[PATCHv4 0/8] nvme: export additional diagnostic counters via sysfs

Venkat Rao Bagalkote venkat88 at linux.ibm.com
Mon May 25 02:12:31 PDT 2026


On 17/05/26 12:06 am, Nilay Shroff wrote:
> Hi,
>
> The NVMe driver encounters various events and conditions during normal
> operation that are either not tracked today or not exposed to userspace
> via sysfs. Lack of visibility into these events can make it difficult to
> diagnose subtle issues related to controller behavior, multipath
> stability, and I/O reliability.
>
> This patchset adds several diagnostic counters that provide improved
> observability into NVMe behavior. These counters are intended to help
> users understand events such as transient path unavailability,
> controller retries/reconnect/reset, failovers, and I/O failures. They
> can also be consumed by monitoring tools such as nvme-top.
>
> Specifically, this series proposes to export the following counters via
> sysfs:
>    - Command retry count
>    - Multipath failover count
>    - Command error count
>    - I/O requeue count
>    - I/O failure count
>    - Controller reset event counts
>    - Controller reconnect counts
>
> The first patch in the series adds a new diag attribute group under per-path,
> ns-head and ctrl sysfs directories so that all diagnostics counters could be
> grouped together under diag sub-directory. The subsequent patches in the series
> adds diagnostics counters listed above.
>
> Please note that this patchset doesn't make any functional change but
> rather export relevant counters to user space via sysfs.
>
> As usual, feedback/comments/suggestions are welcome!
>
> Changes from v3:
>    - To be consistent in naming, all counters are suffixed with _count
>      (Keith Busch)
>    - The first patch in the series creates new attribute group named
>      diag and all counters are now grouped under this new sysfs
>      attribute group (Keith Busch)
>    - Counters are defined as atomic_long_t instead of size_t (Keith Busch)
>    - Removed RB and TB tags due to above changes
> Link to v3: https://lore.kernel.org/all/20260220175024.292898-1-nilay@linux.ibm.com/
>
> Changes from v2:
>    - Allow user to write to sysfs attributes so that user could
>      reset stat counters, if needed (Sagi)
>    - The controller reconnect counter nr_reconnects could reset
>      to zero once connection is re-established, so instead of
>      exposing nr_reconnects counter via sysfs introduce a new
>      counter which accumulates the reconnect attempts and export
>      this accumulated counter via sysfs (Sagi)
> Link to v2: https://lore.kernel.org/all/20260205124810.682559-1-nilay@linux.ibm.com/
>
> Changes from v1:
>    - Remove export of stats for admin command rerty count (Keith)
>    - Use size_add() to ensure stat counters don't overflow (Keith)
> Link to v1: https://lore.kernel.org/all/20260130182028.885089-1-nilay@linux.ibm.com/
>
> Nilay Shroff (8):
>    nvme: add diag attribute group under sysfs
>    nvme: export command retry count via sysfs
>    nvme: export multipath failover count via sysfs
>    nvme: export command error counters via sysfs
>    nvme: export I/O requeue count when no path is available via sysfs
>    nvme: export I/O failure count when no path is available via sysfs
>    nvme: export controller reset event count via sysfs
>    nvme: export controller reconnect event count via sysfs
>
>   drivers/nvme/host/core.c      |  15 ++-
>   drivers/nvme/host/fc.c        |   3 +
>   drivers/nvme/host/multipath.c |  87 ++++++++++++++
>   drivers/nvme/host/nvme.h      |  13 +++
>   drivers/nvme/host/pci.c       |   1 +
>   drivers/nvme/host/rdma.c      |   2 +
>   drivers/nvme/host/sysfs.c     | 214 ++++++++++++++++++++++++++++++++++
>   drivers/nvme/host/tcp.c       |   2 +
>   8 files changed, 336 insertions(+), 1 deletion(-)
>

Hello Nilay,

Applied this patch series on top of v7.1-rc5 and boot-tested on ppc64le.

Verified the new NVMe diag sysfs hierarchy and counters exposed by this 
series.

Validation steps executed:

Read all exported NVMe diag counters: for f in $(find /sys -path 
'*nvme*diag/*_count' 2>/dev/null); do echo "$f: $(cat "$f")"; done

Reset all writable counters to zero: for f in $(find /sys -path 
'*nvme*diag/*_count' 2>/dev/null); do echo 0 > "$f" && echo "reset ok 
$f"; done

Negative test with invalid input: echo abc > 
/sys/devices/pci0525:48/0525:48:00.0/nvme/nvme0/diag/command_error_count

Observed results:

diag directories were present under:

controller paths, e.g. /sys/devices/.../nvme/nvmeX/diag/

per-path namespace paths, e.g. /sys/devices/.../nvme/nvmeX/nvmeYcZnW/diag/

namespace-head paths, e.g. 
/sys/devices/virtual/nvme-subsystem/nvme-subsysX/nvmeYnZ/diag/

Controller counters observed:
reset_count
command_error_count
reconnect_count on fabrics controllers


# ll /sys/devices/virtual/nvme-fabrics/ctl/nvme7/diag
total 0
-rw-r--r--. 1 root root 65536 May 25 03:58 command_error_count
-rw-r--r--. 1 root root 65536 May 25 03:58 reconnect_count
-rw-r--r--. 1 root root 65536 May 25 03:58 reset_count

# ll /sys/devices/pci052a:58/052a:58:00.0/nvme/nvme2/diag
total 0
-rw-r--r--. 1 root root 65536 May 25 03:58 command_error_count
-rw-r--r--. 1 root root 65536 May 25 03:58 reset_count


Per-path counters observed:
multipath_failover_count
command_error_count
command_retries_count

# ll /sys/devices/pci052a:58/052a:58:00.0/nvme/nvme2/nvme2c2n1/diag
total 0
-rw-r--r--. 1 root root 65536 May 25 03:58 command_error_count
-rw-r--r--. 1 root root 65536 May 25 03:58 command_retries_count
-rw-r--r--. 1 root root 65536 May 25 03:58 multipath_failover_count


Namespace-head counters observed:
io_fail_no_available_path_count
io_requeue_no_usable_path_count


# ll /sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n2/diag
total 0
-rw-r--r--. 1 root root 65536 May 25 03:58 io_fail_no_available_path_count
-rw-r--r--. 1 root root 65536 May 25 03:58 io_requeue_no_usable_path_count


All reads returned numeric values

All reset writes to 0 succeeded

Invalid text write failed as expected: -bash: echo: write error: Invalid 
argument.


If it all looks good, please add below tag.


Tested-by: Venkat Rao Bagalkote <venkat88 at linux.ibm.com>


Regards,

Venkat.



Regards,

Venkat.




More information about the Linux-nvme mailing list