[PATCHv3 0/7] nvme: export additional diagnostic counters via sysfs
Venkat
venkat88 at linux.ibm.com
Sun Feb 22 04:36:38 PST 2026
> On 20 Feb 2026, at 11:18 PM, Nilay Shroff <nilay at linux.ibm.com> wrote:
>
> Hi,
>
> The NVMe driver encounters various events and conditions during normal
> operation that are either not tracked today or not exposed to userspace
> via sysfs. Lack of visibility into these events can make it difficult to
> diagnose subtle issues related to controller behavior, multipath
> stability, and I/O reliability.
>
> This patchset adds several diagnostic counters that provide improved
> observability into NVMe behavior. These counters are intended to help
> users understand events such as transient path unavailability,
> controller retries/reconnect/reset, failovers, and I/O failures. They
> can also be consumed by monitoring tools such as nvme-top.
>
> Specifically, this series proposes to export the following counters via
> sysfs:
> - Command retry count
> - Multipath failover count
> - Command error count
> - I/O requeue count
> - I/O failure count
> - Controller reset event counts
> - Controller reconnect counts
>
> The patchset consists of seven patches:
> Patch 1: Export command retry count
> Patch 2: Export multipath failover count
> Patch 3: Export command error count
> Patch 4: Export I/O requeue count
> Patch 5: Export I/O failure count
> Patch 6: Export controller reset event counts
> Patch 7: Export controller reconnect event count
>
> Please note that this patchset doesn't make any functional change but
> rather export relevant counters to user space via sysfs.
>
> As usual, feedback/comments/suggestions are welcome!
>
> Changes from v2:
> - Allow user to write to sysfs attributes so that user could
> reset stat counters, if needed (Sagi)
> - The controller reconnect counter nr_reconnects could reset
> to zero once connection is re-established, so instead of
> exposing nr_reconnects counter via sysfs introduce a new
> counter which accumulates the reconnect attempts and export
> this accumulated counter via sysfs (Sagi)
> Link to v2: https://lore.kernel.org/all/20260205124810.682559-1-nilay@linux.ibm.com/
>
> Changes from v1:
> - Remove export of stats for admin command rerty count (Keith)
> - Use size_add() to ensure stat counters don't overflow (Keith)
> Link to v1: https://lore.kernel.org/all/20260130182028.885089-1-nilay@linux.ibm.com/
>
> Nilay Shroff (7):
> nvme: export command retry count via sysfs
> nvme: export multipath failover count via sysfs
> nvme: export command error counters via sysfs
> nvme: export I/O requeue count when no path is available via sysfs
> nvme: export I/O failure count when no path is available via sysfs
> nvme: export controller reset event count via sysfs
> nvme: export controller reconnect event count via sysfs
>
> drivers/nvme/host/core.c | 18 +++-
> drivers/nvme/host/fc.c | 5 +
> drivers/nvme/host/multipath.c | 89 ++++++++++++++++++
> drivers/nvme/host/nvme.h | 13 ++-
> drivers/nvme/host/rdma.c | 4 +
> drivers/nvme/host/sysfs.c | 167 ++++++++++++++++++++++++++++++++++
> drivers/nvme/host/tcp.c | 3 +
> 7 files changed, 297 insertions(+), 2 deletions(-)
>
> --
> 2.52.0
>
>
Hello Nilay,
I tested this patch series and found couple of attributes are missing.
Missing diag counters:
1. I/O requeue count
2. I/O failure count
Rest all diag counters are exposed via sysfs properly.
Controller-level counters observed:
- reset_events
- reconnect_events
- command_error_count
Namespace-instance counters observed:
- command_retries
- multipath_failover_count
- command_error_count
Logs:
ll /sys/class/nvme/nvme3/
total 0
-r--r--r-- 1 root root 65536 Feb 22 05:49 address
-r--r--r-- 1 root root 65536 Feb 22 05:58 cntlid
-r--r--r-- 1 root root 65536 Feb 22 05:49 cntrltype
-rw-r--r-- 1 root root 65536 Feb 22 06:10 command_error_count
-rw-r--r-- 1 root root 65536 Feb 22 05:58 ctrl_loss_tmo
-r--r--r-- 1 root root 65536 Feb 22 05:49 dctype
--w------- 1 root root 65536 Feb 22 05:58 delete_controller
-r--r--r-- 1 root root 65536 Feb 22 05:58 dev
lrwxrwxrwx 1 root root 0 Feb 22 05:50 device -> ../../ctl
-rw-r--r-- 1 root root 65536 Feb 22 05:58 fast_io_fail_tmo
-r--r--r-- 1 root root 65536 Feb 22 05:49 firmware_rev
-r--r--r-- 1 root root 65536 Feb 22 05:51 hostid
-r--r--r-- 1 root root 65536 Feb 22 05:51 hostnqn
-r--r--r-- 1 root root 65536 Feb 22 05:58 kato
-r--r--r-- 1 root root 65536 Feb 22 05:49 model
-r--r--r-- 1 root root 65536 Feb 22 05:49 numa_node
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n1
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n10
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n2
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n3
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n4
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n5
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n6
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n7
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n8
drwxr-xr-x 9 root root 0 Feb 22 05:49 nvme3c3n9
-rw-r--r-- 1 root root 65536 Feb 22 05:58 passthru_err_log_enabled
drwxr-xr-x 2 root root 0 Feb 22 05:58 power
-r--r--r-- 1 root root 65536 Feb 22 05:49 queue_count
-rw-r--r-- 1 root root 65536 Feb 22 05:58 reconnect_delay
-rw-r--r-- 1 root root 65536 Feb 22 06:11 reconnect_events
--w------- 1 root root 65536 Feb 22 05:58 rescan_controller
--w------- 1 root root 65536 Feb 22 06:11 reset_controller
-rw-r--r-- 1 root root 65536 Feb 22 06:10 reset_events
-r--r--r-- 1 root root 65536 Feb 22 05:49 serial
-r--r--r-- 1 root root 65536 Feb 22 05:49 sqsize
-r--r--r-- 1 root root 65536 Feb 22 05:49 state
-r--r--r-- 1 root root 65536 Feb 22 05:51 subsysnqn
lrwxrwxrwx 1 root root 0 Feb 22 05:49 subsystem -> ../../../../../class/nvme
-r--r--r-- 1 root root 65536 Feb 22 05:51 transport
-rw-r--r-- 1 root root 65536 Feb 22 05:49 uevent
ll /sys/class/nvme/nvme3/nvme3c3n8
total 0
-r--r--r-- 1 root root 65536 Feb 22 06:02 alignment_offset
-r--r--r-- 1 root root 65536 Feb 22 05:51 ana_grpid
-r--r--r-- 1 root root 65536 Feb 22 05:51 ana_state
-r--r--r-- 1 root root 65536 Feb 22 06:02 capability
-rw-r--r-- 1 root root 65536 Feb 22 06:07 command_error_count
-rw-r--r-- 1 root root 65536 Feb 22 06:07 command_retries
-r--r--r-- 1 root root 65536 Feb 22 06:02 csi
lrwxrwxrwx 1 root root 0 Feb 22 05:50 device -> ../../nvme3
-r--r--r-- 1 root root 65536 Feb 22 06:02 discard_alignment
-r--r--r-- 1 root root 65536 Feb 22 06:02 diskseq
-r--r--r-- 1 root root 65536 Feb 22 06:02 events
-r--r--r-- 1 root root 65536 Feb 22 06:02 events_async
-rw-r--r-- 1 root root 65536 Feb 22 06:02 events_poll_msecs
-r--r--r-- 1 root root 65536 Feb 22 06:02 ext_range
-r--r--r-- 1 root root 65536 Feb 22 06:02 hidden
drwxr-xr-x 2 root root 0 Feb 22 06:02 holders
-r--r--r-- 1 root root 65536 Feb 22 06:02 inflight
drwxr-xr-x 2 root root 0 Feb 22 06:02 integrity
-r--r--r-- 1 root root 65536 Feb 22 06:02 metadata_bytes
drwxr-xr-x 18 root root 0 Feb 22 06:02 mq
-rw-r--r-- 1 root root 65536 Feb 22 06:07 multipath_failover_count
-r--r--r-- 1 root root 65536 Feb 22 06:02 nguid
-r--r--r-- 1 root root 65536 Feb 22 06:02 nsid
-r--r--r-- 1 root root 65536 Feb 22 06:02 numa_nodes
-r--r--r-- 1 root root 65536 Feb 22 06:02 nuse
-r--r--r-- 1 root root 65536 Feb 22 06:02 partscan
-rw-r--r-- 1 root root 65536 Feb 22 06:02 passthru_err_log_enabled
drwxr-xr-x 2 root root 0 Feb 22 06:02 power
drwxr-xr-x 2 root root 0 Feb 22 05:49 queue
-r--r--r-- 1 root root 65536 Feb 22 06:02 queue_depth
-r--r--r-- 1 root root 65536 Feb 22 06:02 range
-r--r--r-- 1 root root 65536 Feb 22 05:49 removable
-r--r--r-- 1 root root 65536 Feb 22 06:02 ro
-r--r--r-- 1 root root 65536 Feb 22 05:50 size
drwxr-xr-x 2 root root 0 Feb 22 06:02 slaves
-r--r--r-- 1 root root 65536 Feb 22 06:02 stat
lrwxrwxrwx 1 root root 0 Feb 22 05:49 subsystem -> ../../../../../../class/block
drwxr-xr-x 2 root root 0 Feb 22 06:02 trace
-rw-r--r-- 1 root root 65536 Feb 22 05:49 uevent
-r--r--r-- 1 root root 65536 Feb 22 06:02 uuid
-r--r--r-- 1 root root 65536 Feb 22 06:02 wwid
Regards,
Venkat.
More information about the Linux-nvme
mailing list