[PATCHv4 0/8] nvme: export additional diagnostic counters via sysfs
Nilay Shroff
nilay at linux.ibm.com
Sat May 16 11:36:47 PDT 2026
Hi,
The NVMe driver encounters various events and conditions during normal
operation that are either not tracked today or not exposed to userspace
via sysfs. Lack of visibility into these events can make it difficult to
diagnose subtle issues related to controller behavior, multipath
stability, and I/O reliability.
This patchset adds several diagnostic counters that provide improved
observability into NVMe behavior. These counters are intended to help
users understand events such as transient path unavailability,
controller retries/reconnect/reset, failovers, and I/O failures. They
can also be consumed by monitoring tools such as nvme-top.
Specifically, this series proposes to export the following counters via
sysfs:
- Command retry count
- Multipath failover count
- Command error count
- I/O requeue count
- I/O failure count
- Controller reset event counts
- Controller reconnect counts
The first patch in the series adds a new diag attribute group under per-path,
ns-head and ctrl sysfs directories so that all diagnostics counters could be
grouped together under diag sub-directory. The subsequent patches in the series
adds diagnostics counters listed above.
Please note that this patchset doesn't make any functional change but
rather export relevant counters to user space via sysfs.
As usual, feedback/comments/suggestions are welcome!
Changes from v3:
- To be consistent in naming, all counters are suffixed with _count
(Keith Busch)
- The first patch in the series creates new attribute group named
diag and all counters are now grouped under this new sysfs
attribute group (Keith Busch)
- Counters are defined as atomic_long_t instead of size_t (Keith Busch)
- Removed RB and TB tags due to above changes
Link to v3: https://lore.kernel.org/all/20260220175024.292898-1-nilay@linux.ibm.com/
Changes from v2:
- Allow user to write to sysfs attributes so that user could
reset stat counters, if needed (Sagi)
- The controller reconnect counter nr_reconnects could reset
to zero once connection is re-established, so instead of
exposing nr_reconnects counter via sysfs introduce a new
counter which accumulates the reconnect attempts and export
this accumulated counter via sysfs (Sagi)
Link to v2: https://lore.kernel.org/all/20260205124810.682559-1-nilay@linux.ibm.com/
Changes from v1:
- Remove export of stats for admin command rerty count (Keith)
- Use size_add() to ensure stat counters don't overflow (Keith)
Link to v1: https://lore.kernel.org/all/20260130182028.885089-1-nilay@linux.ibm.com/
Nilay Shroff (8):
nvme: add diag attribute group under sysfs
nvme: export command retry count via sysfs
nvme: export multipath failover count via sysfs
nvme: export command error counters via sysfs
nvme: export I/O requeue count when no path is available via sysfs
nvme: export I/O failure count when no path is available via sysfs
nvme: export controller reset event count via sysfs
nvme: export controller reconnect event count via sysfs
drivers/nvme/host/core.c | 15 ++-
drivers/nvme/host/fc.c | 3 +
drivers/nvme/host/multipath.c | 87 ++++++++++++++
drivers/nvme/host/nvme.h | 13 +++
drivers/nvme/host/pci.c | 1 +
drivers/nvme/host/rdma.c | 2 +
drivers/nvme/host/sysfs.c | 214 ++++++++++++++++++++++++++++++++++
drivers/nvme/host/tcp.c | 2 +
8 files changed, 336 insertions(+), 1 deletion(-)
--
2.53.0
More information about the Linux-nvme
mailing list