[PATCH 0/3] nvme-tcp: start error recovery after KATO
Hannes Reinecke
hare at suse.de
Fri Sep 8 03:00:46 PDT 2023
Hi all,
there have been some very insistent reports of data corruption
with certain target implementations due to command retries.
Problem here is that for TCP we're starting error recovery
immediately after either a command timeout or a (local) link loss.
That is contrary to the NVMe base spec, which states in
section 3.9:
If a Keep Alive Timer expires:
a) the controller shall ...
and
b) the host assumes all outstanding commands are not completed
and re-issues commands as appropriate.
IE we should retry commands only after KATO expired.
With this patchset we will always wait until KATO expired until
starting error recovery. This will cause a longer delay until
failed commands are retried, but that's kinda the point
of this patchset :-)
As usual, comments and reviews are welcome.
Hannes Reinecke (3):
nvme-tcp: Do not terminate commands when in RESETTING
nvme-tcp: make 'err_work' a delayed work
nvme-tcp: delay error recovery until the next KATO interval
drivers/nvme/host/core.c | 3 ++-
drivers/nvme/host/nvme.h | 1 +
drivers/nvme/host/tcp.c | 29 +++++++++++++++++++++++------
3 files changed, 26 insertions(+), 7 deletions(-)
--
2.35.3
More information about the Linux-nvme
mailing list