[LSF/MM/BPF TOPIC] nvme-of connect retries

Wed May 8 02:52:38 PDT 2024

Hi all,

I'd like to request another session for LSF/MM:

NVMe-oF connect retries

There had been several discussions on the mailing list on how to handle 
failures or retries which occurs during 'connect'.
Issues to discuss:
- Should the initial connect return with a status after _all_
   queues are connected? That will introduce a severe lag
   for large installation, with the risk of systemd timing out
   the command.
- Should we try to combine workflows? TCP has three different
   'connect' code paths, one for the initial connect, one for
   reset, and one for reconnect.
- Where should a possible retry be handled? Should user space
   be responsible for a retry, or should it be left to the driver?
- If user space should be driving the retry, how can we return
   a meaningful error to user space?

It would be good if we could come to a consensus here such that
we can start consolidating the various transports.

Cheers,

Hannes