nvme-fc vs nvme-tcp: Initial connect behavior

Fri Jun 16 02:12:23 PDT 2023

Hi,

blktests' nvme/041 fails for the fc transport because it behaves differently for
the initial connect. While tcp/rdma do the initial call directly in the context
of write to the /dev/nvme-fabrics file, fc does defer it to a workqueue (for
retries).

The nvme/041 test case first tries to setup an authenticated connect with an
invalid key, thus it is expected to fail. For tcp/rdma the invalid key is
reported back to userspace and all allocated resources are freed.

Though for fc, it will defer the connect and even retries with the invalid key.

The test is then trying to cleanup the invalid configuration by issuing a 'nvme
disconnect' but this fails too.

Aftrer this the test is trying to setup a connection with the correct key, but
this fails for fc because there is already a connetion attempt (the real reason
why the test will fail)

   run blktests nvme/041 at 2023-06-15 14:19:06
   nvmet: adding nsid 1 to subsystem blktests-subsystem-1
   nvme nvme2: NVME-FC{0}: create association : host wwpn 0x20001100aa000002  rport wwpn 0x20001100aa000001: NQN "blktests-subsystem-1"
   (NULL device *): {0:0} Association created
   [10505] nvmet: ctrl 1 start keep-alive timer for 5 secs
   [10505] nvmet: check nqn.2014-08.org.nvmexpress:uuid:40931c1f-4bf3-4c23-b1af-deb3f65a73e0
   [10505] nvmet: nvmet_setup_dhgroup: ctrl 1 selecting dhgroup 0
   [10505] nvmet: nvmet_setup_auth: using hash none key d4 8b f7 36 75 34 db 1a a2 c5 75 9a 6b 05 de 3a 9e e8 68 61 c5 d6 cb 0b 1a f2 c4 f8 3b 24 a2 a5
   nvmet: creating nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:40931c1f-4bf3-4c23-b1af-deb3f65a73e0 with DH-HMAC-CHAP.
   nvme nvme2: qid 0: no key
   nvme nvme2: qid 0: authentication setup failed
   nvme nvme2: NVME-FC{0}: reset: Reconnect attempt failed (401)
   nvme nvme2: NVME-FC{0}: Reconnect attempt in 2 seconds
   nvme nvme2: NVME-FC{0}: new ctrl: NQN "blktests-subsystem-1"
   [10505] nvmet: ctrl 1 stop keep-alive
   (NULL device *): {0:0} Association deleted
   (NULL device *): {0:0} Association freed
   (NULL device *): Disconnect LS failed: No Association
   nvme nvme2: NVME-FC{0}: controller connectivity lost. Awaiting Reconnect
   nvme nvme2: NVME-FC{0}: transport unloading: deleting ctrl
   nvme_fc: nvme_fc_exit_module: waiting for ctlr deletes
   nvme nvme2: Removing ctrl: NQN "blktests-subsystem-1"
   nvme_fc: nvme_fc_exit_module: ctrl deletes complete

I tickered a bit around to make the initial connect synchronous to see if I
could get the test case working. But got hit by a KASAN report in
nvmf_dev_show() but haven't figured out what's going on yet.

My question is what are we going to do about it? 4c984154efa1 ("nvme-fc: change
controllers first connect to use reconnect path") eleborates why it was changed
to do use workqueues and I suppose these reason still exists.

Thanks,
Daniel