[RFC] nvme-rdma: Stop queues when starting with error recovery

Daniel Wagner dwagner at suse.de
Mon May 23 08:21:02 PDT 2022


When we enter error recovery we should stop all queue activities and
all armed timers.

For example, we could arming an ANATT timer right before we enter
error recovery but do not successfully recover before the timer
fires. The timer is supposed only be active when the controller is in
LIVE state hence we should call nvme_stop_ctrl when starting with the
recover activites.

Signed-off-by: Daniel Wagner <dwagner at suse.de>
---

The nvme_stop_ctrl() does cancel pending ANATT timers. But so far I
don't got hold of logs when the two controllers get back live. So this
might not work as expected.

My question is do we just want to cancel the timer or is
nvme_stop_ctrl() the right function here. Obviously, the same problem
exists for nvme-tcp.

[  889.241541] nvme nvme0: creating 4 I/O queues.                                                                                                                          
[  892.341152] nvme nvme0: mapped 4/0/0 default/read/poll queues.                                                                                                          
[  892.350942] nvme nvme0: new ctrl: NQN "XXX", addr 192.20.93.101:4420                                           
[  892.402493] nvme nvme1: creating 4 I/O queues.                                                                                                                          
[  895.392810] nvme nvme1: mapped 4/0/0 default/read/poll queues.                                                                                                          
[  895.402029] nvme nvme1: new ctrl: NQN "XXX", addr 192.20.93.102:4420                                           
[  895.471730] nvme nvme2: creating 4 I/O queues.                                                                                                                          
[  898.509195] nvme nvme2: mapped 4/0/0 default/read/poll queues.                                                                                                          
[  898.519015] nvme nvme2: new ctrl: NQN "XXX", addr 192.20.193.101:4420                                          
[  898.571169] nvme nvme3: creating 4 I/O queues.                                                                                                                          
[  901.592283] nvme nvme3: mapped 4/0/0 default/read/poll queues.                                                                                                          
[  901.601832] nvme nvme3: new ctrl: NQN "XXX", addr 192.20.193.102:4420

[  983.429977] nvme nvme3: I/O 0 QID 0 timeout                                                                                                                             
[  983.434472] nvme nvme3: starting error recovery                                                                                                                         
[  984.549958] nvme nvme0: I/O 0 QID 0 timeout                                                                                                                             
[  984.554452] nvme nvme0: starting error recovery                                                                                                                         
[  986.962375] nvme nvme3: failed nvme_keep_alive_end_io error=10                                                                                                          
[  986.986898] nvme nvme3: Reconnecting in 10 seconds...

[ 1226.486740] nvme nvme3: Reconnecting in 10 seconds...                                                                                                                   
[ 1227.749980] nvme nvme0: rdma connection establishment failed (-110)                                                                                                     
[ 1227.761593] nvme nvme0: Failed reconnect attempt 18                                                                                                                     
[ 1227.766848] nvme nvme0: Reconnecting in 10 seconds...                                                                                                                   

[ 1235.685958] nvme nvme0: ANATT timeout, resetting controller.                                                                                                            
[ 1235.692107] nvme nvme3: ANATT timeout, resetting controller.  

 drivers/nvme/host/rdma.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/drivers/nvme/host/rdma.c b/drivers/nvme/host/rdma.c
index b87c8ae41d9b..209dd1becd6c 100644
--- a/drivers/nvme/host/rdma.c
+++ b/drivers/nvme/host/rdma.c
@@ -1197,8 +1197,7 @@ static void nvme_rdma_error_recovery_work(struct work_struct *work)
 	struct nvme_rdma_ctrl *ctrl = container_of(work,
 			struct nvme_rdma_ctrl, err_work);
 
-	nvme_stop_keep_alive(&ctrl->ctrl);
-	flush_work(&ctrl->ctrl.async_event_work);
+	nvme_stop_ctrl(&ctrl->ctrl);
 	nvme_rdma_teardown_io_queues(ctrl, false);
 	nvme_start_queues(&ctrl->ctrl);
 	nvme_rdma_teardown_admin_queue(ctrl, false);
-- 
2.29.2




More information about the Linux-nvme mailing list