Hang at NVME Host caused by Controller reset
Sagi Grimberg
sagi at grimberg.me
Thu Jul 30 16:59:37 EDT 2020
>>>>>> This time, with "nvme-fabrics: allow to queue requests for live queues"
>>>>>> patch applied, I see hang only at blk_queue_enter():
>>>>>
>>>>> Interesting, does the reset loop hang? or is it able to make forward
>>>>> progress?
>>>>
>>>> Looks like the freeze depth is messed up with the timeout handler.
>>>> We shouldn't call nvme_tcp_teardown_io_queues in the timeout handler
>>>> because it messes with the freeze depth, causing the unfreeze to not
>>>> wake the waiter (blk_queue_enter). We should simply stop the queue
>>>> and complete the I/O, and the condition was wrong too, because we
>>>> need to do it only for the connect command (which cannot reset the
>>>> timer). So we should check for reserved in the timeout handler.
>>>>
>>>> Can you please try this patch?
>>> Even with this patch I see hangs, as shown below:
>>
>> While it's omitted from the log you provided, its possible
>> that we just reset the timer for timed out admin commands which
>> makes the error recovery flow stuck.
>>
>> Can you please try this:
> Sagi,
>
> Now the below malloc failure at Target is causing blk_queue_enter/wait_event hang at Host.
>
> [ +0.000004] Workqueue: nvmet_tcp_wq nvmet_tcp_io_work [nvmet_tcp]
> [ +0.000001] Call Trace:
> [ +0.000007] dump_stack+0x57/0x6b
> [ +0.000005] warn_alloc+0xf7/0x160
> [ +0.000003] __alloc_pages_slowpath.constprop.123+0x831/0x9f0
> [ +0.000001] __alloc_pages_nodemask+0x2cd/0x300
> [ +0.000004] kmalloc_order+0x1d/0x90
> [ +0.000002] kmalloc_order_trace+0x1d/0xb0
> [ +0.000002] nvmet_tcp_install_queue+0x3f/0xd0 [nvmet_tcp]
> [ +0.000004] nvmet_install_queue+0x86/0x120 [nvmet]
> [ +0.000001] nvmet_execute_io_connect+0xf4/0x1b0 [nvmet]
> [ +0.000002] nvmet_tcp_io_work+0xa46/0xadb [nvmet_tcp]
> [ +0.000002] process_one_work+0x149/0x380
> [ +0.000002] worker_thread+0x41/0x3a0
>
OK, that is a new bug... Lets take care of that separately.
> This malloc failure may be expected as I am running reset controller
> script in a loop without any sleep(It took more than 4 hours to
> hit this malloc failure on my target machine). But I think the
> hang issue at host needs to be addressed.
The I/O won't complete unless you disconnect from the controller, it
shouldn't hang.
BTW, in order to handle a controller reset during I/O and when the
target all of a sudden becomes unresponsive in the middle of the
reset sequence, requires some more fixes, which I'll send out soon.
>
>
> Host side dmesg:
> [ +0.000198] nvme nvme1: creating 12 I/O queues.
> [ +0.005365] nvme nvme1: Connect command failed, error wo/DNR bit: 6
> [ +0.000165] nvme nvme1: failed to connect queue: 1 ret=6
> [ +0.000418] nvme nvme1: Reconnecting in 10 seconds...
> [ +35.705727] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
> [ +0.615556] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex, Tx/Rx PAUSE
> [ +0.001146] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes ready
> [Jul30 05:51] nvme nvme1: queue 0: timeout request 0x1 type 4
> [ +0.000020] nvme nvme1: Connect command failed, error wo/DNR bit: -16388
> [ +0.000170] nvme nvme1: failed to connect queue: 0 ret=-4
> [ +0.000148] nvme nvme1: Failed reconnect attempt 2
> [ +0.000001] nvme nvme1: Reconnecting in 10 seconds...
> [ +13.311922] nvme nvme1: failed to connect socket: -110
> [ +0.000125] nvme nvme1: Failed reconnect attempt 3
> [ +0.000001] nvme nvme1: Reconnecting in 10 seconds...
> [ +2.586017] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
> [ +0.557409] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex, Tx/Rx PAUSE
> [ +0.001144] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes ready
> [Jul30 05:52] cxgb4 0000:02:00.4: Port 0 link down, reason: Link Down
> [ +0.551397] cxgb4 0000:02:00.4 enp2s0f4: link up, 40Gbps, full-duplex, Tx/Rx PAUSE
> [ +0.001073] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0f4: link becomes ready
> [Jul30 05:53] INFO: task nvme:5944 blocked for more than 122 seconds.
> [ +0.000125] Not tainted 5.8.0-rc7ekr+ #2
> [ +0.000111] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ +0.000177] nvme D14392 5944 2796 0x00004000
> [ +0.000123] Call Trace:
> [ +0.001375] __schedule+0x32b/0x670
> [ +0.000125] schedule+0x45/0xb0
> [ +0.000138] blk_queue_enter+0x1e9/0x250
> [ +0.000137] ? wait_woken+0x70/0x70
> [ +0.000116] blk_mq_alloc_request+0x53/0xc0
> [ +0.000122] nvme_alloc_request+0x61/0x70 [nvme_core]
> [ +0.000121] nvme_submit_user_cmd+0x50/0x310 [nvme_core]
> [ +0.000123] nvme_user_cmd+0x12e/0x1c0 [nvme_core]
> [ +0.000142] ? _copy_to_user+0x22/0x30
> [ +0.000135] blkdev_ioctl+0x100/0x250
> [ +0.000122] block_ioctl+0x34/0x40
> [ +0.000129] ksys_ioctl+0x82/0xc0
That is expected, without multipath, the I/O will not complete, and
unless the target restores (or host disconnects), the I/O keeps waiting,
that is the expected behavior.
More information about the Linux-nvme
mailing list