nvme/tcp: infinite loop, livelock, and OOPS issues on disconnect

Jonathan Nicklin jnicklin at blockbridge.com
Mon Sep 26 06:42:25 PDT 2022


On Mon, Sep 19, 2022 at 11:27 AM Sagi Grimberg <sagi at grimberg.me> wrote:
>
>
> >>> Hello,
> >>>
> >>> We've seen system hangs and OOPS with nvme/tcp with host-initiated
> >>> disconnects during fault testing. To help debug, I've developed a short repro
> >>> scenario (~5 minutes runtime) that triggers the issue on 5.15.65.
> >>
> >> Does this happen on upstream as well?
> >
> > I have moved my testing over to 5.19.8 and have reproduced the issue. However,
> > the cycle time is now much longer (~12 hours). This is due to some
> > change in behavior
> > likely from other fixes.
>
> probably...
>
> >>> To reproduce:
> >>>
> >>> On the host:
> >>>    - run connect and disconnect to a single subsystem in a loop
> >>> On the target:
> >>>    - inject random transport failures every 1-2 seconds
> >>
> >> Is this a single controller in the subsystem? or more?
> >
> > Single controller subsystem.
> >
> >>> There are a few observed results:
> >>>
> >>> 1. The system either hangs completely (reboot required) or nvme management
> >>> commands hang (i.e., nvme list and nvme connect)
> >>
> >> Would it be possible to check with the following applied?
> >> 41d07df7de84 ("nvme-tcp: always fail a request when sending it failed")
> >
> > This patch was not present in the original testing on 5.15.65. It is present in
> > the latest repro on 5.19.8.
>
> OK.
>
> >> Perhaps its possible that we don't fail the command, and we are unable
> >> to get to the iterator that cancels all commands becuase
> >> nvme_remove_namespaces() is also blocked?
> >>
> >>> 2. If the system stays alive, "nvme nvme1: failed to send request -32" streams
> >>> from the kernel in an infinite loop.
> >>
> >> Hmm, this is strange... this stream is when the system is hang and
> >> without forward progress? or if the test continues?
> >
> > On 5.15.65, this was streaming in the kernel log while the test was hung;
> > any command via nvme cli hung. On 5.19.18, I did not see the streaming
> > behavior.
>
> OK.
>
> >>>
> >>> 3. nvme tracing shows a loop trying to read the same lba:
> >>>
> >>> kworker/2:1H-6376    [002] .....  4314.468725: nvme_setup_cmd:
> >>>     nvme1: disk=nvme1c1n1, qid=3, cmdid=61519, nsid=1, flags=0x0, meta=0x0
> >>>     cmd=(nvme_cmd_read slba=8, len=7, ctrl=0x0, dsmgmt=0, reftag=0)
> >>> kworker/2:1H-6376    [002] .N...  4314.473568: nvme_complete_rq:
> >>>     nvme1: disk=nvme1c1n1, qid=3, cmdid=61519, res=0x0, retries=0, flags=0x0,
> >>>     status=0x370
> >>>
> >>> 4. Hung task warnings for kernel worker threads:
> >>>
> >>> INFO: task kworker/u16:2:6524 blocked for more than 491 seconds.
> >>>         Not tainted 5.15.65-f0.el7.x86_64 #1
> >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>> task:kworker/u16:2   state:D stack:    0 pid: 6524 ppid:     2 flags:0x00004000
> >>> Workqueue: nvme-wq nvme_scan_work [nvme_core]
> >>>
> >>> Call Trace:
> >>>    <TASK>
> >>>    __schedule+0x390/0x910
> >>>    ? scan_shadow_nodes+0x40/0x40
> >>>    schedule+0x55/0xe0
> >>>    io_schedule+0x16/0x40
> >>>    do_read_cache_page+0x55d/0x850
> >>>    ? __page_cache_alloc+0x90/0x90
> >>>    read_cache_page+0x12/0x20
> >>>    read_part_sector+0x3f/0x110
> >>>    amiga_partition+0x3d/0x3e0
> >>>    ? osf_partition+0x33/0x220
> >>>    ? put_partition+0x90/0x90
> >>>    bdev_disk_changed+0x1fe/0x4d0
> >>>    blkdev_get_whole+0x7b/0x90
> >>>    blkdev_get_by_dev+0xda/0x2d0
> >>>    device_add_disk+0x356/0x3b0
> >>>    nvme_mpath_set_live+0x13c/0x1a0 [nvme_core]
> >>>    ? nvme_parse_ana_log+0xae/0x1a0 [nvme_core]
> >>>    nvme_update_ns_ana_state+0x3a/0x40 [nvme_core]
> >>>    nvme_mpath_add_disk+0x120/0x160 [nvme_core]
> >>>    nvme_alloc_ns+0x594/0xa00 [nvme_core]
> >>>    nvme_validate_or_alloc_ns+0xb9/0x1a0 [nvme_core]
> >>>    ? __nvme_submit_sync_cmd+0x1d2/0x210 [nvme_core]
> >>>    nvme_scan_work+0x281/0x410 [nvme_core]
> >>>    process_one_work+0x1be/0x380
> >>>    worker_thread+0x37/0x3b0
> >>>    ? process_one_work+0x380/0x380
> >>>    kthread+0x12d/0x150
> >>>    ? set_kthread_struct+0x50/0x50
> >>>    ret_from_fork+0x1f/0x30
> >>>    </TASK>
> >>> INFO: task nvme:6725 blocked for more than 491 seconds.
> >>>         Not tainted 5.15.65-f0.el7.x86_64 #1
> >>> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >>> task:nvme            state:D
> >>>    stack:    0 pid: 6725 ppid:  1761 flags:0x00004000
> >>> Call Trace:
> >>>    <TASK>
> >>>    __schedule+0x390/0x910
> >>>    ? sched_clock+0x9/0x10
> >>>    schedule+0x55/0xe0
> >>>    schedule_timeout+0x24b/0x2e0
> >>>    ? try_to_wake_up+0x358/0x510
> >>>    ? finish_task_switch+0x88/0x2c0
> >>>    wait_for_completion+0xa5/0x110
> >>>    __flush_work+0x144/0x210
> >>>    ? worker_attach_to_pool+0xc0/0xc0
> >>>    flush_work+0x10/0x20
> >>>    nvme_remove_namespaces+0x41/0xf0 [nvme_core]
> >>>    nvme_do_delete_ctrl+0x47/0x66 [nvme_core]
> >>>    nvme_sysfs_delete.cold.96+0x8/0xd [nvme_core]
> >>>    dev_attr_store+0x14/0x30
> >>>    sysfs_kf_write+0x38/0x50
> >>>    kernfs_fop_write_iter+0x146/0x1d0
> >>>    new_sync_write+0x114/0x1b0
> >>>    ? intel_pmu_handle_irq+0xe0/0x420
> >>>    vfs_write+0x18d/0x270
> >>>    ksys_write+0x61/0xe0
> >>>    __x64_sys_write+0x1a/0x20
> >>>    do_syscall_64+0x37/0x90
> >>>    entry_SYSCALL_64_after_hwframe+0x61/0xcb
> >>> RIP: 0033:0x7ff4b79c4ba0
> >>> RSP: 002b:00007fff37cea528 EFLAGS: 00000246
> >>>    ORIG_RAX: 0000000000000001
> >>> RAX: ffffffffffffffda RBX: 00000000006532e0 RCX: 00007ff4b79c4ba0
> >>> RDX: 0000000000000001 RSI: 000000000043b146 RDI: 0000000000000003
> >>> RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000001962fe0
> >>> R10: 00007fff37ce9f60 R11: 0000000000000246 R12: 0000000001963090
> >>> R13: 00007fff37ceab20 R14: 0000000000000037 R15: 00000000006535c0
> >>>    </TASK>
> >>>
> >>> 5. And, an occasional OOPS (same test, possibly a different issue)
> >
> >> There were some fixes related to oops observed sporadically with
> >> connect/disconnect/reset loops. It was fixed in:
> >> 0fa0f99fc84e ("nvme: fix a possible use-after-free in controller reset
> >> during load")
> >>
> >> Is it applied on your system?
> >
> > This patch was not present in the original testing on 5.15.65. It is present in
> > the latest repro on 5.19.8.
>
> Thanks.
>
> >
> >>>
> >>> BUG: kernel NULL pointer dereference, address: 0000000000000008
> >>> #PF: supervisor read access in kernel mode
> >>> #PF: error_code(0x0000) - not-present page
> >>> PGD 0
> >>> P4D 0
> >>> Oops: 0000 [#1] SMP NOPTI
> >>> CPU: 4 PID: 2131 Comm: kworker/4:2H Not tainted 5.15.65-f0.el7.x86_64 #1
> >>> Hardware name: Supermicro SYS-1019S-WR/X11SSW-F, BIOS 2.1a 03/07/2018
> >>> Workqueue:  0x0
> >>>    (events_highpri)
> >>> RIP: 0010:process_one_work+0x34/0x380
> >>> Code: 57 41 56 41 55 41 54 49 89 fc 53 48 89 f3 48 83 ec 10 48 8b 06
> >>> 4c 8b 7f 48 49 89 c5 45 30 ed a8 04 b8 00 00 00 00 4c 0f 44 e8 <49> 8b
> >>> 45 08 44 8b b0 00 01 00 00 41 83 e6 20 41 f6 47 10 04 75 0f
> >>> RSP: 0018:ffffc90001527e90 EFLAGS: 00010046
> >>> RAX: 0000000000000000 RBX: ffff8881459395d0 RCX: dead000000000122
> >>> RDX: 00000001001d8113 RSI: ffff8881459395d0 RDI: ffff888124d85c00
> >>> RBP: ffffc90001527ec8 R08: ffff888857b2cea8 R09: 0000000000000381
> >>> R10: 00000000000001ce R11: 0000000000000000 R12: ffff888124d85c00
> >>> R13: 0000000000000000 R14: ffff888857b2c120 R15: ffff888857b2c100
> >>> FS:  0000000000000000(0000) GS:ffff888857b00000(0000) knlGS:0000000000000000
> >>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >>> CR2: 0000000000000008 CR3: 000000000260a006 CR4: 00000000003706e0
> >>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >>> Call Trace:
> >>>    <TASK>
> >>>    worker_thread+0x37/0x3b0
> >>>    ? process_one_work+0x380/0x380
> >>>    kthread+0x12d/0x150
> >>>    ? set_kthread_struct+0x50/0x50
> >>>    ret_from_fork+0x1f/0x30
> >>>    </TASK>
> >>>
> >>> If there is some way that I can assist, let me know.
>
> Thanks for moving to a recent kernel.
>
> Based on the stack trace, the reason for the hang is that
> disconnect is attempting to remove all the namespaces before
> starting to teardown the controller, before that it flushes
> any ns scanning to not compete with it running concurrently.
>
> but NS scanning is stuck when discovering a namespace because
> adding it as a disk triggers some I/O most likely for stuff
> like partition scanning etc.
>
> The ns scan is stuck only because its inflight I/O never completes,
> which is likely related to the errors you are inserting from the
> controller side every now and then...
>
> I do see a possible race here. that right when a transport error is
> observed, error recovery is scheduled (err_work) but before it is
> executing, delete_ctrl is starting to teardown the controller,
> particularly calling nvme_stop_ctrl() -> nvme_tcp_stop_ctrl()
> which cancels the err_work.
>
> If the error recovery is not running, nothing will fail inflight
> commands if they are not failed by the controller explicitly.
>
> What would be helpful, is to know that once the issue reproduced,
> if the ctrl->err_work, was canceled without executing.
>
> So if the theory is correct, this fix is warranted:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index d5871fd6f769..2524b5304bfb 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -2237,7 +2237,7 @@ static void nvme_reset_ctrl_work(struct
> work_struct *work)
>
>   static void nvme_tcp_stop_ctrl(struct nvme_ctrl *ctrl)
>   {
> -       cancel_work_sync(&to_tcp_ctrl(ctrl)->err_work);
> +       flush_work(&to_tcp_ctrl(ctrl)->err_work);
>          cancel_delayed_work_sync(&to_tcp_ctrl(ctrl)->connect_work);
>   }
> --
>
> This will make err_work to run, and then it is guaranteed that all
> inflight requests are completed/cancelled.

Good news. I ran with the fix proposed continuously for three days
without reproducing the issue. Is there any specific instrumentation
you want me to run to get direct confirmation of the bug/fix? If not,
I think we're all set.



More information about the Linux-nvme mailing list