ncme-tcp: io_work NULL pointer when racing with queue stop

Tue Jan 18 17:31:40 PST 2022

On Tue, Jan 18, 2022 at 3:32 PM Sagi Grimberg <sagi at grimberg.me> wrote:
>
> > Hi,
>
> Hey Chris,
>
> > I'm looking at a reported NULL pointer fault with controller reset
> > testing under load
>
> Just to understand, is this upstream kernel?

This was a CentOS Stream 9 kernel, which should be largely in sync
with upstream currently.

> > and it appears that the nvme_tcp_stop_queue
> > sequence is racing against scheduled io_work.  The fault occurs when
> > kernel_sendpage is called from the workqueue context and hits a NULL
> > sock->ops pointer, which is cleared by kernel_sock_shutdown before the
> > call to cancel_work_sync.
>
> Hmm, can you point me to where kernel_sock_shutdown is clearing
> sock->ops? The assumption that nvme-tcp takes is that accessing
> the socket is not allowed only after sock_release() but
> still valid after kernel_sock_shutdown.

Looks like I need to apologize Sagi.  I'm not sure where I got turned
around looking at the socket code, but you're right and
kernel_sock_shutdown doesn't look to be the culprit here.

Thank you for the following detailed description. I'm going to go back
to my crash report and take another look at this one.

- Chris

> The sequence follows this logic:
> 1. kernel_sock_shutdown(SHUT_RDWR) - guarantee we do not send or
> receive anything to/from the wire when any of the driver contexts
> access the socket from this point on.
> 2. nvme_tcp_restore_sock_calls() - guarantee that we are not wake up
> from any socket event, incoming data, or buffer space becoming
> available to send from this point on.
> 3. cancel_work_sync(io_work) - cancel and fence io_work, and make sure
> that it is not self-requeued. So in essence guarantee that io_work is
> not triggered again (not by self-requeue, nor by socket callbacks).
>
> Here we should be able to safely continue to release the socket
> in nvme_tcp_free_queue as there should be no way anything can access
> the socket concurrently.
>
> The only places where io_work is triggered from the code are:
> 1. .queue_work
> 2. socket_event (data_ready/write_space)
> 3. from io_work itself (quota exhausted, or r2t was received)
>
> So (2)+(3) should be prevented by nvme_tcp_stop_queue as outlined
> above, and (1) is prevented by quiescing the request_queue which is
> happening earlier in the reset sequence (nvme_tcp_teardown_io_queues
> calls nvme_stop_queues before nvme_tcp_stop_io_queues).
>
> So this is the logic that nvme-tcp follows to safely reset the
> controller during traffic/load.
>
> I have to say that if kernel_sock_shutdown() is clearing sock->ops
> then we clearly have a wrong assumption, but I'm surprised that it
> even works at all...
>
> However I did not find evidence to kernel_sock_shutdown() clearing
> sock->ops and in which conditions.
>
> >
> > [ 4353.145377] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
> > [ 4353.151921] RIP: 0010:kernel_sendpage+0x11/0xc0
> > [ 4353.157005] Code: ff ff ff 31 db e9 70 ff ff ff be ff ff ff ff e9
> > 49 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 18 48
> > 8b 47 20 <4c> 8b 88 a0 00 00 00 4d 85 c9 74 3b 48 8b 7e 08 48 8d 47 ff
> > 83 e7
> > [ 4353.177985] RSP: 0000:ffffb5870a7b7d68 EFLAGS: 00010292
> > [ 4353.189185] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000048
> > [ 4353.189196] RDX: 0000000000000000 RSI: ffffdffa25000000 RDI: ffff981b45de3400
> > [ 4353.189199] RBP: ffff981b45de3400 R08: 00000000000000c0 R09: ffff98182073b0ec
> > [ 4353.189202] R10: 000000000000000c R11: 000000000000000c R12: ffff9817c4885e38
> > [ 4353.226418] R13: 0000000000000048 R14: ffff9817d0018000 R15: 0000000000000000
> > [ 4353.234393] FS:  0000000000000000(0000) GS:ffff981b2fa00000(0000)
> > knlGS:0000000000000000
> > [ 4353.243442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 4353.249871] CR2: 00000000000000a0 CR3: 00000001307ce004 CR4: 00000000003706f0
> > [ 4353.257850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 4353.265824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [ 4353.273798] Call Trace:
> > [ 4353.276554]  nvme_tcp_try_send_cmd_pdu+0xf4/0x250 [nvme_tcp]
> > [ 4353.282901]  ? tcp_rcv_space_adjust+0x1d/0x1c0
> > [ 4353.287888]  nvme_tcp_try_send+0x15a/0x330 [nvme_tcp]
> > [ 4353.293553]  ? release_sock+0x43/0x90
> > [ 4353.297666]  ? nvme_tcp_try_recv+0x73/0xa0 [nvme_tcp]
> > [ 4353.303334]  nvme_tcp_io_work+0x37/0x90 [nvme_tcp]
> > [ 4353.308723]  process_one_work+0x1e3/0x3b0
> > [ 4353.313237]  worker_thread+0x50/0x3b0
> > [ 4353.317335]  ? rescuer_thread+0x370/0x370
> > [ 4353.321813]  kthread+0x149/0x170
> > [ 4353.325440]  ? set_kthread_struct+0x40/0x40
> > [ 4353.330128]  ret_from_fork+0x22/0x30
> >
> > It would seem easy enough to cancel the io_work before
> > kernel_sock_shutdown, but I think that still leaves an opportunity for
> > it to be rescheduled.
>
> How would it be rescheduled? By who?
>
> >  The TCP callbacks do all appear to be protected
> > by the sk_callback_lock and checks against sk_user_data, so it may be
> > possible to restore those first?  I'm not sure on the best way to
> > safely shut down the socket in this case.
> >
> > I'll try and get some testing on this change, but wanted to get this
> > on the list and see what other suggestions you might have.
> >
> > Thanks!
> > Chris Leech
> >
> > diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> > index 4ceb28675fdf..d9e1f875f80d 100644
> > --- a/drivers/nvme/host/tcp.c
> > +++ b/drivers/nvme/host/tcp.c
> > @@ -1567,9 +1567,9 @@ static void nvme_tcp_restore_sock_calls(struct
> > nvme_tcp_queue *queue)
> >
> >   static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
> >   {
> > -       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
> >          nvme_tcp_restore_sock_calls(queue);
> >          cancel_work_sync(&queue->io_work);
> > +       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
> >   }
>
> On the surface, this seems harmless (other than redundantly
> sending/receiving data to/from the wire while we are terminating
> the queue). But it is unclear to me why would this fix anything...
>
> Obviously from the bug report, it seems that there is something
> that is not taken into account, but I don't yet see what it is
> and how that is fixed by this patch.
>