ncme-tcp: io_work NULL pointer when racing with queue stop

Tue Jan 18 15:32:18 PST 2022

> Hi,

Hey Chris,

> I'm looking at a reported NULL pointer fault with controller reset
> testing under load

Just to understand, is this upstream kernel?

> and it appears that the nvme_tcp_stop_queue
> sequence is racing against scheduled io_work.  The fault occurs when
> kernel_sendpage is called from the workqueue context and hits a NULL
> sock->ops pointer, which is cleared by kernel_sock_shutdown before the
> call to cancel_work_sync.

Hmm, can you point me to where kernel_sock_shutdown is clearing
sock->ops? The assumption that nvme-tcp takes is that accessing
the socket is not allowed only after sock_release() but
still valid after kernel_sock_shutdown.

The sequence follows this logic:
1. kernel_sock_shutdown(SHUT_RDWR) - guarantee we do not send or
receive anything to/from the wire when any of the driver contexts
access the socket from this point on.
2. nvme_tcp_restore_sock_calls() - guarantee that we are not wake up
from any socket event, incoming data, or buffer space becoming
available to send from this point on.
3. cancel_work_sync(io_work) - cancel and fence io_work, and make sure
that it is not self-requeued. So in essence guarantee that io_work is
not triggered again (not by self-requeue, nor by socket callbacks).

Here we should be able to safely continue to release the socket
in nvme_tcp_free_queue as there should be no way anything can access
the socket concurrently.

The only places where io_work is triggered from the code are:
1. .queue_work
2. socket_event (data_ready/write_space)
3. from io_work itself (quota exhausted, or r2t was received)

So (2)+(3) should be prevented by nvme_tcp_stop_queue as outlined
above, and (1) is prevented by quiescing the request_queue which is
happening earlier in the reset sequence (nvme_tcp_teardown_io_queues
calls nvme_stop_queues before nvme_tcp_stop_io_queues).

So this is the logic that nvme-tcp follows to safely reset the
controller during traffic/load.

I have to say that if kernel_sock_shutdown() is clearing sock->ops
then we clearly have a wrong assumption, but I'm surprised that it
even works at all...

However I did not find evidence to kernel_sock_shutdown() clearing
sock->ops and in which conditions.

> 
> [ 4353.145377] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
> [ 4353.151921] RIP: 0010:kernel_sendpage+0x11/0xc0
> [ 4353.157005] Code: ff ff ff 31 db e9 70 ff ff ff be ff ff ff ff e9
> 49 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 18 48
> 8b 47 20 <4c> 8b 88 a0 00 00 00 4d 85 c9 74 3b 48 8b 7e 08 48 8d 47 ff
> 83 e7
> [ 4353.177985] RSP: 0000:ffffb5870a7b7d68 EFLAGS: 00010292
> [ 4353.189185] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000048
> [ 4353.189196] RDX: 0000000000000000 RSI: ffffdffa25000000 RDI: ffff981b45de3400
> [ 4353.189199] RBP: ffff981b45de3400 R08: 00000000000000c0 R09: ffff98182073b0ec
> [ 4353.189202] R10: 000000000000000c R11: 000000000000000c R12: ffff9817c4885e38
> [ 4353.226418] R13: 0000000000000048 R14: ffff9817d0018000 R15: 0000000000000000
> [ 4353.234393] FS:  0000000000000000(0000) GS:ffff981b2fa00000(0000)
> knlGS:0000000000000000
> [ 4353.243442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 4353.249871] CR2: 00000000000000a0 CR3: 00000001307ce004 CR4: 00000000003706f0
> [ 4353.257850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 4353.265824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [ 4353.273798] Call Trace:
> [ 4353.276554]  nvme_tcp_try_send_cmd_pdu+0xf4/0x250 [nvme_tcp]
> [ 4353.282901]  ? tcp_rcv_space_adjust+0x1d/0x1c0
> [ 4353.287888]  nvme_tcp_try_send+0x15a/0x330 [nvme_tcp]
> [ 4353.293553]  ? release_sock+0x43/0x90
> [ 4353.297666]  ? nvme_tcp_try_recv+0x73/0xa0 [nvme_tcp]
> [ 4353.303334]  nvme_tcp_io_work+0x37/0x90 [nvme_tcp]
> [ 4353.308723]  process_one_work+0x1e3/0x3b0
> [ 4353.313237]  worker_thread+0x50/0x3b0
> [ 4353.317335]  ? rescuer_thread+0x370/0x370
> [ 4353.321813]  kthread+0x149/0x170
> [ 4353.325440]  ? set_kthread_struct+0x40/0x40
> [ 4353.330128]  ret_from_fork+0x22/0x30
> 
> It would seem easy enough to cancel the io_work before
> kernel_sock_shutdown, but I think that still leaves an opportunity for
> it to be rescheduled.

How would it be rescheduled? By who?

>  The TCP callbacks do all appear to be protected
> by the sk_callback_lock and checks against sk_user_data, so it may be
> possible to restore those first?  I'm not sure on the best way to
> safely shut down the socket in this case.
> 
> I'll try and get some testing on this change, but wanted to get this
> on the list and see what other suggestions you might have.
> 
> Thanks!
> Chris Leech
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index 4ceb28675fdf..d9e1f875f80d 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1567,9 +1567,9 @@ static void nvme_tcp_restore_sock_calls(struct
> nvme_tcp_queue *queue)
> 
>   static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
>   {
> -       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>          nvme_tcp_restore_sock_calls(queue);
>          cancel_work_sync(&queue->io_work);
> +       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
>   }

On the surface, this seems harmless (other than redundantly
sending/receiving data to/from the wire while we are terminating
the queue). But it is unclear to me why would this fix anything...

Obviously from the bug report, it seems that there is something
that is not taken into account, but I don't yet see what it is
and how that is fixed by this patch.