ncme-tcp: io_work NULL pointer when racing with queue stop

Tue Jan 18 10:36:21 PST 2022

Hi,

I'm looking at a reported NULL pointer fault with controller reset
testing under load, and it appears that the nvme_tcp_stop_queue
sequence is racing against scheduled io_work.  The fault occurs when
kernel_sendpage is called from the workqueue context and hits a NULL
sock->ops pointer, which is cleared by kernel_sock_shutdown before the
call to cancel_work_sync.

[ 4353.145377] Workqueue: nvme_tcp_wq nvme_tcp_io_work [nvme_tcp]
[ 4353.151921] RIP: 0010:kernel_sendpage+0x11/0xc0
[ 4353.157005] Code: ff ff ff 31 db e9 70 ff ff ff be ff ff ff ff e9
49 ff ff ff 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 fd 48 83 ec 18 48
8b 47 20 <4c> 8b 88 a0 00 00 00 4d 85 c9 74 3b 48 8b 7e 08 48 8d 47 ff
83 e7
[ 4353.177985] RSP: 0000:ffffb5870a7b7d68 EFLAGS: 00010292
[ 4353.189185] RAX: 0000000000000000 RBX: 0000000080000000 RCX: 0000000000000048
[ 4353.189196] RDX: 0000000000000000 RSI: ffffdffa25000000 RDI: ffff981b45de3400
[ 4353.189199] RBP: ffff981b45de3400 R08: 00000000000000c0 R09: ffff98182073b0ec
[ 4353.189202] R10: 000000000000000c R11: 000000000000000c R12: ffff9817c4885e38
[ 4353.226418] R13: 0000000000000048 R14: ffff9817d0018000 R15: 0000000000000000
[ 4353.234393] FS:  0000000000000000(0000) GS:ffff981b2fa00000(0000)
knlGS:0000000000000000
[ 4353.243442] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4353.249871] CR2: 00000000000000a0 CR3: 00000001307ce004 CR4: 00000000003706f0
[ 4353.257850] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 4353.265824] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 4353.273798] Call Trace:
[ 4353.276554]  nvme_tcp_try_send_cmd_pdu+0xf4/0x250 [nvme_tcp]
[ 4353.282901]  ? tcp_rcv_space_adjust+0x1d/0x1c0
[ 4353.287888]  nvme_tcp_try_send+0x15a/0x330 [nvme_tcp]
[ 4353.293553]  ? release_sock+0x43/0x90
[ 4353.297666]  ? nvme_tcp_try_recv+0x73/0xa0 [nvme_tcp]
[ 4353.303334]  nvme_tcp_io_work+0x37/0x90 [nvme_tcp]
[ 4353.308723]  process_one_work+0x1e3/0x3b0
[ 4353.313237]  worker_thread+0x50/0x3b0
[ 4353.317335]  ? rescuer_thread+0x370/0x370
[ 4353.321813]  kthread+0x149/0x170
[ 4353.325440]  ? set_kthread_struct+0x40/0x40
[ 4353.330128]  ret_from_fork+0x22/0x30

It would seem easy enough to cancel the io_work before
kernel_sock_shutdown, but I think that still leaves an opportunity for
it to be rescheduled.  The TCP callbacks do all appear to be protected
by the sk_callback_lock and checks against sk_user_data, so it may be
possible to restore those first?  I'm not sure on the best way to
safely shut down the socket in this case.

I'll try and get some testing on this change, but wanted to get this
on the list and see what other suggestions you might have.

Thanks!
Chris Leech

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 4ceb28675fdf..d9e1f875f80d 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -1567,9 +1567,9 @@ static void nvme_tcp_restore_sock_calls(struct
nvme_tcp_queue *queue)

 static void __nvme_tcp_stop_queue(struct nvme_tcp_queue *queue)
 {
-       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
        nvme_tcp_restore_sock_calls(queue);
        cancel_work_sync(&queue->io_work);
+       kernel_sock_shutdown(queue->sock, SHUT_RDWR);
 }