[PATCH] nvme: tcp: avoid race between lock and destroy of queue_lock
Hannes Reinecke
hare at suse.de
Thu Sep 12 02:16:02 PDT 2024
On 9/12/24 08:59, Sagi Grimberg wrote:
>
>
>
> On 12/09/2024 9:27, Shin'ichiro Kawasaki wrote:
>> Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during
>> error recovery") added a mutex_lock() call for the queue->queue_lock
>> in nvme_tcp_get_address(). However, the mutex_lock() races with
>> mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.
>>
>> DEBUG_LOCKS_WARN_ON(lock->magic != lock)
>> WARNING: CPU: 3 PID: 34077 at kernel/locking/mutex.c:587
>> __mutex_lock+0xcf0/0x1220
>> Modules linked in: nvmet_tcp nvmet nvme_tcp nvme_fabrics iw_cm ib_cm
>> ib_core pktcdvd nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib
>> nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct
>> nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set
>> nf_tables qrtr sunrpc ppdev 9pnet_virtio 9pnet pcspkr netfs parport_pc
>> parport e1000 i2c_piix4 i2c_smbus loop fuse nfnetlink zram bochs
>> drm_vram_helper drm_ttm_helper ttm drm_kms_helper xfs drm sym53c8xx
>> floppy nvme scsi_transport_spi nvme_core nvme_auth serio_raw
>> ata_generic pata_acpi dm_multipath qemu_fw_cfg [last unloaded: ib_uverbs]
>> CPU: 3 UID: 0 PID: 34077 Comm: udisksd Not tainted 6.11.0-rc7 #319
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>> 1.16.3-2.fc40 04/01/2014
>> RIP: 0010:__mutex_lock+0xcf0/0x1220
>> Code: 08 84 d2 0f 85 c8 04 00 00 8b 15 ef b6 c8 01 85 d2 0f 85 78 f4
>> ff ff 48 c7 c6 20 93 ee af 48 c7 c7 60 91 ee af e8 f0 a7 6d fd <0f> 0b
>> e9 5e f4 ff ff 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1
>> RSP: 0018:ffff88811305f760 EFLAGS: 00010286
>> RAX: 0000000000000000 RBX: ffff88812c652058 RCX: 0000000000000000
>> RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000001
>> RBP: ffff88811305f8b0 R08: 0000000000000001 R09: ffffed1075c36341
>> R10: ffff8883ae1b1a0b R11: 0000000000010498 R12: 0000000000000000
>> R13: 0000000000000000 R14: dffffc0000000000 R15: ffff88812c652058
>> FS: 00007f9713ae4980(0000) GS:ffff8883ae180000(0000)
>> knlGS:0000000000000000
>> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 00007fcd78483c7c CR3: 0000000122c38000 CR4: 00000000000006f0
>> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>> Call Trace:
>> <TASK>
>> ? __warn.cold+0x5b/0x1af
>> ? __mutex_lock+0xcf0/0x1220
>> ? report_bug+0x1ec/0x390
>> ? handle_bug+0x3c/0x80
>> ? exc_invalid_op+0x13/0x40
>> ? asm_exc_invalid_op+0x16/0x20
>> ? __mutex_lock+0xcf0/0x1220
>> ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>> ? __pfx___mutex_lock+0x10/0x10
>> ? __lock_acquire+0xd6a/0x59e0
>> ? nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>> nvme_tcp_get_address+0xc2/0x1e0 [nvme_tcp]
>> ? __pfx_nvme_tcp_get_address+0x10/0x10 [nvme_tcp]
>> nvme_sysfs_show_address+0x81/0xc0 [nvme_core]
>> dev_attr_show+0x42/0x80
>> ? __asan_memset+0x1f/0x40
>> sysfs_kf_seq_show+0x1f0/0x370
>> seq_read_iter+0x2cb/0x1130
>> ? rw_verify_area+0x3b1/0x590
>> ? __mutex_lock+0x433/0x1220
>> vfs_read+0x6a6/0xa20
>> ? lockdep_hardirqs_on+0x78/0x100
>> ? __pfx_vfs_read+0x10/0x10
>> ksys_read+0xf7/0x1d0
>> ? __pfx_ksys_read+0x10/0x10
>> ? __x64_sys_openat+0x105/0x1d0
>> do_syscall_64+0x93/0x180
>> ? lockdep_hardirqs_on_prepare+0x16d/0x400
>> ? do_syscall_64+0x9f/0x180
>> ? lockdep_hardirqs_on+0x78/0x100
>> ? do_syscall_64+0x9f/0x180
>> ? __pfx_ksys_read+0x10/0x10
>> ? lockdep_hardirqs_on_prepare+0x16d/0x400
>> ? do_syscall_64+0x9f/0x180
>> ? lockdep_hardirqs_on+0x78/0x100
>> ? do_syscall_64+0x9f/0x180
>> ? lockdep_hardirqs_on_prepare+0x16d/0x400
>> ? do_syscall_64+0x9f/0x180
>> ? lockdep_hardirqs_on+0x78/0x100
>> ? do_syscall_64+0x9f/0x180
>> ? lockdep_hardirqs_on_prepare+0x16d/0x400
>> ? do_syscall_64+0x9f/0x180
>> ? lockdep_hardirqs_on+0x78/0x100
>> ? do_syscall_64+0x9f/0x180
>> ? lockdep_hardirqs_on_prepare+0x16d/0x400
>> ? do_syscall_64+0x9f/0x180
>> ? lockdep_hardirqs_on+0x78/0x100
>> ? do_syscall_64+0x9f/0x180
>> ? do_syscall_64+0x9f/0x180
>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>> RIP: 0033:0x7f9713f55cfa
>> Code: 55 48 89 e5 48 83 ec 20 48 89 55 e8 48 89 75 f0 89 7d f8 e8 e8
>> 74 f8 ff 48 8b 55 e8 48 8b 75 f0 41 89 c0 8b 7d f8 31 c0 0f 05 <48> 3d
>> 00 f0 ff ff 77 2e 44 89 c7 48 89 45 f8 e8 42 75 f8 ff 48 8b
>> RSP: 002b:00007ffd7f512e70 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
>> RAX: ffffffffffffffda RBX: 000055c38f316859 RCX: 00007f9713f55cfa
>> RDX: 0000000000000fff RSI: 00007ffd7f512eb0 RDI: 0000000000000011
>> RBP: 00007ffd7f512e90 R08: 0000000000000000 R09: 00000000ffffffff
>> R10: 0000000000000000 R11: 0000000000000246 R12: 000055c38f317148
>> R13: 0000000000000000 R14: 00007f96f4004f30 R15: 000055c3b6b623c0
>> </TASK>
>>
>> The WARN is observed when the blktests test case nvme/014 is repeated
>> with tcp transport. It is rare, and 200 times repeat is required to
>> recreate in my test environment.
>>
>> To avoid the WARN, serialize the two functions nvme_tcp_get_address()
>> and nvme_tcp_free_queue(). Add the flag NVME_TCP_Q_REFERRED and set it
>> when nvme_tcp_get_address() starts. Have nvme_tcp_free_queue() wait the
>> flag cleared so that it does not run until nvme_tcp_get_address() ends.
>> Also check the NVME_TCP_Q_ALLOCATED flag when nvme_tcp_get_address()
>> starts. This flag is cleared at the nvme_tcp_free_queue() start, then it
>> ensures that nvme_tcp_get_address() is not executed while
>> nvme_tcp_free_queue() runs.
>>
>> Of note is that nvme_tcp_stop_queue() locks the queue->queue_lock in
>> same manner as nvme_tcp_get_address(). However, nvme_tcp_stop_queue() is
>> called in sequence before nvme_tcp_free_queue(), then it is safe to
>> assume that the queue->queue_lock is not yet destroyed at
>> nvme_tcp_stop_queue() call. Therefore this fix does not touch
>> nvme_tcp_stop_queue().
>>
>> Fixes: 76d54bf20cdc ("nvme-tcp: don't access released socket during
>> error recovery")
>> Signed-off-by: Shin'ichiro Kawasaki <shinichiro.kawasaki at wdc.com>
>> ---
>> drivers/nvme/host/tcp.c | 16 +++++++++++++++-
>> 1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
>> index a2a47d3ab99f..1457ffda207b 100644
>> --- a/drivers/nvme/host/tcp.c
>> +++ b/drivers/nvme/host/tcp.c
>> @@ -127,6 +127,7 @@ enum nvme_tcp_queue_flags {
>> NVME_TCP_Q_ALLOCATED = 0,
>> NVME_TCP_Q_LIVE = 1,
>> NVME_TCP_Q_POLLING = 2,
>> + NVME_TCP_Q_REFERRED = 3,
>> };
>> enum nvme_tcp_recv_state {
>> @@ -1365,6 +1366,9 @@ static void nvme_tcp_free_queue(struct nvme_ctrl
>> *nctrl, int qid)
>> if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>> return;
>> + wait_on_bit_io(&queue->flags, NVME_TCP_Q_REFERRED,
>> + TASK_UNINTERRUPTIBLE);
>> +
>> if (queue->hdr_digest || queue->data_digest)
>> nvme_tcp_free_crypto(queue);
>> @@ -2613,7 +2617,14 @@ static int nvme_tcp_get_address(struct
>> nvme_ctrl *ctrl, char *buf, int size)
>> {
>> struct nvme_tcp_queue *queue = &to_tcp_ctrl(ctrl)->queues[0];
>> struct sockaddr_storage src_addr;
>> - int ret, len;
>> + int ret, len = 0;
>> +
>> + /* Serialize this function and nvme_tcp_free_queue() */
>> + if (test_and_set_bit(NVME_TCP_Q_REFERRED, &queue->flags))
>> + return 0;
>> +
>> + if (!test_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
>> + goto out;
>> len = nvmf_get_address(ctrl, buf, size);
>> @@ -2631,6 +2642,9 @@ static int nvme_tcp_get_address(struct nvme_ctrl
>> *ctrl, char *buf, int size)
>> done:
>> mutex_unlock(&queue->queue_lock);
>> +out:
>> + clear_and_wake_up_bit(NVME_TCP_Q_REFERRED, &queue->flags);
>> +
>> return len;
>> }
>
> This is really not needed.. You can already serialize with:
> --
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index a2a47d3ab99f..91c19d06cae9 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -1365,6 +1365,10 @@ static void nvme_tcp_free_queue(struct nvme_ctrl
> *nctrl, int qid)
> if (!test_and_clear_bit(NVME_TCP_Q_ALLOCATED, &queue->flags))
> return;
>
> + /* make sure any concurrent .get_address completes */
> + mutex_lock(&queue->queue_lock);
> + mutex_unlock(&queue->queue_lock);
> +
> if (queue->hdr_digest || queue->data_digest)
> nvme_tcp_free_crypto(queue);
Wouldn't it be sufficient to check for NVME_TCP_Q_LIVE outside of the
mutex?
diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 76c8e01d8f08..0db3dde3a478 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -2688,10 +2688,11 @@ static int nvme_tcp_get_address(struct nvme_ctrl
*ctrl, char *buf, int size)
len = nvmf_get_address(ctrl, buf, size);
+ if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
+ return len;
+
mutex_lock(&queue->queue_lock);
- if (!test_bit(NVME_TCP_Q_LIVE, &queue->flags))
- goto done;
ret = kernel_getsockname(queue->sock, (struct sockaddr
*)&src_addr);
if (ret > 0) {
if (len > 0)
Q_LIVE is cleared _way_ earlier than the queue_mutex, so by the time
nvme_tcp_free_queue() is called the bit is already cleared.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
More information about the Linux-nvme
mailing list