[PATCH v2] nvme: tcp: avoid race between queue_lock lock and destroy

Hannes Reinecke hare at suse.de
Fri Oct 4 00:41:26 PDT 2024


On 10/3/24 22:13, Keith Busch wrote:
> On Wed, Oct 02, 2024 at 01:51:41PM +0900, Shin'ichiro Kawasaki wrote:
>> From: Hannes Reinecke <hare at suse.de>
>>
>> Commit 76d54bf20cdc ("nvme-tcp: don't access released socket during
>> error recovery") added a mutex_lock() call for the queue->queue_lock
>> in nvme_tcp_get_address(). However, the mutex_lock() races with
>> mutex_destroy() in nvme_tcp_free_queue(), and causes the WARN below.
> 
> <snip>
> 
>> The WARN is observed when the blktests test case nvme/014 is repeated
>> with tcp transport. It is rare, and 200 times repeat is required to
>> recreate in some test environments.
>>
>> To avoid the WARN, check the NVME_TCP_Q_LIVE flag before locking
>> queue->queue_lock. The flag is cleared long time before the lock gets
>> destroyed.
> 
> I've applied this to nvme-6.12, but the existence of this queue_lock
> seems strange. It looks like tcp is relying on blk-mq's timeout to
> individually complete requests after the queue is stopped, but I feel
> like there should be a way to complete everything in a single batch. We
> have the generic nvme_cancel_request() for this reason, but fabrics has
> it's own other way to do it once at a time?

nvme_cancel_request() just terminates the command internally, so it's of 
limited use for fabrics where we need to synchronize with the controller 
about when to retry commands.
Using the timeout handler is a bit unfortunate; while we start error 
recovery from the timeout handler (which then proceeds to abort all 
commands) other timeouts might be triggering in between leading to
interesting race conditions.
There is a TPAR pending for implementing a third-party controller reset
command which will clear that up, so I've held off trying to work on
that until the TPAR is ratified.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich




More information about the Linux-nvme mailing list