[LSF/MM?BFP TOPIC] Block-layer device resets

Hannes Reinecke hare at suse.de
Tue Feb 3 17:43:20 PST 2026


On 2/3/26 13:19, Nilay Shroff wrote:
> 
> 
> On 2/3/26 4:34 AM, Hannes Reinecke wrote:
>> On 2/2/26 02:46, Damien Le Moal wrote:
>>> On 2/2/26 02:06, Hannes Reinecke wrote:
>>>> Hi all,
>>>>
>>>> We are currently working on implementing cross-controller resets for
>>>> NVMe, which requires to send a command to the target which then should
>>>> terminate all commands on a given controller.
>>>> While we could easily terminate the controller, the specification
>>>> also requires us to terminate all outstanding commands.
>>>> Which then recurses into my all-time favourite topic on how to
>>>> abort outstanding commands from the fs/bio layer.
>>>>
>>>> However, here we don't have to dissect/match to individual commands,
>>>> but rather have to abort everything, which seems rather easier.s
>>>>
>>>> So I would like to fathom whether such a thing is feasible/reasonable
>>>> (I think so, obviously, and can think of several other use-cases, too,
>>>> qemu springs to mind here ...) and discuss possible implementations
>>>> (set 'req->deadline' to zero for all pending commands?).
>>>> Or maybe we can do such a thing already and I'm just not aware of it...
>>>
>>> Hmmm... Command timeouts ? E.g. if a controller is slow to respond (send
>>> completions), the block layer timeout timer may trigger, which will call into
>>> the low level device driver to force a reset. But before the reset actually
>>> happens, completions may actually come back, and we do handle that race
>>> correctly, well at least for scsi/ata.
>>>
>>> Your scenario sound very similar to this: once you reset the controller,
>>> whatever was pending will be silent and can be aborted or retried. So it does
>>> sound like that should not be too difficult, no ? Generalize the timeout
>>> processing or do something similar ?
>>>
>> The good thing is we don't even need to generalize anything. It should
>> should be sufficient to walk the inflight requests and set
>> 'rq->deadline' to 'jiffies'. General idea here is to just _initiate_
>> command termination with this, one then still has to wait for all
>> commands to complete, but at least now there is a reasonable chance
>> that this will happen quickly.
> Well if the request which is being terminated this way happens to be admin
> command then it may cause the controller reset. The issue with this approach
> is that we're artificially inducing the timeout (instead of actually issuing
> abort) and NVMe driver timeout handler assumes the admin command timeout is
> fatal and it resets the controller.>
Well, yes, and no.
Command timeouts _are_ artificial; none of the command specifications
mention any timeout, so anything we do here is actually guesswork.
So the controller _must_ be able to abort commands.
I am fully aware that the NVMe driver does _not_ abort commands on 
timeout (but rather resets the controller), but that is an 
implementation detail of NVMe.
(And arguably something which we should fix. Time for another LSF
session, maybe).
Main point is that the block layer already has driver callbacks for
timeout, so one does not need to implement yet another callback.

> IMO, conceptually, the goal here is not to force a timeout but to explicitly
> abort all outstanding commands as required by the NVMe cross-controller reset
> semantics. So combining these two (timeout and abort) mechanisms makes the intent
> unclear and coupling abort semantics to timeout handling makes it fragile.
> 
> So from this perspective, it would be cleaner to have an explicit blk-mq callback
> for aborting all outstanding requests. The block layer would invoke this callback,
> and each driver could implement the abort logic according to its own requirements
> and specification constraints. For NVMe, this shall allow us to abort in-flight
> commands without overloading the timeout path or triggering unintended controller
> resets, any thoughts?
> 
Which is basically the direction I had been thinking about, having a
generic (block-layer) function to invoke device resets.
_How_ this function will be implemented is open for debate.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich



More information about the Linux-nvme mailing list