[LSF/MM?BFP TOPIC] Block-layer device resets

Tue Feb 3 04:19:12 PST 2026

On 2/3/26 4:34 AM, Hannes Reinecke wrote:
> On 2/2/26 02:46, Damien Le Moal wrote:
>> On 2/2/26 02:06, Hannes Reinecke wrote:
>>> Hi all,
>>>
>>> We are currently working on implementing cross-controller resets for
>>> NVMe, which requires to send a command to the target which then should
>>> terminate all commands on a given controller.
>>> While we could easily terminate the controller, the specification
>>> also requires us to terminate all outstanding commands.
>>> Which then recurses into my all-time favourite topic on how to
>>> abort outstanding commands from the fs/bio layer.
>>>
>>> However, here we don't have to dissect/match to individual commands,
>>> but rather have to abort everything, which seems rather easier.s
>>>
>>> So I would like to fathom whether such a thing is feasible/reasonable
>>> (I think so, obviously, and can think of several other use-cases, too,
>>> qemu springs to mind here ...) and discuss possible implementations
>>> (set 'req->deadline' to zero for all pending commands?).
>>> Or maybe we can do such a thing already and I'm just not aware of it...
>>
>> Hmmm... Command timeouts ? E.g. if a controller is slow to respond (send
>> completions), the block layer timeout timer may trigger, which will call into
>> the low level device driver to force a reset. But before the reset actually
>> happens, completions may actually come back, and we do handle that race
>> correctly, well at least for scsi/ata.
>>
>> Your scenario sound very similar to this: once you reset the controller,
>> whatever was pending will be silent and can be aborted or retried. So it does
>> sound like that should not be too difficult, no ? Generalize the timeout
>> processing or do something similar ?
>>
> The good thing is we don't even need to generalize anything. It should
> should be sufficient to walk the inflight requests and set
> 'rq->deadline' to 'jiffies'. General idea here is to just _initiate_
> command termination with this, one then still has to wait for all
> commands to complete, but at least now there is a reasonable chance
> that this will happen quickly.
Well if the request which is being terminated this way happens to be admin
command then it may cause the controller reset. The issue with this approach
is that we're artificially inducing the timeout (instead of actually issuing
abort) and NVMe driver timeout handler assumes the admin command timeout is
fatal and it resets the controller. 

IMO, conceptually, the goal here is not to force a timeout but to explicitly
abort all outstanding commands as required by the NVMe cross-controller reset
semantics. So combining these two (timeout and abort) mechanisms makes the intent
unclear and coupling abort semantics to timeout handling makes it fragile. 

So from this perspective, it would be cleaner to have an explicit blk-mq callback
for aborting all outstanding requests. The block layer would invoke this callback,
and each driver could implement the abort logic according to its own requirements
and specification constraints. For NVMe, this shall allow us to abort in-flight
commands without overloading the timeout path or triggering unintended controller
resets, any thoughts?

Thanks,
--Nilay