[PATCH RFC 5/5] block, nvme: add failed_bio callback for multipath bio failover

Wed May 20 08:07:49 PDT 2026

On Wed, May 20, 2026 at 09:27:46AM +0200, Christoph Hellwig wrote:
> On Tue, May 19, 2026 at 10:23:26AM -0700, Keith Busch wrote:
> > From: Keith Busch <kbusch at kernel.org>
> >
> > The nvme driver has long utilized a zero capacity to indicate the path
> > isn't reachable, which creates a race condition with IO dispatch when
> > paths are being detached on a live system: when the block layer rejects
> > a bio early due to a capacity check failure, drivers with multipath
> > support using the original bio have no interception point to redirect
> > the bio to another path.
>
> Trying to reverse-engineer - the problem is that the block-layer
> code catches being beyond the capacity and directly completes the bio,
> right?

Yes, and in the case being addressed here, the "zero capacity" setting
is path specific, hence the driver wants to attempt a failover. I
imagine general capacity violations are not path specific though, so
this is kind of a weird case.

> IMHO the right fix is to get rid of the capacity hacks, and have a flag
> we can catch in the nvme driver and complete through the mechanisms.

Sure. I think we can at least remove the unconditional set_capacity(0)
from the nvme driver because the block layer generically does that if
we've done a surprise removal on the namespace's disk.

As to removing the hack from the block layer too, it's been a while, but
I recall it was the easiest way to get forward progress for some
degenerate case of continuously attempting to sync dirty pages. But
maybe we don't need it anymore: it looks like there are more checks on
the GD_DEAD that might make that whole capacity trick unnecessary. I'll
try to whip up a quick test to verify.