[PATCH v1 1/3] driver core: Support asynchronous driver shutdown

Wed Mar 30 07:12:18 PDT 2022

> From: Linux-nvme <linux-nvme-bounces at lists.infradead.org> On Behalf Of
> Oliver O'Halloran
> Sent: Monday, March 28, 2022 8:20 PM
> To: Tanjore Suresh
> Cc: Greg Kroah-Hartman; Rafael J . Wysocki; Christoph Hellwig; Sagi Grimberg;
> Bjorn Helgaas; Linux Kernel Mailing List; linux-nvme at lists.infradead.org; linux-
> pci
> Subject: Re: [PATCH v1 1/3] driver core: Support asynchronous driver shutdown
> 
> 
> On Tue, Mar 29, 2022 at 10:35 AM Tanjore Suresh <tansuresh at google.com>
> wrote:
> >
> > This changes the bus driver interface with additional entry points to
> > enable devices to implement asynchronous shutdown. The existing
> > synchronous interface to shutdown is unmodified and retained for
> > backward compatibility.
> >
> > This changes the common device shutdown code to enable devices to
> > participate in asynchronous shutdown implementation.
> 
> nice to see someone looking at improving the shutdown path

Agreed!

I know this patch is mainly for PCI devices, however, NVMe over Fabrics 
devices can suffer even longer shutdowns. Last September, I reported 
that shutting down an NVMe-oF TCP connection while the network is down 
will result in a 1-minute deadlock. That's because the driver tries to perform 
a proper shutdown by sending commands to the remote target and the 
timeout for unanswered commands is 1-minute. If one needs to shut down 
several NVMe-oF connections, each connection will be shut down sequentially 
taking each 1 minute. Try running "nvme disconnect-all" while the network 
is down and you'll see what I mean. Of course, the KATO is supposed to 
detect when connectivity is lost, but if you have a long KATO (e.g. 2 minutes)
you will most likely hit this condition.

Here's the patch I proposed in September, which shortens the timeout to 
5 sec on a disconnect.

http://lists.infradead.org/pipermail/linux-nvme/2021-September/027867.html

Regards,
Martin Belanger

> 
> > Signed-off-by: Tanjore Suresh <tansuresh at google.com>
> > ---
> >  drivers/base/core.c        | 39 +++++++++++++++++++++++++++++++++++++-
> >  include/linux/device/bus.h | 10 ++++++++++
> >  2 files changed, 48 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/base/core.c b/drivers/base/core.c index
> > 3d6430eb0c6a..359e7067e8b8 100644
> > --- a/drivers/base/core.c
> > +++ b/drivers/base/core.c
> > @@ -4479,6 +4479,7 @@ EXPORT_SYMBOL_GPL(device_change_owner);
> > *snip*
> 
> This all seems a bit dangerous and I'm wondering what systems you've tested
> these changes with. I had a look at implementing something similar a few years
> ago and one case that always concerned me was embedded systems where the
> PCIe root complex also has a driver bound.
> Say you've got the following PCIe topology:
> 
> 00:00.0 - root port
> 01:00.0 - nvme drive
> 
> With the current implementation of device_shutdown() we can guarantee that
> the child device (the nvme) is shut down before we start trying to shut down the
> parent device (the root complex) so there's no possibility of deadlocks and
> other dependency headaches. With this implementation of async shutdown we
> lose that guarantee and I'm not sure what the consequences are. Personally I
> was never able to convince myself it was safe, but maybe you're braver than I
> am :)
> 
> That all said, there's probably only a few kinds of device that will really want to
> implement async shutdown support so maybe you can restrict it to leaf devices
> and flip the ordering around to something
> like:
> 
> for_each_device(dev) {
>    if (can_async(dev) && has_no_children(dev))
>       start_async_shutdown(dev)
> }
> wait_for_all_async_shutdowns_to_finish()
> 
> // tear down the remaining system devices synchronously
> for_each_device(dev)
>    do_sync_shutdown(dev)
> 
> >  /*
> > diff --git a/include/linux/device/bus.h b/include/linux/device/bus.h
> > index a039ab809753..e261819601e9 100644
> > --- a/include/linux/device/bus.h
> > +++ b/include/linux/device/bus.h
> > @@ -93,6 +101,8 @@ struct bus_type {
> >         void (*sync_state)(struct device *dev);
> >         void (*remove)(struct device *dev);
> >         void (*shutdown)(struct device *dev);
> > +       void (*shutdown_pre)(struct device *dev);
> > +       void (*shutdown_post)(struct device *dev);
> 
> Call them shutdown_async_start() / shutdown_async_end() or something IMO.
> These names are not at all helpful and they're easy to mix up their role with
> the class based shutdown_pre / _post