[PATCH v9 0/4] shut down devices asynchronously
Laurence Oberman
loberman at redhat.com
Fri Oct 11 08:52:19 PDT 2024
On Fri, 2024-10-11 at 04:22 +0000, Michael Kelley wrote:
> From: Stuart Hayes <stuart.w.hayes at gmail.com> Sent: Wednesday,
> October 9, 2024 10:58 AM
> >
> > This adds the ability for the kernel to shutdown devices
> > asynchronously.
> >
> > Only devices with drivers that enable it are shut down
> > asynchronously.
> >
> > This can dramatically reduce system shutdown/reboot time on systems
> > that
> > have multiple devices that take many seconds to shut down (like
> > certain
> > NVMe drives). On one system tested, the shutdown time went from 11
> > minutes
> > without this patch to 55 seconds with the patch.
>
> I've been testing this series against a 6.11.0 kernel in an Azure VM,
> which
> is running as a guest on the Microsoft Hyper-V hypervisor. The VM has
> 16 vCPUs,
> 128 Gbytes of memory, and two physical NVMe controllers that are
> mapped
> into the VM so that it can access them directly.
>
> But I wanted to confirm that the two NVMe controllers are being
> shutdown
> in parallel. So before doing a shutdown, I set
> /sys/module/kernel/parameters/initcall_debug to "Y" so the shutdown
> of each device is recorded in the console output. Here's the full
> set of
> device shutdown messages:
>
> 172.609825 platform intel_rapl_msr.0: shutdown
> 172.611940 mlx5_ib.rdma mlx5_core.rdma.0: shutdown
> 172.613931 mlx5_core.eth mlx5_core.eth.0: shutdown
> 172.618116 nvme c2b7:00:00.0: shutdown
> 172.618262 nvme 132e:00:00.0: shutdown
> 172.618349 mlx5_core 1610:00:02.0: shutdown
> 172.618359 mlx5_core 1610:00:02.0: Shutdown was called
> 172.782768 hv_pci ba152dae-1610-4c67-b925-81ac4902e4ce: shutdown
> 172.786405 sd 0:0:0:1: shutdown
> 172.788788 sd 0:0:0:0: shutdown
> 172.789949 sd 0:0:0:0: [sda] Synchronizing SCSI cache
> 172.794209 atkbd serio0: shutdown
> 172.795974 hv_utils 242ff919-07db-4180-9c2e-b86cb68c8c55:
> shutdown
> 172.800432 hv_pci 0cdfe983-132e-434b-8025-fc9ab43c0fc5: shutdown
> 172.802812 hv_pci 2394da4f-c2b7-43bd-b72f-d3482ef6850a: shutdown
> 172.805145 hv_netvsc 0022487e-1043-0022-487e-10430022487e:
> shutdown
> 172.807575 hyperv_fb 5620e0c7-8062-4dce-aeb7-520c7ef76171:
> shutdown
> 172.810026 hyperv_keyboard d34b2567-b9b6-42b9-8778-0a4ec0b955bf:
> shutdown
> 172.812522 hid_hyperv 58f75a6d-d949-4320-99e1-a2a2576d581c:
> shutdown
> 172.814982 hv_balloon 1eccfd72-4b41-45ef-b73a-4a6e44c12924:
> shutdown
> 172.817376 vmbus c4e5e7d1-d748-4afc-979d-683167910a55: shutdown
> 172.819789 hv_storvsc f8b3781b-1e82-4818-a1c3-63d806ec15bb:
> shutdown
> 172.822324 hv_storvsc f8b3781a-1e82-4818-a1c3-63d806ec15bb:
> shutdown
> 172.824813 hv_utils 2dd1ce17-079e-403c-b352-a1921ee207ee:
> shutdown
> 172.827199 hv_utils b6650ff7-33bc-4840-8048-e0676786f393:
> shutdown
> 172.829653 hv_utils fd149e91-82e0-4a7d-afa6-2a4166cbd7c0:
> shutdown
> 172.836408 platform eisa.0: shutdown
> 172.838558 alarmtimer alarmtimer.0.auto: shutdown
> 172.842461 platform Fixed MDIO bus.0: shutdown
> 172.864709 kgdboc kgdboc: shutdown
> 172.878009 serial8250 serial8250: shutdown
> 172.889725 platform pcspkr: shutdown
> 172.904386 rtc_cmos 00:02: shutdown
> 172.906217 serial 00:01: shutdown
> 172.907799 serial 00:00: shutdown
> 172.910427 platform efivars.0: shutdown
> 172.913341 platform rtc-efi.0: shutdown
> 172.915470 vmgenid HYPER_V_GEN_COUNTER_V1:00: shutdown
> 172.917479 vmbus VMBUS:00: shutdown
> 172.919012 platform PNP0003:00: shutdown
> 172.926707 reg-dummy reg-dummy: shutdown
> 172.961360 ACPI: PM: Preparing to enter system sleep state S5
>
> You see the Mellanox CX-5 NIC, the two NVMe devices, various Hyper-V
> virtual devices, and platform devices being shutdown. Everything
> seems to
> work properly, so that's good. The two NVMe devices are shutdown very
> close in time, so they are probably being done in parallel.
>
> As a comparison, I did the same thing with an unmodified 6.11.0
> kernel.
> Indeed, the NVMe device shutdowns are significantly more apart in
> time
> (110 milliseconds). That's not noticeably slow like the NVMe devices
> you
> were dealing with, but doing them in parallel helps a little bit.
>
> But here's the kicker: The overall process of shutting down the
> devices
> took *longer* with the patch set than without. Here's the same
> output
> from a 6.11.0 kernel without the patch set:
>
> 745.455493 platform intel_rapl_msr.0: shutdown
> 745.456999 mlx5_ib.rdma mlx5_core.rdma.0: shutdown
> 745.458557 mlx5_core.eth mlx5_core.eth.0: shutdown
> 745.460166 mlx5_core 1610:00:02.0: shutdown
> 745.461570 mlx5_core 1610:00:02.0: Shutdown was
> called
> 745.466053 nvme 132e:00:00.0: shutdown
> 745.579284 nvme c2b7:00:00.0: shutdown
> 745.718739 hv_pci ba152dae-1610-4c67-b925-81ac4902e4ce:
> shutdown
> 745.721114 sd 0:0:0:1: shutdown
> 745.722254 sd 0:0:0:0: shutdown
> 745.723357 sd 0:0:0:0: [sda] Synchronizing SCSI
> cache
> 745.725259 atkbd serio0: shutdown
> 745.726405 hv_utils 242ff919-07db-4180-9c2e-b86cb68c8c55:
> shutdown
> 745.728375 hv_pci 0cdfe983-132e-434b-8025-fc9ab43c0fc5:
> shutdown
> 745.730347 hv_pci 2394da4f-c2b7-43bd-b72f-d3482ef6850a:
> shutdown
> 745.732281 hv_netvsc 0022487e-1043-0022-487e-10430022487e:
> shutdown
> 745.734318 hyperv_fb 5620e0c7-8062-4dce-aeb7-520c7ef76171:
> shutdown
> 745.736488 hyperv_keyboard d34b2567-b9b6-42b9-8778-0a4ec0b955bf:
> shutdown
> 745.738628 hid_hyperv 58f75a6d-d949-4320-99e1-a2a2576d581c:
> shutdown
> 745.740770 hv_balloon 1eccfd72-4b41-45ef-b73a-4a6e44c12924:
> shutdown
> 745.742835 vmbus c4e5e7d1-d748-4afc-979d-683167910a55:
> shutdown
> 745.744765 hv_storvsc f8b3781b-1e82-4818-a1c3-63d806ec15bb:
> shutdown
> 745.746861 hv_storvsc f8b3781a-1e82-4818-a1c3-63d806ec15bb:
> shutdown
> 745.748907 hv_utils 2dd1ce17-079e-403c-b352-a1921ee207ee:
> shutdown
> 745.750948 hv_utils b6650ff7-33bc-4840-8048-e0676786f393:
> shutdown
> 745.753012 hv_utils fd149e91-82e0-4a7d-afa6-2a4166cbd7c0:
> shutdown
> 745.755000 platform eisa.0: shutdown
> 745.756266 alarmtimer alarmtimer.0.auto: shutdown
> 745.757868 platform Fixed MDIO bus.0: shutdown
> 745.759447 kgdboc kgdboc: shutdown
> 745.760679 serial8250 serial8250: shutdown
> 745.762110 platform pcspkr: shutdown
> 745.763387 rtc_cmos 00:02: shutdown
> 745.764726 serial 00:01: shutdown
> 745.765898 serial 00:00: shutdown
> 745.767036 platform efivars.0: shutdown
> 745.768783 platform rtc-efi.0: shutdown
> 745.770240 vmgenid HYPER_V_GEN_COUNTER_V1:00:
> shutdown
> 745.771949 vmbus VMBUS:00: shutdown
> 745.773197 platform PNP0003:00: shutdown
> 745.774540 reg-dummy reg-dummy: shutdown
> 745.775964 ACPI: PM: Preparing to enter system sleep state S5
>
> There's some modest variability in the individual steps, but the 110
> ms
> saved on the NVMe device seems to be given back on some other
> devices. I did the comparison twice with similar results. (I have the
> full data set with comparisons in an Excel spreadsheet.)
>
> Any thoughts on what might be causing this? I haven't gone into the
> details of your algorithms for parallelizing, but is there any extra
> overhead that could be adding to the time? Or maybe this is
> something unique to Hyper-V guests. The overall difference is only
> a few 10's of milliseconds, so not that big of a deal. But maybe it's
> an indicator that something unexpected is happening that we should
> understand.
>
> I'll keep thinking about the issue and see if I can get any more
> insight.
>
> Michael Kelley
>
> >
> > Changes from V8:
> >
> > Deal with shutdown hangs resulting when a parent/supplier device is
> > later in the devices_kset list than its children/consumers:
> > * Ignore sync_state_only devlinks for shutdown dependencies
> > * Ignore shutdown_after for devices that don't want async
> > shutdown
> > * Add a sanity check to revert to sync shutdown for any device
> > that
> > would otherwise wait for a child/consumer shutdown that hasn't
> > already been scheduled
> >
> > Changes from V7:
> >
> > Do not expose driver async_shutdown_enable in sysfs.
> > Wrapped a long line.
> >
> > Changes from V6:
> >
> > Removed a sysfs attribute that allowed the async device shutdown to
> > be
> > "on" (with driver opt-out), "safe" (driver opt-in), or "off"...
> > what was
> > previously "safe" is now the only behavior, so drivers now only
> > need to
> > have the option to enable or disable async shutdown.
> >
> > Changes from V5:
> >
> > Separated into multiple patches to make review easier.
> > Reworked some code to make it more readable
> > Made devices wait for consumers to shut down, not just children
> > (suggested by David Jeffery)
> >
> > Changes from V4:
> >
> > Change code to use cookies for synchronization rather than async
> > domains
> > Allow async shutdown to be disabled via sysfs, and allow driver
> > opt-in or
> > opt-out of async shutdown (when not disabled), with ability to
> > control
> > driver opt-in/opt-out via sysfs
> >
> > Changes from V3:
> >
> > Bug fix (used "parent" not "dev->parent" in device_shutdown)
> >
> > Changes from V2:
> >
> > Removed recursive functions to schedule children to be shutdown
> > before
> > parents, since existing device_shutdown loop will already do this
> >
> > Changes from V1:
> >
> > Rewritten using kernel async code (suggested by Lukas Wunner)
> >
> >
> > Stuart Hayes (4):
> > driver core: don't always lock parent in shutdown
> > driver core: separate function to shutdown one device
> > driver core: shut down devices asynchronously
> > nvme-pci: Make driver prefer asynchronous shutdown
> >
> > drivers/base/base.h | 4 +
> > drivers/base/core.c | 137 +++++++++++++++++++++++++++---
> > ----
> > drivers/nvme/host/pci.c | 1 +
> > include/linux/device/driver.h | 2 +
> > 4 files changed, 118 insertions(+), 26 deletions(-)
> >
> > --
> > 2.39.3
> >
>
>
Hopefully helpful.
Interesting, once again I tested PATCH bundle V9 in the Red Hat lab and
I again see a great improvement of 6 to 8 time faster on a 24 nvme
device server.
Measuring this using dmesg timestamps gets me a shutdown in about 8s
versus 50+ seconds.
The problem with my testing is I don't have all the hardware, For
example, what failed last time on V8 (SOC board) and was reported, is
not covered bu the Red Hat lab testing.'
So for what it is worth.
Tested-by: Laurence Oberman <loberman at redhat.com>
More information about the Linux-nvme
mailing list