[PATCH v9 0/4] shut down devices asynchronously

Fri Oct 11 08:52:19 PDT 2024

On Fri, 2024-10-11 at 04:22 +0000, Michael Kelley wrote:
> From: Stuart Hayes <stuart.w.hayes at gmail.com> Sent: Wednesday,
> October 9, 2024 10:58 AM
> > 
> > This adds the ability for the kernel to shutdown devices
> > asynchronously.
> > 
> > Only devices with drivers that enable it are shut down
> > asynchronously.
> > 
> > This can dramatically reduce system shutdown/reboot time on systems
> > that
> > have multiple devices that take many seconds to shut down (like
> > certain
> > NVMe drives). On one system tested, the shutdown time went from 11
> > minutes
> > without this patch to 55 seconds with the patch.
> 
> I've been testing this series against a 6.11.0 kernel in an Azure VM,
> which
> is running as a guest on the Microsoft Hyper-V hypervisor. The VM has
> 16 vCPUs,
> 128 Gbytes of memory, and two physical NVMe controllers that are
> mapped
> into the VM so that it can access them directly.
> 
> But I wanted to confirm that the two NVMe controllers are being
> shutdown
> in parallel.  So before doing a shutdown, I set
> /sys/module/kernel/parameters/initcall_debug to "Y" so the shutdown
> of each device is recorded in the console output.  Here's the full
> set of
> device shutdown messages:
> 
> 172.609825      platform intel_rapl_msr.0: shutdown
> 172.611940      mlx5_ib.rdma mlx5_core.rdma.0: shutdown
> 172.613931      mlx5_core.eth mlx5_core.eth.0: shutdown
> 172.618116      nvme c2b7:00:00.0: shutdown
> 172.618262      nvme 132e:00:00.0: shutdown
> 172.618349      mlx5_core 1610:00:02.0: shutdown
> 172.618359      mlx5_core 1610:00:02.0: Shutdown was called
> 172.782768      hv_pci ba152dae-1610-4c67-b925-81ac4902e4ce: shutdown
> 172.786405      sd 0:0:0:1: shutdown
> 172.788788      sd 0:0:0:0: shutdown
> 172.789949      sd 0:0:0:0: [sda] Synchronizing SCSI cache
> 172.794209      atkbd serio0: shutdown
> 172.795974      hv_utils 242ff919-07db-4180-9c2e-b86cb68c8c55:
> shutdown
> 172.800432      hv_pci 0cdfe983-132e-434b-8025-fc9ab43c0fc5: shutdown
> 172.802812      hv_pci 2394da4f-c2b7-43bd-b72f-d3482ef6850a: shutdown
> 172.805145      hv_netvsc 0022487e-1043-0022-487e-10430022487e:
> shutdown
> 172.807575      hyperv_fb 5620e0c7-8062-4dce-aeb7-520c7ef76171:
> shutdown
> 172.810026      hyperv_keyboard d34b2567-b9b6-42b9-8778-0a4ec0b955bf:
> shutdown
> 172.812522      hid_hyperv 58f75a6d-d949-4320-99e1-a2a2576d581c:
> shutdown
> 172.814982      hv_balloon 1eccfd72-4b41-45ef-b73a-4a6e44c12924:
> shutdown
> 172.817376      vmbus c4e5e7d1-d748-4afc-979d-683167910a55: shutdown
> 172.819789      hv_storvsc f8b3781b-1e82-4818-a1c3-63d806ec15bb:
> shutdown
> 172.822324      hv_storvsc f8b3781a-1e82-4818-a1c3-63d806ec15bb:
> shutdown
> 172.824813      hv_utils 2dd1ce17-079e-403c-b352-a1921ee207ee:
> shutdown
> 172.827199      hv_utils b6650ff7-33bc-4840-8048-e0676786f393:
> shutdown
> 172.829653      hv_utils fd149e91-82e0-4a7d-afa6-2a4166cbd7c0:
> shutdown
> 172.836408      platform eisa.0: shutdown
> 172.838558      alarmtimer alarmtimer.0.auto: shutdown
> 172.842461      platform Fixed MDIO bus.0: shutdown
> 172.864709      kgdboc kgdboc: shutdown
> 172.878009      serial8250 serial8250: shutdown
> 172.889725      platform pcspkr: shutdown
> 172.904386      rtc_cmos 00:02: shutdown
> 172.906217      serial 00:01: shutdown
> 172.907799      serial 00:00: shutdown
> 172.910427      platform efivars.0: shutdown
> 172.913341      platform rtc-efi.0: shutdown
> 172.915470      vmgenid HYPER_V_GEN_COUNTER_V1:00: shutdown
> 172.917479      vmbus VMBUS:00: shutdown
> 172.919012      platform PNP0003:00: shutdown
> 172.926707      reg-dummy reg-dummy: shutdown
> 172.961360      ACPI: PM: Preparing to enter system sleep state S5
> 
> You see the Mellanox CX-5 NIC, the two NVMe devices, various Hyper-V
> virtual devices, and platform devices being shutdown. Everything
> seems to
> work properly, so that's good. The two NVMe devices are shutdown very
> close in time, so they are probably being done in parallel.
> 
> As a comparison, I did the same thing with an unmodified 6.11.0
> kernel.
> Indeed, the NVMe device shutdowns are significantly more apart in
> time
> (110 milliseconds). That's not noticeably slow like the NVMe devices
> you
> were dealing with, but doing them in parallel helps a little bit.
> 
> But here's the kicker:  The overall process of shutting down the
> devices
> took *longer* with the patch set than without.  Here's the same
> output
> from a 6.11.0 kernel without the patch set:
> 
> 745.455493      platform intel_rapl_msr.0: shutdown             
> 745.456999      mlx5_ib.rdma mlx5_core.rdma.0: shutdown         
> 745.458557      mlx5_core.eth mlx5_core.eth.0: shutdown         
> 745.460166      mlx5_core 1610:00:02.0: shutdown                
> 745.461570      mlx5_core 1610:00:02.0: Shutdown was
> called             
> 745.466053      nvme 132e:00:00.0: shutdown             
> 745.579284      nvme c2b7:00:00.0: shutdown             
> 745.718739      hv_pci ba152dae-1610-4c67-b925-81ac4902e4ce:
> shutdown           
> 745.721114      sd 0:0:0:1: shutdown            
> 745.722254      sd 0:0:0:0: shutdown            
> 745.723357      sd 0:0:0:0: [sda] Synchronizing SCSI
> cache              
> 745.725259      atkbd serio0: shutdown          
> 745.726405      hv_utils 242ff919-07db-4180-9c2e-b86cb68c8c55:
> shutdown         
> 745.728375      hv_pci 0cdfe983-132e-434b-8025-fc9ab43c0fc5:
> shutdown           
> 745.730347      hv_pci 2394da4f-c2b7-43bd-b72f-d3482ef6850a:
> shutdown           
> 745.732281      hv_netvsc 0022487e-1043-0022-487e-10430022487e:
> shutdown                
> 745.734318      hyperv_fb 5620e0c7-8062-4dce-aeb7-520c7ef76171:
> shutdown                
> 745.736488      hyperv_keyboard d34b2567-b9b6-42b9-8778-0a4ec0b955bf:
> shutdown          
> 745.738628      hid_hyperv 58f75a6d-d949-4320-99e1-a2a2576d581c:
> shutdown               
> 745.740770      hv_balloon 1eccfd72-4b41-45ef-b73a-4a6e44c12924:
> shutdown               
> 745.742835      vmbus c4e5e7d1-d748-4afc-979d-683167910a55:
> shutdown            
> 745.744765      hv_storvsc f8b3781b-1e82-4818-a1c3-63d806ec15bb:
> shutdown               
> 745.746861      hv_storvsc f8b3781a-1e82-4818-a1c3-63d806ec15bb:
> shutdown               
> 745.748907      hv_utils 2dd1ce17-079e-403c-b352-a1921ee207ee:
> shutdown         
> 745.750948      hv_utils b6650ff7-33bc-4840-8048-e0676786f393:
> shutdown         
> 745.753012      hv_utils fd149e91-82e0-4a7d-afa6-2a4166cbd7c0:
> shutdown         
> 745.755000      platform eisa.0: shutdown               
> 745.756266      alarmtimer alarmtimer.0.auto: shutdown          
> 745.757868      platform Fixed MDIO bus.0: shutdown             
> 745.759447      kgdboc kgdboc: shutdown         
> 745.760679      serial8250 serial8250: shutdown         
> 745.762110      platform pcspkr: shutdown               
> 745.763387      rtc_cmos 00:02: shutdown                
> 745.764726      serial 00:01: shutdown          
> 745.765898      serial 00:00: shutdown          
> 745.767036      platform efivars.0: shutdown            
> 745.768783      platform rtc-efi.0: shutdown            
> 745.770240      vmgenid HYPER_V_GEN_COUNTER_V1:00:
> shutdown             
> 745.771949      vmbus VMBUS:00: shutdown                
> 745.773197      platform PNP0003:00: shutdown           
> 745.774540      reg-dummy reg-dummy: shutdown           
> 745.775964      ACPI: PM: Preparing to enter system sleep state S5
>                 
> There's some modest variability in the individual steps, but the 110
> ms
> saved on the NVMe device seems to be given back on some other
> devices. I did the comparison twice with similar results. (I have the
> full data set with comparisons in an Excel spreadsheet.)
> 
> Any thoughts on what might be causing this?  I haven't gone into the
> details of your algorithms for parallelizing, but is there any extra
> overhead that could be adding to the time? Or maybe this is
> something unique to Hyper-V guests. The overall difference is only
> a few 10's of milliseconds, so not that big of a deal. But maybe it's
> an indicator that something unexpected is happening that we should
> understand.
> 
> I'll keep thinking about the issue and see if I can get any more
> insight.
> 
> Michael Kelley
> 
> > 
> > Changes from V8:
> > 
> > Deal with shutdown hangs resulting when a parent/supplier device is
> >   later in the devices_kset list than its children/consumers:
> >   * Ignore sync_state_only devlinks for shutdown dependencies
> >   * Ignore shutdown_after for devices that don't want async
> > shutdown
> >   * Add a sanity check to revert to sync shutdown for any device
> > that
> >     would otherwise wait for a child/consumer shutdown that hasn't
> >     already been scheduled
> > 
> > Changes from V7:
> > 
> > Do not expose driver async_shutdown_enable in sysfs.
> > Wrapped a long line.
> > 
> > Changes from V6:
> > 
> > Removed a sysfs attribute that allowed the async device shutdown to
> > be
> > "on" (with driver opt-out), "safe" (driver opt-in), or "off"...
> > what was
> > previously "safe" is now the only behavior, so drivers now only
> > need to
> > have the option to enable or disable async shutdown.
> > 
> > Changes from V5:
> > 
> > Separated into multiple patches to make review easier.
> > Reworked some code to make it more readable
> > Made devices wait for consumers to shut down, not just children
> >   (suggested by David Jeffery)
> > 
> > Changes from V4:
> > 
> > Change code to use cookies for synchronization rather than async
> > domains
> > Allow async shutdown to be disabled via sysfs, and allow driver
> > opt-in or
> >   opt-out of async shutdown (when not disabled), with ability to
> > control
> >   driver opt-in/opt-out via sysfs
> > 
> > Changes from V3:
> > 
> > Bug fix (used "parent" not "dev->parent" in device_shutdown)
> > 
> > Changes from V2:
> > 
> > Removed recursive functions to schedule children to be shutdown
> > before
> >   parents, since existing device_shutdown loop will already do this
> > 
> > Changes from V1:
> > 
> > Rewritten using kernel async code (suggested by Lukas Wunner)
> > 
> > 
> > Stuart Hayes (4):
> >   driver core: don't always lock parent in shutdown
> >   driver core: separate function to shutdown one device
> >   driver core: shut down devices asynchronously
> >   nvme-pci: Make driver prefer asynchronous shutdown
> > 
> >  drivers/base/base.h           |   4 +
> >  drivers/base/core.c           | 137 +++++++++++++++++++++++++++---
> > ----
> >  drivers/nvme/host/pci.c       |   1 +
> >  include/linux/device/driver.h |   2 +
> >  4 files changed, 118 insertions(+), 26 deletions(-)
> > 
> > --
> > 2.39.3
> > 
> 
> 

Hopefully helpful.
Interesting, once again I tested PATCH bundle V9 in the Red Hat lab and
I again see a great improvement of 6 to 8 time faster on a 24 nvme
device server.
Measuring this using dmesg timestamps gets me a shutdown in about 8s
versus 50+ seconds.
The problem with my testing is I don't have all the hardware, For
example, what failed last time on V8 (SOC board) and was reported, is
not covered bu the Red Hat lab testing.'
So for what it is worth.
Tested-by: Laurence Oberman <loberman at redhat.com>