[PATCH] nvme: Boot as soon as the boot controller has been probed

Thu Nov 12 09:20:24 EST 2020

On Wed, Nov 11, 2020 at 08:26:12PM -0800, Bart Van Assche wrote:
> On 11/11/20 11:19 AM, Keith Busch wrote:
> > On Sat, Nov 07, 2020 at 08:09:03PM -0800, Bart Van Assche wrote:
> >> The following two issues have been introduced by commit 1811977568e0
> >> ("nvme/pci: Use async_schedule for initial reset work"):
> >> - The boot process waits until all NVMe controllers have been probed
> >>   instead of only waiting until the boot controller has been probed.
> >>   This slows down the boot process.
> >> - Some of the controller probing work happens asynchronously without
> >>   the device core being aware of this.
> >>
> >> Hence this patch that makes all probing work happen from nvme_probe()
> >> and that tells the device core to probe multiple NVMe controllers
> >> concurrently by setting PROBE_PREFER_ASYNCHRONOUS.
> > 
> > I am finding that this setting probes devices in parallel on boot up,
> > but serially for devices added after boot. That's making this a rather
> > unappealing patch.
> 
> Hi Keith,
> 
> Thanks for having verified this. However, to me it is unexpected that my
> patch results in serial probing of NVMe devices added after boot. It
> would be appreciated if more information could be shared about how that
> test was run. I'm wondering whether the serialization perhaps was the
> result of how the test was run? My understanding of the driver core is
> that the decision whether or not to probe concurrently is taken inside
> __driver_attach() and also that PROBE_PREFER_ASYNCHRONOUS should trigger
> a call of the following code:
> 
> 		async_schedule_dev(__driver_attach_async_helper, dev);
> 
> I expect that calling the above code should result in concurrent probing.

The easiest check is something like this (as long as you're not actively
using your nvme namespaces):

  # echo 1 | tee /sys/class/nvme/nvme*/device/remove
  # echo 1 > /sys/bus/pci/rescan

The subsequent probes don't go through the async_schedule. They have
this stack instead:

[   44.528906] Call Trace:
[   44.530512]  dump_stack+0x6d/0x88
[   44.532373]  nvme_probe+0x2f/0x59b [nvme]
[   44.534466]  local_pci_probe+0x3d/0x70
[   44.536471]  pci_device_probe+0x107/0x1b0
[   44.538562]  really_probe+0x1be/0x410
[   44.540564]  driver_probe_device+0xe1/0x150
[   44.542733]  ? driver_allows_async_probing+0x50/0x50
[   44.545170]  bus_for_each_drv+0x7b/0xc0
[   44.547172]  __device_attach+0xeb/0x170
[   44.549163]  pci_bus_add_device+0x4a/0x70
[   44.551182]  pci_bus_add_devices+0x2c/0x70
[   44.553229]  pci_rescan_bus+0x25/0x30
[   44.555126]  rescan_store+0x61/0x90
[   44.556975]  kernfs_fop_write+0xcb/0x1b0
[   44.558940]  vfs_write+0xbe/0x200
[   44.560710]  ksys_write+0x5f/0xe0
[   44.562483]  do_syscall_64+0x2d/0x40
[   44.564354]  entry_SYSCALL_64_after_hwframe+0x44/0xa9