`nvme_disable_ctrl()` takes 411 ms on a Dell XPS 13 with SK hynix PC300 NVMEe

Thu May 2 22:52:32 PDT 2024

Dear Keith,

Thank you for your reply with a lot of background. This is much appreciated.

Am 02.05.24 um 10:43 schrieb Keith Busch:
> On Thu, May 02, 2024 at 08:12:39AM +0200, Paul Menzel wrote:
>>>> That doesn't seem too hard to believe to me. A safe shutdown can often
>>>> take a while time for an SSD. I've seen other implementations orders of
>>>> magnitude worse than what you're showing.
>>>
>>> But why? Due to physics or due to "slow" firmware?
> 
> Maybe both? The run time metadata doesn't necessarily match the on-disk
> format, and constructing that can take a moment. These device's CPUs are
> usually the cheapest the vendor could get that satisfies a run-time
> performance target, so may be under powered for computational tasks.
> 
> And it may also have to flush pending user data from its internal
> memory, which could be a few GB.
> 
> Lower end devices don't even have memory, so may have to make many round
> trips to host memory to retrieve its metadata then manipulate that to
> its on-disk format.

Thank you for the details. Indeed “slow” firmware can be caused by 
low-performant chips.

> Maybe this could be better optimized, but vendors may not consider
> shutdown time to be a high priority.

As this is all a black box, it’s hard to know. If this is more visible, 
vendors might make it a higher priority.

> This gets worse as you add more nvme devices to your system because
> shutdown is serialized. Some of us have proposed patches parallelizing
> this process. I wish I could spend more time on helping see that to
> completion, but other priorities get in the way. :(

I didn’t know. I only have systems with one NVMe device. Also in other 
parts like initializing CPU cores and applying microcode updates they 
try to parallelize the initialization to decrease boot time.

>>> So this confirms the ftrace findings. Excuse my ignorance, so the
>>> time-out is in seconds? And how does this relate to the rtd3e value (410
>>> ms /= 60 ms /= 5 s(?)?
> 
> The driver provides a user tunable parameter to specify the minimum
> timeout value, and it defaults to 5 seconds.
> 
>    nvme_core.shutdown_timeout=<time_in_seconds>
> 
> The driver selects this or the advertised rtd3e, whichever is greater.

Reading the code, that is in `nvme_init_identify()`:

	if (id->rtd3e) {
		/* us -> s */
		u32 transition_time = le32_to_cpu(id->rtd3e) / USEC_PER_SEC;

		ctrl->shutdown_timeout = clamp_t(unsigned int, transition_time,
						 shutdown_timeout, 60);

		if (ctrl->shutdown_timeout != shutdown_timeout)
			dev_info(ctrl->device,
				 "D3 entry latency set to %u seconds\n",
				 ctrl->shutdown_timeout);
	} else
		ctrl->shutdown_timeout = shutdown_timeout;

> We can't trust device's to report this correctly (and NVMe 1.0 didn't
> even provide a way for a device to report an expected shutdown time), so
> this exists to prevent unsafe shutdowns. Devices are supposed to survive
> an unsafe shutdown, but it's best to avoid that path.

So it’s like this in my case, as the SK hynix reports 60 ms, but 
actually takes 411 ms?

> The parameter is in granularity of seconds because the NVMe 1.0 spec
> said to "wait at least one second" for a shutdown to complete. Not the
> most clear wording for a spec, but that's where we started.

Thank you for the details. I am not a spec writer, but my gut feeling 
says, there should always be a polling(?) solution and only upper 
boundaries, that means “no longer than”, should be used.

Kind regards,

Paul