My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch.

Thu May 19 03:14:41 PDT 2022

Thank you for your help Keith and Cristoph. I've been doing some more investigations.

On Mon, 16 May 2022, at 18:58, Christoph Hellwig wrote:
> On Sun, May 15, 2022 at 01:44:32PM -0600, Keith Busch wrote:
> > Some of the behavior you're describing has been isolated to specific
> > drive+platform combinations in the past, but let's hear your results from the
> > follow up experiements before considering if we need to introduce a dmi
> > type quirk.
> 
> Also, are we even sure this is related to power states?
> 

That's a good question - somehow I assumed this was the case after every post I saw mentioning this error was about changing latency.

> >  
> > > Anyway, with all that background, I'm happy to try NVME_QUIRK_NO_DEEPEST_PS for 15b7:5011 locally, and submit here if it works.
> > 
> > I think that's worth trying. Alternatively, you could just mess with the
> > module's param 'nvme_core.default_ps_max_latency_us' value and see if only the
> > deepest states or if any low power state is problematic.
> 
> Maybe even just nvme_core.default_ps_max_latency_us=0 to verify it
> really is power state related as a start.
> 

I have now tried both of these options (separately and together), and the issue still occurs. I confirmed settings in command line:

cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-5.13.19-6-pve root=/dev/mapper/pve-root ro quiet video=efifb:off acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Does this mean it's definitively not a power state issue?

The slightly positive news is I now have a fairly dependable way of replicating the issue. I've described it over in the Proxmox forums (https://forum.proxmox.com/threads/what-processes-resources-are-used-while-doing-a-vm-backup-in-stop-mode.109779/) but in short, just backing up a container (from the affected drive, to an unaffected one) has about a 30% chance of causing the drive to go offline. I suppose the fact this happens is another indicator it's not to do with lowering power states (autonomous or otherwise) if it's happening right when the disk is being read.

I tried running strace on the process to see if I could see anything obvious about what the process is doing while it fails, vs what happens when it completes without error. I can't see anything obvious.

Another thing I tried was a raw dd read from the affected disk to dev/null, to see if it was something about intensive reading that causes this, and it did not happen. During that the controller temp maxes out at 63C. nvme smart-log doesn't show any critical warnings.

I'm wondering if there's anything that would help me identify if it's a hardware issue (PSU, SSD, motherboard) in terms of low leve debugging or BIOS settings.

As an complete aside on mailgroup protocol, my understanding is text only, and 'inline' style. Responses come both from the group and directly from the people replying - should I "reply all" or just the group? For now, I've gone for the latter.