My Western Digital SN850 appears to suffer from deep power state issues - considering submitting quirk patch.

Thu May 19 13:02:03 PDT 2022

On Thu, May 19, 2022 at 11:14:41AM +0100, Marcos Scriven wrote:
> 
> I have now tried both of these options (separately and together), and the issue still occurs. I confirmed settings in command line:
> 
> cat /proc/cmdline
> BOOT_IMAGE=/boot/vmlinuz-5.13.19-6-pve root=/dev/mapper/pve-root ro quiet video=efifb:off acpi_enforce_resources=lax nvme_core.default_ps_max_latency_us=0 pcie_aspm=off
> 
> Does this mean it's definitively not a power state issue?

If you're still seeing this same all f's failure even with these settings, I
think it rules out the autonomous power settings provided by nvme and pcie.

It doesn't necessarily rule out potentially platform specific power
capabilities, but that'd be well outside my view of the nvme driver stack, and
I don't have any ideas off the top of my head on what to even check.

> The slightly positive news is I now have a fairly dependable way of replicating the issue. I've described it over in the Proxmox forums (https://forum.proxmox.com/threads/what-processes-resources-are-used-while-doing-a-vm-backup-in-stop-mode.109779/) but in short, just backing up a container (from the affected drive, to an unaffected one) has about a 30% chance of causing the drive to go offline. I suppose the fact this happens is another indicator it's not to do with lowering power states (autonomous or otherwise) if it's happening right when the disk is being read.
> 
> I tried running strace on the process to see if I could see anything obvious about what the process is doing while it fails, vs what happens when it completes without error. I can't see anything obvious.
> 
> Another thing I tried was a raw dd read from the affected disk to dev/null, to see if it was something about intensive reading that causes this, and it did not happen. During that the controller temp maxes out at 63C. nvme smart-log doesn't show any critical warnings.

63C, not great, not terrible.

'dd' is a nice tool, but you may be able to push the drive further with 'fio',
assuming an intense read workload is pushing the drive to failure. Just a quick
example:

  # fio --name=global --filename=/dev/nvme0n1 --bs=64k --direct=1 --ioengine=libaio --rw=randread --iodepth=32 --numjobs=8 --name=test

> I'm wondering if there's anything that would help me identify if it's a hardware issue (PSU, SSD, motherboard) in terms of low leve debugging or BIOS settings.

It does sound hardware related, but I'm not aware of any reasonable tools or
methods to debug it. Right now, I can only recommend verifying you've got the
latest vendor provided firmware installed.

> As an complete aside on mailgroup protocol, my understanding is text only, and 'inline' style. Responses come both from the group and directly from the people replying - should I "reply all" or just the group? For now, I've gone for the latter.

The mailing list only accepts plain text. Top posting is generally frowned
upon. A reply-all is fine. Wrapping columns at 80 characters helps readability.