nvme nvme0: I/O 0 (I/O Cmd) QID 1 timeout, aborting, source drive corruption observed

Tue Dec 20 08:56:23 PST 2022

On Tue, Dec 20, 2022 at 10:10:30AM +0900, J. Hart wrote:
> On 12/19/22 11:41 PM, Keith Busch wrote:
> > Given the potential flakiness of read corruption, I'd disable relaxed
> > ordering and see if that improves anything.
> 
> I am not familiar with this part.  How is this done ?
> 
> > 
> > >                          MaxPayload 128 bytes, MaxReadReq 512 bytes
> > >                  DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr+ TransPend-
> > >                  LnkCap: Port #0, Speed 8GT/s, Width x4, ASPM L1, Latency L0 <1us, L1 <8us
> > >                          ClockPM+ Surprise- LLActRep- BwNot-
> > >                  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- Retrain- CommClk+
> > >                          ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
> > >                  LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
> 
> 
> > Something seems off if it's downtraining to Gen1 x1. I believe this
> > setup should be capable of Gen2 x4. It sounds like the links among these
> > components may not be reliable.
> > 
> > Your first post mentioned total transfer was 50GB. If you've deep enough
> > queues, the tail latency will exceed the default timeout values when
> > you're limited to that kind of bandwidth. You'd probably be better off
> > from a performance strand point with a cheaper SATA SSD on AHCI.
> 
> It would be unfortunate I think if the linux driver could not be made to
> implement the NVME standards on the somewhat older equipment from perhaps
> ten or fifteen years ago.  Earlier than that is perhaps not terribly
> practical of course.  Equipment like that which is still operating does tend
> to be reliable, and it's something of a shame to have to waste it. Some of
> us also do lack the wherewithal to update equipment every two years,
> especially older people or those in areas where the economy is not so good.
> As I think we all know, there's more of that these days then we'd
> like.....:-)
> 
> In any case, I'm very willing to run tests on this equipment if that will
> help.  I'm fairly familiar with building kernels, writing software and that
> sort of thing, but perhaps less so with fixing drivers.

For the record, the linux driver does implement the nvme standards and
works fine on older equipment capable of implementing it.

The problem you're describing sounds closer to the pcie phy layer, far
below the nvme protocol. There's really not a lot we can do at the
kernel layer to say for sure, though; you'd need something like an
expensive pcie protocol analyzer to really confirm. But even if we did
have that kind of data, it's unlikely to reveal a viable work-around.

Though I am skeptical, Christoph seemed to also think there was a
possibility you hit a real kernel issue with your setup, but I don't
know if he has any ideas other than enabling KASAN to see if that
catches anything.