[PATCH] nvme-pci: 512 byte aligned dma pool segment quirk

Thu Nov 14 08:47:22 PST 2024

 ---- On Thu, 14 Nov 2024 15:46:41 +0000  Keith Busch  wrote --- 
 > On Thu, Nov 14, 2024 at 11:38:03AM +0000, Pawel Anikiel wrote:
 > > I've been tracking down an issue that seems to be related (identical?) to
 > > this one, and I would like to propose a different fix.
 > > 
 > > I have a device with the aforementioned NVMe-eMMC bridge, and I was
 > > experiencing nvme read timeouts after updating the kernel from 5.15 to
 > > 6.6. Doing a kernel bisect, I arrived at the same dma pool commit as
 > > Robert in the original thread.
 > > 
 > > After trying out some changes in the nvme-pci driver, I came up with the
 > > same fix as in this thread: change the alignment of the small pool to
 > > 512. However, I wanted to get a deeper understanding of what's going on.
 > > 
 > > After a lot of analysis, I found out why the nvme timeouts were happening:
 > > The bridge incorrectly implements PRP list chaining.
 > > 
 > > When doing a read of exactly 264 sectors, and allocating a PRP list with
 > > offset 0xf00, the last PRP entry in that list lies right before a page
 > > boundary.  The bridge incorrectly (?) assumes that it's a pointer to a
 > > chained PRP list, tries to do a DMA to address 0x0, gets a bus error,
 > > and crashes.
 > > 
 > > When doing a write of 264 sectors with PRP list offset of 0xf00,
 > > the bridge treats data as a pointer, and writes incorrect data to
 > > the drive. This might be why Robert is experiencing fs corruption.
 > 
 > This sounds very plausible, great analysis. Curious though, even without
 > the dma pool optimizations, you could still allocate a PRP list at that
 > offset. I wonder why the problem only showed up once we optimized the
 > pool allocator.

yeah, me too. This is why I ended up at the 512 byte alignment idea.
I was speculating that maybe it has a 512 byte cache, and it caches whole 512 byte aligned blocks. So when we issue the previous segment immediately again, while it is processing the current segment, it may not have invalidated that first segment and start processing it using stale prp's that were meant for the previous block.
I'm not sure why last entry in page required to be a chain in fw would be exposed more readily by the the dmapool block stack change.

 >  
 > > So if my findings are right, the correct quirk would be "don't make PRP
 > > lists ending on a page boundary".
 > 
 > Coincidently enough, the quirk in this patch achieves that. But it's
 > great to understand why it was successful.
 > 
 > > Changing the small dma pool alignment to 512 happens to fix the issue
 > > because it never allocates a PRP list with offset 0xf00. Theoretically,
 > > the issue could still happen with the page pool, but this bridge has
 > > a max transfer size of 64 pages, which is not enough to fill an entire
 > > page-sized PRP list.
 > 
 > Thanks, this answers my question in the other thread: MDTS is too small
 > to hit the same bug with the large pool.
 > 

I hadn't appreciated the MDTS meaning that we never actually use a full page in a given transfer.
For reference, it is 6, which looks like it means only half a page will ever get used by this device. 
In case it is useful for further discussion, attached is the nvme controller identity output

-------------- next part --------------
A non-text attachment was scrubbed...
Name: nvme-id.log
Type: application/octet-stream
Size: 7870 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20241114/393674ac/attachment.obj>