5.10.40-1 - Invalid SGL for payload:131072 nents:13

Fri Jul 23 16:35:19 PDT 2021

Hi,

On Tue, Jul 20, 2021 at 05:34:34PM -0700, Keith Busch wrote:
> On Tue, Jul 20, 2021 at 10:07:33PM +0000, Andy Smith wrote:
> > I have a Debian stable machine with a Samsung PM983 NVMe and a
> > Samsung SM883 in an MD RAID-1. It's been running the 4.19.x Debian
> > packaged kernel for almost 2 years now.
> > 
> > About 24 hours ago I upgraded its kernel to the buster-backports
> > kernel which is version 5.10.40-1~bpo10+1 and around four hours
> > after that I got this:
> > 
> > Jul 20 02:17:54 lamb kernel: [21061.388607] sg[0] phys_addr:0x00000015eb803000 offset:0 length:4096 dma_address:0x000000209e7b7000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.389775] sg[1] phys_addr:0x00000015eb7bc000 offset:0 length:4096 dma_address:0x000000209e7b8000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.390874] sg[2] phys_addr:0x00000015eb809000 offset:0 length:4096 dma_address:0x000000209e7b9000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.391974] sg[3] phys_addr:0x00000015eb766000 offset:0 length:4096 dma_address:0x000000209e7ba000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.393042] sg[4] phys_addr:0x00000015eb7a3000 offset:0 length:4096 dma_address:0x000000209e7bb000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.394086] sg[5] phys_addr:0x00000015eb7c6000 offset:0 length:4096 dma_address:0x000000209e7bc000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.395078] sg[6] phys_addr:0x00000015eb7c2000 offset:0 length:4096 dma_address:0x000000209e7bd000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.396042] sg[7] phys_addr:0x00000015eb7a9000 offset:0 length:4096 dma_address:0x000000209e7be000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.397004] sg[8] phys_addr:0x00000015eb775000 offset:0 length:4096 dma_address:0x000000209e7bf000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.397971] sg[9] phys_addr:0x00000015eb7c7000 offset:0 length:4096 dma_address:0x00000020ff520000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.398889] sg[10] phys_addr:0x00000015eb7cb000 offset:0 length:4096 dma_address:0x00000020ff521000 dma_length:4096
> > Jul 20 02:17:54 lamb kernel: [21061.399814] sg[11] phys_addr:0x00000015eb7e3000 offset:0 length:61952 dma_address:0x00000020ff522000 dma_length:61952
> > Jul 20 02:17:54 lamb kernel: [21061.400754] sg[12] phys_addr:0x00000015eb7f2200 offset:512 length:24064 dma_address:0x00000020ff531200 dma_length:24064
> 
> Perhaps we should add the virt_addr in this print. If it was there, I
> think it should show that the phys offset doesn't match the virtual
> offset, which we are depending on.
> 
> Are you using swiotlb? If so, this recent patch sounds like it should
> fix offset issues:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit?id=5f89468e2f060031cd89fd4287298e0eaf246bf6

I was struggling to reproduce the above issue on my test hardware so
I took the time to resolve the sector offsets into logical volumes
to work out which Xen guests were involved. I found two guests
have triggered it and both of them have partitioned their block
device in an unaligned fashion e.g. partition starts at sector 63
with 512 byte sectors.

Making same setup on a guest on my test host I can now reliably
trigger this within a minute or so using fio.

I should now be able to test this patch mentioned above and/or
bisect to see what changed. Just thought I'd mention it in case the
unaligned nature sparked any other memories for anyone.

Thanks,
Andy