[PATCH 0/4] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages

Ofir Gal ofir.gal at volumez.com
Mon Jun 3 03:32:54 PDT 2024



On 30/05/2024 20:58, Sagi Grimberg wrote:
> Hey Ofir,
>
> On 30/05/2024 16:26, Ofir Gal wrote:
>> skb_splice_from_iter() warns on !sendpage_ok() which results in nvme-tcp
>> data transfer failure. This warning leads to hanging IO.
>>
>> nvme-tcp using sendpage_ok() to check the first page of an iterator in
>> order to disable MSG_SPLICE_PAGES. The iterator can represent a list of
>> contiguous pages.
>>
>> When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
>> it requires all pages in the iterator to be sendable.
>> skb_splice_from_iter() checks each page with sendpage_ok().
>>
>> nvme_tcp_try_send_data() might allow MSG_SPLICE_PAGES when the first
>> page is sendable, but the next one are not. skb_splice_from_iter() will
>> attempt to send all the pages in the iterator. When reaching an
>> unsendable page the IO will hang.
>
> Interesting. Do you know where this buffer came from? I find it strange
> that a we get a bvec with a contiguous segment which consists of non slab
> originated pages together with slab originated pages... it is surprising to see
> a mix of the two.

I find it strange as well, I haven't investigate the origin of the IO
yet. I suspect the first 2 pages are the superblocks of the raid
(mdp_superblock_1 and bitmap_super_s) and the rest of the IO is the
bitmap.

I have stumbled with the same issue when running xfs_format (couldn't
reproduce it from scratch). I suspect there are others cases that mix
the slab pages and non-slab pages.

> I'm wandering if this is something that happened before david's splice_pages
> changes. Maybe before that with multipage bvecs? Anyways it is strange, never
> seen that.
I haven't bisect the commit that caused the behavior but I have tested
ubuntu with 6.2.0 kernel, the bug didn't occur. (6.2.0 doesn't contain
david's splice_pages changes).

I'm not familiar with "multipage bvecs" patch, which patch do you refer
to?

> David,  strange that nvme-tcp is setting a single contiguous element bvec but it
> is broken up into PAGE_SIZE increments in skb_splice_from_iter...
>
>>
>> The patch introduces a helper sendpages_ok(), it returns true if all the
>> continuous pages are sendable.
>>
>> Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
>> this helper to check whether the page list is OK. If the helper does not
>> return true, the driver should remove MSG_SPLICE_PAGES flag.
>>
>>
>> The bug is reproducible, in order to reproduce we need nvme-over-tcp
>> controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid
>> with bitmap over those devices reproduces the bug.
>>
>> In order to simulate large optimal IO size you can use dm-stripe with a
>> single device.
>> Script to reproduce the issue on top of brd devices using dm-stripe is
>> attached below.
>
> This is a great candidate for blktests. would be very beneficial to have it added there.
Good idea, will do!



More information about the Linux-nvme mailing list