[PATCH v6 0/3] bugfix: Introduce sendpages_ok() to check sendpage_ok() on contiguous pages
Ofir Gal
ofir.gal at volumez.com
Mon Jul 22 23:56:04 PDT 2024
skb_splice_from_iter() warns on !sendpage_ok() which results in nvme-tcp
data transfer failure. This warning leads to hanging IO.
nvme-tcp using sendpage_ok() to check the first page of an iterator in
order to disable MSG_SPLICE_PAGES. The iterator can represent a list of
contiguous pages.
When MSG_SPLICE_PAGES is enabled skb_splice_from_iter() is being used,
it requires all pages in the iterator to be sendable.
skb_splice_from_iter() checks each page with sendpage_ok().
nvme_tcp_try_send_data() might allow MSG_SPLICE_PAGES when the first
page is sendable, but the next one are not. skb_splice_from_iter() will
attempt to send all the pages in the iterator. When reaching an
unsendable page the IO will hang.
The patch introduces a helper sendpages_ok(), it returns true if all the
continuous pages are sendable.
Drivers who want to send contiguous pages with MSG_SPLICE_PAGES may use
this helper to check whether the page list is OK. If the helper does not
return true, the driver should remove MSG_SPLICE_PAGES flag.
The root cause of the bug is a bug in md-bitmap, it sends a pages that
wasn't allocated for the bitmap. This cause the IO to be a mixture of
slab and non slab pages.
As Christoph Hellwig said in the v2, the issue can occur in similar
cases due to IO merges.
The bug is reproducible, in order to reproduce we need nvme-over-tcp
controllers with optimal IO size bigger than PAGE_SIZE. Creating a raid
with bitmap over those devices reproduces the bug.
In order to simulate large optimal IO size you can use dm-stripe with a
single device.
Test to reproduce the issue on top of brd devices using dm-stripe is
being added to blktests ("md: add regression test for "md/md-bitmap: fix
writing non bitmap pages").
The bug won't reproduce once "md/md-bitmap: fix writing non bitmap
pages" is merged, becuase it solve the root cause issue. A different
test can be done to reproduce the bug.
I have added 3 prints to test my theory. One in nvme_tcp_try_send_data()
and two others in skb_splice_from_iter() the first before sendpage_ok()
and the second on !sendpage_ok(), after the warning.
...
nvme_tcp: sendpage_ok, page: 0x654eccd7 (pfn: 120755), len: 262144, offset: 0
skbuff: before sendpage_ok - i: 0. page: 0x654eccd7 (pfn: 120755)
skbuff: before sendpage_ok - i: 1. page: 0x1666a4da (pfn: 120756)
skbuff: before sendpage_ok - i: 2. page: 0x54f9f140 (pfn: 120757)
WARNING: at net/core/skbuff.c:6848 skb_splice_from_iter+0x142/0x450
skbuff: !sendpage_ok - page: 0x54f9f140 (pfn: 120757). is_slab: 1, page_count: 1
...
stack trace:
...
WARNING: at net/core/skbuff.c:6848 skb_splice_from_iter+0x141/0x450
Workqueue: nvme_tcp_wq nvme_tcp_io_work
Call Trace:
? show_regs+0x6a/0x80
? skb_splice_from_iter+0x141/0x450
? __warn+0x8d/0x130
? skb_splice_from_iter+0x141/0x450
? report_bug+0x18c/0x1a0
? handle_bug+0x40/0x70
? exc_invalid_op+0x19/0x70
? asm_exc_invalid_op+0x1b/0x20
? skb_splice_from_iter+0x141/0x450
tcp_sendmsg_locked+0x39e/0xee0
? _prb_read_valid+0x216/0x290
tcp_sendmsg+0x2d/0x50
inet_sendmsg+0x43/0x80
sock_sendmsg+0x102/0x130
? vprintk_default+0x1d/0x30
? vprintk+0x3c/0x70
? _printk+0x58/0x80
nvme_tcp_try_send_data+0x17d/0x530
nvme_tcp_try_send+0x1b7/0x300
nvme_tcp_io_work+0x3c/0xc0
process_one_work+0x22e/0x420
worker_thread+0x50/0x3f0
? __pfx_worker_thread+0x10/0x10
kthread+0xd6/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x3c/0x60
? __pfx_kthread+0x10/0x10
ret_from_fork_asm+0x1b/0x30
...
---
Changelog:
v6, add acked-by tag from Jakub.
v5, removed libceph patch as it not necessary
v4, move assigment to declaration at sendpages_ok(), add review tags
from Sagi Grimberg
v3, removed the ROUND_DIV_UP as sagi suggested. add reviewed tags from
Christoph Hellwig, Hannes Reinecke and Christoph Böhmwalder.
Add explanation to the root cause issue in the cover letter.
v2, fix typo in patch subject
Ofir Gal (3):
net: introduce helper sendpages_ok()
nvme-tcp: use sendpages_ok() instead of sendpage_ok()
drbd: use sendpages_ok() instead of sendpage_ok()
drivers/block/drbd/drbd_main.c | 2 +-
drivers/nvme/host/tcp.c | 2 +-
include/linux/net.h | 19 +++++++++++++++++++
3 files changed, 21 insertions(+), 2 deletions(-)
--
2.45.1
More information about the Linux-nvme
mailing list