max_hw_sectors error caused by recent NVMe driver commit

Keith Busch kbusch at kernel.org
Fri Mar 3 08:44:31 PST 2023


On Fri, Mar 03, 2023 at 05:24:43PM +0100, Daniel Gomez wrote:
> On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
> <mikelley at microsoft.com> wrote:
> >
> > Yes, that's correct.   But "the device is limited to a lower max_segments
> > value" isn't really because max_segments is limited.  The limit is on
> > max_hw_sectors_kb derived from the NVMe controller MDTS value,
> > as you have shown in your table below.  Then the max_segments value
> > is derived from max_hw_sectors_kb.  For example, if max_hw_sectors_kb
> > is 128 Kbytes, you can never need more than 33 segments.  Each segment
> > can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> > Then add 1 segment to handle the case where the memory buffer doesn't
> > start on a page boundary, and you get 33.
> 
> I'm trying to understand and clarify this part. So, I found Keith's
> patch [1] where he addresses (and introduces) the same case (quote
> from the patch: 'One additional segment is added to account for a
> transfer that may start in the middle of a page.'). But I'm not sure
> when that case should be considered when we split an I/O. Is that
> handled in bio_split_rw + bvec_split_segs functions?

The function responsible for splitting bio's to the request_queue's limits is
__bio_split_to_limits(). This happens before blk-mq forms 'struct request' for
the low-level driver in blk_mq_submit_bio().

> So, for mdts=7 and max_segments being now 128, when can we see and
> test that a transfer starts in the middle of a page and therefore,
> that we need an extra segment for that? In that case, shouldn't the
> split max at 128 - 1 (508 KiB)?

You can ensure you start in the middle of a page by allocating page aligned
memory, then adding an offset.

The number of segments you actually get from your allocation depends on how
many discontiguous pages the system provisions your virtual memory. User space
doesn't have direct control over that unless you're using huge pages. It's
entirely possible a 512KiB buffer has just one segment, or 129 (assuming 4k
pages), or any number inbetween.

> In my tests [2], I can see splits are made in chunks of 128 in some
> cases. Whether or not these cases are page-aligned I don’t know but
> I'd like to verify that.
> 
> [1] https://lore.kernel.org/all/1439417874-8925-1-git-send-email-keith.busch@intel.com/
> 
> [2] while true; do sleep 1; dd iflag=direct if=/dev/nvme0n1 bs=1M
> count=1 of=/dev/null status=progress; done



More information about the Linux-nvme mailing list