max_hw_sectors error caused by recent NVMe driver commit

Daniel Gomez dagmcr at gmail.com
Fri Mar 3 08:24:43 PST 2023


Hi,

On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
<mikelley at microsoft.com> wrote:
>
> From: Daniel Gomez <dagmcr at gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> >
> > >> value to be set.  In your example, I would guess the value of 512 Kbytes came
> > >> from querying the NVMe device for its max transfer size. Ideally, to support
> > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > >> the middle of a page as describe above).  But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > >
> > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> >
> > Quoting Michael,
> >
> > >> the middle of a page as describe above).  But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> > after Keith optimization patch to 128) but not when the device is
> > limited to a lower max_segments value because they fit anyway in a
> > single page?
>
> Yes, that's correct.   But "the device is limited to a lower max_segments
> value" isn't really because max_segments is limited.  The limit is on
> max_hw_sectors_kb derived from the NVMe controller MDTS value,
> as you have shown in your table below.  Then the max_segments value
> is derived from max_hw_sectors_kb.  For example, if max_hw_sectors_kb
> is 128 Kbytes, you can never need more than 33 segments.  Each segment
> can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> Then add 1 segment to handle the case where the memory buffer doesn't
> start on a page boundary, and you get 33.

I'm trying to understand and clarify this part. So, I found Keith's
patch [1] where he addresses (and introduces) the same case (quote
from the patch: 'One additional segment is added to account for a
transfer that may start in the middle of a page.'). But I'm not sure
when that case should be considered when we split an I/O. Is that
handled in bio_split_rw + bvec_split_segs functions?

So, for mdts=7 and max_segments being now 128, when can we see and
test that a transfer starts in the middle of a page and therefore,
that we need an extra segment for that? In that case, shouldn't the
split max at 128 - 1 (508 KiB)?

In my tests [2], I can see splits are made in chunks of 128 in some
cases. Whether or not these cases are page-aligned I don’t know but
I'd like to verify that.

[1] https://lore.kernel.org/all/1439417874-8925-1-git-send-email-keith.busch@intel.com/

[2] while true; do sleep 1; dd iflag=direct if=/dev/nvme0n1 bs=1M
count=1 of=/dev/null status=progress; done

On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
<mikelley at microsoft.com> wrote:
>
> From: Daniel Gomez <dagmcr at gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> >
> > >> value to be set.  In your example, I would guess the value of 512 Kbytes came
> > >> from querying the NVMe device for its max transfer size. Ideally, to support
> > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > >> the middle of a page as describe above).  But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > >
> > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> >
> > Quoting Michael,
> >
> > >> the middle of a page as describe above).  But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> > after Keith optimization patch to 128) but not when the device is
> > limited to a lower max_segments value because they fit anyway in a
> > single page?
>
> Yes, that's correct.   But "the device is limited to a lower max_segments
> value" isn't really because max_segments is limited.  The limit is on
> max_hw_sectors_kb derived from the NVMe controller MDTS value,
> as you have shown in your table below.  Then the max_segments value
> is derived from max_hw_sectors_kb.  For example, if max_hw_sectors_kb
> is 128 Kbytes, you can never need more than 33 segments.  Each segment
> can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> Then add 1 segment to handle the case where the memory buffer doesn't
> start on a page boundary, and you get 33.   I'm making a subtle distinction
> here between "max_segments is limited" and "you can't need more than
> XX segments for a given max_hw_sectors_kb value".
>
> Michael
>
> >
> > Following the kernel code, I can see the max_hw_sectors_kb is
> > calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
> > max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
> > (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
> > change the MDTS, I get the following:
> >
> >   MDTS   max_hw_sectors   max_hw_sectors_kb   max_segments
> >  ------ ---------------- ------------------- --------------
> >   0      1024             512                 127
> >   1      16               8                   3
> >   2      32               16                  5
> >   3      64               32                  9
> >   4      128              64                  17
> >   5      256              128                 33
> >   6      512              256                 65
> >   7      1024             512                 127
> >
> > >
> > > The limit confuses many because sometimes user space can sometimes get 512kib
> > > IO to work and other times the same program fails, and all because of physical
> > > memory continuity that user space isn't always aware of. A sure-fire way to
> > > never hit that limit is to allocate hugepages.



More information about the Linux-nvme mailing list