max_hw_sectors error caused by recent NVMe driver commit
Daniel Gomez
dagmcr at gmail.com
Fri Mar 3 08:24:43 PST 2023
Hi,
On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
<mikelley at microsoft.com> wrote:
>
> From: Daniel Gomez <dagmcr at gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> >
> > >> value to be set. In your example, I would guess the value of 512 Kbytes came
> > >> from querying the NVMe device for its max transfer size. Ideally, to support
> > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > >> the middle of a page as describe above). But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > >
> > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> >
> > Quoting Michael,
> >
> > >> the middle of a page as describe above). But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> > after Keith optimization patch to 128) but not when the device is
> > limited to a lower max_segments value because they fit anyway in a
> > single page?
>
> Yes, that's correct. But "the device is limited to a lower max_segments
> value" isn't really because max_segments is limited. The limit is on
> max_hw_sectors_kb derived from the NVMe controller MDTS value,
> as you have shown in your table below. Then the max_segments value
> is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb
> is 128 Kbytes, you can never need more than 33 segments. Each segment
> can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> Then add 1 segment to handle the case where the memory buffer doesn't
> start on a page boundary, and you get 33.
I'm trying to understand and clarify this part. So, I found Keith's
patch [1] where he addresses (and introduces) the same case (quote
from the patch: 'One additional segment is added to account for a
transfer that may start in the middle of a page.'). But I'm not sure
when that case should be considered when we split an I/O. Is that
handled in bio_split_rw + bvec_split_segs functions?
So, for mdts=7 and max_segments being now 128, when can we see and
test that a transfer starts in the middle of a page and therefore,
that we need an extra segment for that? In that case, shouldn't the
split max at 128 - 1 (508 KiB)?
In my tests [2], I can see splits are made in chunks of 128 in some
cases. Whether or not these cases are page-aligned I don’t know but
I'd like to verify that.
[1] https://lore.kernel.org/all/1439417874-8925-1-git-send-email-keith.busch@intel.com/
[2] while true; do sleep 1; dd iflag=direct if=/dev/nvme0n1 bs=1M
count=1 of=/dev/null status=progress; done
On Fri, Feb 17, 2023 at 5:05 PM Michael Kelley (LINUX)
<mikelley at microsoft.com> wrote:
>
> From: Daniel Gomez <dagmcr at gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> >
> > >> value to be set. In your example, I would guess the value of 512 Kbytes came
> > >> from querying the NVMe device for its max transfer size. Ideally, to support
> > >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > >> the middle of a page as describe above). But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > >
> > > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> >
> > Quoting Michael,
> >
> > >> the middle of a page as describe above). But the value of max_segments
> > >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > >> sure that the data structure containing the scatterlist fits in a single page.
> >
> > Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> > after Keith optimization patch to 128) but not when the device is
> > limited to a lower max_segments value because they fit anyway in a
> > single page?
>
> Yes, that's correct. But "the device is limited to a lower max_segments
> value" isn't really because max_segments is limited. The limit is on
> max_hw_sectors_kb derived from the NVMe controller MDTS value,
> as you have shown in your table below. Then the max_segments value
> is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb
> is 128 Kbytes, you can never need more than 33 segments. Each segment
> can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
> Then add 1 segment to handle the case where the memory buffer doesn't
> start on a page boundary, and you get 33. I'm making a subtle distinction
> here between "max_segments is limited" and "you can't need more than
> XX segments for a given max_hw_sectors_kb value".
>
> Michael
>
> >
> > Following the kernel code, I can see the max_hw_sectors_kb is
> > calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
> > max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
> > (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
> > change the MDTS, I get the following:
> >
> > MDTS max_hw_sectors max_hw_sectors_kb max_segments
> > ------ ---------------- ------------------- --------------
> > 0 1024 512 127
> > 1 16 8 3
> > 2 32 16 5
> > 3 64 32 9
> > 4 128 64 17
> > 5 256 128 33
> > 6 512 256 65
> > 7 1024 512 127
> >
> > >
> > > The limit confuses many because sometimes user space can sometimes get 512kib
> > > IO to work and other times the same program fails, and all because of physical
> > > memory continuity that user space isn't always aware of. A sure-fire way to
> > > never hit that limit is to allocate hugepages.
More information about the Linux-nvme
mailing list