max_hw_sectors error caused by recent NVMe driver commit
Michael Kelley (LINUX)
mikelley at microsoft.com
Fri Feb 17 08:05:14 PST 2023
From: Daniel Gomez <dagmcr at gmail.com> Sent: Friday, February 17, 2023 5:28 AM
>
> >> value to be set. In your example, I would guess the value of 512 Kbytes came
> >> from querying the NVMe device for its max transfer size. Ideally, to support
> >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> >> the middle of a page as describe above). But the value of max_segments
> >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> >> sure that the data structure containing the scatterlist fits in a single page.
>
> >
> > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
>
> Quoting Michael,
>
> >> the middle of a page as describe above). But the value of max_segments
> >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> >> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> >> sure that the data structure containing the scatterlist fits in a single page.
>
> Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> after Keith optimization patch to 128) but not when the device is
> limited to a lower max_segments value because they fit anyway in a
> single page?
Yes, that's correct. But "the device is limited to a lower max_segments
value" isn't really because max_segments is limited. The limit is on
max_hw_sectors_kb derived from the NVMe controller MDTS value,
as you have shown in your table below. Then the max_segments value
is derived from max_hw_sectors_kb. For example, if max_hw_sectors_kb
is 128 Kbytes, you can never need more than 33 segments. Each segment
can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
Then add 1 segment to handle the case where the memory buffer doesn't
start on a page boundary, and you get 33. I'm making a subtle distinction
here between "max_segments is limited" and "you can't need more than
XX segments for a given max_hw_sectors_kb value".
Michael
>
> Following the kernel code, I can see the max_hw_sectors_kb is
> calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
> max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
> (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
> change the MDTS, I get the following:
>
> MDTS max_hw_sectors max_hw_sectors_kb max_segments
> ------ ---------------- ------------------- --------------
> 0 1024 512 127
> 1 16 8 3
> 2 32 16 5
> 3 64 32 9
> 4 128 64 17
> 5 256 128 33
> 6 512 256 65
> 7 1024 512 127
>
> >
> > The limit confuses many because sometimes user space can sometimes get 512kib
> > IO to work and other times the same program fails, and all because of physical
> > memory continuity that user space isn't always aware of. A sure-fire way to
> > never hit that limit is to allocate hugepages.
More information about the Linux-nvme
mailing list