max_hw_sectors error caused by recent NVMe driver commit

Michael Kelley (LINUX) mikelley at microsoft.com
Fri Feb 17 08:05:14 PST 2023


From: Daniel Gomez <dagmcr at gmail.com> Sent: Friday, February 17, 2023 5:28 AM
> 
> >> value to be set.  In your example, I would guess the value of 512 Kbytes came
> >> from querying the NVMe device for its max transfer size. Ideally, to support
> >> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> >> the middle of a page as describe above).  But the value of max_segments
> >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> >> sure that the data structure containing the scatterlist fits in a single page.
> 
> >
> > Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
> 
> Quoting Michael,
> 
> >> the middle of a page as describe above).  But the value of max_segments
> >> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> >> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> >> sure that the data structure containing the scatterlist fits in a single page.
> 
> Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
> after Keith optimization patch to 128) but not when the device is
> limited to a lower max_segments value because they fit anyway in a
> single page?

Yes, that's correct.   But "the device is limited to a lower max_segments
value" isn't really because max_segments is limited.  The limit is on
max_hw_sectors_kb derived from the NVMe controller MDTS value,
as you have shown in your table below.  Then the max_segments value
is derived from max_hw_sectors_kb.  For example, if max_hw_sectors_kb
is 128 Kbytes, you can never need more than 33 segments.  Each segment
can describe 4 Kbytes (a page), so with 128 Kbytes you get 32 segments.
Then add 1 segment to handle the case where the memory buffer doesn't
start on a page boundary, and you get 33.   I'm making a subtle distinction
here between "max_segments is limited" and "you can't need more than
XX segments for a given max_hw_sectors_kb value".

Michael

> 
> Following the kernel code, I can see the max_hw_sectors_kb is
> calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
> max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
> (MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
> change the MDTS, I get the following:
> 
>   MDTS   max_hw_sectors   max_hw_sectors_kb   max_segments
>  ------ ---------------- ------------------- --------------
>   0      1024             512                 127
>   1      16               8                   3
>   2      32               16                  5
>   3      64               32                  9
>   4      128              64                  17
>   5      256              128                 33
>   6      512              256                 65
>   7      1024             512                 127
> 
> >
> > The limit confuses many because sometimes user space can sometimes get 512kib
> > IO to work and other times the same program fails, and all because of physical
> > memory continuity that user space isn't always aware of. A sure-fire way to
> > never hit that limit is to allocate hugepages.



More information about the Linux-nvme mailing list