max_hw_sectors error caused by recent NVMe driver commit

Daniel Gomez dagmcr at gmail.com
Fri Feb 17 05:28:25 PST 2023


On Mon, Feb 13, 2023 at 5:57 PM Keith Busch <kbusch at kernel.org> wrote:
>
> On Mon, Feb 13, 2023 at 04:42:31PM +0000, Michael Kelley (LINUX) wrote:
> > Ideally, to support
> > 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > the middle of a page as describe above).  But the value of max_segments
> > is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
> > sure that the data structure containing the scatterlist fits in a single page.
> > See nvme_pci_alloc_iod_mempool().
Michael,

Thanks for clarifying. It really helps. Also, it's fixed after
applying Christoph patch [1].

[1] https://lore.kernel.org/all/20230213072035.288225-1-hch@lst.de/

>> value to be set.  In your example, I would guess the value of 512 Kbytes came
>> from querying the NVMe device for its max transfer size. Ideally, to support
>> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
>> the middle of a page as describe above).  But the value of max_segments
>> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
>> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
>> sure that the data structure containing the scatterlist fits in a single page.

>
> Should be 128 possible segements now in -next, but yeah, 129 would be ideal.

Quoting Michael,

>> the middle of a page as describe above).  But the value of max_segments
>> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
>> defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
>> sure that the data structure containing the scatterlist fits in a single page.

Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
after Keith optimization patch to 128) but not when the device is
limited to a lower max_segments value because they fit anyway in a
single page?

Following the kernel code, I can see the max_hw_sectors_kb is
calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
(MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
change the MDTS, I get the following:

  MDTS   max_hw_sectors   max_hw_sectors_kb   max_segments
 ------ ---------------- ------------------- --------------
  0      1024             512                 127
  1      16               8                   3
  2      32               16                  5
  3      64               32                  9
  4      128              64                  17
  5      256              128                 33
  6      512              256                 65
  7      1024             512                 127

>
> The limit confuses many because sometimes user space can sometimes get 512kib
> IO to work and other times the same program fails, and all because of physical
> memory continuity that user space isn't always aware of. A sure-fire way to
> never hit that limit is to allocate hugepages.



More information about the Linux-nvme mailing list