max_hw_sectors error caused by recent NVMe driver commit
Daniel Gomez
dagmcr at gmail.com
Fri Feb 17 05:28:25 PST 2023
On Mon, Feb 13, 2023 at 5:57 PM Keith Busch <kbusch at kernel.org> wrote:
>
> On Mon, Feb 13, 2023 at 04:42:31PM +0000, Michael Kelley (LINUX) wrote:
> > Ideally, to support
> > 512 Kbyte transfers, you would want 129 segments (to allow for starting in
> > the middle of a page as describe above). But the value of max_segments
> > is limited by the NVME driver itself using the value of NVME_MAX_SEGS
> > defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
> > sure that the data structure containing the scatterlist fits in a single page.
> > See nvme_pci_alloc_iod_mempool().
Michael,
Thanks for clarifying. It really helps. Also, it's fixed after
applying Christoph patch [1].
[1] https://lore.kernel.org/all/20230213072035.288225-1-hch@lst.de/
>> value to be set. In your example, I would guess the value of 512 Kbytes came
>> from querying the NVMe device for its max transfer size. Ideally, to support
>> 512 Kbyte transfers, you would want 129 segments (to allow for starting in
>> the middle of a page as describe above). But the value of max_segments
>> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
>> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
>> sure that the data structure containing the scatterlist fits in a single page.
>
> Should be 128 possible segements now in -next, but yeah, 129 would be ideal.
Quoting Michael,
>> the middle of a page as describe above). But the value of max_segments
>> is limited by the NVME driver itself using the value of NVME_MAX_SEGS
>> defined in drivers/nvme/host/pci.c. The value of 127 is chosen to make
>> sure that the data structure containing the scatterlist fits in a single page.
Yes, I can see that. I guess the 129 needs to be reduced to 127 (or
after Keith optimization patch to 128) but not when the device is
limited to a lower max_segments value because they fit anyway in a
single page?
Following the kernel code, I can see the max_hw_sectors_kb is
calculated using max_hw_sectors = 2 ^ (MDTS + page_shift - 9),
max_hw_sectors_kb is just max_hw_sectors >> 1, and max_segments is 2 ^
(MDTS) + 1 (simplified). Using QEMU (NVMe emulation) to be able to
change the MDTS, I get the following:
MDTS max_hw_sectors max_hw_sectors_kb max_segments
------ ---------------- ------------------- --------------
0 1024 512 127
1 16 8 3
2 32 16 5
3 64 32 9
4 128 64 17
5 256 128 33
6 512 256 65
7 1024 512 127
>
> The limit confuses many because sometimes user space can sometimes get 512kib
> IO to work and other times the same program fails, and all because of physical
> memory continuity that user space isn't always aware of. A sure-fire way to
> never hit that limit is to allocate hugepages.
More information about the Linux-nvme
mailing list