max_hw_sectors error caused by recent NVMe driver commit

Michael Kelley (LINUX) mikelley at microsoft.com
Mon Feb 13 08:42:31 PST 2023


From: Daniel Gomez <dagmcr at gmail.com> Sent: Saturday, February 11, 2023 1:39 AM
> 
> Hi all,
> 
> On Sat, Feb 11, 2023 at 6:38 AM Michael Kelley (LINUX)
> <mikelley at microsoft.com> wrote:
> >
> > Commit 3f30a79c2e2c ("nvme-pci: set constant paramters in nvme_pci_alloc_ctrl")
> > appears to have introduced an error in how max_hw_sectors is calculated.  The
> > value of max_hw_sectors is based on dma_max_mapping_size(), which indirectly
> > uses dma_addressing_limited() to decide if swiotlb_max_mapping_size() should
> > be used.
> >
> > In this commit, setting max_hw_sectors is moved to nvme_pci_alloc_dev().
> > But dma_addressing_limited() depends on the dev->dma_mask, which hasn't
> > been set.  dma_addressing_limited() returns "true", and the swiotlb max mapping
> > size is used, limiting NVMe transfers to 504 sectors (252 Kbytes).
> >
> > Prior to this commit, max_hw_sectors isn't set until after the call to
> > dma_set_mask_and_coherent() in nvme_pci_enable(), as called from
> > nvme_reset_work().   max_hw_sectors is correctly determined based on
> > values reported by the NVMe controller.
> >
> > I haven't provided a fix because I'm not that familiar with the overall structure
> > of the code and the intent of the code reorganization.  I'm not sure if setting
> > the DMA mask should be moved earlier, or setting max_hw_sectors should
> > be moved back to its original location.
> 
> Yesterday, I ran into the same problem. Below you can find my test
> case. I'm trying to get familiar with the code, so any help would be
> much appreciated.
> 
> I could reproduce it with a simple dd test [1] and btrace. In summary,
> multi-page bvec splits are performed in 4 steps/chunks of 63 segments
> (63 * 4K / 512 = 504 sectors) and 1 step of 8 segments (8 * 4K / 512 =
> 32 sectors). But I think it should be just 4 steps of 64 segments as
> that is the max_segments upper limit in my test case. Below you can
> find my test case and numbers:

The maximum size of a single I/O transfer is limited to the *smaller*
of two factors:  max_segments and max_sectors_kb.  In your example
below,  the limiting factor is max_sectors_kb.   But note that up to 64
segments *could* be needed for a 252 Kbyte transfer if the starting
memory address is in the middle of a 4 Kbyte page.  A segment is
needed for the partial page at the beginning and another segment
is needed for the partial page at the end.  Also, if the target physical
memory has chunks larger than 4 Kbytes that are contiguous, even
fewer than 63 segments would be needed because each segment
describes a *contiguous* physical memory area using its starting
address and length.  But the I/O transfer would still be limited to
252 Kbytes because of max_sectors_kb.

> 
> Check:
> root at qemuarm64:/sys/block/nvme0n1/queue# grep -H . max_sectors_kb
> max_hw_sectors_kb max_segments max_segment_size optimal_io_size
> logical_block_size chunk_sectors
> max_sectors_kb:252
> max_hw_sectors_kb:252
> max_segments:64
> max_segment_size:4294967295
> optimal_io_size:512
> logical_block_size:512
> chunk_sectors:0
> 
> dd test:
> dd iflag=direct if=/dev/nvme0n1 bs=1M of=/dev/null status=progress
> 
> btrace snippet:
> 259,0    0    23476     1.111774467   751  Q  RS 1923072 + 2048 [dd]
> 259,0    0    23477     1.111774842   751  X  RS 1923072 / 1923576 [dd]
> 259,0    0    23478     1.111775342   751  G  RS 1923072 + 504 [dd]
> 259,0    0    23479     1.111775425   751  I  RS 1923072 + 504 [dd]
> 259,0    0    23480     1.111776133   751  X  RS 1923576 / 1924080 [dd]
> 259,0    0    23481     1.111776258   751  G  RS 1923576 + 504 [dd]
> 259,0    0    23482     1.111776300   751  I  RS 1923576 + 504 [dd]
> 259,0    0    23483     1.111776675   751  X  RS 1924080 / 1924584 [dd]
> 259,0    0    23484     1.111776967   751  G  RS 1924080 + 504 [dd]
> 259,0    0    23485     1.111777008   751  I  RS 1924080 + 504 [dd]
> 259,0    0    23486     1.111777383   751  X  RS 1924584 / 1925088 [dd]
> 259,0    0    23487     1.111777467   751  G  RS 1924584 + 504 [dd]
> 259,0    0    23488     1.111777550   751  I  RS 1924584 + 504 [dd]
> 259,0    0    23489     1.111777758   751  G  RS 1925088 + 32 [dd]
> 259,0    0    23490     1.111777800   751  I  RS 1925088 + 32 [dd]
> 259,0    0    23491     1.111779383    36  D  RS 1923072 + 504 [kworker/0:1H]
> 259,0    0    23492     1.111780092    36  D  RS 1923576 + 504 [kworker/0:1H]
> 259,0    0    23493     1.111780800    36  D  RS 1924080 + 504 [kworker/0:1H]
> 259,0    0    23494     1.111781425    36  D  RS 1924584 + 504 [kworker/0:1H]
> 259,0    0    23495     1.111781717    36  D  RS 1925088 + 32 [kworker/0:1H]
> 259,0    0    23496     1.112201967   749  C  RS 1923072 + 504 [0]
> 259,0    0    23497     1.112563925   749  C  RS 1923072 + 2048 [0]
> 259,0    0    23498     1.112564425   749  C  RS 1923072 + 2016 [0]
> 259,0    0    23499     1.112564800   749  C  RS 1923072 + 1512 [0]
> 259,0    0    23500     1.112758217   749  C  RS 1923072 + 1008 [0]
> 
> In addition, going back to Michaels reference commit and running the
> test in the same setup, I got the following checks:
> 
> max_sectors_kb:512
> max_hw_sectors_kb:512
> max_segments:127
> max_segment_size:4294967295
> optimal_io_size:512
> logical_block_size:512
> 
> So, I'm not sure why my max_segments when from 64 -> 127 and,
> max_sectors_kb and max_hw_sectdors_kb from 252 -> 512. Perhaps my
> first assumption was wrong?

The bug I pointed out caused max_hw_sectors to be set incorrectly.  Going
back to the referenced commit reverted the bug, and allowed the correct
value to be set.  In your example, I would guess the value of 512 Kbytes came
from querying the NVMe device for its max transfer size. Ideally, to support
512 Kbyte transfers, you would want 129 segments (to allow for starting in
the middle of a page as describe above).  But the value of max_segments
is limited by the NVME driver itself using the value of NVME_MAX_SEGS
defined in drivers/nvme/host/pci.c.  The value of 127 is chosen to make
sure that the data structure containing the scatterlist fits in a single page.
See nvme_pci_alloc_iod_mempool().

Hopefully this information helps ...

Michael




More information about the Linux-nvme mailing list