[RFCv2] nvme-pci: adjust tagset parameters to match b/w

Keith Busch kbusch at kernel.org
Wed Oct 13 09:49:35 PDT 2021


On Wed, Oct 13, 2021 at 12:04:49PM -0400, Martin K. Petersen wrote:
> 
> Keith,
> 
> > Instead of auto-adjusting the timeout to cope with the worst case
> > scenario, this version adjusts the IO depth and max transfer size so
> > that the worst case scenario fits within the driver's timeout
> > tolerance.
> 
> I am totally in agreement with the concept of preventing mismatched
> queue depth and I/O timeout.
> 
> I am a bit concerned wrt. PCIe bandwidth as a measure since that is not
> necessarily representative of a drive's actual performance. But
> obviously the queue feedback mechanism is still a bit of a can of
> worms. For a sanity check, PCIe bandwidth is definitely acceptable.
> 
> > +	static const u32 min_bytes = 128 * 1024;
> > +	static const u32 min_depth = 128;
> 
> Not sure about 128 as min depth. We have seen workloads where 32 or 64
> performed best in practice although the drive internally had more
> command slots available.

Probably fine. I chose 128 somewhat arbitrarily, and suspect 32 tags per
hctx will be sufficient for most workloads without harming anything.

This algorithm does not change the nvme IO queue depth either; it only
limits what blk-mq is allowed to dispatch, so we will also have more SQ
command entries than we could ever use if this throttle kicks in.
 
> Also, haven't looked closely at your algorithm yet (impending meeting),
> just want to point out that the default 30 seconds is a totally
> unacceptable timeout for many of the things we care about.  Our I/O
> timeout is typically set to a couple of seconds. My hunch is that the
> algorithm needs to take into account. 

30 seconds is just the default if you didn't ask to change it. The
algorithm will react to the user requested IO timeout, so if you request
2 seconds via the module parameter, the driver will try to adjust
dispatch parameters accordingly.

> In the past we have been bitten by scaling algorithms that were
> essentially linear in nature and failed to pick sensible defaults at
> very low input values or timeouts.
> 
> Anyway. Will have a closer look later.
> 
> -- 
> Martin K. Petersen	Oracle Linux Engineering



More information about the Linux-nvme mailing list