[PATCH] nvme-pci: calculate IO timeout
Keith Busch
kbusch at kernel.org
Wed Oct 13 08:53:58 PDT 2021
On Wed, Oct 13, 2021 at 11:34:33PM +0800, Ming Lei wrote:
> On Tue, Oct 12, 2021 at 07:27:44PM -0700, Keith Busch wrote:
> > Existing host and nvme device combinations are more frequently capable
> > of sustaining outstanding transfer sizes exceeding the driver's default
> > timeout tolerance, given the available device throughput.
> >
> > Let's consider a "mid" level server and controller with 128 CPUs and an
> > NVMe controller with no MDTS limit (the driver will throttle to 4MiB).
> >
> > If we assume the driver's default 1k depth per-queue, this can allow
> > 128k outstanding IO submission queue entries.
> >
> > If all SQ Entries are transferring the 4MiB max request, 512GB will be
> > outstanding at the same time with the default 30 second timer to
> > complete the entirety.
> >
> > If we assume a currently modern PCIe Gen4 x4 NVMe device, that amount of
> > data will take ~70 seconds to transfer over the PCIe link, not
> > considering the device side internal latency: timeouts and IO failures
> > are therefore inevitable.
>
> PCIe link is supposed to be much quicker than handling IOs in device side,
> so nvme device should have been saturated already before using up the
> PCIe link, is there any event or feedback from nvme device side(host or
> device) about the saturation status?
>
> SCSI have such mechanism so that queue depth can be adjusted according
> to the feedback, and Martin is familiar with this field.
Device side saturation should be achieved lower than the depths
considered here, and that usually happens without reaching link
saturation.
We do not really have event feedback for the NVMe driver to react to
though, so I had this patch cautiously assume 50% throughput for timeout
consideration.
I suppose we could react to the IO completion times and try to adjust
queue depths accordingly, though that is probably more aligned with a
longer term project.
More information about the Linux-nvme
mailing list