[PATCH] nvme-pci: calculate IO timeout

Wed Oct 13 03:53:37 PDT 2021

> Existing host and nvme device combinations are more frequently capable
> of sustaining outstanding transfer sizes exceeding the driver's default
> timeout tolerance, given the available device throughput.
> 
> Let's consider a "mid" level server and controller with 128 CPUs and an
> NVMe controller with no MDTS limit (the driver will throttle to 4MiB).
> 
> If we assume the driver's default 1k depth per-queue, this can allow
> 128k outstanding IO submission queue entries.
> 
> If all SQ Entries are transferring the 4MiB max request, 512GB will be
> outstanding at the same time with the default 30 second timer to
> complete the entirety.
> 
> If we assume a currently modern PCIe Gen4 x4 NVMe device, that amount of
> data will take ~70 seconds to transfer over the PCIe link, not
> considering the device side internal latency: timeouts and IO failures
> are therefore inevitable.
> 
> There are some driver options to mitigate the issue:
> 
>   a) Throttle the hw queue depth
>       - harms high-depth single-threaded workloads
>   b) Throttle the number of IO queues
>       - harms low-depth multi-threaded workloads
>   c) Throttle max transfer size
>       - harms large sequential workloads
>   d) Delay dispatch based on outstanding data transfer
>       - requires hot-path atomics
>   e) Increase IO Timeout
> 
> This RFC implements option 'e', increasing the timeout. The timeout is
> calculated based on the largest possible outstanding data transfer
> against the device's available bandwidth. The link time is arbitrarily
> doubled to allow for additional device side latency and potential link
> sharing with another device.
> 
> The obvious downside to this option means it may take a long time for
> the driver to notice a stuck controller.
> 
> Any other ideas?

I think that the case where the workload will behave in the worst
possible case then the admin should probably override the default
manually. I don't think it is desirable to have an absolute-worst-case
default timeout.