[PATCH] nvme-pci: calculate IO timeout
Sagi Grimberg
sagi at grimberg.me
Wed Oct 13 03:53:37 PDT 2021
> Existing host and nvme device combinations are more frequently capable
> of sustaining outstanding transfer sizes exceeding the driver's default
> timeout tolerance, given the available device throughput.
>
> Let's consider a "mid" level server and controller with 128 CPUs and an
> NVMe controller with no MDTS limit (the driver will throttle to 4MiB).
>
> If we assume the driver's default 1k depth per-queue, this can allow
> 128k outstanding IO submission queue entries.
>
> If all SQ Entries are transferring the 4MiB max request, 512GB will be
> outstanding at the same time with the default 30 second timer to
> complete the entirety.
>
> If we assume a currently modern PCIe Gen4 x4 NVMe device, that amount of
> data will take ~70 seconds to transfer over the PCIe link, not
> considering the device side internal latency: timeouts and IO failures
> are therefore inevitable.
>
> There are some driver options to mitigate the issue:
>
> a) Throttle the hw queue depth
> - harms high-depth single-threaded workloads
> b) Throttle the number of IO queues
> - harms low-depth multi-threaded workloads
> c) Throttle max transfer size
> - harms large sequential workloads
> d) Delay dispatch based on outstanding data transfer
> - requires hot-path atomics
> e) Increase IO Timeout
>
> This RFC implements option 'e', increasing the timeout. The timeout is
> calculated based on the largest possible outstanding data transfer
> against the device's available bandwidth. The link time is arbitrarily
> doubled to allow for additional device side latency and potential link
> sharing with another device.
>
> The obvious downside to this option means it may take a long time for
> the driver to notice a stuck controller.
>
> Any other ideas?
I think that the case where the workload will behave in the worst
possible case then the admin should probably override the default
manually. I don't think it is desirable to have an absolute-worst-case
default timeout.
More information about the Linux-nvme
mailing list