[PATCH 3/3] nvme: improve handling of long keep alives

Sagi Grimberg sagi at grimberg.me
Tue Apr 18 09:59:51 PDT 2023



On 4/18/23 01:55, Uday Shankar wrote:
> Upon keep alive completion, nvme_keep_alive_work is scheduled with the
> same delay every time. If keep alive commands are completing slowly,
> this may cause a keep alive timeout. The following trace illustrates the
> issue, taking KATO = 8 and TBKAS off for simplicity:
> 
> 1. t = 0: run nvme_keep_alive_work, send keep alive
> 2. t = ε: keep alive reaches controller, controller restarts its keep
>            alive timer
> 3. t = 4: host receives keep alive completion, schedules
>            nvme_keep_alive_work with delay 4
> 4. t = 8: run nvme_keep_alive_work, send keep alive
> 
> Here, a keep alive having RTT of 4 causes a delay of at least 8 - ε
> between the controller receiving successive keep alives. With ε small,
> the controller is likely to detect a keep alive timeout.
> 
> Fix this by calculating the RTT of the keep alive command, and adjusting
> the scheduling delay of the next keep alive work accordingly.

Is this something that was met in reality?

it is surprising that host->ctrl is super fast and
ctrl->host is super slow to the extent that this
situation exists in reality...



More information about the Linux-nvme mailing list