[PATCH 3/4] nvme: tcp: fix race between timeout and normal completion

Tue Oct 20 04:11:11 EDT 2020

> NVMe TCP timeout handler allows to abort request directly when the
> controller isn't in LIVE state. nvme_tcp_error_recovery() updates
> controller state as RESETTING, and schedule reset work function. If
> new timeout comes before the work function is called, the new timedout
> request will be aborted directly, however at that time, the controller
> isn't shut down yet, then timeout abort vs. normal completion race
> will be triggered.

This assertion is incorrect, the before completing the request from
the timeout handler, we call nvme_tcp_stop_queue, which guarantees upon
return that no more completions will be seen from this queue.

> Fix the race by the following approach:
> 
> 1) aborting timed out request directly only in case that controller is in
> CONNECTING and DELETING state. In the two states, controller has been shutdown,
> so it is safe to do so; Also, it is enough to recovery controller in this way,
> because we only stop/destroy queues during RESETTING, and cancel all in-flight
> requests, no new request is required in RESETTING.

Unfortunately RESETTING also requires direct completion because this
state may include I/O that may timeout and unless we complete it
the reset flow cannot make forward progress
(nvme_disable_ctrl/nvme_shutdown_ctrl generate I/O in fabrics).

> 
> 2) delay unquiesce io queues and admin queue until controller is LIVE
> because it isn't necessary to start queues during RESETTING. Instead,
> this way may risk timeout vs. normal completion race because we need
> to abort timed-out request directly during CONNECTING state for setting
> up controller.

We can't unquisce I/O only when the controller is LIVE because I/O needs
to be able to failover for multipath, which should not be tied with
the controller becoming LIVE again what-so-ever...