[RFC] nvme-rdma: Stop queues when starting with error recovery

Sagi Grimberg sagi at grimberg.me
Tue May 24 06:13:28 PDT 2022



On 5/24/22 16:12, Sagi Grimberg wrote:
> 
> 
> On 5/23/22 18:21, Daniel Wagner wrote:
>> When we enter error recovery we should stop all queue activities and
>> all armed timers.
>>
>> For example, we could arming an ANATT timer right before we enter
>> error recovery but do not successfully recover before the timer
>> fires. The timer is supposed only be active when the controller is in
>> LIVE state hence we should call nvme_stop_ctrl when starting with the
>> recover activites.
> 
> This takes me back... IIRC the main reason why the error recovery
> handler in nvme-rdma (and nvme-tcp) does not start with nvme_stop_ctrl()
> is because back then, nvme_stop_ctrl() could block until an
> admin_timeout expired on an inflight admin command (I have a vague
> memory that if cancelled/flushed the scan_work), which delayed I/O
> failover, hence we just cherry-picked what needed to be stopped to make
> sure I/O failover will not be blocked by anything and progress quickly.
> 
> If I look at nvme_stop_queue

nvme_stop_ctrl obviously...

  it does not touch the scan_work, but it
> does flush ana_work which includes I/O, so it can still block on it
> to fail after admin_timeout expires... also flush failfast_work is
> taking the namespaces_rwsem (although read access). The fw_act_work
> is probably meaningless for fabrics...
> 
> Maybe we need to have something like nvme_stop_ctrl_fast, that will just
> stop the non-blocking stuff, so failover can work quickly, and then
> after we have failover, we can do the full nvme_stop_ctrl() or the
> reminder of nvme_stop_ctrl_fast...
> 
> thoughts?



More information about the Linux-nvme mailing list