[PATCH 2/6] nvme-pci: fix the freeze and quiesce for shutdown and reset case

Tue Feb 6 18:13:51 PST 2018

Hi Keith

Sorry for bothering you again.

On 02/07/2018 10:03 AM, jianchao.wang wrote:
> Hi Keith
> 
> Thanks for your time and kindly response on this.
> 
> On 02/06/2018 11:13 PM, Keith Busch wrote:
>> On Tue, Feb 06, 2018 at 09:46:36AM +0800, jianchao.wang wrote:
>>> Hi Keith
>>>
>>> Thanks for your kindly response.
>>>
>>> On 02/05/2018 11:13 PM, Keith Busch wrote:
>>>>  but how many requests are you letting enter to their demise by
>>>> freezing on the wrong side of the reset?
>>>
>>> There are only two difference with this patch from the original one.
>>> 1. Don't freeze the queue for the reset case. At the moment, the outstanding requests will be requeued back to blk-mq queues.
>>>    The new entered requests during reset will also stay in blk-mq queues. All this requests will not enter into nvme driver layer
>>>    due to quiescent request_queues. And they will be issued after the reset is completed successfully.
>>> 2. Drain the request queue before nvme_dev_disable. This is nearly same with the previous rule which will also unquiesce the queue
>>>    and let the requests be able to be drained. The only difference is this patch will invoke wait_freeze in nvme_dev_disable instead
>>>    of nvme_reset_work.
>>>
>>> We don't sacrifice any request. This patch do the same thing with the previous one and make things clearer.
>>
>> No, what you're proposing is quite different.

What's the difference ? Can you please point out.
I have shared my understanding below.
But actually, I don't get the point what's the difference you said.
Or what you refer to is about the 4th patch ? If yes, I also explain this below.

Really appreciate your precious time to explain this. :)
Many thanks
Jianchao

>>
>> By "enter", I'm referring to blk_queue_enter. 
> When a request is allocated, it will hold a request_queue->q_usage_counter until it is freed.
> Please refer to 
> blk_mq_get_request -> blk_queue_enter_live
> blk_mq_free_request -> blk_exit_queue
> 
> Regarding to 'request enters into an hctx', I cannot get the point.
> I think you should mean it enter into nvme driver layer.
> 
>> Once a request enters
>> into an hctx, it can not be backed out to re-enter a new hctx if the
>> original one is invalidated.
> 
> I also cannot get the point here. We certainly will not issue a request which
> has been issued to other hctx.
> What this patch and also the original one does is that disable/shutdown controller, 
> cancel and requeue  or fail the outstanding requests.
> 
> The requeue mechanism will ensure the requests to be inserted to the ctx where req->mq_ctx->cpu
> points to. 
> 
>>
>> Prior to a reset, all requests that have entered the queue are committed
>> to that hctx, 
> 
> A request could be on 
> - blk-mq per-cpu ctx->rq_list, IO scheduler list\
> - hctx->dispatch list or
> - request_queue->requeue_list (will be inserted to 1st case again)
> 
> When requests are issued, they will be dequeued from 1st or 2nd case and submitted to nvme driver layer.
> These requests are _outstanding_ ones.
> 
> When the request queue is quiesced, the request will be stayed in  blk-mq per-cpu ctx->rq_list, IO scheduler list
> or hctx->dispatch list, and cannot be issued to driver layer any more.
> When the request queue is frozen, it will gate the bio out of generic_make_request, so new request cannot enter
> blk-mq layer any more, and certainly the nvme driver layer.
> 
> For the reset case, the nvme controller will be back soon, we needn't freeze the queue, just quiescing is enough.
> The outstanding ones will be canceled and _requeued_ to request_queue->requeue_list, then they will be inserted into
> blk-mq layer again by requeue_work. When reset_work completes and start queues again, all the requests will be
> issued again. :)
> 
> For the shutdown case, freezing and quiescing is safer. Also we will wait them to be completed if the controller is still
> alive. If dead, we need to fail them directly instead of requeue them, otherwise, IO hung will come up, because controller
> will be offline for some time.
>   
> 
> and we can't do anything about that. The only thing we can
>> do is prevent new requests from entering until we're sure that hctx is
>> valid on the other side of the reset.
>>
> yes, that's is what this patch does.
> 
> Add some explaining about the 4th patch nvme-pci: break up nvme_timeout and nvme_dev_disable here.
> Also thanks for your time to look into this. That's really appreciated!
> 
> The key point is blk_abort_request. It will force the request to be expired and then handle the request
> in timeout work context. It is safe to race with the irq completion path. This is the most important reason
> to use blk_abort_request. 
> We don't _complete_ the request or _rearm_ the time, but just set a CANCELED flag. So request will not be freed.
> Then these requests cannot be taken away by irq completion path and time out path is also be avoid (no outstanding requests any more).
> So we say 'all the outstanding requests are grabbed'. When we close the controller totally, we could complete
> these requests safely. This is the core idea of the 4th patch.
> 
> Many thanks
> Jianchao
>  
>