[PATCH] nvme-fabrics: allow to queue requests for live queues

Tue Jul 28 16:11:31 EDT 2020

On 7/28/2020 10:50 AM, Sagi Grimberg wrote:
>
>>>>> diff --git a/drivers/nvme/host/fabrics.c 
>>>>> b/drivers/nvme/host/fabrics.c
>>>>> index 4ec4829d6233..2e7838f42e36 100644
>>>>> --- a/drivers/nvme/host/fabrics.c
>>>>> +++ b/drivers/nvme/host/fabrics.c
>>>>> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl 
>>>>> *ctrl, struct request *rq,
>>>>>   {
>>>>>       struct nvme_request *req = nvme_req(rq);
>>>>> -    /*
>>>>> -     * If we are in some state of setup or teardown only allow
>>>>> -     * internally generated commands.
>>>>> -     */
>>>>> -    if (!blk_rq_is_passthrough(rq) || (req->flags & 
>>>>> NVME_REQ_USERCMD))
>>>> "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we 
>>>> delete,
>>>> the normal io will be send to target, the target can not treat the io
>>>> if the queue is not NVME_CTRL_LIVE.
>>>
>>> Sure it does, the only reason for us to deny this I/O, is if the queue
>>> is not live. The controller state should only _advise_ us if we need to
>>> look at the queue state.
>>
>> I disagree strongly with removing the check on NVME_REQ_USERCMD. 
>> We've seen cli ioctls going to the admin queue while we're in the 
>> middle of doing controller initialization and it's has hosed the 
>> controller state in some cases. We've seen commands issued before the 
>> controller is in the proper state.  The admin queue may be live - but 
>> we don't necessarily want other io sneaking in.
>
> Can you please give an example? NVME_REQ_USERCMD should not be any
> different from any other type of I/O.

Keep in mind  - in order to send any command, Fabric connect included, 
the admin queue must be unquiesced.

Here's what we have today:
At the time of the reset or immediately during the reset, an ioctl is 
done - it can (and has been seen) to be queued on the request queue as 
there are not state checks. When this io gets scheduled, as of the 
resetting or even the connecting state - it's currently failing the 
NVME_REQ_USERCMD check, which fails back into 
nvmf_fail_nonready_command, which fails back with BLK_STS_RESOURCE, 
which reschedules it.  This rescheduling reoccurs, until the its tried 
again with the controller state changed (usually LIVE) which then passes 
the admin-queue live check. It's good to go.

With your patch:
The ioctl will at least be requeued with BLK_STS_RESOURCE until after 
the Fabric Connect is completed (Connect is explicitly allowed by the 
state-specific check) as hopefully queue->live is false. It may actually 
be issued if we're in this resetting state only to be terminated 
afterward. All of the transports set the queue live flag as of Fabric 
Connect complete. If the ioctl is retried as of that point - it may be 
immediately sent on the admin queue - before or in-between any of the 
commands issued by nvme_enable_ctrl() - including before controller 
enabled, nvme_init_identify(), and before nvme_start_ctrl() has been 
called.    Hopefully none of these commands try to set/change 
configuration.   They simply shouldn't be allowed until after the 
controller has transitioned to LIVE. There's no need to allow them 
in-between.

>
> Also, do note that pci does allow to queue any type of command based
> on the queue state only. fabrics should be slightly different because
> we have the CONNECTING state where we want to let the connect command
> only to be issued.
>

That doesn't make me feel any better.  PCI bit-bangs register writes, 
the odds they can fail/be discarded on the "fabric" is near zero. Admin 
Commands - both sending/processing/responding is much much faster than 
on a fabric, where the cmd can be dropped/delayed by fabric routing. 
Odds of having core-generated commands fail while 
reconnecting/reinitializing are much much lower.   All says - fabrics 
have a much higher command and inter-command latency than pcie does - so 
pcie is going to have a harder time seeing these cases.

>
>> As for the blk_rq_is_passthrough check - I guess I can see it being 
>> based on the queue state, and the check looks ok  (we should never 
>> see !blk_rq_is_passthrough on the admin q).
>> But...
>> - I don't know why it was that important to change it. On the 
>> connecting path, all you're doing is letting io start flowing before 
>> all the queues have been created.  Did you really need to start that 
>> much sooner ?
>
> The issue is that controller in RESETTING state will have requests that
> are being issued, and if we don't let it pass through, it will hang
> around forever being requeued preventing queue freeze to complete.

is it really "forever" or is it until controller loss timeout ? granted 
10minutes does seem like forever

>
> See bug report from Krishnamraju Eraparaju.

I'll go look.

>
>>
>> - But on the resetting path, or deleting cases, you've added a 
>> condition now where the controller state was changed, but there was a 
>> delay before the transport marked the queue live. It's common 
>> practice in the transports to change state then schedule a work 
>> element to perform the actual state change.  Why would you want io to 
>> continue to flow during that window ?   This may bring out other 
>> problems we've avoided in the past.
>
> What are you referring to? the change here? the controller reset must
> allow requests that came in before we started queue freeze to pass,
> otherwise freeze will never complete.

I'm referring to the io's that now pass through with the patch as the 
controller state is no longer blocking them (e.g. the removal of the 
!blk_rq_is_passthrough and NVME_REQ_USERCMD checks). They will pass the 
ready checks and go out on the wire, creating more active io that has to 
be deleted when the reset finally quiesces and terminates the active io 
list. For FC, we have to real link traffic for any io that gets started, 
so it raises the demands.   Should be fine, but it is creating more load 
at a point where it could have been blocked.

-- james