[PATCH] nvme-fabrics: allow to queue requests for live queues
James Smart
james.smart at broadcom.com
Tue Jul 28 16:11:31 EDT 2020
On 7/28/2020 10:50 AM, Sagi Grimberg wrote:
>
>>>>> diff --git a/drivers/nvme/host/fabrics.c
>>>>> b/drivers/nvme/host/fabrics.c
>>>>> index 4ec4829d6233..2e7838f42e36 100644
>>>>> --- a/drivers/nvme/host/fabrics.c
>>>>> +++ b/drivers/nvme/host/fabrics.c
>>>>> @@ -564,21 +564,13 @@ bool __nvmf_check_ready(struct nvme_ctrl
>>>>> *ctrl, struct request *rq,
>>>>> {
>>>>> struct nvme_request *req = nvme_req(rq);
>>>>> - /*
>>>>> - * If we are in some state of setup or teardown only allow
>>>>> - * internally generated commands.
>>>>> - */
>>>>> - if (!blk_rq_is_passthrough(rq) || (req->flags &
>>>>> NVME_REQ_USERCMD))
>>>> "if (!blk_rq_is_passthrough(rq))" should not delete. Because if we
>>>> delete,
>>>> the normal io will be send to target, the target can not treat the io
>>>> if the queue is not NVME_CTRL_LIVE.
>>>
>>> Sure it does, the only reason for us to deny this I/O, is if the queue
>>> is not live. The controller state should only _advise_ us if we need to
>>> look at the queue state.
>>
>> I disagree strongly with removing the check on NVME_REQ_USERCMD.
>> We've seen cli ioctls going to the admin queue while we're in the
>> middle of doing controller initialization and it's has hosed the
>> controller state in some cases. We've seen commands issued before the
>> controller is in the proper state. The admin queue may be live - but
>> we don't necessarily want other io sneaking in.
>
> Can you please give an example? NVME_REQ_USERCMD should not be any
> different from any other type of I/O.
Keep in mind - in order to send any command, Fabric connect included,
the admin queue must be unquiesced.
Here's what we have today:
At the time of the reset or immediately during the reset, an ioctl is
done - it can (and has been seen) to be queued on the request queue as
there are not state checks. When this io gets scheduled, as of the
resetting or even the connecting state - it's currently failing the
NVME_REQ_USERCMD check, which fails back into
nvmf_fail_nonready_command, which fails back with BLK_STS_RESOURCE,
which reschedules it. This rescheduling reoccurs, until the its tried
again with the controller state changed (usually LIVE) which then passes
the admin-queue live check. It's good to go.
With your patch:
The ioctl will at least be requeued with BLK_STS_RESOURCE until after
the Fabric Connect is completed (Connect is explicitly allowed by the
state-specific check) as hopefully queue->live is false. It may actually
be issued if we're in this resetting state only to be terminated
afterward. All of the transports set the queue live flag as of Fabric
Connect complete. If the ioctl is retried as of that point - it may be
immediately sent on the admin queue - before or in-between any of the
commands issued by nvme_enable_ctrl() - including before controller
enabled, nvme_init_identify(), and before nvme_start_ctrl() has been
called. Hopefully none of these commands try to set/change
configuration. They simply shouldn't be allowed until after the
controller has transitioned to LIVE. There's no need to allow them
in-between.
>
> Also, do note that pci does allow to queue any type of command based
> on the queue state only. fabrics should be slightly different because
> we have the CONNECTING state where we want to let the connect command
> only to be issued.
>
That doesn't make me feel any better. PCI bit-bangs register writes,
the odds they can fail/be discarded on the "fabric" is near zero. Admin
Commands - both sending/processing/responding is much much faster than
on a fabric, where the cmd can be dropped/delayed by fabric routing.
Odds of having core-generated commands fail while
reconnecting/reinitializing are much much lower. All says - fabrics
have a much higher command and inter-command latency than pcie does - so
pcie is going to have a harder time seeing these cases.
>
>> As for the blk_rq_is_passthrough check - I guess I can see it being
>> based on the queue state, and the check looks ok (we should never
>> see !blk_rq_is_passthrough on the admin q).
>> But...
>> - I don't know why it was that important to change it. On the
>> connecting path, all you're doing is letting io start flowing before
>> all the queues have been created. Did you really need to start that
>> much sooner ?
>
> The issue is that controller in RESETTING state will have requests that
> are being issued, and if we don't let it pass through, it will hang
> around forever being requeued preventing queue freeze to complete.
is it really "forever" or is it until controller loss timeout ? granted
10minutes does seem like forever
>
> See bug report from Krishnamraju Eraparaju.
I'll go look.
>
>>
>> - But on the resetting path, or deleting cases, you've added a
>> condition now where the controller state was changed, but there was a
>> delay before the transport marked the queue live. It's common
>> practice in the transports to change state then schedule a work
>> element to perform the actual state change. Why would you want io to
>> continue to flow during that window ? This may bring out other
>> problems we've avoided in the past.
>
> What are you referring to? the change here? the controller reset must
> allow requests that came in before we started queue freeze to pass,
> otherwise freeze will never complete.
I'm referring to the io's that now pass through with the patch as the
controller state is no longer blocking them (e.g. the removal of the
!blk_rq_is_passthrough and NVME_REQ_USERCMD checks). They will pass the
ready checks and go out on the wire, creating more active io that has to
be deleted when the reset finally quiesces and terminates the active io
list. For FC, we have to real link traffic for any io that gets started,
so it raises the demands. Should be fine, but it is creating more load
at a point where it could have been blocked.
-- james
More information about the Linux-nvme
mailing list