[PATCH] nvme-fabrics: allow to queue requests for live queues

Sagi Grimberg sagi at grimberg.me
Tue Jul 28 16:38:57 EDT 2020


>> Can you please give an example? NVME_REQ_USERCMD should not be any
>> different from any other type of I/O.
> 
> Keep in mind  - in order to send any command, Fabric connect included, 
> the admin queue must be unquiesced.
> 
> Here's what we have today:
> At the time of the reset or immediately during the reset, an ioctl is 
> done - it can (and has been seen) to be queued on the request queue as 
> there are not state checks. When this io gets scheduled, as of the 
> resetting or even the connecting state - it's currently failing the 
> NVME_REQ_USERCMD check, which fails back into 
> nvmf_fail_nonready_command, which fails back with BLK_STS_RESOURCE, 
> which reschedules it.  This rescheduling reoccurs, until the its tried 
> again with the controller state changed (usually LIVE) which then passes 
> the admin-queue live check. It's good to go.
> 
> With your patch:
> The ioctl will at least be requeued with BLK_STS_RESOURCE until after 
> the Fabric Connect is completed (Connect is explicitly allowed by the 
> state-specific check) as hopefully queue->live is false. It may actually 
> be issued if we're in this resetting state only to be terminated 
> afterward. All of the transports set the queue live flag as of Fabric 
> Connect complete. If the ioctl is retried as of that point - it may be 
> immediately sent on the admin queue - before or in-between any of the 
> commands issued by nvme_enable_ctrl() - including before controller 
> enabled, nvme_init_identify(), and before nvme_start_ctrl() has been 
> called.    Hopefully none of these commands try to set/change 
> configuration.   They simply shouldn't be allowed until after the 
> controller has transitioned to LIVE. There's no need to allow them 
> in-between.

What you are describing is correct, that is why we have connect_q to
save us from it for I/O queues.

The problem I have with this code is that checking for NVME_REQ_USERCMD
is a big hammer that you land on what can be normal I/O
(e.g. nvme read/write/write-zeros), and we must not rely on this
indication to prevent what you are describing.

It maybe (just maybe) OK to check NVME_REQ_USERCMD only for the admin
queue, but we are really circling over the fact that we cannot reliably
send admin connect before to go off first, because we have to unquiesce
it before we issue the admin connect, and there might be a stray command
going into execution before we submit the admin connect.

>> Also, do note that pci does allow to queue any type of command based
>> on the queue state only. fabrics should be slightly different because
>> we have the CONNECTING state where we want to let the connect command
>> only to be issued.
>>
> 
> That doesn't make me feel any better.  PCI bit-bangs register writes, 
> the odds they can fail/be discarded on the "fabric" is near zero. Admin 
> Commands - both sending/processing/responding is much much faster than 
> on a fabric, where the cmd can be dropped/delayed by fabric routing. 
> Odds of having core-generated commands fail while 
> reconnecting/reinitializing are much much lower.   All says - fabrics 
> have a much higher command and inter-command latency than pcie does - so 
> pcie is going to have a harder time seeing these cases.

Timing here is not the issue, we should take a step back and understand
that we fail any user command for a very specific reason. We should
solve the root cause and not paper around it.

>>> As for the blk_rq_is_passthrough check - I guess I can see it being 
>>> based on the queue state, and the check looks ok  (we should never 
>>> see !blk_rq_is_passthrough on the admin q).
>>> But...
>>> - I don't know why it was that important to change it. On the 
>>> connecting path, all you're doing is letting io start flowing before 
>>> all the queues have been created.  Did you really need to start that 
>>> much sooner ?
>>
>> The issue is that controller in RESETTING state will have requests that
>> are being issued, and if we don't let it pass through, it will hang
>> around forever being requeued preventing queue freeze to complete.
> 
> is it really "forever" or is it until controller loss timeout ? granted 
> 10minutes does seem like forever

ctrl_loss_tmo can be a 1000000 years, it cannot rescue us.

>>> - But on the resetting path, or deleting cases, you've added a 
>>> condition now where the controller state was changed, but there was a 
>>> delay before the transport marked the queue live. It's common 
>>> practice in the transports to change state then schedule a work 
>>> element to perform the actual state change.  Why would you want io to 
>>> continue to flow during that window ?   This may bring out other 
>>> problems we've avoided in the past.
>>
>> What are you referring to? the change here? the controller reset must
>> allow requests that came in before we started queue freeze to pass,
>> otherwise freeze will never complete.
> 
> I'm referring to the io's that now pass through with the patch as the 
> controller state is no longer blocking them (e.g. the removal of the 
> !blk_rq_is_passthrough and NVME_REQ_USERCMD checks).

First of all, you have a bug hiding there if blk_mq_update_nr_hw_queues
actually ever does anything, because it won't be able to reliably freeze
the queue (because the !blk_rq_is_passthrough keeps the commands
endlessly rescheduling itself, not ever completing hence the queue
cannot be frozen). This is what started this, because today we support
multiple queue maps, it always ends up freezing the queue, so it can
happen much more frequently.

  They will pass the
> ready checks and go out on the wire, creating more active io that has to 
> be deleted when the reset finally quiesces and terminates the active io 
> list. For FC, we have to real link traffic for any io that gets started, 
> so it raises the demands.   Should be fine, but it is creating more load 
> at a point where it could have been blocked.

As I said, freezing the queue is impossible when we let these commands
hang in limbo state, and queue freeze may occur in a lot of places, so
we must either let them pass, or fence them when starting the freeze.



More information about the Linux-nvme mailing list