[PATCH v9] nvme-fabrics: reject I/O to offline device
Hannes Reinecke
hare at suse.de
Thu Oct 1 04:55:00 EDT 2020
On 9/29/20 5:27 PM, Victor Gladkov wrote:
> Commands get stuck while Host NVMe-oF controller is in reconnect state.
> NVMe controller enters into reconnect state when it loses connection with the
> target. It tries to reconnect every 10 seconds
> (default) until successful reconnection or until reconnect time-out is reached.
> The default reconnect time out is 10 minutes.
>
> Applications are expecting commands to complete with success or error within
> a certain timeout (30 seconds by default). The NVMe host is enforcing that
> timeout while it is connected, never the less, during reconnection, the timeout
> is not enforced and commands may get stuck for a long period or even forever.
>
> To fix this long delay due to the default timeout we introduce new session
> parameter "fast_io_fail_tmo". The timeout is measured in seconds from the
> controller reconnect, any command beyond that timeout is rejected. The new
> parameter value may be passed during 'connect'.
> The default value of -1 means no timeout (similar to current behavior).
>
> We add a new controller NVME_CTRL_FAILFAST_EXPIRED and respective
> delayed work that updates the NVME_CTRL_FAILFAST_EXPIRED flag.
>
> When the controller is entering the CONNECTING state, we schedule the
> delayed_work based on failfast timeout value. If the transition is out of
> CONNECTING, terminate delayed work item and ensure failfast_expired is
> false. If delayed work item expires then set "NVME_CTRL_FAILFAST_EXPIRED"
> flag to true.
>
> We also update nvmf_fail_nonready_command() and
> nvme_available_path() functions with check the
> "NVME_CTRL_FAILFAST_EXPIRED" controller flag.
>
> For multipath (function nvme_available_path()):
> The path will not be considered available if
> "NVME_CTRL_FAILFAST_EXPIRED" controller flag is set
> and the controller in NVME_CTRL_CONNECTING state.
> This prevents commands from getting stuck
> when available paths have tried to reconnect for too long.
>
> Signed-off-by: Victor Gladkov <victor.gladkov at kioxia.com>
> Signed-off-by: Chaitanya Kulkarni <chaitanya.kulkarni at wdc.com>
> Reviewed-by: Hannes Reinecke <hare at suse.de>
>
> ---
> Changes from V8:
>
> 1. Added multipath behavior description as requested by Sagi Grimberg
> (Fri Sep 18 16:38:58 EDT 2020)
>
> Changes from V7:
>
> 1. Expanded the patch description as requested by James Smart
> (Thu Aug 13 11:00:25 EDT 2020)
>
> Changes from V6:
>
> 1. Changed according to Hannes Reinecke review:
> in nvme_start_failfast_work() and nvme_stop_failfast_work() procedures.
>
> Changes from V5:
>
> 1. Drop the "off" string option for fast_io_fail_tmo.
>
> Changes from V4:
>
> 1. Remove subsysnqn from dev_info just keep "failfast expired"
> in nvme_failfast_work().
> 2. Remove excess lock in nvme_failfast_work().
> 3. Change "timeout disabled" value to -1, '0' is to fail I/O right away now.
> 4. Add "off" string for fast_io_fail_tmo option as a -1 equivalent.
>
> Changes from V3:
>
> 1. BUG_FIX: Fix a bug in nvme_start_failfast_work() amd
> nvme_stop_failfast_work() when accessing ctrl->opts as it will fail
> for PCIe transport when is called nvme_change_ctrl_state()
> from nvme_reset_work(), since we don't set the ctrl->opts for
> PCIe transport.
> 2. Line wrap in nvme_start_failfast_work(), nvme_parse_option() and
> for macro NVMF_ALLOWED_OPTS definition.
> 3. Just like all the state change code add a switch for newly
> added state handling outside of the state machine in
> nvme_state_change().
> 4. nvme_available_path() add /* fallthru */ after if..break inside
> switch which is under list_for_each_entry_rcu().
> 5. Align newly added member nvmf_ctrl_options fast_io_fail_tmp.
> 6. Fix the tabs before if in nvme_available_path() and line wrap
> for the same.
> 7. In nvme_failfast_work() use state != NVME_CTRL_CONNECTING
> instead of == to get rid of the parentheses and avoid char > 80.
> 8. Get rid of the ";" at the end of the comment for @fast_io_fail_tmp.
> 9. Change the commit log style to match the one we have in the NVMe
> repo.
>
> Changes from V2:
>
> 1. Several coding style and small fixes.
> 2. Fix the comment for NVMF_DEF_FAIL_FAST_TMO.
> 3. Don't call functionality from the state machine.
> 4. Rename fast_fail_tmo -> fast_io_fail_tmo to match SCSI
> semantics.
>
> Changes from V1:
>
> 1. Add a new session parameter called "fast_fail_tmo". The timeout
> is measured in seconds from the controller reconnect, any command
> beyond that timeout is rejected. The new parameter value may be
> passed during 'connect', and its default value is 30 seconds.
> A value of 0 means no timeout (similar to current behavior).
> 2. Add a controller flag of "failfast_expired".
> 3. Add dedicated delayed_work that update the "failfast_expired"
> controller flag.
> 4. When entering CONNECTING, schedule the delayed_work based on
> failfast timeout value. If transition out of CONNECTING, terminate
> delayed work item and ensure failfast_expired is false.
> If delayed work item expires: set "failfast_expired" flag to true.
> 5. Update nvmf_fail_nonready_command() with check the
> "failfast_expired" controller flag.
>
> ---
> drivers/nvme/host/core.c | 49
> ++++++++++++++++++++++++++++++++++++++++++-
> drivers/nvme/host/fabrics.c | 25 +++++++++++++++++++---
> drivers/nvme/host/fabrics.h | 5 +++++
> drivers/nvme/host/multipath.c | 5 ++++-
> drivers/nvme/host/nvme.h | 3 +++
> 5 files changed, 82 insertions(+), 5 deletions(-)
>
I did some more experiments with this, and found that there are some
issues with ANA handling.
If reconnect works, but the ANA state indicates that we still can't sent
I/O (eg by still being in transitioning), we hit the 'requeueing I/O'
state despite fast_io_fail_tmo being set. Not sure if that's the
expected outcome.
For that it might be better to move the FAILFAST_EXPIRED bit into the
namespace, as then we could selectively clear the bit in
nvme_failfast_work():
@@ -151,12 +151,16 @@ EXPORT_SYMBOL_GPL(nvme_try_sched_reset);
static void nvme_failfast_work(struct work_struct *work) {
struct nvme_ctrl *ctrl = container_of(to_delayed_work(work),
struct nvme_ctrl, failfast_work);
+ struct nvme_ns *ns;
- if (ctrl->state != NVME_CTRL_CONNECTING)
- return;
-
-
- set_bit(NVME_CTRL_FAILFAST_EXPIRED, &ctrl->flags);
+ down_read(&ctrl->namespaces_rwsem);
+ list_for_each_entry(ns, &ctrl->namespaces, list) {
+ if (ctrl->state != NVME_CTRL_LIVE ||
+ (ns->ana_state != NVME_ANA_OPTIMIZED &&
+ ns->ana_state != NVME_ANA_NONOPTIMIZED))
+ set_bit(NVME_NS_FAILFAST_EXPIRED, &ns->flags);
+ }
+ up_read(&ctrl->namespaces_rwsem);
dev_info(ctrl->device, "failfast expired\n");
...and we could leave the failfast worker running even after the
controller transitioned to LIVE.
Cf the attached patch for details.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-nvme-move-FAILFAST_EXPIRED-bit-into-namespace.patch
Type: text/x-patch
Size: 5553 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20201001/fe47591b/attachment-0001.bin>
More information about the Linux-nvme
mailing list