parallel nvme format
Niklas Cassel
Niklas.Cassel at wdc.com
Thu Dec 14 05:56:05 PST 2023
Hello guys,
We have observed the following on linux stable v6.6.x.
(We tried to backport commit 839a40d1e730 ("nvme: fix deadlock between reset
and scan"), but it didn't fix the problem.)
What happens is that we are formatting 8 different namespaces in parallel.
[Wed Nov 29 16:43:56 2023] INFO: task nvme:18411 blocked for more than 120 seconds.
[Wed Nov 29 16:43:56 2023] Not tainted 6.6.2 #1
[Wed Nov 29 16:43:56 2023] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
+disables this message.
[Wed Nov 29 16:43:56 2023] task:nvme state:D stack:0 pid:18411
+ppid:18405 flags:0x00004000
[Wed Nov 29 16:43:56 2023] Call Trace:
[Wed Nov 29 16:43:56 2023] <TASK>
[Wed Nov 29 16:43:56 2023] __schedule+0x34a/0xef0
[Wed Nov 29 16:43:56 2023] ? preempt_count_add+0x7c/0xc0
[Wed Nov 29 16:43:56 2023] ? _raw_spin_lock_irqsave+0x29/0x60
[Wed Nov 29 16:43:56 2023] schedule+0x6b/0xf0
[Wed Nov 29 16:43:56 2023] schedule_timeout+0x233/0x290
[Wed Nov 29 16:43:56 2023] ? __timer_delete_sync+0x31/0x50
[Wed Nov 29 16:43:56 2023] ? schedule_timeout+0x1a3/0x290
[Wed Nov 29 16:43:56 2023] wait_for_completion+0x8b/0x140
[Wed Nov 29 16:43:56 2023] __flush_work+0x139/0x240
[Wed Nov 29 16:43:56 2023] ? __pfx_wq_barrier_func+0x10/0x10
[Wed Nov 29 16:43:56 2023] flush_work+0x10/0x20
[Wed Nov 29 16:43:56 2023] nvme_passthru_end+0x60/0x170 [nvme_core]
[Wed Nov 29 16:43:56 2023] nvme_submit_user_cmd+0x162/0x170 [nvme_core]
[Wed Nov 29 16:43:56 2023] nvme_user_cmd.constprop.0+0x156/0x1d0 [nvme_core]
[Wed Nov 29 16:43:56 2023] nvme_dev_ioctl+0x12c/0x2a0 [nvme_core]
[Wed Nov 29 16:43:56 2023] __x64_sys_ioctl+0x95/0xd0
[Wed Nov 29 16:43:56 2023] do_syscall_64+0x38/0x90
[Wed Nov 29 16:43:56 2023] entry_SYSCALL_64_after_hwframe+0x6e/0xd8
flush_work() waits for more that 120 seconds on scan_work getting completed,
but even after 120 seconds, scan_work hasn't completed.
Inspecting nvme_scan_work(), the only thing that can cause it to not return
after 120 seconds, is if it is waiting on mutex_lock() to get the scan mutex.
What appears to be happening is:
format command #1 completes and calls nvme_passthru_end(),
nvme_passthru_end() calls:
mutex_unlock(&ctrl->scan_lock);
...
nvme_queue_scan(ctrl);
flush_work(&ctrl->scan_work);
Directly after format command #1 unlocks the mutex,
format command #2 will get the mutex in nvme_passthru_start().
Format command #2 will release the mutex in nvme_passthru_end(),
but only after the format command has completed.
The default timeout for the nvme format command via nvme-cli is 600 seconds,
so it is easy to imagine that the format command will take longer than 120
seconds.
So the scan work at the end of format command #1 will eventually get to run,
but only after the second nvme format command has completed.
(nvme_queue_scan() will finally be able to get the mutex, once
nvme_passthru_end() for the second format command has released the mutex.)
I don't think that formatting multiple namespaces in parallel is an
unreasonable thing to do, so I do not think that we should scare our
users with "hung task timeouts".
My first though was that perhaps we could do something like:
diff --git a/drivers/nvme/host/core.c b/drivers/nvme/host/core.c
index 46a4c9c5ea96..811d40cf23f6 100644
--- a/drivers/nvme/host/core.c
+++ b/drivers/nvme/host/core.c
@@ -1113,6 +1113,7 @@ u32 nvme_passthru_start(struct nvme_ctrl *ctrl, struct nvme_ns *ns, u8 opcode)
* effects say only one namespace is affected.
*/
if (effects & NVME_CMD_EFFECTS_CSE_MASK) {
+ flush_work(&ctrl->scan_work);
mutex_lock(&ctrl->scan_lock);
mutex_lock(&ctrl->subsys->lock);
nvme_mpath_start_freeze(ctrl->subsys);
But that will not work, because it will still have a race window between the
time nvme_passthru_end() has unlocked the mutex, and before
nvme_passthru_end() has queued the scan work.
I guess, in the case that nvme_passthru_end() has to perform a scan, we want
to unlock the scan mutex after nvme_passthru_end() has performed the scan,
but right now nvme_queue_scan() will acquire the mutex (and is running on the
workqueue)...
Thoughts?
Kind regards,
Niklas
More information about the Linux-nvme
mailing list