[PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates
Sagi Grimberg
sagi at grimberg.me
Tue Jun 23 20:18:47 EDT 2020
Changes from v1:
- Fixed compilation error in patch #4
- Added patch #5 to resolve a use-after-free condition
Hey All,
The following patches addresses some deadlocks observed while performing some
stress testing of a connect/disconnect storm in addition to rapid ana path
switches concurrently (paths may transition between live<->inaccessible)
on a large number of namespaces (100+).
The test mainly triggers three main flows:
1. ongoing ns scanning, in the presence of concurrent ANA path state changes
and controller removals (disconnect).
2. ongoing ns scanning (or ana processing) in the presence of concurrent
controller removal (disconnect).
3. ongoing ANA processing in the presence of concurrent controller removal
(disconnect).
What was observed is that basically when we disconnect while scan_work and/or ana_work
are running, we can easily deadlock. The main reason is that scan_work and ana_work
may both register the gendisk, triggering I/O (partition scans). Given that a
controller removal (disconnect) may also be running at the same time, I/O may
block. The issue with blocking the head->disk I/O under the locks taken by
both ana_work and scan_work, it means that no other path may update path states
and by doing so, unblock the blocking I/O.
With this patchset applied, the test is able to pass successfully without any
deadlocks.
The last patch is posted as an RFC, while it solves a real problem, we are
essentially adding state to the controller without it going via the normal
controller state, the reason is that the controller state will also affect
ongoing mpath I/O which is the original cause of the deadlock. We are open
to alternative better suggestions if such exist.
Anton Eidelman (3):
nvme-multipath: fix deadlock between ana_work and scan_work
nvme-multipath: fix deadlock due to head->lock
nvme-core: fix deadlock in disconnect during scan_work and/or ana_work
Sagi Grimberg (3):
nvme: fix possible deadlock when I/O is blocked
nvme: don't protect ns mutation with ns->head->lock
nvme-multipath: fix bogus request queue reference put
drivers/nvme/host/core.c | 11 +++++++-
drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++----------
drivers/nvme/host/nvme.h | 3 +++
3 files changed, 47 insertions(+), 15 deletions(-)
--
2.25.1
More information about the Linux-nvme
mailing list