[PATCH v2 for-5.8-rc 0/6] address deadlocks in high stress ns scanning and ana updates

Sagi Grimberg sagi at grimberg.me
Tue Jun 23 20:18:47 EDT 2020


Changes from v1:
- Fixed compilation error in patch #4
- Added patch #5 to resolve a use-after-free condition

Hey All,

The following patches addresses some deadlocks observed while performing some
stress testing of a connect/disconnect storm in addition to rapid ana path
switches concurrently (paths may transition between live<->inaccessible)
on a large number of namespaces (100+).

The test mainly triggers three main flows:
1. ongoing ns scanning, in the presence of concurrent ANA path state changes
   and controller removals (disconnect).
2. ongoing ns scanning (or ana processing) in the presence of concurrent
   controller removal (disconnect).
3. ongoing ANA processing in the presence of concurrent controller removal
   (disconnect).

What was observed is that basically when we disconnect while scan_work and/or ana_work
are running, we can easily deadlock. The main reason is that scan_work and ana_work
may both register the gendisk, triggering I/O (partition scans). Given that a
controller removal (disconnect) may also be running at the same time, I/O may
block. The issue with blocking the head->disk I/O under the locks taken by
both ana_work and scan_work, it means that no other path may update path states
and by doing so, unblock the blocking I/O.

With this patchset applied, the test is able to pass successfully without any
deadlocks.

The last patch is posted as an RFC, while it solves a real problem, we are
essentially adding state to the controller without it going via the normal
controller state, the reason is that the controller state will also affect
ongoing mpath I/O which is the original cause of the deadlock. We are open
to alternative better suggestions if such exist.

Anton Eidelman (3):
  nvme-multipath: fix deadlock between ana_work and scan_work
  nvme-multipath: fix deadlock due to head->lock
  nvme-core: fix deadlock in disconnect during scan_work and/or ana_work

Sagi Grimberg (3):
  nvme: fix possible deadlock when I/O is blocked
  nvme: don't protect ns mutation with ns->head->lock
  nvme-multipath: fix bogus request queue reference put

 drivers/nvme/host/core.c      | 11 +++++++-
 drivers/nvme/host/multipath.c | 48 +++++++++++++++++++++++++----------
 drivers/nvme/host/nvme.h      |  3 +++
 3 files changed, 47 insertions(+), 15 deletions(-)

-- 
2.25.1




More information about the Linux-nvme mailing list