[PATCH v3 for-5.8-rc 0/5] address deadlocks in high stress ns scanning and ana updates

Sagi Grimberg sagi at grimberg.me
Wed Jun 24 04:53:07 EDT 2020


Changes from v2:
- removed RFC patch #6 from the series
- patch #1: updated change log
- patch #2: renaming and nit restructuring in patch #2
- patch #3: Clarified change log and move srcu and requeue_work scheduling
  outside of the head->lock 
- patch #4: renamed flag to NVME_NSHEAD_DISK_LIVE

Changes from v1:
- Fixed compilation error in patch #4
- Added patch #5 to resolve a use-after-free condition

Hey All,

The following patches addresses some deadlocks observed while performing some
stress testing of a connect/disconnect storm in addition to rapid ana path
switches concurrently (paths may transition between live<->inaccessible)
on a large number of namespaces (100+).

The test mainly triggers three main flows:
1. ongoing ns scanning, in the presence of concurrent ANA path state changes
   and controller removals (disconnect).
2. ongoing ns scanning (or ana processing) in the presence of concurrent
   controller removal (disconnect).
3. ongoing ANA processing in the presence of concurrent controller removal
   (disconnect).

What was observed is that basically when we disconnect while scan_work and/or ana_work
are running, we can easily deadlock. The main reason is that scan_work and ana_work
may both register the gendisk, triggering I/O (partition scans). Given that a
controller removal (disconnect) may also be running at the same time, I/O may
block. The issue with blocking the head->disk I/O under the locks taken by
both ana_work and scan_work, it means that no other path may update path states
and by doing so, unblock the blocking I/O.

With this patchset applied (plus the missing RFC patch that we dropped)
the test is able to pass successfully without any deadlocks.

Anton Eidelman (2):
  nvme-multipath: fix deadlock between ana_work and scan_work
  nvme-multipath: fix deadlock due to head->lock

Sagi Grimberg (3):
  nvme: fix possible deadlock when I/O is blocked
  nvme: don't protect ns mutation with ns->head->lock
  nvme-multipath: fix bogus request queue reference put

 drivers/nvme/host/core.c      |  1 -
 drivers/nvme/host/multipath.c | 46 ++++++++++++++++++++++-------------
 drivers/nvme/host/nvme.h      |  2 ++
 3 files changed, 31 insertions(+), 18 deletions(-)

-- 
2.25.1




More information about the Linux-nvme mailing list