[PATCH 0/2 v2] Fix double completing a request

Israel Rukshin israelr at mellanox.com
Sun Apr 8 07:08:37 PDT 2018


This patch series fixes two bugs that was reproduced while getting
block mq timeout (reset controller).

The first bug is a warning of the block layer:

 WARNING: CPU: 9 PID: 563 at block/blk-mq.c:534 __blk_mq_complete_request
 Workqueue: kblockd blk_mq_timeout_work
 RIP: 0010:__blk_mq_complete_request+0x154/0x160
 Call Trace:
  bt_iter+0x43/0x50
  blk_mq_queue_tag_busy_iter+0xfb/0x230
  ? blk_mq_complete_request+0x80/0x80
  ? blk_mq_complete_request+0x80/0x80
  ? __call_rcu.constprop.72+0x170/0x1c0
  blk_mq_timeout_work+0xf6/0x1e0

>From the code:
WARN_ON_ONCE(blk_mq_rq_state(rq) != MQ_RQ_IN_FLIGHT);

The second bug is a NULL deref of a request mr:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000014
 IP: __nvme_rdma_recv_done.isra.48+0x1ba/0x300 [nvme_rdma]
 Call Trace:
  <IRQ>
  nvme_rdma_recv_done+0x12/0x20 [nvme_rdma]
  __ib_process_cq+0x58/0xb0 [ib_core]
  ib_poll_handler+0x1d/0x70 [ib_core]
  irq_poll_softirq+0x98/0xf0
  __do_softirq+0xbc/0x1c0
  irq_exit+0x9a/0xb0
  do_IRQ+0x4c/0xd0
  common_interrupt+0x90/0x90
  </IRQ>


Those two bugs are related and they happen because we complete the requests
from several places:
 - rdma completions
 - block mq reset work
 - nvme abort commands

The first commit don't let the block layer to complete the request.
Those requests will be completed by nvme abort mechanism.
The second commit fix the race between rdma completions and
nvme abort commands.
It fixes the race by flushing all the rdma completions before
starting the abort commands mechanism.

Change from v1:
 - Adding cover letter

Israel Rukshin (2):
  nvme-rdma: Fix race between queue timeout and error recovery
  nvme-rdma: Fix race at error recovery

 drivers/nvme/host/rdma.c | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

-- 
1.8.3.1




More information about the Linux-nvme mailing list