[PATCH] nvme-pci: fix potential I/O hang when CQ is full

Wed Feb 11 01:47:44 PST 2026

On Tue, 10 Feb 2026 16:57:12 +0100, Christoph Hellwig wrote:

> We can't update the CQ head before consuming the CQEs, otherwise
> the device can reuse them.  And devices must not discard completions
> when there is no completion queue entry, nvme does allow SQs and CQs
> to be smaller than the number of outstanding commands.

Updating the CQ head before consuming the CQE would not cause the device to 
reuse these entries, as new commands can only be submitted by the driver after
the CQE is consumed. Therefore, the device does not have the opportunity 
to reuse these entries.

Actually, the root cause of the issue is that the underlying device received 
more commands from the NVMe driver than the queue depth (q_depth), leading 
to a CQ full problem.

In my environment, the NVMe admin queue depth is 32, allowing a maximum of 
32 commands to be processed concurrently. During the NVMe disk removal process, 
the NVMe driver sends commands via the admin queue to delete all I/O queues.
When the NVMe driver has already submitted more than 32 commands, any additional
commands beyond 32 will wait for the previous ones to complete.

During NVMe interrupt handling, the current implementation first processes the 
CQE and then updates the CQ head. The commands allocated by nvme_delete_queue
are not processed through the batch flow during interrupt response. After
consuming the CQE, the tag is released and the upper-layer NVMe driver is notified
(note: at this point, the CQ head has not yet been updated, meaning the entire 
previous process is not yet complete). Upon receiving the notification, the NVMe 
driver immediately submits new commands to the SQ. When the underlying device 
completes command processing and writes the result back to the CQ (while the CQ 
head remains unupdated), the number of commands processed by the underlying device
exceeds the NVMe queue depth. Since there is no available space in the CQ to place 
the completion, a CQ full error is reported.

The above process can be illustrated by the following diagram:

          driver              irq             underlying(virtual/hardware)
          ------             ------                     ------
      1. Wait for tag
                            1. Read CQE      CQ is full, wait for head update
                            2. Handle CQE
                            3. Wake up tag
      2. Get tag
      (blk_mq_put_tag)
      3. Issue new cmd
                                              1. Process cmd
                                              2. Try write to CQ
                                              3. CQ is full, discard cmd!
                            4. Update CQ head
                                (LATE!)
      4. Cmd timeout

Best regards,
Junnan Zhang