[PATCH v2 4/4] nvme: code command_id with a genctr for use-after-free validation

Wed May 26 02:26:02 PDT 2021

On 5/26/2021 11:41 AM, Sagi Grimberg wrote:
>
>>>> The bad controller should be fixed.
>>>>
>>>> In the past, I've sent patches that check that sqid match in nvme 
>>>> cqe to
>>>> protect faulty drives that might send
>>>> the completion on a wrong msix.
>>>>
>>>> My patch wasn't accepted since it added an additional "if" in the 
>>>> fast path.
>>>>
>>>> Now we're adding much more operation in the fast path because of 
>>>> buggy ctrl
>>>> ?
>>> I shared the same performance concern on v1 on this series. I haven't
>>> been able to test this one yet (only have emulation for two more 
>>> weeks).
>>>
>>> Hannes says the bug this catches happens frequently enough on TCP. 
>>> If we
>>> don't catch it, we get kernel panic or corruption, so we defintely 
>>> need to
>>> do something. Sagi correctly noted this type of bug is not unique to 
>>> TCP
>>> (or even NVMe, for that matter), but if there is a performance 
>>> impact on
>>> PCI, and no one so far reports such an issue, I would still recommend
>>> this type of mitigation be isolated to transports that actually observe
>>> invalid CQEs.
>>
>> Please do that if possible.
>
> I disagree with that. The original report was about TCP, but that is
> pure coincidence as far as I'm concerned, as I indicated before there
> is nothing TCP specific about this.
>
> Today we don't protect against such a scenario. I originally dismissed
> this issue as we should probably address the a buggy controller, but an
> argument was made that a buggy controller should not be able to crash
> the kernel and we should have a safe-guard for it, which is a valid
> argument, but this is not the worst that can happen, this could have
> caused a silent data corruption with a slightly different timing. So
> it seems absolutely appropriate to me considering the possible
> consequences.
>
> The way I see it, we should decide where we stand here, either we
> continue to ignore this possible panic/data-corruption (which I
> personally think would be wrong), or we address it, and if we do,
> we should do it in the correct place, not limit it to the observed
> component.
>
> As for the performance concerns, I'd be surprised if performance is
> noticeably impacted from 2 assignments and 2 optimized branches.
> Also the overall nvme_request did not grow (still 32-bytes, half a
> cacheline). But let's see if measurements prove otherwise...

Of course that this will not have a big impact, but this is not the point.

The point is that I don't understand the state of mind.

why didn't we defend from buggy controller with the below code (We had a 
bug in our NVMe SNAP controller that caused this, so I decided to 
contribute this case upstream):

diff --git a/drivers/nvme/host/pci.c b/drivers/nvme/host/pci.c
index 5a0bf6a..ab7ff34 100644
--- a/drivers/nvme/host/pci.c
+++ b/drivers/nvme/host/pci.c
@@ -983,6 +983,13 @@ static inline void nvme_handle_cqe(struct 
nvme_queue *nvmeq, u16 idx)
         volatile struct nvme_completion *cqe = &nvmeq->cqes[idx];
         struct request *req;

+       if (unlikely(le16_to_cpu(cqe->sq_id) != nvmeq->qid)) {
+               dev_warn(nvmeq->dev->ctrl.device,
+                        "got completion on sqid %d instead of sqid %d\n",
+                        nvmeq->qid, le16_to_cpu(cqe->sq_id));
+               return;
+       }
+
         if (unlikely(cqe->command_id >= nvmeq->q_depth)) {
                 dev_warn(nvmeq->dev->ctrl.device,
                         "invalid id %d completed on queue %d\n",

And why we want to defend from buggy controllers that perform double 
completion ?