[PATCH v2 4/4] nvme: code command_id with a genctr for use-after-free validation

Wed May 26 01:41:17 PDT 2021

>>> The bad controller should be fixed.
>>>
>>> In the past, I've sent patches that check that sqid match in nvme cqe to
>>> protect faulty drives that might send
>>> the completion on a wrong msix.
>>>
>>> My patch wasn't accepted since it added an additional "if" in the fast path.
>>>
>>> Now we're adding much more operation in the fast path because of buggy ctrl
>>> ?
>> I shared the same performance concern on v1 on this series. I haven't
>> been able to test this one yet (only have emulation for two more weeks).
>>
>> Hannes says the bug this catches happens frequently enough on TCP. If we
>> don't catch it, we get kernel panic or corruption, so we defintely need to
>> do something. Sagi correctly noted this type of bug is not unique to TCP
>> (or even NVMe, for that matter), but if there is a performance impact on
>> PCI, and no one so far reports such an issue, I would still recommend
>> this type of mitigation be isolated to transports that actually observe
>> invalid CQEs.
>>   
> 
> Please do that if possible.

I disagree with that. The original report was about TCP, but that is
pure coincidence as far as I'm concerned, as I indicated before there
is nothing TCP specific about this.

Today we don't protect against such a scenario. I originally dismissed
this issue as we should probably address the a buggy controller, but an
argument was made that a buggy controller should not be able to crash
the kernel and we should have a safe-guard for it, which is a valid
argument, but this is not the worst that can happen, this could have
caused a silent data corruption with a slightly different timing. So
it seems absolutely appropriate to me considering the possible
consequences.

The way I see it, we should decide where we stand here, either we
continue to ignore this possible panic/data-corruption (which I
personally think would be wrong), or we address it, and if we do,
we should do it in the correct place, not limit it to the observed
component.

As for the performance concerns, I'd be surprised if performance is
noticeably impacted from 2 assignments and 2 optimized branches.
Also the overall nvme_request did not grow (still 32-bytes, half a
cacheline). But let's see if measurements prove otherwise...

Before patch:
struct nvme_request {
         struct nvme_command *      cmd;                  /*     0     8 */
         union nvme_result          result;               /*     8     8 */
         u8                         retries;              /*    16     1 */
         u8                         flags;                /*    17     1 */
         u16                        status;               /*    18     2 */

         /* XXX 4 bytes hole, try to pack */

         struct nvme_ctrl *         ctrl;                 /*    24     8 */

         /* size: 32, cachelines: 1, members: 6 */
         /* sum members: 28, holes: 1, sum holes: 4 */
         /* last cacheline: 32 bytes */
};

After patch:
truct nvme_request {
         struct nvme_command *      cmd;                  /*     0     8 */
         union nvme_result          result;               /*     8     8 */
         u8                         genctr;               /*    16     1 */
         u8                         retries;              /*    17     1 */
         u8                         flags;                /*    18     1 */

         /* XXX 1 byte hole, try to pack */

         u16                        status;               /*    20     2 */

         /* XXX 2 bytes hole, try to pack */

         struct nvme_ctrl *         ctrl;                 /*    24     8 */

         /* size: 32, cachelines: 1, members: 7 */
         /* sum members: 29, holes: 2, sum holes: 3 */
         /* last cacheline: 32 bytes */
};