NVMe induced NULL deref in bt_iter()
Jens Axboe
axboe at kernel.dk
Mon Jul 3 09:01:48 PDT 2017
On 07/02/2017 04:45 AM, Max Gurtovoy wrote:
>
>
> On 6/30/2017 8:26 PM, Jens Axboe wrote:
>> Hi Max,
>
> Hi Jens,
>
>>
>> I remembered you reporting this. I think this is a regression introduced
>> with the scheduling, since ->rqs[] isn't static anymore. ->static_rqs[]
>> is, but that's not indexable by the tag we find. So I think we need to
>> guard those with a NULL check. The actual requests themselves are
>> static, so we know the memory itself isn't going away. But if we race
>> with completion, we could find a NULL there, validly.
>>
>> Since you could reproduce it, can you try the below?
>
> I still can repro the null deref with this patch applied.
>
>>
>> diff --git a/block/blk-mq-tag.c b/block/blk-mq-tag.c
>> index d0be72ccb091..b856b2827157 100644
>> --- a/block/blk-mq-tag.c
>> +++ b/block/blk-mq-tag.c
>> @@ -214,7 +214,7 @@ static bool bt_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> bitnr += tags->nr_reserved_tags;
>> rq = tags->rqs[bitnr];
>>
>> - if (rq->q == hctx->queue)
>> + if (rq && rq->q == hctx->queue)
>> iter_data->fn(hctx, rq, iter_data->data, reserved);
>> return true;
>> }
>> @@ -249,8 +249,8 @@ static bool bt_tags_iter(struct sbitmap *bitmap, unsigned int bitnr, void *data)
>> if (!reserved)
>> bitnr += tags->nr_reserved_tags;
>> rq = tags->rqs[bitnr];
>> -
>> - iter_data->fn(rq, iter_data->data, reserved);
>> + if (rq)
>> + iter_data->fn(rq, iter_data->data, reserved);
>> return true;
>> }
>
> see the attached file for dmesg output.
>
> output of gdb:
>
> (gdb) list *(blk_mq_flush_busy_ctxs+0x48)
> 0xffffffff8127b108 is in blk_mq_flush_busy_ctxs
> (./include/linux/sbitmap.h:234).
> 229
> 230 for (i = 0; i < sb->map_nr; i++) {
> 231 struct sbitmap_word *word = &sb->map[i];
> 232 unsigned int off, nr;
> 233
> 234 if (!word->word)
> 235 continue;
> 236
> 237 nr = 0;
> 238 off = i << sb->shift;
>
>
> when I change the "if (!word->word)" to "if (word && !word->word)"
> I can get null deref at "nr = find_next_bit(&word->word, word->depth,
> nr);". Seems like somehow word becomes NULL.
>
> Adding the linux-nvme guys too.
> Sagi has mentioned that this can be null only if we remove the tagset
> while I/O is trying to get a tag and when killing the target we get into
> error recovery and periodic reconnects, which does _NOT_ include freeing
> the tagset, so this is probably the admin tagset.
>
> Sagi,
> you've mention a patch for centrelizing the treatment of the admin
> tagset to the nvme core. I think I missed this patch, so can you please
> send a pointer to it and I'll check if it helps ?
Right, this is clearly a different issue and my first thought as well
was that it's a missing quiesce of the queue. We're iterating the tags
when they are being torn down.
Looks like Sagi's patch fixes the issue, so I'm considering this one
resolved.
--
Jens Axboe
More information about the Linux-nvme
mailing list