nvmet panics during high load

Sun Aug 13 06:11:02 PDT 2017

Sorry for the lack of response. I havn't tested the patch.
I intended to look into the issue further and suggest a simpler
solution that simply calculates the maximum queue size correctly but
simply didn't have the time.

On Sun, Aug 13, 2017 at 3:10 PM, Alon Horev <alon at vastdata.com> wrote:
> Sorry for the lack of response. I hadn't tested the patch.
> I intended to look into the issue further and suggest a simpler solution
> that simply calculates the maximum queue size correctly but simply didn't
> have the time.
>
> On Sun, 13 Aug 2017 at 14:54 Sagi Grimberg <sagi at grimberg.me> wrote:
>>
>>
>> >> Hi All,
>> >
>> > Hey Alon,
>> >
>> >> This is my first post on this mailing list. Let me know if this is the
>> >> wrong place or format to post bugs in.
>> >
>> > This is the correct place.
>> >
>> >> We're running nvmef using RDMA on kernel 4.11.8.
>> >> We found a zero-dereference bug in nvmet during high load and
>> >> identified the root cause:
>> >> The location according to current linux master (fd2b2c57) is
>> >> drivers/nvme/target/rdma.c at function nvmet_rdma_get_rsp line 170.
>> >> list_first_entry is called on the list of free responses (free_rsps)
>> >> which is empty and obviously unexpected. I added an assert to validate
>> >> that and also tested a hack that enlarges the queue times 10 and it
>> >> seemed to solve it.
>> >> It's probably not a leak but a miscalculation of the size of the queue
>> >> (queue->recv_queue_size * 2). Can anyone explain the rationale behind
>> >> this calculation? Is the queue assumed to never be empty?
>> >
>> > Well, you are correct that the code assumes that it always has a free
>> > rsp to use, and yes, its a wrong assumptions. The reason is that
>> > rsps are freed upon the send completion of a nvme command (cqe).
>> >
>> > If for example one or more acks from the host on this send were dropped
>> > (which can very well happen on a high load switched fabric environment),
>> > then we might end up needing more resources than we originally thought.
>> >
>> > Does your cluster involve one or more switches cascade? That would
>> > explain how we're getting there.
>> >
>> > We use heuristics of 2x the queue_size so that we can't pipeline
>> > queue-depth and also have a queue-depth to spare as completions might
>> > take time.
>> >
>> > I think that allocating 10x is an overkill, but maybe something that
>> > grows lazily can fit better (not sure if we want to shrink as well).
>> >
>> > Can you tryout this (untested) patch:
>>
>> Alon, did you happen to test this?

-- 
Alon Horev
+972-524-517-627