nvmet panics during high load

Sagi Grimberg sagi at grimberg.me
Sun Aug 13 04:53:59 PDT 2017


>> Hi All,
> 
> Hey Alon,
> 
>> This is my first post on this mailing list. Let me know if this is the
>> wrong place or format to post bugs in.
> 
> This is the correct place.
> 
>> We're running nvmef using RDMA on kernel 4.11.8.
>> We found a zero-dereference bug in nvmet during high load and
>> identified the root cause:
>> The location according to current linux master (fd2b2c57) is
>> drivers/nvme/target/rdma.c at function nvmet_rdma_get_rsp line 170.
>> list_first_entry is called on the list of free responses (free_rsps)
>> which is empty and obviously unexpected. I added an assert to validate
>> that and also tested a hack that enlarges the queue times 10 and it
>> seemed to solve it.
>> It's probably not a leak but a miscalculation of the size of the queue
>> (queue->recv_queue_size * 2). Can anyone explain the rationale behind
>> this calculation? Is the queue assumed to never be empty?
> 
> Well, you are correct that the code assumes that it always has a free
> rsp to use, and yes, its a wrong assumptions. The reason is that
> rsps are freed upon the send completion of a nvme command (cqe).
> 
> If for example one or more acks from the host on this send were dropped
> (which can very well happen on a high load switched fabric environment),
> then we might end up needing more resources than we originally thought.
> 
> Does your cluster involve one or more switches cascade? That would
> explain how we're getting there.
> 
> We use heuristics of 2x the queue_size so that we can't pipeline
> queue-depth and also have a queue-depth to spare as completions might
> take time.
> 
> I think that allocating 10x is an overkill, but maybe something that
> grows lazily can fit better (not sure if we want to shrink as well).
> 
> Can you tryout this (untested) patch:

Alon, did you happen to test this?



More information about the Linux-nvme mailing list