nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event

Tue Mar 14 06:22:14 PDT 2017

On 03/13/2017 04:05 PM, Sagi Grimberg wrote:
>> Hi Sagi
>
> Hi Yi,
>
>> With this patch, issue[1] cannot be reproduced, but I still can
>> reproduce issue[2]. thanks
>>
>> [1]
>>
>> kernel NULL pointer on nvmet with stress
>> rescan_controller/reset_controller test during I/O
>>
>> [2]
>> mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during
>> stress test on reset_controller
>
> So this particular issue seems to do with the fact that the target is
> attacked with reset attempts. The controller teardown happens
> asynchronously and waits for safe termination while re-establishments
> are allowed immediately.
>
> I think we need to somehow figure out what is our max allowed active
> controllers and simply refuse controller establishment if we exceeded
> this number. The tricky part is to understand what this magic number
> is?
>
> Question, does the test in [2] complete eventually? or is this a fatal
> error?
Hi Sagi
If I don't stop the reset_controller operation on client side, it will 
finally got one fatal error since the operation will eat memory 
continuously on target side.

Thanks
Yi