nvmet: race condition while CQE are getting processed concurrently with the DISCONNECTED event

Mon Mar 13 01:05:53 PDT 2017

> Hi Sagi

Hi Yi,

> With this patch, issue[1] cannot be reproduced, but I still can
> reproduce issue[2]. thanks
>
> [1]
>
> kernel NULL pointer on nvmet with stress
> rescan_controller/reset_controller test during I/O
>
> [2]
> mlx4_core 0000:07:00.0: swiotlb buffer is full and OOM observed during
> stress test on reset_controller

So this particular issue seems to do with the fact that the target is
attacked with reset attempts. The controller teardown happens
asynchronously and waits for safe termination while re-establishments
are allowed immediately.

I think we need to somehow figure out what is our max allowed active
controllers and simply refuse controller establishment if we exceeded
this number. The tricky part is to understand what this magic number
is?

Question, does the test in [2] complete eventually? or is this a fatal
error?