[PATCH] nvme: Revert: Fix controller creation races with teardown flow

Mon Aug 31 18:26:28 EDT 2020

On 8/28/2020 4:59 PM, Sagi Grimberg wrote:
>
>>> This is indeed a regression.
>>>
>>> Perhaps we should also revert:
>>> 12a0b6622107 ("nvme: don't hold nvmf_transports_rwsem for more than 
>>> transport lookups")
>>>
>>> Which inherently caused this by removing the serialization of
>>> .create_ctrl()...
>>
>> no, I believe the patch on the semaphore is correct. Otherwise - 
>> things can be blocked a long time.. a minute (1 cmd timeout) or even 
>> multiple minutes in the case where a command failure in core layers 
>> effectively gets ignored and thus doesn't cause the error path in the 
>> transport. There can be multiple /dev/nvme-fabrics commands stacked 
>> up that can make the delays look much longer to the last guy.
>>
>> as far as creation vs teardown... yeah, not fun, but there are other 
>> ways to deal with it. FC: I got rid of the separate create/reconnect 
>> threads a while ago thus the return-control-while-reconnecting 
>> behavior, so I've had to deal with it.  It's one area it'd be nice to 
>> see some convergence in implementation again between transports.
>
> Doesn't fc have a bug there? in create_ctrl after flushing the
> connect_work, what is telling it if delete is running in with it
> (or that it already ran...)

I guess I don't understand what the bug is you are thinking about. Maybe 
there's a short period that the ctrl ptr is perhaps freed, thus the 
pointer shouldn't be used - but I don't see it as almost everything is 
simply looking at  the value of the pointer, not dereferencing it.

I do have a bug or two  with delete association fighting with 
create_association - but it's mainly due to nvme_fc_error_recovery not 
the delete routine. I've reworked this area after seeing your other 
patches and will be posting after some more testing.  But no reason for 
synchronizing all ctrl creates.

-- james