[PATCH] nvme: Revert: Fix controller creation races with teardown flow
James Smart
james.smart at broadcom.com
Mon Aug 31 18:26:28 EDT 2020
On 8/28/2020 4:59 PM, Sagi Grimberg wrote:
>
>>> This is indeed a regression.
>>>
>>> Perhaps we should also revert:
>>> 12a0b6622107 ("nvme: don't hold nvmf_transports_rwsem for more than
>>> transport lookups")
>>>
>>> Which inherently caused this by removing the serialization of
>>> .create_ctrl()...
>>
>> no, I believe the patch on the semaphore is correct. Otherwise -
>> things can be blocked a long time.. a minute (1 cmd timeout) or even
>> multiple minutes in the case where a command failure in core layers
>> effectively gets ignored and thus doesn't cause the error path in the
>> transport. There can be multiple /dev/nvme-fabrics commands stacked
>> up that can make the delays look much longer to the last guy.
>>
>> as far as creation vs teardown... yeah, not fun, but there are other
>> ways to deal with it. FC: I got rid of the separate create/reconnect
>> threads a while ago thus the return-control-while-reconnecting
>> behavior, so I've had to deal with it. It's one area it'd be nice to
>> see some convergence in implementation again between transports.
>
> Doesn't fc have a bug there? in create_ctrl after flushing the
> connect_work, what is telling it if delete is running in with it
> (or that it already ran...)
I guess I don't understand what the bug is you are thinking about. Maybe
there's a short period that the ctrl ptr is perhaps freed, thus the
pointer shouldn't be used - but I don't see it as almost everything is
simply looking at the value of the pointer, not dereferencing it.
I do have a bug or two with delete association fighting with
create_association - but it's mainly due to nvme_fc_error_recovery not
the delete routine. I've reworked this area after seeing your other
patches and will be posting after some more testing. But no reason for
synchronizing all ctrl creates.
-- james
More information about the Linux-nvme
mailing list