[PATCH 1/3] nvme-core: improve avoiding false remove namespace

Fri Aug 21 16:23:28 EDT 2020

>>> So the one thing I'm not even sure about is if just ignoring the
>>> errors was a good idea to start with.  They obviously are if we just
>>> did a rescan and did run into an error while rescanning a namespace
>>> that didn't change.  But what if it actually did change?
>>
>> Right, we don't know, so if we failed without DNR, we assume that
>> we will retry again and ignore the error. The assumption is that
>> we will retry when we will reconnect as we don't have a retry mechanism
>> for these requests.
> 
> Yes.  And I think for anything related to namespace (re-)scanning
> we can actually trivially build a sane retry mechanism.  That is give
> up on the current scan_work, and just rescan one after a short wait.

There is no point in doing that if we are disconnected and will in
the future reconnect, which will trigger a scan that can actually
work.

>>> So I think a logic like in this patch kinda makes sense, but I think
>>> we also need to retry and scan again on these kinds of errors.
>>
>> So you are OK with keeping nvme_submit_sync_cmd returning -ENODEV for
>> cancelled requests and have the scan flow assume that these are
>> cancelled requests?
> 
> How does nvme_submit_sync_cmd return -ENODEV?  As far as I can tell
> -ENODEV is our special escape for expected-ish errors in namespace
> scanning.

One of these escapes I guess :)

>> At the very least we need a good comment to say what is going on there.
> 
> Absolutely.
> 
>>
>>    Btw,
>>> did you ever actually see -ENOMEM in practice?  With the small
>>> allocations that we do it really should not happen normally, so
>>> special casing for it always felt a little strange.
>>
>> Never seen it, it's there just because we have allocations in the path.
>>
>>> FYI, I've started rebasing various bits of work I've done to start
>>> untangling the mess.  Here is my current WIP, which in this form
>>> is completely untested:
>>>
>>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/nvme-scanning-cleanup
>>
>> This does not yet contain sorting out what is discussed here correct?
> 
> No, but all the infrastructure needed to implement my above idead.  Most
> importanty the crazy revalidate callchains are pretty much gone and we're
> down to just a few functions with reasonable call chains.

OK, that makes sense. I'm still not convinced the retry makes sense
though...