[PATCH 1/3] nvme-core: improve avoiding false remove namespace
Sagi Grimberg
sagi at grimberg.me
Fri Aug 21 16:23:28 EDT 2020
>>> So the one thing I'm not even sure about is if just ignoring the
>>> errors was a good idea to start with. They obviously are if we just
>>> did a rescan and did run into an error while rescanning a namespace
>>> that didn't change. But what if it actually did change?
>>
>> Right, we don't know, so if we failed without DNR, we assume that
>> we will retry again and ignore the error. The assumption is that
>> we will retry when we will reconnect as we don't have a retry mechanism
>> for these requests.
>
> Yes. And I think for anything related to namespace (re-)scanning
> we can actually trivially build a sane retry mechanism. That is give
> up on the current scan_work, and just rescan one after a short wait.
There is no point in doing that if we are disconnected and will in
the future reconnect, which will trigger a scan that can actually
work.
>>> So I think a logic like in this patch kinda makes sense, but I think
>>> we also need to retry and scan again on these kinds of errors.
>>
>> So you are OK with keeping nvme_submit_sync_cmd returning -ENODEV for
>> cancelled requests and have the scan flow assume that these are
>> cancelled requests?
>
> How does nvme_submit_sync_cmd return -ENODEV? As far as I can tell
> -ENODEV is our special escape for expected-ish errors in namespace
> scanning.
One of these escapes I guess :)
>> At the very least we need a good comment to say what is going on there.
>
> Absolutely.
>
>>
>> Btw,
>>> did you ever actually see -ENOMEM in practice? With the small
>>> allocations that we do it really should not happen normally, so
>>> special casing for it always felt a little strange.
>>
>> Never seen it, it's there just because we have allocations in the path.
>>
>>> FYI, I've started rebasing various bits of work I've done to start
>>> untangling the mess. Here is my current WIP, which in this form
>>> is completely untested:
>>>
>>> http://git.infradead.org/users/hch/misc.git/shortlog/refs/heads/nvme-scanning-cleanup
>>
>> This does not yet contain sorting out what is discussed here correct?
>
> No, but all the infrastructure needed to implement my above idead. Most
> importanty the crazy revalidate callchains are pretty much gone and we're
> down to just a few functions with reasonable call chains.
OK, that makes sense. I'm still not convinced the retry makes sense
though...
More information about the Linux-nvme
mailing list