[PATCH 0/2] nvme-fabrics: short-circuit connect retries
Hannes Reinecke
hare at suse.de
Fri Jul 9 01:59:10 PDT 2021
On 7/9/21 10:55 AM, Hannes Reinecke wrote:
> On 7/9/21 10:34 AM, Chao Leng wrote:
>>
>>
>> On 2021/7/9 12:57, James Smart wrote:
>>> On 7/7/2021 11:29 PM, Chao Leng wrote:
>>>>
>>>>
>>>> On 2021/6/29 19:00, Hannes Reinecke wrote:
> [ .. ]
>>>>>
>>>>> Please give details for the issues you are concerned with.
>>>> For example, if host send a fabric cmd, but the fabric cmd is poisoned
>>>> due to HBA inner error, abnormal or attacked network, and then the
>>>> target
>>>> set the DNR to reply response. If do not reconnect for DNR response of
>>>> old connection, thus the connecting can not aoto recovery. The fabric
>>>> cmd poisoning may be transient, may success if try for reconnecting.
>>>> so try to reconnecting is a better choice.
>>>> If do not reconnect for DNR response, target should set DNR state just
>>>> for target inner error.
>>>>>
>>>
>>> It really doesn't matter what you describe is happening on the back
>>> end of the controllers/subsystem. The rev 1.4 spec says "if the same
>>> command is re-submitted to any controller in the NVM subsystem, then
>>> that re-submitted command is expected to fail." - So, if there's a
>>> chance that a reconnect would succeed, which would be on a different
>>> controller - then subsystem is not following that statement. So you
>>> shouldn't be setting DNR. If you disagree with this behavior, it
>>> will need to be taken up with the NVM Express group.
>> I agree the nvme spec. I mean that linux kernel nvme target does not
>> behave exactly like this.
>> So need to modify both host and target. In addition, the compatibility
>> between different versions should be considered.
>
> Ah. You are talking about the _target_. I would be the first to admit
> that this could be cleared up quite a bit; there are lots of
> inconsistencies there.
>
> And sure, I can do a patch for that, too.
>
But incidentally, I'd rather solve this in a different patchset, as
there are lots of places in the target code where we return
NVME_SC_INTERNAL upon allocation failure, which by rights should be
retryable. But then I'd rather implement ACRE for the target first, as
then we could classify the retry frequency; for allocation failures we
should give the target more time between retries as for eg locking issues.
Cheers,
Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
hare at suse.de +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer
More information about the Linux-nvme
mailing list