[LSF/MM/BPF TOPIC] TP4129 KATO correctoins and clarification

Sagi Grimberg sagi at grimberg.me
Fri Feb 7 01:00:46 PST 2025




On 05/02/2025 18:53, Ballard, Curtis C (HPE Storage) wrote:
>>> On Fri, Jan 31, 2025 at 06:48:20PM +0000, Ballard, Curtis C (HPE Storage) wrote:
>>> If there are specific aspects of TP4129 that are broken, I would be interested in learning more about them and how they are broken.
>> Serious Curtis, we've been through this a dozen times in the TWG.
>> It just adds another best effort timer that is not better than the
>> one that already is there.  But you guys insist on papering over it,
>> an there is no real reason for Linux to adopt that.  (despite the pages
>> long John rant).
>>
> Thanks for that response Christoph. It is really helpful to understand that
> the concern is around the timer and the "best effort" aspects. I was aware that
> concerns had been expressed on whether the timer was reliable and the TWG took
> those seriously.
>
> Reading between the lines on the term "best effort" I think that the concern
> is that the timers don't provide a reliable technique to prevent the ghost
> write data corruption issue that was discussed at LSF last year.
>
> There was quite a bit of work to go through scenarios, and timer processing, to
> address the concerns that the timer was a "best effort" (to reuse the term)
> rather than a reliable method. I'm not aware of a scenario where applying the
> timers, with the process described in TP4129, fails to provide a reliable
> assurance that it is safe to retry write commands.

Unless the controller is able to reliably upper bound its abort 
operation of all inflight
commands I can't see how this is 100% reliable. Which will likely mean 
that a healthy
margin is taken such that it will not break in practice.

>    While it is slower than
> any of us would like, having a method to prevent the possibility of data
> corruption seems like a worthy effort.
>
> If there is a scenario where the timer management model described in TP4129
> isn't a reliable method of determining that it is safe to retry write commands
> then we need to do some investigation to see if we can resolve that issue.
>
> I appreciate the comments on TP8028. I don't like how slow the timer method is
> and really hope that we can continue to work on methods of making NVMe-oF
> error detection and recovery faster, while maintaining data integrity. The
> TP8028 effort is what we have in development and I'm optimistic that we'll be
> able to discuss that model at LSF.

I think its reasonable for a subsystem to report the host a failover 
delay. While it is
indeed a "best effort", its most likely sufficient for implementers to 
never see
where it breaks... I just had an issue with prior attempts to make it 
behavior universal.

I also agree that making quiesce in failover explicit is the best 
solution. But I don't think
we should reject any attempt of making the host respect the controller CQT.



More information about the Linux-nvme mailing list