[LSF/MM/BPF TOPIC] TP4129 KATO correctoins and clarification
Sagi Grimberg
sagi at grimberg.me
Fri Feb 7 01:00:46 PST 2025
On 05/02/2025 18:53, Ballard, Curtis C (HPE Storage) wrote:
>>> On Fri, Jan 31, 2025 at 06:48:20PM +0000, Ballard, Curtis C (HPE Storage) wrote:
>>> If there are specific aspects of TP4129 that are broken, I would be interested in learning more about them and how they are broken.
>> Serious Curtis, we've been through this a dozen times in the TWG.
>> It just adds another best effort timer that is not better than the
>> one that already is there. But you guys insist on papering over it,
>> an there is no real reason for Linux to adopt that. (despite the pages
>> long John rant).
>>
> Thanks for that response Christoph. It is really helpful to understand that
> the concern is around the timer and the "best effort" aspects. I was aware that
> concerns had been expressed on whether the timer was reliable and the TWG took
> those seriously.
>
> Reading between the lines on the term "best effort" I think that the concern
> is that the timers don't provide a reliable technique to prevent the ghost
> write data corruption issue that was discussed at LSF last year.
>
> There was quite a bit of work to go through scenarios, and timer processing, to
> address the concerns that the timer was a "best effort" (to reuse the term)
> rather than a reliable method. I'm not aware of a scenario where applying the
> timers, with the process described in TP4129, fails to provide a reliable
> assurance that it is safe to retry write commands.
Unless the controller is able to reliably upper bound its abort
operation of all inflight
commands I can't see how this is 100% reliable. Which will likely mean
that a healthy
margin is taken such that it will not break in practice.
> While it is slower than
> any of us would like, having a method to prevent the possibility of data
> corruption seems like a worthy effort.
>
> If there is a scenario where the timer management model described in TP4129
> isn't a reliable method of determining that it is safe to retry write commands
> then we need to do some investigation to see if we can resolve that issue.
>
> I appreciate the comments on TP8028. I don't like how slow the timer method is
> and really hope that we can continue to work on methods of making NVMe-oF
> error detection and recovery faster, while maintaining data integrity. The
> TP8028 effort is what we have in development and I'm optimistic that we'll be
> able to discuss that model at LSF.
I think its reasonable for a subsystem to report the host a failover
delay. While it is
indeed a "best effort", its most likely sufficient for implementers to
never see
where it breaks... I just had an issue with prior attempts to make it
behavior universal.
I also agree that making quiesce in failover explicit is the best
solution. But I don't think
we should reject any attempt of making the host respect the controller CQT.
More information about the Linux-nvme
mailing list