[RFC] arm64: Enforce observed order for spinlock and data
bdegraaf at codeaurora.org
bdegraaf at codeaurora.org
Wed Oct 5 08:30:08 PDT 2016
On 2016-10-05 11:10, Peter Zijlstra wrote:
> On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf at codeaurora.org
> wrote:
>> On 2016-10-04 15:12, Mark Rutland wrote:
>> >Hi Brent,
>> >
>> >Could you *please* clarify if you are trying to solve:
>> >
>> >(a) a correctness issue (e.g. data corruption) seen in practice.
>> >(b) a correctness issue (e.g. data corruption) found by inspection.
>> >(c) A performance issue, seen in practice.
>> >(d) A performance issue, found by inspection.
>> >
>> >Any one of these is fine; we just need to know in order to be able to
>> >help effectively, and so far it hasn't been clear.
>
> Brent, you forgot to state which: 'a-d' is the case here.
>
>> I found the problem.
>>
>> Back in September of 2013, arm64 atomics were broken due to missing
>> barriers
>> in certain situations, but the problem at that time was undiscovered.
>>
>> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in
>> at
>> that
>> time and changed the correct cmpxchg64 in lockref.c to
>> cmpxchg64_relaxed.
>>
>> d2212b4 appeared to be OK at that time because the additional barrier
>> requirements of this specific code sequence were not yet discovered,
>> and
>> this change was consistent with the arm64 atomic code of that time.
>>
>> Around February of 2014, some discovery led Will to correct the
>> problem with
>> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67,
>> which
>> has an excellent explanation of potential ordering problems with the
>> same
>> code sequence used by lockref.c.
>>
>> With this updated understanding, the earlier commit
>> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
>>
>> Because acquire/release semantics are insufficient for the full
>> ordering,
>> the single barrier after the store exclusive is the best approach,
>> similar
>> to Will's atomic barrier fix.
>
> This again does not in fact describe the problem.
>
> What is the problem with lockref, and how (refer the earlier a-d
> multiple choice answer) was this found.
>
> Now, I have been looking, and we have some idea what you _might_ be
> alluding to, but please explain which accesses get reordered how and
> cause problems.
Sorry for the confusion, this was a "b" item (correctness fix based on
code
inspection. I had sent an answer to this yesterday, but didn't realize
that
it was in a separate, private email thread.
I'll work out the before/after problem scenarios and send them along
once
I've hashed them out (it may take a while for me to paint a clear
picture).
In the meantime, however, consider that even without the spinlock code
in
the picture, lockref needs to treat the cmpxchg as a full system-level
atomic,
because multiple agents could access the value in a variety of timings.
Since
atomics similar to this are barriered on arm64 since 8e86f0b, the access
to
lockref should be similar.
Brent
More information about the linux-arm-kernel
mailing list