[RFC] arm64: Enforce observed order for spinlock and data

bdegraaf at codeaurora.org bdegraaf at codeaurora.org
Wed Oct 5 08:30:08 PDT 2016


On 2016-10-05 11:10, Peter Zijlstra wrote:
> On Wed, Oct 05, 2016 at 10:55:57AM -0400, bdegraaf at codeaurora.org 
> wrote:
>> On 2016-10-04 15:12, Mark Rutland wrote:
>> >Hi Brent,
>> >
>> >Could you *please* clarify if you are trying to solve:
>> >
>> >(a) a correctness issue (e.g. data corruption) seen in practice.
>> >(b) a correctness issue (e.g. data corruption) found by inspection.
>> >(c) A performance issue, seen in practice.
>> >(d) A performance issue, found by inspection.
>> >
>> >Any one of these is fine; we just need to know in order to be able to
>> >help effectively, and so far it hasn't been clear.
> 
> Brent, you forgot to state which: 'a-d' is the case here.
> 
>> I found the problem.
>> 
>> Back in September of 2013, arm64 atomics were broken due to missing 
>> barriers
>> in certain situations, but the problem at that time was undiscovered.
>> 
>> Will Deacon's commit d2212b4dce596fee83e5c523400bf084f4cc816c went in 
>> at
>> that
>> time and changed the correct cmpxchg64 in lockref.c to 
>> cmpxchg64_relaxed.
>> 
>> d2212b4 appeared to be OK at that time because the additional barrier
>> requirements of this specific code sequence were not yet discovered, 
>> and
>> this change was consistent with the arm64 atomic code of that time.
>> 
>> Around February of 2014, some discovery led Will to correct the 
>> problem with
>> the atomic code via commit 8e86f0b409a44193f1587e87b69c5dcf8f65be67, 
>> which
>> has an excellent explanation of potential ordering problems with the 
>> same
>> code sequence used by lockref.c.
>> 
>> With this updated understanding, the earlier commit
>> (d2212b4dce596fee83e5c523400bf084f4cc816c) should be reverted.
>> 
>> Because acquire/release semantics are insufficient for the full 
>> ordering,
>> the single barrier after the store exclusive is the best approach, 
>> similar
>> to Will's atomic barrier fix.
> 
> This again does not in fact describe the problem.
> 
> What is the problem with lockref, and how (refer the earlier a-d
> multiple choice answer) was this found.
> 
> Now, I have been looking, and we have some idea what you _might_ be
> alluding to, but please explain which accesses get reordered how and
> cause problems.

Sorry for the confusion, this was a "b" item (correctness fix based on 
code
inspection. I had sent an answer to this yesterday, but didn't realize 
that
it was in a separate, private email thread.

I'll work out the before/after problem scenarios and send them along 
once
I've hashed them out (it may take a while for me to paint a clear 
picture).
In the meantime, however, consider that even without the spinlock code 
in
the picture, lockref needs to treat the cmpxchg as a full system-level 
atomic,
because multiple agents could access the value in a variety of timings. 
Since
atomics similar to this are barriered on arm64 since 8e86f0b, the access 
to
lockref should be similar.

Brent



More information about the linux-arm-kernel mailing list