LSE atomic op ordering is weaker than intended?

Wed Mar 3 18:04:20 GMT 2021

On 04/03/2021 00.36, Will Deacon wrote:
>> Did I miss something, or is this in fact an issue?
> 
> Both. The -AL atomics are actually special-cased in the
> "barrier-ordered-before" relation in the Arm ARM:
> 
>    [RW1 is barrier-ordered-before if]
>    * RW1 appears in program order before an atomic instruction with both
>      Acquire and Release semantics that appears in program order before
>      RW2.
> 
> However, that isn't sufficient to order prior accesses with the "load part"
> of the RmW and later accesses with the "store part" of the RmW, as you have
> observed in your test. I'm aware of some pending proposals in this area of
> the architecture, so I'm reluctant to make any changes until that's
> bottomed-out, but I'll make a note to chase that up.

I had actually seen that part of the spec, and looked at it sideways a 
few times, but concluded it wasn't giving me the ordering guarantees I 
was looking for (this was before I wrote the litmus test). You're right, 
it does nonetheless make it stronger than the mere combination of 
_acquire and _release semantics.

Glad to hear this is something being worked on! I've been giving myself 
a crash course in memory model minutiae over the past few weeks :)

>> (And while I'm talking to the right people: this issue aside, do atomic ops
>> on Normal memory create ordering with Device memory ops, or are there no
>> guarantees there due to the fact that Normal memory is mapped
>> inner-shareable and the ordering guarantees thus do not extend to
>> outer-shareable Device accesses? My currenty understanding is the latter,
>> but I find the ARM ARM wording hard to conclusively grok here.)
> 
> Outer-shareable is a superset of inner-shareable, but I think this would be
> easier with a specific example. I'll go and look at the AIC patch, since
> this is all a lot easier to talk about in the context of some real code.
> 
> Which is the latest version I should look at?

I'm just about to send a v3 tomorrow, so I'll CC you on that patch 
(don't bother with v2, this part of the code is changing a lot). That 
said, it's basically the following two sequences:

A:

// ...stuff that needs to be ordered prior to the atomic here
ret = atomic_fetch_or_release(flags...)
if (condition on ret and unrelated stuff) {
	writel(reg_send, ...) // includes pre-barrier
}

B:

writel_relaxed(reg_ack, ...)
dma_wmb() // need a post-barrier
atomic_fetch_andnot_acquire(flags...)
// ...stuff that needs to be ordered after the atomic here

My current understanding is that I cannot drop the dma_wmb() in B and 
use _relaxed in A() and instead use full-ordered atomic ops, because the 
atomic ops, operating on normal IS memory, would not make any statements 
regarding ordering with device OS memory. I need the I/O writes to be 
ordered with regard to the atomics.

-- 
Hector Martin (marcan at marcan.st)
Public Key: https://mrcn.st/pub