LSE atomic op ordering is weaker than intended?

Wed Mar 3 18:40:55 GMT 2021

On Thu, Mar 04, 2021 at 03:04:20AM +0900, Hector Martin wrote:
> On 04/03/2021 00.36, Will Deacon wrote:
> > > Did I miss something, or is this in fact an issue?
> > 
> > Both. The -AL atomics are actually special-cased in the
> > "barrier-ordered-before" relation in the Arm ARM:
> > 
> >    [RW1 is barrier-ordered-before if]
> >    * RW1 appears in program order before an atomic instruction with both
> >      Acquire and Release semantics that appears in program order before
> >      RW2.
> > 
> > However, that isn't sufficient to order prior accesses with the "load part"
> > of the RmW and later accesses with the "store part" of the RmW, as you have
> > observed in your test. I'm aware of some pending proposals in this area of
> > the architecture, so I'm reluctant to make any changes until that's
> > bottomed-out, but I'll make a note to chase that up.
> 
> I had actually seen that part of the spec, and looked at it sideways a few
> times, but concluded it wasn't giving me the ordering guarantees I was
> looking for (this was before I wrote the litmus test). You're right, it does
> nonetheless make it stronger than the mere combination of _acquire and
> _release semantics.
> 
> Glad to hear this is something being worked on! I've been giving myself a
> crash course in memory model minutiae over the past few weeks :)
> 
> > > (And while I'm talking to the right people: this issue aside, do atomic ops
> > > on Normal memory create ordering with Device memory ops, or are there no
> > > guarantees there due to the fact that Normal memory is mapped
> > > inner-shareable and the ordering guarantees thus do not extend to
> > > outer-shareable Device accesses? My currenty understanding is the latter,
> > > but I find the ARM ARM wording hard to conclusively grok here.)
> > 
> > Outer-shareable is a superset of inner-shareable, but I think this would be
> > easier with a specific example. I'll go and look at the AIC patch, since
> > this is all a lot easier to talk about in the context of some real code.
> > 
> > Which is the latest version I should look at?
> 
> I'm just about to send a v3 tomorrow, so I'll CC you on that patch (don't
> bother with v2, this part of the code is changing a lot). That said, it's
> basically the following two sequences:
> 
> A:
> 
> // ...stuff that needs to be ordered prior to the atomic here
> ret = atomic_fetch_or_release(flags...)
> if (condition on ret and unrelated stuff) {
> 	writel(reg_send, ...) // includes pre-barrier

Looks you have a control dependency here, so I think that can be
writel_relaxed() [with a comment!].

> }
> 
> B:
> 
> writel_relaxed(reg_ack, ...)

Interesting, is there no MMIO read on the IRQ path?

> dma_wmb() // need a post-barrier
> atomic_fetch_andnot_acquire(flags...)
> // ...stuff that needs to be ordered after the atomic here

This looks pretty dodgy to me as a later read could drift up before teh
writel_relaxed().

> My current understanding is that I cannot drop the dma_wmb() in B and use
> _relaxed in A() and instead use full-ordered atomic ops, because the atomic
> ops, operating on normal IS memory, would not make any statements regarding
> ordering with device OS memory. I need the I/O writes to be ordered with
> regard to the atomics.

If this is all to do with IPIs between CPUs and the "stuff that needs to be
ordered <before/after> the atomic" is all normal, cacheable memory then I
dont't think you need to worry about the outer-shareable domain at all. The
only reason we have that in dma_*mb() is because the coherent DMA buffers
might be non-cacheable to deal with non-coherent DMA.

But hack something together, and I'll look at your v3 to see what's going
on.

Will