LSE atomic op ordering is weaker than intended?

Wed Mar 3 19:37:31 GMT 2021

On 04/03/2021 03.40, Will Deacon wrote:
>> // ...stuff that needs to be ordered prior to the atomic here
>> ret = atomic_fetch_or_release(flags...)
>> if (condition on ret and unrelated stuff) {
>> 	writel(reg_send, ...) // includes pre-barrier
> 
> Looks you have a control dependency here, so I think that can be
> writel_relaxed() [with a comment!].

Wouldn't the control dependency be strictly on the *read* portion of the 
atomic, while I'm looking to order on the write part?

Though, actually, I omitted something; the unrelated stuff includes an 
atomic_read_acquire that will create order against the write-release 
portion of the atomic (not per the Linux cross-platform docs, but yes 
per ARMv8 B2.3.3) and *that* should be a suitable control dependency if 
I'm not mistaken.

>> writel_relaxed(reg_ack, ...)
> 
> Interesting, is there no MMIO read on the IRQ path?

There's a prior read, but that doesn't help here; the important part is 
that this writel (ACKing the hardware IPI) cannot move after the atomic 
op after it (figuring out which software IPIs it represents), because if 
it does we might ACK a subsequent IPI we're not about to handle.

>> dma_wmb() // need a post-barrier
>> atomic_fetch_andnot_acquire(flags...)
>> // ...stuff that needs to be ordered after the atomic here
> 
> This looks pretty dodgy to me as a later read could drift up before teh
> writel_relaxed().

That's fine, as long as those reads are ordered after the atomic fetch 
portion itself (which they should be since it's an _acquire).

AIUI the worst case ordering this can result in is:

- read portion of atomic
- the entire rest of the code (if it does no writes)
- writel_relaxed()
- write portion of atomic

That's fine: the result of the atomic read gives us license to run the 
rest of the code; the writel doesn't need to have happened. The 
important part is that the write portion of the atomic can't happen 
before the writel.

> If this is all to do with IPIs between CPUs and the "stuff that needs to be
> ordered <before/after> the atomic" is all normal, cacheable memory then I
> dont't think you need to worry about the outer-shareable domain at all. The
> only reason we have that in dma_*mb() is because the coherent DMA buffers
> might be non-cacheable to deal with non-coherent DMA.

What worried me is whether the order created by atomics on normal memory 
might not extend to MMIO ops on device memory, which are treated as 
outer-shareable (per D5.5.1).

I don't fully understand how the entire B2.3 section on ordering 
interacts across memory regions having different attributes. That 
section talks about a Common Shareability Domain, and also explicitly 
mentions only Normal memory locations, so I'm having a lot of trouble 
piecing together how MMIO ops are treated within the entire memory 
ordering framework (and all the litmus test stuff seems to assume 
everythign is IS...).

All the actual memory involved in everything but the writel() stuff is 
all normal cacheable memory; what I need to ensure is that I can 
transitively create memory ordering constraints between normal memory 
accesses through MMIO writes.

That is, that I can create two ordered sequences on two PEs:

(normal access 1; mmio write 1)
(mmio write 2; normal access 2)

Such that I can assert that at least one of the following orderings are 
true:

(normal access 1; normal access 2)
(as observed by the PEs)

or:

(mmio write 2; mmio write 1)
(as observed by the MMIO device)

But never this combination:

(normal access 2; normal access 1)
(mmio write 1; mmio write 2)

> But hack something together, and I'll look at your v3 to see what's going
> on.

Please do, hopefully the full code makes more sense.

-- 
Hector Martin (marcan at marcan.st)
Public Key: https://mrcn.st/pub