[PATCH] riscv/atomic.h: optimize ops with acquire/release ordering
Puranjay Mohan
puranjay at kernel.org
Wed May 8 06:58:58 PDT 2024
Andrea Parri <parri.andrea at gmail.com> writes:
>> I think Guo Ren sent a patch[1] like this earlier but it did not get
>> comments yet. I will reply on that thread[1] as well.
>
> TBF, those changes appeared in a later submission/series,
>
> https://lore.kernel.org/lkml/20220505035526.2974382-1-guoren@kernel.org/
>
> a submission that received a similar feedback from the ATOMIC INFRASTRUCTURE
> maintainers and myself: in short, "please explain _why_ you are doing what
> you are doing".
>
>
>> I saw the commit 5ce6c1f3535fa ("riscv/atomic: Strengthen
>> implementations with fences") and all the related discussions.
>>
>> This is what I understand from the discussions:
>>
>> RISCV's LR.aq/SC.rl were following RCpc ordering but the LKMM expected
>> RCsc ordering from lock() and unlock(). So you added fences to force RCsc
>> ordering in the locks/atomics.
>
> Appreciate the effort. Some corrections/clarifications:
>
> When 5ce6c1f3535fa was developed, the LKMM expected "less-than-RCsc" ordering
> from the lock operations. Some of those properties were illustrated by the
> unlock-lock-read-ordering litmus test you reported (and included in
>
> 0123f4d76ca63 ("riscv/spinlock: Strengthen implementations with fences") ).
>
> It's also worth mentioning that, when 5ce6c1f3535fa was discussed, the LKMM
> expected similar read-read ordering properties to hold for ordinary acquire
> and release operations, i.e. not necessary a lock operation.
>
> Later changes to the LKMM relaxed those properties for ordinary acquires and
> releases, and added extra ordering for locks, cf.
>
> 6e89e831a9017 ("tools/memory-model: Add extra ordering for locks and remove it for ordinary release/acquire")
> ddfe12944e848 ("tools/memory-model: Provide extra ordering for unlock+lock pair on the same CPU")
>
> Roughly speaking, such changes made the LKMM's locks RCtso, and this matches
> the current LKMM's approach. (Yes I know, there is code assuming/wishing RCsc
> locks... long story, not strictly related to this discussion/thread: IAC, for
> completeness, I'll say more about that in my comments below.)
>
> My changes/the current implementations provides RCtso (not RCsc) ordering for
> RISCV's locks and atomics; in fact, by their very design, this RCtso is pretty
> easy to see/prove:
>
> (1) every release op provides RW to W order (or stronger);
>
> (2) every acquire op provides more than R to R order (typically R to RW
> order, but in atomic_cond_load_acquire() & co. that R-to-W order is
> limited to the "R" associated with the acquire op itself).
>
> Put together, (1-2) give R-to-R, R-to-W and W-to-W order (aka RCtso) as claimed.
> Notice that this argument holds for every locks operations and types (spinlock,
> rwlock, mutex, rt_mutex, semaphore, rw_semaphore, etc.) and that it does _not_
> require any audit of the locking code. More on this point below.
>
>
>> An experiment with LKMM and RISCV MM:
>>
>> The following litmus test should not reach (1:r0=1 /\ 1:r1=0) with LKMM:
>>
>> C unlock-lock-read-ordering
>>
>> {}
>> /* s initially owned by P1 */
>>
>> P0(int *x, int *y)
>> {
>> WRITE_ONCE(*x, 1);
>> smp_wmb();
>> WRITE_ONCE(*y, 1);
>> }
>>
>> P1(int *x, int *y, spinlock_t *s)
>> {
>> int r0;
>> int r1;
>>
>> r0 = READ_ONCE(*y);
>> spin_unlock(s);
>> spin_lock(s);
>> r1 = READ_ONCE(*x);
>> }
>>
>> exists (1:r0=1 /\ 1:r1=0)
>>
>> Which is indeed true:
>>
>> Test unlock-lock-read-ordering Allowed
>> States 3
>> 1:r0=0; 1:r1=0;
>> 1:r0=0; 1:r1=1;
>> 1:r0=1; 1:r1=1;
>> No
>> Witnesses
>> Positive: 0 Negative: 3
>> Flag unmatched-unlock
>> Condition exists (1:r0=1 /\ 1:r1=0)
>> Observation unlock-lock-read-ordering Never 0 3
>> Time unlock-lock-read-ordering 0.01
>> Hash=ab0cfdcde54d1bb1faa731533980f424
>>
>> And when I map this test to RISC-V:
>>
>> RISCV RISCV-unlock-lock-read-ordering
>> {
>> 0:x2=x;
>> 0:x4=y;
>>
>> 1:x2=x;
>> 1:x4=y;
>> 1:x6=s;
>> }
>> P0 | P1 ;
>> ori x1,x0,1 | lw x1,0(x4) ;
>> sw x1,0(x2) | amoswap.w.rl x0,x0,(x6) ;
>> fence w,w | ori x5,x0,1 ;
>> ori x3,x0,1 | amoswap.w.aq x0,x5,(x6) ;
>> sw x3,0(x4) | lw x3,0(x2) ;
>> exists (1:x1=1 /\ 1:x3=0)
>>
>> This also doesn't reach the condition:
>>
>> Test RISCV-unlock-lock-read-ordering Allowed
>> States 3
>> 1:x1=0; 1:x3=0;
>> 1:x1=0; 1:x3=1;
>> 1:x1=1; 1:x3=1;
>> No
>> Witnesses
>> Positive: 0 Negative: 3
>> Condition exists (1:x1=1 /\ 1:x3=0)
>> Observation RISCV-unlock-lock-read-ordering Never 0 3
>> Time RISCV-unlock-lock-read-ordering 0.01
>> Hash=d845d36e2a8480165903870d135dd81e
>
> Which "mapping" did you use for this experiment/analysis? Looking at the
Actually, by mapping I meant:
R1
amoswap.w.rl
amoswap.w.aq
R2
will provide R1->R2 ordering like:
R1
spin_unlock()
spin_lock()
R2
That test is for read->read ordering enforced by unlock()->lock() and I
just wanted to say that the current RISC-V memory model provides that
with all(amo/lr/sc) .rl -> .aq operations.
> current spinlock code for RISCV (and including the atomic changes at stake)
> P1 seems to be better described by something like:
>
> fence rw,w // arch_spin_unlock --> smp_store_release
> sw
>
> lr.w // arch_spin_trylock --> arch_try_cmpxchg
> bne
> sc.w.rl
> bnez
> fence rw,rw
>
> or
>
> amoadd.w.aqrl // arch_spin_lock --> atomic_fetch_add
>
> or
>
> lw // arch_spin_lock --> atomic_cond_read_acquire ; smp_mb (??)
> bne
> fence r,r
>
> fence rw,rw
>
> Looking at the rwlock code (for which the same RCtso property is expected to
> hold, even though that hasn't been formalized in the LKMM yet), I see (again,
> including your atomic changes):
>
> amoadd.w.rl // queued_read_unlock --> atomic_sub_return_release
> amoadd.w.aq // queued_read_lock --> atomic_add_return_acquire
>
> and
>
> fence rw,w // queue_write_unlock --> smp_store_release
> sw
> lr.w // queue_write_lock --> atomic_try_cmpxchg_acquire
> bne
> sc.w
> bnez
> fence r,rw
>
> I won't list the slowpath scenarios. Or even the mutex, semaphore, etc. I
> believe you got the point...
Thanks for these details, I will do more testing with this on herd7.
Something out of curiosity?:
>From my understanding of the current version of the RV memory model:
.aq provides .aq -> all ordering
.rl provides all -> .rl ordering
and because this is RCsc variant of release consistency
.rl -> .aq
which means
R/W
amoswap.w.rl
amoswap.w.aq
R/W
Should act as a full fence? R/W -> rl -> aq -> R/W
>
>> Your commit mentioned that the above test would reach the exists
>> condition for RISCV.
>>
>> So, maybe the model has been modified to make .aq and .rl RCsc now?
>
> Yes. .aq and .rl are RCsc. They were considered RCpc when 5ce6c1f3535fa
> 0123f4d76ca63 were discussed (which happened _before_ the RISC-V's memory
> model was ratified) as clearly remarked in their commit messages.
>
> The ATOMIC maintainers went as far as "bisecting" the RISC-V ISA spec in
>
> https://lore.kernel.org/lkml/YrPei6q4rIAx6Ymf@boqun-archlinux/
>
> but, as they say, it's hard to help people who don't want to be helped...
>
>
>> This presentation[2] by Dan Lustig says on page 31:
>>
>> | PPO RULES 5-7
>> | A release that precedes an acquire in program
>> | order also precedes it in global memory order
>> | • i.e., the RCsc variant of release consistency
>>
>> If above is true, removing the weak fences and using LR, SC, AMOs with
>> aq, rl, and aqrl bits could be used in the kernel AMOs and locks.
>
> The problem with this argument is that it relies on all lock ops to come with
> an RCsc annotation, which is simply not true in the current/RISCV code as the
> few snippets above also suggested.
>
> BTW, arm64 uses a similar argument, except all its releases/acquires come with
> RCsc annotations (which greatly simplifies the analysis). The argument could
> be easily made to work in RISCV _provided_ its ISA were augmented with lw.aq
> and sw.rl, but it's been ~6 years...
>
> Said this, maybe we're "lucky" and all the unlock+lock pairs will just work w/
> your changes. I haven't really checked, and I probably won't until the only
> motivation for such changes will be "lower inst count in qemu".
>
> On such regard, remark that Section A.5, "Code porting and mapping guidelines"
> of the RISCV ISA spec provides alternative mapping for our atomics, including
> AMO mapping w/ .aq and .rl annotations: I'm sure those mappings were subject
> to a fair amount of review and formal analysis (although I was not involved in
> that work/review at the time): if inst count is so important to you, why not
> simply follow those guidelines? (Notice that such re-write would require some
> modification to non-AMO mappings, cf. smp_store_release() and LR/SC mappings.)
So, I will do the following now:
1. Do some benchmarking on real hardware and find out how much overhead
these weak fences add.
2. Study the LKMM and the RVWMO for the next few weeks/months or however
much time it takes me to confidently reason about things written in
these two models.
3. Study the locking / related code of RISC-V to see what could break if
we change all these operations in accordance with "Code Porting and
Mapping Guidelines" of RISCV ISA.
4. I will use the herd7 models of LKMM and RVWMO and see if everything
works as expected after these changes.
And If I am convinced after all this, I will send a patch to implement
"Code Porting and Mapping Guidelines" + provide performance numbers from
real hardware.
Thanks for the detailed explainations and especially regarding how the
LKMM evolved.
Puranjay
More information about the linux-riscv
mailing list