[PATCH V2 0/3] riscv: atomic: Optimize AMO instructions usage

Wed Apr 20 10:03:12 PDT 2022

On 4/20/2022 1:33 AM, Guo Ren wrote:
> Thx Dan,
> 
> On Wed, Apr 20, 2022 at 1:12 AM Dan Lustig <dlustig at nvidia.com> wrote:
>>
>> On 4/17/2022 12:51 AM, Guo Ren wrote:
>>> Hi Boqun & Andrea,
>>>
>>> On Sun, Apr 17, 2022 at 10:26 AM Boqun Feng <boqun.feng at gmail.com> wrote:
>>>>
>>>> On Sun, Apr 17, 2022 at 12:49:44AM +0800, Guo Ren wrote:
>>>> [...]
>>>>>
>>>>> If both the aq and rl bits are set, the atomic memory operation is
>>>>> sequentially consistent and cannot be observed to happen before any
>>>>> earlier memory operations or after any later memory operations in the
>>>>> same RISC-V hart and to the same address domain.
>>>>>                 "0:     lr.w     %[p],  %[c]\n"
>>>>>                 "       sub      %[rc], %[p], %[o]\n"
>>>>>                 "       bltz     %[rc], 1f\n".
>>>>> -               "       sc.w.rl  %[rc], %[rc], %[c]\n"
>>>>> +               "       sc.w.aqrl %[rc], %[rc], %[c]\n"
>>>>>                 "       bnez     %[rc], 0b\n"
>>>>> -               "       fence    rw, rw\n"
>>>>>                 "1:\n"
>>>>> So .rl + fence rw, rw is over constraints, only using sc.w.aqrl is more proper.
>>>>>
>>>>
>>>> Can .aqrl order memory accesses before and after it (not against itself,
>>>> against each other), i.e. act as a full memory barrier? For example, can
>>> From the RVWMO spec description, the .aqrl annotation appends the same
>>> effect with "fence rw, rw" to the AMO instruction, so it's RCsc.
>>>
>>> Not only .aqrl, and I think the below also could be an RCsc when
>>> sc.w.aq is executed:
>>> A: Pre-Access
>>> B: lr.w.rl ADDR-0
>>> ...
>>> C: sc.w.aq ADDR-0
>>> D: Post-Acess
>>> Because sc.w.aq has overlap address & data dependency on lr.w.rl, the
>>> global memory order should be A->B->C->D when sc.w.aq is executed. For
>>> the amoswap
>>
>> These opcodes aren't actually meaningful, unfortunately.
>>
>> Quoting the ISA manual chapter 10.2: "Software should not set the rl bit
>> on an LR instruction unless the aq bit is also set, nor should software
>> set the aq bit on an SC instruction unless the rl bit is also set."
> 1. Oh, I've missed the behind half of the ISA manual. But why can't we
> utilize lr.rl & sc.aq in software programming to guarantee the
> sequence?

lr.aq and sc.rl map more naturally to hardware than lr.rl and sc.aq.
Plus, they just aren't common operations to begin with, e.g., there
is no smp_store_acquire() or smp_load_release(), nor are there
equivalents in C/C++ atomics.

> 2. Using .aqrl to replace the fence rw, rw is okay to ISA manual,
> right? And reducing a fence instruction to gain better performance:
>                 "0:     lr.w     %[p],  %[c]\n"
>                  "       sub      %[rc], %[p], %[o]\n"
>                  "       bltz     %[rc], 1f\n".
>  -               "       sc.w.rl  %[rc], %[rc], %[c]\n"
>  +              "       sc.w.aqrl %[rc], %[rc], %[c]\n"
>                  "       bnez     %[rc], 0b\n"
>  -               "       fence    rw, rw\n"

Yes, using .aqrl is valid.

Dan

>>
>> Dan
>>
>>> The purpose of the whole patchset is to reduce the usage of
>>> independent fence rw, rw instructions, and maximize the usage of the
>>> .aq/.rl/.aqrl aonntation of RISC-V.
>>>
>>>                 __asm__ __volatile__ (                                  \
>>>                         "0:     lr.w %0, %2\n"                          \
>>>                         "       bne  %0, %z3, 1f\n"                     \
>>>                         "       sc.w.rl %1, %z4, %2\n"                  \
>>>                         "       bnez %1, 0b\n"                          \
>>>                         "       fence rw, rw\n"                         \
>>>                         "1:\n"                                          \
>>>
>>>> we end up with u == 1, v == 1, r1 on P0 is 0 and r1 on P1 is 0, for the
>>>> following litmus test?
>>>>
>>>>     C lr-sc-aqrl-pair-vs-full-barrier
>>>>
>>>>     {}
>>>>
>>>>     P0(int *x, int *y, atomic_t *u)
>>>>     {
>>>>             int r0;
>>>>             int r1;
>>>>
>>>>             WRITE_ONCE(*x, 1);
>>>>             r0 = atomic_cmpxchg(u, 0, 1);
>>>>             r1 = READ_ONCE(*y);
>>>>     }
>>>>
>>>>     P1(int *x, int *y, atomic_t *v)
>>>>     {
>>>>             int r0;
>>>>             int r1;
>>>>
>>>>             WRITE_ONCE(*y, 1);
>>>>             r0 = atomic_cmpxchg(v, 0, 1);
>>>>             r1 = READ_ONCE(*x);
>>>>     }
>>>>
>>>>     exists (u=1 /\ v=1 /\ 0:r1=0 /\ 1:r1=0)
>>> I think my patchset won't affect the above sequence guarantee. Current
>>> RISC-V implementation only gives RCsc when the original value is the
>>> same at least once. So I prefer RISC-V cmpxchg should be:
>>>
>>>
>>> -                       "0:     lr.w %0, %2\n"                          \
>>> +                      "0:     lr.w.rl %0, %2\n"                          \
>>>                         "       bne  %0, %z3, 1f\n"                     \
>>>                         "       sc.w.rl %1, %z4, %2\n"                  \
>>>                         "       bnez %1, 0b\n"                          \
>>> -                       "       fence rw, rw\n"                         \
>>>                         "1:\n"                                          \
>>> +                        "       fence w, rw\n"                    \
>>>
>>> To give an unconditional RSsc for atomic_cmpxchg.
>>>
>>>>
>>>> Regards,
>>>> Boqun
>>>
>>>
>>>
> 
> 
>