[PATCH] ARM: The mandatory barrier rmb() must be a dsb() in for device accesses

Sun Apr 10 06:10:41 EDT 2011

On 9 April 2011 20:03, Ming Lei <tom.leiming at gmail.com> wrote:
> 2011/4/9 Catalin Marinas <catalin.marinas at arm.com>:
>> Probably the above document isn't comprehensive enough. It mainly
>> targets memory ordering between processors. I think another example
>> that mentions DSB is the mailbox scenario.
>
> I also saw this example of 8.1, but I don't think it is related with
> linux read memory barrier, see below:
>
>        - 8.1 requires that the 'STR R5, [R1]' in P1 is completed before
>        'LDR R5, [R1]' in P2;

Its not related to the read memory barrier but it refers to ordering
between accesses to Normal memory and Device memory. The mailbox here
at [R4] can be a device. I'm copying the relevant code here for others
following this thread:

P1:
    STR R5, [R1] ; message stored to shared memory location
    DSB [ST]
    STR R1, [R4] ; R4 contains the address of a mailbox

P2:
    ; interrupt service routine
    LDR R5, [R1]

>        - Documentation/memory-barriers.txt said:
>        1), Memory barriers are such interventions.  They impose a
>        perceived partial ordering over the memory operations on either
>        side of the barrier.
>
>        2),There is no guarantee that any of the memory accesses specified
>        before a memory barrier will be _complete_ by the completion of a
>        memory barrier instruction; the barrier can be considered to draw a
>        line in that CPU's access queue that accesses of the appropriate type
>        may not cross.

While the above is correct, the Linux document is quite vague in
relation to the mandatory barriers and it's not clear whether the
above only refers to the SMP barriers.

> In fact, IMO, the 'DSB' should be used before 'LDR R5, [R1]' in P2 instead
> of in P1 because of the belows:

A DSB on one CPU does not control the ordering on another CPU. So the
DSB on P2 is useless. Note that the P2 code is executed in an
interrupt routine raised as a result of STR R1, [R4]. If P2 had a loop
polling for [R4] to get a valid address, we would indeed need a DSB
before LDR R5, [R1].

>        - out of order of the two stores in P1 is allowable

That's why we have a DSB in there, to ensure that the [R1] location is
written before we store the R1 to the [R4] device.

>        - P2 should get message address from the mailbox pointed by R4 first,
>        so how can we make sure the 'STR R1, [R4]' in P1 is completed if
>        R4 doesn't point to a Strongly-Ordered memory.

Even if the memory is SO, on newer CPUs I don't think it guarantees
the completion (pretty much acting like Device memory). But here we
don't care when the store to [R4] completes. When this happen, P2 will
get an interrupt. P2 does not execute the LDR R5, [R1] in interrupt
routine speculatively (i.e. before the interrupt was received).

>> Let's assume we have a device that performs the two steps below:
>>
>> 1. Writes data to RAM
>> 2. Updates its status register
>>
>> A driver running on the CPU has some code as below:
>>
>>    LDR [Device]    @ read the device status
>>    DMB               @ current barrier that we have in readl
>>    TST                 @ check whether the DMA transfer is ready
>>    BEQ out
>>    LDR [Normal]    @ read the DMA buffer
>>    ...
>> out:
>>
>> With the code above, the CPU may do the following steps:
>>
>> 1. Issue read from the device. Note that it does not wait for the read
>> to complete.
>> 2. DMB - ensures that no subsequent memory accesses happen before the
>> previous ones.
>> 3. Issues read from normal memory speculatively. This is allowed
>> because the TST/BEQ are only flow control dependency. In case the
>> condition fails, the read is discarded.
>> 4. The read from Normal memory (DMA buffer) completes. This could
>> happen before the I/O read at point 1 depending on the bus speeds.
>> 5. The Device read completes. This can happen after the Normal read
>> because of different bus speeds.
>> 6. TST clears CPSR.Z
>> 7. BEQ not executed.
>> 8. Normal read data moved to register.
>>
>> So, even if the CPU issues the read from Device and Normal memory in
>> order (steps 1, 3), they can happen at the device and RAM level out of
>> order (steps 4, 5) and the CPU could read data not yet written by the
>> device.
>>
>> The solution is to use a DSB which ensures the completion of the
>> Device read before issuing the Normal memory read.
>
> I agree a DSB is needed in such case, but I am not sure the patch is
> good. Maybe we should keep rmb not changed and only replace __iormb
> as DSB if this issue only happens between device io memory and normal
> memory, then rmb will not degrade performance a little as does by your
> patch.

What other use do you see for the mandatory barriers if not in
relation to I/O? They are not for SMP configurations. We already have
a DSB in wmb().

The mandatory barriers could be present in device drivers that use the
relaxed I/O accessors and mandatory barriers explicitly as
optimisation. This was the approach recommended in past mailing list
discussions.

-- 
Catalin