[RFC PATCH 3/4] ARM: bL_entry: Match memory barriers to architectural requirements

Santosh Shilimkar santosh.shilimkar at ti.com
Wed Jan 16 07:11:00 EST 2013


On Wednesday 16 January 2013 05:19 PM, Dave Martin wrote:
> On Wed, Jan 16, 2013 at 12:20:47PM +0530, Santosh Shilimkar wrote:
>> + Catalin, RMK
>>
>> Dave,
>>
>> On Tuesday 15 January 2013 10:18 PM, Dave Martin wrote:
>>> For architectural correctness even Strongly-Ordered memory accesses
>>> require barriers in order to guarantee that multiple CPUs have a
>>> coherent view of the ordering of memory accesses.
>>>
>>> Virtually everything done by this early code is done via explicit
>>> memory access only, so DSBs are seldom required.  Existing barriers
>>> are demoted to DMB, except where a DSB is needed to synchronise
>>> non-memory signalling (i.e., before a SEV).  If a particular
>>> platform performs cache maintenance in its power_up_setup function,
>>> it should force it to complete explicitly including a DSB, instead
>>> of relying on the bL_head framework code to do it.
>>>
>>> Some additional DMBs are added to ensure all the memory ordering
>>> properties required by the race avoidance algorithm.  DMBs are also
>>> moved out of loops, and for clarity some are moved so that most
>>> directly follow the memory operation which needs to be
>>> synchronised.
>>>
>>> The setting of a CPU's bL_entry_vectors[] entry is also required to
>>> act as a synchronisation point, so a DMB is added after checking
>>> that entry to ensure that other CPUs do not observe gated
>>> operations leaking across the opening of the gate.
>>>
>>> Signed-off-by: Dave Martin <dave.martin at linaro.org>
>>> ---
>>
>> Sorry to pick on this again but I am not able to understand why
>> the strongly ordered access needs barriers. At least from the
>> ARM point of view, a strongly ordered write will be more of blocking
>> write and the further interconnect also is suppose to respect that
>
> This is what I originally assumed (hence the absence of barriers in
> the initial patch).
>
>> rule. SO read writes are like adding barrier after every load store
>
> This assumption turns out to be wrong, unfortunately, although in
> a uniprocessor scenario is makes no difference.  A SO memory access
> does block the CPU making the access, but explicitly does not
> block the interconnect.
>
I suspected the interconnect part when you described the barrier
need for SO memory region.

> In a typical boot scenario for example, all secondary CPUs are
> quiescent or powered down, so there's no problem.  But we can't make
> the same assumptions when we're trying to coordinate between
> multiple active CPUs.
>
>> so adding explicit barriers doesn't make sense. Is this a side
>> effect of some "write early response" kind of optimizations at
>> interconnect level ?
>
> Strongly-Ordered accesses are always non-shareable, so there is
> no explicit guarantee of coherency between multiple masters.
>
This is where probably issue then. My understanding is exactly
opposite here and hence I wasn't worried about multi-master
CPU scenario since sharable attributes would be taking care of it
considering the same page tables being used in SMP system.

ARM documentation says -
------------
Shareability and the S bit, with TEX remap
The memory type of a region, as indicated in the Memory type column of 
Table B3-12 on page B3-1350, provides
the first level of control of whether the region is shareable:
• If the memory type is Strongly-ordered then the region is Shareable
------------------------------------------------------------

> If there is only one master, it makes no difference, but if there
> are multiple masters, there is no guarantee that they are conntected
> to a slave device (DRAM controller in this case) via a single
> slave port.
>
See above. We are talking about multiple CPUs here and not
the DSP or other co-processors. In either case we are discussing
code which is getting execute on ARM CPUs so we can safely limit
it to the multi-master ARM CPU.

> The architecture only guarantees global serialisation when there is a
> single slave device, but provides no way to know whether two accesses
> from different masters will reach the same slave port.  This is in the
> realms of "implementation defined."
>
> Unfortunately, a high-performance component like a DRAM controller
> is exactly the kind of component which may implement multiple
> master ports, so you can't guarantee that accesses are serialised
> in the same order from the perspective of all masters.  There may
> be some pipelining and caching between each master port and the actual
> memory, for example.  This is allowed, because there is no requirement
> for the DMC to look like a single slave device from the perspective
> of multiple masters.
>
> A multi-ported slave might provide transparent coherency between master
> ports, but it is only required to guarantee this when the accesses
> are shareable (SO is always non-shared), or when explicit barriers
> are used to force synchronisation between the device's master ports.
>
> Of course, a given platform may have a DMC with only one slave
> port, in which case the barriers should not be needed.  But I wanted
> this code to be generic enough to be reusable -- hence the
> addition of the barriers.  The CPU does not need to wait for a DMB
> to "complete" in any sense, so this does not necessarily have a
> meaningful impact on performance.
>
> This is my understanding anyway.
>
>> Will you be able to point to specs or documents which puts
>> this requirement ?
>
> Unfortunately, this is one of this things which we require not because
> there is a statement in the ARM ARM to say that we need it -- rather,
> there is no statement in the ARM ARM to say that we don't.
>
Thanks a lot for elaborate answer. It helps to understand the rationale
at least.

Regards
Santosh




More information about the linux-arm-kernel mailing list