[PATCH v7 4/4] ARM: Add support for Hisilicon Kunpeng L3 cache controller

Tue Feb 2 07:18:50 EST 2021

On 2021/2/2 16:44, Arnd Bergmann wrote:
> On Tue, Feb 2, 2021 at 8:16 AM Zhen Lei <thunder.leizhen at huawei.com> wrote:
>> +
>> +/*
>> + * All read and write operations on L3 cache registers are protected by the
>> + * spinlock, except for l3cache_init(). Each time the L3 cache operation is
>> + * performed, all related information is filled into its registers. Therefore,
>> + * there is no memory order problem when only _relaxed() functions are used.
> 
> Thank you for including the text.
> 
> I don't think the explanation with the spin_lock() explains why this
> can be considered safe though, as spin_lock() only contains serialization
> against other CPUs (smp_mb()) rather than the stronger DMA barriers
> implied by readl and writel. As Russell previously explained, these
> barriers are the L1 cache operations (e.g. v7_dma_inv_range) do
> include stronger barriers, so it would be better to come up with a
> justification based on those.

Okay, I'll correct the description.

> 
>> + * This can help us achieve some performance improvement:
>> + * 1) The readl_relaxed() is about 20ns faster than readl().
>> + * 2) The writel_relaxed() is about 123ns faster than writel().
> 
> These are not really the performance numbers I asked for, as a
> low-level benchmark comparing the instructions is rather meaningless.
> The time spent waiting for the barrier depends on what else is going
> on around the barrier. Also, most of the time would likely be
> spent spinning in the loop around readl() while the cache operations
> are in progress, so the latency of a single readl() is not necessarily
> significant.
> 
> To have a more useful performance number, try mentioning the
> most performance sensitive non-coherent DMA master on one
> of the chips that has this cache controller, and a high-level
> performance number such as "1.2% more network packets per
> second" if that is something you can measure easily.

It's not easy. My board only have debugging NIC, only the downstream
products have high-speed service NIC. Software needs to be packaged
layer by layer.

> 
> Of course, if all high-speed DMA masters on this chip are
> cache coherent, there is no need for performance numbers, just
> mention that we don't care about speed in that case.

It's not cache coherent, otherwise, the L3 cache does not need to be
operated.

> 
>         Arnd
> 
> .
>