[PATCH] perf: arm_spe: Use Inner Shareable DSB when draining the buffer

Mon Oct 19 08:55:42 EDT 2020

On 2020-10-19 13:24, Mark Rutland wrote:
> On Tue, Oct 06, 2020 at 05:13:31PM +0100, Alexandru Elisei wrote:
>> Hi Marc,
>> 
>> Thank you for having a look at the patch!
>> 
>> On 10/6/20 4:32 PM, Marc Zyngier wrote:
>> > Hi Alex,
>> >
>> > On Tue, 06 Oct 2020 16:05:20 +0100,
>> > Alexandru Elisei <alexandru.elisei at arm.com> wrote:
>> >> From ARM DDI 0487F.b, page D9-2807:
>> >>
>> >> "Although the Statistical Profiling Extension acts as another observer in
>> >> the system, for determining the Shareability domain of the DSB
>> >> instructions, the writes of sample records are treated as coming from the
>> >> PE that is being profiled."
>> >>
>> >> Similarly, on page D9-2801:
>> >>
>> >> "The memory type and attributes that are used for a write by the
>> >> Statistical Profiling Extension to the Profiling Buffer is taken from the
>> >> translation table entries for the virtual address being written to. That
>> >> is:
>> >> - The writes are treated as coming from an observer that is coherent with
>> >>   all observers in the Shareability domain that is defined by the
>> >>   translation tables."
>> >>
>> >> All the PEs are in the Inner Shareable domain, use a DSB ISH to make sure
>> >> writes to the profiling buffer have completed.
>> > I'm a bit sceptical of this change. The SPE writes are per-CPU, and
>> > all we are trying to ensure is that the CPU we are running on has
>> > drained its own queue of accesses.
>> >
>> > The accesses being made within the IS domain doesn't invalidate the
>> > fact that they are still per-CPU, because "the writes of sample
>> > records are treated as coming from the PE that is being profiled.".
>> >
>> > So why should we have an IS-wide synchronisation for accesses that are
>> > purely local?
>> 
>> I think I might have misunderstood how perf spe works. Below is my 
>> original train
>> of thought.
>> 
>> In the buffer management event interrupt we drain the buffer, and if 
>> the buffer is
>> full, we call arm_spe_perf_aux_output_end() -> perf_aux_output_end(). 
>> The comment
>> for perf_aux_output_end() says "Commit the data written by hardware 
>> into the ring
>> buffer by adjusting aux_head and posting a PERF_RECORD_AUX into the 
>> perf buffer.
>> It is the pmu driver's responsibility to observe ordering rules of the 
>> hardware,
>> so that all the data is externally visible before this is called." My 
>> conclusion
>> was that after we drain the buffer, the data must be visible to all 
>> CPUs.
> 
> FWIW, this reasoning sounds correct to me. The DSB NSH will be
> sufficient to drain the buffer, but we need the DSB ISH to ensure that
> it's visbile to other CPUs at the instant we call 
> perf_aux_output_end().

Right. I think I missed that last bit (and Alex's email at the same 
time).

> Otherwise, if CPU x is reading the ring-buffer written by CPU y, it
> might see the aux buffer pointers updated before the samples are
> viisble, and hence read junk from the buffer.
> 
> We can add a comment to that effect (or rework perf_aux_output_end()
> somehow to handle that ordering).

I'd rather this is done in perf_aux_output_end(), as a full blown DSB 
ISH
on guest entry is pretty harsh... It would also nicely split the 
responsibilities:

- KVM stops SPE and make sure the output is drained
- Perf makes the data visible to all CPUs

Thoughts?

         M.
-- 
Jazz is not dead. It just smells funny...