[RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

Mon Mar 2 18:20:19 PST 2015

Hi Christoffer,

I don't understand how can the CPU handle different cache attributes
used by
QEMU and Guest won't you run into B2.9 checklist? Wouldn't cache
evictions or
cleans wipe out guest updates to same cache line(s)?

- Mario

On 03/02/2015 08:31 AM, Christoffer Dall wrote:
> On Tue, Feb 24, 2015 at 05:47:19PM +0000, Ard Biesheuvel wrote:
>> On 24 February 2015 at 14:55, Andrew Jones <drjones at redhat.com> wrote:
>>> On Fri, Feb 20, 2015 at 04:36:26PM +0100, Andrew Jones wrote:
>>>> On Fri, Feb 20, 2015 at 02:37:25PM +0000, Ard Biesheuvel wrote:
>>>>> On 20 February 2015 at 14:29, Andrew Jones <drjones at redhat.com> wrote:
>>>>>> So looks like the 3 orders of magnitude greater number of traps
>>>>>> (only to el2) don't impact kernel compiles.
>>>>>>
>>>>>
>>>>> OK, good! That was what I was hoping for, obviously.
>>>>>
>>>>>> Then I thought I'd be able to quick measure the number of cycles
>>>>>> a trap to el2 takes with this kvm-unit-tests test
>>>>>>
>>>>>> int main(void)
>>>>>> {
>>>>>>         unsigned long start, end;
>>>>>>         unsigned int sctlr;
>>>>>>
>>>>>>         asm volatile(
>>>>>>         "       mrs %0, sctlr_el1\n"
>>>>>>         "       msr pmcr_el0, %1\n"
>>>>>>         : "=&r" (sctlr) : "r" (5));
>>>>>>
>>>>>>         asm volatile(
>>>>>>         "       mrs %0, pmccntr_el0\n"
>>>>>>         "       msr sctlr_el1, %2\n"
>>>>>>         "       mrs %1, pmccntr_el0\n"
>>>>>>         : "=&r" (start), "=&r" (end) : "r" (sctlr));
>>>>>>
>>>>>>         printf("%llx\n", end - start);
>>>>>>         return 0;
>>>>>> }
>>>>>>
>>>>>> after applying this patch to kvm
>>>>>>
>>>>>> diff --git a/arch/arm64/kvm/hyp.S b/arch/arm64/kvm/hyp.S
>>>>>> index bb91b6fc63861..5de39d740aa58 100644
>>>>>> --- a/arch/arm64/kvm/hyp.S
>>>>>> +++ b/arch/arm64/kvm/hyp.S
>>>>>> @@ -770,7 +770,7 @@
>>>>>>
>>>>>>         mrs     x2, mdcr_el2
>>>>>>         and     x2, x2, #MDCR_EL2_HPMN_MASK
>>>>>> -       orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>>>>>> +//     orr     x2, x2, #(MDCR_EL2_TPM | MDCR_EL2_TPMCR)
>>>>>>         orr     x2, x2, #(MDCR_EL2_TDRA | MDCR_EL2_TDOSA)
>>>>>>
>>>>>>         // Check for KVM_ARM64_DEBUG_DIRTY, and set debug to trap
>>>>>>
>>>>>> But I get zero for the cycle count. Not sure what I'm missing.
>>>>>>
>>>>>
>>>>> No clue tbh. Does the counter work as expected in the host?
>>>>>
>>>>
>>>> Guess not. I dropped the test into a module_init and inserted
>>>> it on the host. Always get zero for pmccntr_el0 reads. Or, if
>>>> I set it to something non-zero with a write, then I always get
>>>> that back - no increments. pmcr_el0 looks OK... I had forgotten
>>>> to set bit 31 of pmcntenset_el0, but doing that still doesn't
>>>> help. Anyway, I assume the problem is me. I'll keep looking to
>>>> see what I'm missing.
>>>>
>>>
>>> I returned to this and see that the problem was indeed me. I needed yet
>>> another enable bit set (the filter register needed to be instructed to
>>> count cycles while in el2). I've attached the code for the curious.
>>> The numbers are mean=6999, std_dev=242. Run on the host, or in a guest
>>> running on a host without this patch series (after TVM traps have been
>>> disabled), I get a pretty consistent 40.
>>>
>>> I checked how many vm-sysreg traps we do during the kernel compile
>>> benchmark. It's 124924. So it's a bit strange that we don't see the
>>> benchmark taking 10 to 20 seconds longer on average. I should probably
>>> double check my runs. In any case, while I like the approach of this
>>> series, the overhead is looking non-negligible.
>>>
>>
>> Thanks a lot for producing these numbers. 125k x 7k == <1 billion
>> cycles == <1 second on a >1 GHz machine, I think?
>> Or am I missing something? How long does the actual compile take?
>>
> I ran a sequence of benchmarks that I occasionally run (pbzip,
> kernbench, and hackbench) and I also saw < 1% performance degradation,
> so I think we can trust that somewhat.  (I can post the raw numbers when
> I have ssh access to my Linux desktop - sending this from Somewhere Over
> The Atlantic).
> 
> However, my concern with these patches are on two points:
> 
> 1. It's not a fix-all.  We still have the case where the guest expects
> the behavior of device memory (for strong ordering for example) on a RAM
> region, which we now break.  Similiarly this doesn't support the
> non-coherent DMA to RAM region case.
> 
> 2. While the code is probably as nice as this kind of stuff gets, it
> is non-trivial and extremely difficult to debug.  The counter-point here
> is that we may end up handling other stuff at EL2 for performanc reasons
> in the future.
> 
> Mainly because of point 1 above, I am leaning to thinking userspace
> should do the invalidation when it knows it needs to, either through KVM
> via a memslot flag or through some other syscall mechanism.
> 
> Thanks,
> -Christoffer
> _______________________________________________
> kvmarm mailing list
> kvmarm at lists.cs.columbia.edu
> https://lists.cs.columbia.edu/mailman/listinfo/kvmarm
>