[RFC/RFT PATCH 0/3] arm64: KVM: work around incoherency with uncached guest mappings

Thu Mar 5 11:13:43 PST 2015

On 5 March 2015 at 15:58, Catalin Marinas <catalin.marinas at arm.com> wrote:
> On Thu, Mar 05, 2015 at 01:26:39PM +0100, Paolo Bonzini wrote:
>> On 05/03/2015 13:03, Catalin Marinas wrote:
>> >> > I'd hate to have to do that. PCI should be entirely probeable
>> >> > given that we tell the guest where the host bridge is, that's
>> >> > one of its advantages.
>> > I didn't say a DT node per device, the DT doesn't know what PCI devices
>> > are available (otherwise it defeats the idea of probing). But we need to
>> > tell the OS where the host bridge is via DT.
>> >
>> > So the guest would be told about two host bridges: one for real devices
>> > and another for virtual devices. These can have different coherency
>> > properties.
>>
>> Yeah, and it would suck that the user needs to know the difference
>> between the coherency proprties of the host bridges.
>
> The host needs to know about this, unless we assume full coherency on
> all the platforms. Arguably, Qemu needs to know as well if it is the one
> generating the DT for guest (or at least passing some snippets from the
> host DT).
>
>> It would especially suck if the user has a cluster with different
>> machines, some of them coherent and others non-coherent, and then has to
>> debug why the same configuration works on some machines and not on others.
>
> That's a problem indeed, especially with guest migration. But I don't
> think we have any sane solution here for the bus master DMA.
>
>> To avoid replying in two different places, which of the solutions look
>> to me like something that half-works?  Pretty much all of them, because
>> in the end it is just a processor misfeature.  For example, Intel
>> virtualization extensions let the hypervisor override stage1 translation
>> _if necessary_.  AMD doesn't, but has some other quirky things that let
>> you achieve the same effect..
>
> ARM can override them as well but only making them stricter. Otherwise,
> on a weakly ordered architecture, it's not always safe (let's say the
> guest thinks it accesses Strongly Ordered memory and avoids barriers for
> flag updates but the host "upgrades" it to Cacheable which breaks the
> memory order).
>
> If we want the host to enforce guest memory mapping attributes via stage
> 2, we could do it the other way around: get the guests to always assume
> full cache coherency, generating Normal Cacheable mappings, but use the
> stage 2 attributes restriction in the host to make such mappings
> non-cacheable when needed (it works this way on ARM but not in the other
> direction to relax the attributes).
>

This was precisely the idea of the MAIR mangling patch: promote all
stage1 mappings to cacheable, and put the host in control by allowing
it to supersede them with device mappings in stage 2.
But it appears there are too many corner cases where this doesn't
quite work out.

>> In particular, I am not even sure that this is about bus coherency,
>> because this problem does not happen when the device is doing bus master
>> DMA.  Working around coherency for bus master DMA would be easy.
>
> My previous emails on the "dma-coherent" property were only about bus
> master DMA (which would cause the correct selection of the DMA API ops
> in the guest).
>
> But even for bus master DMA, guest OS still needs to be aware of the
> (virtual) device DMA capabilities (cache coherent or not). You may be
> able to work around it in the host (stage 2, explicit cache flushing or
> SMMU attributes) if the guests assumes non-coherency but it's not really
> efficient (nor nice to implement).
>
>> The problem arises with MMIO areas that the guest can reasonably expect
>> to be uncacheable, but that are optimized by the host so that they end
>> up backed by cacheable RAM.  It's perfectly reasonable that the same
>> device needs cacheable mapping with one userspace, and works with
>> uncacheable mapping with another userspace that doesn't optimize the
>> MMIO area to RAM.
>
> Unless the guest allocates the framebuffer itself (e.g.
> dma_alloc_coherent), we can't control the cacheability via
> "dma-coherent" properties as it refers to bus master DMA.
>
> So for MMIO with the buffer allocated by the host (Qemu), the only
> solution I see on ARM is for the host to ensure coherency, either via
> explicit cache maintenance (new KVM API) or by changing the memory
> attributes used by Qemu to access such virtual MMIO.
>
> Basically Qemu is acting as a bus master when reading the framebuffer it
> allocated but the guest considers it a slave access and we don't have a
> way to tell the guest that such accesses should be cacheable, nor can we
> upgrade them via architecture features.
>
>> Currently the VGA framebuffer is the main case where this happen, and I
>> don't expect many more.  Because this is not bus master DMA, it's hard
>> to find a QEMU API that can be hooked to invalidate the cache.  QEMU is
>> just reading from an array of chars.
>
> I now understand the problem better. I was under the impression that the
> guest allocates the framebuffer itself and tells Qemu where it is (like
> in amba-clcd.c for example).
>
>> In practice, the VGA framebuffer has an optimization that uses dirty
>> page tracking, so we could piggyback on the ioctls that return which
>> pages are dirty.  It turns out that piggybacking on those ioctls also
>> should fix the case of migrating a guest while the MMU is disabled.
>
> Yes, Qemu would need to invalidate the cache before reading a dirty
> framebuffer page.
>
> As I said above, an API that allows non-cacheable mappings for the VGA
> framebuffer in Qemu would also solve the problem. I'm not sure what KVM
> provides here (or whether we can add such API).
>
>> We could use _DSD to export the device tree property separately for each
>> device, but that wouldn't work for hotplugged devices.
>
> This would only work for bus master DMA, so it doesn't solve the VGA
> framebuffer issue.
>
> --
> Catalin