[PATCH v2 1/1] KVM: arm64: allow the VM to select DEVICE_* and NORMAL_NC for IO memory

Tue Dec 5 08:22:33 PST 2023

On Tue, Dec 05, 2023 at 09:05:17AM -0400, Jason Gunthorpe wrote:
> On Tue, Dec 05, 2023 at 11:40:47AM +0000, Catalin Marinas wrote:
> > > - Will had unanswered questions in another part of the thread:
> > > 
> > >   https://lore.kernel.org/all/20231013092954.GB13524@willie-the-truck/
> > > 
> > >   Can someone please help concluding it?
> > 
> > Is this about reclaiming the device? I think we concluded that we can't
> > generalise this beyond PCIe, though not sure there was any formal
> > statement to that thread. The other point Will had was around stating
> > in the commit message why we only relax this to Normal NC. I haven't
> > checked the commit message yet, it needs careful reading ;).
> 
> Not quite, we said reclaiming is VFIO's problem and if VFIO can't
> reliably reclaim a device it shouldn't create it in the first place.
> 
> Again, I think alot of this is trying to take VFIO problems into KVM.
> 
> VFIO devices should not exist if they pose a harm to the system. If
> VFIO decided to create the devices anyhow (eg admin override or
> something) then it is not KVM's job to do any further enforcement.

Yeah, I made this argument in the past. But it's a fair question to ask
since the Arm world is different from x86. Just reusing an existing
driver in a different context may break its expectations. Does Normal NC
access complete by the time a TLBI (for Stage 2) and DSB (DVMsync) is
completed? It does reach some point of serialisation with subsequent
accesses to the same address but not sure how it is ordered with an
access to a different location like the config space used for reset.
Maybe it's not a problem at all or it is safe only for PCIe but it would
be good to get to the bottom of this.

Device-nGnRnE has some stronger rules around end-point completion and
that's what the vfio-pci uses. KVM, however, went for the slightly more
relaxed nGnRE variant which, at least per the Arm ARM, doesn't have
these guarantees.

> Remember, the feedback we got from the CPU architects was that even
> DEVICE_* will experience an uncontained failure if the device tiggers
> an error response in shipping ARM IP.
> 
> The reason PCIe is safe is because the PCI bridge does not generate
> errors in the first place!

That's an argument to restrict this feature to PCIe. It's really about
fewer arguments on the behaviour of other devices. Marc did raise
another issue with the GIC VCPU interface (does this even have a vma in
the host VMM?). That's a class of devices where the mapping is
context-switched, so the TLBI+DSB rules don't help.

> Thus, the way a platform device can actually be safe is if it too
> never generates errors in the first place! Obviously this approach
> works just as well with NORMAL_NC.
> 
> If a platform device does generate errors then we shouldn't expect
> containment at all, and the memory type has no bearing on the
> safety. The correct answer is to block these platform devices from
> VFIO/KVM/etc because they can trigger uncontained failures.

Assuming the error containment is sorted, there are two other issues
with other types of devices:

1. Ordering guarantees on reclaim or context switch

2. Unaligned accesses

On (2), I think PCIe is fairly clear on how the TLPs are generated, so I
wouldn't expect additional errors here. But I have no idea what AMBA/AXI
does here in general. Perhaps it's fine, I don't think we looked into it
as the focus was mostly on PCIe.

So, I think it would be easier to get this patch upstream if we limit
the change to PCIe devices for now. We may relax this further in the
future. Do you actually have a need for non-PCIe devices to support WC
in the guest or it's more about the complexity of the logic to detect
whether it's actually a PCIe BAR we are mapping into the guest? (I can
see some Arm GPU folk asking for this but those devices are not easily
virtualisable).

-- 
Catalin