[PATCH v3 2/2] kvm: arm64: set io memory s2 pte as normalnc for vfio pci devices

Thu Dec 21 05:19:18 PST 2023

Catching up on emails before going on holiday (again).

On Thu, Dec 14, 2023 at 04:56:01PM +0000, Oliver Upton wrote:
> On Thu, Dec 14, 2023 at 04:48:15PM +0100, Lorenzo Pieralisi wrote:
> > > AFAICT, the only reason PCI devices can get the blanket treatment of
> > > Normal-NC at stage-2 is because userspace has a Device-* mapping and can't
> > > speculatively load from the alias. This feels a bit hacky, and maybe we
> > > should prioritize an interface for mapping a device into a VM w/o a
> > > valid userspace mapping.
> > 
> > FWIW - I have tried to summarize the reasoning behind PCIe devices
> > Normal-NC default stage-2 safety in a document that I have just realized
> > now it has become this series cover letter, I don't think the PCI blanket
> > treatment is related *only* to the current user space mappings (ie
> > BTW, AFAICS it is also *possible* at present to map a prefetchable BAR through
> > sysfs with Normal-NC memory attributes in the host at the same time a PCI
> > device is passed-through to a guest with VFIO - and therefore we have a
> > dev-nGnRnE stage-1 mapping for it. Don't think anyone does that - what for -
> > but it is possible and KVM would not know about it).
> > 
> > Again, FWIW, we were told (source Arm ARM) mismatched aliases concerning
> > device-XXX vs Normal-NC are not problematic as long as the transactions
> > issued for the related mappings are independent (and none of the
> > mappings is cacheable).
> > 
> > I appreciate this is not enough to give everyone full confidence on
> > this solution robustness - that's why I wrote that up so that we know
> > what we are up against and write KVM interfaces accordingly.
> 
> Apologies, I didn't mean to question what's going on here from the
> hardware POV. My concern was more from the kernel + user interfaces POV,
> this all seems to work (specifically for PCI) by maintaining an
> intentional mismatch between the VFIO stage-1 and KVM stage-2 mappings.

If you stare at it long enough, the mismatch starts to look fine ;).
Even if you have the VFIO stage 1 Normal NC, KVM stage 2 Normal NC, you
can still have the guest setting stage 1 to Device and introduce an
architectural mismatch. These aliases have some bad reputation but the
behaviour is constrained architecturally.

IMHO we should move on from this attribute mismatch since we can't fully
solve it anyway and focus instead on what the device, system can
tolerate, who's responsible for deciding which MMIO ranges can be mapped
as Normal NC. There are a few options here (talking in the PCIe context
but it can be extended to other VFIO mappings):

1. The VMM is responsible for intra-BAR relaxation of the KVM stage 2:
   a) via the stage 1 VFIO mapping attributes - Device or Normal
   b) via other means (e.g. ioctl(<range>)) while the stage 1 VFIO stays
      Device

2. KVM decides the intra-BAR relaxation irrespective of the VFIO stage 1
   attributes (VMM mapping)

3. KVM decides the full-BAR relaxation with the guest responsible for
   the intra-BAR attributes. As with (2), that's irrespective of the
   VFIO stage 1 host mapping

Whichever option we pick, it won't be the host forcing the Normal NC
mapping, that's still a guest decision and the host only allowing it.

(1) needs specific device knowledge in the VMM or a VFIO-specific driver
(or both if the VMM isn't fully trusted to request the right
attributes). (2) moves the device-specific knowledge to KVM or a
combination of KVM and VFIO-specific driver. Things can get a lot worse
if the Device vs Normal ranges within a BAR are configurable and needs
some paravirtualised interface for the guest to agree with the host.

These patches aim for (3) but only if the host VFIO driver deems it safe
(hence PCIe only for now). I find this an acceptable compromise.

If we really want to avoid any aliases (though I think we are spending
too many cycles on something that's not a real issue), the only way is
to have fd-based mappings in KVM so that there's no VMM alias. After
that we need to choose between (2) and (3) since the VMM may no longer
be able to probe the device and figure out which ranges need what
attributes.

> If we add more behind-the-scenes tricks to get other MMIO mappings
> working in the future then this whole interaction will get even
> hairier. At least if we follow the stage-1 attributes (where possible)
> then we can document some sort of expected behavior in KVM. The VMM would
> need know if the device has read side-effects, as the only way to get a
> Normal-NC mapping in the guest would be to have one at stage-1.

I don't think KVM or the VMM should attempt to hand-hold the guest and
ensure that it maps an MMIO with read side-effects appropriately. The
guest driver can do this by itself or get incorrect hw behaviour. Such
hand-holding is only needed if the speculative loads have wider system
implications but we concluded that it's not the case for PCIe. Even with
a Device mapping, the guest can always issue random reads from an
assigned MMIO range and cause side-effects.

> Kinda stinks to make the VMM aware of the device, but IMO it is a
> fundamental limitation of the way we back memslots right now.

As I mentioned above, the limitation may be more complex if the
intra-BAR attributes are not something readily available in the device
documentation. Maybe Jason or Ankit can shed some light here: are those
intra-BAR ranges configurable by the (guest) driver or they are already
pre-configured by firmware and the driver only needs to probe them?

Anyway, about to go on the Christmas break, so most likely I'll follow
up in January. Happy holidays!

-- 
Catalin