[RFC] ARM vGIC-ITS tables serialization when running protected VMs

Sun Apr 27 09:37:08 PDT 2025

On Tue, 15 Apr 2025 10:44:39 +0100,
David Woodhouse <dwmw2 at infradead.org> wrote:
> 
> [1  <text/plain; UTF-8 (quoted-printable)>]
> On Tue, 2025-04-15 at 09:35 +0100, Marc Zyngier wrote:
> > On Mon, 14 Apr 2025 12:12:43 +0100,
> > Ilias Stamatis <ilstam at amazon.com> wrote:
> > > 
> > > # The problem
> > > 
> > > KVM's ARM Virtual Interrupt Translation Service (ITS) interface supports the
> > > KVM_DEV_ARM_ITS_SAVE_TABLES and KVM_DEV_ARM_ITS_RESTORE_TABLES operations.
> > > These operations save and restore a set of tables (Device Tables, Interrupt
> > > Translation Tables, Collection Table) to and from guest memory.
> > > 
> > > This can be a problem when running a protected VM on top of pKVM or another
> > > lowvisor since the host kernel (running at EL1) cannot access guest memory.
> > > 
> > 
> > pKVM doesn't allow a guest to be saved/restored, full stop.
> 
> Yet. Either it's going to need to learn to support live update, or
> it'll remain a toy solution.

Toy solution to what problem?

> 
> > > # Page declassification and why ITTs are special
> > > 
> > > The Collection and Device tables are page aligned and their sizes must be a
> > > multiple of page size. If the lowvisor knows where these tables live, it is
> > > possible to "declassify" the corresponding pages and configure the MMU such as
> > > that the EL1 host can write to guest memory directly.
> > > 
> > > The ITTs (Interrupt Translation Tables) are different. They are NOT page
> > > aligned, they are 256 byte aligned and their size is variable. That means that
> > > the lowvisor can't declassify pages containing ITTs and configure the MMU
> > > giving the host direct access as above since those pages may contain unrelated
> > > data.
> > 
> > And it is the responsibility of the guest to make these page aligned
> > if it intend to let the hypervisor use them. To sum it up, the ITT
> > isn't special at all.
> 
> The ITT has nothing to do with virtualization, does it? And despite
> this being logically "DMA", I don't believe it's possible to advertise
> it as being behind the SMMU, which would have allowed for access
> control (and would indeed have meant that the guest would be expected
> to grant access to full pages).
> 
> What exactly are you suggesting? That the GIC specification should be
> changed to require page alignment, or to document that in a
> confidential compute setup, the remainder of any page which contains
> ITTs will be implicitly made non-confidential and shared with the
> hypervisor?

The GIC architecture is of course absolutely perfect, and there is
nothing to change there.

I'm suggesting you do what any OS designed to run under a confidential
infrastructure do. Which is to expose page-sized data to the
non-trusted infrastructure. Linux has done that for a while (as part
of both CCA and pKVM enablement), and I don't see why your toy guests
can't do the same. It's not like using a page pool for ITT allocations
is rocket science, is it?

> And then the lowvisor would also have to snoop the ITS command queues
> to even find out which pages to implicitly allow access to?

Why should it? as long as you only expose pages that only contain
GIC-related data, you should be safe.

However, if your hypervisor doesn't fully validate the interaction of
the *host* with the HW, then you're dead in the water.

> > > If the lowvisor knows where the ITTs live in guest memory it could instead
> > > perform the guest memory accesses on behalf of the host. I.e. the EL1 host
> > > would attempt to save the ITTs to guest memory like it does today, that would
> > > generate a data abort, and then the EL2 lowvisor could perform the copy after
> > > validating that the faulty address belongs to an ITT in guest memory.
> > > 
> > > One issue with the above is that the ITS save/restore happens at hypervisor
> > > live update which is a time sensitive operation and the extra traps (one per
> > > interrupt mapping?) can introduce significant additional overhead
> > > there.
> > 
> > I don't believe this for a second.
> 
> You don't believe that every millisecond of live update downtime,
> perceived by the guest as unwanted steal time of a hypervisor that's
> generally trying to be as quiescent as possible, is an issue?

I absolutely don't. Certainly not for something that has no tangible
existence, with no performance numbers whatsoever, and based on shaky
premises.

> > > Another issue is that it's actually hard for the lowvisor to know where these
> > > tables live without trusting the EL1 host which virtualizes the ITS. It is
> > > especially hard knowing the locations of the ITTs (compared to
> > > Collection/Device tables) because that probably means having to parse the ITS
> > > command queue from EL2 which is complex and undesirable.
> > > 
> > > # An alternative: Serializing ITTs into a userspace buffer
> > 
> > NAK.
> > 
> > Share the page-aligned memory with the rest of the hypervisor, and use
> > the existing API.
> 
> That seems like a bad choice. All this is just using guest memory to
> store KVM's state.

The architecture *mandates* the memory allocation. KVM uses this
memory for the purpose described in the architecture. If you don't
like it, invent your own interrupt architecture. Trust me, it's real
fun!

> Yes, the guest provides a buffer which the virtual hardware *may* use
> if it wants, but with no IOMMU or access control defined in the
> specification.
> 
> It seems like it would be much cleaner just to let KVM pass its state
> up to userspace for serialization like we do for all *other* KVM state,
> which is what Ilias is proposing.

Sure. You could also decide that SMMU page tables should be extracted
separately, because that's the exact same rationale. You could also
build your own hypervisor instead of inventing new ways to make the
KVM API even more of a terrible mess.

	M.

-- 
Without deviation from the norm, progress is not possible.