[PATCH RFCv1 04/14] iommufd: Add struct iommufd_viommu and iommufd_viommu_ops

Wed May 22 06:39:05 PDT 2024

On Wed, May 22, 2024 at 08:58:34AM +0000, Tian, Kevin wrote:
> > From: Jason Gunthorpe <jgg at nvidia.com>
> > Sent: Tuesday, May 14, 2024 11:56 PM
> > 
> > On Sun, May 12, 2024 at 08:34:02PM -0700, Nicolin Chen wrote:
> > > On Sun, May 12, 2024 at 11:03:53AM -0300, Jason Gunthorpe wrote:
> > > > On Fri, Apr 12, 2024 at 08:47:01PM -0700, Nicolin Chen wrote:
> > > > > Add a new iommufd_viommu core structure to represent a vIOMMU
> > instance in
> > > > > the user space, typically backed by a HW-accelerated feature of an
> > IOMMU,
> > > > > e.g. NVIDIA CMDQ-Virtualization (an ARM SMMUv3 extension) and
> > AMD Hardware
> > > > > Accelerated Virtualized IOMMU (vIOMMU).
> > > >
> > > > I expect this will also be the only way to pass in an associated KVM,
> > > > userspace would supply the kvm when creating the viommu.
> > > >
> > > > The tricky bit of this flow is how to manage the S2. It is necessary
> > > > that the S2 be linked to the viommu:
> > > >
> > > >  1) ARM BTM requires the VMID to be shared with KVM
> > > >  2) AMD and others need the S2 translation because some of the HW
> > > >     acceleration is done inside the guest address space
> > > >
> > > > I haven't looked closely at AMD but presumably the VIOMMU create will
> > > > have to install the S2 into a DID or something?
> > > >
> > > > So we need the S2 to exist before the VIOMMU is created, but the
> > > > drivers are going to need some more fixing before that will fully
> > > > work.
> 
> Can you elaborate on this point? VIOMMU is a dummy container when
> it's created and the association to S2 comes relevant only until when
> VQUEUE is created inside and linked to a device? 

VIOMMU contains:
 - A nesting parent
 - A KVM
 - Any global per-VM data the driver needs
   * In ARM case this is VMID, sometimes shared with KVM
   * In AMD case this is will allocate memory in the
     "viommu backing storage memory"

Objects can be created on top of a VIOMMU:
 - A nested HWPT (iommu_hwpt_alloc::pt_id can be a viommu)
 - A vqueue (ARM/AMD)
 - Other AMD virtualized objects (EventLog, PPRLog)

It is desirable to keep the VIOMMU linked to only a single nesting
parent that never changes. Given it seems to be a small ask to
allocate the nesting parent before the VIOMMU providing it at VIOMMU
creation time looks like it will simplify the drivers because they can
rely on it always existing and never changing.

I think this lends itself to a logical layered VMM design..

 - If VFIO is being used get an iommufd
 - Allocate an IOAS for the entire guest physical
 - Determine the vIOMMU driver to use
 - Allocate a HWPT for use by all the vIOMMU instances
 - Allocate a VIOMMU per vIOMMU instance

On ARM the S2 is not divorced from the VIOMMU, ARM requires a single
VMID, shared with KVM, and localized to a single VM for some of the
bypass features (vBTM, vCMDQ). So to attach a S2 you actually have to
attach the VIOMMU to pick up the correct VMID.

I imagine something like this:
   hwpt_alloc(deva, nesting_parent=true) = shared_s2
   viommu_alloc(deva, shared_s2) = viommu1
   viommu_alloc(devb, shared_s2) = viommu2
   hwpt_alloc(deva, viommu1, vste) = deva_vste
   hwpt_alloc(devb, viommu2, vste) = devb_vste
   attach(deva, deva_vste)
   attach(devb, devb_vste)
   attach(devc, shared_s2)

The driver will then know it should program three different VMIDs for
the same S2 page table, which matches the ARM expectation for
VMID. That is to say we'd pass in the viommu as the pt_id for the
iommu_hwpt_alloc. The viommu would imply both the S2 page table and
any meta information like VMID the driver needs.

Both AMD and the vCMDQ thing need to translate some PFNs through the
S2 and program them elsewhere, this is manually done by SW, and there
are three choices I guess:
 - Have the VMM do it and provide  void __user * to the driver
 - Have the driver do it through the S2 directly and track
   S2 invalidations
 - Have the driver open an access on the IOAS and use the access unmap

Not sure which is the best..

> > Right, Intel currently doesn't need it, but I feel like everyone will
> > need this eventually as the fast invalidation path is quite important.
> 
> yes, there is no need but I don't see any harm of preparing for such
> extension on VT-d. Logically it's clearer, e.g. if we decide to move
> device TLB invalidation to a separate uAPI then vIOMMU is certainly
> a clearer object to carry it. and hardware extensions really looks like
> optimization on software implementations.
> 
> and we do need make a decision now, given if we make vIOMMU as
> a generic object for all vendors it may have potential impact on
> the user page fault support which Baolu is working on.

> the so-called
> fault object will be contained in vIOMMU, which is software managed
> on VT-d/SMMU but passed through on AMD. 

Hmm, given we currently have no known hardware entanglement between
PRI and VIOMMU it does seem OK for PRI to just exist seperate for
now. If someone needs them linked someday we can add a viommu_id to
the create pri queue command.

> And probably we don't need another handle mechanism in the attach
> path, suppose the vIOMMU object already contains necessary
> information to find out iommufd_object for a reported fault.

The viommu might be useful to have the kernel return the vRID instead
of the dev_id in the fault messages. I'm not sure how valuable this
is..

Jason