[PATCH v1 14/14] iommu/arm-smmu-v3: Add arm_smmu_cache_invalidate_user

Nicolin Chen nicolinc at nvidia.com
Wed Mar 22 12:21:27 PDT 2023


On Wed, Mar 22, 2023 at 02:28:38PM -0300, Jason Gunthorpe wrote:
> On Wed, Mar 22, 2023 at 10:11:33AM -0700, Nicolin Chen wrote:
> 
> > > Yes, there are a few different ways to handle this and still preserve
> > > batching. It is part of the reason it would be hard to make the kernel
> > > natively parse the commandq
> > 
> > Yea. I think the way I described above might be the cleanest,
> > since the host kernel would only handle all the leftover TLBI
> > commands? I am open for other better idea, if there's any.
> 
> It seems best to have userspace take a first pass over the cmdq and
> then send what it didn't handle to the kernel

Yes. I can go ahead with this approach for v2.

> > > On the other hand, we could add some more native kernel support for a
> > > SW emulated vCMDQ and that might be interesting for performance.
> > 
> > That's something I have thought about too. But it would feel
> > like changing the "hardware" of the VM, right? If the host
> > kernel enables nesting, then we'd have this extra queue for
> > TLBI commands. From the driver prospective, it would feels
> > like detecting an extra feature bit in the HW register, but
> > there's no such bit in the SMMU HW spec :)
> 
> You'd trigger it the same way vCMDQ triggers. It is basically SW
> emulated vCMDQ.

It still feels something very big. Off the top of my head,
we'd need a pair of new emulated registers for consumer and
producer indexes, and perhaps some configuration registers
too. How should we put into the MMIO space? Maybe we could
emulate that via ECMDQ? So, for QEMU, the SMMU device model
always has the ECMDQ feature so we can have this extra MMIO
space for a separate CMDQ.

> > Yet, would you please elaborate how it impacts performance?
> > I can only see the benefit of isolation, from having a SW
> > emulated VCMDQ exclusively for TLBI commands v.s. having a
> > single CMDQ interlacing different commands, because both of
> > them requires trapping and some sort of dispatching.
> 
> In theory would could make it work like virtio-iommu where the
> doorbell ring for the SW emulated vCMDQ is delivered directly to a
> kernel thread and chop a bunch of latency out of it.

With a SW emulated VCMDQ, the dispatching is moved to the
guest kernel, v.s. the hypervisor. I still don't see a big
improvement here. Perhaps we should run a benchmark with
some experimental changes.

> The issue is latency to complete invalidation as in a vSVA scenario
> the virtual process MM will block on IOMMU invlidation whenever it
> does any mm_struct maintenance. Ie you slow a vast set of
> operations. The less latency the better.

Yea. If it has a noticeable per gain, we should do that.

Do you prefer this to happen with this series? I would think
of adding this in the later stage, although I am not sure if
the uAPI would be completely compatible. It seems to me that
we would need a different uAPI, so as to setup a queue in an
earlier stage, and then to ring a bell when QEMU traps any
incoming commands in the emulated VCMDQ.

> > Btw, just to confirm my understanding, a use case having two
> > or more iommu_domains means an S2 iommu_domain replacement,
> > right? I.e. a running S2 iommu_domain gets replaced on the fly
> > by a different S2 iommu_domain holding a different VMID, while
> > the IOAS still has the previous mappings? When would that
> > actually happen in the real world?
> 
> It doesn't have to be replace - what is needed is that evey vPCI
> device connected to the same SMMU instance be using the same S2 and
> thus the same VM_ID.
> 
> IOW evey SID must be linked to the same VM_ID or invalidation commands
> will not be properly processed.
> 
> qemu would have to have multiple SMMU instances according to S2
> domains, which is probably true anyhow since we need to know what
> physical SMMU instance to deliver the invalidation too anyhow.

I am not 100% following this part. So, you mean that we're
safe if we only have one SMMU instance, because there'd be
only one S2 domain, while multiple S2 domains would happen
if we have multiple SMMU instances?

Can we still use the same S2 domain for multiple instances?
Our approach of setting up a stage-2 mapping in QEMU is to
map the entire guest memory. I don't see a point in having
a separate S2 domain, even if there are multiple instances?

Btw, from a private discussion with Eric, he expressed the
difficulty of adding multiple SMMU instances in QEMU, as it
would complicate the device and ACPI components. For VCMDQ,
we do need a multi-instance environment, because there are
multiple physical pairs of SMMU+VCMDQ, i.e. multiple VCMDQ
MMIO regions being attached/used by different devices. So,
I have been exploring a different approach by creating an
internal multiplication inside VCMDQ...

Thanks
Nic



More information about the linux-arm-kernel mailing list