[PATCH v2 0/5] Convert riscv to use the generic iommu page table

Robin Murphy robin.murphy at arm.com
Mon Feb 2 08:43:48 PST 2026


On 2026-02-02 2:37 pm, Jason Gunthorpe wrote:
> On Mon, Feb 02, 2026 at 02:00:07PM +0000, Robin Murphy wrote:
> 
>>> DMA-FQ requires two functionalites from the page table:
>>> 1) use gather->freelist to avoid a HW UAF (iommupt always does this)
>>
>> Nope, correct DMA API usage would almost never unmap an entire table, so
>> synchronous non-leaf maintenance in that path still doesn't hurt DMA-FQ
>> either (e.g. io-pgtable-arm).
> 
> Well, it certainly would hurt workloads like IB MR's which can have
> quite alot of IOVA in a single dma_map_sg() and we do want to see the
> table levels removed to avoid the waste that Pasha has talked
> about. Doing single invalidations of potentially a lot of levels in a
> DMA-FQ environment is unnecessary overhead.
> 
> But I get your point that simple, say storage, use of the DMA API
> wouldn't be bothered by this and you could still get alot of benefit
> without using the free list.

Yeah, users dealing with giant non-physically-contiguous scatterlists
are the exception (hence "almost"), but such big things are already
taking the slow path for IOVA allocation/freeing, and they're presumably
not churning at high frequency, so would stand to see a lot less benefit
from flush queues in the first place. If anything, having big lumps of
IOVA space (and pagetable memory) tied up in the queues could even make
matters worse overall.

>> If a pagetable implementation wanted to refcount and eagerly free empty
>> tables upon leaf unmaps, then yes it would need deferred freeing, but
>> frankly it would be better off just not doing that at all for DMA-FQ anyway
>> (as IOVA caching would make it likely to need to repopulate the same level
>> of table soon.)
> 
> Today it isn't done with refcounts, just if the iova range unmapped
> fully contains a table level then the table level can go away too. It
> does trim interior page tables for large IOVA allocations but small
> ones are unlikely to free anything.

Right, and other than the non-contiguous scatterlist case, anything
where a dma_unmap_*() might take out a table-sized region at once, the
corresponding dma_map_*() would have put it down as a block anyway.

>>> The one call to iommu_iotlb_sync() is only for the para-virtualization
>>> optimization of narrowing invalidations. It would be nonsensical for a
>>> driver to enable this optimization and offer IOMMU_CAP_DEFERRED_FLUSH.
>>
>> Not necessarily - in the PV case it can be desirable to minimise
>> over-invalidation *if* you're trapping for targeted invalidations in strict
>> mode. However, depending on the usage pattern it may also be beneficial to
>> have non-strict let the FQ mechanism batch up work to minimise the number of
>> traps taken - e.g. s390 is in this situation, and is precisely why we added
>> IOMMU_DMA_OPTS_SINGLE_QUEUE to help optimise for that.
> 
> Okay, so if I understand you right, it should check for
> iommu_iotlb_gather_queued() and disable PT_FEAT_FLUSH_RANGE_NO_GAPS
> mode entirely? ie there is no point in doing small invalidations if
> the caller is going to do a flush all?
> 
> This way the user gets to pick using DMA-FQ or DMA-strict ?

Indeed, and furthermore we permit relaxing from DMA to DMA-FQ on a
live domain, so although a virtualisation-aware driver may use
PT_FEAT_FLUSH_RANGE_NO_GAPS and also call iommu_set_dma_strict() by
default, that doesn't mean IOMMU_CAP_DEFERRED_FLUSH can't still be
brought into play later. So I guess probably something like the below
(except the other fix just sent breaks the if/else logic, ho hum...)

> Also Intel would probably benefit from .shadow_on_flush too?

I think it mostly depends on how the vIOMMU is implemented. It did seem
potentially mildly beneficial to virtio at the time, not sure if
anyone's ever tried it for Intel/AMD.

Thanks,
Robin.

----->8-----
diff --git a/drivers/iommu/generic_pt/iommu_pt.h b/drivers/iommu/generic_pt/iommu_pt.h
index 3327116a441c..b5cc3094f543 100644
--- a/drivers/iommu/generic_pt/iommu_pt.h
+++ b/drivers/iommu/generic_pt/iommu_pt.h
@@ -51,7 +51,9 @@ static void gather_range_pages(struct iommu_iotlb_gather *iotlb_gather,
  		iommu_pages_stop_incoherent_list(free_list,
  						 iommu_table->iommu_device);
  
-	if (pt_feature(common, PT_FEAT_FLUSH_RANGE_NO_GAPS) &&
+	if (iommu_iotlb_gather_queued(iotlb_gather)) {
+		/* No need to bother, FQ will take care of TLBs */
+	} else if (pt_feature(common, PT_FEAT_FLUSH_RANGE_NO_GAPS) &&
  	    iommu_iotlb_gather_is_disjoint(iotlb_gather, iova, len)) {
  		iommu_iotlb_sync(&iommu_table->domain, iotlb_gather);
  		/*




More information about the linux-riscv mailing list