[PATCH v8 07/15] iommupt: Add map_pages op

Tue Jan 20 17:08:19 PST 2026

On 20/1/26 04:37, Jason Gunthorpe wrote:
> On Mon, Jan 19, 2026 at 12:00:47PM +1100, Alexey Kardashevskiy wrote:
>> On 18/1/26 02:43, Jason Gunthorpe wrote:
>>> On Sat, Jan 17, 2026 at 03:54:52PM +1100, Alexey Kardashevskiy wrote:
>>>
>>>> I am trying this with TEE-IO on AMD SEV and hitting problems.
>>>
>>> My understanding is that if you want to use SEV today you also have to
>>> use the kernel command line parameter to force 4k IOMMU pages?
>>
>> No, not only 4K. I do not enforce any page size by default so it is
>> "everything but 512G", only when the device is "accepted" - I unmap
>> everything in QEMU, "accept" the device, then map everything again
>> but this time IOMMU uses the (4K|2M) pagemask and takes RMP entry
>> sizes into account.
> 
> I mean, I'm telling you how things work in upstream right now. If you
> want this to work you set the 4k only cmdline option and it
> works. None of what you are describing is upstream. Upstream does not
> support > 4K IOPTEs if RMP is used.

ah, that. Well, even now if you force swiotlb, then IOMMU should be able to use huge pages. But ok, point taken.

>>>> Now, from time to time the guest will share 4K pages which makes the
>>>> host OS smash NPT's 2MB PDEs to 4K PTEs, and 2M RMP entries to 4K
>>>> RMP entries, and since the IOMMU performs RMP checks - IOMMU PDEs
>>>> have to use the same granularity as NPT and RMP.
>>>
>>> IMHO this is a bad hardware choice, it is going to make some very
>>> troublesome software, so sigh.
>>
>> afaik the Other OS is still not using 2MB pages (or does but not much?) and runs on the same hw :)
>>
>> Sure we can force some rules in Linux to make the sw simpler though.
> 
> I mean that the HW requires multiple SW controlled tables to all be
> sizes must be matched. Instead the HW should read all the tables and
> compute the appropriate smallest size automatically.

Not sure I follow. IOMMU table matches the QEMU page table, it is two tables already and IOMMU cannot just blindly use 2M PTEs if the guest is backed with 4K pages.

>>> Doing it at mapping time doesn't seem right to me, AFAICT the RMP can
>>> change dynamically whenever the guest decides to change the
>>> private/shared status of memory?
>>
>> The guest requests page state conversion which makes KVM change RMPs
>> and potentially smash huge pages, the guest only (in)validates the
>> RMP entry but does not change ASID+GPA+otherbits, the host does. But
>> yeah a race is possible here.
> 
> It is not even a "race", it is just something the VMM has to deal with
> whenever the RMP changes.
> 
>>> My expectation for AMD was that the VMM would be monitoring the RMP
>>> granularity and use cut or "increase/decrease page size" through
>>> iommupt to adjust the S2 mapping so it works with these RMP
>>> limitations.
>>>
>>> Those don't fully exist yet, but they are in the plans.
>>
>> I remember the talks about hitless smashing but in case of RMPs atomic xchg is not enough (we have a HW engine for that).
> 
> I don't think you need hitless here, if the guest is doing
> encrpyed/decrypted conversions then it can be expected to not do DMA
> at the same time, or at least it is OK if DMA during this period
> fails.

The guest converts only a handful of 4Ks (say, the guest userspace wants to read certificates from guest-os->host-os->fw) and only that converted part is not expected for DMA but the rest of 2MB page is DMA-able.

> So long as the VMM gets a chance to fix the iommu before the guest
> understands the RMP change is completed it would be OK.

The IOMMU HW needs to understand the change too. After I smash IO PDE, there is a small window before smashing an RMP entry when incoming trafic may hit not-converted part of a 2MB page and RMP check in the IOMMU will fail. That mentioned above HW+FW engine can stall DMA for a few ms while it is smashing things.

> I'm assuming there is a VMM call involved here?

Yes.

>>> It assumes that the VMM is continually aware of what all the RMP PTEs
>>> look like and when they are changing so it can make the required
>>> adjustments.
>>>
>>> The flow would be some thing like..
>>>    1) Create an IOAS
>>>    2) Create a HWPT. If there is some known upper bound on RMP/etc page
>>>       size then limit the HWPT page size to the upper bound
>>>    3) Map stuff into the ioas
>>>    4) Build the RMP/etc and map ranges of page granularity
>>>    5) Call iommufd to adjust the page size within ranges
>>
>> Say, I hotplug a device into a VM with a mix of 4K and 2M RMPs. QEMU
>> will ask iommufd to map everything (and that would be 2M/1G), should
>> then QEMU ask KVM to walk through ranges and call iommufd directly
>> to make IO PDEs/PTEs match RMPs?
> 
> Yes, assuming it isn't already tracking it on its own.
> 
>> I mean, I have to do the KVM->iommufd part anyway when 2M->4K
>> smashing happens in runtime but the initial mapping could be simpler
>> if iommufd could check RMP.
> 
> Yeah, but then we have to implement two completely different
> flows. You can't do without the above since you have to deal with
> dynamic changes to the RMP by the guest.
> 
> Making it so map can happen right the first time is an
> optimization. Lets get the basics and then think about optimising. I
> think optimizing hot plug is not important, nor do I know how good an
> optimization this would even be.

Got it.

>> For the time being I do bypass IOMMU and make KVM call another FW+HW DMA engine to smash IOPDEs.
> 
> I don't even want to know what that means :\ You can't change the
> IOMMU page tables owned by linux from FW or you are creating bugs.

oh but I can :) It is a FW call which takes a pointer to an 2MB IOPDE, a new table of 4K PTEs filled with the old PDE's pfn plus offsets and then the FW exchanges the old IOPDE with a new table and smashes the corresponding RMP, and it suspends the DMA while doing so.

If I get it right, for other platforms, the entire IOMMU table is going to live in a secure space so there will be similar FW calls so it is not that different.

>> ps. I am still curious about:
>>
>>> btw just realized - does the code check that the folio_size
>>> matches IO pagesize? Or batch_to_domain() is expected to start a
>>> new batch if the next page size is not the same as previous? With
>>> THP, we can have a mix of page sizes"
> 
> The batch has a linear chunk of consecutive physical addreses. It has
> nothing to do with folios. The batch can start and end on any physical
> address so long as all addresses within the range are contiguously
> mapped.
> 
> The iommu mapping logic accept contiguous physical ranges and breaks
> them back down into IOPTEs. There is no direct relationship between
> folio size and IOPTE construction.

Ah right, pfn_reader_first/next take care of that constructiveness. Never mind. Thanks,

> For example the iommufd selftest often has a scenario where it lucks
> into maybe 16k of consecutive PFNs because that is just what the MM
> does on a fresh boot. Even though they are actually 4k folios they
> will be mapped into AMDv1's 16k IOPTE encoding.
> 
> Jason

-- 
Alexey