[PATCH v8 07/15] iommupt: Add map_pages op

Wed Feb 25 15:11:56 PST 2026

On 18/1/26 02:43, Jason Gunthorpe wrote:
> On Sat, Jan 17, 2026 at 03:54:52PM +1100, Alexey Kardashevskiy wrote:
> 
>> I am trying this with TEE-IO on AMD SEV and hitting problems.
> 
> My understanding is that if you want to use SEV today you also have to
> use the kernel command line parameter to force 4k IOMMU pages?
> 
> So, I think your questions are about trying to enhance this to get
> larger pages in the IOMMU when possible?
> 
>> Now, from time to time the guest will share 4K pages which makes the
>> host OS smash NPT's 2MB PDEs to 4K PTEs, and 2M RMP entries to 4K
>> RMP entries, and since the IOMMU performs RMP checks - IOMMU PDEs
>> have to use the same granularity as NPT and RMP.
> 
> IMHO this is a bad hardware choice, it is going to make some very
> troublesome software, so sigh.
> 
>> So I end up in a situation when QEMU asks to map, for example, 2GB
>> of guest RAM and I want most of it to be 2MB mappings, and only
>> handful of 2MB pages to be split into 4K pages. But it appears so
>> that the above enforces the same page size for entire range.
> 
>> In the old IOMMU code, I handled it like this:
>>
>> https://github.com/AMDESE/linux-kvm/commit/0a40130987b7b65c367390d23821cc4ecaeb94bd#diff-f22bea128ddb136c3adc56bc09de9822a53ba1ca60c8be662a48c3143c511963L341
>>
>> tl;dr: I constantly re-calculate the page size while mapping.
> 
> Doing it at mapping time doesn't seem right to me, AFAICT the RMP can
> change dynamically whenever the guest decides to change the
> private/shared status of memory?
> 
> My expectation for AMD was that the VMM would be monitoring the RMP
> granularity and use cut or "increase/decrease page size" through
> iommupt to adjust the S2 mapping so it works with these RMP
> limitations.
> 
> Those don't fully exist yet, but they are in the plans.
> 
> It assumes that the VMM is continually aware of what all the RMP PTEs
> look like and when they are changing so it can make the required
> adjustments.
> 
> The flow would be some thing like..
>   1) Create an IOAS
>   2) Create a HWPT. If there is some known upper bound on RMP/etc page
>      size then limit the HWPT page size to the upper bound
>   3) Map stuff into the ioas
>   4) Build the RMP/etc and map ranges of page granularity
>   5) Call iommufd to adjust the page size within ranges

I am about to try this approach now. 5) means splitting bigger pages to smaller and I remember you working on that hitless IO PDEs smashing, do you have something to play with? I could not spot anything on github but do not want to reinvent. Thanks,

>   6) Guest changes encrypted state so RMP changes
>   7) VMM adjusts the ranges of page granularity and calls iommufd with
>      the updates
>   8) iommput code increases/decreases page size as required.
> 
> Does this seem reasonable?
> 
>> I know, ideally we would only share memory in 2MB chunks but we are
>> not there yet as I do not know the early boot stage on x86 enough to
> 
> Even 2M is too small, I'd expect realy scenarios to want to get up to
> 1GB ??
> 
> Jason

-- 
Alexey