[PATCH v4 2/3] swiotlb: dma: its: Enforce host page-size alignment for shared buffers

Tue Apr 28 05:20:53 PDT 2026

Marc Zyngier <maz at kernel.org> writes:

> On Mon, 27 Apr 2026 07:31:07 +0100,
> "Aneesh Kumar K.V (Arm)" <aneesh.kumar at kernel.org> wrote:
>> 
>> When running private-memory guests, the guest kernel must apply additional
>> constraints when allocating buffers that are shared with the hypervisor.
>> 
>> These shared buffers are also accessed by the host kernel and therefore
>> must be aligned to the host’s page size, and have a size that is a multiple
>> of the host page size.
>> 
>> On non-secure hosts, set_guest_memory_attributes() tracks memory at the
>> host PAGE_SIZE granularity. This creates a mismatch when the guest applies
>> attributes at 4K boundaries while the host uses 64K pages. In such cases,
>> set_guest_memory_attributes() call returns -EINVAL, preventing the
>> conversion of memory regions from private to shared.
>> 
>> Architectures such as Arm can tolerate realm physical address space
>> (protected memory) PFNs being mapped as shared memory, as incorrect
>> accesses are detected and reported as GPC faults. However, relying on this
>> mechanism is unsafe and can still lead to kernel crashes.
>> 
>> This is particularly likely when guest_memfd allocations are mmapped and
>> accessed from userspace. Once exposed to userspace, we cannot guarantee
>> that applications will only access the intended 4K shared region rather
>> than the full 64K page mapped into their address space. Such userspace
>> addresses may also be passed back into the kernel and accessed via the
>> linear map, resulting in a GPC fault and a kernel crash.
>> 
>> With CCA, although Stage-2 mappings managed by the RMM still operate at a
>> 4K granularity, shared pages must nonetheless be aligned to the
>> host-managed page size and sized as whole host pages to avoid the issues
>> described above.
>
> I thought that was being fixed, and that there was now a strong
> guarantee that RMM and host are aligned on the page size. Even more,
> S2 is totally irrelevant here. The only thing that matters is the host
> page size vs the guest page size. Nothing else.
>

Yes, the latest RMM update includes the ability to change the granule
size.

The section above in the commit message was intended to explain that the
S2 mapping size is irrelevant. I agree it is not clear as written, so I
will reword it to improve clarity.

>
>> 
>> Introduce a new helper, mem_decrypt_align(), to allow callers to enforce
>> the required alignment and size constraints for shared buffers.
>> 
>> The architecture-specific implementation of mem_decrypt_align() will be
>> provided in a follow-up patch.
>> 
>> Note on restricted-dma-pool:
>> rmem_swiotlb_device_init() uses reserved-memory regions described by
>> firmware. Those regions are not changed in-kernel to satisfy host granule
>> alignment. This is intentional: we do not expect restricted-dma-pool
>> allocations to be used with CCA. If restricted-dma-pool is intended for CCA
>> shared use, firmware must provide base/size aligned to the host IPA-change
>> granule.
>> 
>> Signed-off-by: Aneesh Kumar K.V (Arm) <aneesh.kumar at kernel.org>
>> ---
>>  arch/arm64/mm/mem_encrypt.c      | 19 +++++++++++++++----
>>  drivers/irqchip/irq-gic-v3-its.c | 20 +++++++++++++-------
>>  include/linux/mem_encrypt.h      | 14 ++++++++++++++
>>  kernel/dma/contiguous.c          | 10 ++++++++++
>>  kernel/dma/direct.c              | 16 ++++++++++++++--
>>  kernel/dma/pool.c                |  4 +++-
>>  kernel/dma/swiotlb.c             | 21 +++++++++++++--------
>>  7 files changed, 82 insertions(+), 22 deletions(-)
>> 
>
> [...]
>
>> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
>> index 291d7668cc8d..239d7e3bc16f 100644
>> --- a/drivers/irqchip/irq-gic-v3-its.c
>> +++ b/drivers/irqchip/irq-gic-v3-its.c
>> @@ -213,16 +213,17 @@ static gfp_t gfp_flags_quirk;
>>  static struct page *its_alloc_pages_node(int node, gfp_t gfp,
>>  					 unsigned int order)
>>  {
>> +	unsigned int new_order;
>>  	struct page *page;
>>  	int ret = 0;
>>  
>> -	page = alloc_pages_node(node, gfp | gfp_flags_quirk, order);
>> -
>> +	new_order = get_order(mem_decrypt_align((PAGE_SIZE << order)));
>> +	page = alloc_pages_node(node, gfp | gfp_flags_quirk, new_order);
>>  	if (!page)
>>  		return NULL;
>>  
>>  	ret = set_memory_decrypted((unsigned long)page_address(page),
>> -				   1 << order);
>> +				   1 << new_order);
>>  	/*
>>  	 * If set_memory_decrypted() fails then we don't know what state the
>>  	 * page is in, so we can't free it. Instead we leak it.
>> @@ -241,13 +242,16 @@ static struct page *its_alloc_pages(gfp_t gfp, unsigned int order)
>>  
>>  static void its_free_pages(void *addr, unsigned int order)
>>  {
>> +	int new_order;
>> +
>> +	new_order = get_order(mem_decrypt_align((PAGE_SIZE << order)));
>>  	/*
>>  	 * If the memory cannot be encrypted again then we must leak the pages.
>>  	 * set_memory_encrypted() will already have WARNed.
>>  	 */
>> -	if (set_memory_encrypted((unsigned long)addr, 1 << order))
>> +	if (set_memory_encrypted((unsigned long)addr, 1 << new_order))
>>  		return;
>> -	free_pages((unsigned long)addr, order);
>> +	free_pages((unsigned long)addr, new_order);
>>  }
>>
>
> Here's the non-obfuscated version of the two hunks above (and let it
> be on the record that New Order is a terrible, overrated band):
>
> diff --git a/drivers/irqchip/irq-gic-v3-its.c b/drivers/irqchip/irq-gic-v3-its.c
> index 291d7668cc8da..a4d555aaee241 100644
> --- a/drivers/irqchip/irq-gic-v3-its.c
> +++ b/drivers/irqchip/irq-gic-v3-its.c
> @@ -216,6 +216,7 @@ static struct page *its_alloc_pages_node(int node, gfp_t gfp,
>  	struct page *page;
>  	int ret = 0;
>  
> +	order = get_order(mem_decrypt_align(PAGE_SIZE << order));
>  	page = alloc_pages_node(node, gfp | gfp_flags_quirk, order);
>  
>  	if (!page)
> @@ -245,6 +246,7 @@ static void its_free_pages(void *addr, unsigned int order)
>  	 * If the memory cannot be encrypted again then we must leak the pages.
>  	 * set_memory_encrypted() will already have WARNed.
>  	 */
> +	order = get_order(mem_decrypt_align(PAGE_SIZE << order));
>  	if (set_memory_encrypted((unsigned long)addr, 1 << order))
>  		return;
>  	free_pages((unsigned long)addr, order);
>

I will include this in the next revision.

>>  static struct gen_pool *itt_pool;
>> @@ -268,11 +272,13 @@ static void *itt_alloc_pool(int node, int size)
>>  		if (addr)
>>  			break;
>>  
>> -		page = its_alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO, 0);
>> +		page = its_alloc_pages_node(node, GFP_KERNEL | __GFP_ZERO,
>> +					    get_order(mem_decrypt_granule_size()));
>
> You already taught its_alloc_pages_node() about the decrypt granule
> size stuff. I don't think we need to see more of it (and you don't
> mess with the call that is just above it).
>
>>  		if (!page)
>>  			break;
>>  
>> -		gen_pool_add(itt_pool, (unsigned long)page_address(page), PAGE_SIZE, node);
>> +		gen_pool_add(itt_pool, (unsigned long)page_address(page),
>> +			     mem_decrypt_granule_size(), node);
>
> I'd rather see something like mem_decrypt_align(PAGE_SIZE), which
> keeps the intent clear.
>

The helper was added based on feedback from a previous version. I assume
you are suggesting that only this caller should switch?

-aneesh