[PATCH] iommu/io-pgtable-arm: Add support for contiguous hint bit
Vijayanand Jitta
vijayanand.jitta at oss.qualcomm.com
Wed Jun 24 22:47:21 PDT 2026
On 6/20/2026 1:10 AM, Daniel Mentz wrote:
> On Thu, Jun 18, 2026 at 2:06 AM Vijayanand Jitta
> <vijayanand.jitta at oss.qualcomm.com> wrote:
>> Support is gated behind CONFIG_IOMMU_IO_PGTABLE_CONTIG_HINT, which
>> provides a compile-time opt-out for hardware affected by SMMU errata
>> related to the contiguous bit.
>
> Have you considered making this a runtime option? Compare this with
> arm_smmu_device_iidr_probe() where the smmuv3 driver disables certain
> features based on the identified implementation and the errata
> affecting that implementation.
>
Thanks for the review comments.
Good point. I’ll drop the Kconfig switch and make this runtime-controlled
via an io-pgtable quirk, so SMMU drivers can disable CONT based on errata.
>> On the mapping side, __arm_lpae_map() detects when the requested size
>> matches a contiguous range at the next level, sets the CONT bit on all
>> PTEs in the group, then recurses with the base block size and an
>> adjusted pgcount.
>
> I would perform this check at the current level not the previous
> level. See comments below.
>
Sure, will update this check at current level.
>>
>> On the unmapping side, the CONT bit is cleared from all PTEs in the
>> affected contiguous group before any individual entry is invalidated,
>> following the Break-Before-Make requirement of the architecture.
>
> My understanding is that for unmap operations, the following rule applies:
>
> The IOVA range targeted by an unmap operation must exactly match the
> IOVA range of a previous map operation. Partial unmap operations are
> not allowed.
>
> The iopgtable code previously had a function named
> arm_lpae_split_blk_unmap() which allowed a block mapping to be split
> up. However, that function has since been removed, which aligns with
> prohibiting partial unmaps.
> The other concern I have is a potential race condition: While one
> thread clears the contiguous bit, another thread could try to unmap
> the same descriptor.
>
> Consider dropping support for partial unmap and just triggering a
> WARN_ON() if you detect that a contiguous group is partially unmapped.
>
Sure, will drop partial unmap support and I'll update with WARN_ON()
as suggested.
>> +static inline int arm_lpae_cont_pmds(unsigned long size)
>
> PMD is not a term that is used in this file. I advise against
> introducing this term.
>
Agreed, I’ll avoid PMD terminology here and rename those helpers/comments
to use block-level wording.
>> +static u32 arm_lpae_find_num_cont(struct arm_lpae_io_pgtable *data, int lvl)
>> +{
>> + if (lvl == ARM_LPAE_MAX_LEVELS - 2)
>> + return arm_lpae_cont_pmds(ARM_LPAE_BLOCK_SIZE(lvl, data));
>> + else if (lvl == ARM_LPAE_MAX_LEVELS - 1)
>> + return arm_lpae_cont_ptes(ARM_LPAE_BLOCK_SIZE(lvl, data));
>
> Consider supporting the contiguous bit at lookup level 1.
>
Sure.
>> static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
>> phys_addr_t paddr, size_t size, size_t pgcount,
>> arm_lpae_iopte prot, int lvl, arm_lpae_iopte *ptep,
>> @@ -463,6 +583,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
>> size_t tblsz = ARM_LPAE_GRANULE(data);
>> struct io_pgtable_cfg *cfg = &data->iop.cfg;
>> int ret = 0, num_entries, max_entries, map_idx_start;
>> + u32 num_cont = 1;
>>
>> /* Find our entry at the current level */
>> map_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
>> @@ -505,6 +626,24 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
>> return -EEXIST;
>> }
>>
>> + if (arm_lpae_pte_is_contiguous_range(data, size, lvl + 1, &num_cont)) {
>
> I would recommend performing this check at the actual level not at the
> previous lookup level i.e. not at the (lvl - 1) level. Imagine the
> following situation: The granule size is 4KB, the initial lookup level
> is 2, and size is 32MB. I'm wondering if in that case, it'll just keep
> recursing until it hits (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1)).
>
Right, I see your point. The contiguous-size check should be done against the
current level, I’ll fix that in v2.
>> +#ifdef CONFIG_IOMMU_IO_PGTABLE_CONTIG_HINT
>> +static void arm_lpae_cont_clear(struct arm_lpae_io_pgtable *data,
>> + unsigned long iova, int lvl,
>> + arm_lpae_iopte *ptep, size_t num_entries)
>> +{
>> + struct io_pgtable_cfg *cfg = &data->iop.cfg;
>> + u32 num_cont = arm_lpae_find_num_cont(data, lvl);
>> + arm_lpae_iopte *cont_ptep;
>> + arm_lpae_iopte *cont_ptep_start;
>> + unsigned long cont_iova;
>> + int offset, itr;
>> +
>> + cont_ptep = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
>> + cont_iova = round_down(iova,
>> + ARM_LPAE_BLOCK_SIZE(lvl, data) * num_cont);
>
> As a result of this round_down() function, you are accessing a
> descriptor that describes an IOVA outside the range targeted by the
> iommu_unmap call. Consequently, you might race against another thread
> accessing the same descriptor.
>
Agreed. I’m going to drop partial-unmap handling for contiguous groups,
so we will only operate on an exact aligned contiguous range and
reject partial unmaps with WARN_ON(). That also removes the need for
the current round_down()-based logic.
>> + cont_ptep += ARM_LPAE_LVL_IDX(cont_iova, lvl, data);
>> + cont_ptep_start = cont_ptep;
>> +
>> + /*
>> + * iova may not be aligned to the contiguous group boundary; include
>> + * any leading entries so round_up() covers all overlapping groups.
>> + */
>> + offset = ARM_LPAE_LVL_IDX(iova, lvl, data) -
>> + ARM_LPAE_LVL_IDX(cont_iova, lvl, data);
>> + num_entries = round_up(offset + num_entries, num_cont);
>> +
>> + for (itr = 0; itr < num_entries; itr++) {
>> + WRITE_ONCE(*cont_ptep, READ_ONCE(*cont_ptep) & ~ARM_LPAE_PTE_CONT);
>
> This read-modify-write operation is not safe due to the potential race
> described above.
>
With partial unmap support removed, I suppose this should be fine now.
>> + cont_ptep++;
>> + }
>> +
>> + if (!cfg->coherent_walk)
>> + __arm_lpae_sync_pte(cont_ptep_start, num_entries, cfg);
>> +}
>> +#else
>> +static void arm_lpae_cont_clear(struct arm_lpae_io_pgtable *data,
>> + unsigned long iova, int lvl,
>> + arm_lpae_iopte *ptep, size_t num_entries)
>> +{
>> +}
>> +#endif
>> +
>> static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>> struct iommu_iotlb_gather *gather,
>> unsigned long iova, size_t size, size_t pgcount,
>> @@ -660,7 +841,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>> {
>> arm_lpae_iopte pte;
>> struct io_pgtable *iop = &data->iop;
>> - int i = 0, num_entries, max_entries, unmap_idx_start;
>> + int i = 0, num_cont = 1, num_entries, max_entries, unmap_idx_start;
>>
>> /* Something went horribly wrong and we ran out of page table */
>> if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
>> @@ -675,9 +856,15 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>> }
>>
>> /* If the size matches this level, we're in the right place */
>> - if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
>> + if (size == ARM_LPAE_BLOCK_SIZE(lvl, data) ||
>> + (size == arm_lpae_find_num_cont(data, lvl) *
>> + ARM_LPAE_BLOCK_SIZE(lvl, data))) {
>> + size_t pte_size;
>> +
>> max_entries = arm_lpae_max_entries(unmap_idx_start, data);
>> - num_entries = min_t(int, pgcount, max_entries);
>> + num_cont = arm_lpae_check_num_cont(data, size, lvl);
>> + num_entries = min_t(int, num_cont * pgcount, max_entries);
>> + pte_size = size / num_cont;
>>
>> /* Find and handle non-leaf entries */
>> for (i = 0; i < num_entries; i++) {
>> @@ -687,11 +874,27 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>> break;
>> }
>>
>> + /*
>> + * Break-Before-Make: before invalidating any leaf
>> + * entry, clear the CONT bit from every entry in the
>> + * contiguous group(s) and flush the TLB, as required
>> + * by the architecture. arm_lpae_cont_clear() covers
>> + * the full [iova, iova + num_entries * pte_size) range
>> + * via round_up(), so subsequent entries read back
>> + * CONT=0 and skip this block.
>> + */
>> + if (pte & ARM_LPAE_PTE_CONT) {
>> + arm_lpae_cont_clear(data, iova, lvl, ptep, num_entries);
>> + io_pgtable_tlb_flush_walk(iop, iova,
>> + num_entries * pte_size,
>> + ARM_LPAE_GRANULE(data));
>
> I believe this is inefficient. Consider the case where we unmap 2MB
> worth of IOVA space mapped by 512 4KB page descriptors with the
> contiguous bit set. If I'm not mistaken, you're running CMOs
> (__arm_lpae_sync_pte) twice for every page descriptor. In addition,
> io_pgtable_tlb_flush_walk() will submit an extra CMD_SYNC and wait for
> it's completion.
>
> Additionally, you perform rounding in arm_lpae_cont_clear(). However,
> io_pgtable_tlb_flush_walk() is called on the original, potentially
> unaligned range. Can this lead to under invalidation? Again, my
> preference would be to drop support for partial unmaps which would
> also remove the requirement for calling io_pgtable_tlb_flush_walk()
> here.
>
Agreed. The current unmap path is more complex and expensive than necessary.
Since partial unmap of contiguous groups should not be supported, I will remove
the rounding-based handling and only permit unmaps that exactly match an
aligned contiguous group. That also eliminates the need for the
extra io_pgtable_tlb_flush_walk() here.
Thanks,
Vijay
>> + }
>> +
>> if (!iopte_leaf(pte, lvl, iop->fmt)) {
>> __arm_lpae_clear_pte(&ptep[i], &iop->cfg, 1);
>>
>> /* Also flush any partial walks */
>> - io_pgtable_tlb_flush_walk(iop, iova + i * size, size,
>> + io_pgtable_tlb_flush_walk(iop, iova + i * pte_size, pte_size,
>> ARM_LPAE_GRANULE(data));
>> __arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
>> }
More information about the linux-arm-kernel
mailing list