[PATCH] iommu/io-pgtable-arm: Add support for contiguous hint bit

Vijayanand Jitta vijayanand.jitta at oss.qualcomm.com
Wed Jun 24 22:47:21 PDT 2026



On 6/20/2026 1:10 AM, Daniel Mentz wrote:
> On Thu, Jun 18, 2026 at 2:06 AM Vijayanand Jitta
> <vijayanand.jitta at oss.qualcomm.com> wrote:
>> Support is gated behind CONFIG_IOMMU_IO_PGTABLE_CONTIG_HINT, which
>> provides a compile-time opt-out for hardware affected by SMMU errata
>> related to the contiguous bit.
> 
> Have you considered making this a runtime option? Compare this with
> arm_smmu_device_iidr_probe() where the smmuv3 driver disables certain
> features based on the identified implementation and the errata
> affecting that implementation.
> 

Thanks for the review comments.

Good point. I’ll drop the Kconfig switch and make this runtime-controlled
via an io-pgtable quirk, so SMMU drivers can disable CONT based on errata.

>> On the mapping side, __arm_lpae_map() detects when the requested size
>> matches a contiguous range at the next level, sets the CONT bit on all
>> PTEs in the group, then recurses with the base block size and an
>> adjusted pgcount.
> 
> I would perform this check at the current level not the previous
> level. See comments below.
> 

Sure, will update this check at current level.

>>
>> On the unmapping side, the CONT bit is cleared from all PTEs in the
>> affected contiguous group before any individual entry is invalidated,
>> following the Break-Before-Make requirement of the architecture.
> 
> My understanding is that for unmap operations, the following rule applies:
> 
> The IOVA range targeted by an unmap operation must exactly match the
> IOVA range of a previous map operation. Partial unmap operations are
> not allowed.
> 
> The iopgtable code previously had a function named
> arm_lpae_split_blk_unmap() which allowed a block mapping to be split
> up. However, that function has since been removed, which aligns with
> prohibiting partial unmaps.
> The other concern I have is a potential race condition: While one
> thread clears the contiguous bit, another thread could try to unmap
> the same descriptor.
> 
> Consider dropping support for partial unmap and just triggering a
> WARN_ON() if you detect that a contiguous group is partially unmapped.
> 

Sure, will drop partial unmap support and  I'll update with WARN_ON()
as suggested.

>> +static inline int arm_lpae_cont_pmds(unsigned long size)
> 
> PMD is not a term that is used in this file. I advise against
> introducing this term.
> 

Agreed, I’ll avoid PMD terminology here and rename those helpers/comments
to use block-level wording.

>> +static u32 arm_lpae_find_num_cont(struct arm_lpae_io_pgtable *data, int lvl)
>> +{
>> +       if (lvl == ARM_LPAE_MAX_LEVELS - 2)
>> +               return arm_lpae_cont_pmds(ARM_LPAE_BLOCK_SIZE(lvl, data));
>> +       else if (lvl == ARM_LPAE_MAX_LEVELS - 1)
>> +               return arm_lpae_cont_ptes(ARM_LPAE_BLOCK_SIZE(lvl, data));
> 
> Consider supporting the contiguous bit at lookup level 1.
> 

Sure.

>>  static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
>>                           phys_addr_t paddr, size_t size, size_t pgcount,
>>                           arm_lpae_iopte prot, int lvl, arm_lpae_iopte *ptep,
>> @@ -463,6 +583,7 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
>>         size_t tblsz = ARM_LPAE_GRANULE(data);
>>         struct io_pgtable_cfg *cfg = &data->iop.cfg;
>>         int ret = 0, num_entries, max_entries, map_idx_start;
>> +       u32 num_cont = 1;
>>
>>         /* Find our entry at the current level */
>>         map_idx_start = ARM_LPAE_LVL_IDX(iova, lvl, data);
>> @@ -505,6 +626,24 @@ static int __arm_lpae_map(struct arm_lpae_io_pgtable *data, unsigned long iova,
>>                 return -EEXIST;
>>         }
>>
>> +       if (arm_lpae_pte_is_contiguous_range(data, size, lvl + 1, &num_cont)) {
> 
> I would recommend performing this check at the actual level not at the
> previous lookup level i.e. not at the (lvl - 1) level. Imagine the
> following situation: The granule size is 4KB, the initial lookup level
> is 2, and size is 32MB. I'm wondering if in that case, it'll just keep
> recursing until it hits (WARN_ON(lvl >= ARM_LPAE_MAX_LEVELS - 1)).
> 

Right, I see your point. The contiguous-size check should be done against the
current level, I’ll fix that in v2.

>> +#ifdef CONFIG_IOMMU_IO_PGTABLE_CONTIG_HINT
>> +static void arm_lpae_cont_clear(struct arm_lpae_io_pgtable *data,
>> +                               unsigned long iova, int lvl,
>> +                               arm_lpae_iopte *ptep, size_t num_entries)
>> +{
>> +       struct io_pgtable_cfg *cfg = &data->iop.cfg;
>> +       u32 num_cont = arm_lpae_find_num_cont(data, lvl);
>> +       arm_lpae_iopte *cont_ptep;
>> +       arm_lpae_iopte *cont_ptep_start;
>> +       unsigned long cont_iova;
>> +       int offset, itr;
>> +
>> +       cont_ptep = ptep - ARM_LPAE_LVL_IDX(iova, lvl, data);
>> +       cont_iova = round_down(iova,
>> +                              ARM_LPAE_BLOCK_SIZE(lvl, data) * num_cont);
> 
> As a result of this round_down() function, you are accessing a
> descriptor that describes an IOVA outside the range targeted by the
> iommu_unmap call. Consequently, you might race against another thread
> accessing the same descriptor.
> 

Agreed. I’m going to drop partial-unmap handling for contiguous groups,
so we will only operate on an exact aligned contiguous range and
reject partial unmaps with WARN_ON(). That also removes the need for
the current round_down()-based logic.

>> +       cont_ptep += ARM_LPAE_LVL_IDX(cont_iova, lvl, data);
>> +       cont_ptep_start = cont_ptep;
>> +
>> +       /*
>> +        * iova may not be aligned to the contiguous group boundary; include
>> +        * any leading entries so round_up() covers all overlapping groups.
>> +        */
>> +       offset = ARM_LPAE_LVL_IDX(iova, lvl, data) -
>> +                ARM_LPAE_LVL_IDX(cont_iova, lvl, data);
>> +       num_entries = round_up(offset + num_entries, num_cont);
>> +
>> +       for (itr = 0; itr < num_entries; itr++) {
>> +               WRITE_ONCE(*cont_ptep, READ_ONCE(*cont_ptep) & ~ARM_LPAE_PTE_CONT);
> 
> This read-modify-write operation is not safe due to the potential race
> described above.
> 

With partial unmap support removed, I suppose this should be fine now.

>> +               cont_ptep++;
>> +       }
>> +
>> +       if (!cfg->coherent_walk)
>> +               __arm_lpae_sync_pte(cont_ptep_start, num_entries, cfg);
>> +}
>> +#else
>> +static void arm_lpae_cont_clear(struct arm_lpae_io_pgtable *data,
>> +                               unsigned long iova, int lvl,
>> +                               arm_lpae_iopte *ptep, size_t num_entries)
>> +{
>> +}
>> +#endif
>> +
>>  static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>>                                struct iommu_iotlb_gather *gather,
>>                                unsigned long iova, size_t size, size_t pgcount,
>> @@ -660,7 +841,7 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>>  {
>>         arm_lpae_iopte pte;
>>         struct io_pgtable *iop = &data->iop;
>> -       int i = 0, num_entries, max_entries, unmap_idx_start;
>> +       int i = 0, num_cont = 1, num_entries, max_entries, unmap_idx_start;
>>
>>         /* Something went horribly wrong and we ran out of page table */
>>         if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
>> @@ -675,9 +856,15 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>>         }
>>
>>         /* If the size matches this level, we're in the right place */
>> -       if (size == ARM_LPAE_BLOCK_SIZE(lvl, data)) {
>> +       if (size == ARM_LPAE_BLOCK_SIZE(lvl, data) ||
>> +           (size == arm_lpae_find_num_cont(data, lvl) *
>> +                    ARM_LPAE_BLOCK_SIZE(lvl, data))) {
>> +               size_t pte_size;
>> +
>>                 max_entries = arm_lpae_max_entries(unmap_idx_start, data);
>> -               num_entries = min_t(int, pgcount, max_entries);
>> +               num_cont = arm_lpae_check_num_cont(data, size, lvl);
>> +               num_entries = min_t(int, num_cont * pgcount, max_entries);
>> +               pte_size = size / num_cont;
>>
>>                 /* Find and handle non-leaf entries */
>>                 for (i = 0; i < num_entries; i++) {
>> @@ -687,11 +874,27 @@ static size_t __arm_lpae_unmap(struct arm_lpae_io_pgtable *data,
>>                                 break;
>>                         }
>>
>> +                       /*
>> +                        * Break-Before-Make: before invalidating any leaf
>> +                        * entry, clear the CONT bit from every entry in the
>> +                        * contiguous group(s) and flush the TLB, as required
>> +                        * by the architecture.  arm_lpae_cont_clear() covers
>> +                        * the full [iova, iova + num_entries * pte_size) range
>> +                        * via round_up(), so subsequent entries read back
>> +                        * CONT=0 and skip this block.
>> +                        */
>> +                       if (pte & ARM_LPAE_PTE_CONT) {
>> +                               arm_lpae_cont_clear(data, iova, lvl, ptep, num_entries);
>> +                               io_pgtable_tlb_flush_walk(iop, iova,
>> +                                                         num_entries * pte_size,
>> +                                                         ARM_LPAE_GRANULE(data));
> 
> I believe this is inefficient. Consider the case where we unmap 2MB
> worth of IOVA space mapped by 512 4KB page descriptors with the
> contiguous bit set. If I'm not mistaken, you're running CMOs
> (__arm_lpae_sync_pte) twice for every page descriptor. In addition,
> io_pgtable_tlb_flush_walk() will submit an extra CMD_SYNC and wait for
> it's completion.
> 
> Additionally, you perform rounding in arm_lpae_cont_clear(). However,
> io_pgtable_tlb_flush_walk() is called on the original, potentially
> unaligned range. Can this lead to under invalidation? Again, my
> preference would be to drop support for partial unmaps which would
> also remove the requirement for calling io_pgtable_tlb_flush_walk()
> here.
> 

Agreed. The current unmap path is more complex and expensive than necessary.
Since partial unmap of contiguous groups should not be supported, I will remove
the rounding-based handling and only permit unmaps that exactly match an
aligned contiguous group. That also eliminates the need for the 
extra io_pgtable_tlb_flush_walk() here.

Thanks,
Vijay

>> +                       }
>> +
>>                         if (!iopte_leaf(pte, lvl, iop->fmt)) {
>>                                 __arm_lpae_clear_pte(&ptep[i], &iop->cfg, 1);
>>
>>                                 /* Also flush any partial walks */
>> -                               io_pgtable_tlb_flush_walk(iop, iova + i * size, size,
>> +                               io_pgtable_tlb_flush_walk(iop, iova + i * pte_size, pte_size,
>>                                                           ARM_LPAE_GRANULE(data));
>>                                 __arm_lpae_free_pgtable(data, lvl + 1, iopte_deref(pte, data));
>>                         }




More information about the linux-arm-kernel mailing list