[PATCH 3/3] iommu/arm-smmu-v3: Add a max_tlbi_ops for __arm_smmu_tlb_inv_range()

Robin Murphy robin.murphy at arm.com
Tue Aug 22 02:30:35 PDT 2023


On 2023-08-22 09:45, Nicolin Chen wrote:
> When receiving an __arm_smmu_tlb_inv_range() call with a large size, there
> could be a long latency at this function call: one part is coming from a
> large software overhead in the routine of building commands, and the other
> part is coming from CMDQ hardware consuming the large number of commands.
> This latency could be significantly large on an SMMU that does not support
> range invalidation commands, i.e. no ARM_SMMU_FEAT_RANGE_INV.
> 
> One way to optimize this is to replace a large number of VA invalidation
> commands with one single per-asid or per-vmid invalidation command, if an
> invalidation size reaches a preset threshold using the number of entries
> per io-pgtable, similar to the MAX_TLBI_OPS in arm64's tlbflush.h.
> 
> Add a max_tlbi_ops in arm_smmu_domain, and convert a large number of per-
> granule TLBI commands to one single per-asid or per-vmid TLBI command, on
> SMMUs without ARM_SMMU_FEAT_RANGE_INV.
> 
> Signed-off-by: Nicolin Chen <nicolinc at nvidia.com>
> ---
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 +++++++++-
>   drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h |  1 +
>   2 files changed, 10 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> index d6c647e1eb01..3f0db30932bd 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
> @@ -1897,7 +1897,14 @@ static void __arm_smmu_tlb_inv_range(struct arm_smmu_cmdq_ent *cmd,
>   	if (!size)
>   		return;
>   
> -	if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> +	if (!(smmu->features & ARM_SMMU_FEAT_RANGE_INV)) {
> +		/*
> +		 * When the size reaches a threshold, replace per-granule TLBI
> +		 * commands with one single per-asid or per-vmid TLBI command.
> +		 */
> +		if (size >= granule * smmu_domain->max_tlbi_ops)
> +			return arm_smmu_tlb_inv_domain(smmu_domain);

This looks like it's at the wrong level - we should have figured this 
out before we got as far as low-level command-building. I'd have thought 
it would be a case of short-circuiting directly from 
arm_smmu_tlb_inv_range_domain() to arm_smmu_tlb_inv_context().

> +	} else {
>   		/* Get the leaf page size */
>   		tg = __ffs(smmu_domain->domain.pgsize_bitmap);
>   
> @@ -2258,6 +2265,7 @@ static int arm_smmu_domain_finalise(struct iommu_domain *domain,
>   	}
>   
>   	smmu_domain->pgtbl_ops = pgtbl_ops;
> +	smmu_domain->max_tlbi_ops = pgtbl_cfg.nents_per_pgtable;

And now we're carrying *three* copies of the same information around 
everywhere? Honestly, just pull cfg->bits_per_level out of the 
io_pgtable_ops at the point where you need it, like the pagetable code 
itself manages to do perfectly happily. Wrap it in an io-pgtable helper 
if you think that's cleaner.

Thanks,
Robin.

>   	return 0;
>   }
>   
> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> index dcab85698a4e..f68c95a2e24f 100644
> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h
> @@ -721,6 +721,7 @@ struct arm_smmu_domain {
>   	struct io_pgtable_ops		*pgtbl_ops;
>   	bool				stall_enabled;
>   	atomic_t			nr_ats_masters;
> +	unsigned long			max_tlbi_ops;
>   
>   	enum arm_smmu_domain_stage	stage;
>   	union {



More information about the linux-arm-kernel mailing list