[PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries

Nipun Gupta nipun.gupta at nxp.com
Wed Nov 2 12:26:02 PDT 2016


Hi Robin,

> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy at arm.com]
> Sent: Wednesday, November 02, 2016 16:51
> To: Nipun Gupta <nipun.gupta at nxp.com>; will.deacon at arm.com; linux-arm-
> kernel at lists.infradead.org; iommu at lists.linux-foundation.org
> Cc: Stuart Yoder <stuart.yoder at nxp.com>
> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
> caching of bypass entries
> 
> Hi Nipun,
> 
> On 02/11/16 13:35, Nipun Gupta wrote:
> > The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
> > provides an option to enable the updation of TLB in case of bypass
> > transactions due to no stream match in the stream match table. This
> > reduces the latencies of the subsequent transactions with the same stream-id
> which bypasses the SMMU.
> > This provides a significant performance benefit for certain networking
> > workloads.
> 
> ...at the cost of increased TLB contention against other workloads.
> However, in the general case we'd expect the system to be fully described, so if
> there aren't any unmatched Stream IDs there hopefully shouldn't be an impact
> to leaving this switched on. I'd be interested to see some actual performance
> numbers, though - you already know my opinion about unsubstantiated quotes
> from the MMU-500 TRM.

With this change we have seen substantial performance improvement of ~9-10%
with DPDK l3fwd application (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP reflector application
(loopback mode - NXP in-house) we have seen 5% improvement in performance on
LS1088 platform.

W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from avg. ~140
platform cycles per memory read transactions which follow this bypass path (on LS2088
with DPDK l3fwd application).

(Apologies, I cannot share the DPDK/ODP exact performance numbers on the mailing list).

> 
> > Signed-off-by: Nipun Gupta <nipun.gupta at nxp.com>
> > ---
> >  drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
> >  1 file changed, 15 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
> > ce2a9d4..7010a5c 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
> >
> >  #define ARM_MMU500_ACTLR_CPRE		(1 << 1)
> >
> > +#define ACR_SMTNMB_TLBEN		(1 << 8)
> 
> ACR is entirely implementation-defined, so there are no generic field names.
> Please follow the naming convention handily demonstrated in the subsequent
> context line.
> 
> >  #define ARM_MMU500_ACR_CACHE_LOCK	(1 << 26)
> 
> Actually, can we also please keep these in descending order of bit position like
> everything else?
> 
> >  #define CB_PAR_F			(1 << 0)
> > @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
> arm_smmu_device *smmu)
> >  	for (i = 0; i < smmu->num_mapping_groups; ++i)
> >  		arm_smmu_write_sme(smmu, i);
> >
> > +	/* Get the major rev required for configuring ACR */
> 
> That comment is nonsense.
> 
> > +	reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> > +	major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> > +
> >  	/*
> >  	 * Before clearing ARM_MMU500_ACTLR_CPRE, need to
> >  	 * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
> >  	 * bit is only present in MMU-500r2 onwards.
> >  	 */
> > -	reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> > -	major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> > -	if ((smmu->model == ARM_MMU500) && (major >= 2)) {
> > -		reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> > +	reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> > +	if ((smmu->model == ARM_MMU500) && (major >= 2))
> >  		reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> > -		writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> > -	}
> > +
> > +	/*
> > +	 * Set the SMTNMB_TLBEN in ACR so that the transactions which
> > +	 * bypass with SMMU due to no stream match found in the SMR table
> > +	 * are updated in the TLB's.
> 
> Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries for
> reduced latency". It's already clear from the code what bit's being set where, we
> only need to remember *why*.
> 
> > +	 */
> > +	reg |= ACR_SMTNMB_TLBEN;
> > +	writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> 
> Are you sure it's perfectly fine to set that implementation-defined bit on any
> SMMU implementation other than the two-and-a-half ARM Ltd. ones which
> happen to share the same meaning? I'm certainly not.
> 
> The correct flow would be something like this:
> 
> 	if (smmu->model == ARM_MMU500) {
> 		reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> 		major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> 		reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> 		if (major >= 2)
> 	  		reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> 		reg |= ACR_SMTNMB_TLBEN;
> 		writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> 	}
> 
> The shape of the current code avoids an extra level of indentation, but once you
> have to have the nested conditional anyway, it might as well all be predicated
> appropriately.
> 

Thank you for providing the useful comments. I would incorporate them all in next version :).

Regards,
Nipun

> Robin.
> 
> >  	/* Make sure all context banks are disabled and clear CB_FSR  */
> >  	for (i = 0; i < smmu->num_context_banks; ++i) {
> >




More information about the linux-arm-kernel mailing list