[PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries

Thu Nov 3 11:24:11 PDT 2016

Hi Robin,

> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy at arm.com]
> Sent: Thursday, November 03, 2016 16:39
> To: Nipun Gupta <nipun.gupta at nxp.com>; will.deacon at arm.com; linux-arm-
> kernel at lists.infradead.org; iommu at lists.linux-foundation.org
> Cc: Stuart Yoder <stuart.yoder at nxp.com>
> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
> caching of bypass entries
> 
> On 02/11/16 19:26, Nipun Gupta wrote:
> >
> > Hi Robin,
> >
> >> -----Original Message-----
> >> From: Robin Murphy [mailto:robin.murphy at arm.com]
> >> Sent: Wednesday, November 02, 2016 16:51
> >> To: Nipun Gupta <nipun.gupta at nxp.com>; will.deacon at arm.com; linux-arm-
> >> kernel at lists.infradead.org; iommu at lists.linux-foundation.org
> >> Cc: Stuart Yoder <stuart.yoder at nxp.com>
> >> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to
> enable
> >> caching of bypass entries
> >>
> >> Hi Nipun,
> >>
> >> On 02/11/16 13:35, Nipun Gupta wrote:
> >>> The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
> >>> provides an option to enable the updation of TLB in case of bypass
> >>> transactions due to no stream match in the stream match table. This
> >>> reduces the latencies of the subsequent transactions with the same stream-
> id
> >> which bypasses the SMMU.
> >>> This provides a significant performance benefit for certain networking
> >>> workloads.
> >>
> >> ...at the cost of increased TLB contention against other workloads.
> >> However, in the general case we'd expect the system to be fully described, so
> if
> >> there aren't any unmatched Stream IDs there hopefully shouldn't be an
> impact
> >> to leaving this switched on. I'd be interested to see some actual performance
> >> numbers, though - you already know my opinion about unsubstantiated
> quotes
> >> from the MMU-500 TRM.
> >
> > With this change we have seen substantial performance improvement of ~9-
> 10%
> > with DPDK l3fwd application
> (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
> > on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP
> reflector application
> > (loopback mode - NXP in-house) we have seen 5% improvement in
> performance on
> > LS1088 platform.
> >
> > W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from
> avg. ~140
> > platform cycles per memory read transactions which follow this bypass path
> (on LS2088
> > with DPDK l3fwd application).
> >
> > (Apologies, I cannot share the DPDK/ODP exact performance numbers on the
> mailing list).
> 
> That's understandable, and I'm not sure I'd know how to interpret them
> anyway ;) I reckon the percentages make a sufficiently compelling
> qualification of the improvement, so it would be good to have that
> summarised in the commit log.

Sure, I'll add a part of it in the commit log.

> 
> >>
> >>> Signed-off-by: Nipun Gupta <nipun.gupta at nxp.com>
> >>> ---
> >>>  drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
> >>>  1 file changed, 15 insertions(+), 6 deletions(-)
> >>>
> >>> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
> >>> ce2a9d4..7010a5c 100644
> >>> --- a/drivers/iommu/arm-smmu.c
> >>> +++ b/drivers/iommu/arm-smmu.c
> >>> @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
> >>>
> >>>  #define ARM_MMU500_ACTLR_CPRE		(1 << 1)
> >>>
> >>> +#define ACR_SMTNMB_TLBEN		(1 << 8)
> >>
> >> ACR is entirely implementation-defined, so there are no generic field names.
> >> Please follow the naming convention handily demonstrated in the subsequent
> >> context line.
> >>
> >>>  #define ARM_MMU500_ACR_CACHE_LOCK	(1 << 26)
> >>
> >> Actually, can we also please keep these in descending order of bit position
> like
> >> everything else?
> >>
> >>>  #define CB_PAR_F			(1 << 0)
> >>> @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
> >> arm_smmu_device *smmu)
> >>>  	for (i = 0; i < smmu->num_mapping_groups; ++i)
> >>>  		arm_smmu_write_sme(smmu, i);
> >>>
> >>> +	/* Get the major rev required for configuring ACR */
> >>
> >> That comment is nonsense.
> >>
> >>> +	reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> >>> +	major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> >>> +
> >>>  	/*
> >>>  	 * Before clearing ARM_MMU500_ACTLR_CPRE, need to
> >>>  	 * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
> >>>  	 * bit is only present in MMU-500r2 onwards.
> >>>  	 */
> >>> -	reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> >>> -	major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> >>> -	if ((smmu->model == ARM_MMU500) && (major >= 2)) {
> >>> -		reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> >>> +	reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> >>> +	if ((smmu->model == ARM_MMU500) && (major >= 2))
> >>>  		reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> >>> -		writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> >>> -	}
> >>> +
> >>> +	/*
> >>> +	 * Set the SMTNMB_TLBEN in ACR so that the transactions which
> >>> +	 * bypass with SMMU due to no stream match found in the SMR table
> >>> +	 * are updated in the TLB's.
> >>
> >> Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries
> for
> >> reduced latency". It's already clear from the code what bit's being set where,
> we
> >> only need to remember *why*.
> >>
> >>> +	 */
> >>> +	reg |= ACR_SMTNMB_TLBEN;
> >>> +	writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> >>
> >> Are you sure it's perfectly fine to set that implementation-defined bit on any
> >> SMMU implementation other than the two-and-a-half ARM Ltd. ones which
> >> happen to share the same meaning? I'm certainly not.
> >>
> >> The correct flow would be something like this:
> >>
> >> 	if (smmu->model == ARM_MMU500) {
> >> 		reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> >> 		major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> >> 		reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> >> 		if (major >= 2)
> >> 	  		reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> >> 		reg |= ACR_SMTNMB_TLBEN;
> >> 		writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> >> 	}
> >>
> >> The shape of the current code avoids an extra level of indentation, but once
> you
> >> have to have the nested conditional anyway, it might as well all be predicated
> >> appropriately.
> >>
> >
> > Thank you for providing the useful comments. I would incorporate them all in
> next version :).
> 
> Cool. Just for clarity (I realise I was thinking it, but never said it
> outright), whilst MMU-40x do share the same feature with the same ACR
> bit definition as MMU-500, I'd be inclined not to bother with them.
> Since the monolithic microarchitecture means there's normally a separate
> MMU-40x per device, if you don't want translation for that device you
> can simply not probe the thing to turn it on in the first place.
> 

This seems a pretty decent reason to have this bit set only for MMU-500.
I'll send a patch v2 soon.

Thanks,
Nipun

> Robin.
> 
> >
> > Regards,
> > Nipun
> >
> >> Robin.
> >>
> >>>  	/* Make sure all context banks are disabled and clear CB_FSR  */
> >>>  	for (i = 0; i < smmu->num_context_banks; ++i) {
> >>>
> >