[PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries

Robin Murphy robin.murphy at arm.com
Thu Nov 3 04:08:34 PDT 2016


On 02/11/16 19:26, Nipun Gupta wrote:
> 
> Hi Robin,
> 
>> -----Original Message-----
>> From: Robin Murphy [mailto:robin.murphy at arm.com]
>> Sent: Wednesday, November 02, 2016 16:51
>> To: Nipun Gupta <nipun.gupta at nxp.com>; will.deacon at arm.com; linux-arm-
>> kernel at lists.infradead.org; iommu at lists.linux-foundation.org
>> Cc: Stuart Yoder <stuart.yoder at nxp.com>
>> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
>> caching of bypass entries
>>
>> Hi Nipun,
>>
>> On 02/11/16 13:35, Nipun Gupta wrote:
>>> The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
>>> provides an option to enable the updation of TLB in case of bypass
>>> transactions due to no stream match in the stream match table. This
>>> reduces the latencies of the subsequent transactions with the same stream-id
>> which bypasses the SMMU.
>>> This provides a significant performance benefit for certain networking
>>> workloads.
>>
>> ...at the cost of increased TLB contention against other workloads.
>> However, in the general case we'd expect the system to be fully described, so if
>> there aren't any unmatched Stream IDs there hopefully shouldn't be an impact
>> to leaving this switched on. I'd be interested to see some actual performance
>> numbers, though - you already know my opinion about unsubstantiated quotes
>> from the MMU-500 TRM.
> 
> With this change we have seen substantial performance improvement of ~9-10%
> with DPDK l3fwd application (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
> on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP reflector application
> (loopback mode - NXP in-house) we have seen 5% improvement in performance on
> LS1088 platform.
> 
> W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from avg. ~140
> platform cycles per memory read transactions which follow this bypass path (on LS2088
> with DPDK l3fwd application).
> 
> (Apologies, I cannot share the DPDK/ODP exact performance numbers on the mailing list).

That's understandable, and I'm not sure I'd know how to interpret them
anyway ;) I reckon the percentages make a sufficiently compelling
qualification of the improvement, so it would be good to have that
summarised in the commit log.

>>
>>> Signed-off-by: Nipun Gupta <nipun.gupta at nxp.com>
>>> ---
>>>  drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
>>>  1 file changed, 15 insertions(+), 6 deletions(-)
>>>
>>> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
>>> ce2a9d4..7010a5c 100644
>>> --- a/drivers/iommu/arm-smmu.c
>>> +++ b/drivers/iommu/arm-smmu.c
>>> @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
>>>
>>>  #define ARM_MMU500_ACTLR_CPRE		(1 << 1)
>>>
>>> +#define ACR_SMTNMB_TLBEN		(1 << 8)
>>
>> ACR is entirely implementation-defined, so there are no generic field names.
>> Please follow the naming convention handily demonstrated in the subsequent
>> context line.
>>
>>>  #define ARM_MMU500_ACR_CACHE_LOCK	(1 << 26)
>>
>> Actually, can we also please keep these in descending order of bit position like
>> everything else?
>>
>>>  #define CB_PAR_F			(1 << 0)
>>> @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
>> arm_smmu_device *smmu)
>>>  	for (i = 0; i < smmu->num_mapping_groups; ++i)
>>>  		arm_smmu_write_sme(smmu, i);
>>>
>>> +	/* Get the major rev required for configuring ACR */
>>
>> That comment is nonsense.
>>
>>> +	reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
>>> +	major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
>>> +
>>>  	/*
>>>  	 * Before clearing ARM_MMU500_ACTLR_CPRE, need to
>>>  	 * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
>>>  	 * bit is only present in MMU-500r2 onwards.
>>>  	 */
>>> -	reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
>>> -	major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
>>> -	if ((smmu->model == ARM_MMU500) && (major >= 2)) {
>>> -		reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
>>> +	reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
>>> +	if ((smmu->model == ARM_MMU500) && (major >= 2))
>>>  		reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
>>> -		writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
>>> -	}
>>> +
>>> +	/*
>>> +	 * Set the SMTNMB_TLBEN in ACR so that the transactions which
>>> +	 * bypass with SMMU due to no stream match found in the SMR table
>>> +	 * are updated in the TLB's.
>>
>> Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries for
>> reduced latency". It's already clear from the code what bit's being set where, we
>> only need to remember *why*.
>>
>>> +	 */
>>> +	reg |= ACR_SMTNMB_TLBEN;
>>> +	writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
>>
>> Are you sure it's perfectly fine to set that implementation-defined bit on any
>> SMMU implementation other than the two-and-a-half ARM Ltd. ones which
>> happen to share the same meaning? I'm certainly not.
>>
>> The correct flow would be something like this:
>>
>> 	if (smmu->model == ARM_MMU500) {
>> 		reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
>> 		major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
>> 		reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
>> 		if (major >= 2)
>> 	  		reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
>> 		reg |= ACR_SMTNMB_TLBEN;
>> 		writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
>> 	}
>>
>> The shape of the current code avoids an extra level of indentation, but once you
>> have to have the nested conditional anyway, it might as well all be predicated
>> appropriately.
>>
> 
> Thank you for providing the useful comments. I would incorporate them all in next version :).

Cool. Just for clarity (I realise I was thinking it, but never said it
outright), whilst MMU-40x do share the same feature with the same ACR
bit definition as MMU-500, I'd be inclined not to bother with them.
Since the monolithic microarchitecture means there's normally a separate
MMU-40x per device, if you don't want translation for that device you
can simply not probe the thing to turn it on in the first place.

Robin.

> 
> Regards,
> Nipun
> 
>> Robin.
>>
>>>  	/* Make sure all context banks are disabled and clear CB_FSR  */
>>>  	for (i = 0; i < smmu->num_context_banks; ++i) {
>>>
> 




More information about the linux-arm-kernel mailing list