[PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable caching of bypass entries
Nipun Gupta
nipun.gupta at nxp.com
Wed Nov 2 12:26:02 PDT 2016
Hi Robin,
> -----Original Message-----
> From: Robin Murphy [mailto:robin.murphy at arm.com]
> Sent: Wednesday, November 02, 2016 16:51
> To: Nipun Gupta <nipun.gupta at nxp.com>; will.deacon at arm.com; linux-arm-
> kernel at lists.infradead.org; iommu at lists.linux-foundation.org
> Cc: Stuart Yoder <stuart.yoder at nxp.com>
> Subject: Re: [PATCH] iommu: arm-smmu: Set SMTNMB_TLBEN in ACR to enable
> caching of bypass entries
>
> Hi Nipun,
>
> On 02/11/16 13:35, Nipun Gupta wrote:
> > The SMTNMB_TLBEN in the Auxiliary Configuration Register (ACR)
> > provides an option to enable the updation of TLB in case of bypass
> > transactions due to no stream match in the stream match table. This
> > reduces the latencies of the subsequent transactions with the same stream-id
> which bypasses the SMMU.
> > This provides a significant performance benefit for certain networking
> > workloads.
>
> ...at the cost of increased TLB contention against other workloads.
> However, in the general case we'd expect the system to be fully described, so if
> there aren't any unmatched Stream IDs there hopefully shouldn't be an impact
> to leaving this switched on. I'd be interested to see some actual performance
> numbers, though - you already know my opinion about unsubstantiated quotes
> from the MMU-500 TRM.
With this change we have seen substantial performance improvement of ~9-10%
with DPDK l3fwd application (http://dpdk.org/doc/guides/sample_app_ug/l3_forward.html)
on NXP's LS2088a platform (single core as well as multi-core). Also, with ODP reflector application
(loopback mode - NXP in-house) we have seen 5% improvement in performance on
LS1088 platform.
W.r.t. the read latencies, they are reduced to avg. ~50 platform cycles from avg. ~140
platform cycles per memory read transactions which follow this bypass path (on LS2088
with DPDK l3fwd application).
(Apologies, I cannot share the DPDK/ODP exact performance numbers on the mailing list).
>
> > Signed-off-by: Nipun Gupta <nipun.gupta at nxp.com>
> > ---
> > drivers/iommu/arm-smmu.c | 21 +++++++++++++++------
> > 1 file changed, 15 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index
> > ce2a9d4..7010a5c 100644
> > --- a/drivers/iommu/arm-smmu.c
> > +++ b/drivers/iommu/arm-smmu.c
> > @@ -246,6 +246,7 @@ enum arm_smmu_s2cr_privcfg {
> >
> > #define ARM_MMU500_ACTLR_CPRE (1 << 1)
> >
> > +#define ACR_SMTNMB_TLBEN (1 << 8)
>
> ACR is entirely implementation-defined, so there are no generic field names.
> Please follow the naming convention handily demonstrated in the subsequent
> context line.
>
> > #define ARM_MMU500_ACR_CACHE_LOCK (1 << 26)
>
> Actually, can we also please keep these in descending order of bit position like
> everything else?
>
> > #define CB_PAR_F (1 << 0)
> > @@ -1569,18 +1570,26 @@ static void arm_smmu_device_reset(struct
> arm_smmu_device *smmu)
> > for (i = 0; i < smmu->num_mapping_groups; ++i)
> > arm_smmu_write_sme(smmu, i);
> >
> > + /* Get the major rev required for configuring ACR */
>
> That comment is nonsense.
>
> > + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> > + major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> > +
> > /*
> > * Before clearing ARM_MMU500_ACTLR_CPRE, need to
> > * clear CACHE_LOCK bit of ACR first. And, CACHE_LOCK
> > * bit is only present in MMU-500r2 onwards.
> > */
> > - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> > - major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> > - if ((smmu->model == ARM_MMU500) && (major >= 2)) {
> > - reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> > + reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> > + if ((smmu->model == ARM_MMU500) && (major >= 2))
> > reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> > - writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> > - }
> > +
> > + /*
> > + * Set the SMTNMB_TLBEN in ACR so that the transactions which
> > + * bypass with SMMU due to no stream match found in the SMR table
> > + * are updated in the TLB's.
>
> Or simply, e.g. "Allow unmatched Stream IDs to allocate bypass TLB entries for
> reduced latency". It's already clear from the code what bit's being set where, we
> only need to remember *why*.
>
> > + */
> > + reg |= ACR_SMTNMB_TLBEN;
> > + writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
>
> Are you sure it's perfectly fine to set that implementation-defined bit on any
> SMMU implementation other than the two-and-a-half ARM Ltd. ones which
> happen to share the same meaning? I'm certainly not.
>
> The correct flow would be something like this:
>
> if (smmu->model == ARM_MMU500) {
> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_ID7);
> major = (reg >> ID7_MAJOR_SHIFT) & ID7_MAJOR_MASK;
> reg = readl_relaxed(gr0_base + ARM_SMMU_GR0_sACR);
> if (major >= 2)
> reg &= ~ARM_MMU500_ACR_CACHE_LOCK;
> reg |= ACR_SMTNMB_TLBEN;
> writel_relaxed(reg, gr0_base + ARM_SMMU_GR0_sACR);
> }
>
> The shape of the current code avoids an extra level of indentation, but once you
> have to have the nested conditional anyway, it might as well all be predicated
> appropriately.
>
Thank you for providing the useful comments. I would incorporate them all in next version :).
Regards,
Nipun
> Robin.
>
> > /* Make sure all context banks are disabled and clear CB_FSR */
> > for (i = 0; i < smmu->num_context_banks; ++i) {
> >
More information about the linux-arm-kernel
mailing list