[PATCH] iommu/arm-smmu-v3: Add SMMUv3.2 range invalidation support

Rob Herring robh at kernel.org
Thu Jan 16 15:09:06 PST 2020


On Thu, Jan 16, 2020 at 3:23 PM Auger Eric <eric.auger at redhat.com> wrote:
>
> Hi Rob,
>
> On 1/16/20 5:57 PM, Rob Herring wrote:
> > On Wed, Jan 15, 2020 at 10:33 AM Auger Eric <eric.auger at redhat.com> wrote:
> >>
> >> Hi Rob,
> >>
> >> On 1/15/20 3:02 PM, Rob Herring wrote:
> >>> On Wed, Jan 15, 2020 at 3:21 AM Auger Eric <eric.auger at redhat.com> wrote:
> >>>>
> >>>> Hi Rob,
> >>>>
> >>>> On 1/13/20 3:39 PM, Rob Herring wrote:
> >>>>> Arm SMMUv3.2 adds support for TLB range invalidate operations.
> >>>>> Support for range invalidate is determined by the RIL bit in the IDR3
> >>>>> register.
> >>>>>
> >>>>> The range invalidate is in units of the leaf page size and operates on
> >>>>> 1-32 chunks of a power of 2 multiple pages. First we determine from the
> >>>>> size what power of 2 multiple we can use and then adjust the granule to
> >>>>> 32x that size.
> >
> >>>>> @@ -2022,12 +2043,39 @@ static void arm_smmu_tlb_inv_range(unsigned long iova, size_t size,
> >>>>>               cmd.tlbi.vmid   = smmu_domain->s2_cfg.vmid;
> >>>>>       }
> >>>>>
> >>>>> +     if (smmu->features & ARM_SMMU_FEAT_RANGE_INV) {
> >>>>> +             unsigned long tg, scale;
> >>>>> +
> >>>>> +             /* Get the leaf page size */
> >>>>> +             tg = __ffs(smmu_domain->domain.pgsize_bitmap);
> >>>> it is unclear to me why you can't set tg with the granule parameter.
> >>>
> >>> granule could be 2MB sections if THP is enabled, right?
> >>
> >> Ah OK I thought it was a page size and not a block size.
> >>
> >> I requested this feature a long time ago for virtual SMMUv3. With
> >> DPDK/VFIO the guest was sending page TLB invalidation for each page
> >> (granule=4K or 64K) part of the hugepage buffer and those were trapped
> >> by the VMM. This stalled qemu.
> >
> > I did some more testing to make sure THP is enabled, but haven't been
> > able to get granule to be anything but 4K. I only have the Fast Model
> > with AHCI on PCI to test this with. Maybe I'm hitting some place where
> > THPs aren't supported yet.
> >
> >>>>> +             /* Determine the power of 2 multiple number of pages */
> >>>>> +             scale = __ffs(size / (1UL << tg));
> >>>>> +             cmd.tlbi.scale = scale;
> >>>>> +
> >>>>> +             cmd.tlbi.num = CMDQ_TLBI_RANGE_NUM_MAX - 1;
> >>>> Also could you explain why you use CMDQ_TLBI_RANGE_NUM_MAX.
> >>>
> >>> How's this:
> >>> /* The invalidation loop defaults to the maximum range */
> >> I would have expected num=0 directly. Don't we invalidate the &size in
> >> one shot as 2^scale * pages of granularity @tg? I fail to understand
> >> when NUM > 0.
> >
> > NUM is > 0 anytime size is not a power of 2. For example, if size is
> > 33 pages, then it takes 2 loops doing 32 pages and then 1 page. If
> > size is 34 pages, then NUM is (17-1) and SCALE is 1.
> OK I get it now. I misread the scale computation as log2() :-(.
>
> I still have a doubt about the scale choice. What if you invalidate a
> large number of pages such as 1025 pages. scale is 0 and you end up with
> 32 * 32 * 2^0 + 1 * 2 * 2^0  invalidations (33). Whereas you could
> invalidate the whole range with 2 invalidation commands: 1 x 2^10 +
> 1*1^1 (packing the invalidations by largest scale). Am I correct or do I
> still miss something?

No, that's correct. 33 is a lot better than 1025 though. :) 1023 pages
is about the worst case if we assume we get 2MB blocks, but maybe not
a good assumption given our testing so far...

So thinking out loud, I guess we could iterate on power of 2 chunks of
size (in units of pages) like this:

while (size) {
  scale = fls(size);
  range = 1 << scale;
  size &= ~range;

  iova += range;
}

But that means NUM is always 0, so also not ideal. So we need to
extract 5 bits from size for NUM on each iteration:

while (size) {
  scale = __ffs(size);
  num = (size >> scale)) & 0x1f;
  size -= (num + 1) * (1 << scale);

  ...
}

So worst case, we'd have 4 invalidates for up to 4G.

> Besides in the patch I think in the while loop the iova should be
> incremented with the actual number of invalidated bytes and not the max
> sized granule variable.

Ok.

Rob



More information about the linux-arm-kernel mailing list