[PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers

Mon Oct 23 01:36:36 PDT 2023

On Fri, Oct 20, 2023 at 7:39 PM Jason Gunthorpe <jgg at nvidia.com> wrote:
>
> On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> > The comment helps a lot thank you.
> >
> > I do still have some final reservations: wouldn't it be clearer with
> > the loop un-rolled? After all it's only 3 steps in the worst case....
> > Something like:
>
> I thought about that, but a big point for me was to consolidate the
> algorithm between CD/STE. Inlining everything makes it much more
> difficult to achieve this. Actually my first sketches were trying to
> write it unrolled.
>
> > +       arm_smmu_get_ste_used(target, &target_used);
> > +       arm_smmu_get_ste_used(cur, &cur_used);
> > +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
>
> hitless possible requires the loop of the step function to calcuate
> it.

Possibly yes. Another option would be to have a list of transitions
(e.g. IDENTITY -> S1) we expect and want to be hitless and check
against that list. It defeats some of the purpose of your design, but
it's also not obvious to me that we really need such flexibility in
the first place.

>
> > +               target->data[0] = STRTAB_STE_0_V;
> > +               arm_smmu_sync_ste_for_sid(smmu, sid);
>
> I still like V=0 as I think we do want the event for this case.

Oh yeah sure, I did not think too carefully about this.

>
> > +               /*
> > +                * The STE is now in abort where none of the bits except
> > +                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
> > +                * all other words of the STE to be written without further
> > +                * disruption.
> > +                */
> > +               arm_smmu_get_ste_used(cur, &cur_used);
> > +       }
> > +       /* write bits in all positions unused by the STE */
> > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > +               /* (should probably optimize this away if no write needed) */
> > +               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
> > | (target->data[i] & ~cur_used[i]));
> > +       }
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
> Yes, I wanted to avoid all the syncs if they are not required.

Should be easy enough to optimize away in this alternate version as well.

>
> > +       /*
> > +        * It should now be possible to make a single qword write to make the
> > +        * the new configuration take effect.
> > +        */
> > +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> > +               if ((cur->data[i] & target_used[i]) !=
> > (target->data[i] & target_used[i]))
> > +                       /*
> > +                        * TODO:
> > +                        * WARN_ONCE if this condition hits more than once in
> > +                        * the loop
> > +                        */
> > +                       WRITE_ONCE(cur->data[i], (cur->data[i] &
> > cur_used[i]) | (target->data[i] & ~cur_used[i]));
> > +       }
>
> > +       arm_smmu_sync_ste_for_sid(smmu, sid);
>
> This needs to be optional too
>
> And there is another optional 4th pass to set the unused target values
> to 0.
>
> Basically you have captured the core algorithm, but I think if you
> fill in all the missing bits to get up to the same functionality it
> will be longer and unsharable with the CD side.
>
> You could perhaps take this approach and split it into 4 sharable step
> functions:
>
>  if (step1(target, target_used, cur_used, cur_used, len)) {
>   arm_smmu_sync_ste_for_sid(smmu, sid);
>   arm_smmu_get_ste_used(cur, &cur_used);
>  }
>
>  if (step2(target, target_used, cur_used, cur_used, len))
>   arm_smmu_sync_ste_for_sid(smmu, sid);
>
>  if (step3(target, target_used, cur_used, cur_used, len)) {
>    arm_smmu_sync_ste_for_sid(smmu, sid);
>    arm_smmu_get_ste_used(cur, &cur_used);
>   }
>
>  if (step4(target, target_used, cur_used, cur_used, len))
>    arm_smmu_sync_ste_for_sid(smmu, sid);
>
> To me this is inelegant as if we only need to do step 3 we have to
> redundantly scan the array 2 times. The rolled up version just
> directly goes to step 3.
>
> However this does convince me you've thought very carefully about this
> and have not found a flaw in the design!

Right, it's more of a style/readability question so I'll leave it to
others to chime in further if needed :) .