[PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers

Jason Gunthorpe jgg at nvidia.com
Fri Oct 20 04:39:18 PDT 2023


On Fri, Oct 20, 2023 at 04:23:44PM +0800, Michael Shavit wrote:
> The comment helps a lot thank you.
> 
> I do still have some final reservations: wouldn't it be clearer with
> the loop un-rolled? After all it's only 3 steps in the worst case....
> Something like:

I thought about that, but a big point for me was to consolidate the
algorithm between CD/STE. Inlining everything makes it much more
difficult to achieve this. Actually my first sketches were trying to
write it unrolled.

> +       arm_smmu_get_ste_used(target, &target_used);
> +       arm_smmu_get_ste_used(cur, &cur_used);
> +       if (!hitless_possible(target, target_used, cur_used, cur_used)) {

hitless possible requires the loop of the step function to calcuate
it.

> +               target->data[0] = STRTAB_STE_0_V;
> +               arm_smmu_sync_ste_for_sid(smmu, sid);

I still like V=0 as I think we do want the event for this case.

> +               /*
> +                * The STE is now in abort where none of the bits except
> +                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
> +                * all other words of the STE to be written without further
> +                * disruption.
> +                */
> +               arm_smmu_get_ste_used(cur, &cur_used);
> +       }
> +       /* write bits in all positions unused by the STE */
> +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> +               /* (should probably optimize this away if no write needed) */
> +               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
> | (target->data[i] & ~cur_used[i]));
> +       }
> +       arm_smmu_sync_ste_for_sid(smmu, sid);

Yes, I wanted to avoid all the syncs if they are not required.

> +       /*
> +        * It should now be possible to make a single qword write to make the
> +        * the new configuration take effect.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
> +               if ((cur->data[i] & target_used[i]) !=
> (target->data[i] & target_used[i]))
> +                       /*
> +                        * TODO:
> +                        * WARN_ONCE if this condition hits more than once in
> +                        * the loop
> +                        */
> +                       WRITE_ONCE(cur->data[i], (cur->data[i] &
> cur_used[i]) | (target->data[i] & ~cur_used[i]));
> +       }

> +       arm_smmu_sync_ste_for_sid(smmu, sid);

This needs to be optional too

And there is another optional 4th pass to set the unused target values
to 0.

Basically you have captured the core algorithm, but I think if you
fill in all the missing bits to get up to the same functionality it
will be longer and unsharable with the CD side.

You could perhaps take this approach and split it into 4 sharable step
functions:

 if (step1(target, target_used, cur_used, cur_used, len)) {
  arm_smmu_sync_ste_for_sid(smmu, sid);
  arm_smmu_get_ste_used(cur, &cur_used);
 }

 if (step2(target, target_used, cur_used, cur_used, len))
  arm_smmu_sync_ste_for_sid(smmu, sid);

 if (step3(target, target_used, cur_used, cur_used, len)) {
   arm_smmu_sync_ste_for_sid(smmu, sid);
   arm_smmu_get_ste_used(cur, &cur_used);
  }

 if (step4(target, target_used, cur_used, cur_used, len))
   arm_smmu_sync_ste_for_sid(smmu, sid);

To me this is inelegant as if we only need to do step 3 we have to
redundantly scan the array 2 times. The rolled up version just
directly goes to step 3.

However this does convince me you've thought very carefully about this
and have not found a flaw in the design!

Thanks,
Jason



More information about the linux-arm-kernel mailing list