[PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers

Jason Gunthorpe jgg at nvidia.com
Sun Dec 17 05:03:55 PST 2023


On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:

> Ok, I took a proper stab at trying to unroll the loop on the github
> version of this patch (v3+)
> As you suspected, it's not easy to re-use the unrolled version for
> both STE and CD writing as we'd have to pass in callbacks for syncing
> the STE/CD and recomputing arm_smmu_{get_ste/cd}_used. 

Yes, that is why I structured it as an iterator
> I personally find this quite a bit more readable as a sequential
> series of steps instead of a loop. It also only requires 3 STE/CD
> syncs in the worst case compared to 4 in the loop version since we

The existing version is max 3 as well, it works the same by
checking the number of critical qwords after computing the first step
in the hitless flow.

However, what you have below has the same problem as the first sketch,
it always does 3 syncs. The existing version fully minimizes the
syncs. It is why it is so complex to unroll it as you have to check
before every sync if the sync is needed at all.

This could probably be organized like this so one shared function
computes the "plan" and then a cd/ste copy executes the plan. It
avoids the loop but all the code is still basically the same, there is
just more of it.

I'm fine with any of these ways

Jason

>  static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
>                                struct arm_smmu_ste *ste,
>                                const struct arm_smmu_ste *target)
>  {
> +       struct arm_smmu_ste staging_ste;
>         struct arm_smmu_ste target_used;
> +       int writes_required = 0;
> +       u8 staging_used_diff = 0;
> +       int i = 0;
> 
> +       /*
> +        * Compute a staging ste that has all the bits currently unused by HW
> +        * set to their target values, such that comitting it to the ste table
> +        * woudn't disrupt the hardware.
> +        */
> +       memcpy(&staging_ste, ste, sizeof(staging_ste));
> +       arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +
> +       /*
> +        * Determine if it's possible to reach the target configuration from the
> +        * staged configured in a single qword write (ignoring bits that are
> +        * unused under the target configuration).
> +        */
>         arm_smmu_get_ste_used(target, &target_used);
> -               WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> +       /*
> +        * But first sanity check that masks in arm_smmu_get_ste_used() are up
> +        * to date.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(target->data); i++) {
> +               if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
> +                       target_used.data[i] |= target->data[i];
> +       }
> 
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
> +               /*
> +                * Each bit of staging_used_diff indicates the index of a qword
> +                * within the staged ste that is incorrect compared to the
> +                * target, considering only the used bits in the target
> +                */
> +               if ((staging_ste.data[i] &
> +                   target_used.data[i]) != (target->data[i]))
> +                       staging_used_diff |= 1 << i;
> +       }
> +       if (hweight8(staging_used_diff) > 1) {
> +               /*
> +                * More than 1 qword is mismatched and a hitless transition from
> +                * the current ste to the target ste is not possible. Clear the
> +                * V bit and recompute the staging STE.
> +                * Because the V bit is cleared, the staging STE will be equal
> +                * to the target STE except for the first qword.
> +                */
> +               staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
> +               arm_smmu_ste_set_unused_bits(&staging_ste, target);
> +               staging_used_diff = 1;
> +       }
> +
> +       /*
> +        * Commit the staging STE.
> +        */
> +       for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
> +               WRITE_ONCE(ste->data[i], staging_ste.data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);
> +
> +       /*
> +        * It's now possible to switch to the target configuration with a write
> +        * to a single qword. Make that switch now.
> +        */
> +       i = ffs(staging_used_diff) - 1;
> +       WRITE_ONCE(ste->data[i], target->data[i]);
> +       arm_smmu_sync_ste_for_sid(smmu, sid);

The non hitless flow doesn't look right to me, it should set v=0 then
load all qwords but 0 then load 0, in exactly that sequence. If the
goal is clarity then the two programming flows should be explicitly
spelled out.

Jason



More information about the linux-arm-kernel mailing list