[PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers

Fri Oct 20 01:23:44 PDT 2023

On Wed, Oct 18, 2023 at 9:05 PM Jason Gunthorpe <jgg at nvidia.com> wrote:
>
> On Wed, Oct 18, 2023 at 07:05:49PM +0800, Michael Shavit wrote:
>
> > > FWIW, I found the most difficult part the used bit calculation, not
> > > the update algorithm. Difficult because it is hard to read and find in
> > > the spec when things are INGORED, but it is a "straightforward" job of
> > > finding INGORED cases and making the used bits 0.
> >
> > The update algorithm is the part I'm finding much harder to read and
> > review :) . arm_smmu_write_entry_step in particular is hard to read
> > through; on top of which there's some subtle dependencies between loop
> > iterations that weren't obvious to grok:
>
> Yes, you have it right, it is basically a classic greedy
> algorithm. Let's improve the comment.
>
> > * Relying on the used_bits to be recomputed after the first iteration
> > where V=0 was set to 0 so that more bits can now be set.
> > * The STE having to be synced between iterations to prevent broken STE
> > reads by the SMMU (there's a comment somewhere else in arm-smmu-v3.c
> > that would fit nicely here instead). But the caller is responsible for
> > calling this between iterations for some reason (supposedly to support
> > CD entries as well in the next series)
>
> Yes, for CD entry support.
>
> How about:
>
> /*
>  * This algorithm updates any STE/CD to any value without creating a situation
>  * where the HW can percieve a corrupted entry. HW is only required to have a 64
>  * bit atomicity with stores from the CPU, while entires are many 64 bit values
>  * big.
>  *
>  * The algorithm works by evolving the entry toward the target in a series of
>  * steps. Each step synchronizes with the HW so that the HW can not see an entry
>  * torn across two steps. Upon each call cur/cur_used reflect the current
>  * synchronized value seen by the HW.
>  *
>  * During each step the HW can observe a torn entry that has any combination of
>  * the step's old/new 64 bit words. The algorithm objective is for the HW
>  * behavior to always be one of current behavior, V=0, or new behavior, during
>  * each step, and across all steps.
>  *
>  * At each step one of three actions is choosen to evolve cur to target:
>  *  - Update all unused bits with their target values.
>  *    This relies on the IGNORED behavior described in the specification
>  *  - Update a single 64-bit value
>  *  - Update all unused bits and set V=0
>  *
>  * The last two actions will cause cur_used to change, which will then allow the
>  * first action on the next step.
>  *
>  * In the most general case we can make any update in three steps:
>  *  - Disrupting the entry (V=0)
>  *  - Fill now unused bits, all bits except V
>  *  - Make valid (V=1), single 64 bit store
>  *
>  * However this disrupts the HW while it is happening. There are several
>  * interesting cases where a STE/CD can be updated without disturbing the HW
>  * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
>  * because the used bits don't intersect. We can detect this by calculating how
>  * many 64 bit values need update after adjusting the unused bits and skip the
>  * V=0 process.
>  */
> static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
>
> Jason

The comment helps a lot thank you.

I do still have some final reservations: wouldn't it be clearer with
the loop un-rolled? After all it's only 3 steps in the worst case....
Something like:

+       arm_smmu_get_ste_used(target, &target_used);
+       arm_smmu_get_ste_used(cur, &cur_used);
+       if (!hitless_possible(target, target_used, cur_used, cur_used)) {
+               target->data[0] = STRTAB_STE_0_V;
+               arm_smmu_sync_ste_for_sid(smmu, sid);
+               /*
+                * The STE is now in abort where none of the bits except
+                * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
+                * all other words of the STE to be written without further
+                * disruption.
+                */
+               arm_smmu_get_ste_used(cur, &cur_used);
+       }
+       /* write bits in all positions unused by the STE */
+       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
+               /* (should probably optimize this away if no write needed) */
+               WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
| (target->data[i] & ~cur_used[i]));
+       }
+       arm_smmu_sync_ste_for_sid(smmu, sid);
+       /*
+        * It should now be possible to make a single qword write to make the
+        * the new configuration take effect.
+        */
+       for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
+               if ((cur->data[i] & target_used[i]) !=
(target->data[i] & target_used[i]))
+                       /*
+                        * TODO:
+                        * WARN_ONCE if this condition hits more than once in
+                        * the loop
+                        */
+                       WRITE_ONCE(cur->data[i], (cur->data[i] &
cur_used[i]) | (target->data[i] & ~cur_used[i]));
+       }
+       arm_smmu_sync_ste_for_sid(smmu, sid);