[PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
Michael Shavit
mshavit at google.com
Fri Oct 20 01:23:44 PDT 2023
On Wed, Oct 18, 2023 at 9:05 PM Jason Gunthorpe <jgg at nvidia.com> wrote:
>
> On Wed, Oct 18, 2023 at 07:05:49PM +0800, Michael Shavit wrote:
>
> > > FWIW, I found the most difficult part the used bit calculation, not
> > > the update algorithm. Difficult because it is hard to read and find in
> > > the spec when things are INGORED, but it is a "straightforward" job of
> > > finding INGORED cases and making the used bits 0.
> >
> > The update algorithm is the part I'm finding much harder to read and
> > review :) . arm_smmu_write_entry_step in particular is hard to read
> > through; on top of which there's some subtle dependencies between loop
> > iterations that weren't obvious to grok:
>
> Yes, you have it right, it is basically a classic greedy
> algorithm. Let's improve the comment.
>
> > * Relying on the used_bits to be recomputed after the first iteration
> > where V=0 was set to 0 so that more bits can now be set.
> > * The STE having to be synced between iterations to prevent broken STE
> > reads by the SMMU (there's a comment somewhere else in arm-smmu-v3.c
> > that would fit nicely here instead). But the caller is responsible for
> > calling this between iterations for some reason (supposedly to support
> > CD entries as well in the next series)
>
> Yes, for CD entry support.
>
> How about:
>
> /*
> * This algorithm updates any STE/CD to any value without creating a situation
> * where the HW can percieve a corrupted entry. HW is only required to have a 64
> * bit atomicity with stores from the CPU, while entires are many 64 bit values
> * big.
> *
> * The algorithm works by evolving the entry toward the target in a series of
> * steps. Each step synchronizes with the HW so that the HW can not see an entry
> * torn across two steps. Upon each call cur/cur_used reflect the current
> * synchronized value seen by the HW.
> *
> * During each step the HW can observe a torn entry that has any combination of
> * the step's old/new 64 bit words. The algorithm objective is for the HW
> * behavior to always be one of current behavior, V=0, or new behavior, during
> * each step, and across all steps.
> *
> * At each step one of three actions is choosen to evolve cur to target:
> * - Update all unused bits with their target values.
> * This relies on the IGNORED behavior described in the specification
> * - Update a single 64-bit value
> * - Update all unused bits and set V=0
> *
> * The last two actions will cause cur_used to change, which will then allow the
> * first action on the next step.
> *
> * In the most general case we can make any update in three steps:
> * - Disrupting the entry (V=0)
> * - Fill now unused bits, all bits except V
> * - Make valid (V=1), single 64 bit store
> *
> * However this disrupts the HW while it is happening. There are several
> * interesting cases where a STE/CD can be updated without disturbing the HW
> * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> * because the used bits don't intersect. We can detect this by calculating how
> * many 64 bit values need update after adjusting the unused bits and skip the
> * V=0 process.
> */
> static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
>
> Jason
The comment helps a lot thank you.
I do still have some final reservations: wouldn't it be clearer with
the loop un-rolled? After all it's only 3 steps in the worst case....
Something like:
+ arm_smmu_get_ste_used(target, &target_used);
+ arm_smmu_get_ste_used(cur, &cur_used);
+ if (!hitless_possible(target, target_used, cur_used, cur_used)) {
+ target->data[0] = STRTAB_STE_0_V;
+ arm_smmu_sync_ste_for_sid(smmu, sid);
+ /*
+ * The STE is now in abort where none of the bits except
+ * STRTAB_STE_0_V and STRTAB_STE_0_CFG are accessed. This allows
+ * all other words of the STE to be written without further
+ * disruption.
+ */
+ arm_smmu_get_ste_used(cur, &cur_used);
+ }
+ /* write bits in all positions unused by the STE */
+ for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
+ /* (should probably optimize this away if no write needed) */
+ WRITE_ONCE(cur->data[i], (cur->data[i] & cur_used[i])
| (target->data[i] & ~cur_used[i]));
+ }
+ arm_smmu_sync_ste_for_sid(smmu, sid);
+ /*
+ * It should now be possible to make a single qword write to make the
+ * the new configuration take effect.
+ */
+ for (i = 0; i != ARRAY_SIZE(cur->data); i++) {
+ if ((cur->data[i] & target_used[i]) !=
(target->data[i] & target_used[i]))
+ /*
+ * TODO:
+ * WARN_ONCE if this condition hits more than once in
+ * the loop
+ */
+ WRITE_ONCE(cur->data[i], (cur->data[i] &
cur_used[i]) | (target->data[i] & ~cur_used[i]));
+ }
+ arm_smmu_sync_ste_for_sid(smmu, sid);
More information about the linux-arm-kernel
mailing list