[PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers
Michael Shavit
mshavit at google.com
Mon Dec 18 04:35:30 PST 2023
On Sun, Dec 17, 2023 at 9:03 PM Jason Gunthorpe <jgg at nvidia.com> wrote:
>
> On Sat, Dec 16, 2023 at 04:26:48AM +0800, Michael Shavit wrote:
>
> > Ok, I took a proper stab at trying to unroll the loop on the github
> > version of this patch (v3+)
> > As you suspected, it's not easy to re-use the unrolled version for
> > both STE and CD writing as we'd have to pass in callbacks for syncing
> > the STE/CD and recomputing arm_smmu_{get_ste/cd}_used.
>
> Yes, that is why I structured it as an iterator
On second thought, perhaps defining a helper class implementing
entry_sync() and entry_get_used_bits() might not be so bad?
It's a little bit more verbose, but avoids deduplication of the
complicated parts.
> > I personally find this quite a bit more readable as a sequential
> > series of steps instead of a loop. It also only requires 3 STE/CD
> > syncs in the worst case compared to 4 in the loop version since we
>
> The existing version is max 3 as well, it works the same by
> checking the number of critical qwords after computing the first step
> in the hitless flow.
Gotcha. I was being lazy and assuming it was 4 based on the warning
added to the github version :).
>
> However, what you have below has the same problem as the first sketch,
> it always does 3 syncs. The existing version fully minimizes the
> syncs. It is why it is so complex to unroll it as you have to check
> before every sync if the sync is needed at all.
Hmmmm, AFAICT there are two optimizations I was missing:
1. The final clean-up loop may be a nop, in which case a sync isn't required.
2. We may be able to directly transition to the target state with a
single qword write from the very beginning, without going through any
intermediate STEs.
Both of these seem pretty easy to address in this version as well
however; or am I still overlooking a scenario?
>
> This could probably be organized like this so one shared function
> computes the "plan" and then a cd/ste copy executes the plan. It
> avoids the loop but all the code is still basically the same, there is
> just more of it.
>
> I'm fine with any of these ways
>
> Jason
>
> > static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> > struct arm_smmu_ste *ste,
> > const struct arm_smmu_ste *target)
> > {
> > + struct arm_smmu_ste staging_ste;
> > struct arm_smmu_ste target_used;
> > + int writes_required = 0;
> > + u8 staging_used_diff = 0;
> > + int i = 0;
> >
> > + /*
> > + * Compute a staging ste that has all the bits currently unused by HW
> > + * set to their target values, such that comitting it to the ste table
> > + * woudn't disrupt the hardware.
> > + */
> > + memcpy(&staging_ste, ste, sizeof(staging_ste));
> > + arm_smmu_ste_set_unused_bits(&staging_ste, target);
> > +
> > + /*
> > + * Determine if it's possible to reach the target configuration from the
> > + * staged configured in a single qword write (ignoring bits that are
> > + * unused under the target configuration).
> > + */
> > arm_smmu_get_ste_used(target, &target_used);
> > - WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> > + /*
> > + * But first sanity check that masks in arm_smmu_get_ste_used() are up
> > + * to date.
> > + */
> > + for (i = 0; i != ARRAY_SIZE(target->data); i++) {
> > + if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
> > + target_used.data[i] |= target->data[i];
> > + }
> >
> > + for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
> > + /*
> > + * Each bit of staging_used_diff indicates the index of a qword
> > + * within the staged ste that is incorrect compared to the
> > + * target, considering only the used bits in the target
> > + */
> > + if ((staging_ste.data[i] &
> > + target_used.data[i]) != (target->data[i]))
> > + staging_used_diff |= 1 << i;
> > + }
> > + if (hweight8(staging_used_diff) > 1) {
> > + /*
> > + * More than 1 qword is mismatched and a hitless transition from
> > + * the current ste to the target ste is not possible. Clear the
> > + * V bit and recompute the staging STE.
> > + * Because the V bit is cleared, the staging STE will be equal
> > + * to the target STE except for the first qword.
> > + */
> > + staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
> > + arm_smmu_ste_set_unused_bits(&staging_ste, target);
> > + staging_used_diff = 1;
> > + }
> > +
> > + /*
> > + * Commit the staging STE.
> > + */
> > + for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
> > + WRITE_ONCE(ste->data[i], staging_ste.data[i]);
> > + arm_smmu_sync_ste_for_sid(smmu, sid);
> > +
> > + /*
> > + * It's now possible to switch to the target configuration with a write
> > + * to a single qword. Make that switch now.
> > + */
> > + i = ffs(staging_used_diff) - 1;
> > + WRITE_ONCE(ste->data[i], target->data[i]);
> > + arm_smmu_sync_ste_for_sid(smmu, sid);
>
> The non hitless flow doesn't look right to me, it should set v=0 then
> load all qwords but 0 then load 0, in exactly that sequence. If the
> goal is clarity then the two programming flows should be explicitly
> spelled out.
Ah yeah you're right. I forgot to set the other qwords in the staging
STE to have their final values in the non-hitless case.
The following should address the bug and optimization you pointed out
with minimal adjustments:
diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
index 98aa8cc17b58b..1c35599d944d7 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c
@@ -971,100 +971,6 @@ void arm_smmu_tlb_inv_asid(struct
arm_smmu_device *smmu, u16 asid)
arm_smmu_cmdq_issue_cmd_with_sync(smmu, &cmd);
}
-/*
- * This algorithm updates any STE/CD to any value without creating a situation
- * where the HW can percieve a corrupted entry. HW is only required
to have a 64
- * bit atomicity with stores from the CPU, while entries are many 64 bit values
- * big.
- *
- * The algorithm works by evolving the entry toward the target in a series of
- * steps. Each step synchronizes with the HW so that the HW can not
see an entry
- * torn across two steps. Upon each call cur/cur_used reflect the current
- * synchronized value seen by the HW.
- *
- * During each step the HW can observe a torn entry that has any combination of
- * the step's old/new 64 bit words. The algorithm objective is for the HW
- * behavior to always be one of current behavior, V=0, or new behavior, during
- * each step, and across all steps.
- *
- * At each step one of three actions is chosen to evolve cur to target:
- * - Update all unused bits with their target values.
- * This relies on the IGNORED behavior described in the specification
- * - Update a single 64-bit value
- * - Update all unused bits and set V=0
- *
- * The last two actions will cause cur_used to change, which will
then allow the
- * first action on the next step.
- *
- * In the most general case we can make any update in three steps:
- * - Disrupting the entry (V=0)
- * - Fill now unused bits, all bits except V
- * - Make valid (V=1), single 64 bit store
- *
- * However this disrupts the HW while it is happening. There are several
- * interesting cases where a STE/CD can be updated without disturbing the HW
- * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
- * because the used bits don't intersect. We can detect this by calculating how
- * many 64 bit values need update after adjusting the unused bits and skip the
- * V=0 process.
- */
-static bool arm_smmu_write_entry_next(__le64 *cur, const __le64 *cur_used,
- const __le64 *target,
- const __le64 *target_used, __le64 *step,
- __le64 v_bit, unsigned int len)
-{
- u8 step_used_diff = 0;
- u8 step_change = 0;
- unsigned int i;
-
- /*
- * Compute a step that has all the bits currently unused by HW set to
- * their target values.
- */
- for (i = 0; i != len; i++) {
- step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
- if (cur[i] != step[i])
- step_change |= 1 << i;
- /*
- * Each bit indicates if the step is incorrect compared to the
- * target, considering only the used bits in the target
- */
- if ((step[i] & target_used[i]) != (target[i] & target_used[i]))
- step_used_diff |= 1 << i;
- }
-
- if (hweight8(step_used_diff) > 1) {
- /*
- * More than 1 qword is mismatched, this cannot be done without
- * a break. Clear the V bit and go again.
- */
- step[0] &= ~v_bit;
- } else if (!step_change && step_used_diff) {
- /*
- * Have exactly one critical qword, all the other qwords are set
- * correctly, so we can set this qword now.
- */
- i = ffs(step_used_diff) - 1;
- step[i] = target[i];
- } else if (!step_change) {
- /* cur == target, so all done */
- if (memcmp(cur, target, len * sizeof(*cur)) == 0)
- return true;
-
- /*
- * All the used HW bits match, but unused bits are different.
- * Set them as well. Technically this isn't necessary but it
- * brings the entry to the full target state, so if there are
- * bugs in the mask calculation this will obscure them.
- */
- memcpy(step, target, len * sizeof(*step));
- }
-
- for (i = 0; i != len; i++)
- WRITE_ONCE(cur[i], step[i]);
- return false;
-}
-
static void arm_smmu_sync_cd(struct arm_smmu_master *master,
int ssid, bool leaf)
{
@@ -1398,39 +1304,135 @@ static void arm_smmu_get_ste_used(const
struct arm_smmu_ste *ent,
}
}
-static bool arm_smmu_write_ste_next(struct arm_smmu_ste *cur,
- const struct arm_smmu_ste *target,
- const struct arm_smmu_ste *target_used)
+/*
+ * Make bits of the current ste that aren't in use by the hardware equal to the
+ * target's bits.
+ */
+static void arm_smmu_ste_set_unused_bits(
+ struct arm_smmu_ste *cur,
+ const struct arm_smmu_ste *target)
{
struct arm_smmu_ste cur_used;
- struct arm_smmu_ste step;
+ int i =0;
arm_smmu_get_ste_used(cur, &cur_used);
- return arm_smmu_write_entry_next(cur->data, cur_used.data, target->data,
- target_used->data, step.data,
- cpu_to_le64(STRTAB_STE_0_V),
- ARRAY_SIZE(cur->data));
+ for (i = 0; i < ARRAY_SIZE(cur->data); i++)
+ cur->data[i] = (cur->data[i] & cur_used.data[i]) |
+ (target->data[i] & ~cur_used.data[i]);
}
+/*
+ * Update the STE to the target configuration. The transition from the current
+ * STE to the target STE takes place over multiple steps that attempts to make
+ * the transition hitless if possible. This function takes care not to create a
+ * situation where the HW can perceive a corrupted entry. HW is only
required to
+ * have a 64 bit atomicity with stores from the CPU, while entries are many 64
+ * bit values big.
+ *
+ * The algorithm works by evolving the entry toward the target in a series of
+ * steps. Each step synchronizes with the HW so that the HW can not
see an entry
+ * torn across two steps. During each step the HW can observe a torn entry that
+ * has any combination of the step's old/new 64 bit words. The algorithm
+ * objective is for the HW behavior to always be one of current behavior, V=0,
+ * or new behavior.
+ *
+ * In the most general case we can make any update in three steps:
+ * - Disrupting the entry (V=0)
+ * - Fill now unused bits, all bits except V
+ * - Make valid (V=1), single 64 bit store
+ *
+ * However this disrupts the HW while it is happening. There are several
+ * interesting cases where a STE/CD can be updated without disturbing the HW
+ * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
+ * because the used bits don't intersect. We can detect this by calculating how
+ * many 64 bit values need update after adjusting the unused bits and skip the
+ * V=0 process. This relies on the IGNORED behavior described in the
+ * specification
+ */
static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
struct arm_smmu_ste *ste,
const struct arm_smmu_ste *target)
{
+ struct arm_smmu_ste staging_ste;
struct arm_smmu_ste target_used;
- int i;
+ int writes_required = 0;
+ u8 staging_used_diff = 0;
+ int i = 0;
+ /*
+ * Compute a staging ste that has all the bits currently unused by HW
+ * set to their target values, such that comitting it to the ste table
+ * woudn't disrupt the hardware.
+ */
+ memcpy(&staging_ste, ste, sizeof(staging_ste));
+ arm_smmu_ste_set_unused_bits(&staging_ste, target);
+
+ /*
+ * Determine if it's possible to reach the target configuration from the
+ * staged configured in a single qword write (ignoring bits that are
+ * unused under the target configuration).
+ */
arm_smmu_get_ste_used(target, &target_used);
- /* Masks in arm_smmu_get_ste_used() are up to date */
- for (i = 0; i != ARRAY_SIZE(target->data); i++)
- WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
+ /*
+ * But first sanity check that masks in arm_smmu_get_ste_used() are up
+ * to date.
+ */
+ for (i = 0; i != ARRAY_SIZE(target->data); i++) {
+ if (WARN_ON_ONCE(target->data[i] & ~target_used.data[i]))
+ target_used.data[i] |= target->data[i];
+ }
- for (i = 0; true; i++) {
- if (arm_smmu_write_ste_next(ste, target, &target_used))
- break;
- arm_smmu_sync_ste_for_sid(smmu, sid);
- if (WARN_ON(i == 4))
- break;
+ for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++) {
+ /*
+ * Each bit of staging_used_diff indicates the index of a qword
+ * within the staged ste that is incorrect compared to the
+ * target, considering only the used bits in the target
+ */
+ if ((staging_ste.data[i] &
+ target_used.data[i]) != (target->data[i]))
+ staging_used_diff |= 1 << i;
+ }
+ if (hweight8(staging_used_diff) > 1) {
+ /*
+ * More than 1 qword is mismatched and a hitless transition from
+ * the current ste to the target ste is not possible. Clear the
+ * V bit and recompute the staging STE.
+ * Because the V bit is cleared, the staging STE will be equal
+ * to the target STE except for the first qword.
+ */
+ staging_ste.data[0] &= ~cpu_to_le64(STRTAB_STE_0_V);
+ arm_smmu_ste_set_unused_bits(&staging_ste, target);
+ staging_used_diff = 1;
+ }
+
+ /*
+ * Commit the staging STE.
+ */
+ for (i = 0; i != ARRAY_SIZE(staging_ste.data); i++)
+ WRITE_ONCE(ste->data[i], staging_ste.data[i]);
+ arm_smmu_sync_ste_for_sid(smmu, sid);
+
+ /*
+ * It's now possible to switch to the target configuration with a write
+ * to a single qword. Make that switch now.
+ */
+ i = ffs(staging_used_diff) - 1;
+ WRITE_ONCE(ste->data[i], target->data[i]);
+ arm_smmu_sync_ste_for_sid(smmu, sid);
+
+ /*
+ * Some of the bits set under the previous configuration but unused
+ * under the target configuration might still bit set. Clear them as
+ * well. Technically this isn't necessary but it brings the entry to
+ * the full target state, so if there are bugs in the mask calculation
+ * this will obscure them.
+ */
+ for (i = 0; i != ARRAY_SIZE(ste->data); i++) {
+ if (ste->data[i] != target->data[i]) {
+ WRITE_ONCE(ste->data[i], target->data[i]);
+ }
}
+ arm_smmu_sync_ste_for_sid(smmu, sid);
/* It's likely that we'll want to use the new STE soon */
if (!(smmu->options & ARM_SMMU_OPT_SKIP_PREFETCH)) {
More information about the linux-arm-kernel
mailing list