[PATCH v3 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers

Tue Dec 12 10:04:26 PST 2023

On Tue, Dec 12, 2023 at 04:23:53PM +0000, Will Deacon wrote:
> > At this point it generates the same sequence of updates as the current
> > code, except that zeroing the VMID on entry to BYPASS/ABORT will do an
> > extra sync (this seems to be an existing bug).
> 
> This is certainly very clever, but at the same time I can't help but feel
> that it's slightly over-engineered to solve the general case, whereas I'm
> struggling to see why such a level of complexity is necessary.

I'm open to any alternative that accomplishes the same outcome of
decoupling the STE programming from the STE state and adding the API
required hitless transitions.

This is a really important part of all of this work.

IMHO this is inherently complex. The SMMU spec devotes several pages
to describing how to do tearless updates. However, that text does not
even consider the need for hitless updates. It only contemplates the
V=0 flow.

Here we are very much adding hitless and intending to do so.

> In the comment, you say:
> 
> > + * In the most general case we can make any update in three steps:
> > + *  - Disrupting the entry (V=0)
> > + *  - Fill now unused bits, all bits except V
> > + *  - Make valid (V=1), single 64 bit store
> > + *
> > + * However this disrupts the HW while it is happening. There are several
> > + * interesting cases where a STE/CD can be updated without disturbing the HW
> > + * because only a small number of bits are changing (S1DSS, CONFIG, etc) or
> > + * because the used bits don't intersect. We can detect this by calculating how
> > + * many 64 bit values need update after adjusting the unused bits and skip the
> > + * V=0 process.
> 
> Please can you spell out these "interesting cases"? For cases where we're
> changing CONFIG, I'd have thought it would be perfectly fine to go via an
> invalid STE. What am I missing?

The modern IOMMU driver API requires hitless transitions in a bunch of
cases. I think I explained it in an email to Michael:

https://lore.kernel.org/linux-iommu/20231012121616.GF3952@nvidia.com/

 - IDENTIY -> DMA -> IDENTITY hitless with RESV_DIRECT
 - STE -> S1DSS -> STE hitless (PASID upgrade)
 - S1 -> BLOCKING -> S1 with active PASID hitless (iommufd case)
 - NESTING -> NESTING (eg to change S1DSS, change CD table pointers, etc)
 - CD ASID change hitless (BTM S1 replacement)
 - CD quiet_cd hitless (SVA mm release)

I will add that list to the commit message, the cover letter talks
about it a bit too.

To properly implement the API the driver has to support this.
Rather than try to just target those few cases (and hope I even know
all the needed cases) I just did everything.

"Everything" was a reaction to the complexity, at the end of part 3 we
have 7 different categories of STEs:
 - IDENTITY
 - BLOCKING
 - S2 domain
 - IDENTITY w/ PASID
 - BLOCKING w/ PASID
 - S1 domain
 - S1 domain nested on S2 domain

(plus a bunch of those have ATS and !ATS variations, and the nesting
has its own sub types..)

Which is 42 different permutations of transitions (plus the CD
stuff). Even if we want to just special case the sequences above we
still have to identify them and open code the arcs. Then the next
person along has to understand why those specific arcs are special
which is a complex and subtle argument based on the spec saying the HW
does not read certain bits.

Finally, there is a VM emulation reason - we don't know what the VM
will do so it may be assuming hitless changes. To actually correctly
emulate this we do need this general approach. Otherwise we have a
weird emulation gap where STE sequences generated by a VM that should
be hitless on real HW (eg the ones above, perhaps) will not
work. Indeed a VM running kernels with all three parts of this will be
doing and expecting hitless STE updates!

While this is perhaps not a real world functional concern it falls
under the usual rubrik of accurate VM emulation.

So, to my mind ignoring the general case and just doing V=0 all the
time is not OK.

> Generally, I like where the later patches in the series take things, but
> I'd really like to reduce the complexity of the strtab updating code to
> what is absolutely required.

This is the sort of thing where the algorithm in
arm_smmu_write_entry_step() is perhaps complex, but the ongoing
support of it is not. Just update the used bits function when new STE
features are (rarely) introduced.

> I've left some minor comments on the code below, but I'd really like to
> see this whole thing simplified if possible.

I have no particularly goood ideas on this front, sorry. I did try
several other things. Many thoughts were so difficult I couldn't even
write them down correctly :(

> > + */
> > +static bool arm_smmu_write_entry_step(__le64 *cur, const __le64 *cur_used,
> > +				      const __le64 *target,
> > +				      const __le64 *target_used, __le64 *step,
> > +				      __le64 v_bit,
> > +				      unsigned int len)
> > +{
> > +	u8 step_used_diff = 0;
> > +	u8 step_change = 0;
> > +	unsigned int i;
> > +
> > +	/*
> > +	 * Compute a step that has all the bits currently unused by HW set to
> > +	 * their target values.
> > +	 */
> 
> Well, ok, I do have a cosmetic nit here: using 'step' for both "STE pointer"
> and "incremental change" is perhaps, err, a step too far ;)

Ok.. I'll call it arm_smmu_write_entry_next()

> > +	for (i = 0; i != len; i++) {
> > +		step[i] = (cur[i] & cur_used[i]) | (target[i] & ~cur_used[i]);
> 
> Isn't 'cur[i] & cur_used[i]' always cur[i]?

No. The routine is called iteratively as the comment explained. The
first iteration may set unused bits to their target values, so we have
to mask them away again here during the second iteration. Later
iterations may change, eg the config, which means again we will have
unused values set from the prior config.

> > +	} else if (!step_change) {
> > +		/* cur == target, so all done */
> > +		if (memcmp(cur, target, len * sizeof(*cur)) == 0)
> > +			return true;
> > +
> > +		/*
> > +		 * All the used HW bits match, but unused bits are different.
> > +		 * Set them as well. Technically this isn't necessary but it
> > +		 * brings the entry to the full target state, so if there are
> > +		 * bugs in the mask calculation this will obscure them.
> > +		 */
> > +		memcpy(step, target, len * sizeof(*step));
> 
> Bah, I'm not a huge fan of this sort of defensive programming. I'd prefer
> to propagate the error rather than quietly try to cover it up.

There is no error.

This is adjusting the unused bits that were left over from the prior
config to 0, it is a similar answer to the 'cur[i]' question above.

The defensiveness is only a decision that the installed STE should
fully match the given target STE.

In principle the HW doesn't read unused bits in the target STE so we
could leave them set to 1 instead of fully matching.

"bugs in the mask calculation" means everything from the "HW should
not look at bit X but does" to "the spec was misread and the HW does
look at bit X"

> > +/*
> > + * Based on the value of ent report which bits of the STE the HW will access. It
> > + * would be nice if this was complete according to the spec, but minimally it
> > + * has to capture the bits this driver uses.
> > + */
> > +static void arm_smmu_get_ste_used(const struct arm_smmu_ste *ent,
> > +				  struct arm_smmu_ste *used_bits)
> > +{
> > +	memset(used_bits, 0, sizeof(*used_bits));
> > +
> > +	used_bits->data[0] = cpu_to_le64(STRTAB_STE_0_V);
> > +	if (!(ent->data[0] & cpu_to_le64(STRTAB_STE_0_V)))
> > +		return;
> > +
> > +	/*
> > +	 * If S1 is enabled S1DSS is valid, see 13.5 Summary of
> > +	 * attribute/permission configuration fields for the SHCFG behavior.
> > +	 */
> > +	if (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0])) & 1 &&
> > +	    FIELD_GET(STRTAB_STE_1_S1DSS, le64_to_cpu(ent->data[1])) ==
> > +		    STRTAB_STE_1_S1DSS_BYPASS)
> > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> > +
> > +	used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_CFG);
> > +	switch (FIELD_GET(STRTAB_STE_0_CFG, le64_to_cpu(ent->data[0]))) {
> > +	case STRTAB_STE_0_CFG_ABORT:
> > +		break;
> > +	case STRTAB_STE_0_CFG_BYPASS:
> > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_SHCFG);
> > +		break;
> > +	case STRTAB_STE_0_CFG_S1_TRANS:
> > +		used_bits->data[0] |= cpu_to_le64(STRTAB_STE_0_S1FMT |
> > +						  STRTAB_STE_0_S1CTXPTR_MASK |
> > +						  STRTAB_STE_0_S1CDMAX);
> > +		used_bits->data[1] |=
> > +			cpu_to_le64(STRTAB_STE_1_S1DSS | STRTAB_STE_1_S1CIR |
> > +				    STRTAB_STE_1_S1COR | STRTAB_STE_1_S1CSH |
> > +				    STRTAB_STE_1_S1STALLD | STRTAB_STE_1_STRW);
> > +		used_bits->data[1] |= cpu_to_le64(STRTAB_STE_1_EATS);
> > +		break;
> > +	case STRTAB_STE_0_CFG_S2_TRANS:
> > +		used_bits->data[1] |=
> > +			cpu_to_le64(STRTAB_STE_1_EATS | STRTAB_STE_1_SHCFG);
> > +		used_bits->data[2] |=
> > +			cpu_to_le64(STRTAB_STE_2_S2VMID | STRTAB_STE_2_VTCR |
> > +				    STRTAB_STE_2_S2AA64 | STRTAB_STE_2_S2ENDI |
> > +				    STRTAB_STE_2_S2PTW | STRTAB_STE_2_S2R);
> > +		used_bits->data[3] |= cpu_to_le64(STRTAB_STE_3_S2TTB_MASK);
> > +		break;
> 
> I think this is going to be a pain to maintain :/

Ah, but not your pain :)

Some day someone adds a new STE bit, they don't understand this so
they just change one of the make functions.

They test and hit the WARN_ON, which brings them here. Then they
hopefully realize they need to read the spec and understand the new
bit so they do that and make a try at the right conditions.

Reviewer sees the new hunk in this function and double-checks that the
conditions for the new bit are correct. Reviewer needs to consider if
any hitless flows become broken (and I've considered adding an
explicit self test for this)

The hard work of reviewing the spec and deciding how a new STE bit is
processed cannot be skipped, but we can make it harder to notice that
it has been skipped. I think the current design makes skipping this
work too easy.

So you are calling it a "pain to maintain" I'm calling it an "enforced
rigor" going forward. Like type safety.

> > +static void arm_smmu_write_ste(struct arm_smmu_device *smmu, u32 sid,
> > +			       struct arm_smmu_ste *ste,
> > +			       const struct arm_smmu_ste *target)
> > +{
> > +	struct arm_smmu_ste target_used;
> > +	int i;
> > +
> > +	arm_smmu_get_ste_used(target, &target_used);
> > +	/* Masks in arm_smmu_get_ste_used() are up to date */
> > +	for (i = 0; i != ARRAY_SIZE(target->data); i++)
> > +		WARN_ON_ONCE(target->data[i] & ~target_used.data[i]);
> 
> That's a runtime cost on every single STE update for what would be a driver
> bug.

See above. STE programming is not critical path, and I think this
series makes it faster overall anyhow.

> > +	while (true) {
> > +		if (arm_smmu_write_ste_step(ste, target, &target_used))
> > +			break;
> > +		arm_smmu_sync_ste_for_sid(smmu, sid);
> > +	}
> 
> This really should be bounded...

Sure, I think the bound is 4 but I will check.

Thanks,
Jason