[PATCH 04/19] iommu/arm-smmu-v3: Make STE programming independent of the callers

Wed Jan 3 08:52:48 PST 2024

On Tue, Jan 2, 2024 at 10:48 PM Jason Gunthorpe <jgg at nvidia.com> wrote:
>
> On Tue, Jan 02, 2024 at 04:08:41PM +0800, Michael Shavit wrote:
> > On Wed, Dec 27, 2023 at 11:46 PM Jason Gunthorpe <jgg at nvidia.com> wrote:
> > >
> > > On Tue, Dec 19, 2023 at 09:42:27PM +0800, Michael Shavit wrote:
> > >
> > > > +static void arm_smmu_write_entry(struct arm_smmu_entry_writer *writer,
> > > > +                                __le64 *cur, const __le64 *target,
> > > > +                                __le64 *staging_entry)
> > > > +{
> > > > +       bool cleanup_sync_required = false;
> > > > +       u8 entry_qwords_used_diff = 0;
> > > > +       int i = 0;
> > > > +
> > > > +       entry_qwords_used_diff =
> > > > +               writer->ops.get_used_qword_diff_indexes(cur, target);
> > > > +       if (WARN_ON_ONCE(entry_qwords_used_diff == 0))
> > > > +               return;
> > >
> > > A no change update is actually API legal, eg we can set the same
> > > domain twice in a row. It should just do nothing.
> > >
> > > If the goal is to improve readability I'd split this into smaller
> > > functions and have the main function look like this:
> > >
> > >        compute_used(..)
> > >        if (hweight8(entry_qwords_used_diff) > 1) {
> > >              set_v_0(..);
> > >              set(qword_start=1,qword_end=N);
> > >              set(qword_start=0,qword_end=1); // V=1
> >
> > This branch is probably a bit more complicated than that. It's a bit more like:
> >        if (hweight8(entry_qwords_used_diff) > 1) {
> >              compute_staging_entry(...);
> >              compute_used_diffs(...staging_entry...)
> >              if (hweight(entry_qwords_used_diff) > 1) {
> >                  set_v_0();
> >                  set(qword_start=1,qword_end=N);
> >                  set(qword_start=0,qword_end=1); // V=1
> >              } else {
> >                  set(qword_start=0, qword_end=N, staging_entry, entry)
> >                  critical = ffs(..);
> >                  set(qword_start=critical,qword_end=critical+1);
> >                  set(qword_start=0,qword_end=N);
> >              }
> >       }
> >
> > >        } else if (hweight8(entry_qwords_used_diff) == 1) {
> > >              set_unused(..);
> > >              critical = ffs(..);
> > >              set(qword_start=critical,qword_end=critical+1);
> > >              set(qword_start=0,qword_end=N);
> >
> > And then this branch is the case where you can directly switch to the
> > entry without first setting unused bits.
>
> Don't make that a special case, just always set the unused bits. All
> the setting functions should skip the sync if they didn't change the
> entry, so we don't need to care if we call them needlessly.
>
> There are only three programming sequences.

The different cases (ignoring clean-up) from simplest to least are:
1. No change because the STE is already equal to the target.
2. Directly writing critical word because that's the only difference.
3. Setting unused bits then writing critical word.
4. Installing breaking STE, write other words, write critical word.

Case 2. could potentially be collapsed into 3. if the routine that
sets unused bits skips over the critical word, so that it's a nop when
the only change is on that critical word.

> entry_qwords_used_diff should reflect required changes after setting
> the unused bits.

Ohhhhhhh, I see. Your suggestion is essentially to move this block
into the first call to get_used_qword_diff_indexes:
> > > > +               /*
> > > > +                * Compute a staging entry that has all the bits currently
> > > > +                * unused by HW set to their target values, such that comitting
> > > > +                * it to the entry table woudn't disrupt the hardware.
> > > > +                */
> > > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > > +
> > > > +               entry_qwords_used_diff =
> > > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > > +                                                               target);

Such that:
if (hweight8(entry_qwords_used_diff) > 1) => non hitless
if (hweight8(entry_qwords_used_diff) > 0) => hitless, potentially by
first setting some unused bits in non-critical qwords.

>
> > > > +       if (hweight8(entry_qwords_used_diff) > 1) {
> > > > +               /*
> > > > +                * If transitioning to the target entry with a single qword
> > > > +                * write isn't possible, then we must first transition to an
> > > > +                * intermediate entry. The intermediate entry may either be an
> > > > +                * entry that melds bits of the target entry into the current
> > > > +                * entry without disrupting the hardware, or a breaking entry if
> > > > +                * a hitless transition to the target is impossible.
> > > > +                */
> > > > +
> > > > +               /*
> > > > +                * Compute a staging entry that has all the bits currently
> > > > +                * unused by HW set to their target values, such that comitting
> > > > +                * it to the entry table woudn't disrupt the hardware.
> > > > +                */
> > > > +               memcpy(staging_entry, cur, writer->entry_length);
> > > > +               writer->ops.set_unused_bits(staging_entry, target);
> > > > +
> > > > +               entry_qwords_used_diff =
> > > > +                       writer->ops.get_used_qword_diff_indexes(staging_entry,
> > > > +                                                               target);
> > >
> > > Put the number qwords directly in the ops struct and don't make this
> > > an op.  Above will need N=number of qwords as well.
> >
> > The reason I made get_used_qword_diff_indexes an op is because the
> > algorithm needs to compute the used_bits for entries (for the current
> > entry, the target entry as well as the melded-staging entry).
>
> Make getting the used bits the op..

Right, I initially tried making get_used_bits the op but the problem
is where to store the output used_bits without dynamic allocation.
Introducing .get_used_qword_diff_indexes and .set_unused_bits bypasses
the issue. I agree it's a bit weird though.
Are you suggesting that get_used_bits()'s output would use storage
from its ops struct? With the requirement that calls to
get_used_bits() invalidates previous calls?