[PATCH 1/2] RISC-V: Probe for unaligned access speed

Thu Jun 29 16:09:46 PDT 2023

On Thu, Jun 29, 2023 at 5:05 AM David Laight <David.Laight at aculab.com> wrote:
>
> From: Evan Green
> > Sent: 27 June 2023 20:12
> >
> > On Mon, Jun 26, 2023 at 2:42 PM Jessica Clarke <jrtc27 at jrtc27.com> wrote:
> > >
> > > On 23 Jun 2023, at 23:20, Evan Green <evan at rivosinc.com> wrote:
> > > >
> > > > Rather than deferring misaligned access speed determinations to a vendor
> > > > function, let's probe them and find out how fast they are. If we
> > > > determine that a misaligned word access is faster than N byte accesses,
> > > > mark the hardware's misaligned access as "fast".
> > >
> > > How sure are you that your measurements can be extrapolated and aren’t
> > > an artefact of the testing process? For example, off the top of my head:
> > >
> > > * The first run will potentially be penalised by data cache misses,
> > >   untrained prefetchers, TLB misses, branch predictors, etc. compared
> > >   with later runs. You have one warmup, but who knows how many
> > >   iterations it will take to converge?
> >
> > I'd expect the cache penalties to be reasonably covered by a single
> > warmup. You're right about branch prediction, which is why I tried to
> > use a large-ish buffer size, minimize the ratio of conditionals to
> > loads/stores, and do the test for a decent number of iterations (on my
> > THead, about 1800 and 400 for words and bytes).
> >
> > When I ran the test a handful of times, I did see variation on the
> > order of ~5%. But the comparison of the two numbers doesn't seem to be
> > anywhere near that margin (THead C906 was ~4x faster doing misaligned
> > word accesses, others with slow misaligned accesses also reporting
> > numbers not anywhere close to each other).
>
> Isn't the EMULATED case so much slower than anything else that
> it is even pretty obvious from a single access?
> (Possibly the 2nd access to avoid 'cold cache'.)
>
> One of the things that can perturb measurements is hardware
> interrupts. That can be mitigated by counting clocks for a few
> (10 is plenty) iterations of a short request and taking the
> fastest value.
> For short hot-cache code sequences you can actually compare the
> actual clock counts with theoretical minimum values.

Yeah, one thing I could do is disable interrupts, measure the cycle
count of doing an individual iteration, do this N times, and take the
minimum value as the time to compare. In the end I'll then have two
numbers to compare, like I do in this patch. In theory the variance on
that should be really tight. N will have to depend on the overall
amount of time I'm taking so as not to shut interrupts off for very
long. Let me experiment with this and see how the results look.
-Evan