[PATCH v3 3/3] riscv: optimized memset

Tue Jun 22 17:08:30 PDT 2021

On Tue, Jun 22, 2021 at 3:07 AM Nick Kossifidis <mick at ics.forth.gr> wrote:
>
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > +
> > +void *__memset(void *s, int c, size_t count)
> > +{
> > +     union types dest = { .u8 = s };
> > +
> > +     if (count >= MIN_THRESHOLD) {
> > +             const int bytes_long = BITS_PER_LONG / 8;
>
> You could make 'const int bytes_long = BITS_PER_LONG / 8;' and 'const
> int mask = bytes_long - 1;' from your memcpy patch visible to memset as
> well (static const...) and use them here (mask would make more sense to
> be named as word_mask).
>

I'll do

> > +             unsigned long cu = (unsigned long)c;
> > +
> > +             /* Compose an ulong with 'c' repeated 4/8 times */
> > +             cu |= cu << 8;
> > +             cu |= cu << 16;
> > +#if BITS_PER_LONG == 64
> > +             cu |= cu << 32;
> > +#endif
> > +
>
> You don't have to create cu here, you'll fill dest buffer with 'c'
> anyway so after filling up enough 'c's to be able to grab an aligned
> word full of them from dest, you can just grab that word and keep
> filling up dest with it.
>

I tried that, but this way I have to wait 8 bytes more before starting
the memset.
And, the machine code needed to generate 'cu' is just 6 instructions on riscv:

slli a5,a0,8
or a5,a5,a0
slli a0,a5,16
or a0,a0,a5
slli a5,a0,32
or a0,a5,a0

so probably it's not worth it.

> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > +             /* Fill the buffer one byte at time until the destination
> > +              * is aligned on a 32/64 bit boundary.
> > +              */
> > +             for (; count && dest.uptr % bytes_long; count--)
>
> You could reuse & mask here instead of % bytes_long.
>

Sure, even if the machine code will be the same.

> > +                     *dest.u8++ = c;
> > +#endif
>
> I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> and even if it did we are talking about a loop that will run just a few
> times to reach the alignment boundary (worst case scenario it'll run 7
> times), I don't think we gain much here, even for archs that have
> efficient unaligned access.

It doesn't _now_, but maybe in the future we will have a CPU which
handles unaligned accesses correctly!

-- 
per aspera ad upstream