[PATCH v3 3/3] riscv: optimized memset
Matteo Croce
mcroce at linux.microsoft.com
Tue Jun 22 17:08:30 PDT 2021
On Tue, Jun 22, 2021 at 3:07 AM Nick Kossifidis <mick at ics.forth.gr> wrote:
>
> Στις 2021-06-17 18:27, Matteo Croce έγραψε:
> > +
> > +void *__memset(void *s, int c, size_t count)
> > +{
> > + union types dest = { .u8 = s };
> > +
> > + if (count >= MIN_THRESHOLD) {
> > + const int bytes_long = BITS_PER_LONG / 8;
>
> You could make 'const int bytes_long = BITS_PER_LONG / 8;' and 'const
> int mask = bytes_long - 1;' from your memcpy patch visible to memset as
> well (static const...) and use them here (mask would make more sense to
> be named as word_mask).
>
I'll do
> > + unsigned long cu = (unsigned long)c;
> > +
> > + /* Compose an ulong with 'c' repeated 4/8 times */
> > + cu |= cu << 8;
> > + cu |= cu << 16;
> > +#if BITS_PER_LONG == 64
> > + cu |= cu << 32;
> > +#endif
> > +
>
> You don't have to create cu here, you'll fill dest buffer with 'c'
> anyway so after filling up enough 'c's to be able to grab an aligned
> word full of them from dest, you can just grab that word and keep
> filling up dest with it.
>
I tried that, but this way I have to wait 8 bytes more before starting
the memset.
And, the machine code needed to generate 'cu' is just 6 instructions on riscv:
slli a5,a0,8
or a5,a5,a0
slli a0,a5,16
or a0,a0,a5
slli a5,a0,32
or a0,a5,a0
so probably it's not worth it.
> > +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
> > + /* Fill the buffer one byte at time until the destination
> > + * is aligned on a 32/64 bit boundary.
> > + */
> > + for (; count && dest.uptr % bytes_long; count--)
>
> You could reuse & mask here instead of % bytes_long.
>
Sure, even if the machine code will be the same.
> > + *dest.u8++ = c;
> > +#endif
>
> I noticed you also used CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS on your
> memcpy patch, is it worth it here ? To begin with riscv doesn't set it
> and even if it did we are talking about a loop that will run just a few
> times to reach the alignment boundary (worst case scenario it'll run 7
> times), I don't think we gain much here, even for archs that have
> efficient unaligned access.
It doesn't _now_, but maybe in the future we will have a CPU which
handles unaligned accesses correctly!
--
per aspera ad upstream
More information about the linux-riscv
mailing list