[PATCH] riscv: Optimize memset

Tue May 9 02:16:33 PDT 2023

On Tue, May 09, 2023 at 10:22:07AM +0800, zhangfei wrote:
> From: zhangfei <zhangfei at nj.iscas.ac.cn>
> 
> > >  5:
> > > -	sb a1, 0(t0)
> > > -	addi t0, t0, 1
> > > -	bltu t0, a3, 5b
> > > +        sb a1, 0(t0)
> > > +        sb a1, -1(a3)
> > > +        li a4, 2
> > > +        bgeu a4, a2, 6f
> > > +
> > > +        sb a1, 1(t0)
> > > +        sb a1, 2(t0)
> > > +        sb a1, -2(a3)
> > > +        sb a1, -3(a3)
> > > +        li a4, 6
> > > +        bgeu a4, a2, 6f
> > > +
> > > +        sb a1, 3(t0)
> > > +        sb a1, -4(a3)
> > > +        li a4, 8
> > > +        bgeu a4, a2, 6f
> > 
> > Why is this check here?
> 
> Hi,
> 
> I filled head and tail with minimal branching. Each conditional ensures that 
> all the subsequently used offsets are well-defined and in the dest region.

I know. You trimmed my comment, so I'll quote myself, here

"""
After the check of a2 against 6 above we know that offsets 6(t0)
and -7(a3) are safe. Are we trying to avoid too may redundant
stores with these additional checks?
"""

So, again. Why the additional check against 8 above and, the one you
trimmed, checking 10?

> 
> Although this approach may result in redundant storage, compared to byte by 
> byte storage, it allows storage instructions to be executed in parallel and 
> reduces the number of jumps.

I understood that when I read the code, but text like this should go in
the commit message to avoid people having to think their way through
stuff.

> 
> I used the code linked below for performance testing and commented on the memset 
> that calls the arm architecture in the code to ensure it runs properly on the 
> risc-v platform.
> 
> [1] https://github.com/ARM-software/optimized-routines/blob/master/string/bench/memset.c#L53
> 
> The testing platform selected RISC-V SiFive U74.The test data is as follows:
> 
> Before optimization
> ---------------------
> Random memset (bytes/ns):
>            memset_call 32K:0.45 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.30
> 
> Medium memset (bytes/ns):
>            memset_call 8B:0.18 16B:0.48 32B:0.91 64B:1.63 128B:2.71 256B:4.40 512B:5.67
> Large memset (bytes/ns):
>            memset_call 1K:6.62 2K:7.02 4K:7.46 8K:7.70 16K:7.82 32K:7.63 64K:1.40
> 
> After optimization
> ---------------------
> Random memset bytes/ns):
>            memset_call 32K:0.46 64K:0.35 128K:0.30 256K:0.28 512K:0.27 1024K:0.25 avg 0.31
> Medium memset (bytes/ns )
>            memset_call 8B:0.27 16B:0.48 32B:0.91 64B:1.64 128B:2.71 256B:4.40 512B:5.67
> Large memset (bytes/ns):
>            memset_call 1K:6.62 2K:7.02 4K:7.47 8K:7.71 16K:7.83 32K:7.63 64K:1.40
> 
> From the results, it can be seen that memset has significantly improved its performance with 
> a data volume of around 8B, from 0.18 bytes/ns to 0.27 bytes/ns.

And these benchmark results belong in the cover letter, which this series
is missing.

Thanks,
drew