[PATCH 1/1] arm64: Accelerate Adler32 using arm64 SVE instructions.

Wed Nov 4 21:32:34 EST 2020

在 2020/11/4 22:49, Dave Martin 写道:
> On Wed, Nov 04, 2020 at 05:19:18PM +0800, Li Qiang wrote:

...

>>>
>>> I haven't tried to understand this algorithm in detail, but there should
>>> probably be no need for this special case to handle the trailing bytes.
>>>
>>> You should search for examples of speculative vectorization using
>>> WHILELO etc., to get a better feel for how to do this.
>>
>> Yes, I have considered this problem, but I have not found a good way to achieve it,
>> because before the end of the loop is reached, the decreasing sequence used for
>> calculation is determined.
>>
>> For example, buf is divided into 32-byte blocks. This sequence should be 32,31,...,2,1,
>> if there are only 10 bytes left at the end of the loop, then this sequence
>> should be 10,9,8,...,2,1.
>>
>> If I judge whether the end of the loop has been reached in the body of the loop,
>> and reset the starting point of the sequence according to the length of the tail,
>> it does not seem very good.
> 
> That would indeed be inefficient, since the adjustment is only needed on
> the last iteration.
> 
> Can you do instead do the adjustment after the loop ends?
> 
> For example, if
> 
> 	y = x[n] * 32 + x[n+1] * 31 + x[n+2] * 30 ...
> 
> then 
> 
> 	y - (x[n] * 22 + x[n+1] * 22 + x[n+2] * 22 ...)
> 
> equals
> 
> 	x[n] + 10 + x[n+1] * 9 + x[n+2] * 8 + ,,,
> 
> (This isn't exactly what the algorithm demands, but hopefully you see the
> general idea.)
> 
> [...]
> 
> Cheers
> ---Dave
> .
> 

This idea seems feasible, so that the judgment can be made only once after the
end of the loop, and the extra part is subtracted, and there is no need to enter
another loop to process the trailing bytes.

I will try this solution later. Thank you! :)

-- 
Best regards,
Li Qiang