[PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome
Ard Biesheuvel
ard.biesheuvel at linaro.org
Thu Jul 13 11:00:12 PDT 2017
On 13 July 2017 at 18:51, Markus Stockhausen <stockhausen at collogia.de> wrote:
>> Von: Ard Biesheuvel [ard.biesheuvel at linaro.org]
>> Gesendet: Donnerstag, 13. Juli 2017 19:16
>> An: linux-arm-kernel at lists.infradead.org; linux-raid at vger.kernel.org
>> Cc: shli at kernel.org; Markus Stockhausen; linux at armlinux.org.uk; will.deacon at arm.com; catalin.marinas at arm.com; Ard Biesheuvel
>> Betreff: [PATCH 1/2] md/raid6: use faster multiplication for ARM NEON delta syndrome
>>
>> The P/Q left side optimization in the delta syndrome simply involves
>> repeatedly multiplying a value by polynomial 'x' in GF(2^8). Given
>> that 'x * x * x * x' equals 'x^4' even in the polynomial world, we
>> can accelerate this substantially by performing up to 4 such operations
>> at once, using the NEON instructions for polynomial multiplication.
>>
>> Results on a Cortex-A57 running in 64-bit mode:
>>
>> Before:
>> -------
>> raid6: neonx1 xor() 1680 MB/s
>> raid6: neonx2 xor() 2286 MB/s
>> raid6: neonx4 xor() 3162 MB/s
>> raid6: neonx8 xor() 3389 MB/s
>>
>> After:
>> ------
>> raid6: neonx1 xor() 2281 MB/s
>> raid6: neonx2 xor() 3362 MB/s
>> raid6: neonx4 xor() 3787 MB/s
>> raid6: neonx8 xor() 4239 MB/s
>
> Nice optimiziation. Nevertheless the test algorithm favours this implementation. See:
>
> int start = (disks>>1)-1, stop = disks-3; /* work on the second half of the disks */
>
> What gives the before/after test if you work on the middle data disks and not on
> the right ones? In the 4K page size this should be start = 3, stop = 11 instead of
> start = 7, stop = 13. Given the large gain you see the impact should be lower but
> at least in the >10% range.
>
Relative before and after (using raid6test rather than the kernel
module this time, so they should not be compared with the numbers
above)
before
raid6: neonx1 xor() 1773 MB/s
raid6: neonx2 xor() 2362 MB/s
raid6: neonx4 xor() 3223 MB/s
raid6: neonx8 xor() 3375 MB/s
after
raid6: neonx1 xor() 2259 MB/s
raid6: neonx2 xor() 2975 MB/s
raid6: neonx4 xor() 3404 MB/s
raid6: neonx8 xor() 3788 MB/s
So your estimate is correct: 12% speedup for neonx8 in the 'start = 7,
stop = 13' case
--
Ard.
More information about the linux-arm-kernel
mailing list