[PATCH riscv/for-next] crypto: riscv - parallelize AES-CBC decryption
Jerry Shih
jerry.shih at sifive.com
Sun Feb 25 17:40:14 PST 2024
On Feb 11, 2024, at 02:12, Eric Biggers <ebiggers at kernel.org> wrote:
> On Sat, Feb 10, 2024 at 11:25:27PM +0800, Jerry Shih wrote:
>>> .macro aes_cbc_decrypt keylen
>>> + srli LEN, LEN, 2 // Convert LEN from bytes to words
>>> vle32.v v16, (IVP) // Load IV
>>> 1:
>>> - vle32.v v17, (INP) // Load ciphertext block
>>> - vmv.v.v v18, v17 // Save ciphertext block
>>> - aes_decrypt v17, \keylen // Decrypt
>>> - vxor.vv v17, v17, v16 // XOR with IV or prev ciphertext block
>>> - vse32.v v17, (OUTP) // Store plaintext block
>>> - vmv.v.v v16, v18 // Next "IV" is prev ciphertext block
>>> - addi INP, INP, 16
>>> - addi OUTP, OUTP, 16
>>> - addi LEN, LEN, -16
>>> + vsetvli t0, LEN, e32, m4, ta, ma
>>> + vle32.v v20, (INP) // Load ciphertext blocks
>>> + vslideup.vi v16, v20, 4 // Setup prev ciphertext blocks
>>> + addi t1, t0, -4
>>> + vslidedown.vx v24, v20, t1 // Save last ciphertext block
>>
>> Do we need to setup the `e32, len=t0` for next IV?
>> I think we only need 128bit IV (with VL=4).
>>
>>> + aes_decrypt v20, \keylen // Decrypt the blocks
>>> + vxor.vv v20, v20, v16 // XOR with prev ciphertext blocks
>>> + vse32.v v20, (OUTP) // Store plaintext blocks
>>> + vmv.v.v v16, v24 // Next "IV" is last ciphertext block
>>
>> Same VL issue here.
>
> It's true that the vslidedown.vx and vmv.v.v only need vl=4. But it also works
> fine with vl unchanged. It just results in some extra data being moved in the
> registers. My hypothesis is that this is going to be faster than having the
> three extra instructions per loop iteration to change the vl to 4 twice.
>
> I still have no real hardware to test on, so I have no quantitative data. All I
> can do is go with my instinct which is that the shorter version will be better.
>
> If you have access to a real CPU that supports the RISC-V vector crypto
> extensions, I'd be interested in the performance you get from each variant.
> (Of course, different RISC-V CPU implementations may have quite different
> performance characteristics, so that still won't be definitive.)
Hi Eric,
Thank you. I think the extra vl doesn't affect performance significantly. The main
tasks are still the aes body.
The original implementation is enough right now.
> In general, this level of micro-optimization probably needs to be wait until
> there are a variety of CPUs to test on. We know that parallelizing the
> algorithms is helpful, so we should do that, as this patch does. But the
> effects of small variations in the instruction sequences are currently unclear.
>
> - Eric
More information about the linux-riscv
mailing list