[PATCH v4] crypto: riscv/poly1305 - import OpenSSL/CRYPTOGAMS implementation

Tue Jun 24 02:13:49 PDT 2025

>> +.globl	poly1305_init
>> +.type	poly1305_init,\@function
>> +poly1305_init:
>> +#ifdef	__riscv_zicfilp
>> +	lpad	0
>> +#endif
> 
> The 'lpad' instructions aren't present in the upstream CRYPTOGAMS source.

They are.

> If they are necessary, this addition needs to be documented.
> 
> But they appear to be unnecessary.

They are better be there if Control Flow Integrity is on. It's the same 
deal as with endbranch instruction on Intel and hint #34 on ARM. It's 
possible that the kernel never engages CFI for itself, in which case all 
the mentioned instructions are executed as nop-s. But note that here 
they are compiled conditionally, so that if you don't compile the kernel 
with -march=..._zicfilp_..., then they won't be there.

>> +#ifndef	__CHERI_PURE_CAPABILITY__
>> +	andi	$tmp0,$inp,7		# $inp % 8
>> +	andi	$inp,$inp,-8		# align $inp
>> +	slli	$tmp0,$tmp0,3		# byte to bit offset
>> +#endif
>> +	ld	$in0,0($inp)
>> +	ld	$in1,8($inp)
>> +#ifndef	__CHERI_PURE_CAPABILITY__
>> +	beqz	$tmp0,.Laligned_key
>> +
>> +	ld	$tmp2,16($inp)
>> +	neg	$tmp1,$tmp0		# implicit &63 in sll
>> +	srl	$in0,$in0,$tmp0
>> +	sll	$tmp3,$in1,$tmp1
>> +	srl	$in1,$in1,$tmp0
>> +	sll	$tmp2,$tmp2,$tmp1
>> +	or	$in0,$in0,$tmp3
>> +	or	$in1,$in1,$tmp2
>> +
>> +.Laligned_key:
> 
> This code is going through a lot of trouble to work on RISC-V CPUs that don't
> support efficient misaligned memory accesses.  That includes issuing loads of
> memory outside the bounds of the given buffer, which is questionable (even if
> it's guaranteed to not cross a page boundary).

It's indeed guaranteed to not cross a page *nor* even cache-line 
boundaries. Hence they can't trigger any externally observed side 
effects the corresponding unaligned loads won't. What is the concern 
otherwise? [Do note that the boundaries are not crossed on a 
boundary-enforcable CHERI platform ;-)]

> The rest of the kernel's RISC-V crypto code, which is based on the vector
> extension, just assumes that efficient misaligned memory accesses are supported.

Was it tested on real hardware though? I wonder what hardware is out 
there that supports the vector crypto extensions?

> On a related topic, if this patch is accepted, the result will be inconsistent
> optimization of ChaCha vs. Poly1305, which are usually paired:

https://github.com/dot-asm/cryptogams/blob/master/riscv/chacha-riscv.pl

>      (1) ChaCha optimized with the RISC-V vector extension
>      (2) Poly1305 optimized with RISC-V scalar instructions
> 
> Surely a RISC-V vector extension optimized Poly1305 is going to be needed too?

I'm a "test-on-hardware" guy. I've got Spacemit X60, which has a working 
256-bit base vector implementation. I have a "teaser" Chacha vector 
implementation that currently performs *worse* than scalar one, more 
than twice worse. Working on improving it. For reference. One has to 
recognize that cryptographic algorithms customarily have short 
dependencies, which means that performance is dominated by instruction 
latencies. There might or might not be ways to match the scalar 
performance. Or course, even if it turns out to be impossible on this 
processor, it doesn't mean that it won't make sense to keep the vector 
implementation, because other processors might do better. In other 
words, it's coming...

> But with that being the case, will a RISC-V scalar optimized Poly1305 actually
> be worthwhile to add too?  Especially without optimized ChaCha alongside it?

Yes. Because vector implementations are inefficient on short inputs and 
having a compatible scalar fall-back for short inputs is more than 
appropriate. In other words starting with scalar implementations is a 
sensible and perfectly meaningful step.

Cheers.