[PATCH v4] riscv: Fixed misaligned memory access. Fixed pointer comparison.

Sat Jan 29 13:07:44 PST 2022

Thank you for your feedback David.

> No point jumping to a conditional branch that jumps bak
> Make this a:
> 	bne	t3, t6, 1b
> and move 1: down one instruction.
> (Or is the 'beq' at the top even possible - there is likely to
> be an earlier test for zero length copies.)
There are 2 checks for length.  One returns if zero.  The other 
limits the copy to byte-copy if it is below SZREG.  It is possible 
for the 'beq' to be true on the first attempt.

This would happen with something like "memmove(0x02, 0x13, 5)" on 
32-bit.  At no point would there be the ability to do a full 
register width store.  So, misaligned copy would be jumped to 
because they are not co-aligned and size is greater than or equal 
to SZREG.  The misaligned copy would first call the byte-copy 
sub-function to bring dest into natural SZREG alignment.  From 
there, the misaligned copy loop would start, but it would 
immediately break on the 'beq' being true because there are not 
SZREG bytes left to copy.  So it would jump to byte copy which 
would again try to copy any remaining unaligned bytes (if any 
exist).

One alternative is to make the previous copy length check force 
byte-copy for any size less than 2 * SZREG, instead of just SZREG.  
This would guarantee that the loop would always be able to run at 
least once before breaking.  If you think I should make this change, 
let me know.

I also gave a lot of thought when implementing the register value 
rollover so that I would not accidentally page fault in the kernel.  
I needed to make sure that those first loads, since they occur 
before those conditional checks, would not try to load from 
anywhere outside of the page boundaries.  Given that the the code 
pathway would not have been taken unless the copy length is 
greater than zero and the low bits of dest are not equal to the low 
bits of src(they are not co-aligned), we can say that bytes would 
have to have been loaded from within that SZREG aligned bound 
anyway.  And since pages are aligned at minimum to 0x1000 
boundaries, I was confident that I would be within bounds even if 
right up against the edge.

I will make the change from j to bne instructions in the applicable 
loops.

> I also suspect it is worth unrolling the loop once.
> You lose the 'mv t1, t0' and one 'addi' for each word transferred.
I'll explore this possibility, but to my previous point, I need 
to be careful about bounds checking.  Until I try to write it, I 
can't be sure if it will make sense.  I'll see what I can come up 
with.  No promises.

> I think someone mentioned that there is a few clocks delay before
> the data from the memory read (REG_L) is actually available.
> On in-order cpu this is likely to be a full pipeline stall.
> So move the 'addi' up between the 'REG_L' and 'sll' instructions.
> (The offset will need to be -SZREG to match.)
I try to keep loads as far apart as reasonable from their usage to 
avoid this.  I'll make this change in the next version of the patch.

	Michael

On 1/29/22 08:41, David Laight wrote:

> From: michael at michaelkloos.com
> ...
>> [v4]
>>
>> I could not resist implementing the optimization I mentioned in
>> my v3 notes.  I have implemented the roll over of data by cpu
>> register in the misaligned fixup copy loops.  Now, only one load
>> from memory is required per iteration of the loop.
> I nearly commented...
>
> ...
>> +	/*
>> +	 * Fix Misalignment Copy Loop.
>> +	 * load_val1 = load_ptr[0];
>> +	 * while (store_ptr != store_ptr_end) {
>> +	 *   load_val0 = load_val1;
>> +	 *   load_val1 = load_ptr[1];
>> +	 *   *store_ptr = (load_val0 >> {a6}) | (load_val1 << {a7});
>> +	 *   load_ptr++;
>> +	 *   store_ptr++;
>> +	 * }
>> +	 */
>> +	REG_L t0, 0x000(a3)
>> +	1:
>> +	beq   t3, t6, 2f
>> +	mv    t1, t0
>> +	REG_L t0, SZREG(a3)
>> +	srl   t1, t1, a6
>> +	sll   t2, t0, a7
>> +	or    t1, t1, t2
>> +	REG_S t1, 0x000(t3)
>> +	addi  a3, a3, SZREG
>> +	addi  t3, t3, SZREG
>> +	j 1b
> No point jumping to a conditional branch that jumps bak
> Make this a:
> 	bne	t3, t6, 1b
> and move 1: down one instruction.
> (Or is the 'beq' at the top even possible - there is likely to
> be an earlier test for zero length copies.)
>> +	2:
> I also suspect it is worth unrolling the loop once.
> You lose the 'mv t1, t0' and one 'addi' for each word transferred.
>
> I think someone mentioned that there is a few clocks delay before
> the data from the memory read (REG_L) is actually available.
> On in-order cpu this is likely to be a full pipeline stall.
> So move the 'addi' up between the 'REG_L' and 'sll' instructions.
> (The offset will need to be -SZREG to match.)
>
> 	David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>