[PATCH v4] riscv: Fixed misaligned memory access. Fixed pointer comparison.
Michael T. Kloos
michael at michaelkloos.com
Sat Jan 29 13:07:44 PST 2022
Thank you for your feedback David.
> No point jumping to a conditional branch that jumps bak
> Make this a:
> bne t3, t6, 1b
> and move 1: down one instruction.
> (Or is the 'beq' at the top even possible - there is likely to
> be an earlier test for zero length copies.)
There are 2 checks for length. One returns if zero. The other
limits the copy to byte-copy if it is below SZREG. It is possible
for the 'beq' to be true on the first attempt.
This would happen with something like "memmove(0x02, 0x13, 5)" on
32-bit. At no point would there be the ability to do a full
register width store. So, misaligned copy would be jumped to
because they are not co-aligned and size is greater than or equal
to SZREG. The misaligned copy would first call the byte-copy
sub-function to bring dest into natural SZREG alignment. From
there, the misaligned copy loop would start, but it would
immediately break on the 'beq' being true because there are not
SZREG bytes left to copy. So it would jump to byte copy which
would again try to copy any remaining unaligned bytes (if any
exist).
One alternative is to make the previous copy length check force
byte-copy for any size less than 2 * SZREG, instead of just SZREG.
This would guarantee that the loop would always be able to run at
least once before breaking. If you think I should make this change,
let me know.
I also gave a lot of thought when implementing the register value
rollover so that I would not accidentally page fault in the kernel.
I needed to make sure that those first loads, since they occur
before those conditional checks, would not try to load from
anywhere outside of the page boundaries. Given that the the code
pathway would not have been taken unless the copy length is
greater than zero and the low bits of dest are not equal to the low
bits of src(they are not co-aligned), we can say that bytes would
have to have been loaded from within that SZREG aligned bound
anyway. And since pages are aligned at minimum to 0x1000
boundaries, I was confident that I would be within bounds even if
right up against the edge.
I will make the change from j to bne instructions in the applicable
loops.
> I also suspect it is worth unrolling the loop once.
> You lose the 'mv t1, t0' and one 'addi' for each word transferred.
I'll explore this possibility, but to my previous point, I need
to be careful about bounds checking. Until I try to write it, I
can't be sure if it will make sense. I'll see what I can come up
with. No promises.
> I think someone mentioned that there is a few clocks delay before
> the data from the memory read (REG_L) is actually available.
> On in-order cpu this is likely to be a full pipeline stall.
> So move the 'addi' up between the 'REG_L' and 'sll' instructions.
> (The offset will need to be -SZREG to match.)
I try to keep loads as far apart as reasonable from their usage to
avoid this. I'll make this change in the next version of the patch.
Michael
On 1/29/22 08:41, David Laight wrote:
> From: michael at michaelkloos.com
> ...
>> [v4]
>>
>> I could not resist implementing the optimization I mentioned in
>> my v3 notes. I have implemented the roll over of data by cpu
>> register in the misaligned fixup copy loops. Now, only one load
>> from memory is required per iteration of the loop.
> I nearly commented...
>
> ...
>> + /*
>> + * Fix Misalignment Copy Loop.
>> + * load_val1 = load_ptr[0];
>> + * while (store_ptr != store_ptr_end) {
>> + * load_val0 = load_val1;
>> + * load_val1 = load_ptr[1];
>> + * *store_ptr = (load_val0 >> {a6}) | (load_val1 << {a7});
>> + * load_ptr++;
>> + * store_ptr++;
>> + * }
>> + */
>> + REG_L t0, 0x000(a3)
>> + 1:
>> + beq t3, t6, 2f
>> + mv t1, t0
>> + REG_L t0, SZREG(a3)
>> + srl t1, t1, a6
>> + sll t2, t0, a7
>> + or t1, t1, t2
>> + REG_S t1, 0x000(t3)
>> + addi a3, a3, SZREG
>> + addi t3, t3, SZREG
>> + j 1b
> No point jumping to a conditional branch that jumps bak
> Make this a:
> bne t3, t6, 1b
> and move 1: down one instruction.
> (Or is the 'beq' at the top even possible - there is likely to
> be an earlier test for zero length copies.)
>> + 2:
> I also suspect it is worth unrolling the loop once.
> You lose the 'mv t1, t0' and one 'addi' for each word transferred.
>
> I think someone mentioned that there is a few clocks delay before
> the data from the memory read (REG_L) is actually available.
> On in-order cpu this is likely to be a full pipeline stall.
> So move the 'addi' up between the 'REG_L' and 'sll' instructions.
> (The offset will need to be -SZREG to match.)
>
> David
>
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
> Registration No: 1397386 (Wales)
>
More information about the linux-riscv
mailing list