[PATCH v4] lib: sbi: Improve csr read and write

Wed Jun 22 00:12:45 PDT 2022

On 6/21/22 23:53, Xiang W wrote:
> 在 2022-05-26星期四的 16:28 +0100，Jessica Clarke写道：
>> On 26 May 2022, at 16:19, Xiang W <wxjstz at 126.com> wrote:
>>> 在 2022-05-26星期四的 20:35 +0530，Anup Patel写道：
>>>> On Thu, May 26, 2022 at 8:12 PM Xiang W <wxjstz at 126.com> wrote:
>>>>> <patch contents snipped>
>>>> I agree we need FENCE.I in the path but this is
>>>> like a big hammer in the execution path.
>>>>
>>>> This is now much slower compared to the existing
>>>> approach because every CSR access using this
>>>> function will execute FENCE.I
>>> I passed the qemu test, and it can be run normally with
>>> or without fence.i. But I'm not sure if the exception
>>> happens on other platform or we have a more lightweight
>>> way.
>> There is not. Optimising this kind of bad practice code is not a goal
>> of modern architectures. After all, it totally precludes having W^X
>> protection in OpenSBI (e.g. via the PMP, or via CHERI), especially
>> executable stacks form a key part of *the* classic buffer overflow
>> shell code attack.
>>
>> Jess
> Currently opensbi does not have strict RWX permission control. We only 
> need W and X permissions not to exist at the same time to avoid security
> risks. If opensbi has RWX permission control, we only need to change the
> permission of opcode_buff before and after running. The stack default is
> RW permission, we can change the permission of opcode_buff to RX before
> the call, and change the permission to RW after the call is completed.
>
> I think these jobs can be put after opensbi has rwx permission control.
>
> Regards,
> Xiang W

But in that case, wouldn't this codegen approach be even slower? Is
there any particular reason to believe that this will be faster than the
current switch-case? Some benchmark on existing silicon would be nice.

The original implementation compiles down a jump table, with several
branches and bound checks at the beginning. It really is just a hunch,
but I find it difficult to be convinced how generating code could be faster.

Optimizing the current switch-case approach into a better jump table
seems to be the way to go here. Currently the compiler does not seem to
take advantage of the fact that every csr* instruction is exactly 4
bytes. There also seems to be a few extra instructions for dealing with
int instead of long, though I am not sure. Furthermore, having separate
functions for hpmcounter, pmpcfg, etc. might help with avoiding some
branches, and possibly also make code clearer at call site.

Regards,
dram