[PATCH v4] lib: sbi: Improve csr read and write

Wed Jun 22 07:57:47 PDT 2022

在 2022-06-22星期三的 15:12 +0800，dramforever写道：
> 
> On 6/21/22 23:53, Xiang W wrote:
> > 在 2022-05-26星期四的 16:28 +0100，Jessica Clarke写道：
> > > On 26 May 2022, at 16:19, Xiang W <wxjstz at 126.com> wrote:
> > > > 在 2022-05-26星期四的 20:35 +0530，Anup Patel写道：
> > > > > On Thu, May 26, 2022 at 8:12 PM Xiang W <wxjstz at 126.com> wrote:
> > > > > > <patch contents snipped>
> > > > > I agree we need FENCE.I in the path but this is
> > > > > like a big hammer in the execution path.
> > > > > 
> > > > > This is now much slower compared to the existing
> > > > > approach because every CSR access using this
> > > > > function will execute FENCE.I
> > > > I passed the qemu test, and it can be run normally with
> > > > or without fence.i. But I'm not sure if the exception
> > > > happens on other platform or we have a more lightweight
> > > > way.
> > > There is not. Optimising this kind of bad practice code is not a goal
> > > of modern architectures. After all, it totally precludes having W^X
> > > protection in OpenSBI (e.g. via the PMP, or via CHERI), especially
> > > executable stacks form a key part of *the* classic buffer overflow
> > > shell code attack.
> > > 
> > > Jess
> > Currently opensbi does not have strict RWX permission control. We only 
> > need W and X permissions not to exist at the same time to avoid security
> > risks. If opensbi has RWX permission control, we only need to change the
> > permission of opcode_buff before and after running. The stack default is
> > RW permission, we can change the permission of opcode_buff to RX before
> > the call, and change the permission to RW after the call is completed.
> > 
> > I think these jobs can be put after opensbi has rwx permission control.
> > 
> > Regards,
> > Xiang W
> 
> But in that case, wouldn't this codegen approach be even slower? Is
> there any particular reason to believe that this will be faster than the
> current switch-case? Some benchmark on existing silicon would be nice.

Yes, it will definitely be slower. This will only reduce the maintenance
cost of csr_read_num/csr_write_num, without adding code to read a new csr.

Regards,
Xiang W
> 
> The original implementation compiles down a jump table, with several
> branches and bound checks at the beginning. It really is just a hunch,
> but I find it difficult to be convinced how generating code could be faster.
> 
> Optimizing the current switch-case approach into a better jump table
> seems to be the way to go here. Currently the compiler does not seem to
> take advantage of the fact that every csr* instruction is exactly 4
> bytes. There also seems to be a few extra instructions for dealing with
> int instead of long, though I am not sure. Furthermore, having separate
> functions for hpmcounter, pmpcfg, etc. might help with avoiding some
> branches, and possibly also make code clearer at call site.
> 
> Regards,
> dram
>