[PATCH v4 2/2] riscv: Set unalignment speed at compile time

Wed Feb 28 00:04:58 PST 2024

On 27/02/2024 20:44, Evan Green wrote:
> On Tue, Feb 27, 2024 at 11:20 AM Charlie Jenkins <charlie at rivosinc.com> wrote:
>>
>> On Tue, Feb 27, 2024 at 06:48:54PM +0000, Conor Dooley wrote:
>>> On Tue, Feb 27, 2024 at 10:17:21AM -0800, Charlie Jenkins wrote:
>>>> On Tue, Feb 27, 2024 at 11:39:25AM +0000, Conor Dooley wrote:
>>>>> On Fri, Feb 16, 2024 at 12:33:19PM -0800, Charlie Jenkins wrote:
>>>
>>>>>> +config RISCV_EMULATED_UNALIGNED_ACCESS
>>>>>> +       bool "Assume the CPU expects emulated unaligned memory accesses"
>>>>>> +       depends on NONPORTABLE
>>>>>
>>>>> This is portable too, right?
>>>>
>>>> I guess so? I think I would prefer to have the probing being the only
>>>> portable option.
>>>
>>> I dunno, I think there could be value to someone in always emulating
>>> this in the kernel and I don't think that should relegate them to the
>>> naughty step, given it can work everywhere.
>>
>> Alright, I will remove the nonportable.
>>
>>>
>>>
>>>>>> +config RISCV_SLOW_UNALIGNED_ACCESS
>>>>>> +       bool "Assume the CPU supports slow unaligned memory accesses"
>>>>>> +       depends on NONPORTABLE
>>>>>> +       help
>>>>>> +         Assume that the CPU supports slow unaligned memory accesses. When
>>>>>> +         enabled, this option improves the performance of the kernel on such
>>>>>> +         CPUs.
>>>>>
>>>>> Does it? Are you sure that generating unaligned accesses on systems
>>>>> where they are slow is a performance increase?
>>>>> That said, I don't really see this option actually doing anything other
>>>>> than setting the value for hwprobe, so I don't actually know what the
>>>>> effect of this option actually is on the kernel's performance.
>>>>>
>>>>> Generally I would like to suggest a change from "CPU" to "system" here,
>>>>> since the slow cases that exist are mostly because the unaligned access
>>>>> is actually emulated in firmware.
>>>>
>>>> It would be ideal if "emulated" was used for any case of emulated
>>>> accesses (firmware or in the kernel).  Doing emulated accesses will be
>>>> orders of magnitude slower than a processor that "slowly" handles the
>>>> accesses.
>>>>
>>>> So even if the processor performs a "slow" access, it could still be
>>>> beneficial for the kernel to do the misaligned access rather than manual
>>>> do the alignment.
>>>
>>> Right. But, at least from a probing perspective, SLOW is what gets
>>> selected when firmware emulates the unaligned access so to userspace
>>> seeing slow means that the performance could be horrifically bad:
>>>
>>> |     rzfive:
>>> |         cpu0: Ratio of byte access time to unaligned word access is
>>> | 1.05, unaligned accesses are fast
>>> |
>>> |     icicle:
>>> |
>>> |         cpu1: Ratio of byte access time to unaligned word access is
>>> | 0.00, unaligned accesses are slow
>>> |         cpu2: Ratio of byte access time to unaligned word access is
>>> | 0.00, unaligned accesses are slow
>>> |         cpu3: Ratio of byte access time to unaligned word access is
>>> | 0.00, unaligned accesses are slow
>>> |
>>> |         cpu0: Ratio of byte access time to unaligned word access is
>>> | 0.00, unaligned accesses are slow
>>> |
>>> |     k210:
>>> |
>>> |         cpu1: Ratio of byte access time to unaligned word access is
>>> | 0.02, unaligned accesses are slow
>>> |         cpu0: Ratio of byte access time to unaligned word access is
>>> | 0.02, unaligned accesses are slow
>>> |
>>> |     starlight:
>>> |
>>> |         cpu1: Ratio of byte access time to unaligned word access is
>>> | 0.01, unaligned accesses are slow
>>> |         cpu0: Ratio of byte access time to unaligned word access is
>>> | 0.02, unaligned accesses are slow
>>> |
>>> |     vexriscv/orangecrab:
>>> |
>>> |         cpu0: Ratio of byte access time to unaligned word access is
>>> | 0.00, unaligned accesses are slow
>>>  https://lore.kernel.org/all/CAMuHMdVtXGjP8VFMiv-7OMFz1XvfU1cz=Fw4jL3fcp4wO1etzQ@mail.gmail.com/
>>
>> If the accesses are horrifically slow then maybe they should be flagged
>> as emulated rather than slow by the probe.
> 
> Yeah, I thought about that too. I didn't feel like I had enough info
> to come up with the delineating number for "horrifically slow". Plus
> Clement came in with a series to detect specifically that accesses are
> emulated (though it will only work on future platforms that can
> delegate the trap to the kernel).

Yes, the delegation request mechanism should be part of SBI 3.0. At that
point we should be able to detect properly if accesses are emulated or
slow (providing the SBI implements the new extension).

Clément

> 
> -Evan