[PATCH] KVM: arm64: Make the exposed feature bits in AA64DFR0_EL1 writable from userspace

Thu Nov 28 01:31:08 PST 2024

Hi Marc,

On 11/26/24 20:29, Marc Zyngier wrote:
> On Tue, 26 Nov 2024 17:00:35 +0000,
> Sebastian Ott <sebott at redhat.com> wrote:
>>
>> Hi,
>>
>> On Wed, 14 Aug 2024, Shameerali Kolothum Thodi wrote:
>>>>
>>>> On Tue, 13 Aug 2024 15:28:35 +0100,
>>>> Shameer Kolothum <shameerali.kolothum.thodi at huawei.com> wrote:
>>>>>
>>>>> KVM exposes the OS double lock feature bit to Guests but returns
>>>>> RAZ/WI on Guest OSDLR_EL1 access. This breaks Guest migration between
>>>>> systems where this feature support differ. Add support to make this
>>>>> feature writable from userspace by setting the mask bit. While at it,
>>>>> set the mask bits for other exposed features in the AA64DFR0_EL1
>>>>> register as well.
>>>>>
>>>>> Also update the selftest to cover these fields.
>>>>>
>>>>> Signed-off-by: Shameer Kolothum
>>>> <shameerali.kolothum.thodi at huawei.com>
>>>>> ---
>>>>>    This is based on the discussion here(Thanks to Oliver),
>>>>>    https://lore.kernel.org/all/ZrVSlbVwnaMDShah@linux.dev/
>>>>> ---
>>>>>  arch/arm64/kvm/sys_regs.c                         | 6 +++++-
>>>>>  tools/testing/selftests/kvm/aarch64/set_id_regs.c | 4 ++++
>>>>>  2 files changed, 9 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/arch/arm64/kvm/sys_regs.c b/arch/arm64/kvm/sys_regs.c
>>>>> index c90324060436..adb49d681052 100644
>>>>> --- a/arch/arm64/kvm/sys_regs.c
>>>>> +++ b/arch/arm64/kvm/sys_regs.c
>>>>> @@ -2376,7 +2376,11 @@ static const struct sys_reg_desc sys_reg_descs[]
>>>> = {
>>>>>  	  .get_user = get_id_reg,
>>>>>  	  .set_user = set_id_aa64dfr0_el1,
>>>>>  	  .reset = read_sanitised_id_aa64dfr0_el1,
>>>>> -	  .val = ID_AA64DFR0_EL1_PMUVer_MASK |
>>>>> +	  .val = ID_AA64DFR0_EL1_DoubleLock_MASK |
>>>>> +		 ID_AA64DFR0_EL1_CTX_CMPs_MASK |
>>>>> +		 ID_AA64DFR0_EL1_WRPs_MASK |
>>>>> +		 ID_AA64DFR0_EL1_BRPs_MASK |
>>>>
>>>>
>>>> I think this is going to cause some troubles.
>>>>
>>>> The issue is that context-aware breakpoints are the highest-numbered
>>>> breakpoints, right after the normal breakpoints (D2.8.3 "Breakpoint
>>>> types and linking of breakpoints"). So if you reduce the number of
>>>> normal breakpoints, you shift the context-aware ones down, and
>>>> everything breaks.
>>>
>>> Thanks Marc for explaining this. I was not aware of this one.
>>>
>>>> I really don't see how you can safely do that without completely
>>>> changing the way we handle the debug registers.
>>>
>>> Looks like Reji has attempted to do this a while back,
>>> https://lore.kernel.org/kvm/20220419065544.3616948-13-reijiw@google.com/
>>>
>>
>> I've got two machines that differ in the number of breakpoints and
>> it would be nice to be able to migrate between these. Is anything
> 
> Is that the *only* thing that differ? Do the have the same number of
> context-aware breakpoints?
> 
>> preventing us from trapping the access and make sure the correct
>> breakpoint is used? Is anyone working on this? If not I'd like to
>> give it a shot.
> 
> Not only trapping. You also need to handle some interesting parts of
> the architecture, such as the breakpoint linking fun.
> 
> But if we are to go down that road, I really want to restrict that to
> implementations that have FEAT_FGT. Because otherwise we need to trap
> and emulate *everything*, instead of just the breakpoint registers.
> And that would be pretty bad from a performance perspective.
> 
> Another thing is that this only works because there is no report of
> the breakpoint number in ESR_ELx. The moment we offering this
> migration "feature", we are painting ourselves in a corner, should the
> architecture ever evolve to something less... bizarre.
> 
> Finally, who is going to ensure this keeps working in the foreseeable
> future? Because while this is nice, that's not what gets deployed in
> production, as it leads to unpredictable performances. My take is that
> this thing will eventually bitrot and die.
In the context of our works to define qemu vcpu models for ARM
(https://lore.kernel.org/all/20241025101959.601048-1-eric.auger@redhat.com/)
, our current approach is to try migrating between modern HW we have
access to. The case above is migration between AmpereOne and Grace which
both should be prevalent systems. Do you think this does not make sense
at all to try migrating between those, alhough this may be challenging?

Other cases we have looked at are migration within Ampere Altra Max
system family (which should be hopefully fine now with have CTR_EL0
works from Sebastian upstream), mig between Graviton hosts. Wrt Ampere
Altra Max to AmpereOne, Oliver pointed out the cntfrq issue which is
blocking.

Do you think we should restrict our studies to systems which are
"closer" to each other in terms of ARM spec rev. We throught that
migration bewteen AmpereOne And Grace would be an interesting POC and
not totally irrelevant in terms of industry.

Thanks

Eric
> 
> So, do we *really* want to go down that road?
> 
> 	M.
>