[PATCH 1/2] arm64: errata: Work around AmpereOne's erratum AC03_CPU_36

Mon Apr 28 09:35:03 PDT 2025

Marc Zyngier <maz at kernel.org> writes:

> On Fri, 25 Apr 2025 03:02:29 +0100,
> D Scott Phillips <scott at os.amperecomputing.com> wrote:
>> 
>> Marc Zyngier <maz at kernel.org> writes:
>> 
>> > On Tue, 15 Apr 2025 16:47:10 +0100,
>> > D Scott Phillips <scott at os.amperecomputing.com> wrote:
>> >> 
>> >> AC03_CPU_36 can cause asynchronous exceptions to be routed to the wrong
>> >> exception level if an async exception coincides with an update to the
>> >> controls for the target exception level in HCR_EL2. On affected
>> >> machines, always do writes to HCR_EL2 with async exceptions blocked.
>> >
>> > From the actual errata document [1]:
>> >
>> > <quote>
>> > If an Asynchronous Exception to EL2 occurs, while EL2 software is
>> > changing the EL2 exception control bits from a configuration where
>> > asynchronous exceptions are routed to EL2 to a configuration where
>> > asynchronous exceptions are routed to EL1, the processor may exhibit
>> > the incorrect exception behavior of routing an interrupt taken at EL2
>> > to EL1.  The affected system register is HCR_EL2, which contains
>> > control bits for routing and enabling of EL2 exceptions.
>> > </quote>
>> >
>> > My reading is that things can go wrong when clearing the xMO bits.
>> >
>> > I don't think we need to touch the xMO bits at all when running
>> > VHE. So my preference would be to:
>> >
>> > - simply leave the xMO bits set at all times (nothing bad can happen
>> >   from that, can it?)
>> >
>> > - prevent these systems from using anything but VHE (and fail KVM init
>> >   otherwise)
>> 
>> Hi Marc, I started writing up this patch and then realized that the
>> issue can also not happen in nvhe mode. While xMO bits are modified
>> there, async exceptions are always masked and so the "simultaneously
>> take an async exception" part of the erratum can't happen.
>> 
>> Does that sound right to you, or are there cases that I'm missing. If
>> it's right the nvhe is also can't hit the erratum case, then what do you
>> think is the right thing for me to do here?
>
> That's an interesting point. We always run the nVHE/hVHE hypervisor
> code with interrupts disabled by virtue of taking an HVC exception
> into EL2, so that particular case seems OK as it literally implements
> the proposed workaround.
>
> However, there's at least one catch: the SError handling code in
> hyp/entry.S relies on clearing PSTATE.A to take a pending abort (the
> so-called VAXorcism). I take that this CPU implements FEAT_RAS, and
> that we don't need to worry about this code path either, and that the
> erratum cannot trigger on speculatively executed paths?

Yep, right on both counts, the cpu supports FEAT_RAS, and the erratum
case doesn't happen speculatively.

> If we're OK with that, then I don't think there is much to do, other
> than always setting the xMO bits at all times, for which I already
> have a patch in review (v2 coming shortly).

OK, sounds good to me.