[PATCH v4 12/21] arm64: kernel: Survive corrected RAS errors notified by SError
James Morse
james.morse at arm.com
Thu Nov 2 05:15:11 PDT 2017
Hi Will,
On 31/10/17 13:50, Will Deacon wrote:
> On Thu, Oct 19, 2017 at 03:57:58PM +0100, James Morse wrote:
>> Prior to v8.2, SError is an uncontainable fatal exception. The v8.2 RAS
>> extensions use SError to notify software about RAS errors, these can be
>> contained by the ESB instruction.
>>
>> An ACPI system with firmware-first may use SError as its 'SEI'
>> notification. Future patches may add code to 'claim' this SError as a
>> notification.
>>
>> Other systems can distinguish these RAS errors from the SError ESR and
>> use the AET bits and additional data from RAS-Error registers to handle
>> the error. Future patches may add this kernel-first handling.
>>
>> Without support for either of these we will panic(), even if we received
>> a corrected error. Add code to decode the severity of RAS errors. We can
>> safely ignore contained errors where the CPU can continue to make
>> progress. For all other errors we continue to panic().
>> diff --git a/arch/arm64/include/asm/esr.h b/arch/arm64/include/asm/esr.h
>> index 66ed8b6b9976..8ea52f15bf1c 100644
>> --- a/arch/arm64/include/asm/esr.h
>> +++ b/arch/arm64/include/asm/esr.h
>> @@ -85,6 +85,15 @@
>> #define ESR_ELx_WNR_SHIFT (6)
>> #define ESR_ELx_WNR (UL(1) << ESR_ELx_WNR_SHIFT)
>>
>> +/* Asynchronous Error Type */
>> +#define ESR_ELx_AET (UL(0x7) << 10)
> Can you add a #define for the AET shift in the Srror ISS, please? (we have
> other blocks in this file for different abort types). e.g.
>
> /* ISS fields definitions for SError interrupts */
> #define ESR_ELx_AER_SHIFT 10
>
> then use it below.
Yes, I should have done that..
>> +#define ESR_ELx_AET_UC (UL(0) << 10) /* Uncontainable */
>> +#define ESR_ELx_AET_UEU (UL(1) << 10) /* Uncorrected Unrecoverable */
>> +#define ESR_ELx_AET_UEO (UL(2) << 10) /* Uncorrected Restartable */
>> +#define ESR_ELx_AET_UER (UL(3) << 10) /* Uncorrected Recoverable */
>> +#define ESR_ELx_AET_CE (UL(6) << 10) /* Corrected */
>> +
>> /* Shared ISS field definitions for Data/Instruction aborts */
>> #define ESR_ELx_SET_SHIFT (11)
>> #define ESR_ELx_SET_MASK (UL(3) << ESR_ELx_SET_SHIFT)
>> @@ -99,6 +108,7 @@
>> #define ESR_ELx_FSC (0x3F)
>> #define ESR_ELx_FSC_TYPE (0x3C)
>> #define ESR_ELx_FSC_EXTABT (0x10)
>> +#define ESR_ELx_FSC_SERROR (0x11)
>> #define ESR_ELx_FSC_ACCESS (0x08)
>> #define ESR_ELx_FSC_FAULT (0x04)
>> #define ESR_ELx_FSC_PERM (0x0C)
>> diff --git a/arch/arm64/include/asm/traps.h b/arch/arm64/include/asm/traps.h
>> index d131501c6222..8d2a1fff5c6b 100644
>> --- a/arch/arm64/include/asm/traps.h
>> +++ b/arch/arm64/include/asm/traps.h
>> @@ -19,6 +19,7 @@
>> #define __ASM_TRAP_H
>>
>> #include <linux/list.h>
>> +#include <asm/esr.h>
>> #include <asm/sections.h>
>>
>> struct pt_regs;
>> @@ -58,4 +59,39 @@ static inline int in_entry_text(unsigned long ptr)
>> return ptr >= (unsigned long)&__entry_text_start &&
>> ptr < (unsigned long)&__entry_text_end;
>> }
>> +
>> +static inline bool arm64_is_ras_serror(u32 esr)
>> +{
>> + bool impdef = esr & ESR_ELx_ISV; /* aka IDS */
>
> I think you should add an IDS field along with the AET one I suggested.
Sure,
>> +
>> + if (cpus_have_const_cap(ARM64_HAS_RAS_EXTN))
>> + return !impdef;
>> +
>> + return false;
>> +}
>> +
>> +/* Return the AET bits of an SError ESR, or 0/uncontainable/uncategorized */
>> +static inline u32 arm64_ras_serror_get_severity(u32 esr)
>> +{
>> + u32 aet = esr & ESR_ELx_AET;
>> +
>> + if (!arm64_is_ras_serror(esr)) {
>> + /* Not a RAS error, we can't interpret the ESR */
>> + return 0;
>> + }
>> +
>> + /*
>> + * AET is RES0 if 'the value returned in the DFSC field is not
>> + * [ESR_ELx_FSC_SERROR]'
>> + */
>> + if ((esr & ESR_ELx_FSC) != ESR_ELx_FSC_SERROR) {
>> + /* No severity information */
>> + return 0;
>> + }
> Hmm, this means we can't distinguish impdef or RES0 encodings from
> uncontainable errors. Is that desirable?
We panic for for both impdef and uncontainable ESR values, so the difference
doesn't matter. I'll remove the 'is_ras_serror()' in here and make it the
callers problem to check...
RES0 encodings?
If this is an imp-def 'all zeros', those should all be matched as impdef by
arm64_is_ras_serror().
Otherwise its a RAS encoding with {I,D}FSC bits that indicate we can't know the
severity.
The ARM-ARM calls these 'uncategorized'. Yes I'm treating them as uncontained,
(on aarch32 these share an encoding). I'll add a comment to call it out.
> Also, could we end up in a situation where some CPUs support RAS and some
> don't,
Ooer, differing CPU support. I hadn't considered that... wouldn't cpufeature
declare such a system insane?
> so arm64_is_ras_serror returns false yet a correctable error is
> reported by one the CPUs and we treat it as uncontainable?
Makeing the HAS_RAS tests use this_cpu_has_cap() should cover this, but will
cause problems for KVM as it calls these from a pre-emptible context.
>> +
>> + return aet;
>> +}
>> +
>> +bool arm64_blocking_ras_serror(struct pt_regs *regs, unsigned int esr);
>> +void __noreturn arm64_serror_panic(struct pt_regs *regs, u32 esr);
>> #endif
>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>> index 773aae69c376..53aeb25158b0 100644
>> --- a/arch/arm64/kernel/traps.c
>> +++ b/arch/arm64/kernel/traps.c
>> @@ -709,17 +709,65 @@ asmlinkage void handle_bad_stack(struct pt_regs *regs)
>> +bool arm64_blocking_ras_serror(struct pt_regs *regs, unsigned int esr)
>> +{
> Since you asked... what about "fatal" instead of "blocking"?
.. well that was obvious. Yes, I was looking too much at whether we could return
to the interrupted context instead of what we do next!
Thanks,
James
More information about the linux-arm-kernel
mailing list