[PATCH V5 00/10] Add UEFI 2.6 and ACPI 6.1 updates for RAS on ARM64
John Garry
john.garry at huawei.com
Tue Nov 22 03:11:55 PST 2016
+
We'll try and test this on our platform.
Cheers,
John
On 21/11/2016 22:35, Tyler Baicar wrote:
> When a memory error, CPU error, PCIe error, or other type of hardware error
> that's covered by RAS occurs, firmware should populate the shared GHES memory
> location with the proper GHES structures to notify the OS of the error.
> For example, platforms that implement firmware first handling may implement
> separate GHES sources for corrected errors and uncorrected errors. If the
> error is an uncorrectable error, then the firmware will notify the OS
> immediately since the error needs to be handled ASAP. The OS will then be able
> to take the appropriate action needed such as offlining a page. If the error
> is a corrected error, then the firmware will not interrupt the OS immediately.
> Instead, the OS will see and report the error the next time it's GHES timer
> expires. The kernel will first parse the GHES structures and report the errors
> through the kernel logs and then notify the user space through RAS trace
> events. This allows user space applications such as RAS Daemon to see the
> errors and report them however the user desires. This patchset extends the
> kernel functionality for RAS errors based on updates in the UEFI 2.6 and
> ACPI 6.1 specifications.
>
> An example flow from firmware to user space could be:
>
> +---------------+
> +-------->| |
> | | GHES polling |--+
> +-------------+ | source | | +---------------+ +------------+
> | | +---------------+ | | Kernel GHES | | |
> | Firmware | +-->| CPER AER and |-->| RAS trace |
> | | +---------------+ | | EDAC drivers | | event |
> +-------------+ | | | +---------------+ +------------+
> | | GHES sci |--+
> +-------->| source |
> +---------------+
>
> Add support for Generic Hardware Error Source (GHES) v2, which introduces the
> capability for the OS to acknowledge the consumption of the error record
> generated by the Reliability, Availability and Serviceability (RAS) controller.
> This eliminates potential race conditions between the OS and the RAS controller.
>
> Add support for the timestamp field added to the Generic Error Data Entry v3,
> allowing the OS to log the time that the error is generated by the firmware,
> rather than the time the error is consumed. This improves the correctness of
> event sequences when analyzing error logs. The timestamp is added in
> ACPI 6.1, reference Table 18-343 Generic Error Data Entry.
>
> Add support for ARMv8 Common Platform Error Record (CPER) per UEFI 2.6
> specification. ARMv8 specific processor error information is reported as part of
> the CPER records. This provides more detail on for processor error logs. This
> can help describe ARMv8 cache, tlb, and bus errors.
>
> Synchronous External Abort (SEA) represents a specific processor error condition
> in ARM systems. A handler is added to recognize SEA errors, and a notifier is
> added to parse and report the errors before the process is killed. Refer to
> section N.2.1.1 in the Common Platform Error Record appendix of the UEFI 2.6
> specification.
>
> Currently the kernel ignores CPER records that are unrecognized.
> On the other hand, UEFI spec allows for non-standard (eg. vendor
> proprietary) error section type in CPER (Common Platform Error Record),
> as defined in section N2.3 of UEFI version 2.5. Therefore, user
> is not able to see hardware error data of non-standard section.
>
> If section Type field of Generic Error Data Entry is unrecognized,
> prints out the raw data in dmesg buffer, and also adds a tracepoint
> for reporting such hardware errors.
>
> Currently even if an error status block's severity is fatal, the kernel
> does not honor the severity level and panic. With the firmware first
> model, the platform could inform the OS about a fatal hardware error
> through the non-NMI GHES notification type. The OS should panic when a
> hardware error record is received with this severity.
>
> Add support to handle SEAs that occur while a KVM guest kernel is
> running. Currently these are unsupported by the guest abort handling.
>
> Depends on: [PATCH v14] acpi, apei, arm64: APEI initial support for aarch64.
> https://lkml.org/lkml/2016/8/10/231
>
> V5: Fix GHES goto logic for error conditions
> Change ghes_do_read_ack to ghes_ack_error
> Make sure data version check is >= 3
> Use CPER helper functions in print functions
> Make handle_guest_sea() dummy function static for arm
> Add arm to subject line for KVM patch
>
> V4: Add bit offset left shift to read_ack_write value
> Make HEST generic and generic_v2 structures a union in the ghes structure
> Move gdata v3 helper functions into ghes.h to avoid duplication
> Reorder the timestamp print and avoid memcpy
> Add helper functions for gdata size checking
> Rename the SEA functions
> Add helper function for GHES panics
> Set fru_id to NULL UUID at variable declaration
> Limit ARM trace event parameters to the needed structures
> Reorder the ARM trace event variables to save space
> Add comment for why we don't pass SEAs to the guest when it aborts
> Move ARM trace event call into GHES driver instead of CPER
>
> V3: Fix unmapped address to the read_ack_register in ghes.c
> Add helper function to get the proper payload based on generic data entry
> version
> Move timestamp print to avoid changing function calls in cper.c
> Remove patch "arm64: exception: handle instruction abort at current EL"
> since the el1_ia handler is already added in 4.8
> Add EFI and ARM64 dependencies for HAVE_ACPI_APEI_SEA
> Add a new trace event for ARM type errors
> Add support to handle KVM guest SEAs
>
> V2: Add PSCI state print for the ARMv8 error type.
> Separate timestamp year into year and century using BCD format.
> Rebase on top of ACPICA 20160318 release and remove header file changes
> in include/acpi/actbl1.h.
> Add panic OS with fatal error status block patch.
> Add processing of unrecognized CPER error section patches with updates
> from previous comments. Original patches: https://lkml.org/lkml/2015/9/8/646
>
> V1: https://lkml.org/lkml/2016/2/5/544
>
> Jonathan (Zhixiong) Zhang (1):
> acpi: apei: panic OS with fatal error status block
>
> Tyler Baicar (9):
> acpi: apei: read ack upon ghes record consumption
> ras: acpi/apei: cper: generic error data entry v3 per ACPI 6.1
> efi: parse ARMv8 processor error
> arm64: exception: handle Synchronous External Abort
> acpi: apei: handle SEA notification type for ARMv8
> efi: print unrecognized CPER section
> ras: acpi / apei: generate trace event for unrecognized CPER section
> trace, ras: add ARM processor error trace event
> arm/arm64: KVM: add guest SEA support
>
> arch/arm/include/asm/kvm_arm.h | 1 +
> arch/arm/include/asm/system_misc.h | 5 +
> arch/arm/kvm/mmu.c | 18 ++-
> arch/arm64/Kconfig | 1 +
> arch/arm64/include/asm/kvm_arm.h | 1 +
> arch/arm64/include/asm/system_misc.h | 15 +++
> arch/arm64/mm/fault.c | 71 ++++++++++--
> drivers/acpi/apei/Kconfig | 14 +++
> drivers/acpi/apei/ghes.c | 188 ++++++++++++++++++++++++++++---
> drivers/acpi/apei/hest.c | 7 +-
> drivers/firmware/efi/cper.c | 210 ++++++++++++++++++++++++++++++++---
> drivers/ras/ras.c | 2 +
> include/acpi/ghes.h | 15 ++-
> include/linux/cper.h | 84 ++++++++++++++
> include/ras/ras_event.h | 100 +++++++++++++++++
> 15 files changed, 688 insertions(+), 44 deletions(-)
>
More information about the linux-arm-kernel
mailing list