[PATCH 0/3] riscv: log Hardware Error Exception via APEI

Ruidong Tian tianruidong at linux.alibaba.com
Fri May 8 01:20:17 PDT 2026


This series extends the handling of do_trap_hardware_error() based on
the works in [1].

RISC-V already dispatches Hardware Error Exception (cause 19, "HEE") via
do_trap_hardware_error(), but today the trap handler has no way to learn
*what* went wrong: the offending task is killed (or the kernel panics)
with no diagnostic about the underlying hardware fault, no error record
is logged, no page is isolated, and memory_failure() is never invoked.

There are two principal ways to obtain hardware error information on
HEE:

  1. Let firmware parse platform error registers and hand the kernel a
     standardized CPER record through ACPI / APEI / GHES.
  2. Have the kernel read the error registers directly.

Option (2) is not yet viable on RISC-V: the architecture does not
define a unified, mandatory layout for hardware error status registers
across implementations, so there is nothing stable for common code to
decode. This series therefore implements option (1) and wires HEE into
the existing APEI / GHES path, mirroring how arm64 treats SEA.

Future work: option (2) is not ruled out. Once the RISC-V architecture
standardizes a common hardware error register layout (either as part
of the privileged spec or via a well-defined SBI / ACPI namespace
interface), a kernel-native decoder could be added alongside the
ACPI/APEI path. The two can then coexist and be selected per
platform through a Kconfig choice. This series keeps the door open
by routing HEE through apei_claim_hee() behind CONFIG_ACPI_APEI_HEE,
so disabling that config already restores the legacy path and does
not block a future native decoder from being wired in.

After this series:

  * Firmware reports RAS events to the OS as CPER records through a
    HEST GHES entry whose notification type is HEE (new value 13).
  * If CONFIG_ACPI_APEI_HEE is set, do_trap_hardware_error() calls
    apei_claim_hee() first. On success GHES queues the record, drains
    irq_work inline, and delivers a BUS_MCEERR_AR SIGBUS to the faulting
    user task via task_work after isolating the poisoned page with 
    memory_failure(MF_ACTION_REQUIRED).
  * If firmware does not claim the error or CONFIG_ACPI_APEI_HEE not set:
      - user mode falls back to SIGBUS / BUS_MCEERR_AR via do_trap_error(),
      - kernel mode tries fixup_exception() to let MC-safe copy routines
        recover; otherwise die().

References:
----------
[1] [RISC-V RAS patch]: https://lore.kernel.org/all/20260109090224.3105465-1-himanshu.chauhan@oss.qualcomm.com/

Ruidong Tian (3):
  acpi: Introduce HEE in HEST notification types
  riscv: Introduce HEST HEE notification handlers for APEI
  riscv: collect hardware error information via APEI on HEE

 arch/riscv/include/asm/acpi.h   |  2 +
 arch/riscv/include/asm/fixmap.h |  3 ++
 arch/riscv/kernel/acpi.c        | 54 ++++++++++++++++++++++++++
 arch/riscv/kernel/traps.c       | 35 ++++++++++++++++-
 drivers/acpi/apei/Kconfig       | 12 ++++++
 drivers/acpi/apei/ghes.c        | 68 ++++++++++++++++++++++++++++++++-
 include/acpi/actbl1.h           |  3 +-
 include/acpi/ghes.h             |  6 +++
 8 files changed, 178 insertions(+), 5 deletions(-)

-- 
2.51.2.612.gdc70283dfc




More information about the linux-riscv mailing list