[PATCH v2 11/11] RAS: add DeviceTree firmware-first CPER provider
Jonathan Cameron
jonathan.cameron at huawei.com
Tue Feb 24 07:55:20 PST 2026
On Fri, 20 Feb 2026 13:42:29 +0000
Ahmed Tiba <ahmed.tiba at arm.com> wrote:
> Add a DeviceTree firmware-first CPER provider that reuses the shared
> GHES helpers, wire it into the RAS Kconfig/Makefile and document it in
> the admin guide. Update MAINTAINERS now that the driver exists.
>
> Signed-off-by: Ahmed Tiba <ahmed.tiba at arm.com>
Hi Ahmed,
Various comments inline.
Jonathan
> ---
> Documentation/admin-guide/RAS/main.rst | 18 +++
> MAINTAINERS | 1 +
> drivers/acpi/apei/apei-internal.h | 10 +-
> drivers/acpi/apei/ghes_cper.c | 2 +
> drivers/ras/Kconfig | 12 ++
> drivers/ras/Makefile | 1 +
> drivers/ras/esource-dt.c | 264 +++++++++++++++++++++++++++++++++
> include/acpi/ghes_cper.h | 9 ++
> 8 files changed, 308 insertions(+), 9 deletions(-)
>
> diff --git a/Documentation/admin-guide/RAS/main.rst b/Documentation/admin-guide/RAS/main.rst
> index 5a45db32c49b..4ffabaaeabb1 100644
> --- a/Documentation/admin-guide/RAS/main.rst
> +++ b/Documentation/admin-guide/RAS/main.rst
> @@ -205,6 +205,24 @@ Architecture (MCA)\ [#f3]_.
> .. [#f3] For more details about the Machine Check Architecture (MCA),
> please read Documentation/arch/x86/x86_64/machinecheck.rst at the Kernel tree.
>
> +Firmware-first CPER via DeviceTree
> +----------------------------------
> +
> +Some systems expose Common Platform Error Record (CPER) data
> +via DeviceTree instead of ACPI HEST tables.
I'd argue this isn't really DT specific, it's just not ACPI table.
You could for instance use PRP0001 and wire this up on ACPI with only
one trivial change to generic property.h accessor for the boolean.
Or use another firmware information source entirely.
> +Enable ``CONFIG_RAS_ESOURCE_DT`` to build the ``drivers/ras/esource-dt.c``
> +driver and describe the CPER error source buffer with the
> +``Documentation/devicetree/bindings/firmware/arm,ras-ffh.yaml`` binding.
> +The driver reuses the GHES CPER helper object in
> +``drivers/acpi/apei/ghes_cper.c`` so the logging, notifier chains, and
> +memory failure handling match the ACPI GHES behaviour even when
> +ACPI is disabled.
> +
> +Once a platform describes a firmware-first provider, both ACPI GHES and the
> +DeviceTree driver reuse the same code paths. This keeps the behaviour
> +consistent regardless of whether the error source is described via ACPI
> +tables or DeviceTree.
> diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
> index fc4f4bb94a4c..ea6d96713020 100644
> --- a/drivers/ras/Kconfig
> +++ b/drivers/ras/Kconfig
> @@ -34,6 +34,18 @@ if RAS
> source "arch/x86/ras/Kconfig"
> source "drivers/ras/amd/atl/Kconfig"
>
> +config RAS_ESOURCE_DT
> + bool "DeviceTree firmware-first CPER error source block provider"
It isn't really DT specific other than one call that I've suggested you
replace with a generic firmware accessor.
> + depends on OF
Generally we don't gate on OF unless there are OF specific calls. Here there
aren't so you are just reducing build coverage. || COMPILE_TEST
maybe.
> + depends on ARM64
Likewise, nothing in here is arm64 specific that I can spot.
> + select GHES_CPER_HELPERS
> + help
> + Enable support for firmware-first Common Platform Error Record (CPER)
> + error source block providers that are described via DeviceTree
> + instead of ACPI HEST tables. The driver reuses the existing GHES
> + CPER helpers so the error processing matches the ACPI code paths,
> + but it can be built even when ACPI is disabled.
> +
> diff --git a/drivers/ras/esource-dt.c b/drivers/ras/esource-dt.c
> new file mode 100644
> index 000000000000..b575a2258536
> --- /dev/null
> +++ b/drivers/ras/esource-dt.c
> @@ -0,0 +1,264 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * DeviceTree provider for firmware-first CPER error source block.
> + *
> + * This driver shares the GHES CPER helpers so we keep the reporting and
> + * notifier behaviour identical to ACPI GHES
> + *
> + * Copyright (C) 2025 ARM Ltd.
> + * Author: Ahmed Tiba <ahmed.tiba at arm.com>
> + */
> +
> +#include <linux/atomic.h>
> +#include <linux/bitops.h>
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/io-64-nonatomic-lo-hi.h>
Used?
> +#include <linux/module.h>
mod_devicetable.h for of_device_id definition.
> +#include <linux/of_address.h>
> +#include <linux/of_irq.h>
Generally very little reason to include these. Not sure why you need
them here.
> +#include <linux/panic.h>
> +#include <linux/platform_device.h>
> +#include <linux/slab.h>
> +#include <linux/spinlock.h>
> +
> +#include <acpi/ghes.h>
> +#include <acpi/ghes_cper.h>
> +
> +static atomic_t ghes_ffh_source_ids = ATOMIC_INIT(0);
I'd normally expect an IDA or similar. If nothing else it clearly
indicates we only want a unique ID.
> +
> +struct ghes_ffh_ack {
> + void __iomem *addr;
> + u64 preserve;
> + u64 set;
> + u8 width;
> + bool present;
> +};
> +
> +struct ghes_ffh {
> + struct device *dev;
> + void __iomem *status;
> + size_t status_len;
> +
> + struct ghes_ffh_ack ack;
> +
> + struct acpi_hest_generic *generic;
> + struct acpi_hest_generic_status *estatus;
> +
> + bool sync;
> + int irq;
> +
> + /* Serializes access to the firmware-owned buffer. */
If we are serializing it, in what sense is it owned by the firmware?
> + spinlock_t lock;
> +};
> +
> +static void ghes_ffh_process(struct ghes_ffh *ctx)
> +{
> + unsigned long flags;
> + int sev;
> +
> + spin_lock_irqsave(&ctx->lock, flags);
guard() + include cleanup.h. Then can do returns in error paths.
> +
> + if (ghes_ffh_copy_status(ctx))
> + goto out;
Like here to give simpler lfow.
> +
> + sev = ghes_severity(ctx->estatus->error_severity);
> + if (sev >= GHES_SEV_PANIC)
> + ghes_ffh_fatal(ctx);
> +
> + if (!ghes_estatus_cached(ctx->estatus)) {
> + if (ghes_print_estatus(NULL, ctx->generic, ctx->estatus))
Combine the two if statements with &&
> + ghes_estatus_cache_add(ctx->generic, ctx->estatus);
> + }
> +
> + ghes_cper_handle_status(ctx->dev, ctx->generic, ctx->estatus, ctx->sync);
> +
> + ghes_ffh_ack(ctx);
> +
> +out:
> + spin_unlock_irqrestore(&ctx->lock, flags);
> +}
> +
> +static irqreturn_t ghes_ffh_irq(int irq, void *data)
> +{
> + struct ghes_ffh *ctx = data;
> +
> + ghes_ffh_process(ctx);
> +
> + return IRQ_HANDLED;
> +}
> +
> +static int ghes_ffh_init_ack(struct platform_device *pdev,
> + struct ghes_ffh *ctx)
> +{
> + struct resource *res;
> + size_t size;
> +
> + res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
> + if (!res)
> + return 0;
> +
> + ctx->ack.addr = devm_ioremap_resource(&pdev->dev, res);
Why not devm_platform_get_and_ioremap_resource()?
> + if (IS_ERR(ctx->ack.addr))
> + return PTR_ERR(ctx->ack.addr);
> +
> + size = resource_size(res);
> + switch (size) {
> + case 4:
> + ctx->ack.width = 32;
> + ctx->ack.preserve = ~0U;
> + break;
> + case 8:
> + ctx->ack.width = 64;
> + ctx->ack.preserve = ~0ULL;
> + break;
> + default:
> + dev_err(&pdev->dev, "Unsupported ack resource size %zu\n", size);
> + return -EINVAL;
> + }
> +
> + ctx->ack.set = BIT_ULL(0);
> + ctx->ack.present = true;
> + return 0;
> +}
> +
> +static int ghes_ffh_probe(struct platform_device *pdev)
Consider using a
struct device *dev = &pdev->dev;
given there is only one device around and it will shorten a bunch of
lines a little.
> +{
> + struct ghes_ffh *ctx;
> + struct resource *res;
> + int rc;
> +
> + ctx = devm_kzalloc(&pdev->dev, sizeof(*ctx), GFP_KERNEL);
> + if (!ctx)
> + return -ENOMEM;
> +
> + spin_lock_init(&ctx->lock);
> + ctx->dev = &pdev->dev;
> + ctx->sync = of_property_read_bool(pdev->dev.of_node, "arm,sea-notify");
Hmm. I'd allow for other firmware types with
device_property_read_bool() instead.
> +
> + res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> + if (!res) {
> + dev_err(&pdev->dev, "status region missing\n");
In probe you can always use dev_err_probe. It pretty prints the return value etc and
saves lines of code.
return dev_err_probe(&pdev->dev, -EINVAL, "status region missing\n");
Don't worry about slightly long line.
> + return -EINVAL;
> + }
> +
> + ctx->status_len = resource_size(res);
> + if (!ctx->status_len) {
> + dev_err(&pdev->dev, "Status region has zero length\n");
As above, use dev_err_probe()
> + return -EINVAL;
> + }
> +
> + ctx->status = devm_ioremap_resource(&pdev->dev, res);
I'd be tempted to use devm_platform_get_and_ioremap_resource() and just
not worry about mapping and unmapping that will unnecessarily occur in the
case of error.
> + if (IS_ERR(ctx->status))
> + return PTR_ERR(ctx->status);
> +
> + rc = ghes_ffh_init_ack(pdev, ctx);
> + if (rc)
> + return rc;
> +
> + rc = ghes_ffh_init_pool();
> + if (rc)
> + return rc;
> +
> + ctx->estatus = devm_kzalloc(&pdev->dev, ctx->status_len, GFP_KERNEL);
> + if (!ctx->estatus)
> + return -ENOMEM;
> +
> + ctx->generic = devm_kzalloc(&pdev->dev, sizeof(*ctx->generic), GFP_KERNEL);
> + if (!ctx->generic)
> + return -ENOMEM;
> +
> + ctx->generic->header.type = ACPI_HEST_TYPE_GENERIC_ERROR;
> + ctx->generic->header.source_id =
> + atomic_inc_return(&ghes_ffh_source_ids);
> + ctx->generic->notify.type = ctx->sync ?
> + ACPI_HEST_NOTIFY_SEA : ACPI_HEST_NOTIFY_EXTERNAL;
> + ctx->generic->error_block_length = ctx->status_len;
> +
> + ctx->irq = platform_get_irq_optional(pdev, 0);
> + if (ctx->irq <= 0) {
> + if (ctx->irq == -EPROBE_DEFER)
> + return ctx->irq;
> + dev_err(&pdev->dev, "interrupt is required (%d)\n", ctx->irq);
If it's required, why call get_irq_optional?
That only serves to suppress the error message inside the call. Use
the non optional version and drop this.
> + return -EINVAL;
> + }
> +
> + rc = devm_request_threaded_irq(&pdev->dev, ctx->irq,
> + NULL, ghes_ffh_irq,
> + IRQF_ONESHOT,
> + dev_name(&pdev->dev), ctx);
> + if (rc)
> + return rc;
> +
> + platform_set_drvdata(pdev, ctx);
I can't immediately spot where this is used. If it isn't don't set it as that
will mislead people into thinking it's needed.
> + dev_info(&pdev->dev, "Firmware-first CPER status provider (interrupt)\n");
Krysztof already commented on this one.
> + return 0;
> +}
> +
> +static void ghes_ffh_remove(struct platform_device *pdev)
> +{
If nothing to do, platform drivers don't need a remove so get rid of it.
> +}
> +
> +static const struct of_device_id ghes_ffh_of_match[] = {
> + { .compatible = "arm,ras-ffh" },
> + { /* sentinel */ }
> +};
> +MODULE_DEVICE_TABLE(of, ghes_ffh_of_match);
> +
> +static struct platform_driver ghes_ffh_driver = {
> + .driver = {
> + .name = "esource-dt",
> + .of_match_table = ghes_ffh_of_match,
> + },
> + .probe = ghes_ffh_probe,
> + .remove = ghes_ffh_remove,
> +};
> +
Common convention is keep this tightly coupled with the
struct platform_driver but not having a blank line here.
> +module_platform_driver(ghes_ffh_driver);
> +
> +MODULE_AUTHOR("Ahmed Tiba <ahmed.tiba at arm.com>");
> +MODULE_DESCRIPTION("Firmware-first CPER provider for DeviceTree platforms");
> +MODULE_LICENSE("GPL");
More information about the linux-arm-kernel
mailing list