[PATCH 1/2] drivers/edac: Add L1 and L2 error detection for A53 and A57
Marc Zyngier
maz at kernel.org
Fri Apr 2 11:06:56 BST 2021
On Thu, 01 Apr 2021 12:06:14 +0100,
Sascha Hauer <s.hauer at pengutronix.de> wrote:
>
> The Cortex A53 and A57 cores have error detection capabilities for the
> L1/L2 Caches, this patch adds a driver for them.
>
> Unfortunately there is no robust way to inject errors into the caches,
> so this driver doesn't contain any code to actually test it. It has
> been tested though with code taken from an older version of this driver
> found here: https://lkml.org/lkml/2018/3/14/1203. For reasons stated
> in this thread the error injection code is not suitable for mainline,
> so it is removed from the driver.
>
> Signed-off-by: Sascha Hauer <s.hauer at pengutronix.de>
> ---
> drivers/edac/Kconfig | 6 +
> drivers/edac/Makefile | 1 +
> drivers/edac/cortex_arm64_l1_l2.c | 218 ++++++++++++++++++++++++++++++
> 3 files changed, 225 insertions(+)
> create mode 100644 drivers/edac/cortex_arm64_l1_l2.c
>
> diff --git a/drivers/edac/Kconfig b/drivers/edac/Kconfig
> index 27d0c4cdc58d..b038aed35e93 100644
> --- a/drivers/edac/Kconfig
> +++ b/drivers/edac/Kconfig
> @@ -538,4 +538,10 @@ config EDAC_DMC520
> Support for error detection and correction on the
> SoCs with ARM DMC-520 DRAM controller.
>
> +config EDAC_CORTEX_ARM64_L1_L2
> + tristate "ARM Cortex A57/A53"
> + depends on ARM64
> + help
> + Support for L1/L2 cache error detection on ARM Cortex A57 and A53.
I went through the TRMs for a few other Cortex-A cores, and this
feature looks more common than this comment suggests. At least A35 and
A72 implement something similar (if not strictly identical), probably
owing to their ancestry.
> +
> endif # EDAC
> diff --git a/drivers/edac/Makefile b/drivers/edac/Makefile
> index 2d1641a27a28..5849d8bb32ae 100644
> --- a/drivers/edac/Makefile
> +++ b/drivers/edac/Makefile
> @@ -84,3 +84,4 @@ obj-$(CONFIG_EDAC_QCOM) += qcom_edac.o
> obj-$(CONFIG_EDAC_ASPEED) += aspeed_edac.o
> obj-$(CONFIG_EDAC_BLUEFIELD) += bluefield_edac.o
> obj-$(CONFIG_EDAC_DMC520) += dmc520_edac.o
> +obj-$(CONFIG_EDAC_CORTEX_ARM64_L1_L2) += cortex_arm64_l1_l2.o
> diff --git a/drivers/edac/cortex_arm64_l1_l2.c b/drivers/edac/cortex_arm64_l1_l2.c
> new file mode 100644
> index 000000000000..3b1e2f3ccab6
> --- /dev/null
> +++ b/drivers/edac/cortex_arm64_l1_l2.c
> @@ -0,0 +1,218 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Cortex A57 and A53 EDAC L1 and L2 cache error detection
> + *
> + * Copyright (c) 2020 Pengutronix, Sascha Hauer <s.hauer at pengutronix.de>
> + *
> + * Based on Code from:
> + * Copyright (c) 2018, NXP Semiconductor
> + * Author: York Sun <york.sun at nxp.com>
> + *
> + */
> +
> +#include <linux/module.h>
> +#include <linux/of_platform.h>
> +#include <linux/of_device.h>
> +#include <linux/bitfield.h>
> +#include <asm/smp_plat.h>
> +
> +#include "edac_module.h"
> +
> +#define DRVNAME "cortex-arm64-edac"
> +
> +#define CPUMERRSR_EL1_RAMID GENMASK(30, 24)
> +
> +#define CPUMERRSR_EL1_VALID BIT(31)
> +#define CPUMERRSR_EL1_FATAL BIT(63)
> +
> +#define L1_I_TAG_RAM 0x00
> +#define L1_I_DATA_RAM 0x01
> +#define L1_D_TAG_RAM 0x08
> +#define L1_D_DATA_RAM 0x09
> +#define L1_D_DIRTY_RAM 0x14
> +#define TLB_RAM 0x18
> +
> +#define L2MERRSR_EL1_VALID BIT(31)
> +#define L2MERRSR_EL1_FATAL BIT(63)
> +
> +struct merrsr {
> + u64 cpumerr;
> + u64 l2merr;
> +};
> +
> +#define MESSAGE_SIZE 64
> +
> +#define SYS_CPUMERRSR_EL1 sys_reg(3, 1, 15, 2, 2)
> +#define SYS_L2MERRSR_EL1 sys_reg(3, 1, 15, 2, 3)
> +
> +static struct cpumask compat_mask;
> +
> +static void report_errors(struct edac_device_ctl_info *edac_ctl, int cpu,
> + struct merrsr *merrsr)
> +{
> + char msg[MESSAGE_SIZE];
> + u64 cpumerr = merrsr->cpumerr;
> + u64 l2merr = merrsr->l2merr;
> +
> + if (cpumerr & CPUMERRSR_EL1_VALID) {
> + const char *str;
> + bool fatal = cpumerr & CPUMERRSR_EL1_FATAL;
> +
> + switch (FIELD_GET(CPUMERRSR_EL1_RAMID, cpumerr)) {
> + case L1_I_TAG_RAM:
> + str = "L1-I Tag RAM";
> + break;
> + case L1_I_DATA_RAM:
> + str = "L1-I Data RAM";
> + break;
> + case L1_D_TAG_RAM:
> + str = "L1-D Tag RAM";
> + break;
> + case L1_D_DATA_RAM:
> + str = "L1-D Data RAM";
> + break;
> + case L1_D_DIRTY_RAM:
> + str = "L1 Dirty RAM";
> + break;
> + case TLB_RAM:
> + str = "TLB RAM";
> + break;
> + default:
> + str = "unknown";
> + break;
> + }
> +
> + snprintf(msg, MESSAGE_SIZE, "%s %s error(s) on CPU %d",
> + str, fatal ? "fatal" : "correctable", cpu);
> +
> + if (fatal)
> + edac_device_handle_ue(edac_ctl, cpu, 0, msg);
> + else
> + edac_device_handle_ce(edac_ctl, cpu, 0, msg);
> + }
> +
> + if (l2merr & L2MERRSR_EL1_VALID) {
> + bool fatal = l2merr & L2MERRSR_EL1_FATAL;
> +
> + snprintf(msg, MESSAGE_SIZE, "L2 %s error(s) on CPU %d",
> + fatal ? "fatal" : "correctable", cpu);
The shared nature of the L2 makes the CPU it has been detected on
pretty much irrelevant. What you really want here is the CPUID+Way
that is in the register data.
> + if (fatal)
> + edac_device_handle_ue(edac_ctl, cpu, 1, msg);
> + else
> + edac_device_handle_ce(edac_ctl, cpu, 1, msg);
> + }
> +}
> +
> +static void read_errors(void *data)
> +{
> + struct merrsr *merrsr = data;
> +
> + merrsr->cpumerr = read_sysreg_s(SYS_CPUMERRSR_EL1);
> + write_sysreg_s(0, SYS_CPUMERRSR_EL1);
> + merrsr->l2merr = read_sysreg_s(SYS_L2MERRSR_EL1);
> + write_sysreg_s(0, SYS_L2MERRSR_EL1);
If an error happens between read and write, you lose it. That's not
great. You could improve things by only writing 0 if you have found an
error. You probably also need an isb after the write if you want it to
take effect in a timely manner.
I'm also not sure of how valuable it is to probe for L2 errors on each
CPU, given that it is shared with up to 3 other cores. You probably
want to use the cache topology information for this.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
More information about the linux-arm-kernel
mailing list