[PATCH 1/2] drivers/edac: Add L1 and L2 error detection for A53 and A57

Thu Apr 15 11:15:09 BST 2021

Hi Marc,

Thanks for the input.

On Fri, Apr 02, 2021 at 11:06:56AM +0100, Marc Zyngier wrote:
> > +config EDAC_CORTEX_ARM64_L1_L2
> > +	tristate "ARM Cortex A57/A53"
> > +	depends on ARM64
> > +	help
> > +	  Support for L1/L2 cache error detection on ARM Cortex A57 and A53.
> 
> I went through the TRMs for a few other Cortex-A cores, and this
> feature looks more common than this comment suggests. At least A35 and
> A72 implement something similar (if not strictly identical), probably
> owing to their ancestry.

Ok, I'll add these to the description.

> > +		}
> > +
> > +		snprintf(msg, MESSAGE_SIZE, "%s %s error(s) on CPU %d",
> > +			 str, fatal ? "fatal" : "correctable", cpu);
> > +
> > +		if (fatal)
> > +			edac_device_handle_ue(edac_ctl, cpu, 0, msg);
> > +		else
> > +			edac_device_handle_ce(edac_ctl, cpu, 0, msg);
> > +	}
> > +
> > +	if (l2merr & L2MERRSR_EL1_VALID) {
> > +		bool fatal = l2merr & L2MERRSR_EL1_FATAL;
> > +
> > +		snprintf(msg, MESSAGE_SIZE, "L2 %s error(s) on CPU %d",
> > +			 fatal ? "fatal" : "correctable", cpu);
> 
> The shared nature of the L2 makes the CPU it has been detected on
> pretty much irrelevant. What you really want here is the CPUID+Way
> that is in the register data.

You are right. For the next round I added some more code to decode the
CPUID/Way field. What's still missing then is information which L2
cache has errors in case there is more than one. I wonder if we should
add get_cpu_cacheinfo(cpu)->id to the message or if there's more to it.

> 
> > +		if (fatal)
> > +			edac_device_handle_ue(edac_ctl, cpu, 1, msg);
> > +		else
> > +			edac_device_handle_ce(edac_ctl, cpu, 1, msg);
> > +	}
> > +}
> > +
> > +static void read_errors(void *data)
> > +{
> > +	struct merrsr *merrsr = data;
> > +
> > +	merrsr->cpumerr = read_sysreg_s(SYS_CPUMERRSR_EL1);
> > +	write_sysreg_s(0, SYS_CPUMERRSR_EL1);
> > +	merrsr->l2merr = read_sysreg_s(SYS_L2MERRSR_EL1);
> > +	write_sysreg_s(0, SYS_L2MERRSR_EL1);
> 
> If an error happens between read and write, you lose it. That's not
> great. You could improve things by only writing 0 if you have found an
> error. You probably also need an isb after the write if you want it to
> take effect in a timely manner.

Ok, will change.

> 
> I'm also not sure of how valuable it is to probe for L2 errors on each
> CPU, given that it is shared with up to 3 other cores. You probably
> want to use the cache topology information for this.

I have no idea how l2merr is implemented. When there is only one
register for all CPUs sharing the same L2 cache then it shouldn't
do any harm to read it more than once. The expensive part is
probably to schedule a function on all CPUs, and we have to do that
anyway to read the L1 cache errors.

Sascha

-- 
Pengutronix e.K.                           |                             |
Steuerwalder Str. 21                       | http://www.pengutronix.de/  |
31137 Hildesheim, Germany                  | Phone: +49-5121-206917-0    |
Amtsgericht Hildesheim, HRA 2686           | Fax:   +49-5121-206917-5555 |