[PATCH v4] EDAC: Add ARM64 EDAC

Fri Oct 30 10:06:06 PDT 2015

On Fri, Oct 30, 2015 at 11:26:58AM -0500, Brijesh Singh wrote:
> Hi Mark,
> 
> >> +
> >> +Required properties:
> >> +- compatible: Should be "arm,cortex-a57-edac" or "arm,cortex-a53-edac"
> >> +
> >> +Example:
> >> +	edac {
> >> +		compatible = "arm,cortex-a57-edac";
> >> +	};
> >> +
> > 
> > This is insufficient for big.LITTLE, no interrupt is possible, and we
> > haven't defined the rules for accessing the registers (e.g. whether
> > write backs are permitted).
> > 
> > Please see my prior comments [1] on those points.
> > 
> > If we're going to use this feature directly within the kernel, we need
> > to consider the envelope of possible implementations rather than your
> > use-case alone.
> >
> 
> I have looked at possibility of pushing correctable error logging in the
> firmware; but given current hardware limitation it seems like OS is the best
> place to implement it. Let me summaries the issues we are running into:
> 
> * Correctable errors does not generate any interrupt:
>   If we have to implement error parsing inside the firmware then work need
>   to be split between OS and firmware. Maybe OS can call SMC instruction to 
>   dial into firmware and then firmware can check error syndrome registers; 
>   if it finds correctable error then build HEST table. This method will introduce
>   performance issue because it require OS executing SMC every 100ms or so to just
>   poll for correctable error. If you have any other recommendation then please share it.

I agree that this is a problem, and is an unfortunate hardware
limitation.

I am still wary of making use of IMPLEMENTATION DEFINED features like
this in the kernel.

> > * Interaction with firmware
> > - When/do we handle interrupts?
> 
> We can a properties in dt bindings:
> 
> 1) "num-interrupts = 1" - number of interrupt count. One interrupts per cluster
>     e.g if you have 4 cluster then num-interrupts=4.
> 2) interrupts = <0, 92, 0> <0, 94, 0> <0, 96, 0> <0, 98, 0>  // interrupt mapping
> 
> If num-interrupts = 0, then firmware handles interrupts. Optionally we can use HEST FIRMWARE-FIRST
> bit, if bit is set then firmware is handling the interrupt otherwise use DT information.

You won't have the HEST and DT information at the same time, given that
at runtime the kernel uses one of ACPI or DT.

This also doesn't define the affinity of interrupts (i.e. which
cluster/CPU does each interrupt get generated by), which is the more
important part.

> >  - When is it valid to write back and clear an error? We should not do
> >    this behind the back of any firmware that owns the interface.
> 
> As per A57 TRM is concerned you are right both the correctable and uncorrectable 
> error needs to clear VALID bit in L1/L2 syndrome registers. So yes we need to define
> a rule for accessing the registers. I can think of two possible approach here:
> 
> 1) add "error-syndrome-reg-write-access=1" property in dt.
>    * if '1' then OS has exclusive write backs access to error syndrome register
>    * if '0' then OS will not clear the valid bit on fatal error

What about correctable errors? What if someone wants to poll that in FW
(no matter how horrible that may be for performance)? What if a future
CPU revision adds an optional interrupt for that?

>   The handler looks like this:
> 
>   parse_error_syndrome () {
>    val = read_cpumerrsr
> 
>    if (!IS_VALID(val))
>      return
> 
>    /* log the error details */
> 
>    /* if fatal error and OS does not have exclusive write back access */
>    if (IS_FATAL(val) && !error-syndrom-reg-write-access)
>      return; 
> 
>    val = ~(1UL << 31); /* clear valid bit */
>   }

Ok.

> 2) Use HEST FIRMWARE-FIRST bit field, if the bit is set then OS should not clear
>    the valid bit on fatal error and similarly if bit is clear then OS clears the VALID bit.
> 
> Since firmware will never handle the correctable error hence its always safe to clear
> the VALID bit on non-fatal error. If you have any other suggestions then please share it.

As above, I don't think this is true in general.

> I am not pushing my use-case only; I am trying to work through current hardware
> limitation and still support all the possibilities. I am open to hear your suggestions.
> I am also not well versed on big.LITTILE CPU, so you may need to point me on right 
> direction as we progress. My testing is limited to Cortex A57.

The important thing to consider is that an arbitrary subset of CPUs may
have this feature, while another arbitrary subset may not.

> > I don't think the use of old_mask is sufficient here, given the mapping
> > of logical to physical ID is arbitrary. For example, we could have CPUs
> > 0,5,6,7 in one cluster, and CPUs 1,2,3,4 in another, and in that case
> > we'd check the first cluster twice.
> > 
> 
> Noted. I should use physical ID instead of logical mapping.

Or you could simply keep a mask of CPUs whose clusters have already been
checked...

Thanks,
Mark.