[PATCH v4] EDAC: Add ARM64 EDAC

Fri Oct 30 09:26:58 PDT 2015

Hi Mark,

>> +
>> +Required properties:
>> +- compatible: Should be "arm,cortex-a57-edac" or "arm,cortex-a53-edac"
>> +
>> +Example:
>> +	edac {
>> +		compatible = "arm,cortex-a57-edac";
>> +	};
>> +
> 
> This is insufficient for big.LITTLE, no interrupt is possible, and we
> haven't defined the rules for accessing the registers (e.g. whether
> write backs are permitted).
> 
> Please see my prior comments [1] on those points.
> 
> If we're going to use this feature directly within the kernel, we need
> to consider the envelope of possible implementations rather than your
> use-case alone.
>

I have looked at possibility of pushing correctable error logging in the
firmware; but given current hardware limitation it seems like OS is the best
place to implement it. Let me summaries the issues we are running into:

* Correctable errors does not generate any interrupt:
  If we have to implement error parsing inside the firmware then work need
  to be split between OS and firmware. Maybe OS can call SMC instruction to 
  dial into firmware and then firmware can check error syndrome registers; 
  if it finds correctable error then build HEST table. This method will introduce
  performance issue because it require OS executing SMC every 100ms or so to just
  poll for correctable error. If you have any other recommendation then please share it.

> * Interaction with firmware
> - When/do we handle interrupts?

We can a properties in dt bindings:

1) "num-interrupts = 1" - number of interrupt count. One interrupts per cluster
    e.g if you have 4 cluster then num-interrupts=4.
2) interrupts = <0, 92, 0> <0, 94, 0> <0, 96, 0> <0, 98, 0>  // interrupt mapping

If num-interrupts = 0, then firmware handles interrupts. Optionally we can use HEST FIRMWARE-FIRST
bit, if bit is set then firmware is handling the interrupt otherwise use DT information.

>
>  - When is it valid to write back and clear an error? We should not do
>    this behind the back of any firmware that owns the interface.

As per A57 TRM is concerned you are right both the correctable and uncorrectable 
error needs to clear VALID bit in L1/L2 syndrome registers. So yes we need to define
a rule for accessing the registers. I can think of two possible approach here:

1) add "error-syndrome-reg-write-access=1" property in dt.
   * if '1' then OS has exclusive write backs access to error syndrome register
   * if '0' then OS will not clear the valid bit on fatal error

  The handler looks like this:

  parse_error_syndrome () {
   val = read_cpumerrsr

   if (!IS_VALID(val))
     return

   /* log the error details */

   /* if fatal error and OS does not have exclusive write back access */
   if (IS_FATAL(val) && !error-syndrom-reg-write-access)
     return; 

   val = ~(1UL << 31); /* clear valid bit */
  }

2) Use HEST FIRMWARE-FIRST bit field, if the bit is set then OS should not clear
   the valid bit on fatal error and similarly if bit is clear then OS clears the VALID bit.

Since firmware will never handle the correctable error hence its always safe to clear
the VALID bit on non-fatal error. If you have any other suggestions then please share it.

I am not pushing my use-case only; I am trying to work through current hardware
limitation and still support all the possibilities. I am open to hear your suggestions.
I am also not well versed on big.LITTILE CPU, so you may need to point me on right 
direction as we progress. My testing is limited to Cortex A57.

> 
> I don't think the use of old_mask is sufficient here, given the mapping
> of logical to physical ID is arbitrary. For example, we could have CPUs
> 0,5,6,7 in one cluster, and CPUs 1,2,3,4 in another, and in that case
> we'd check the first cluster twice.
> 

Noted. I should use physical ID instead of logical mapping.

> This also is wrong for big.LITTLE; we can't necessarily check on every
> CPU.
>