[PATCH v3] arm64: enable EDAC on arm64

Rob Herring robherring2 at gmail.com
Tue Apr 22 09:29:52 PDT 2014


On Tue, Apr 22, 2014 at 11:01 AM, Will Deacon <will.deacon at arm.com> wrote:
> On Tue, Apr 22, 2014 at 04:23:20PM +0100, Rob Herring wrote:
>> On Tue, Apr 22, 2014 at 8:26 AM, Will Deacon <will.deacon at arm.com> wrote:
>> > On Tue, Apr 22, 2014 at 01:54:12PM +0100, Rob Herring wrote:
>> >> On Tue, Apr 22, 2014 at 5:24 AM, Will Deacon <will.deacon at arm.com> wrote:
>> >> > On Mon, Apr 21, 2014 at 05:09:16PM +0100, Rob Herring wrote:
>> >> >> +#ifndef ASM_EDAC_H
>> >> >> +#define ASM_EDAC_H
>> >> >> +/*
>> >> >> + * ECC atomic, DMA, SMP and interrupt safe scrub function.
>> >> >
>> >> > What do you mean by `DMA safe'? For coherent (cacheable) DMA buffers, this
>> >> > should work fine, but for non-coherent (and potentially non-cacheable)
>> >> > buffers, I think we'll have problems both due to the lack of guaranteed
>> >> > exclusive monitor support and also eviction of dirty lines.
>> >>
>> >> That's just copied from other implementations. I agree you could have
>> >> a problem here although I don't see why dirty line eviction would be.
>> >
>> > I was thinking of the case where you have an ongoing, non-coherent DMA
>> > transfer from a device and then the atomic_scrub routine runs in parallel
>> > on the CPU, targetting the same buffer. In this case, the stxr could store
>> > stale data back to the buffer, leading to corruption (since the monitor
>> > won't help). This differs from the case where the monitor could always
>> > report failure for non-cacheable regions, causing atomic_scrub to livelock.
>>
>> It is only reads that will trigger an error and scrubbing. If the DMA
>> is continuously reading (such as a framebuffer), then there would not
>> be an issue. What would be the usecase where a DMA continously writes
>> to the same location without any synchronization with the cpu? I
>> suppose one core could re-trigger a DMA while another core is doing
>> the scrubbing. You would have to read the DMA data and be finished
>> with it quicker than the scrubbing could get handled. I just wonder
>> whether this is really only a theoretical problem, but not one in
>> practice.
>
> I don't think it's all that complicated if you consider speculative reads
> from the CPU triggering the error. However, discussion with Catalin raised
> another question (see below).
>
>> >> There's not really a solution other than not doing s/w scrubbing or
>> >> doing it in h/w. So it is up to individual drivers to decide what to
>> >> do, but we have to provide this function just to enable EDAC.
>> >
>> > I think we need to avoid s/w scrubbing of non-cacheable memory altogether.
>>
>> There's not really a way to determine the memory attributes easily
>> though. Whether it works depends on the h/w. Calxeda's memory
>> controller did have an exclusive monitor so I think this would have
>> worked even in the non-coherent case.
>>
>> What exactly is your proposal to do here? I think we should assume the
>> h/w is designed correctly until we have a case that it is not.
>
> Looking at the edac_mc_scrub_block code, atomic_scrub is always called with
> a normal, cacheable mapping (kmap_atomic) so that doesn't help us (although
> it means the exclusives will at least succeed).
>
> The problem of speculative reads by the CPU could be solved by unmapped the
> DMA buffer when we transfer the ownership over to the device (instead of
> invalidating it after the transfer). However, I'm now slightly confused as
> to how atomic_scrub fixes errors reported at any cache level higher than
> L1. Do we need cache-flushing to ensure that the exclusive-store propagates
> to the point of failure?

The whole point of scrubbing is to stop repeated error reporting of
correctable errors. For example, you do a write to memory and the ECC
code is added to it. Suppose the data stored in the memory gets
corrupted either on the write or some time later you get a bit flip in
the memory cell. Then when the data is read from memory, the memory
controller will detect the error, correct it, and trigger and ECC
correctable error interrupt. It will do this every time you read that
memory location because the error occurred on the write. The only way
to clear the error is re-writing memory. As long as that cache line is
dirty, no reads from that memory location will occur as other readers
will get the line from other cores, the L2, or the line will get
pushed out to memory first. I guess you could see an invalidate on DMA
memory causing the scrub to get lost, but that doesn't really matter.
It would be harmless to get the error again other than making your
error rate seem higher (which is something OEMs are very sensitive
to). You are doing the invalidate so that DMA can write new data
anyway.

Rob



More information about the linux-arm-kernel mailing list