mvneta: oops in __rcu_read_lock on mirabox

Mon Sep 16 13:45:14 EDT 2013

On Mon, Sep 16, 2013 at 06:14:16PM +0100, Russell King - ARM Linux wrote:
> On Mon, Sep 16, 2013 at 06:24:50PM +0200, Thomas Petazzoni wrote:
> > Could this be caused by bitflips in the RAM due to bad timings, or
> > overheating or that kind of things?
> 
> Well, the SoC is an Armada 370, which uses Marvell's own Sheeva core.
> From what I understand, this is a CPU designed entirely by Marvell, so
> the interpretation of these codes may not be correct.  This is made
> harder to diagnose in that Marvell is soo secret with their
> documentation; indeed for this CPU there is no information publically
> available (there's only the product briefs).

Yes and their salesmen never respond after many attempts in more than one
year now. Looks like they want to keep their chips for themselves only :-(

> Bad timings could certainly cause bitflips, as could poor routing of
> data line D8 (eg, incorrect termination or routing causing reflections
> on the data line - remember that with modern hardware, almost every
> signal is a transmission line).

This board has a really clean routing and placement, chips are very close.
That does not rule out the possibility of a lacking termination, but it
would probably affect more users.

> Marginal or noisy power supplies could also be a problem - for example,
> if the impedance of the power supply connections is too great, it may
> work with some patterns of use but not others.

We have some margin here, I measured less than 1 Amp to boot and something
like 6-700 mA in idle if my memory serves me correctly. The 3A PSU and its
thicker-than-average wires seem safe. I think that Globalscale learned a
lot from the horrible Guruplug design that all this part needs to be done
correctly and they did a very clean job this time.

> There's soo many possibilities...

Including faulty components. I'm not aware of an equivalent of cpuburn for
ARM, it would probably help, though it's probably harder to design in a
generic way than on x86 where all systems are the same.

> However, if the fault codes above really do equate to what's in the ARMv7
> Architecture Reference Manual, I think we can rule out the routing and
> RAM chips - because a cache parity error points to bit flips in the cache,
> or if there is no cache parity checking implemented, it means something
> is corrupting the state of the SoC - which could be due to bad power
> supplies.
> 
> How do we get to the bottom of this?  That's a very good question - one
> which is going to be very difficult to solve.  Ideally, it means working
> with the manufacturer's design team to try and work out what's going on
> at the board level, probably using logic analysers to capture the bus
> activity leading up to the failure.  Also, checking the power supplies
> at the SoC too - checking that they're within correct tolerance and
> checking the amount of noise on them.
> 
> I think all we can do at the moment is to wait for further reports to roll
> in and see whether a better pattern emerges.

Especially since there are also some heavy testers who don't seem to be
impacted :-/

> If you want to try something - and you suspect it may be heat related,
> you could try putting the board inside a container, monitor the temperature
> inside the container, and put it in your freezer!  Just be careful of the
> temperature of the other devices on the board getting too cold though -
> remember, most consumer electronics is only rated for an *operating*
> temperature range of 0°C to 70°C and your freezer will be something like
> -20°C - so don't let the ambient temperature inside the container go
> below 0°C!  If the CPU is producing lots of heat though, it may keep the
> container sufficiently warm that that's not a problem.  The theory is
> that by making the ambient 15 to 20°C cooler, you will also lower the
> temperature of the hotter parts by a similar amount.

Sometimes you can also do the opposite, heat it gently with an hair dryer
while working to see if problems happen moore frequently. It's often easier
to do than working in a cold place as you don't have issues with the wires,
and it does not accumulate moist.

I've detected some early failures this way ; the NAND in my Iomega Iconnect
is extremely sensitive to heating to the point that I had to stick a heat
sink on it and take the board out of its case to avoid hangs. The hair
dryer quickly revealed the culprit in a few minutes when it took weeks to
get a failure before.

Willy