FP register corruption in Exynos 4210 (Cortex-A9)

Lanchon lanchon at gmail.com
Mon Dec 22 14:46:27 PST 2014

On 10/10/2014 07:01 AM, Russell King - ARM Linux wrote:
> On Fri, Oct 10, 2014 at 11:45:34AM +0200, Arnd Bergmann wrote:
>> On Thursday 09 October 2014 23:32:44 Russell King - ARM Linux wrote:
>>>> there is a new piece of information:
>>>> the FP corruption seems to only happen in these android devices if the
>>>> display is off. the charger may be connected or not, but if the display
>>>> is on, the corruption won't happen.
>>>> i wonder if the kernel could be turning off the FPU and then back on
>>>> without saving the FPU state. i would think corruption would be seen
>>>> more often then.
>>> No.  We don't "turn off" the VFP.  We disable and enable access to VFP
>>> via the coprocessor access register.  If the VFP access is disabled and
>>> then re-enabled, all state is preserved.
>>> The only time which state would be lost is if (eg) we hot-unplug the
>>> entire CPU, but that first requires a context switch which implies that
>>> the state will already be saved.
>> Could the problem be caused by a bug in the exynos CPU suspend/resume
>> path then? E.g. if we go to sleep with VFP access disabled but it
>> comes back with VFP access enabled (or vice versa) that could lead
>> to the wrong register state being seen by the user space application.
> Well, an interesting test would be to save out the entire VFP state
> both before and after the pread64 call, and then inspect that to
> determine whether it is a single register or multiple registers
> which are being corrupted.
> However, looking at the mainline code, we do the right thing with the
> CPU PM infrastructure, and that is called appropriately by the exynos
> CPU idle driver.
> So, another possible test for Lanchon would be to see whether disabling
> CPU idle support fixes the problem.

hi again! thank you all for your help. i sort of disappeared, i'm very 
sorry about that.

i never mentioned it here, but the fact was that i didn't have a device 
to test on. so all i could do was post test code and ask users for their 
help. at some point no one was helping; i waited for test results but 
they never happened, so i got frustrated and abandoned the project.

but recently interest built up again and we were able to progress and 
finally fix this, so i'm writing to let you know how it turned out.

so remember there was random userland VFP register corruption. the VFP 
state was not being corrupted in the registers nor in the saved state in 
ram. what happened was: the kernel tracks the leftover state in the VFP 
once the eager state save is done. in the lazy restore trap, the kernel 
optimizes away the state load and instead only enables the VFP if it can 
prove that the leftover state in the VFP hardware matches the process 
state saved in ram.

however under some circumstances the kernel did the wrong thing: it 
didn't reload the registers even though it was needed, probably because 
the hardware had been powered down and had lost state without the 
tracking code getting word of it. just disabling the optimization made 
the kernel solid.

a couple of days later the root cause seems to have been identified and 
fixed. i describe the whole thing here:

once again, thank you for all your help.

kind regards

More information about the linux-arm-kernel mailing list