FP register corruption in Exynos 4210 (Cortex-A9)
Lanchon
lanchon at gmail.com
Tue Oct 7 14:48:23 PDT 2014
Hi,
There is a longstanding bug in all the after-market kernels (and maybe
manufacturer's kernels too) for all the Exynos 4210 (Cortex-A9)-based
devices. These include:
Samsung Galaxy S II
Samsung Galaxy Note
Samsung Galaxy Tab 7.0 Plus
...and others.
Under rare conditions which are not easy to reproduce, floating point
registers of userland processes get clobbered.
There is a vital FUSE process in Android 4.4 (called 'sdcard.c') that
mediates access to internal phone storage as an emulated sdcard, and to
external sdcards too. This process, normally compiled using
-mfloat-abi=softfp, calls pread64() after saving the value of a 64-bit
integer variable (called 'unique') in an FPU register (d8). On very rare
occasions, upon return from pread64() the value of the FP register is
corrupted; as a result the process stops responding and the devices
loose access to storage.
There are other instabilities in the platform suspected of having the
same cause. This bug has plagued the platform for years, but only
recently FP clobbering was identified as the culprit.
More context:
This only happens on 4210-based devices. The same kernel tree compiled
for 4212- and 4412-based devices does not exhibit the behavior. (The
4x12 SoCs are a newer iteration of the 4210, with the 'x' corresponding
to the number of cores. See:
http://en.wikipedia.org/wiki/Exynos#List_of_Exynos_SoCs ) This points to
a hardware issue, maybe a missing errata in the kernel, or to a driver
issue.
Simply busy-spinning in userland waiting for FP corruption does not seem
to trigger the issue. Concurrently accessing storage in another process
while spinning also does not work; power management (sleep, etc) may be
involved.
Compiling 'sdcard.c' using -mfloat-abi=soft solves the issue (for this
vital process) since the 'unique' variable is saved in regular instead
of FP registers then.
Objdumping the complete kernel does not show any instructions that
access 'd' registers, except in context switching code, and in the code
that implements traps that old VFP units need to handle some corner
cases. Also, objdumps of *.ko files do not reveal any instructions that
access 'd' registers.
We do not have 'kernel_neon_begin' nor 'kernel_vfp_begin' support in
these kernels; the code is just not there.
Some links:
One of the affected kernel trees:
https://github.com/CyanogenMod/android_kernel_samsung_smdk4412/tree/cm-11.0
First direct observation of corruption:
http://forum.xda-developers.com/showthread.php?p=51237856&highlight=unique
The 'sdcard.c' process:
http://forum.xda-developers.com/showthread.php?p=55787440
Post showing that 'unique' is saved in 'd8':
http://forum.xda-developers.com/showthread.php?p=55783884
A busy-spin FP corruption test (that fails to reproduce the bug):
http://forum.xda-developers.com/showthread.php?p=55861206
Objdumps:
http://forum.xda-developers.com/showthread.php?p=55839635
And finally some questions:
Kernel threads (such as the worker thread of a threaded interrupt)
should guard FPU access in between 'kernel_XXX_begin'/'kernel_XXX_end'
calls (which our kernels do not implement). But what if it did not?
1) What is the FPU enable state while executing a kernel thread in ARM
arch? Which of these answers is correct?
1a) the FPU is always disabled in kernel threads.
1b) the FPU might be enabled or disabled in a kernel thread, depending
on the FPU enable state of the userland context that executed before
and/or some other factors.
2) What would happen if a kernel thread executed an FPU instruction
without the kernel_XXX_begin'/'kernel_XXX_end' guards in ARM arch and
the FPU was disabled at the time?
2a) In the FPU trap the kernel would always detect the issue and panic
or oops or something.
2b) In the FPU trap the kernel might enable the FPU, load the FPU
context of some userland process and resume the kernel thread.
Of course an ISR should not touch the FPU at all. But what if it did?
3) What would happen if an ISR executed an FPU instruction in ARM arch
and the FPU was disabled in the context that was interrupted:
3a) In the FPU trap the kernel would always detect the issue and panic
or oops or something.
3b) In the FPU trap the kernel would react as if the interrupted context
executed the FPU instruction: If the interrupted context was user mode,
it would restore the userland process' FP context into the FPU. If the
interrupted context was kernel mode, it would react as per the answer to
question 2) above.
4) What would happen if an ISR executed an FPU instruction in ARM arch
and the FPU was enabled in the context that was interrupted:
4a) The processor would disable the FPU on ISR entry automatically and
thus the system would behave as described in the answer to question 3)
above.
4b) If the driver uses the standard kernel interrupt dispatch
architecture, the kernel would disable the FPU before dispatching the
interrupt to the driver ISR, and so the system would also behave as
described in 3).
4c) The FPU instruction would execute. There is no fail-fast or
detection of this kind of violation by the kernel.
Of course every pointer, idea, or suspicion that might seem relevant to
the case is welcome.
Thank you very much for reading and for your help.
Regards,
Lanchon
More information about the linux-arm-kernel
mailing list