FP register corruption in Exynos 4210 (Cortex-A9)

Wed Oct 8 01:19:19 PDT 2014

On 10/07/2014 07:44 PM, Russell King - ARM Linux wrote:
> On Tue, Oct 07, 2014 at 07:35:14PM -0300, Lanchon wrote:
>>> I hope this helps; I didn't answer your specific questions because it
>>> seemed I would just end up repeating what I've said above.
>>>
>> actually no, answers to my very specific questions would help me
>> understand this: if we had a close-source driver (ISR or kernel thread)
>> that touched the FPU, how would the kernel react?
> I already covered this.  It would corrupt the VFP state, thereby
> corrupting the VFP state which userspace sees.
>
> Hence why I said:
>
> 	Which means that the kernel itself must /never/ make use of floating
> 	point itself - if it does, it /will/ corrupt the user state in the way
> 	               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> 	you are seeing.
> 	^^^^^^^^^^^^^^^
>
> How can I make that more clear?

no, actually you did not answer my questions. you stated that the end 
result would be corruption of user FP state, which i already know. i am 
inquiring as to *how* the process of corruption comes about exactly, not 
the end result.

knowing exactly how corruption can happen and how it cannot would help 
me decide where to look for the offending code.

for instance, you say that if an ISR uses the FPU it would corrupt user 
FP state. fine, but it is not that simple. what if the FPU was disabled 
at the time of interrupt? (ie: lazy restore did not yet happen in this 
time-slice.) then the ISR FPU instruction would trap, not corrupt 
immediately. would the kernel recognize the trap was generated in ISR 
code and panic, or just blindly restore the FP context of the 
interrupted thread? if the former is true, then i can discount ISRs as 
sources of corruptions because i am not seeing panics, so there is no 
point in instrumenting ISRs. if the latter is true, ok fine... but what 
if the interrupted thread was a kernel thread? where would the restored 
FP context come from?

answering these questions require both knowledge of the architecture of 
the linux kernel and of cortex-A, and i know neither of them, which is 
why i am asking in this list.

a plausible answer (which i am making up out of the blue) would be:

"each cpu is always working in the context of a 'current' or 'executing' 
userland process (which may be the idle process), with the MMU 
configured to its virtual address space and all, even when the cpu is 
executing a kernel thread. the FPU state and handling is not affected by 
user/kernel mode switches, only by userland context switches. this means 
that if a kernel thread executes FP instructions, the kernel will trap 
if the FPU is disabled and happily restore the context of the current 
userland process of the CPU for the kernel thread to corrupt next, never 
noticing that the trap originated in kernel mode.

also the arm architecture will not disable the FPU on interrupt 
processing, and the kernel will not disable the FPU prior to dispatching 
the interrupt to the registered drivers. so the same thing would happen 
in an ISR, even if the ISR is interrupting a kernel thread."

another plausible answer would be:

"the kernel always disables the FPU on scheduling a kernel thread. the 
cost of fiddling with this is low compared to the safety it provides. if 
triggered, the FPU trap will notice that the CPU is in kernel mode and 
panic, failing fast. there are no special rules applying to interrupts: 
if an ISR issues FP instructions they will be handled as if they had 
been issued in the interrupted thread (kernel: panic; user: lazy restore 
and/or execution)."

yet another:

"the FPU is effectively disabled on interrupt processing by the arm 
architecture. while running in interrupt mode, and independent of the 
FPU enable status, all FP instructions will trap to a different FPU 
vector which will cause the kernel to panic."

any and all of these hypothetical details would help me determine where 
*not* to look for the cause of the problem, where and what type of 
instrumentation is worth trying, etc. a simple "state would be 
corrupted" sentence does not give me any useful information that helps 
me find the source of the problem, but understanding the process of 
corruption might. (disregarding the fact that this is probably a 
hardware bug, maybe a cache coherence problem or something of the sort, 
and there might be no error in the code at all.)

this is why i will close this email with a copy my questions for 
context. maybe someone can provide the answer for some.

thanks again, and in advance to anyone who can help.

regards,
lanchon

--------------------------

Kernel threads (such as the worker thread of a threaded interrupt) 
should guard FPU access in between 'kernel_XXX_begin'/'kernel_XXX_end' 
calls (which our kernels do not implement). But what if it did not?

1) What is the FPU enable state while executing a kernel thread in ARM 
arch? Which of these answers is correct?

1a) the FPU is always disabled in kernel threads.
1b) the FPU might be enabled or disabled in a kernel thread, depending 
on the FPU enable state of the userland context that executed before 
and/or some other factors.

2) What would happen if a kernel thread executed an FPU instruction 
without the kernel_XXX_begin'/'kernel_XXX_end' guards in ARM arch and 
the FPU was disabled at the time?

2a) In the FPU trap the kernel would always detect the issue and panic 
or oops or something.
2b) In the FPU trap the kernel might enable the FPU, load the FPU 
context of some userland process and resume the kernel thread.

Of course an ISR should not touch the FPU at all. But what if it did?

3) What would happen if an ISR executed an FPU instruction in ARM arch 
and the FPU was disabled in the context that was interrupted:

3a) In the FPU trap the kernel would always detect the issue and panic 
or oops or something.
3b) In the FPU trap the kernel would react as if the interrupted context 
executed the FPU instruction: If the interrupted context was user mode, 
it would restore the userland process' FP context into the FPU. If the 
interrupted context was kernel mode, it would react as per the answer to 
question 2) above.

4) What would happen if an ISR executed an FPU instruction in ARM arch 
and the FPU was enabled in the context that was interrupted:

4a) The processor would disable the FPU on ISR entry automatically and 
thus the system would behave as described in the answer to question 3) 
above.
4b) If the driver uses the standard kernel interrupt dispatch 
architecture, the kernel would disable the FPU before dispatching the 
interrupt to the driver ISR, and so the system would also behave as 
described in 3).
4c) The FPU instruction would execute. There is no fail-fast or 
detection of this kind of violation by the kernel.

Of course every pointer, idea, or suspicion that might seem relevant to 
the case is welcome.