CPU stalls when handling PAN emulation?

Thu Jan 11 16:11:27 PST 2024

Hi,

Mark noticed that LKDTM's EXEC_USERSPACE test[1] (which trips the PAN
emulation, CONFIG_CPU_SW_DOMAIN_PAN=y, on 32-bit arm) appears to be
creating a situation that leads to a CPU stall. make_task_dead() reports:

	note: ...[NNN] exited with preempt_count 1

which isn't seen with other Oopses (like EXEC_RODATA). (They do also
both report: "note: ...[NNN] exited with irqs disabled" too, but this
seems survivable.) I note that the ACCESS_USERSPACE test does _not_
have this problem.

ACCESS_USERSPACE (survivable) starts with:

	lkdtm: Performing direct entry ACCESS_USERSPACE
	lkdtm: attempting bad read at 76f44000
	8<--- cut here ---
	Unhandled fault: page domain fault (0x01b) at 0x76f44000

EXEC_USERSPACE (leads to CPU stall) starts with:

	lkdtm: Performing direct entry EXEC_USERSPACE
	lkdtm: attempting ok execution at 8075bf18
	lkdtm: attempting bad execution at 76f6f000
	Unhandled prefetch abort: page domain fault (0x01b) at 0x76f6f000
	8<--- cut here ---
	Unhandled fault: page domain fault (0x01b) at 0x76f6f000

So they're both getting caught by the Domain stuff, but there looks to
be a second fault for EXEC_USERSPACE. (more below)

For the CPU stall to appear there (at least) needs to be a second Oops.
As an example, if I run EXEC_USERSPACE and then EXEC_RODATA, the latter
stops sending to the console very quickly, reporting only the very start
of the Oops:

	lkdtm: Performing direct entry EXEC_USERSPACE
	lkdtm: attempting ok execution at 8075bf18
	lkdtm: attempting bad execution at 76f10000
	Unhandled prefetch abort: page domain fault (0x01b) at 0x76f10000
	8<--- cut here ---
	Unhandled fault: page domain fault (0x01b) at 0x76f10000
	[76f10000] *pgd=44f6e835, *pte=469a455f, *ppte=469a4c7e
	Internal error: : 1b [#1] SMP ARM
	Modules linked in:
	...
	Stack: (0xf0959d10 to 0xf095a000)
	...
	 copy_from_kernel_nofault from is_valid_bugaddr+0x40/0x84
	 r7:f0959e08 r6:76f10000 r5:81a60000 r4:00000000
	 is_valid_bugaddr from report_bug+0x4c/0x1b8
	 r4:80e84f4c
	 report_bug from die+0xb4/0x2f0
	 r10:8100a3dc r9:80f0ead0 r8:60070193 r7:81a60000 r6:0000001b r5:80cf5e1c
	 r4:f0959e08
	 die from arm_notify_die+0x54/0x58
	 r10:81a60000 r9:81a60000 r8:80fab2ec r7:80f0ffc8 r6:f0959e08 r5:76f10000
	 r4:0000001b
	 arm_notify_die from do_PrefetchAbort+0x90/0x98
	 do_PrefetchAbort from __pabt_svc+0x5c/0xa0
	Exception stack(0xf0959e08 to 0xf0959e50)
	 r7:f0959e3c r6:ffffffff r5:60070013 r4:76f10000
	 lkdtm_EXEC_USERSPACE from lkdtm_do_action+0x2c/0x4c
	 r4:84fc6000
	 lkdtm_do_action from direct_entry+0x130/0x150
	...
	---[ end trace 0000000000000000 ]---
	note: cat[1271] exited with irqs disabled
	note: cat[1271] exited with preempt_count 1
	lkdtm: Performing direct entry EXEC_RODATA
	lkdtm: attempting ok execution at 8075bf18
	lkdtm: attempting bad execution at 80b42118
	8<--- cut here ---
	Unable to handle kernel paging request at virtual address 80b42118 when execute
	[80b42118] *pgd=40a1940e(bad)
	****nothing else****

Here the 2 crashes in EXEC_USERSPACE are visible. Does anyone see
something obvious in the exception handling that might cause this? I'm
not sure what to do next to figure out what's going wrong.

Any help greatly appreciated! :)

-Kees

[1] To run an LKDTM test, build with CONFIG_LKDTM=y, mount debugfs and do:
	echo "EXEC_USERSPACE" | cat >/sys/kernel/debug/provoke-crash/DIRECT
    (The pipe to cat is to avoid killing your shell on Oopses and BUGs.)
    To list all available tests:
	cat /sys/kernel/debug/provoke-crash/DIRECT

--
Kees Cook