Child process might hang inside kernel with 2.6.33.15-rt31

Jouko Haapaluoma jouhaapa at gmail.com
Tue Jul 31 02:00:50 EDT 2012


Hello everyone

I know that the kernel 2.6.33 is very old, but since RT_PREEMPT was 
unavailable to newer versions when we were choosing the kernel, we had 
to choose this one and currently it is not possible to start converting 
our board to a newer kernel.

We have run into a problem where a pthreaded application executes shell 
commands with the system() function which runs fork/clone+execve and 
sometimes the thread that is calling the system() might hang forever. I 
attached a simple test program that reproduces the problem within 
minutes. I also tried to implement the system() function by doing a 
manual syscall(SYS_fork) call but the same problem remains.

Our boards are using the AT91SAM9260 or AT91SAM9263 ARM SoCs, glibc v2.9 
and gcc v4.3.3. We also tried with eglibc v2.12 and gcc v4.5.4. Kernel 
configuration is attached.

At first the test programs were also crashing (plus the other thread 
locking) and we encountered the problems also with the vanilla 2.6.33 
kernel with CONFIG_PREEMPT. However, I was reading the linux-arm-kernel 
archives and found the thread "cache aliasing in dup_mmap" (2009-03-07) 
where was indicated that this problem is caused by VIVT cache aliasing. 
Russell King posted a patch in that thread that fixed all the crashing 
and hanging problems on a vanilla 2.6.33 kernel. I found out that the 
patch was applied to the 2.6.34 release 
(http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=2725898fc9bb2121ac0fb1b5e4faf4fc09014729). 


The patch fixed the crashing also with the RT_PREEMPT but we still 
encounter those weird hangs. I made some tracing with sysrq-t and I 
pasted the backtrace of the locked task (the child process) in this mail 
[1]. It always seems to hang to the same place 
"get_page_from_freelist+0x31c" when running sys_execve(). It looks like 
the parent hangs because it sits on waitpid() which is called for the 
child process and the child process hangs forever inside the kernel at 
"get_page_from_freelist+0x31c". I noticed that the hung task might 
actually continue running normally for a bit after some time but that is 
very rare.

I also found that running "ps" when the task is hung might lock the 
whole system for a while. Kernel printed the hung task backtrace which 
is also pasted to this mail [2]. This could be related to the same 
problem. I have not gotten any more info about the hang with 
DETECT_SOFTLOCKUP, DEBUG_LOCKDEP and DEBUG_SPINLOCK_SLEEP options.

Could this problem still be related to VIVT cache aliasing since similar 
hanging was present with vanilla 2.6.33 without the patch or is it a 
general RT_PREEMPT problem? Jamie Lokier had some doubts if there were 
some holes left: 
http://article.gmane.org/gmane.linux.ports.arm.kernel/69927/

Has anyone else encountered this on any platform?

BR,
Jouko Haapaluoma


********[1] sysrq-t backtrace ***********
[ 1666.230000] system_hang_e R running 0 10745 743 0x00000000
[ 1666.250000] [<c0251c64>] (__schedule+0x304/0x380) from [<c0251dbc>] 
(preempt_schedule+0x64/0x90)
[ 1666.270000] [<c0251dbc>] (preempt_schedule+0x64/0x90) from 
[<c006ce98>] (get_page_from_freelist+0x31c/0x478)
[ 1666.290000] [<c006ce98>] (get_page_from_freelist+0x31c/0x478) from 
[<c006d238>] (__alloc_pages_nodemask+0x100/0x568)
[ 1666.310000] [<c006d238>] (__alloc_pages_nodemask+0x100/0x568) from 
[<c006d6b4>] (__get_free_pages+0x14/0x44)
[ 1666.330000] [<c006d6b4>] (__get_free_pages+0x14/0x44) from 
[<c002c95c>] (get_pgd_slow+0x18/0xe8)
[ 1666.350000] [<c002c95c>] (get_pgd_slow+0x18/0xe8) from [<c00363c0>] 
(mm_init+0xac/0xfc)
[ 1666.370000] [<c00363c0>] (mm_init+0xac/0xfc) from [<c0091584>] 
(bprm_mm_init+0x10/0x174)
[ 1666.370000] [<c0091584>] (bprm_mm_init+0x10/0x174) from [<c0091b58>] 
(do_execve+0xa0/0x26c)
[ 1666.390000] [<c0091b58>] (do_execve+0xa0/0x26c) from [<c00281dc>] 
(sys_execve+0x38/0x5c)
[ 1666.410000] [<c00281dc>] (sys_execve+0x38/0x5c) from [<c0024f40>] 
(ret_fast_syscall+0x0/0x28)

(gdb) list *(get_page_from_freelist+0x31c)
0xc006ce98 is in get_page_from_freelist (mm/page_alloc.c:193).
188 }
189
190 static inline void unlock_cpu_pcp(unsigned long flags, int this_cpu)
191 {
192 #ifdef CONFIG_PREEMPT_RT
193 put_cpu_var_locked(pcp_locks, this_cpu);
194 #else
195 local_irq_restore(flags);
196 #endif
197 }
(gdb)


********[2] ps hangs ***********
root at at91sam9263ek:/# ps
PID USER VSZ STAT COMMAND
1 root 1572 S init [5]
2 root 0 SW [kthreadd]
3 root 0 SW [sirq-high/0]
4 root 0 SW [sirq-timer/0]
5 root 0 SW [sirq-net-tx/0]
6 root 0 SW [sirq-net-rx/0]
7 root 0 SW [sirq-block/0]
8 root 0 SW [sirq-block-iopo]
9 root 0 SW [sirq-tasklet/0]
10 root 0 SW [sirq-sched/0]
11 root 0 SW [sirq-hrtimer/0]
12 root 0 SW [sirq-rcu/0]
13 root 0 SW [posixcputmr/0]
14 root 0 SW [watchdog/0]
15 root 0 SW< [desched/0]
16 root 0 SW< [events/0]
17 root 0 SW [khelper]
20 root 0 SW [async/mgr]
95 root 0 SW [sync_supers]
97 root 0 SW [bdi-default]
99 root 0 SW [kblockd/0]
108 root 0 SW [khubd]
111 root 0 SW [kseriod]
119 root 0 SW [cfg80211]
136 root 0 SW [khungtaskd]
137 root 0 SW [kswapd0]
138 root 0 SW [aio/0]
223 root 0 SW [mtdblockd]
256 root 0 SW [ubi_bgt0d]
259 root 0 SW [ubi_bgt1d]
260 root 0 SW [irq/14-atmel_sp]
267 root 0 SW [irq/15-atmel_sp]
274 root 0 SW [irq/29-ohci_hcd]
285 root 0 SW [irq/1-rtc0]
323 root 0 SW [usbhid_resumer]
330 root 0 SW [ubifs_bgt0_1]
354 root 1932 S < /sbin/udevd -d
459 root 1928 S < /sbin/udevd -d
460 root 1928 S < /sbin/udevd -d
614 root 0 SW [ubifs_bgt1_1]
727 root 2900 S /sbin/syslogd -n -C64 -m 20
729 root 2836 S /sbin/klogd -n
740 root 3016 S -sh
742 root 18060 R ./system_hang_echo
4170 root 3016 R ps
4175 root 0 []
[ 1560.600000] INFO: task ps:4170 blocked for more than 120 seconds.
[ 1560.600000] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
disables this message.
[ 1560.630000] ps D c0251c64 0 4170 740 0x00000000
[ 1560.660000] [<c0251c64>] (__schedule+0x304/0x380) from [<c0251e10>] 
(schedule+0x28/0x44)
[ 1560.660000] [<c0251e10>] (schedule+0x28/0x44) from [<c0252ca4>] 
(__rt_mutex_slowlock+0xa0/0xc8)
[ 1560.690000] [<c0252ca4>] (__rt_mutex_slowlock+0xa0/0xc8) from 
[<c02532d0>] (rt_mutex_slowlock+0x1dc/0x2b0)
[ 1560.720000] [<c02532d0>] (rt_mutex_slowlock+0x1dc/0x2b0) from 
[<c005c870>] (rt_down_read+0x28/0x38)
[ 1560.760000] [<c005c870>] (rt_down_read+0x28/0x38) from [<c007b7ec>] 
(access_process_vm+0x34/0x180)
[ 1560.790000] [<c007b7ec>] (access_process_vm+0x34/0x180) from 
[<c00c8320>] (proc_pid_cmdline+0x58/0xd4)
[ 1560.820000] [<c00c8320>] (proc_pid_cmdline+0x58/0xd4) from 
[<c00c986c>] (proc_info_read+0x5c/0xd8)
[ 1560.850000] [<c00c986c>] (proc_info_read+0x5c/0xd8) from [<c008c3d0>] 
(vfs_read+0xac/0x158)
[ 1560.880000] [<c008c3d0>] (vfs_read+0xac/0x158) from [<c008c534>] 
(sys_read+0x40/0x6c)
[ 1560.910000] [<c008c534>] (sys_read+0x40/0x6c) from [<c0024f40>] 
(ret_fast_syscall+0x0/0x28)
17759 root 2836 R sh -c echo HANG! > /dev/null
root at at91sam9263ek:/#
-------------- next part --------------
A non-text attachment was scrubbed...
Name: at91sam9263ek.config
Type: application/xml
Size: 7311 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20120731/3d866139/attachment.wsdl>
-------------- next part --------------
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

void *print_message_function( void *ptr )
{
    pthread_t num = pthread_self();
    while(1)
    {
        printf("Thread %d doing a system() call\n", num);
        system("echo HANG! > /dev/null");
        printf("Thread %d call finished\n", num);
    }
}

int main(int argc, char *argv[])
{
    pthread_t thread1;
    pthread_t thread2;

    pthread_create( &thread1, NULL, print_message_function, NULL);
    pthread_create( &thread2, NULL, print_message_function, NULL);

    while(1)
    {
    }
}


More information about the linux-arm-kernel mailing list