Another JFFS2 deadlock, kernel 3.4.11
wangzaiwei
wangzaiwei at top-vision.cn
Mon Oct 26 03:53:11 PDT 2015
Hello all:
Sorry for my pool English and bad Email using first.
I have encountered another deadlock between several JFFS2 threads.
(It is different from the one post at
http://lists.infradead.org/pipermail/linux-mtd/2012-October/044263.html )
Target system is a soc with BCM6838(Dual core,MIPS32, Broadcom BMIPS4350 V8.0)
We made a jffs2 partition on flash (S34ML01G200TFI000 1Gbit SLC Nand Chip),
Then we mount this jffs2 partition on both /app and /data
mtd:data on /data type jffs2 (rw,relatime)
mtd:data on /app type jffs2 (rw,relatime)
We encountered jffs2 deadlock issue first time when we run command "reboot"
in shell --- "reboot" stucked.Then we tried steps below
1, We found that neither /app nor /data could nearly not be accessed.
"ls" "touch" will be stucked too.
2, We noticed that [sync_supers] was at a state D.
3, We compiled a kernel module which can read process's kernel stack,and found
[sync_supers] stucked at lock_page(),just as same as which described at
http://lists.infradead.org/pipermail/linux-mtd/2012-October/044263.html .
4, "reboot" called sync(),then stucked.
So we patched our kernel refer to
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
SHA-1: 5ffd3412ae5536a4c57469cb8ea31887121dcb2e
* jffs2: Fix lock acquisition order bug in jffs2_write_begin
But these days, we encountered another deadlock .
our process stucked at system call 'unlink()' when we delete a file.
Enclosed scripts can be used to reproduce this new issue.
Scripts 1: testr.sh
#!/bin/sh
while [ 1 ]
do
cat /app/test.file >/dev/null
done
Scripts 2:test2.sh
#!/bin/sh
while [ 1 ]
do
ls /app -al >/dev/null
ls /app/test.file -al >/dev/null
done
Scripts 3:tests.sh
#!/bin/sh
while [ 1 ]
do
sync
sleep 1
done
Scripts 4:testw.sh
#!/bin/sh
while [ 1 ]
do
cat /etc/inittab >> /app/test.file
sleep 1
done
Scripts 5:testw.sh
#!/bin/sh
while [ 1 ]
do
cat /etc/config > /app/test.file
sleep 10
done
/etc/config is a ascii text file (size: 10785Bytes)
Run these scripts just like:
# ./testw2.sh &
# ./testw2.sh &
# ./testw2.sh &
# ./testw2.sh &
# ./testw.sh &
# ./testw.sh &
# ./testw.sh &
# ./testw.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr2.sh &
# ./testr2.sh &
# ./testr2.sh &
# ./testr2.sh &
# ./tests.sh &
# ./tests.sh &
about 10 minutes later, these test scripts will be blocked in state 'D'
We parsed this issue again.
for [sync_supers]
jffs2_garbage_collect_live
mutex_lock(&f->sem) (A)
jffs2_garbage_collect_dnode
jffs2_gc_fetch_page
read_cache_page_async
do_read_cache_page
lock_page(page) (B)
For other tasks
generic_file_aio_read
do_generic_file_read
lock_page_killable(page); (B)
mapping->a_ops->readpage (jffs2_readpage )
mutex_lock(&f->sem) (A)
We noticed that jffs2_readpage always be called with lock_page(page) hold,
but most of other functions in jffs2 module call mutex_lock(&f->sem) first,
lock_page(page) second. It is the same in latest kernel:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
Is this logical? Or is it just my understanding wrong?
More info:
We have tested this issue on another target system,a soc with FreeScale MPC8308
(PowerPc arch, e300c3), Linux version 2.6.29.6. it has a jffs2 partition on a
nor flash(S29GL512P11TFI010).
We never reproduced any jffs2 deadlock issue on this powerpc target.
Even though its linux source code has the proble mentioned in
http://lists.infradead.org/pipermail/linux-mtd/2012-October/044263.html.
What's the main difference between the two targets?
We noticed Backing-Dev(bdi_sync_supers) coming into kernel between two versions.
and different flash types being used for two targets(one nand ,one nor)
----------------------------------
From: wangzaiwei
15th R&D Department
Sumavision Technologies Co., Ltd.
More information about the linux-mtd
mailing list