Another JFFS2 deadlock, kernel 3.4.11

Mon Oct 26 03:53:11 PDT 2015

Hello all：

Sorry for my pool English and bad Email using first.

I have encountered another deadlock between several JFFS2 threads.
(It is different from the one post at
http://lists.infradead.org/pipermail/linux-mtd/2012-October/044263.html )

Target system is a soc with BCM6838(Dual core,MIPS32, Broadcom BMIPS4350 V8.0)
We made a jffs2 partition on flash (S34ML01G200TFI000 1Gbit SLC Nand Chip),
Then we mount this jffs2 partition on both /app and /data
mtd:data on /data type jffs2 (rw,relatime)
mtd:data on /app type jffs2 (rw,relatime)

We encountered jffs2 deadlock issue first time when we run command "reboot" 
in shell --- "reboot" stucked.Then we tried steps below
1, We found that neither /app nor /data could nearly not be accessed.
"ls" "touch" will be stucked too.
2, We noticed that [sync_supers] was at a state D.
3, We compiled a kernel module which can read process's kernel stack,and found
[sync_supers] stucked at lock_page(),just as same as which described at 
http://lists.infradead.org/pipermail/linux-mtd/2012-October/044263.html .
4, "reboot" called sync(),then stucked.
So we patched our kernel refer to 
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
SHA-1: 5ffd3412ae5536a4c57469cb8ea31887121dcb2e
* jffs2: Fix lock acquisition order bug in jffs2_write_begin

But these days, we encountered another deadlock . 
our process stucked at system call 'unlink()' when we delete a file.

Enclosed scripts can be used to reproduce this new issue.
Scripts 1:  testr.sh
#!/bin/sh
while [ 1 ]
do
cat /app/test.file >/dev/null
done

Scripts 2:test2.sh 
#!/bin/sh
while [ 1 ]
do
ls /app -al >/dev/null
ls /app/test.file -al >/dev/null
done

Scripts 3:tests.sh 
#!/bin/sh
while [ 1 ]
do
sync
sleep 1
done

Scripts 4:testw.sh 
#!/bin/sh
while [ 1 ]
do
cat /etc/inittab >> /app/test.file
sleep 1
done

Scripts 5:testw.sh 
#!/bin/sh
while [ 1 ]
do
cat /etc/config > /app/test.file
sleep 10
done

/etc/config is a ascii text file (size: 10785Bytes)

Run these scripts just like:
# ./testw2.sh &
# ./testw2.sh &
# ./testw2.sh &
# ./testw2.sh &
# ./testw.sh &
# ./testw.sh &
# ./testw.sh &
# ./testw.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr.sh &
# ./testr2.sh &
# ./testr2.sh &
# ./testr2.sh &
# ./testr2.sh &
# ./tests.sh &
# ./tests.sh &

about 10 minutes later, these test scripts will be blocked in state 'D'

We parsed this issue again. 
for [sync_supers]
jffs2_garbage_collect_live
    mutex_lock(&f->sem)                         (A)
    jffs2_garbage_collect_dnode
        jffs2_gc_fetch_page
            read_cache_page_async
                do_read_cache_page
                    lock_page(page)             (B)
For other tasks
	generic_file_aio_read 
		do_generic_file_read 
			lock_page_killable(page);                (B)
			mapping->a_ops->readpage  （jffs2_readpage ） 
				mutex_lock(&f->sem)                  (A)

We noticed that jffs2_readpage always be called with lock_page(page) hold,
but most of other functions in jffs2 module call mutex_lock(&f->sem) first,
lock_page(page) second. It is the same in latest kernel:
git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
Is this logical? Or is it just my understanding wrong?

More info:
We have tested this issue on another target system,a soc with FreeScale MPC8308
(PowerPc arch, e300c3), Linux version 2.6.29.6. it has a jffs2 partition on a
nor flash(S29GL512P11TFI010).
We never reproduced any jffs2 deadlock issue on this powerpc target.
Even though its linux source code has the proble mentioned in 
http://lists.infradead.org/pipermail/linux-mtd/2012-October/044263.html.

What's the main difference between the two targets? 
We noticed Backing-Dev(bdi_sync_supers) coming into kernel between two versions. 
and different flash types being used for two targets(one nand ,one nor)

----------------------------------
From: wangzaiwei
15th R&D Department
Sumavision Technologies Co., Ltd.