Occasional crashes in suspend-resume with MMC transactions

Thu Mar 22 00:23:23 EDT 2012

Hi,

I am trying to do suspend-resume test with a file copy on MMC/SD going on
in the background. The test involves simply copying a 450MB file on an ext3
partition to the same partition under a different name.

This is on an AM335x board which uses the omap_hsmmc driver. 
The kernel is v3.2 and I have also applied the following patch 

commit 6b9dc6b5f2a1b7953e46e897eb3e91e885322ed0
Author: Guennadi Liakhovetski <g.liakhovetski at gmx.de>
Date:   Wed Jan 4 15:28:45 2012 +0100

    mmc: fix a deadlock between system suspend and MMC block IO

    Performing MMC block IO with simultaneous STR can lead to a deadlock: the
    mmc_pm_notify() function claims the host and then calls bus .remove()
    method, which lands in mmc_blk_remove(), which calls mmc_blk_remove_req()
    then it goes to -> mmc_cleanup_queue() -> kthread_stop(), which waits for
    the mmc-block thread to stop. If the mmc-block thread at that time is
    processing block requests, it will also try to claim the host in
    mmc_blk_issue_rq() and block there. This patch fixes the problem by
    calling .remove() before claiming the host.

    Signed-off-by: Guennadi Liakhovetski <g.liakhovetski at gmx.de>
    Acked-by: Arindam Nath <arindam.nath at amd.com>
    Signed-off-by: Chris Ball <cjb at laptop.org>

In some iterations the resume part gets into issues with the following dump

...
[  229.716033] PM: Syncing filesystems ... done.
[  281.389129] Freezing user space processes ... (elapsed 0.95 seconds) done.
[  282.357757] Freezing remaining freezable tasks ... (elapsed 4.36 seconds) done.
[  286.764099] PM: suspend of devices complete after 29.815 msecs
[  286.765777] PM: late suspend of devices complete after 1.647 msecs

( resume using a uart keystroke)
[  290.121765] GFX domain entered low power stat
[  290.123748] Unhandled fault: external abort on non-linefetch (0x1028) at 0xfa06022c
[  290.123809] Internal error: : 1028 [#1]
[  290.123840] Modules linked in:
[  290.123870] CPU: 0    Not tainted  (3.2.0-12280-g6b9dc6b-dirty #12)
[  290.123931] PC is at omap_hsmmc_request+0xcc/0x4d8
[  290.123962] LR is at mmc_start_request+0x110/0x20c
[  290.123992] pc : [<c030baec>]    lr : [<c02fbff8>]    psr: a0000113
[  290.124023] sp : cf373e58  ip : cf373ea0  fp : cf373e9c
[  290.124053] r10: 00000000  r9 : ce55e218  r8 : cfa6202c
[  290.124053] r7 : cfa620dc  r6 : cf3b6000  r5 : cfa6202c  r4 : cf3b6280
[  290.124084] r3 : fa060100  r2 : fa060100  r1 : 00000080  r0 : cf3b6000
[  290.124145] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment kernel
[  290.124176] Control: 10c5387d  Table: 8edf4019  DAC: 00000015
[  290.124206] Process mmcqd/0 (pid: 666, stack limit = 0xcf3722f0)
[  290.124237] Stack: (0xcf373e58 to 0xcf374000)
[  290.124237] 3e40:                                                       cf373e9c cf373e68
[  290.124298] 3e60: c001991c c01c7898 cf362c40 00000001 00002000 cfa6202c cf3b6000 c0616550
[  290.124359] 3e80: cfa620dc cfa6202c ce55e218 00000000 cf373ed4 cf373ea0 c02fbff8 c030ba2c
[  290.124389] 3ea0: 00000001 00000000 cfa6202c 00000000 cf373ed4 cf3b6000 cfa6211c 00000000
[  290.124420] 3ec0: cf373f14 cfa6202c cf373efc cf373ed8 c02fce8c c02fbef4 00000000 cfa62400
[  290.124481] 3ee0: cfa62004 cf372000 cfa62400 cfa62000 cf373f44 cf373f00 c03089fc c02fcd88
[  290.124542] 3f00: cf373f54 00000000 cfa6223c c312e8d8 00100100 00200200 c02fc47c cfa62400
[  290.124572] 3f20: cfa62004 cf372000 cfa62000 c312e8d8 ce55e218 cfa62000 cf373f8c cf373f48
[  290.124633] 3f40: c0308e5c c0308998 00000000 c312e8d8 cf373f74 cf373f60 c01ab760 c01af878
[  290.124664] 3f60: c312e8d8 cfa62004 00000000 cf372000 00000001 cf3a8ae8 ce55e218 cfa6200c
[  290.124725] 3f80: cf373fbc cf373f90 c03095f8 c0308cd0 00000000 cf835cb0 cfa62004 c030959c
[  290.124755] 3fa0: 00000013 00000000 00000000 00000000 cf373ff4 cf373fc0 c0055a30 c03095a8
[  290.124786] 3fc0: cf835cb0 00000000 cfa62004 00000000 cf373fd0 cf373fd0 00000000 cf835cb0
[  290.124847] 3fe0: c00559a0 c004018c 00000000 cf373ff8 c004018c c00559ac f52322cc f291067e
[  290.124877] Backtrace:
[  290.124938] [<c030ba20>] (omap_hsmmc_request+0x0/0x4d8) from [<c02fbff8>] (mmc_start_request+0x110/0x20c)
[  290.125000] [<c02fbee8>] (mmc_start_request+0x0/0x20c) from [<c02fce8c>] (mmc_start_req+0x110/0x160)
[  290.125030]  r8:cfa6202c r7:cf373f14 r6:00000000 r5:cfa6211c r4:cf3b6000
[  290.125091] [<c02fcd7c>] (mmc_start_req+0x0/0x160) from [<c03089fc>] (mmc_blk_issue_rw_rq+0x70/0x338)
[  290.125122]  r8:cfa62000 r7:cfa62400 r6:cf372000 r5:cfa62004 r4:cfa62400
[  290.125183] r3:00000000
[  290.125213] [<c030898c>] (mmc_blk_issue_rw_rq+0x0/0x338) from [<c0308e5c>] (mmc_blk_issue_rq+0x198/0x460)
[  290.125274] [<c0308cc4>] (mmc_blk_issue_rq+0x0/0x460) from [<c03095f8>] (mmc_queue_thread+0x5c/0xf0)
[  290.125335] [<c030959c>] (mmc_queue_thread+0x0/0xf0) from [<c0055a30>] (kthread+0x90/0x94)
[  290.125396] [<c00559a0>] (kthread+0x0/0x94) from [<c004018c>] (do_exit+0x0/0x67c)
[  290.125427]  r6:c004018c r5:c00559a0 r4:cf835cb0
[  290.125457] Code: e5933008 e1833801 e5823104 e5943034 (e593212c)
[  290.125488] ---[ end trace 82f48b7910402960 ]---
[  290.329986] PM: early resume of devices complete after 207.732 msecs
[  290.465484] CPSW phy found : id is : 0x4dd074
[  290.467193] PHY 0:01 not found
[  290.469512] PM: resume of devices complete after 139.156 msecs
[  290.921264] Successfully transitioned all domains to low power state
[  290.928375] PM: Finishing wakeup.
[  290.932250] Restarting tasks ... 
[  290.960052] done.
[  293.461608] PHY: 0:00 - Link is Up - 100/Full

Whenever this happens, the file copy doesn't continue after resume.

A snapshot of the blocked tasks gives the following info

root at arago-armv7:/media/mmcblk0p3# echo w > /proc/sysrq-trigger
[  364.364807] SysRq : Show Blocked State
[  364.368743]   task                PC stack   pid father
[  364.374206] kworker/u:0     D c0411548     0     5      2 0x00000000
[  364.380859] Backtrace:
[  364.383422] [<c0411388>] (__schedule+0x0/0x394) from [<c0411810>] (schedule+0x50/0x68)
[  364.391693] [<c04117c0>] (schedule+0x0/0x68) from [<c02fc604>] (__mmc_claim_host+0x138/0x144)
[  364.400604] [<c02fc4cc>] (__mmc_claim_host+0x0/0x144) from [<c030184c>] (mmc_sd_detect+0x28/0x80)
[  364.409881] [<c0301824>] (mmc_sd_detect+0x0/0x80) from [<c02fdff0>] (mmc_rescan+0x120/0x250)
[  364.418670]  r5:cf8d2e00 r4:cf3b6000
[  364.422424] [<c02fded0>] (mmc_rescan+0x0/0x250) from [<c004fba0>] (process_one_work+0x124/0x384)
[  364.431579]  r6:cf834000 r5:cf8d2e00 r4:cf817f00 r3:60000013
[  364.437500] [<c004fa7c>] (process_one_work+0x0/0x384) from [<c0051a18>] (worker_thread+0x15c/0x330)
[  364.446960] [<c00518bc>] (worker_thread+0x0/0x330) from [<c0055a30>] (kthread+0x90/0x94)
[  364.455413] [<c00559a0>] (kthread+0x0/0x94) from [<c004018c>] (do_exit+0x0/0x67c)
[  364.463226]  r6:c004018c r5:c00559a0 r4:cf81dee0
[  364.468078] kjournald       D c0411548     0  1747      2 0x00000000
[  364.474700] Backtrace:
[  364.477264] [<c0411388>] (__schedule+0x0/0x394) from [<c0411810>] (schedule+0x50/0x68)
[  364.485534] [<c04117c0>] (schedule+0x0/0x68) from [<c0411890>] (io_schedule+0x68/0x90)
[  364.493804] [<c0411828>] (io_schedule+0x0/0x90) from [<c01aabc4>] (get_request_wait+0xcc/0x170)
[  364.502868]  r6:cf3a8ae8 r5:cfb53d9c r4:cfb53da8 r3:cede3c00
[  364.508789] [<c01aaaf8>] (get_request_wait+0x0/0x170) from [<c01abde8>] (blk_queue_bio+0x4c/0x298)
[  364.518157] [<c01abd9c>] (blk_queue_bio+0x0/0x298) from [<c01a9c64>] (generic_make_request+0xb4/0xd4)
[  364.527770] [<c01a9bb0>] (generic_make_request+0x0/0xd4) from [<c01a9d08>] (submit_bio+0x84/0xf8)
[  364.537017]  r6:00000211 r5:00000008 r4:c3538240
[  364.541870] [<c01a9c84>] (submit_bio+0x0/0xf8) from [<c00cd198>] (submit_bh+0x13c/0x16c)
[  364.550323] [<c00cd05c>] (submit_bh+0x0/0x16c) from [<c011ecb0>] (journal_do_submit_data.clone.23+0x3c/0x48)
[  364.560577]  r7:00000200 r6:c00d0c68 r5:cee2ee80 r4:000001a0
[  364.566497] [<c011ec74>] (journal_do_submit_data.clone.23+0x0/0x48) from [<c011fab0>] (journal_commit_transaction+0xdf4/0xf1c)
[  364.578369]  r7:c600cec0 r6:00000200 r5:c600ce88 r4:cec053c0
[  364.584289] [<c011ecbc>] (journal_commit_transaction+0x0/0xf1c) from [<c0121e54>] (kjournald+0xc4/0x1d8)
[  364.594207] [<c0121d90>] (kjournald+0x0/0x1d8) from [<c0055a30>] (kthread+0x90/0x94)
[  364.602264] [<c00559a0>] (kthread+0x0/0x94) from [<c004018c>] (do_exit+0x0/0x67c)
[  364.610076]  r6:c004018c r5:c00559a0 r4:cedfbcd8
[  364.614929] cp              D c0411548     0  1915   1877 0x00000000
[  364.621551] Backtrace:
[  364.624114] [<c0411388>] (__schedule+0x0/0x394) from [<c0411810>] (schedule+0x50/0x68)
[  364.632354] [<c04117c0>] (schedule+0x0/0x68) from [<c0411890>] (io_schedule+0x68/0x90)
[  364.640625] [<c0411828>] (io_schedule+0x0/0x90) from [<c01aabc4>] (get_request_wait+0xcc/0x170)
[  364.649719]  r6:cf3a8ae8 r5:c17aba84 r4:c17aba90 r3:cede2600
[  364.655639] [<c01aaaf8>] (get_request_wait+0x0/0x170) from [<c01abde8>] (blk_queue_bio+0x4c/0x298)
[  364.664978] [<c01abd9c>] (blk_queue_bio+0x0/0x298) from [<c01a9c64>] (generic_make_request+0xb4/0xd4)
[  364.674591] [<c01a9bb0>] (generic_make_request+0x0/0xd4) from [<c01a9d08>] (submit_bio+0x84/0xf8)
[  364.683868]  r6:00000001 r5:00000080 r4:cfac3a40
[  364.688690] [<c01a9c84>] (submit_bio+0x0/0xf8) from [<c00d9870>] (do_mpage_readpage+0x468/0x774)
[  364.697845] [<c00d9408>] (do_mpage_readpage+0x0/0x774) from [<c00d9cec>] (mpage_readpages+0xec/0x148)
[  364.707489] [<c00d9c00>] (mpage_readpages+0x0/0x148) from [<c01060a0>] (ext3_readpages+0x24/0x28)
[  364.716766] [<c010607c>] (ext3_readpages+0x0/0x28) from [<c0083448>] (__do_page_cache_readahead+0x198/0x27c)
[  364.727020] [<c00832b0>] (__do_page_cache_readahead+0x0/0x27c) from [<c00837f8>] (ra_submit+0x28/0x30)
[  364.736724] [<c00837d0>] (ra_submit+0x0/0x30) from [<c0083908>] (ondemand_readahead+0x108/0x1e8)
[  364.745910] [<c0083800>] (ondemand_readahead+0x0/0x1e8) from [<c0083ae0>] (page_cache_sync_readahead+0x4c/0x68)
[  364.756439] [<c0083a94>] (page_cache_sync_readahead+0x0/0x68) from [<c007be18>] (generic_file_aio_read+0x42c/0x770)
[  364.767333]  r5:cf576888 r4:00000000
[  364.771057] [<c007b9ec>] (generic_file_aio_read+0x0/0x770) from [<c00a7694>] (do_sync_read+0xac/0xec)
[  364.780700] [<c00a75e8>] (do_sync_read+0x0/0xec) from [<c00a7e20>] (vfs_read+0xb8/0x14c)
[  364.789123]  r8:00001000 r7:c17abf70 r6:bea1da58 r5:00001000 r4:ced3a740
[  364.796142] [<c00a7d68>] (vfs_read+0x0/0x14c) from [<c00a7ef8>] (sys_read+0x44/0x74)
[  364.804229]  r8:00001000 r7:bea1da58 r6:ced3a740 r5:00000000 r4:09adc000
[  364.811248] [<c00a7eb4>] (sys_read+0x0/0x74) from [<c0014200>] (ret_fast_syscall+0x0/0x30)
[  364.819854]  r8:c00143a8 r7:00000003 r6:bea1da58 r5:00000003 r4:00001000

Debugging a bit further I found that runtime resume of hsmmc does not take place in the
failed iterations. Since this doesn't happen in every iteration I would hazard a guess
that there's some sort of race condition between the MMC core resuming and the 
pm_runtime_get_sync() call.

Any pointers on what could be going wrong here and how to debug this further.

Regards,
Vaibhav