[REGRESSION 4.10] dw_mmc: failures on Rockchip rk3288 veyron boards

Doug Anderson dianders at google.com
Mon Apr 10 16:35:55 PDT 2017


Hi,

On Wed, Mar 29, 2017 at 6:17 PM, Brian Norris <briannorris at chromium.org> wrote:
> Hi all,
>
> I haven't managed to get as far as a bugfix for this, but I've bisected
> some issues seen on v4.10+ with a Chromebook of the Veyron family (Jaq,
> in particular). v4.9 works fine.

OK, I finally got everything up and running to test this too...


> Issue #1 - eMMC complains periodically:
>
> [    4.358135] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [    4.461466] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [    5.291450] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [    5.381471] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [   11.243337] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [   17.371628] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)

I don't believe that this is an error, actually.  Really we just need
to quiet this message or (since I've always found it useful) move it
to a different place.  I believe that with runtime PM we're
effectively turning the clock off whenever the MMC device isn't in
use.  On dw_mmc we have a "helpful" printout every time the clock is
changed, and that's what you're seeing here.

You can see with:

while true; do
  dd if=/dev/mmcblk2 of=/dev/null bs=512 count=1 iflag=direct;
  sleep .1;
done

...that you'll get a printout every 100ms.


Ah, looks like this is in:

ce69e2fea093 mmc: dw_mmc: silent verbose log when calling from PM context

...as pointed out by Shawn Lin.  So I think in 4.10 we can just ignore
those messages and they're good on 4.11.


> and if I stress it out at all (e.g., dd if=/dev/mmcblk2 bs=1M >
> /dev/null), it will eventually croak:
>
> [  359.916315] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [  360.071378] dwmmc_rockchip ff0f0000.dwmmc: Successfully tuned phase to 153
> [  360.211351] mmcblk2: error -110 transferring data, sector 8644608, nr 2048, cmd response 0x900, card status 0x0
> [  360.221936] mmcblk2: retrying using single block read
> [  363.491362] mmcblk2: error -110 transferring data, sector 8646656, nr 2048, cmd response 0x900, card status 0x0
> [  363.531569] mmc_host mmc2: Bus speed (slot 0) = 400000Hz (slot req 400000Hz, actual 400000HZ div = 0)
> [  363.596326] mmc_host mmc2: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [  363.612712] dwmmc_rockchip ff0f0000.dwmmc: Successfully tuned phase to 152
> [  363.751351] mmcblk2: error -110 transferring data, sector 8646656, nr 2048, cmd response 0x900, card status 0x0
> [  363.761938] mmcblk2: retrying using single block read
> [  366.611356] INFO: task mmcqd/2boot1:92 blocked for more than 120 seconds.
> [  366.618134]       Not tainted 4.10.0 #284
> [  366.622146] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [  366.629960] mmcqd/2boot1    D    0    92      2 0x00000000
> [  366.635454] [<c07dc21c>] (__schedule) from [<c07dc4e0>] (schedule+0x90/0xa0)
> [  366.642497] [<c07dc4e0>] (schedule) from [<c066e8b4>] (__mmc_claim_host+0xd4/0x19c)
> [  366.650142] [<c066e8b4>] (__mmc_claim_host) from [<c066e9ac>] (mmc_get_card+0x30/0x34)
> [  366.658056] [<c066e9ac>] (mmc_get_card) from [<c067fc8c>] (mmc_blk_issue_rq+0x64/0x48c)
> [  366.666052] [<c067fc8c>] (mmc_blk_issue_rq) from [<c0680230>] (mmc_queue_thread+0x114/0x1b4)
> [  366.674484] [<c0680230>] (mmc_queue_thread) from [<c023d1b0>] (kthread+0x128/0x144)
> [  366.682134] [<c023d1b0>] (kthread) from [<c02076e8>] (ret_from_fork+0x14/0x2c)

I'm not convinced this is a regression.

I remember Heiko saying that he's heard reports that on some boards
eMMC doesn't work with high speed, and I'be believe that's what you're
seeing here.  It would be interesting to try to debug this.  I can't
personally reproduce, though.

I think veyron_minnie already has UHS turned off for eMMC upstream.  I
guess we could do it for other veyron boards too until someone can
debug?


> Issue #2 - Wifi (via SDIO, mmc1) is completely dead:
>
> [    1.444125] mmc_host mmc1: card is non-removable.
> [    1.471368] mmc_host mmc1: Bus speed (slot 0) = 400000Hz (slot req 400000Hz, actual 400000HZ div = 0)
> [    1.619553] mmc_host mmc1: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [    1.881699] mmc1: new ultra high speed SDR104 SDIO card at address 0001
> [   25.681172] mmc_host mmc1: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [   25.691666] mwifiex: rx work enabled, cpus 4
> [   26.827000] mwifiex_sdio mmc1:0001:1: info: FW download over, size 800344 bytes
> [   27.561352] mwifiex_sdio mmc1:0001:1: WLAN FW is active
> [   33.585165] mmc_host mmc1: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [   37.651344] mwifiex_sdio mmc1:0001:1: mwifiex_cmd_timeout_func: Timeout cmd id = 0xa9, act = 0x0
> [   37.660122] mwifiex_sdio mmc1:0001:1: num_data_h2c_failure = 0
> [   37.665951] mwifiex_sdio mmc1:0001:1: num_cmd_h2c_failure = 0
> [   37.671688] mwifiex_sdio mmc1:0001:1: is_cmd_timedout = 1
> [   37.677076] mwifiex_sdio mmc1:0001:1: num_tx_timeout = 0
> [   37.682380] mwifiex_sdio mmc1:0001:1: last_cmd_index = 1
> [   37.687681] mwifiex_sdio mmc1:0001:1: last_cmd_id: 00 00 a9 00 00 00 00 00 00 00
> [   37.695066] mwifiex_sdio mmc1:0001:1: last_cmd_act: 00 00 00 00 00 00 00 00 00 00
> [   37.702536] mwifiex_sdio mmc1:0001:1: last_cmd_resp_index = 0
> [   37.708269] mwifiex_sdio mmc1:0001:1: last_cmd_resp_id: 00 00 00 00 00 00 00 00 00 00
> [   37.716087] mwifiex_sdio mmc1:0001:1: last_event_index = 0
> [   37.721564] mwifiex_sdio mmc1:0001:1: last_event: 00 00 00 00 00 00 00 00 00 00
> [   37.728857] mwifiex_sdio mmc1:0001:1: data_sent=1 cmd_sent=0
> [   37.734508] mwifiex_sdio mmc1:0001:1: ps_mode=0 ps_state=0
> [   37.740016] mmc_host mmc1: Bus speed (slot 0) = 148500000Hz (slot req 150000000Hz, actual 148500000HZ div = 0)
> [   37.750268] mwifiex_sdio mmc1:0001:1: info: mwifiex_fw_dpc: unregister device

This doesn't surprise me at all.  What surprises me, though, is that
nobody else seems to be able to reproduce this.

On veyron, WiFi is connected via SDIO.  For good speed, it uses SDIO
Interrupts.  See this bit in the device tree:

  cap-sdio-irq;

SDIO interrupts (in 4-bit mode) specifically need the card clock to be
running all the time to work.  I can reproduce your regression (on
veryron-jerry, which also has Marvell WiFi) and I can also find that
the regression is "gone' if I take out the "cap-sdio-irq" in the
veyron device tree.  Ah, interestingly enough, turning off SDIO
interrupts has the side effect of sending enough (polling) traffic
that we never seem to runtime suspend, either.  ;-P

In general I'd question whether dw_mmc actually gets much power
benefit from Runtime PM in Linux.  The dw_mmc IP blocks already have a
feature in them to automatically stop and restart the card clock.  See
SDMMC_CLKEN_LOW_PWR.  Maybe you're getting the benefit of turning off
VMMC or VQMMC?  Is that really a lot of power?  Presumably those power
savings would be for eMMC or normal SD cards (not SDIO).

Maybe someone else on this thread knows how Runtime PM is supposed to
work in general for SDIO?  I notice that in sdio.c the
mmc_sdio_runtime_suspend() unconditionally calls mmc_power_off().
That seems odd since the main mmc_sdio_suspend() _doesn't_ call it if
mmc_card_keep_power().  Hrmmm...

OK, so I just tried this on veyron-minnie.  On minnie we have Broadcom
WiFi.  That actually works (!).  Presumably this is because
brcmf_sdiod_host_fixup() calls pm_runtime_forbid().  Commenting that
out breaks things.

OK, and I can make Marvell work by adding
"pm_runtime_forbid(func->card->host->parent);" to the end of
mwifiex_sdio_probe().

--

So where does that leave us?

A) Technically we can fix Marvell's driver to work like Broadcom's.
One could possibly assert that this is the wrong fix because
technically we could make Runtime PM work with SDIO with enough work.
We could theoretically move into 1-bit mode and there (I think) you
can get interrupts with the clock off.  ...or we could have a
dedicated SDIO Interrupt pin (for the embedded case), which is talked
about in the SDIO spec.

B) Technically we could hack this in the dw_mmc code to disable
Runtime PM if we see that an SDIO interrupt is used.  One advantage of
doing it here is that if we ever add support in dw_mmc for the
external SDIO interrupt we could allow Runtime PM in that case.  In
theory the dw_mmc IP block has some basic support for a dedicated SDIO
interrupt pin, but there's no code to support it.

C) Technically we could add this into the MMC core.

D) Technically we could remove Runtime PM support from dw_mmc for now
until someone can address all these issues (and ideally show a real
power savings).


I'd tend to vote for D, but I've been pretty absent from dw_mmc for a
long time, so probably my vote isn't worth that much...

Shawn: I think you actually enabled runtime PM.  Did you really see
power savings, or did it just seem like enabling Runtime PM would be a
neat thing to do?


-Doug



More information about the Linux-rockchip mailing list