[PATCH v1] mmc: sdhci: Prevent stale command and data interrupt handling

Mon Oct 7 11:58:02 PDT 2024

On 7/10/24 20:00, Michal Wilczynski wrote:
> 
> 
> On 10/7/24 18:09, Adrian Hunter wrote:
>> On 3/10/24 19:10, Michal Wilczynski wrote:
>>> While working with the T-Head 1520 LicheePi4A SoC, certain conditions
>>> arose that allowed me to reproduce a race issue in the sdhci code.
>>>
>>> To reproduce the bug, you need to enable the sdio1 controller in the
>>> device tree file
>>> `arch/riscv/boot/dts/thead/th1520-lichee-module-4a.dtsi` as follows:
>>>
>>> &sdio1 {
>>> 	bus-width = <4>;
>>> 	max-frequency = <100000000>;
>>> 	no-sd;
>>> 	no-mmc;
>>> 	broken-cd;
>>> 	cap-sd-highspeed;
>>> 	post-power-on-delay-ms = <50>;
>>> 	status = "okay";
>>> 	wakeup-source;
>>> 	keep-power-in-suspend;
>>> };
>>>
>>> When resetting the SoC using the reset button, the following messages
>>> appear in the dmesg log:
>>>
>>> [    8.164898] mmc2: Got command interrupt 0x00000001 even though no
>>> command operation was in progress.
>>> [    8.174054] mmc2: sdhci: ============ SDHCI REGISTER DUMP ===========
>>> [    8.180503] mmc2: sdhci: Sys addr:  0x00000000 | Version:  0x00000005
>>> [    8.186950] mmc2: sdhci: Blk size:  0x00000000 | Blk cnt:  0x00000000
>>> [    8.193395] mmc2: sdhci: Argument:  0x00000000 | Trn mode: 0x00000000
>>> [    8.199841] mmc2: sdhci: Present:   0x03da0000 | Host ctl: 0x00000000
>>> [    8.206287] mmc2: sdhci: Power:     0x0000000f | Blk gap:  0x00000000
>>> [    8.212733] mmc2: sdhci: Wake-up:   0x00000000 | Clock:    0x0000decf
>>> [    8.219178] mmc2: sdhci: Timeout:   0x00000000 | Int stat: 0x00000000
>>> [    8.225622] mmc2: sdhci: Int enab:  0x00ff1003 | Sig enab: 0x00ff1003
>>> [    8.232068] mmc2: sdhci: ACmd stat: 0x00000000 | Slot int: 0x00000000
>>> [    8.238513] mmc2: sdhci: Caps:      0x3f69c881 | Caps_1:   0x08008177
>>> [    8.244959] mmc2: sdhci: Cmd:       0x00000502 | Max curr: 0x00191919
>>> [    8.254115] mmc2: sdhci: Resp[0]:   0x00001009 | Resp[1]:  0x00000000
>>> [    8.260561] mmc2: sdhci: Resp[2]:   0x00000000 | Resp[3]:  0x00000000
>>> [    8.267005] mmc2: sdhci: Host ctl2: 0x00001000
>>> [    8.271453] mmc2: sdhci: ADMA Err:  0x00000000 | ADMA Ptr:
>>> 0x0000000000000000
>>> [    8.278594] mmc2: sdhci: ============================================
>>>
>>> I also enabled some traces to better understand the problem:
>>>
>>>      kworker/3:1-62      [003] .....     8.163538: mmc_request_start:
>>> mmc2: start struct mmc_request[000000000d30cc0c]: cmd_opcode=5
>>> cmd_arg=0x0 cmd_flags=0x2e1 cmd_retries=0 stop_opcode=0 stop_arg=0x0
>>> stop_flags=0x0 stop_retries=0 sbc_opcode=0 sbc_arg=0x0 sbc_flags=0x0
>>> sbc_retires=0 blocks=0 block_size=0 blk_addr=0 data_flags=0x0 tag=0
>>> can_retune=0 doing_retune=0 retune_now=0 need_retune=0 hold_retune=1
>>> retune_period=0
>>>           <idle>-0       [000] d.h2.     8.164816: sdhci_cmd_irq:
>>> hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x10000
>>> intmask_p=0x18000
>>>      irq/24-mmc2-96      [000] .....     8.164840: sdhci_thread_irq:
>>> msg=
>>>      irq/24-mmc2-96      [000] d.h2.     8.164896: sdhci_cmd_irq:
>>> hw_name=ffe70a0000.mmc quirks=0x2008008 quirks2=0x8 intmask=0x1
>>> intmask_p=0x1
>>>      irq/24-mmc2-96      [000] .....     8.285142: mmc_request_done:
>>> mmc2: end struct mmc_request[000000000d30cc0c]: cmd_opcode=5
>>> cmd_err=-110 cmd_resp=0x0 0x0 0x0 0x0 cmd_retries=0 stop_opcode=0
>>> stop_err=0 stop_resp=0x0 0x0 0x0 0x0 stop_retries=0 sbc_opcode=0
>>> sbc_err=0 sbc_resp=0x0 0x0 0x0 0x0 sbc_retries=0 bytes_xfered=0
>>> data_err=0 tag=0 can_retune=0 doing_retune=0 retune_now=0 need_retune=0
>>> hold_retune=1 retune_period=0
>>>
>>> Here's what happens: the __mmc_start_request function is called with
>>> opcode 5. Since the power to the Wi-Fi card, which resides on this SDIO
>>> bus, is initially off after the reset, an interrupt SDHCI_INT_TIMEOUT is
>>> triggered. Immediately after that, a second interrupt SDHCI_INT_RESPONSE
>>> is triggered. Depending on the exact timing, these conditions can
>>> trigger the following race problem:
>>>
>>> 1) The sdhci_cmd_irq top half handles the command as an error. It sets
>>>    host->cmd to NULL and host->pending_reset to true.
>>> 2) The sdhci_thread_irq bottom half is scheduled next and executes faster
>>>    than the second interrupt handler for SDHCI_INT_RESPONSE. It clears
>>>    host->pending_reset before the SDHCI_INT_RESPONSE handler runs.
>>> 3) The pending interrupt SDHCI_INT_RESPONSE handler gets called, triggering
>>>    a code path that prints: "mmc2: Got command interrupt 0x00000001 even
>>>    though no command operation was in progress."
>>>
>>> To solve this issue, we need to clear pending interrupts when resetting
>>> host->pending_reset. This ensures that after sdhci_threaded_irq restores
>>> interrupts, there are no pending stale interrupts.
>>>
>>> Signed-off-by: Michal Wilczynski <m.wilczynski at samsung.com>
>>> ---
>>>  drivers/mmc/host/sdhci.c | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/drivers/mmc/host/sdhci.c b/drivers/mmc/host/sdhci.c
>>> index 4b91c9e96635..b91a6076c332 100644
>>> --- a/drivers/mmc/host/sdhci.c
>>> +++ b/drivers/mmc/host/sdhci.c
>>> @@ -3098,6 +3098,10 @@ static bool sdhci_request_done(struct sdhci_host *host)
>>>  		sdhci_reset_for(host, REQUEST_ERROR);
>>>  
>>>  		host->pending_reset = false;
>>> +
>>> +		/* Clear any pending interrupts after reset */
>>> +		sdhci_writel(host, SDHCI_INT_CMD_MASK | SDHCI_INT_DATA_MASK,
>>> +			     SDHCI_INT_STATUS);
>>
>> According to SDHCI spec, "Software Reset For CMD Line" clears
>> "Command Complete" in "Normal Interrupt Status register", so the
>> interrupt status should not need to be cleared again.
>>
>> Which SDHCI driver is it?
>>
> 
> Thank you for your review.
> 
> The driver in use is located at drivers/mmc/host/sdhci-of-dwcmshc.c,
> and the .compatible string is thead,th1520-dwcmshc.
> 
> Based on the specifications, it appears that the specific hardware used in
> the LicheePi4A SoC may not be fully compliant with the SDHCI standard.
> Therefore, the appropriate solution might be to add additional clearing
> code to the th1520_sdhci_reset() function.
> 

Seems reasonable