[PATCH v5 1/4] mtd: nand: increase ready wait timeout and report timeouts

Tue Sep 15 02:38:43 PDT 2015

On 10 September 2015 at 00:49, Brian Norris <computersforpeace at gmail.com> wrote:
> + Niklas
>
> On Tue, Sep 08, 2015 at 10:10:50AM +0100, Alex Smith wrote:
>> If nand_wait_ready() times out, this is silently ignored, and its
>> caller will then proceed to read from/write to the chip before it is
>> ready. This can potentially result in corruption with no indication as
>> to why.
>>
>> While a 20ms timeout seems like it should be plenty enough, certain
>> behaviour can cause it to timeout much earlier than expected. The
>> situation which prompted this change was that CPU 0, which is
>> responsible for updating jiffies, was holding interrupts disabled
>> for a fairly long time while writing to the console during a printk,
>> causing several jiffies updates to be delayed. If CPU 1 happens to
>> enter the timeout loop in nand_wait_ready() just before CPU 0 re-
>> enables interrupts and updates jiffies, CPU 1 will immediately time
>> out when the delayed jiffies updates are made. The result of this is
>> that nand_wait_ready() actually waits less time than the NAND chip
>> would normally take to be ready, and then read_page() proceeds to
>> read out bad data from the chip.
>>
>> The situation described above may seem unlikely, but in fact it can be
>> reproduced almost every boot on the MIPS Creator Ci20.
>>
>> Debugging this was made more difficult by the misleading comment above
>> nand_wait_ready() stating "The timeout is caught later" - no timeout
>> was ever reported, leading me away from the real source of the problem.
>>
>> Therefore, this patch increases the timeout to 200ms. This should be
>> enough to cover cases where jiffies updates get delayed. Additionally,
>> add a pr_warn() when a timeout does occur so that it is easier to
>> pinpoint any problems in future caused by the chip not becoming ready.
>
> Did you examine other solutions? I've seen patches for hrtimer support
> previously:
>
> http://patchwork.ozlabs.org/patch/160333/
> http://patchwork.ozlabs.org/patch/431066/
>
> A few things have been cleaned up since then, so some of the initial
> objections to the hrtimer patch don't make sense anymore, I believe.
>
> Anyway, I think just increasing the timeout looks OK to me (as long as
> we never have a 200ms jiffies jump... can this happen??), so hrtimer may
> be over-engineering. I just want to make sure both options have been
> considered before officially choosing one over the other.
>
> Brian

Hi Brian, Niklas,

I'm no expert in the matter but I feel like using a hrtimer here would
indeed be over-engineering and could potentially add overhead to the
"normal" case where the chip becomes ready well before the timeout
expires? Just increasing the timeout seems like a simpler solution
that solves the problem. I think that a jiffies jump of a few hundred
milliseconds is extremely unlikely and would indicate something else
that needs to be fixed (i.e. in the SMP case I had it would mean that
the CPU which is supposed to update jiffies has interrupts disabled
for hundreds of milliseconds).

Niklas: If I update the patch based on your suggestions would you be
happy to go with that rather than your hrtimer patch?

Thanks,
Alex

>
>> Signed-off-by: Alex Smith <alex.smith at imgtec.com>
>> Reviewed-by: Ezequiel Garcia <ezequiel at vanguardiasur.com.ar>
>> Cc: Zubair Lutfullah Kakakhel <Zubair.Kakakhel at imgtec.com>
>> Cc: David Woodhouse <dwmw2 at infradead.org>
>> Cc: Brian Norris <computersforpeace at gmail.com>
>> Cc: linux-mtd at lists.infradead.org
>> Cc: linux-kernel at vger.kernel.org
>> ---
>> v4 -> v5:
>>  - Remove spurious change.
>>  - Add Ezequiel's Reviewed-by.
>>
>> v3 -> v4:
>>  - New patch to fix issue encountered in external Ci20 3.18 kernel
>>    branch which also applies upstream.
>> ---
>>  drivers/mtd/nand/nand_base.c | 14 +++++++++++---
>>  1 file changed, 11 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/mtd/nand/nand_base.c b/drivers/mtd/nand/nand_base.c
>> index ceb68ca8277a..07b831b94e5c 100644
>> --- a/drivers/mtd/nand/nand_base.c
>> +++ b/drivers/mtd/nand/nand_base.c
>> @@ -543,11 +543,16 @@ static void panic_nand_wait_ready(struct mtd_info *mtd, unsigned long timeo)
>>       }
>>  }
>>
>> -/* Wait for the ready pin, after a command. The timeout is caught later. */
>> +/**
>> + * nand_wait_ready - [GENERIC] Wait for the ready pin after commands.
>> + * @mtd: MTD device structure
>> + *
>> + * Wait for the ready pin after a command, and warn if a timeout occurs.
>> + */
>>  void nand_wait_ready(struct mtd_info *mtd)
>>  {
>>       struct nand_chip *chip = mtd->priv;
>> -     unsigned long timeo = jiffies + msecs_to_jiffies(20);
>> +     unsigned long timeo = jiffies + msecs_to_jiffies(200);
>>
>>       /* 400ms timeout */
>>       if (in_interrupt() || oops_in_progress)
>> @@ -557,9 +562,12 @@ void nand_wait_ready(struct mtd_info *mtd)
>>       /* Wait until command is processed or timeout occurs */
>>       do {
>>               if (chip->dev_ready(mtd))
>> -                     break;
>> +                     goto out;
>>               touch_softlockup_watchdog();
>>       } while (time_before(jiffies, timeo));
>> +
>> +     pr_warn("timeout while waiting for chip to become ready\n");
>> +out:
>>       led_trigger_event(nand_led_trigger, LED_OFF);
>>  }
>>  EXPORT_SYMBOL_GPL(nand_wait_ready);
>> --
>> 2.5.0
>>