[PATCH] mtd: cfi_cmdset_0002: Change erase functions to retry for error

Wed May 9 01:18:33 PDT 2018

On Tue, 2018-05-08 at 19:11 +0200, Boris Brezillon wrote:
> CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender and know the content is safe.
> 
> 
> +Joakim
> 
> On Tue, 8 May 2018 16:52:55 +0000
> IKEGAMI Tokunori <ikegami at allied-telesis.co.jp> wrote:
> 
> > From: Tokunori Ikegami <ikegami at allied-telesis.co.jp>
> > 
> > For the word write functions it is retried for error.
> > But it is not implemented to retry for the erase functions.
> > To make sure for the erase functions change to retry as same.
> > 
> > This is needed to prevent the flash erase error caused only once.
> > It was caused by the error case of chip_good() in the do_erase_oneblock().
> > Also it was confirmed on the MACRONIX flash device MX29GL512FHT2I-11G.
> > But the error issue behavior is not able to reproduce at this moment.
> 
> Hm, that's a problem. How can you be sure the retry approach fixes the
> issue if you can't reproduce the it?
> 
> > The flash controller is parallel Flash interface integrated on BCM53003.
> > 
> > Also the change is possible to resolve other erase error.
> > For example the Erase suspend issue was caused on Cypress AMD NOR flash.
> >   S29GL01GS / S29GL512S / S29GL256S / S29GL128S
> 
> Is the Erase suspend issue related to this problem?
> 

My Erase suspend issue is not related to just retrying the erase.
We have seen several times that too many Erase suspend for one block will cause 5.5.2.6 DQ5:
 Exceeded Timing Limits
This condition is not checked for, nor handled in any way.
Once this condition has occurred, a simple retry wont fix it. One must
first issue a flash reset:
  map_write( map, CMD(0xF0), chip->in_progress_block_addr); /* Reset */
which will clear this error condition and return the flash to read array more.
Now you can restart the erase again and it will probably finish the erase
this time(at least in our testing).

Testing for DQ5: Exceeded Timing Limits is not straight forward in current 0002 driver
since it requires more than just testing for toggling bits. I have not
made any progress on this due to other more pressing work.

Note: In our testing we needed over 300000 suspend/resume before hitting the DQ5
limit.
Here is my test hack so far:

diff --git a/drivers/mtd/chips/cfi_cmdset_0002.c b/drivers/mtd/chips/cfi_cmdset_0002.c
index 23a893ab4264..38b52e2c6fb6 100644
--- a/drivers/mtd/chips/cfi_cmdset_0002.c
+++ b/drivers/mtd/chips/cfi_cmdset_0002.c
@@ -829,6 +829,7 @@ static int get_chip(struct map_info *map, struct flchip *chip, unsigned long adr
                chip->oldstate = FL_ERASING;
                chip->state = FL_ERASE_SUSPENDING;
                chip->erase_suspended = 1;
+               chip->in_progress_suspends++;
                for (;;) {
                        if (chip_ready(map, adr))
                                break;
@@ -839,8 +840,33 @@ static int get_chip(struct map_info *map, struct flchip *chip, unsigned long adr
                                 * there was an error (so leave the erase
                                 * routine to recover from it) or we trying to
                                 * use the erase-in-progress sector. */
+                               map_word status = map_read(map, adr);
+
+                               map_write( map, CMD(0xF0), chip->in_progress_block_addr); /* Reset */
+#if 1
+                               /* Restart erase */
+                               cfi_send_gen_cmd(0xAA, cfi->addr_unlock1, chip->start, map, cfi,
+                                                cfi->device_type, NULL);
+                               cfi_send_gen_cmd(0x55, cfi->addr_unlock2, chip->start, map, cfi,
+                                                cfi->device_type, NULL);
+                               cfi_send_gen_cmd(0x80, cfi->addr_unlock1, chip->start, map, cfi,
+                                                cfi->device_type, NULL);
+                               cfi_send_gen_cmd(0xAA, cfi->addr_unlock1, chip->start, map, cfi,
+                                                cfi->device_type, NULL);
+                               cfi_send_gen_cmd(0x55, cfi->addr_unlock2, chip->start, map, cfi,
+                                                cfi->device_type, NULL);
+                               map_write(map, cfi->sector_erase_cmd, chip->in_progress_block_addr);
+#endif
+//                             chip->oldstate = FL_READY;
+//                             chip->state = FL_READY;
+
                                put_chip(map, chip, adr);
-                               printk(KERN_ERR "MTD %s(): chip not ready after erase suspend\n", __func__);
+                               printk(KERN_ERR "MTD %s(): chip not ready after erase suspend, block_addr:0x%lx, "
+                                      "block_mask:0x%lx, adr:0x%lx, suspends:%ld, status:0x%lx \n", __func__,
+                                      chip->in_progress_block_addr, chip->in_progress_block_addr, adr,
+                                      chip->in_progress_suspends,
+                                      status.x[0]);
+
                                return -EIO;
                        }
 
@@ -2270,7 +2296,7 @@ static int __xipram do_erase_chip(struct map_info *map, struct flchip *chip)
        chip->erase_suspended = 0;
        chip->in_progress_block_addr = adr;
        chip->in_progress_block_mask = ~(map->size - 1);
-
+       chip->in_progress_suspends = 0;
        INVALIDATE_CACHE_UDELAY(map, chip,
                                adr, map->size,
                                chip->erase_time*500);
diff --git a/include/linux/mtd/flashchip.h b/include/linux/mtd/flashchip.h
index 3529683f691e..8e7ba8244ced 100644
--- a/include/linux/mtd/flashchip.h
+++ b/include/linux/mtd/flashchip.h
@@ -86,6 +86,7 @@ struct flchip {
        unsigned int erase_suspended:1;
        unsigned long in_progress_block_addr;
        unsigned long in_progress_block_mask;
+       unsigned long in_progress_suspends;
 
        struct mutex mutex;
        wait_queue_head_t wq; /* Wait on here when we're waiting for the chip


Which printed(when error occurred):
MTD get_chip(): chip not ready after erase suspend, block_addr:0x3480000, block_mask:0x3480000, adr:0x3a7da84,
suspends:319941, status:0x28


 Jocke