[PATCH 02/10] mtd: nand: gpmi: Utilize hardware to detect bitflips in erased blocks

Fri Dec 8 02:35:50 PST 2017

On Fri, 8 Dec 2017 11:21:57 +0100
Sascha Hauer <s.hauer at pengutronix.de> wrote:

> On Wed, Dec 06, 2017 at 04:34:31PM +0100, Boris Brezillon wrote:
> > On Wed, 6 Dec 2017 16:28:04 +0100
> > Sascha Hauer <s.hauer at pengutronix.de> wrote:
> >   
> > > On Wed, Dec 06, 2017 at 10:27:13AM +0100, Boris Brezillon wrote:  
> > > > Hi Sascha,
> > > > 
> > > > On Wed,  6 Dec 2017 10:19:17 +0100
> > > > Sascha Hauer <s.hauer at pengutronix.de> wrote:
> > > >     
> > > > > The GPMI nand has a hardware feature to ignore bitflips in erased pages.
> > > > > Use this feature rather than the longish code we have now.
> > > > > Unfortunately the bitflips in erased pages are not corrected, so we have
> > > > > to memset the read data before passing it to the upper layers.    
> > > > 
> > > > There's a good reason we didn't use the HW bitflip detection in the
> > > > first place: we currently have no way to report the number of corrected
> > > > bitflips in an erased page, and that's a big problem, because then UBI
> > > > does not know when it should re-erase the block.    
> > > 
> > > Ah, ok.
> > >   
> > > > 
> > > > Maybe we missed something when the initial proposal was done and
> > > > there's actually a way to retrieve this information, but if that's not
> > > > the case, I'd prefer to keep the existing implementation even if it's
> > > > slower and more verbose.    
> > > 
> > > I'm not so much concerned about the verbosity of the code but rather
> > > about the magic it has inside. I have stared at this code for some time
> > > now and still only vaguely know what it does.
> > > 
> > > We could do a bit better: We can still detect the number of bitflips
> > > using nand_check_erased_ecc_chunk() without reading the oob data.
> > > That would not be 100% accurate since we do not take the oob data into
> > > account which might have bitflips aswell, but still should be good
> > > enough, no?  
> > 
> > Nope, we really have to take OOB bytes into account when counting
> > bitflips. Say that you have almost all in-band data set to 0xff, but
> > all OOB bytes reserved for ECC are set to 0 because someone decided to
> > write this portion of the page in raw mode. In this case we can't
> > consider the page as empty, otherwise we'll fail to correctly write ECC
> > bytes next time we program the page.  
> 
> I would probably be with you if at least the current code worked correctly,
> but unfortunately it doesn't.
> 
> I did a test with the attached program. What it does is that it writes
> pages nearly full of 0xff in raw mode. In the first page the first byte
> is exchanged with a bitflip, in the second page the second byte is
> exchanged with a bitflip, and so on. What I would expect that the number
> of corrected bits is the same for all bitflip positions. What I get
> instead is this:
> 
> ./a.out /dev/mtd4 0x0f
> byteno: 0x00000000 corrected: 5 failed: 0
> byteno: 0x00000001 corrected: 4 failed: 0
> byteno: 0x00000002 corrected: 5 failed: 0
> byteno: 0x00000003 corrected: 6 failed: 0
> byteno: 0x00000004 corrected: 5 failed: 0
> byteno: 0x00000005 corrected: 5 failed: 0
> byteno: 0x00000006 corrected: 4 failed: 0
> byteno: 0x00000007 corrected: 5 failed: 0
> byteno: 0x00000008 corrected: 5 failed: 0
> byteno: 0x00000009 corrected: 4 failed: 0
> ...
> 
> (Read this as: byteno <x> in page has 0x0f instead of 0xff, so number
> of corrected bits should be 4, instead we have 5)
> 
> or:
> ./a.out /dev/mtd4 0xfe
> byteno: 0x00000000 corrected: 1 failed: 0
> byteno: 0x00000001 corrected: 1 failed: 0
> byteno: 0x00000002 corrected: 1 failed: 0
> byteno: 0x00000003 corrected: 1 failed: 0
> byteno: 0x00000004 corrected: 1 failed: 0
> byteno: 0x00000005 corrected: 1 failed: 0
> byteno: 0x00000006 corrected: 1 failed: 0
> byteno: 0x00000007 corrected: 1 failed: 0
> byteno: 0x00000008 corrected: 1 failed: 0
> byteno: 0x00000009 corrected: 1 failed: 0
> byteno: 0x0000000a corrected: 1 failed: 0
> byteno: 0x0000000b corrected: 1 failed: 0
> byteno: 0x0000000c corrected: 1 failed: 0
> byteno: 0x0000000d corrected: 1 failed: 0
> byteno: 0x0000000e corrected: 2 failed: 0
> byteno: 0x0000000f corrected: 1 failed: 0
> byteno: 0x00000010 corrected: 1 failed: 0
> byteno: 0x00000011 corrected: 1 failed: 0
> byteno: 0x00000012 corrected: 2 failed: 0
> byteno: 0x00000013 corrected: 1 failed: 0
> 
> When I add a print_hex_dump to the driver I really see a buffer
> with 5 bytes flipped instead of 4. When I do a raw read of the
> page (under barebox, the Linux driver doesn't print the OOB data
> properly), then I really see what I have written: A page with four
> bitflips. The results are perfectly reproducable, so I don't expect
> that to be any random read failures.

Wow, that's really weird.

> 
> I know next to nothing about BCH, but could it be that the BCH engine
> already starts correcting the page while it still doesn't know that
> it can't be corrected at all leaving spurious cleared bits in the
> read data?

Bitflips are not supposed to be fixed by the BCH engine if an
uncorrectable error is detected. BCH is operating on a block of data
(usually 512 or 1024 bytes), and before starting to fix actual errors,
it searches their positions, and their positions is only determined
after the correctable/uncorrectable check has been done, so really, I
doubt BCH can mess up you data.

Did you find out where the extra bitflip is? Is it next to the other
bitflips or in a totally different region?

> 
> Sascha
> 
> -------------------------------8<-------------------------------
> 
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdlib.h>
> #include <mtd/mtd-abi.h>
> #include <sys/ioctl.h>
> #include <stdint.h>
> #include <string.h>
> 
> int fd;
> uint8_t *buf;
> uint8_t byte;
> 
> void erase_block(void)
> {
> 	struct erase_info_user req = {
> 		.start = 0,
> 		.length = 128 * 1024,
> 	};
> 	int ret;
> 
> 	ret = ioctl(fd, MEMERASE, &req);
> 	if (ret != 0) {
> 		perror("ioctl");
> 		exit (1);
> 	}
> }
> 
> void get_stats(unsigned int *__corrected, unsigned int *__failed)
> {
> 	struct mtd_ecc_stats stats;
> 	static unsigned int corrected, failed;
> 	int ret;
> 
> 	ret = ioctl(fd, ECCGETSTATS, &stats);
> 	if (ret != 0) {
> 		perror("ioctl");
> 		exit (1);
>  	}
> 
> 	*__corrected = stats.corrected - corrected;
> 	*__failed = stats.failed - failed;
> 
> 	corrected = stats.corrected;
> 	failed = stats.failed;
> }
> 
> void wr_pages(int startbyte)
> {
> 	struct mtd_write_req req = {
> 		.start = 0,
> 		.len = 2048,
> 		.ooblen = 64,
> 		.usr_data = (__u64)(unsigned long)buf,
> 		.usr_oob = (__u64)(unsigned long)(buf + 2048),
> 		.mode = MTD_OPS_RAW,
> 	};
> 	unsigned int corrected, failed;
> 	int i, ret;
> 
> 	erase_block();
> 
> 	for (i = 0; i < 64; i++) {
> 		memset(buf, 0xff, 2112);
> 
> 		buf[i + startbyte] = byte;
> 
> 		req.start = i * 2048;
> 
> 		ret = ioctl(fd, MEMWRITE, &req);
> 		if (ret != 0) {
> 			perror("ioctl");
> 			exit(1);
> 		}
> 	}
> 
> 	lseek(fd, 0, SEEK_SET);
> 
> 	for (i = 0; i < 64; i++) {
> 		int j;
> 
> 		ret = read(fd, buf, 2048);
> 		if (ret < 2048) {
> 			perror("read");
> 			exit(1);
> 		}
> 
> 		get_stats(&corrected, &failed);
> 		printf("byteno: 0x%08x corrected: %d failed: %d\n", i + startbyte, corrected, failed);
> 
> 		for (j = 0; j < 2048; j++) {
> 			if (buf[j] != 0xff) {
> 				printf("swapped: 0x%08x notff: 0x%08x instead: 0x%02x\n",
> 				       startbyte + i, j, buf[j]);
> 			}
> 		}
> 	}
> }
> 
> int main(int argc, char *argv[])
> {
> 	int i;
> 	unsigned int c, f;
> 
> 	if (argc < 3) {
> 		printf("usage: %s <mtd> <byte>\n", argv[0]);
> 		exit (1);
> 	}
> 
> 	byte = strtoul(argv[2], NULL, 0);
> 
> 	fd = open(argv[1], O_RDWR);
> 	if (fd < 0) {
> 		perror("open");
> 		exit(1);
> 	}
> 
> 	get_stats(&c, &f);
> 
> 	buf = malloc(2112);
> 
> 	for (i = 0; i < 0x840; i+= 64)
> 		wr_pages(i);
> 
> 	exit(0);
> }