MTD RAID

Fri Aug 19 04:36:10 PDT 2016

On Fri, 19 Aug 2016 18:22:25 +0800
Dongsheng Yang <dongsheng.yang at easystack.cn> wrote:

> On 08/19/2016 05:37 PM, Boris Brezillon wrote:
> > On Fri, 19 Aug 2016 17:15:56 +0800
> > Dongsheng Yang <dongsheng081251 at gmail.com> wrote:
> >  
> >> Hi Boris,
> >>
> >> On Fri, Aug 19, 2016 at 4:20 PM, Boris Brezillon <
> >> boris.brezillon at free-electrons.com> wrote:
> >>  
> >>> On Fri, 19 Aug 2016 15:08:35 +0800
> >>> Dongsheng Yang <dongsheng.yang at easystack.cn> wrote:
> >>>     
> >>>> On 08/19/2016 02:49 PM, Boris Brezillon wrote:  
> >>>>> Hi Dongsheng,
> >>>>>
> >>>>> On Fri, 19 Aug 2016 14:34:54 +0800
> >>>>> Dongsheng Yang <dongsheng081251 at gmail.com> wrote:
> >>>>>     
> >>>>>> Hi guys,
> >>>>>>       This is a email about MTD RAID.
> >>>>>>
> >>>>>> *Code:*
> >>>>>>       kernel:
> >>>>>> https://github.com/yangdongsheng/linux/tree/mtd_raid_v2-for-4.7  
> >>>>> Just had a quick look at the code, and I see at least one major problem
> >>>>> in your RAID-1 implementation: you're ignoring the fact that NAND  
> >>> blocks  
> >>>>> can be or become bad. What's the plan for that?  
> >>>> Hi Boris,
> >>>>       Thanx for your quick reply.
> >>>>
> >>>>       When you are using RAID-1, it would erase the all mirrored blockes
> >>>> when you are erasing.
> >>>> if there is a bad block in them, mtd_raid_erase will return an error and
> >>>> the userspace tool
> >>>> or ubi will mark this block as bad, that means, the
> >>>> mtd_raid_block_markbad() will mark the all
> >>>>    mirrored blocks as bad, although some of it are good.
> >>>>
> >>>> In addition, when you have data in flash with RAID-1, if one block
> >>>> become bad. For example,
> >>>> when the mtd0 and mtd1 are used to build a RAID-1 device mtd2. When you
> >>>> are using mtd2
> >>>> and you found there is a block become bad. Don't worry about data
> >>>> losing, the data is still
> >>>> saved in the good one mirror. you can replace the bad one device with
> >>>> another new mtd device.  
> >>> Okay, good to see you were aware of this problem.
> >>>     
> >>>> My plan about this feature is all on the userspace tool.
> >>>> (1). mtd_raid scan mtd2 <---- this will show the status of RAID device
> >>>> and each member of it.
> >>>> (2). mtd_raid replace mtd2 --old mtd1 --new mtd3.   <---- this will
> >>>> replace the bad one mtd1 with mtd3.
> >>>>
> >>>> What about this idea?  
> >>> Not sure I follow you on #2. And, IMO, you should not depend on a
> >>> userspace tool to detect address this kind of problems.
> >>>
> >>> Okay, a few more questions.
> >>>
> >>> 1/ What about data retention issues? Say you read from the main MTD, and
> >>> it does not show uncorrectable errors, so you keep reading on it, but,
> >>> since you're never reading from the mirror, you can't detect if there
> >>> are some uncorrectable errors or if the number of bitflips exceed the
> >>> threshold used to trigger a data move. If suddenly a page in your main
> >>> MTD becomes unreadable, you're not guaranteed that the mirror page will
> >>> be valid :-/.
> >>>     
> >> Yes, that could happen. But that's a case where main MTD and mirror bacome
> >> bad at the same time. Yes, that's possible, but that's much rare than
> >> pure one MTD going to bad, right?  
> > Absolutely not, that's actually more likely than getting bad blocks. If
> > you're not regularly reading your data they can become bad with no way
> > to recover from it.
> >  
> >> That's what RAID-1 want. If you want
> >> to solve this problem, just increase the number of mirror. Then you can make
> >> your data safer and safer.  
> > Except the number of bitflips is likely to increase over time, so if
> > you never read your mirror blocks because the main MTD is working fine,
> > you may not be able to read data back when you really need it.  
> 
> Sorry, I am afraid I did not get your point. But in general, it's safer to
> have two copies of data than just one copy of it I believe. Could you 
> explain
> more , thanx. :)

It's safer in most cases, but if you don't make sure your mirror is
in a correct state, then it's just giving an illusion of safety, which
is not necessarily here.

> >  
> >>> 2/ How do you handle write atomicity in RAID1? I don't know exactly
> >>> how RAID1 works, but I guess there's a mechanism (a journal?) to detect
> >>> that data has been written on the main MTD but not on the mirror, so
> >>> that you can replay the operation after a power-cut. Do handle this
> >>> case correctly?
> >>>     
> >> No, but the redundancy of RAID levels is designed to protect against a
> >> *disk* failure,
> >> not against a *power* failure, that's a responsibility of ubifs. when the
> >> ubifs replay,
> >> the not completed writing will be abandoned.  
> > And again, you're missing one important point. UBI and UBIFS are
> > sitting on your RAID layer. If the mirror MTD is corrupted because of
> > a power-cut, but the main one is working fine, UBI and UBIFS won't
> > notice, until you really need to use the mirror, and it's already too
> > late.  
> Actually there is already an answer about this question in RAID-1:
> 
> https://linas.org/linux/Software-RAID/Software-RAID-4.html
> 
> 
> But, I am glad to figure out what we can do in this case.
> At this moment, I think do a raid check for the all copies of data
> when ubifs is recoverying sounds possible.

Now you're mixing different layers. How would UBIFS/UBI inform the MTD
that it needs to take some security measures?
IMO you're heading to something that is complex and error prone (mainly
because of the unreliability of the NANDs).

> 
> >  
> >>> On a general note, I don't think it's wise to place the RAID layer at
> >>> the MTD level. How about placing it at the UBI level (pick 2 ubi
> >>> volumes to create one UBI-RAID element)? This way you don't have to
> >>> bother about bad block handling (you're manipulating logical blocks
> >>> which can be anywhere on the NAND).
> >>>     
> >>
> >> But how can we handle the multiple chips problem? Some drivers
> >> are combining multiple chips to one single mtd device, what the
> >> mtd_concat is doing.  
> > You can either pick 2 UBI volumes from 2 UBI devices (each one attached
> > to a different MTD device).  
> 
> Yes, but, I am afraid we don't want to expose all our chips.
> 
> Please consider this scenario, One pcie card attached chips, we only 
> want user
> to see just one mtd device /dev/mtd0, rather than 40+ mtd devices. So we 
> need to call
> mtd_raid_create() in the driver for this card.

Yes, I was only commenting on RAID-1 implementation. For RAID-0, all
you need is an improved mtdconcat implementation.

> >  
> >>> One last question? What's the real goal of this MTD-RAID layer? If
> >>> that's about addressing the MLC/TLC NAND reliability problems, I don't
> >>> think it's such a good idea.
> >>>     
> >> Oh, that's not the main problem I want to solve. RAID-1 is just a possible
> >>   extension base on my RAID framework.
> >>
> >> This work is started for only RAID0, which is used to take the use of lots
> >> of flash to improve performance. Then I refactored it to a MTD RAID
> >> framework. Then we can implement other raid level for mtd.
> >>
> >> Example:
> >>      In our production, there are 40+ chips attached on one pcie card.
> >> Then we need to simulate all of them into one mtd device. At the same
> >> time, we need to consider how to manage these chips. Finally we chose
> >> a RAID0 mode for them. And got a great performance result.
> >>
> >> So, the multiple chips scenario is the original problem I want to solve. And
> >> then I found I can refactor it for other RAID-levels.  
> > So all you need a way to concatenate MTD devices (are we talking
> > about NAND devices?)? That shouldn't be to hard to define something
> > like an MTD-cluster aggregating several similar MTD devices to provide
> > a single MTD. But I'd really advise you to drop the MTD-RAID idea and
> > focus on your real/simple need: aggregating MTD devices.  
> 
> Yes, the original problem is to concatenate the NAND devices. And we
> have to use RAID-0 to improve our performance.
> 
> Later on, I found the MTD raid is a not bad idea to solve other problems,
> So I tried to do a refactor for MTD-RAID.

Except it's way more complicated than aggregating several MTD devices
to expose a single entity. So you'd better focus on the mtdconcat
feature instead of trying to implement a RAID layer possibly supporting
all kind of RAID configs.