Jffs2 and big file = very slow jffs2_garbage_collect_pass

Wed Jan 23 05:41:14 EST 2008

On Wed, 23 Jan 2008, Jörn Engel wrote:

>> I would say that correctable errors occurring "soon" after writing are an
>> indication that the block is going bad. My experience has been that
>> extensive reading can cause bitflips (and it probably happens over time
>> too), but that for fresh blocks, billions of read operations need to be
>> done before a bit flips. For blocks that are nearing their best before
>> date, a couple of hundred thousand reads can cause a bit to flip. So if I
>> was implementing some sort of 'when is this block considered
>> bad'-algorithm, I'd try to keep tabs on how often the block has been
>> (read-) accessed in relation to when it was last writen. If this number is
>> "low", the block should be considered bad and not used again.
>
> That sounds like an impossible strategy.  Causing a write for every read
> will significantly increase write pressure, thereby reduce flash
> lifetime, reduce performance etc.
>
> What would be possible was a counter for soft/hard errors per physical
> block.  On soft error, move data elsewhere and reuse the block, but
> increment the error counter.  If the counter increases beyond 17 (or any
> other random number), mark the block as bad.  Limit can be an mkfs
> option.

Sorry, I didn't express myself clearly. I should have said '...keep tabs 
on how _many_times_ the block has been read accessed in relation to when 
it was last written.' If a page has been read, say, 100 000 times since it 
was last written, and starts to show bit flips, it is a sign that the 
block is wearing out. If it has been read, say, 100 000 000 times since it 
was written and starts showing bit flips, it's probably sufficient just to 
do a garbage collect and rewrite the data (in the same block or 
elsewhere).

The algorithm you suggest also sounds reasonable. Repeatedly occurring bit 
flips (-EUCLEAN) are an indication that the block is wearing out. Probably 
more efficient than logging the number of read accesses somewhere.

One problem may be what to do when the system is powered down. If we don't 
store the error counters in the flash (or some other non-volatile place), 
then each time the system is powered up, all the error counters will be 
reset.

>> We ran some tests here on a particular flash chip type to try and
>> determine at least some of the failure modes that are related to block
>> wear (due to write/erase) and bit decay (due to reading). The end result
>> was basically what I tried to describe above, but I can go into more
>> detail if you're interested.
>
> I do remember your mail describing the test.  One of the interesting
> conclusions is that even awefully worn out block is still good enough to
> store short-lived information.  It appears to be a surprisingly robust
> strategy to have a high wear-out, as long as you keep the wear
> constantly high and replace block contents at a high rate.

You're probably right, but I'm not sure I understand what you mean.

/Ricard
--
Ricard Wolf Wanderlöf                           ricardw(at)axis.com
Axis Communications AB, Lund, Sweden            www.axis.com
Phone +46 46 272 2016                           Fax +46 46 13 61 30