UBI: Race between fastmap_write and wear_leveling_worker
Anders Olofsson
pingu at mazeda.se
Thu Aug 25 01:32:09 PDT 2016
On 2016-08-25 09:38, Richard Weinberger wrote:
> On 25.08.2016 08:52, Anders Olofsson wrote:
>> On 2016-08-24 17:04, Richard Weinberger wrote:
>>> Anders,
>>>
>>> On Wed, Aug 24, 2016 at 1:37 PM, Anders Olofsson <pingu at mazeda.se> wrote:
>>>> After enabling fastmap I sometimes get the following warning at boot:
>>>>
>>>
>>> Hehe, you're lucky I've recently fixed an issue in this area, can you
>>> please try:
>>> http://lists.infradead.org/pipermail/linux-mtd/2016-August/068919.html
>>>
>>> I did these fixes on top of an rather old customer kernel and started
>>> upstreaming
>>> them.
>>>
>>
>> Tested it and from what I can tell it solves my problem as well. I've run a bunch of reboots and the wear leveling worker no longer runs while the fastmap is being updated.
>>
>> Good work and thanks a lot for solving it so quickly.
>
> How do you test? I wonder how you can trigger this so easily.
> The said patch emerged while a customer did excessive Fastmap testing
> and the race appeared only once. I found it while staring at the code.
I don't know what I'm doing that makes my system special. I can only
guess it's related to the size of the UBI partition since it only
happens on the smaller of the two partitions we use (160 PEBs vs. 1830
in the larger partition where I've never seen this happen).
Having only 160 PEBs means the WL pool consists of only 4 PEBs if that
could be any clue to the behaviors I'm describing here.
If size is the key, then the setup is a 20MB partition with a 8MB UBIFS
volume in it and the only thing I need to do to trigger this is to
attach the partition and mount the filesystem. I think my system may
also do some small write to a file in the filesystem, but mostly just
reading. Clean reboot or power cycle seems to work equally well in
triggering the fault.
What I have seen is that at every boot, the wear leveling worker always
wants to relocate one PEB and always fails. The source PEB varies but
the target PEB is always the first one from the WL pool. The relocation
always fails, either because the source block is unused or because it is
locked and the handling in the worker is to always erase the destination
PEB and this was happening while the fastmap was being updated.
This by itself sounds like a bug somewhere, there should be no need to
erase the destination PEB when the wear leveling was aborted before
anything was written. Since it is always the same PEB, the result is
this PEB having a much higher erase count than the other PEBs in the
partition.
The wear leveling always seems to happen right after attaching and the
fastmap is also always rewritten at this time. From what I've understood
so far from the fastmap logic, I don't see why it needs to update the
map at every boot though, but it happens on my partition and since both
of these happens at the same time the race occurs often enough to be
visible as more than just a small glitch.
This behavior is of course the same with your patch. The only difference
is that the wear leveling worker isn't allowed to run until after the
fastmap update is completed.
I did notice the fault happening more easily while I was debugging, so
having a lot of debug prints in the code made the race window larger,
but I still got this at least 1/10 of every boot before adding any
prints on the multi-core systems.
> But it is good to see that finally after years embedded Folks start
> using Fastmap and non-obvious issues can get sorted out.
I'm working on an embedded system where boot times are becoming more and
more important. Using fastmap removes a whole second from our total boot
time (half in boot loader and half in kernel) so this was definitely a
good feature for us.
/Anders
More information about the linux-mtd
mailing list