Lots of fastmap writes

Mon Jun 17 06:48:07 PDT 2024

On 6/17/24 15:21, Zhihao Cheng wrote:
> 在 2024/6/17 19:20, Rickard x Andersson 写道:
>> On 6/14/24 14:28, Zhihao Cheng wrote:
>>> 在 2024/6/14 19:42, Rickard X Andersson 写道:
>>>> On 6/4/24 03:52, Zhihao Cheng wrote:
>>>
>>> [...]
>>>>>
>>>>> BTW, after applying the patches, the kernel should run on a new 
>>>>> flash, the improved wear-leveling algorithm cannot rescue the worn 
>>>>> out image.
>>>>>
>>>>
>>>> Thanks for the patches!
>>>>
>>>> I have backported the patches to Linux kernel 6.1. Do you think the 
>>>> patches are safe to apply to Linux kernel 6.1?
>>>
>>> Yes, it's okay. I have backported the patches to our product(kernel 
>>> v5.10) and it works fine.
>>
>> Thanks! I backported the patches to Linux 6.1 and did run my own 
>> stress test for a few days. (On another device with fresh flash 
>> memory.) It seems like the wear of the fastmap physical blocks (0-63) 
>> is a lot less now with the patches applied, which is good.
>>
>> However I got this problem after almost 3 days of stress testing (file 
>> system is set to read only mode):
>>
>>
>> [ 7885.036577][  T182] ubi2: scrubbed PEB 2904 (LEB 0:229), data moved 
>> to PEB 627
>> [83721.724621][  T182] ubi2: scrubbed PEB 983 (LEB 0:3240), data moved 
>> to PEB 7
>> [83721.832521][  T182] ubi2: scrubbed PEB 997 (LEB 0:2819), data moved 
>> to PEB 5
>> [83784.750714][  T182] ubi2: scrubbed PEB 1927 (LEB 0:10), data moved 
>> to PEB 2
>> [165812.657934][  T182] ubi2: scrubbed PEB 3691 (LEB 0:11), data moved 
>> to PEB 18
>> [166748.055242][  T182] ubi2: scrubbed PEB 3045 (LEB 0:2), data moved 
>> to PEB 837
>> [166834.742451][  T182] ubi2: scrubbed PEB 918 (LEB 0:2), data moved 
>> to PEB 43
> 
> Looks like that some of PEBs have met the bitflip errors.

One thing that struck me. When looking at the scrubbing being done 
above, is it not strange that data is moved from physical PEBs outside 
fastmap area into the fastmap area? For example from PEB 997 to PEB 5?

>> [239986.496840][T31387] UBIFS error (ubi2:0 pid 31387): ubifs_scan: 
>> corrupt empty space at LEB 3519:101376
>> [239986.506809][T31387] UBIFS error (ubi2:0 pid 31387): 
>> ubifs_scanned_corruption: corruption at LEB 3519:101376
>> [239986.519742][T31387] UBIFS error (ubi2:0 pid 31387): 
>> ubifs_scanned_corruption: first 8192 bytes from LEB 3519:101376
>> [239986.532052][T31387] 00000000: fffffffe ffffffff ffffffff ffffffff 
>> ffffffff ffffffff ffffffff ffffffff  ................................
> 
> The data content(0xfffffffe) is weird, shouldn't it be '0xffffffff'? One 
> bit flips. and there is no ECC error messages!

Yes strange!

>>>> Another thing, would it not be possible to rescue that particular 
>>>> worn out device by simply turning fastmap off on that device?
>>>>
>>>
>>> Can I regard the rescuing as making erase counters become normal 
>>> again(max - min <= UBI_WL_THRESHOLD)? If so, I'm afraid that not all 
>>> PEBs can be rescued, according to get_peb_for_wl().
>>> For example: PB, PC cannot be rescued, unless PA is taken for writing 
>>> and then wl is just right scheduled.
>>>
>>> ubi->free tree:
>>>       29600(PB)
>>> 1(PA)        29600(PC)
>>
>> I mean that I think that the badly worn device could be made usable 
>> again by turning off fastmap. I mean would it not work properly? I do 
>> however understand that the first 64 physical erase blocks would not 
>> be used in practice since the erase counts of those blocks are very 
>> high. But would not the filsystem work OK? Or am I missing something?
>>
> 
> I think the first 64 PEBs could be used when the number of free PEBs 
> belows 64, for example UBI runs out of space or there are many erasing 
> works not being executed before getting a free PEB.

Ok, I think I understand. The device is probably usable but if the flash 
is becoming almost full or if the system is under pressure I could run 
inte problems.

Thanks!
/Rickard A.