Orange PI 5 MAX: very unstable using kernel 6.19.0 and 6.18.10, 6.18.9 perfectly stable
Qu Wenruo
quwenruo.btrfs at gmx.com
Thu Feb 12 14:48:08 PST 2026
在 2026/2/13 08:53, David Arendt 写道:
> On 2/12/26 10:05 PM, Qu Wenruo wrote:
>>
>>
>> 在 2026/2/13 06:41, David Arendt 写道:
>>> Hello,
>>>
>>> I am using a Kubernetes Cluster with 3 Orange PI5 MAX nodes. The data
>>> is stored using a btrfs filesystem as backend. If using kernel 6.19.0
>>> or kernel 6.18.10 I have experienced many crashes during high IO load
>>> on all 3 nodes. Reverting back to 6.18.9 solves the problems
>>> completely. Unfortunately the crashes are spontaneous reboots without
>>> leaving a trace in any logfile, so I have no stacktrace of them.
>>> After the crashes I have sometimes incorrect btrfs csums for a file
>>> but these may also be a result of a partial write due to the crash.
>>> On one node I had a btrfs error logged without crashing, but I am not
>>> sure if this is the root cause or a result of a prior crash. A scrub
>>> after reboot returned no error with 6.19.0.
>>
>> The offending tree dump items are:
>>
>> Feb 10 13:31:07 opi02 kernel: item 92 key (13218356101120
>> Feb 10 13:31:07 opi02 kernel: item 93 key (13216208642048
>> Feb 10 13:31:07 opi02 kernel: item 94 key (13218356162560
>>
>> Obviously item 93 is smaller than all its previous and next item keys.
>>
>> hex(13218356101120) = 0xc05a36b8000
>> hex(13216208642048) = 0xc05236be000
>> hex(13218356162560) = 0xc05a36c7000
>>
>> It looks like something fliped, "0xc05a3" -> "0xc0523"
>>
>> 0xa -> 0x2 is exactly one bit flipped.
>>
>> So either the memory hardware has something wrong and resulting a
>> sticking bit (always 0), or there is something inside the kernel
>> touching memory it shouldn't.
>>
>> And this exactly matches the symptom, changing random bit of your
>> kernel, crash always expected.
>>
>>
>> Can you run a memtest to make sure it is not hardware problems first?
>
> Hello,
>
> I don't know of anything like memtest86 for the arm64 platform for
> testing the whole memory, so I used the user space memtester to check
> the 14G of unused ram on all 3 machines while using kernel 6.18.10.
>
> Here is the result of the first iteration (same on every machine):
>
> memtester version 4.7.1 (64-bit)
> Copyright (C) 2001-2024 Charles Cazabon.
> Licensed under the GNU General Public License version 2 (only).
>
> pagesize is 4096
> pagesizemask is 0xfffffffffffff000
> want 14000MB (14680064000 bytes)
> got 14000MB (14680064000 bytes), trying mlock ...locked.
> Loop 1:
> Stuck Address : ok
> Random Value : ok
> Compare XOR : ok
> Compare SUB : ok
> Compare MUL : ok
> Compare DIV : ok
> Compare OR : ok
> Compare AND : ok
> Sequential Increment: ok
> Solid Bits : ok
> Block Sequential : ok
> Checkerboard : ok
> Bit Spread : ok
> Bit Flip : ok
> Walking Ones : ok
> Walking Zeroes : ok
>
> I don't think it is hardware a failure as it is happening on 3 different
> machines. Crashes occur somewhere between 30 minutes and 12 hours on all
> 3 machines that have been running without a single crash for more than a
> year now with older kernel versions including 4 days with 6.18.9 and all
> version from 6.18.0 to 6.18.9, so it seems to be caused by something
> that has changed between 6.18.9 and 6.18.10.
Then I'm afraid you have to try bisecting.
On the other hand, I also have a arm64 board (Orion O6) as a VM host.
The testing arm64 VM is running a kernel very close to v6.19.0, but
never hit such a crash/corruption.
So I'm wondering it may be some driver, specific to RK3588, that is
corrupting memory randomly that caused the problem.
In the past (several years ago), we had amd sfh driver causing random
corruptions in x86_64, and led to the exactly same problem (random
crash, btrfs corruption detected etc).
So I guess it can be the same situation.
Thanks,
Qu
>
> Thanks,
>
> David Arendt
>
>>
>> Thanks,
>> Qu
>>
>>
>>>
>>> Unfortunately I don't have more information at the moment.
>>>
>>> Thanks in advance,
>>>
>>> David Arendt
>>>
>>>
>>
>
>
More information about the Linux-rockchip
mailing list