Orange PI 5 MAX: very unstable using kernel 6.19.0 and 6.18.10, 6.18.9 perfectly stable

David Arendt admin at prnet.org
Thu Feb 12 21:09:15 PST 2026


On 2/12/26 11:48 PM, Qu Wenruo wrote:
>
>
> 在 2026/2/13 08:53, David Arendt 写道:
>> On 2/12/26 10:05 PM, Qu Wenruo wrote:
>>>
>>>
>>> 在 2026/2/13 06:41, David Arendt 写道:
>>>> Hello,
>>>>
>>>> I am using a Kubernetes Cluster with 3 Orange PI5 MAX nodes. The 
>>>> data is stored using a btrfs filesystem as backend. If using kernel 
>>>> 6.19.0 or kernel 6.18.10 I have experienced many crashes during 
>>>> high IO load on all 3 nodes. Reverting back to 6.18.9 solves the 
>>>> problems completely. Unfortunately the crashes are spontaneous 
>>>> reboots without leaving a trace in any logfile, so I have no 
>>>> stacktrace of them. After the crashes I have sometimes incorrect 
>>>> btrfs csums for a file but these may also be a result of a partial 
>>>> write due to the crash. On one node I had a btrfs error logged 
>>>> without crashing, but I am not sure if this is the root cause or a 
>>>> result of a prior crash. A scrub after reboot returned no error 
>>>> with 6.19.0.
>>>
>>> The offending tree dump items are:
>>>
>>> Feb 10 13:31:07 opi02 kernel:  item 92 key (13218356101120
>>> Feb 10 13:31:07 opi02 kernel:  item 93 key (13216208642048
>>> Feb 10 13:31:07 opi02 kernel:  item 94 key (13218356162560
>>>
>>> Obviously item 93 is smaller than all its previous and next item keys.
>>>
>>> hex(13218356101120) = 0xc05a36b8000
>>> hex(13216208642048) = 0xc05236be000
>>> hex(13218356162560) = 0xc05a36c7000
>>>
>>> It looks like something fliped, "0xc05a3" -> "0xc0523"
>>>
>>> 0xa -> 0x2 is exactly one bit flipped.
>>>
>>> So either the memory hardware has something wrong and resulting a 
>>> sticking bit (always 0), or there is something inside the kernel 
>>> touching memory it shouldn't.
>>>
>>> And this exactly matches the symptom, changing random bit of your 
>>> kernel, crash always expected.
>>>
>>>
>>> Can you run a memtest to make sure it is not hardware problems first?
>>
>> Hello,
>>
>> I don't know of anything like memtest86 for the arm64 platform for 
>> testing the whole memory, so I used the user space memtester to check 
>> the 14G of unused ram on all 3 machines while using kernel 6.18.10.
>>
>> Here is the result of the first iteration (same on every machine):
>>
>> memtester version 4.7.1 (64-bit)
>> Copyright (C) 2001-2024 Charles Cazabon.
>> Licensed under the GNU General Public License version 2 (only).
>>
>> pagesize is 4096
>> pagesizemask is 0xfffffffffffff000
>> want 14000MB (14680064000 bytes)
>> got  14000MB (14680064000 bytes), trying mlock ...locked.
>> Loop 1:
>>    Stuck Address       : ok
>>    Random Value        : ok
>>    Compare XOR         : ok
>>    Compare SUB         : ok
>>    Compare MUL         : ok
>>    Compare DIV         : ok
>>    Compare OR          : ok
>>    Compare AND         : ok
>>    Sequential Increment: ok
>>    Solid Bits          : ok
>>    Block Sequential    : ok
>>    Checkerboard        : ok
>>    Bit Spread          : ok
>>    Bit Flip            : ok
>>    Walking Ones        : ok
>>    Walking Zeroes      : ok
>>
>> I don't think it is hardware a failure as it is happening on 3 
>> different machines. Crashes occur somewhere between 30 minutes and 12 
>> hours on all 3 machines that have been running without a single crash 
>> for more than a year now with older kernel versions including 4 days 
>> with 6.18.9 and all version from 6.18.0 to 6.18.9, so it seems to be 
>> caused by something that has changed between 6.18.9 and 6.18.10.
>
> Then I'm afraid you have to try bisecting.
>
> On the other hand, I also have a arm64 board (Orion O6) as a VM host.
> The testing arm64 VM is running a kernel very close to v6.19.0, but 
> never hit such a crash/corruption.
>
> So I'm wondering it may be some driver, specific to RK3588, that is 
> corrupting memory randomly that caused the problem.
>
> In the past (several years ago), we had amd sfh driver causing random 
> corruptions in x86_64, and led to the exactly same problem (random 
> crash, btrfs corruption detected etc).
> So I guess it can be the same situation.
>
> Thanks,
> Qu
>
Hello,

I think prior to bisecting, I will rebuild 6.18.9 from source instead of 
using the one compiled last week and let it run some days to be 100% 
sure nothing on my build system changed leading to corrupted builds. I 
am building using clang version 21.1.8. One notable change on my build 
system between compiling 6.18.9 and 6.18.10 was an update from glibc 
2.42 to glibc 2.43. Even if the kernel itself doesn't use the glibc, I 
want to make sure that there is no bad interaction between llvm and the 
new glibc leading to bad code generation in some corner cases.

If the resulting 6.18.9 kernel runs stable for a few days. I will 
continue applying the patches between 6.18.9 and 6.18.10 one by one 
beginning with the most susceptible ones letting it run always a day in 
between to see when the problem is appearing. So it will probably take 
some time before having a result.

Thanks,

David Arendt

>>
>> Thanks,
>>
>> David Arendt
>>
>>>
>>> Thanks,
>>> Qu
>>>
>>>
>>>>
>>>> Unfortunately I don't have more information at the moment.
>>>>
>>>> Thanks in advance,
>>>>
>>>> David Arendt
>>>>
>>>>
>>>
>>
>>
>




More information about the Linux-rockchip mailing list