Random reboots on ODROID-N2+

Fri Jul 23 08:56:01 PDT 2021

Hi Byron, Hi Robin,

Very interesting findings!

On 2021-07-23 17:36, Robin Murphy wrote:
> On 2021-07-23 15:25, Byron Stanoszek wrote:
>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>
>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>> Hi,
>>>>
>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>> received since several reports of random reboots every couple of days.
>>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>>> at some point.
>>>>
>>>> After running serial console on several instances, I was able to catch
>>>> this stack trace:
>>>>
>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>>> #1
>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>
>>> <snip>
>>>
>>> We do see those crashes in similar frequency with Linux 5.12:
>>>
>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>
>>> It seems load and/or hardware dependent since we see it on some devices
>>> quite frequent (every few days), and on others it takes multiple weeks.
>>> Of course the once we see it frequently are the ones in production :).
>>>
>>> I am currently trying different stress-ng and other load to accelerate
>>> the crash rate before then trying to git bisect it.
>>
>> I have an Odroid-N2+ and was able to track this problem down. The problem is
>> related to the following dmesg line that reads "failed to reserve memory"
>> below:
>>
>> Machine model: Hardkernel ODROID-N2Plus
>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
>> OF: fdt: Reserved memory: failed to reserve memory for node 'secmon at 5000000': base 0x0000000005000000, size 3 MiB

In my 5.9 builds that line isn't present, and it seems all logs I stored
from 5.10 builds have the change already and show this line.

>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
>> OF: reserved mem: node linux,cma compatible matching fail
>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
>> ...
>>
>> A subsequent "cat /proc/iomem" shows that this memory region is still reserved
>> and the system appears to operate normally, until eventually the SError
>> Interrupt comes in under heavy memory/page-cache usage. The difference with
>> later kernels is that now the memory at 0x5000000-0x52fffff is registered under
>> the "System RAM" memory area, whereas previous kernels had dropped it from
>> "System RAM".
>>
>> The culprit is this new code introduced in Linux 5.12, in this function in
>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():

It seems that patch got also backported, so that is why I see it with
5.10 as well.

>>
>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>                                          phys_addr_t size, bool nomap)
>> {
>>          if (nomap) {
>>                  /*
>>                   * If the memory is already reserved (by another region), we
>>                   * should not allow it to be marked nomap.
>>                   */
>>                  if (memblock_is_region_reserved(base, size))  <------
>>                          return -EBUSY;                        <------
>>
>>                  return memblock_mark_nomap(base, size);
>>          }
>>          return memblock_reserve(base, size);
>> }
>>
>> "nomap" is true, due to this text being present in the FDT:
>>
>>     reserved-memory {
>>       ranges secmon_reserved: secmon at 5000000 {
>>         reg = <0x0 0x05000000 0x0 0x300000>
>>         no-map
>>       }
>>       ...
>>
>> But memblock_is_region_reserved() is returning true because there is already an
>> entry for 0x5000000-0x52fffff in the memory map, which is already marked
>> reserved, at the time the __reserved_mem_reserve_reg() function is called.
>> (Perhaps this is being set reserved by u-boot? -- I did not research that far.)
>>
>> This function is defined as:
>>
>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
>> {
>>          return memblock_overlaps_region(&memblock.reserved, base, size);
>> }
>>
>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
>> reserved region "0x5000000-0x52fffff", the function returns true.
>>
>> If I comment out the "if (memblock_is_region_reserved(base, size))" code and
>> allow it to mark the region no-map, then the memory area is properly removed
>> from the "System RAM" area and the crashing stops.
>>
>> I've had the system up and running for 15 days now under heavy load without any
>> crashes, using just the following patch as workaround:
>>
>>
>> --- linux-5.13.0/drivers/of/fdt.c.bak    2021-07-07 00:22:58.000000000 -0400
>> +++ linux-5.13.0/drivers/of/fdt.c    2021-07-07 00:23:08.000000000 -0400
>> @@ -1157,13 +1157,6 @@
>>                       phys_addr_t size, bool nomap)
>>   {
>>       if (nomap) {
>> -        /*
>> -         * If the memory is already reserved (by another region), we
>> -         * should not allow it to be marked nomap.
>> -         */
>> -        if (memblock_is_region_reserved(base, size))
>> -            return -EBUSY;
>> -
>>           return memblock_mark_nomap(base, size);
>>       }
>>       return memblock_reserve(base, size);
>>
>>
>> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
>> well.

Eventhough probably not the correct solution, I'll give this a try on my
end just to verify we are indeed experience the same problem and the
change fixes it for me too.

>>
>> Perhaps a more proper fix is to allow the no-map to still proceed, in the case
>> that the existing reserved region is identical (same start/end) to the region
>> getting marked no-map.
> 
> If U-Boot is marking regions with the wrong type/attributes in the EFI
> memory map, then the best thing to do would be to fix that. I see a
> fairly recent commit which looks suspiciously relevant:
> 
> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004

It seems that this patch went into U-Boot 2021.04 which I am using, so
that (alone) seems not to fix the mapping.

> 
> Booting with "efi=debug" should (among other things) print the memory
> map at boot if you want to double-check that that is the source of the
> mismatch. Our EFI code should be perfectly capable of setting the
> memblock flag if the region *is* described appropriately, see
> reserve_regions() in drivers/firmware/efi/efi-init.c.

Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
[    0.000000] Machine model: Hardkernel ODROID-N2Plus
[    0.000000] efi: Getting UEFI parameters from /chosen in DT:
[    0.000000] efi: UEFI not found.
[    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
node 'secmon at 5000000': base 0x0000000005000000, size 3 MiB

So it seems UEFI is not in the play here?

--
Stefan