Random reboots on ODROID-N2+

Fri Jul 23 07:25:09 PDT 2021

On Tue, 22 Jun 2021, Stefan Agner wrote:

> On 2021-05-17 11:14, Stefan Agner wrote:
>> Hi,
>>
>> We are currently testing a new release using Linux 5.10.33. I've
>> received since several reports of random reboots every couple of days.
>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>> at some point.
>>
>> After running serial console on several instances, I was able to catch
>> this stack trace:
>>
>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>> #1
>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>
> <snip>
>
> We do see those crashes in similar frequency with Linux 5.12:
>
> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>
> It seems load and/or hardware dependent since we see it on some devices
> quite frequent (every few days), and on others it takes multiple weeks.
> Of course the once we see it frequently are the ones in production :).
>
> I am currently trying different stress-ng and other load to accelerate
> the crash rate before then trying to git bisect it.

I have an Odroid-N2+ and was able to track this problem down. The problem is
related to the following dmesg line that reads "failed to reserve memory"
below:

Machine model: Hardkernel ODROID-N2Plus
memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
memblock_reserve: [0x0000000008210000-0x0000000008baffff] 0xffffffc0107e36dc
memblock_reserve: [0x0000000005000000-0x00000000052fffff] 0xffffffc0107feb50
OF: fdt: Reserved memory: failed to reserve memory for node 'secmon at 5000000': base 0x0000000005000000, size 3 MiB
memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ff87c
OF: reserved mem: node linux,cma compatible matching fail
memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
...

A subsequent "cat /proc/iomem" shows that this memory region is still reserved
and the system appears to operate normally, until eventually the SError
Interrupt comes in under heavy memory/page-cache usage. The difference with
later kernels is that now the memory at 0x5000000-0x52fffff is registered under
the "System RAM" memory area, whereas previous kernels had dropped it from
"System RAM".

The culprit is this new code introduced in Linux 5.12, in this function in
drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():

int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
                                         phys_addr_t size, bool nomap)
{
         if (nomap) {
                 /*
                  * If the memory is already reserved (by another region), we
                  * should not allow it to be marked nomap.
                  */
                 if (memblock_is_region_reserved(base, size))  <------
                         return -EBUSY;                        <------

                 return memblock_mark_nomap(base, size);
         }
         return memblock_reserve(base, size);
}

"nomap" is true, due to this text being present in the FDT:

    reserved-memory {
      ranges secmon_reserved: secmon at 5000000 {
        reg = <0x0 0x05000000 0x0 0x300000>
        no-map
      }
      ...

But memblock_is_region_reserved() is returning true because there is already an
entry for 0x5000000-0x52fffff in the memory map, which is already marked
reserved, at the time the __reserved_mem_reserve_reg() function is called.
(Perhaps this is being set reserved by u-boot? -- I did not research that far.)

This function is defined as:

bool __init_memblock memblock_is_region_reserved(phys_addr_t base, phys_addr_t size)
{
         return memblock_overlaps_region(&memblock.reserved, base, size);
}

Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the existing
reserved region "0x5000000-0x52fffff", the function returns true.

If I comment out the "if (memblock_is_region_reserved(base, size))" code and
allow it to mark the region no-map, then the memory area is properly removed
from the "System RAM" area and the crashing stops.

I've had the system up and running for 15 days now under heavy load without any
crashes, using just the following patch as workaround:

--- linux-5.13.0/drivers/of/fdt.c.bak	2021-07-07 00:22:58.000000000 -0400
+++ linux-5.13.0/drivers/of/fdt.c	2021-07-07 00:23:08.000000000 -0400
@@ -1157,13 +1157,6 @@
  					phys_addr_t size, bool nomap)
  {
  	if (nomap) {
-		/*
-		 * If the memory is already reserved (by another region), we
-		 * should not allow it to be marked nomap.
-		 */
-		if (memblock_is_region_reserved(base, size))
-			return -EBUSY;
-
  		return memblock_mark_nomap(base, size);
  	}
  	return memblock_reserve(base, size);


The above patch applies to later versions of Linux 5.10.x through 5.12.x as
well.

Perhaps a more proper fix is to allow the no-map to still proceed, in the case
that the existing reserved region is identical (same start/end) to the region
getting marked no-map.

  -Byron