Random reboots on ODROID-N2+

Fri Jul 23 08:36:39 PDT 2021

On 2021-07-23 15:25, Byron Stanoszek wrote:
> On Tue, 22 Jun 2021, Stefan Agner wrote:
> 
>> On 2021-05-17 11:14, Stefan Agner wrote:
>>> Hi,
>>>
>>> We are currently testing a new release using Linux 5.10.33. I've
>>> received since several reports of random reboots every couple of days.
>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>> at some point.
>>>
>>> After running serial console on several instances, I was able to catch
>>> this stack trace:
>>>
>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>> #1
>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>
>> <snip>
>>
>> We do see those crashes in similar frequency with Linux 5.12:
>>
>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>
>> It seems load and/or hardware dependent since we see it on some devices
>> quite frequent (every few days), and on others it takes multiple weeks.
>> Of course the once we see it frequently are the ones in production :).
>>
>> I am currently trying different stress-ng and other load to accelerate
>> the crash rate before then trying to git bisect it.
> 
> I have an Odroid-N2+ and was able to track this problem down. The 
> problem is
> related to the following dmesg line that reads "failed to reserve memory"
> below:
> 
> Machine model: Hardkernel ODROID-N2Plus
> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 
> 0xffffffc0107e36dc
> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 
> 0xffffffc0107feb50
> OF: fdt: Reserved memory: failed to reserve memory for node 
> 'secmon at 5000000': base 0x0000000005000000, size 3 MiB
> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 
> 0xffffffc0107ff87c
> OF: reserved mem: node linux,cma compatible matching fail
> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
> ...
> 
> A subsequent "cat /proc/iomem" shows that this memory region is still 
> reserved
> and the system appears to operate normally, until eventually the SError
> Interrupt comes in under heavy memory/page-cache usage. The difference with
> later kernels is that now the memory at 0x5000000-0x52fffff is 
> registered under
> the "System RAM" memory area, whereas previous kernels had dropped it from
> "System RAM".
> 
> The culprit is this new code introduced in Linux 5.12, in this function in
> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
> 
> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>                                          phys_addr_t size, bool nomap)
> {
>          if (nomap) {
>                  /*
>                   * If the memory is already reserved (by another 
> region), we
>                   * should not allow it to be marked nomap.
>                   */
>                  if (memblock_is_region_reserved(base, size))  <------
>                          return -EBUSY;                        <------
> 
>                  return memblock_mark_nomap(base, size);
>          }
>          return memblock_reserve(base, size);
> }
> 
> "nomap" is true, due to this text being present in the FDT:
> 
>     reserved-memory {
>       ranges secmon_reserved: secmon at 5000000 {
>         reg = <0x0 0x05000000 0x0 0x300000>
>         no-map
>       }
>       ...
> 
> But memblock_is_region_reserved() is returning true because there is 
> already an
> entry for 0x5000000-0x52fffff in the memory map, which is already marked
> reserved, at the time the __reserved_mem_reserve_reg() function is called.
> (Perhaps this is being set reserved by u-boot? -- I did not research 
> that far.)
> 
> This function is defined as:
> 
> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, 
> phys_addr_t size)
> {
>          return memblock_overlaps_region(&memblock.reserved, base, size);
> }
> 
> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the 
> existing
> reserved region "0x5000000-0x52fffff", the function returns true.
> 
> If I comment out the "if (memblock_is_region_reserved(base, size))" code 
> and
> allow it to mark the region no-map, then the memory area is properly 
> removed
> from the "System RAM" area and the crashing stops.
> 
> I've had the system up and running for 15 days now under heavy load 
> without any
> crashes, using just the following patch as workaround:
> 
> 
> --- linux-5.13.0/drivers/of/fdt.c.bak    2021-07-07 00:22:58.000000000 
> -0400
> +++ linux-5.13.0/drivers/of/fdt.c    2021-07-07 00:23:08.000000000 -0400
> @@ -1157,13 +1157,6 @@
>                       phys_addr_t size, bool nomap)
>   {
>       if (nomap) {
> -        /*
> -         * If the memory is already reserved (by another region), we
> -         * should not allow it to be marked nomap.
> -         */
> -        if (memblock_is_region_reserved(base, size))
> -            return -EBUSY;
> -
>           return memblock_mark_nomap(base, size);
>       }
>       return memblock_reserve(base, size);
> 
> 
> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
> well.
> 
> Perhaps a more proper fix is to allow the no-map to still proceed, in 
> the case
> that the existing reserved region is identical (same start/end) to the 
> region
> getting marked no-map.

If U-Boot is marking regions with the wrong type/attributes in the EFI 
memory map, then the best thing to do would be to fix that. I see a 
fairly recent commit which looks suspiciously relevant:

https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004

Booting with "efi=debug" should (among other things) print the memory 
map at boot if you want to double-check that that is the source of the 
mismatch. Our EFI code should be perfectly capable of setting the 
memblock flag if the region *is* described appropriately, see 
reserve_regions() in drivers/firmware/efi/efi-init.c.

Robin.