Random reboots on ODROID-N2+

Robin Murphy robin.murphy at arm.com
Fri Jul 23 10:47:21 PDT 2021


On 2021-07-23 17:14, Robin Murphy wrote:
> On 2021-07-23 16:56, Stefan Agner wrote:
>> Hi Byron, Hi Robin,
>>
>> Very interesting findings!
>>
>> On 2021-07-23 17:36, Robin Murphy wrote:
>>> On 2021-07-23 15:25, Byron Stanoszek wrote:
>>>> On Tue, 22 Jun 2021, Stefan Agner wrote:
>>>>
>>>>> On 2021-05-17 11:14, Stefan Agner wrote:
>>>>>> Hi,
>>>>>>
>>>>>> We are currently testing a new release using Linux 5.10.33. I've
>>>>>> received since several reports of random reboots every couple of 
>>>>>> days.
>>>>>> Unfortunately the log (journald) doesn't show anything, just a 
>>>>>> hard cut
>>>>>> at some point.
>>>>>>
>>>>>> After running serial console on several instances, I was able to 
>>>>>> catch
>>>>>> this stack trace:
>>>>>>
>>>>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>>>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 
>>>>>> 5.10.33
>>>>>> #1
>>>>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>>>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>>>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>>>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>>>>
>>>>> <snip>
>>>>>
>>>>> We do see those crashes in similar frequency with Linux 5.12:
>>>>>
>>>>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>>>>
>>>>> It seems load and/or hardware dependent since we see it on some 
>>>>> devices
>>>>> quite frequent (every few days), and on others it takes multiple 
>>>>> weeks.
>>>>> Of course the once we see it frequently are the ones in production :).
>>>>>
>>>>> I am currently trying different stress-ng and other load to accelerate
>>>>> the crash rate before then trying to git bisect it.
>>>>
>>>> I have an Odroid-N2+ and was able to track this problem down. The 
>>>> problem is
>>>> related to the following dmesg line that reads "failed to reserve 
>>>> memory"
>>>> below:
>>>>
>>>> Machine model: Hardkernel ODROID-N2Plus
>>>> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 
>>>> 0xffffffc0107e3604
>>>> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 
>>>> 0xffffffc0107e3664
>>>> memblock_reserve: [0x0000000008210000-0x0000000008baffff] 
>>>> 0xffffffc0107e36dc
>>>> memblock_reserve: [0x0000000005000000-0x00000000052fffff] 
>>>> 0xffffffc0107feb50
>>>> OF: fdt: Reserved memory: failed to reserve memory for node 
>>>> 'secmon at 5000000': base 0x0000000005000000, size 3 MiB
>>
>> In my 5.9 builds that line isn't present, and it seems all logs I stored
>> from 5.10 builds have the change already and show this line.
>>
>>>> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff] 
>>>> 0xffffffc0107ff87c
>>>> OF: reserved mem: node linux,cma compatible matching fail
>>>> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 
>>>> 0xffffffc0107ffca8
>>>> ...
>>>>
>>>> A subsequent "cat /proc/iomem" shows that this memory region is 
>>>> still reserved
>>>> and the system appears to operate normally, until eventually the SError
>>>> Interrupt comes in under heavy memory/page-cache usage. The 
>>>> difference with
>>>> later kernels is that now the memory at 0x5000000-0x52fffff is 
>>>> registered under
>>>> the "System RAM" memory area, whereas previous kernels had dropped 
>>>> it from
>>>> "System RAM".
>>>>
>>>> The culprit is this new code introduced in Linux 5.12, in this 
>>>> function in
>>>> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
>>
>> It seems that patch got also backported, so that is why I see it with
>> 5.10 as well.
>>
>>>>
>>>> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
>>>>                                           phys_addr_t size, bool nomap)
>>>> {
>>>>           if (nomap) {
>>>>                   /*
>>>>                    * If the memory is already reserved (by another 
>>>> region), we
>>>>                    * should not allow it to be marked nomap.
>>>>                    */
>>>>                   if (memblock_is_region_reserved(base, size))  <------
>>>>                           return -EBUSY;                        <------
>>>>
>>>>                   return memblock_mark_nomap(base, size);
>>>>           }
>>>>           return memblock_reserve(base, size);
>>>> }
>>>>
>>>> "nomap" is true, due to this text being present in the FDT:
>>>>
>>>>      reserved-memory {
>>>>        ranges secmon_reserved: secmon at 5000000 {
>>>>          reg = <0x0 0x05000000 0x0 0x300000>
>>>>          no-map
>>>>        }
>>>>        ...
>>>>
>>>> But memblock_is_region_reserved() is returning true because there is 
>>>> already an
>>>> entry for 0x5000000-0x52fffff in the memory map, which is already 
>>>> marked
>>>> reserved, at the time the __reserved_mem_reserve_reg() function is 
>>>> called.
>>>> (Perhaps this is being set reserved by u-boot? -- I did not research 
>>>> that far.)
>>>>
>>>> This function is defined as:
>>>>
>>>> bool __init_memblock memblock_is_region_reserved(phys_addr_t base, 
>>>> phys_addr_t size)
>>>> {
>>>>           return memblock_overlaps_region(&memblock.reserved, base, 
>>>> size);
>>>> }
>>>>
>>>> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the 
>>>> existing
>>>> reserved region "0x5000000-0x52fffff", the function returns true.
>>>>
>>>> If I comment out the "if (memblock_is_region_reserved(base, size))" 
>>>> code and
>>>> allow it to mark the region no-map, then the memory area is properly 
>>>> removed
>>>> from the "System RAM" area and the crashing stops.
>>>>
>>>> I've had the system up and running for 15 days now under heavy load 
>>>> without any
>>>> crashes, using just the following patch as workaround:
>>>>
>>>>
>>>> --- linux-5.13.0/drivers/of/fdt.c.bak    2021-07-07 
>>>> 00:22:58.000000000 -0400
>>>> +++ linux-5.13.0/drivers/of/fdt.c    2021-07-07 00:23:08.000000000 
>>>> -0400
>>>> @@ -1157,13 +1157,6 @@
>>>>                        phys_addr_t size, bool nomap)
>>>>    {
>>>>        if (nomap) {
>>>> -        /*
>>>> -         * If the memory is already reserved (by another region), we
>>>> -         * should not allow it to be marked nomap.
>>>> -         */
>>>> -        if (memblock_is_region_reserved(base, size))
>>>> -            return -EBUSY;
>>>> -
>>>>            return memblock_mark_nomap(base, size);
>>>>        }
>>>>        return memblock_reserve(base, size);
>>>>
>>>>
>>>> The above patch applies to later versions of Linux 5.10.x through 
>>>> 5.12.x as
>>>> well.
>>
>> Eventhough probably not the correct solution, I'll give this a try on my
>> end just to verify we are indeed experience the same problem and the
>> change fixes it for me too.
>>
>>>>
>>>> Perhaps a more proper fix is to allow the no-map to still proceed, 
>>>> in the case
>>>> that the existing reserved region is identical (same start/end) to 
>>>> the region
>>>> getting marked no-map.
>>>
>>> If U-Boot is marking regions with the wrong type/attributes in the EFI
>>> memory map, then the best thing to do would be to fix that. I see a
>>> fairly recent commit which looks suspiciously relevant:
>>>
>>> https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004 
>>>
>>
>> It seems that this patch went into U-Boot 2021.04 which I am using, so
>> that (alone) seems not to fix the mapping.
>>
>>>
>>> Booting with "efi=debug" should (among other things) print the memory
>>> map at boot if you want to double-check that that is the source of the
>>> mismatch. Our EFI code should be perfectly capable of setting the
>>> memblock flag if the region *is* described appropriately, see
>>> reserve_regions() in drivers/firmware/efi/efi-init.c.
>>
>> Booting 5.12.10 with "efi=debug" on U-Boot 2021.04 gave this:
>> [    0.000000] Machine model: Hardkernel ODROID-N2Plus
>> [    0.000000] efi: Getting UEFI parameters from /chosen in DT:
>> [    0.000000] efi: UEFI not found.
>> [    0.000000] OF: fdt: Reserved memory: failed to reserve memory for
>> node 'secmon at 5000000': base 0x0000000005000000, size 3 MiB
>>
>> So it seems UEFI is not in the play here?
> 
> Ah, OK, in that case I guess the question remains why does 
> early_init_dt_reserve_memory_arch() think the region is already 
> reserved? My instinctive assumption was an EFI memory map being present; 
> seeing that U-Boot does indeed reflect DT reservations there *and* has 
> had a likely-looking bug recently was then just overwhelmingly 
> suggestive :)

Actually, poking at U-Boot a bit more I find 
meson_board_add_reserved_memory() - can you check /sys/firmware/fdt and 
see if the region ends up being passed as a /memreserve/ as well as a 
proper reserved-memory node?

IIRC the semantics of /memreserve/ aren't really well-defined enough to 
be suitable for the kind of things which require no-map, and my new 
guess is that that's what ends up conflicting here.

Robin.



More information about the linux-amlogic mailing list