Random reboots on ODROID-N2+
Robin Murphy
robin.murphy at arm.com
Fri Jul 23 08:36:39 PDT 2021
On 2021-07-23 15:25, Byron Stanoszek wrote:
> On Tue, 22 Jun 2021, Stefan Agner wrote:
>
>> On 2021-05-17 11:14, Stefan Agner wrote:
>>> Hi,
>>>
>>> We are currently testing a new release using Linux 5.10.33. I've
>>> received since several reports of random reboots every couple of days.
>>> Unfortunately the log (journald) doesn't show anything, just a hard cut
>>> at some point.
>>>
>>> After running serial console on several instances, I was able to catch
>>> this stack trace:
>>>
>>> [202983.988153] SError Interrupt on CPU3, code 0xbf000000 -- SError
>>> [202983.988155] CPU: 3 PID: 3463 Comm: mdns-repeater Not tainted 5.10.33
>>> #1
>>> [202983.988156] Hardware name: Hardkernel ODROID-N2Plus (DT)
>>> [202983.988157] pstate: 80000005 (Nzcv daif -PAN -UAO -TCO BTYPE=--)
>>> [202983.988158] pc : udp_send_skb.isra.0+0x178/0x390
>>> [202983.988159] lr : udp_send_skb.isra.0+0x130/0x390
>>
>> <snip>
>>
>> We do see those crashes in similar frequency with Linux 5.12:
>>
>> [129988.642342] SError Interrupt on CPU4, code 0xbf000000 -- SError
>>
>> It seems load and/or hardware dependent since we see it on some devices
>> quite frequent (every few days), and on others it takes multiple weeks.
>> Of course the once we see it frequently are the ones in production :).
>>
>> I am currently trying different stress-ng and other load to accelerate
>> the crash rate before then trying to git bisect it.
>
> I have an Odroid-N2+ and was able to track this problem down. The
> problem is
> related to the following dmesg line that reads "failed to reserve memory"
> below:
>
> Machine model: Hardkernel ODROID-N2Plus
> memblock_remove: [0x0001000000000000-0x0000fffffffffffe] 0xffffffc0107e3604
> memblock_remove: [0x0000004000000000-0x0000003ffffffffe] 0xffffffc0107e3664
> memblock_reserve: [0x0000000008210000-0x0000000008baffff]
> 0xffffffc0107e36dc
> memblock_reserve: [0x0000000005000000-0x00000000052fffff]
> 0xffffffc0107feb50
> OF: fdt: Reserved memory: failed to reserve memory for node
> 'secmon at 5000000': base 0x0000000005000000, size 3 MiB
> memblock_reserve: [0x00000000e4c00000-0x00000000f4bfffff]
> 0xffffffc0107ff87c
> OF: reserved mem: node linux,cma compatible matching fail
> memblock_free: [0x00000000e4c00000-0x00000000f4bfffff] 0xffffffc0107ffca8
> ...
>
> A subsequent "cat /proc/iomem" shows that this memory region is still
> reserved
> and the system appears to operate normally, until eventually the SError
> Interrupt comes in under heavy memory/page-cache usage. The difference with
> later kernels is that now the memory at 0x5000000-0x52fffff is
> registered under
> the "System RAM" memory area, whereas previous kernels had dropped it from
> "System RAM".
>
> The culprit is this new code introduced in Linux 5.12, in this function in
> drivers/of/fdt.c, called by function __reserved_mem_reserve_reg():
>
> int __init __weak early_init_dt_reserve_memory_arch(phys_addr_t base,
> phys_addr_t size, bool nomap)
> {
> if (nomap) {
> /*
> * If the memory is already reserved (by another
> region), we
> * should not allow it to be marked nomap.
> */
> if (memblock_is_region_reserved(base, size)) <------
> return -EBUSY; <------
>
> return memblock_mark_nomap(base, size);
> }
> return memblock_reserve(base, size);
> }
>
> "nomap" is true, due to this text being present in the FDT:
>
> reserved-memory {
> ranges secmon_reserved: secmon at 5000000 {
> reg = <0x0 0x05000000 0x0 0x300000>
> no-map
> }
> ...
>
> But memblock_is_region_reserved() is returning true because there is
> already an
> entry for 0x5000000-0x52fffff in the memory map, which is already marked
> reserved, at the time the __reserved_mem_reserve_reg() function is called.
> (Perhaps this is being set reserved by u-boot? -- I did not research
> that far.)
>
> This function is defined as:
>
> bool __init_memblock memblock_is_region_reserved(phys_addr_t base,
> phys_addr_t size)
> {
> return memblock_overlaps_region(&memblock.reserved, base, size);
> }
>
> Since the region to mark no-map, "0x5000000-0x52fffff", overlaps the
> existing
> reserved region "0x5000000-0x52fffff", the function returns true.
>
> If I comment out the "if (memblock_is_region_reserved(base, size))" code
> and
> allow it to mark the region no-map, then the memory area is properly
> removed
> from the "System RAM" area and the crashing stops.
>
> I've had the system up and running for 15 days now under heavy load
> without any
> crashes, using just the following patch as workaround:
>
>
> --- linux-5.13.0/drivers/of/fdt.c.bak 2021-07-07 00:22:58.000000000
> -0400
> +++ linux-5.13.0/drivers/of/fdt.c 2021-07-07 00:23:08.000000000 -0400
> @@ -1157,13 +1157,6 @@
> phys_addr_t size, bool nomap)
> {
> if (nomap) {
> - /*
> - * If the memory is already reserved (by another region), we
> - * should not allow it to be marked nomap.
> - */
> - if (memblock_is_region_reserved(base, size))
> - return -EBUSY;
> -
> return memblock_mark_nomap(base, size);
> }
> return memblock_reserve(base, size);
>
>
> The above patch applies to later versions of Linux 5.10.x through 5.12.x as
> well.
>
> Perhaps a more proper fix is to allow the no-map to still proceed, in
> the case
> that the existing reserved region is identical (same start/end) to the
> region
> getting marked no-map.
If U-Boot is marking regions with the wrong type/attributes in the EFI
memory map, then the best thing to do would be to fix that. I see a
fairly recent commit which looks suspiciously relevant:
https://source.denx.de/u-boot/u-boot/-/commit/9ff9f4b4268946f3b73d9759766ccfcc599da004
Booting with "efi=debug" should (among other things) print the memory
map at boot if you want to double-check that that is the source of the
mismatch. Our EFI code should be perfectly capable of setting the
memblock flag if the region *is* described appropriately, see
reserve_regions() in drivers/firmware/efi/efi-init.c.
Robin.
More information about the linux-amlogic
mailing list