[PATCH] riscv: Fix linear mapping checks for non-contiguous memory regions
Alexandre Ghiti
alex at ghiti.fr
Mon Jun 24 09:49:00 PDT 2024
On 24/06/2024 13:32, Alexandre Ghiti wrote:
> On 24/06/2024 13:20, Stuart Menefy wrote:
>> On Mon, Jun 24, 2024 at 8:29 AM Alexandre Ghiti <alex at ghiti.fr> wrote:
>>>
>>> On 22/06/2024 13:42, Stuart Menefy wrote:
>>>> The RISC-V kernel already has checks to ensure that memory which would
>>>> lie outside of the linear mapping is not used. However those checks
>>>> use memory_limit, which is used to implement the mem= kernel command
>>>> line option (to limit the total amount of memory, not its address
>>>> range). When memory is made up of two or more non-contiguous memory
>>>> banks this check is incorrect.
>>>>
>>>> Two changes are made here:
>>>> - add a call in setup_bootmem() to memblock_cap_memory_range()
>>>> which
>>>> will cause any memory which falls outside the linear mapping
>>>> to be
>>>> removed from the memory regions.
>>>> - remove the check in create_linear_mapping_page_table() which was
>>>> intended to remove memory which is outside the liner mapping
>>>> based
>>>> on memory_limit, as it is no longer needed. Note a check for
>>>> mapping more memory than memory_limit (to implement mem=) is
>>>> unnecessary because of the existing call to
>>>> memblock_enforce_memory_limit().
>>>>
>>>> This issue was seen when booting on a SV39 platform with two memory
>>>> banks:
>>>> 0x00,80000000 1GiB
>>>> 0x20,00000000 32GiB
>>>> This memory range is 158GiB from top to bottom, but the linear mapping
>>>> is limited to 128GiB, so the lower block of RAM will be mapped at
>>>> PAGE_OFFSET, and the upper block straddles the top of the linear
>>>> mapping.
>>>>
>>>> This causes the following Oops:
>>>> [ 0.000000] Linux version 6.10.0-rc2-gd3b8dd5b51dd-dirty
>>>> (stuart.menefy at codasip.com) (riscv64-codasip-linux-gcc (GCC)
>>>> 13.2.0, GNU ld (GNU Binutils) 2.41.0.20231213) #20 SMP Sat Jun 22
>>>> 11:34:22 BST 2024
>>>> [ 0.000000] memblock_add:
>>>> [0x0000000080000000-0x00000000bfffffff]
>>>> early_init_dt_add_memory_arch+0x4a/0x52
>>>> [ 0.000000] memblock_add:
>>>> [0x0000002000000000-0x00000027ffffffff]
>>>> early_init_dt_add_memory_arch+0x4a/0x52
>>>> ...
>>>> [ 0.000000] memblock_alloc_try_nid: 23724 bytes align=0x8 nid=-1
>>>> from=0x0000000000000000 max_addr=0x0000000000000000
>>>> early_init_dt_alloc_memory_arch+0x1e/0x48
>>>> [ 0.000000] memblock_reserve:
>>>> [0x00000027ffff5350-0x00000027ffffaffb]
>>>> memblock_alloc_range_nid+0xb8/0x132
>>>> [ 0.000000] Unable to handle kernel paging request at virtual
>>>> address fffffffe7fff5350
>>>> [ 0.000000] Oops [#1]
>>>> [ 0.000000] Modules linked in:
>>>> [ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
>>>> 6.10.0-rc2-gd3b8dd5b51dd-dirty #20
>>>> [ 0.000000] Hardware name: codasip,a70x (DT)
>>>> [ 0.000000] epc : __memset+0x8c/0x104
>>>> [ 0.000000] ra : memblock_alloc_try_nid+0x74/0x84
>>>> [ 0.000000] epc : ffffffff805e88c8 ra : ffffffff806148f6 sp :
>>>> ffffffff80e03d50
>>>> [ 0.000000] gp : ffffffff80ec4158 tp : ffffffff80e0bec0 t0 :
>>>> fffffffe7fff52f8
>>>> [ 0.000000] t1 : 00000027ffffb000 t2 : 5f6b636f6c626d65 s0 :
>>>> ffffffff80e03d90
>>>> [ 0.000000] s1 : 0000000000005cac a0 : fffffffe7fff5350 a1 :
>>>> 0000000000000000
>>>> [ 0.000000] a2 : 0000000000005cac a3 : fffffffe7fffaff8 a4 :
>>>> 000000000000002c
>>>> [ 0.000000] a5 : ffffffff805e88c8 a6 : 0000000000005cac a7 :
>>>> 0000000000000030
>>>> [ 0.000000] s2 : fffffffe7fff5350 s3 : ffffffffffffffff s4 :
>>>> 0000000000000000
>>>> [ 0.000000] s5 : ffffffff8062347e s6 : 0000000000000000 s7 :
>>>> 0000000000000001
>>>> [ 0.000000] s8 : 0000000000002000 s9 : 00000000800226d0 s10:
>>>> 0000000000000000
>>>> [ 0.000000] s11: 0000000000000000 t3 : ffffffff8080a928 t4 :
>>>> ffffffff8080a928
>>>> [ 0.000000] t5 : ffffffff8080a928 t6 : ffffffff8080a940
>>>> [ 0.000000] status: 0000000200000100 badaddr: fffffffe7fff5350
>>>> cause: 000000000000000f
>>>> [ 0.000000] [<ffffffff805e88c8>] __memset+0x8c/0x104
>>>> [ 0.000000] [<ffffffff8062349c>]
>>>> early_init_dt_alloc_memory_arch+0x1e/0x48
>>>> [ 0.000000] [<ffffffff8043e892>] __unflatten_device_tree+0x52/0x114
>>>> [ 0.000000] [<ffffffff8062441e>] unflatten_device_tree+0x9e/0xb8
>>>> [ 0.000000] [<ffffffff806046fe>] setup_arch+0xd4/0x5bc
>>>> [ 0.000000] [<ffffffff806007aa>] start_kernel+0x76/0x81a
>>>> [ 0.000000] Code: b823 02b2 bc23 02b2 b023 04b2 b423 04b2 b823
>>>> 04b2 (bc23) 04b2
>>>> [ 0.000000] ---[ end trace 0000000000000000 ]---
>>>> [ 0.000000] Kernel panic - not syncing: Attempted to kill the
>>>> idle task!
>>>> [ 0.000000] ---[ end Kernel panic - not syncing: Attempted to
>>>> kill the idle task! ]---
>>>>
>>>> The problem is that memblock (unaware that some physical memory cannot
>>>> be used) has allocated memory from the top of memory but which is
>>>> outside the linear mapping region.
>>>>
>>>> Signed-off-by: Stuart Menefy <stuart.menefy at codasip.com>
>>>> Fixes: c99127c45248 ("riscv: Make sure the linear mapping does not
>>>> use the kernel mapping")
>>>> Reviewed-by: David McKay <david.mckay at codasip.com>
>>>>
>>>> ---
>>>> arch/riscv/mm/init.c | 15 +++++++++++----
>>>> 1 file changed, 11 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
>>>> index e3405e4b99af..7e25606f858a 100644
>>>> --- a/arch/riscv/mm/init.c
>>>> +++ b/arch/riscv/mm/init.c
>>>> @@ -233,8 +233,6 @@ static void __init setup_bootmem(void)
>>>> */
>>>> memblock_reserve(vmlinux_start, vmlinux_end - vmlinux_start);
>>>>
>>>> - phys_ram_end = memblock_end_of_DRAM();
>>>> -
>>>> /*
>>>> * Make sure we align the start of the memory on a PMD
>>>> boundary so that
>>>> * at worst, we map the linear mapping with PMD mappings.
>>>> @@ -249,6 +247,16 @@ static void __init setup_bootmem(void)
>>>> if (IS_ENABLED(CONFIG_64BIT) && IS_ENABLED(CONFIG_MMU))
>>>> kernel_map.va_pa_offset = PAGE_OFFSET - phys_ram_base;
>>>>
>>>> + /*
>>>> + * The size of the linear page mapping may restrict the
>>>> amount of
>>>> + * usable RAM.
>>>> + */
>>>> + if (IS_ENABLED(CONFIG_64BIT)) {
>>>> + max_mapped_addr = __pa(PAGE_OFFSET) + KERN_VIRT_SIZE;
>>>
>>> This is only true for sv39 once the following patch lands
>>> https://lore.kernel.org/linux-riscv/20240514133614.87813-1-alexghiti@rivosinc.com/
>>>
>> Hi Alex
>>
>> I've seen this problem whether your patch has been applied or not.
>
>
> Sure, I just meant the use of KERN_VIRT_SIZE above, sorry it was not
> clear.
>
>
>>
>>> Otherwise, sv39 is "weirdly" restricted to 124GB. You mention in the
>>> changelog that the linear mapping size is limited to 128GB, does that
>>> mean you tested your patch on top this one ^? If so, would you mind
>>> adding a Tested-by to it? Otherwise, would you mind testing on top
>>> of it
>>> :) ?
>> I've tested in both cases, so I'll add a Tested-by.
>
>
> Thanks!
>
>
>>
>> Note that to actually use 128GiB on Sv39 systems yet another patch is
>> needed which I'll also post.
>
>
> Ok, I'm curious! But if indeed another patch is needed, then we need
> to merge the 3 at the same time. I'll add this other patch to my fixes
> branch.
>
> Thanks,
>
> Alex
>
>
>>
>>> I'll see with Palmer, but maybe we can't take both patches to -fixes.
>> All three patches address different issues, so I think it would be
>> safe to
>> take them in any combination. However I understand the reluctance
>> when making changes in this area.
>>
>> Thanks
>>
>> Stuart
>>
>>
>>> You can add:
>>>
>>> Reviewed-by: Alexandre Ghiti <alexghiti at rivosinc.com>
>>>
>>> Thanks,
>>>
>>> Alex
Actually, this patch breaks KASAN on sv39:
[ 0.000000] kernel BUG at arch/riscv/mm/kasan_init.c:393!
[ 0.000000] Kernel BUG [#1]
[ 0.000000] Modules linked in:
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted
6.10.0-rc5-defconfig_kasan_sparsemem_vmemmmap-g018bb5470bac #1
[ 0.000000] Hardware name: riscv-virtio,qemu (DT)
[ 0.000000] epc : kasan_shallow_populate_pud+0xf2/0x228
[ 0.000000] ra : kasan_shallow_populate_pud+0x2a/0x228
[ 0.000000] epc : ffffffff81c0d2bc ra : ffffffff81c0d1f4 sp :
ffffffff82607c60
[ 0.000000] gp : ffffffff82a79548 tp : ffffffff82615740 t0 :
ffffffd7fefed000
[ 0.000000] t1 : ffffffd7fefec000 t2 : 0000000000000000 s0 :
ffffffff82607ce0
[ 0.000000] s1 : ffffffff82a85ef8 a0 : 000000005fbfb001 a1 :
0000400000000000
[ 0.000000] a2 : 0000000000000000 a3 : dfffffff00000000 a4 :
ffffffffffffffff
[ 0.000000] a5 : 0000000000000000 a6 : ffffffd7fefecfff a7 :
1ffffffaffdfd9ff
[ 0.000000] s2 : fffffff800000000 s3 : fffffff800000000 s4 :
fffffff7ffffffff
[ 0.000000] s5 : ffffffff8248ab80 s6 : 0000002000000000 s7 :
000000003fffffff
[ 0.000000] s8 : ffffffff8248ab58 s9 : ffffffff8248ab88 s10:
ffffffff8248b20c
[ 0.000000] s11: ffffffff8248b210 t3 : fffffff9ffdfda00 t4 :
0000000000000200
[ 0.000000] t5 : 0000000000000000 t6 : ffffffff82ac35e0
[ 0.000000] status: 0000000200000100 badaddr: 0000000000000000 cause:
0000000000000003
[ 0.000000] [<ffffffff81c0d2bc>] kasan_shallow_populate_pud+0xf2/0x228
[ 0.000000] [<ffffffff81c0d1aa>] kasan_shallow_populate_p4d+0x108/0x128
[ 0.000000] [<ffffffff81c0d082>] kasan_shallow_populate_pgd+0x15a/0x17a
[ 0.000000] [<ffffffff81c0bf26>] kasan_init+0x178/0x384
[ 0.000000] [<ffffffff81c060c0>] setup_arch+0x90/0x10a
[ 0.000000] [<ffffffff81c005b0>] start_kernel+0x5e/0x340
[ 0.000000] Code: 6a46 6aa6 6b06 7be2 7c42 7ca2 7d02 6de2 6109 8082
(9002) 0e13
[ 0.000000] ---[ end trace 0000000000000000 ]---
[ 0.000000] Kernel panic - not syncing: Fatal exception in interrupt
[ 0.000000] ---[ end Kernel panic - not syncing: Fatal exception in
interrupt ]---
I took a quick look and the issue is that KASAN is populating the kernel
page table but falls onto an already populated PUD entry (which is a PGD
entry on sv39) and then bugs since this should not happen.
@Stuart: Can you take a look? A simple defconfig + KASAN run on qemu
with sv39 triggers the issue. I'm currently investigating another
failure with rc5 and once finished I can take over/help if needed.
Thanks,
Alex
>>>
>>>
>>>> + memblock_cap_memory_range(phys_ram_base,
>>>> + max_mapped_addr -
>>>> phys_ram_base);
>>>> + }
>>>> +
>>>> /*
>>>> * Reserve physical address space that would be mapped to
>>>> virtual
>>>> * addresses greater than (void *)(-PAGE_SIZE) because:
>>>> @@ -265,6 +273,7 @@ static void __init setup_bootmem(void)
>>>> memblock_reserve(max_mapped_addr,
>>>> (phys_addr_t)-max_mapped_addr);
>>>> }
>>>>
>>>> + phys_ram_end = memblock_end_of_DRAM();
>>>> min_low_pfn = PFN_UP(phys_ram_base);
>>>> max_low_pfn = max_pfn = PFN_DOWN(phys_ram_end);
>>>> high_memory = (void *)(__va(PFN_PHYS(max_low_pfn)));
>>>> @@ -1289,8 +1298,6 @@ static void __init
>>>> create_linear_mapping_page_table(void)
>>>> if (start <= __pa(PAGE_OFFSET) &&
>>>> __pa(PAGE_OFFSET) < end)
>>>> start = __pa(PAGE_OFFSET);
>>>> - if (end >= __pa(PAGE_OFFSET) + memory_limit)
>>>> - end = __pa(PAGE_OFFSET) + memory_limit;
>>>>
>>>> create_linear_mapping_range(start, end, 0);
>>>> }
>> _______________________________________________
>> linux-riscv mailing list
>> linux-riscv at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/linux-riscv
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv
More information about the linux-riscv
mailing list