[PATCH V1 0/3] Revert huge-paged linear mapping and its related fixups
Song Shuai
suagrfillet at gmail.com
Tue Jun 27 08:13:11 PDT 2023
Hi Alex,
在 2023/6/27 19:47, Alexandre Ghiti 写道:
> On Sun, Jun 25, 2023 at 10:36 PM Alexandre Ghiti <alexghiti at rivosinc.com> wrote:
>>
>> Hi Song,
>>
>> On Sun, Jun 25, 2023 at 4:10 PM Song Shuai <songshuaishuai at tinylab.org> wrote:
>>>
>>> We have encountered these two issues about huge-paged linear mapping since v6.4-rc1:
>>>
>>> 1. Bug report: kernel paniced when system hibernates[1]
>>>
>>> OpenSbi [v0.8,v1.3) set the PMP regions as !no-map, and the commit 3335068f8721
>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") mapped them in linear mapping.
>>> The hibernation process attempted to save/restore these mapped regions resulting in access fault.
>>>
>>> This issue was temporarily fixed by commit ed309ce52218 ("RISC-V: mark hibernation as nonportable").
>>> But as Alex's RFC and Rob's response stats in another thread [2] ,
>>> "Hibernation is only one case. Speculative accesses could also occur."
>>> So this fixing commit seems not the perfect answer to this issue.
>>>
>>>
>>> 2. Bug report: kernel paniced while booting (with UEFI )[3]
>>>
>>> During the booting with UEFI, UEFI Memory Mapping overwrote the memblock.
>>> The phys_ram_base was set as the end address of mmoderes0 (like 0x80040000 for 256 KiB mmoderes0 at 80000000),
>>> which resulted the VA based on 2M-aligned PA was not 2M-aligned using va_pa_offset
>>> (PAGE_OFFSET - phys_ram_base) to translate.
>>>
>>> The best_map_size() from commit 3335068f8721 didn't check the virtual alignment
>>> before choosing a map size. and then a "VA hole" was created where page faults always occurred.
>>>
>>> This issue was fixed by commit 49a0a3731596 ("riscv: Check the virtual alignment before choosing a map size"),
>>> But this fixing commit has a side-effect ("the possible third one" as Alex said in this thread).
>>> There are numerous PTE allocations slowing down the boot time and consuming some system memory when UEFI booting
>>> (Note that it's not involved when booting directly with OpenSbi, where phys_ram_base is the 2M-aligned base of DRAM).
>>>
>>> In my test, compared with/out reverting both commit 49a0a3731596 and commit 3335068f8721,
>>> I must wait ~20s for the linear mapping creation and mem_init_print_info() reported ~8M extra reserved memory.
>>
>> Indeed, phys_ram_base is not aligned on a 2MB boundary when booting
>> with EDK2, IIRC that's because the first chunk of memory at
>> 0x8000_0000 is marked as UC and is then completely evicted.
>>
>>>
>>> To eliminate this side-effect, We should find a way to align VA and PA on a 2MB boundary.
>>> The simplest way is reverting the commit 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping").
>>>
>>
>> I disagree, the simplest way is to align phys_ram_base on a 2MB
>> boundary, either by aligning to the upper boundary (and then wastes
>> memory, like we used to) or by aligning to the lower boundary (but I
>> want to make sure it works).
>>
>> I'll come up with something tomorrow.
>
> @Song Shuai : can you test the following please?
>
> commit a35b5e5e3f29e415f54fea064176315e79e21fb7 (HEAD ->
> dev/alex/align_va_pa_v1)
> Author: Alexandre Ghiti <alexghiti at rivosinc.com>
> Date: Mon Jun 5 14:26:55 2023 +0000
>
> riscv: Start of DRAM should at least be aligned on PMD size for
> the direct mapping
>
> So that we do not end up mapping the whole linear mapping using 4K
> pages, which is slow at boot time, and also very likely at runtime.
>
> So make sure we align the start of DRAM on a PMD boundary.
>
> Signed-off-by: Alexandre Ghiti <alexghiti at rivosinc.com>
>
> diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
> index 4fa420faa780..4a43ec275c6d 100644
> --- a/arch/riscv/mm/init.c
> +++ b/arch/riscv/mm/init.c
> @@ -214,8 +214,13 @@ static void __init setup_bootmem(void)
> memblock_reserve(vmlinux_start, vmlinux_end - vmlinux_start);
>
> phys_ram_end = memblock_end_of_DRAM();
> +
> + /*
> + * Make sure we align the start of the memory on a PMD boundary so that
> + * at worst, we map the linear mapping with PMD mappings.
> + */
> if (!IS_ENABLED(CONFIG_XIP_KERNEL))
> - phys_ram_base = memblock_start_of_DRAM();
> + phys_ram_base = memblock_start_of_DRAM() & PMD_MASK;
>
> /*
> * In 64-bit, any use of __va/__pa before this point is wrong as we
>
I tested your patch and it really fixed what I concerned :
`There are numerous PTE allocations slowing down the boot time and
consuming some system memory when UEFI booting.`
FYI, I posted the `ptdmp` and the available memory with/out the patch
from my test:
```
>> v6.4
---[ Linear mapping ]---
0xff60000000000000-0xff600000001c0000 0x0000000080040000 1792K
PTE D A G . . W R V
0xff600000001c0000-0xff60000000bc0000 0x0000000080200000 10M
PTE D A G . . . R V
0xff60000000bc0000-0xff60000000fc0000 0x0000000080c00000 4M
PTE D A G . . W R V
0xff60000000fc0000-0xff600000015c0000 0x0000000081000000 6M
PTE D A G . . . R V
0xff600000015c0000-0xff600000ffdfd000 0x0000000081600000 4169972K
PTE D A G . . W R V
0xff600000fffbf000-0xff600000fffc0000 0x000000017ffff000 4K
PTE D A G . . W R V
>> v6.4 with the patch
---[ Linear mapping ]---
0xff60000000040000-0xff60000000200000 0x0000000080040000 1792K
PTE D A G . . W R V
0xff60000000200000-0xff60000000c00000 0x0000000080200000 10M
PMD D A G . . . R V
0xff60000000c00000-0xff60000001000000 0x0000000080c00000 4M
PMD D A G . . W R V
0xff60000001000000-0xff60000001600000 0x0000000081000000 6M
PMD D A G . . . R V
0xff60000001600000-0xff60000040000000 0x0000000081600000 1002M
PMD D A G . . W R V
0xff60000040000000-0xff600000c0000000 0x00000000c0000000 2G
PUD D A G . . W R V
0xff600000c0000000-0xff600000ffe00000 0x0000000140000000 1022M
PMD D A G . . W R V
0xff600000ffe00000-0xff600000ffe3d000 0x000000017fe00000 244K
PTE D A G . . W R V
0xff600000fffff000-0xff60000100000000 0x000000017ffff000 4K
PTE D A G . . W R V
```
```
>> v6.4
Memory: 3945340K/4194048K available (8391K kernel code, 4959K rwdata,
4096K rodata, 2195K init, 476K bss, 248708K reserved, 0K cma-reserved)
>> v6.4 with the patch
Memory: 3953524K/4194048K available (8391K kernel code, 4959K rwdata,
4096K rodata, 2195K init, 476K bss, 240524K reserved, 0K cma-reserved)
```
> An example of output when phys_ram_base is not aligned on a 2MB boundary:
>
> ---[ Linear mapping ]---
> 0xffffaf8000008000-0xffffaf8000200000 0x0000000080008000 2016K
> PTE D A G . . W R V
> 0xffffaf8000200000-0xffffaf8000e00000 0x0000000080200000 12M
> PMD D A G . . . R V
> 0xffffaf8000e00000-0xffffaf8001200000 0x0000000080e00000 4M
> PMD D A G . . W R V
> 0xffffaf8001200000-0xffffaf8001a00000 0x0000000081200000 8M
> PMD D A G . . . R V
> 0xffffaf8001a00000-0xffffaf807e600000 0x0000000081a00000 1996M
> PMD D A G . . W R V
> 0xffffaf807e600000-0xffffaf807e714000 0x00000000fe600000 1104K
> PTE D A G . . W R V
> 0xffffaf807e715000-0xffffaf807e718000 0x00000000fe715000 12K
> PTE D A G . . W R V
> 0xffffaf807e71b000-0xffffaf807e71c000 0x00000000fe71b000 4K
> PTE D A G . . W R V
> 0xffffaf807e720000-0xffffaf807e800000 0x00000000fe720000 896K
> PTE D A G . . W R V
> 0xffffaf807e800000-0xffffaf807fe00000 0x00000000fe800000 22M
> PMD D A G . . W R V
> 0xffffaf807fe00000-0xffffaf807ff54000 0x00000000ffe00000 1360K
> PTE D A G . . W R V
> 0xffffaf807ff55000-0xffffaf8080000000 0x00000000fff55000 684K
> PTE D A G . . W R V
> 0xffffaf8080000000-0xffffaf83c0000000 0x0000000100000000 13G
> PUD D A G . . W R V
> 0xffffaf83c0000000-0xffffaf83ffe00000 0x0000000440000000 1022M
> PMD D A G . . W R V
> 0xffffaf83ffe00000-0xffffaf8400000000 0x000000047fe00000 2M
> PTE D A G . . W R V
>
> I found that it was easier to align phys_ram_base on the lower 2MB
> boundary. Aligning on the upper boundary is more complicated to me
> since there may be "something" between phys_ram_base and the upper 2MB
> boundary that needs to be accessed using the linear mapping (DT is
> accessed using fixmap so not a problem, initrd? ACPI tables? I don't
> know actually).
>
And would there be possible influence from the diff between
phys_ram_base and memblock_start_of_DRAM()?
> Weirdly simple though, I may be missing something, so any comment/test
> is welcome, it is currently running our internal CI.
>
> Thanks,
>
> Alex
>
>>
>> Thanks,
>>
>> Alex
>>
>>>
>>>
>>> Using PUD/P4D/PGD pages for the linear mapping to improve the performance is marginal from a recent talk [4]
>>> from Mike Rapoport. OpenSbi had marked all the PMP-protected regions as "no-map" [5] to practice this talk.
>>>
I noticed best_map_size() still creates some huge-paged mapping (like
above 2G PUD) with this patch.
How about to revert best_map_size() to disable huge-paged mapping to
practice the Mike's talk.
>>> For all those reasons, let's revert these related commits:
>>>
>>> - commit 3335068f8721 ("riscv: Use PUD/P4D/PGD pages for the linear mapping")
>>> - commit 49a0a3731596 ("riscv: Check the virtual alignment before choosing a map size")
>>> - commit ed309ce52218 ("RISC-V: mark hibernation as nonportable")
>>>
>>> [1]: https://lore.kernel.org/linux-riscv/CAAYs2=gQvkhTeioMmqRDVGjdtNF_vhB+vm_1dHJxPNi75YDQ_Q@mail.gmail.com/
>>> [2]: https://lore.kernel.org/linux-kernel/20230530080425.18612-1-alexghiti@rivosinc.com/
>>> [3]: https://lore.kernel.org/linux-riscv/tencent_7C3B580B47C1B17C16488EC1@qq.com/
>>> [4]: https://lwn.net/Articles/931406/
>>> [5]: https://github.com/riscv-software-src/opensbi/commit/8153b2622b08802cc542f30a1fcba407a5667ab9
>>>
>>> Song Shuai (3):
>>> Revert "RISC-V: mark hibernation as nonportable"
>>> Revert "riscv: Check the virtual alignment before choosing a map size"
>>> Revert "riscv: Use PUD/P4D/PGD pages for the linear mapping"
>>>
>>> arch/riscv/Kconfig | 5 +---
>>> arch/riscv/include/asm/page.h | 16 -------------
>>> arch/riscv/mm/init.c | 43 +++++++----------------------------
>>> arch/riscv/mm/physaddr.c | 16 -------------
>>> drivers/of/fdt.c | 11 ++++-----
>>> 5 files changed, 14 insertions(+), 77 deletions(-)
>>>
>>> --
>>> 2.20.1
>>>
>
--
Thanks
Song Shuai
More information about the linux-riscv
mailing list