Bug report: kernel paniced when system hibernates

Alexandre Ghiti alex at ghiti.fr
Thu May 18 05:09:24 PDT 2023


On 5/18/23 08:53, Anup Patel wrote:
> On Wed, May 17, 2023 at 8:26 PM Alexandre Ghiti <alexghiti at rivosinc.com> wrote:
>> On Wed, May 17, 2023 at 1:28 PM Conor Dooley <conor.dooley at microchip.com> wrote:
>>> Hey Alex,
>>>
>>> On Wed, May 17, 2023 at 10:58:02AM +0200, Alexandre Ghiti wrote:
>>>> On Tue, May 16, 2023 at 1:12 PM Alexandre Ghiti <alexghiti at rivosinc.com> wrote:
>>>>> On Tue, May 16, 2023 at 11:24 AM Song Shuai <suagrfillet at gmail.com> wrote:
>>>>> I actually removed this flag a few years ago, and I have to admit that
>>>>> I need to check if that's necessary: the goal of commit 3335068f8721
>>>>> ("riscv: Use PUD/P4D/PGD pages for the linear mapping") is to expose
>>>>> the "right" start of DRAM so that we can align virtual and physical
>>>>> addresses on a 1GB boundary.
>>>>>
>>>>> So I have to check if a nomap region is actually added as a
>>>>> memblock.memory.regions[] or not: if yes, that's perfect, let's add
>>>>> the nomap attributes to the PMP regions, otherwise, I don't think that
>>>>> is a good solution.
>>>> So here is the current linear mapping without nomap in openSBI:
>>>>
>>>> ---[ Linear mapping ]---
>>>> 0xff60000000000000-0xff60000000200000    0x0000000080000000         2M
>>>> PMD     D A G . . W R V
>>>> 0xff60000000200000-0xff60000000e00000    0x0000000080200000        12M
>>>> PMD     D A G . . . R V
>>>>
>>>> And below the linear mapping with nomap in openSBI:
>>>>
>>>> ---[ Linear mapping ]---
>>>> 0xff60000000080000-0xff60000000200000    0x0000000080080000      1536K
>>>> PTE     D A G . . W R V
>>>> 0xff60000000200000-0xff60000000e00000    0x0000000080200000        12M
>>>> PMD     D A G . . . R V
>>>>
>>>> So adding nomap does not misalign virtual and physical addresses, it
>>>> prevents the usage of 1GB page for this area though, so that's a
>>>> solution, we just lose this 1GB page here.
>>>>
>>>> But even though that may be the fix, I think we also need to fix that
>>>> in the kernel as it would break compatibility with certain versions of
>>>> openSBI *if* we fix openSBI...So here are a few solutions:
>>>>
>>>> 1. we can mark all "mmode_resv" nodes in the device tree as nomap,
>>>> before the linear mapping is established (IIUC, those nodes are added
>>>> by openSBI to advertise PMP regions)
>>>>      -> This amounts to the same fix as opensbi and we lose the 1GB hugepage.
>>> AFAIU, losing the 1 GB hugepage is a regression, which would make this
>>> not an option, right?
>> Not sure this is a real regression, I'd rather avoid it, but as
>> mentioned in my first answer, Mike Rapoport showed that it was making
>> no difference performance-wise...
>>
>>>> 2. we can tweak pfn_is_nosave function to *not* save pfn corresponding
>>>> to PMP regions
>>>>      -> We don't lose the 1GB hugepage \o/
>>>> 3. we can use register_nosave_region() to not save the "mmode_resv"
>>>> regions (x86 does that
>>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/x86/kernel/e820.c#L753)
>>>>      -> We don't lose the 1GB hugepage \o/
>>>> 4. Given JeeHeng pointer to
>>>> https://elixir.bootlin.com/linux/v6.4-rc1/source/kernel/power/snapshot.c#L1340,
>>>> we can mark those pages as non-readable and make the hibernation
>>>> process not save those pages
>>>>      -> Very late-in-the-day idea, not sure what it's worth, we also
>>>> lose the 1GB hugepage...
>>> Ditto here re: introducing another regression.
>>>
>>>> To me, the best solution is 3 as it would prepare for other similar
>>>> issues later, it is similar to x86 and it allows us to keep 1GB
>>>> hugepages.
>>>>
>>>> I have been thinking, and to me nomap does not provide anything since
>>>> the kernel should not address this memory range, so if it does, we
>>>> must fix the kernel.
>>>>
>>>> Let me know what you all think, I'll be preparing a PoC of 3 in the meantime!
>>> #3 would probably get my vote too. It seems like you could use it
>>> dynamically if there was to be a future other provider of "mmode_resv"
>>> regions, rather than doing something location-specific.
>>>
>>> We should probably document these opensbi reserved memory nodes though
>>> in a dt-binding or w/e if we are going to be relying on them to not
>>> crash!
> Depending on a particular node name is fragile. If we really need
> information from DT then I suggest adding "no-save-restore" DT
> property in reserved memory nodes.


I understand your point, the node name is the only thing I found that 
would work with current opensbi: any other idea what we could use instead?


>> Yes, you're right, let's see what Atish and Anup think!
> I think we have two possible approaches:
>
> 1) Update OpenSBI to set "no-map" DT property for firmware
>      reserved regions. We were doing this previously but removed
>      it later for performance reasons mentioned by Alex. It is also
>      worth mentioning that ARM Trusted Firmware also sets "no-map"
>      DT property for firmware reserved regions.
>
> 2) Add a new "no-save-restore" DT property in the reserved
>      memory DT bindings. The hibernate support of Linux arch/riscv
>      will use this DT property to exclude memory regions from
>      save-restore. The EFI implementation of EDK2 and U-Boot
>      should do the following:
>      1) Treat all memory having "no-map" DT property as EFI
>          reserved memory
>      2) Treat all memory not having "no-map" DT property and
>          not having "no-save-restore" DT property as EfiBootServicesData
>      3) Treat all memory not having "no-map" DT property and
>           having "no-save-restore" DT property as EfiRuntimeServiceData
>           (Refer,
> https://devicetree-specification.readthedocs.io/en/latest/chapter3-devicenodes.html#reserved-memory-and-uefi)
>
> Personally, I am leaning towards approach#1 since approach#2
> will require changing DeviceTree specification as well.


If needed, indeed #1 is the simplest, but I insist, to me it is not 
needed (and we don't have it in the current opensbi), if you have 
another opinion, I'm open to discuss it!

Thanks for your quick answer Anup,

Alex


>
> Regards,
> Anup
>
>> Thanks for your quick answers Conor and Song, really appreciated!
>>
>> Alex
>>
>>> Thanks for working on this,
>>> Conor.
>>>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv



More information about the linux-riscv mailing list