[REGRESSION] kexec does firmware reboot in kernel v6.7.6

Wed Oct 23 06:29:38 PDT 2024

On 10/23/2024 6:39 AM, David Woodhouse wrote:
> On Wed, 2024-10-23 at 06:07 -0500, Kalra, Ashish wrote:
>>
>> As mentioned above, about the same 2MB page containing the end portion of the RMP table and a page allocated for kexec and 
>> looking at the e820 memory map dump here: 
>>
>>>>> [    0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved
>>>>> [    0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
>>
>> As seen here in the e820 memory map, the end range of the RMP table is not
>> aligned to 2MB and not reserved but it is usable as RAM.
>>
>> Subsequently, kexec-ed kernel could try to allocate from within that chunk
>> which then causes a fatal RMP fault.
> 
> Well, allocating within that chunk would be just fine. It *is* usable
> as RAM, as the e820 table says. It works fine most of the time.
> 
> You've missed a step out of the story. The problem is that for kexec we
> map it with an "overreaching" 2MiB PTE which also covers the reserved
> regions, and *that* is what causes the RMP violation fault.
> 

Actually, the RMP entry covering the end range of the RMP table will be a 2MB/large entry 
which means that the whole 2MB including the usable 1MB memory range here will also be marked
as reserved in the RMP table and hence any host writes into this memory range will trigger
the RMP violation.

> We could take two possible viewpoints here. I was taking the viewpoint
> that this is a kernel bug, that it *shouldn't* be setting up 2MiB pages
> which include a reserved region, and should break those down to 4KiB
> pages.
> 
> The alternative view would be to consider it a BIOS bug, and to say
> that the BIOS really *ought* to have reserved the whole 2MiB region to
> avoid the 'sharing'.  Since the hardware apparently already breaks down
> 1GiB pages to 2MiB TLB entries in order to avoid triggering the problem
> on 1GiB mappings.
> 
>> This issue has been fixed with the following patch: 
>> https://lore.kernel.org/lkml/171438476623.10875.16783275868264913579.tip-bot2@tip-bot2/
> 
> Thanks for pointing that patch out! Should it have been Cc:stable?
> 

This thing can happen after SNP host support got merged in 6.11 and SNP support is enabled, therefore
the patch does not mark it Cc:stable.

I am trying to understand the scenario here: you have SNP enabled in the BIOS and you also
have SNP support added in the host kernel, which means that the following logs are seen:
..
SEV-SNP: RMP table physical range [0x000000xxxxxxxxxx - 0x000000yyyyyyyyyy]
..

> It seems to be taking the latter of the above two viewpoints, that this
> is a BIOS bug and that the BIOS *should* have reserved the whole 2MiB.
> 
> In that case are fixed BIOSes available already? 

We have been of the view that it is easier to get it fixed in kernel, by fixing/aligning the e820 range
mapping the start and end of RMP table to 2MB boundaries, rather than trusting a BIOS to do it
correctly. 

Here is a link to a discussion on the same:
https://lore.kernel.org/all/2ab14f6f-2690-056b-cf9e-38a12dafd728@amd.com/

Thanks, 
Ashish

>This patch makes sense
> as a temporary workaround (we have ways to print warnings about BIOS
> bugs, btw), but I don't really like it as a longer-term "fix". What if
> the BIOS had put *other* things into that other 1MiB of address space?
> What if the bootloader had loaded something there? 
> 
> I'm still inclined to suggest that kexec *shouldn't* use over-reaching
> large pages which cover anything that isn't marked as usable RAM.