[REGRESSION] kexec does firmware reboot in kernel v6.7.6

Wed Oct 23 04:07:37 PDT 2024

On 10/23/2024 2:39 AM, David Woodhouse wrote:
> On Tue, 2024-10-22 at 17:06 -0500, Steve Wahl wrote:
>> On Tue, Oct 22, 2024 at 07:51:38PM +0100, David Woodhouse wrote:
>>> I spent all of Monday setting up a full GDT, IDT and exception handler
>>> for the relocate_kernel() environment¹, and I think these reports may
>>> have been the same as what I've been debugging.
>>
>> David,
>>
>> My original problem involved UV platform hardware catching a
>> speculative access into the reserved areas, that caused a BIOS HALT.
>> Reducing the use of gbpages in the page table kept the speculation
>> from hitting those areas.  I would believe this sort of thing might be
>> uniqe to the UV platform.
>>
>> The regression reports I got from Pavin and others were due to my
>> original patch trimming down the page tables to the point where they
>> didn't include some memory that was actually referenced, not processor
>> speculation, because the regions were not explicitly included in the
>> creation of the kexec page map.  This was fixed by explicitly
>> including those regions when creating the map.
> 
> Hm, I didn't see that part of the discussion. I saw that such was a
> theory, but haven't seen specific confirmation and fixes. And your
> original patch was reverted and still not reapplied, AFAICT.
> 
> I did note that the victims all seemed to be using AMD CPUs, so it
> seemed likely that at least *some* of them were suffering the same
> problem that I've found.
> 
> Do you have references please? 
> 
> If anyone is still seeing such problems either with or without your
> patch, they can run with my exception handler and get an actual dump
> instead of a triple-fault.
> 
> (I'm also pushing CPU vendors to give us information from the triple-
> fault through the machine check architecture. It's awful having to do
> this blind. For VMs, I also had plans to register a crashdump kernel
> entry point with the hypervisor, so that on a triple fault the
> *hypervisor* could jump state of all the vCPUs to the configured
> location, then restart one CPU in the crash kernel for it to do its own
> dump). 
> 
>> Can you dump the page tables to see if the address you're referencing
>> is included in those tables (or maybe you already did)?  Can you give
>> symbols and code around the RIP when you hit the #PF?  It looks like
>> this is in the region metioned as the "Control page", so it's probably
>> trampoline code that has been copied from somewhere else.  I'm using
>> my copy of perhaps different kernel source than you have, given your
>> exception handler modification.
>>
>> Wait, I can't make sense of the dump. See more below.
>>
>> What platform are you running on?  And under what conditions (is this
>> bare metal)? Is it really speculation that's causing your #PF?  If so,
>> you could cause it deterministically by, say, doing a quick checksum
>> on that area you're not supposed to touch (0xc142000000 -
>> 0xC1420fffff) and see if it faults every time.  (As I said, I was
>> thinking faults from speculation might be unique to the UV platform.)
> 
> Yes, it's bare metal. AMD Genoa. No, it's not speculation. It's because
> we have a single 2MiB page which covers *both* the RMP table (1MiB
> reserved by BIOS in e820 as I showed), and a page that was allocated
> for the kimage. If I understand correctly, the hardware raises that
> fault (with bit 31 in the error code) when refusing to populate that
> TLB entry for writing.

As mentioned above, about the same 2MB page containing the end portion of the RMP table and a page allocated for kexec and 
looking at the e820 memory map dump here: 

>>> [    0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved
>>> [    0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable

As seen here in the e820 memory map, the end range of the RMP table is not
aligned to 2MB and not reserved but it is usable as RAM.

Subsequently, kexec-ed kernel could try to allocate from within that chunk
which then causes a fatal RMP fault.

This issue has been fixed with the following patch: 
https://lore.kernel.org/lkml/171438476623.10875.16783275868264913579.tip-bot2@tip-bot2/

Thanks, 
Ashish

> 
> According to the AMD manual we're allowed to *read* but not write.
> 
>>> We end up taking a #PF, usually on one of the 'rep mov's, one time on
>>> the 'pushq %r8' right before using it to 'ret' to identity_mapped. In
>>> each case it happens on the first *write* to a page.
>>>
>>> Now I can print %cr2 when it happens (instead of just going straight to
>>> triple-fault), I spot an interesting fact about the address. It's
>>> always *adjacent* to a region reserved by BIOS in the e820 data, and
>>> within the same 2MiB page.

>>
>> I'm not at all certain, but this feels like a red herring.  Be cautious.
> 
> It wouldn't be our first in this journey, but I'm actually fairly
> confident this time. :)
> 
>>> [    0.000000] BIOS-e820: [mem 0x000000bfbe000000-0x000000c1420fffff] reserved
>>> [    0.000000] BIOS-e820: [mem 0x000000c142100000-0x000000fc7fffffff] usable
>>>
>>>
>>> 2024-10-22 17:09:14.291000 kern NOTICE [   58.996257] kexec: Control page at c149431000
>>> 2024-10-22 17:09:14.291000 Y
>>> 2024-10-22 17:09:14.291000 rip:000000c1494312f8
>>> 2024-10-22 17:09:14.291000 rsp:000000c149431f90
>>> 2024-10-22 17:09:14.291000 Exc:000000000000000e
>>> 2024-10-22 17:09:14.291000 Err:0000000080000003
>>> 2024-10-22 17:09:14.291000 rax:000000c142130000
>>> 2024-10-22 17:09:14.291000 rbx:000000010d4b8020
>>> 2024-10-22 17:09:14.291000 rcx:0000000000000200
>>> 2024-10-22 17:09:14.291000 rdx:000000000009c000
>>> 2024-10-22 17:09:14.291000 rsi:000000000009c000
>>> 2024-10-22 17:09:14.291000 rdi:000000c142130000
>>> 2024-10-22 17:09:14.291000 r8 :000000c149431000
>>> 2024-10-22 17:09:14.291000 r9 :000000c149430000
>>> 2024-10-22 17:09:14.291000 r10:000000010d4bc000
>>> 2024-10-22 17:09:14.291000 r11:0000000000000000
>>> 2024-10-22 17:09:14.291000 r12:0000000000000000
>>> 2024-10-22 17:09:14.291000 r13:0000000000770ef0
>>> 2024-10-22 17:09:14.291000 r14:ffff8c82c0000000
>>> 2024-10-22 17:09:14.291000 r15:0000000000000000
>>> 2024-10-22 17:09:14.291000 cr2:000000c142130000
>>>>
>>>
>>> And bit 31 in the error code is set, which means it's an RMP
>>> violation. 
>>
>> RMP is AMD SEV related, right?  I'm not familiar with SEV operation,
>> but I have an itchy feeling it's involved in this problem.
>>
>> I am having a hard time with the RIP listed above.  Maybe your
>> exception handler has affected it?  My disassembly seems to show this
>> address should be in a sea of 0xCC / int3 bytes past the end of swap
>> pages.
> 
> You'd have to have access to my kernel binary to have a hope of knowing
> that, surely? I don't think I checked that particular one, but it's
> normally one of the 'rep mov's in relocate_kernel_64.S.
> 
>>> Looks like we set up a 2MiB page covering the whole range from
>>> 0xc142000000 to 0xc142200000, but we aren't allowed to touch the first
>>> half of that.
>>
>> Is it possible that, instead, some SEV tag is hanging around (TLB not
>> completely cleared?) and a page that was otherwise free is causing the
>> problem.  Are you using SEV/SME in your system, and if you stop using
>> it does it go away?  (Although I have a feeling the answer is no and
>> I'm barking up the wrong tree.)
>>
>> The target of the pages above is c1421300000.  Have you checked to
>> make sure that's a valid address in the page map?
> 
> Yeah, we dumped the page tables and it's present.
> 
>>> For me it happens either with or without Steve's last patch, *but*
>>> clearing direct_gbpages did seem to make it go away (or at least
>>> reduced the incident rate far below the 1-crash-in-1000-kexecs which I
>>> was seeing before).
>>
>> I assume you're referring to the "nogbpages" kernel option?  
> 
> Nah, I just commented out the lines in init_pgtable() which set
> info.direct_gbpages=true.
> 
> 
>> My patch
>> and the nogbpages option should have the exact same pages mapped in
>> the page table.  The difference being my patch would still use gbpages
>> in places where a whole gbpage region is included in the map,
>> nogbpages would use 2M pages to fill out the region.  This *would*
>> allocate more pages to the page table, which might be shifting things
>> around on you.
> 
> Right. In fact the first trigger for this, in our case, was an
> innocuous change to the NMI watchdog period — which sent us on a *long*
> wild goose chase based on the assumption that it was a stray perf NMI
> causing the triple-faults, when in fact that was just shifting things
> around on us too, and causing pages in that dangerous 1MiB to be chosen
> for the kimage.
> 
>>> I think Steve's original patch was just moving things around a little
>>> and because it allocate more pages for page tables, just happened to
>>> leave pages in the offending range to be allocated for writing to, for
>>> the unlucky victims.
>>>
>>> I think the patch was actually along the right lines though, although
>>> it needs to go all the way down to 4KiB PTEs in some cases. And it
>>> could probably map anything that the e820 calls 'usable RAM', rather
>>> than really restricting itself to precisely the ranges which it's
>>> requested to map. 
>>>
>>>
>>>
>>> ¹ I'll post that exception handler at some point once I've tidied it
>>> up.
>>
>> I hope this might be of some help.  Good luck, I'll pitch in any way I
>> can.
> 
> Thanks.
>