[PATCH] kvm: pass the virtual SEI syndrome to guest OS

Wed Mar 29 04:58:29 PDT 2017

(This ought to be one of the longest address lists I've ever seen :)
Thanks for the CC. I'm glad Shannon is already on the CC list. For good
measure, I'm adding MST and Igor.)

On 03/29/17 12:36, Achin Gupta wrote:
> Hi gengdongjiu,
> 
> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>
>> Hi Laszlo/Biesheuvel/Qemu developer,
>>
>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>
>> when guest OS happen synchronous or asynchronous abort, kvm needs
>> to send the error address to Qemu or UEFI through sigbus to
>> dynamically generate APEI table. from my investigation, there are
>> two ways:
>>
>> (1) Qemu get the error address, and generate the APEI table, then
>> notify UEFI to know this generation, then inject abort error to
>> guest OS, guest OS read the APEI table.
>> (2) Qemu get the error address, and let UEFI to generate the APEI
>> table, then inject abort error to guest OS, guest OS read the APEI
>> table.
> 
> Just being pedantic! I don't think we are talking about creating the APEI table
> dynamically here. The issue is: Once KVM has received an error that is destined
> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
> into the guest OS, a CPER (Common Platform Error Record) has to be generated
> corresponding to the error source (GHES corresponding to memory subsystem,
> processor etc) to allow the guest OS to do anything meaningful with the
> error. So who should create the CPER is the question.
> 
> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
> responsible for creating the CPER. ARM is experimenting with using a Standalone
> MM EDK2 image in the secure world to do the CPER creation. This will avoid
> adding the same code in ARM TF in EL3 (better for security). The error will then
> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
> Firmware.
> 
> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
> interface (as discussed with Christoffer below). So it should generate the CPER
> before injecting the error.
> 
> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
> mean guest UEFI). At this time, the guest OS already knows where to pick up the
> CPER from through the HEST. Qemu has to create the CPER and populate its address
> at the address exported in the HEST. Guest UEFI should not be involved in this
> flow. Its job was to create the HEST at boot and that has been done by this
> stage.
> 
> Qemu folk will be able to add but it looks like support for CPER generation will
> need to be added to Qemu. We need to resolve this.
> 
> Do shout if I am missing anything above.

After reading this email, the use case looks *very* similar to what
we've just done with VMGENID for QEMU 2.9.

We have a facility between QEMU and the guest firmware, called "ACPI
linker/loader", with which QEMU instructs the firmware to

- allocate and download blobs into guest RAM (AcpiNVS type memory) --
ALLOCATE command,

- relocate pointers in those blobs, to fields in other (or the same)
blobs -- ADD_POINTER command,

- set ACPI table checksums -- ADD_CHECKSUM command,

- and send GPAs of fields within such blobs back to QEMU --
WRITE_POINTER command.

This is how I imagine we can map the facility to the current use case
(note that this is the first time I read about HEST / GHES / CPER):

    etc/acpi/tables                 etc/hardware_errors
    ================     ==========================================
                         +-----------+
    +--------------+     | address   |         +-> +--------------+
    |    HEST      +     | registers |         |   | Error Status |
    + +------------+     | +---------+         |   | Data Block 1 |
    | | GHES       | --> | | address | --------+   | +------------+
    | | GHES       | --> | | address | ------+     | |  CPER      |
    | | GHES       | --> | | address | ----+ |     | |  CPER      |
    | | GHES       | --> | | address | -+  | |     | |  CPER      |
    +-+------------+     +-+---------+  |  | |     +-+------------+
                                        |  | |
                                        |  | +---> +--------------+
                                        |  |       | Error Status |
                                        |  |       | Data Block 2 |
                                        |  |       | +------------+
                                        |  |       | |  CPER      |
                                        |  |       | |  CPER      |
                                        |  |       +-+------------+
                                        |  |
                                        |  +-----> +--------------+
                                        |          | Error Status |
                                        |          | Data Block 3 |
                                        |          | +------------+
                                        |          | |  CPER      |
                                        |          +-+------------+
                                        |
                                        +--------> +--------------+
                                                   | Error Status |
                                                   | Data Block 4 |
                                                   | +------------+
                                                   | |  CPER      |
                                                   | |  CPER      |
                                                   | |  CPER      |
                                                   +-+------------+

(1) QEMU generates the HEST ACPI table. This table goes in the current
"etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
GHES objects in the HEST.

(2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
also populates this blob.

(2a) Given N error sources, the (unnamed) table of address registers
will contain N address registers.

(2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
also contain N Error Status Data Blocks.

I don't know about the sizing (number of CPERs) each Error Status Data
Block has to contain, but I understand it is all pre-allocated as far as
the OS is concerned, which matches our capabilities well.

(3) QEMU generates the ACPI linker/loader script for the firmware, as
always.

(3a) The HEST table is part of "etc/acpi/tables", which the firmware
already allocates memory for, and downloads (because QEMU already
generates an ALLOCATE linker/loader command for it already).

(3b) QEMU will have to create another ALLOCATE command for the
"etc/hardware_errors" blob. The firmware allocates memory for this blob,
and downloads it.

(4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
ADD_POINTER commands, which point the GHES."Error Status
Address" fields in the HEST table, to the corresponding address
registers in the downloaded "etc/hardware_errors" blob.

(5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
HEST table is correctly checksummed after executing the N ADD_POINTER
commands from (4).

(6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
address registers (located in guest memory, in the downloaded
"etc/hardware_errors" blob) to the respective Error Status Data Blocks.

(7) (This is the trick.) For this step, we need a third, write-only
fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
firmware can send back the guest-side allocation addresses to QEMU.

Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
QEMU generates N WRITE_POINTER commands for the firmware.

For error source K (0 <= K < N), QEMU instructs the firmware to
calculate the guest address of Error Status Data Block K, from the
QEMU-dictated offset within "etc/hardware_errors", and from the
guest-determined allocation base address for "etc/hardware_errors". The
firmware then writes the calculated address back to fw_cfg file
"etc/hardware_errors_addr", at offset K*8, according to the
WRITE_POINTER command.

This way QEMU will know the GPA of each Error Status Data Block.

(In fact this can be simplified to a single WRITE_POINTER command: the
address of the "address register table" can be sent back to QEMU as
well, which already contains all Error Status Data Block addresses.)

(8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
through a signalfd -- QEMU can format the CPER right into guest memory,
and then inject whatever interrupt (or assert whatever GPIO line) is
necessary for notifying the guest.

(9) This notification (in virtual hardware) can either be handled by the
guest kernel stand-alone, or else the guest kernel can invoke an ACPI
event handler method with it (which would be in the DSDT or one of the
SSDTs, also generated by QEMU). The ACPI event handler method could
invoke the specific guest kernel driver for errror handling via a
Notify() operation.

I'm attracted to the above design because:
- it would leave the firmware alone after OS boot, and
- it would leave the firmware blissfully ignorant about HEST, GHES,
CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
in the first place.)

Thanks
Laszlo

>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>
>>
>>
>>
>> On 2017/3/28 21:40, James Morse wrote:
>>> Hi gengdongjiu,
>>>
>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>> In a guest?, I don't know.
>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>
>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>> place the records at some specific location in memory, and if the guest
>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>
>>>>>> Is this how others see the world too?
>>>>>
>>>>> I think so!
>>>>>
>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>> HEST for the guest Kernel?
>>>>>
>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>> where the memory is.
>>>>
>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>> logical.
>>>> let Qemu generate CPER record may be clear.
>>>
>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>> needs deciding, (but probably not here)...
>>>
>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>> memory choice above they need to interact and it gets complicated again. The
>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>> code to generate/parse them.
>>>
>>>
>>> Thanks,
>>>
>>> James
>>>
>>>
>>> .
>>>
>>