[PATCH] kvm: pass the virtual SEI syndrome to guest OS

Thu Apr 6 19:52:10 PDT 2017

Hi Laszlo,
  thanks.

On 2017/4/7 2:55, Laszlo Ersek wrote:
> On 04/06/17 14:35, gengdongjiu wrote:
>> Dear, Laszlo
>>    Thanks for your detailed explanation.
>>
>> On 2017/3/29 19:58, Laszlo Ersek wrote:
>>> (This ought to be one of the longest address lists I've ever seen :)
>>> Thanks for the CC. I'm glad Shannon is already on the CC list. For good
>>> measure, I'm adding MST and Igor.)
>>>
>>> On 03/29/17 12:36, Achin Gupta wrote:
>>>> Hi gengdongjiu,
>>>>
>>>> On Wed, Mar 29, 2017 at 05:36:37PM +0800, gengdongjiu wrote:
>>>>>
>>>>> Hi Laszlo/Biesheuvel/Qemu developer,
>>>>>
>>>>>    Now I encounter a issue and want to consult with you in ARM64 platform， as described below:
>>>>>
>>>>> when guest OS happen synchronous or asynchronous abort, kvm needs
>>>>> to send the error address to Qemu or UEFI through sigbus to
>>>>> dynamically generate APEI table. from my investigation, there are
>>>>> two ways:
>>>>>
>>>>> (1) Qemu get the error address, and generate the APEI table, then
>>>>> notify UEFI to know this generation, then inject abort error to
>>>>> guest OS, guest OS read the APEI table.
>>>>> (2) Qemu get the error address, and let UEFI to generate the APEI
>>>>> table, then inject abort error to guest OS, guest OS read the APEI
>>>>> table.
>>>>
>>>> Just being pedantic! I don't think we are talking about creating the APEI table
>>>> dynamically here. The issue is: Once KVM has received an error that is destined
>>>> for a guest it will raise a SIGBUS to Qemu. Now before Qemu can inject the error
>>>> into the guest OS, a CPER (Common Platform Error Record) has to be generated
>>>> corresponding to the error source (GHES corresponding to memory subsystem,
>>>> processor etc) to allow the guest OS to do anything meaningful with the
>>>> error. So who should create the CPER is the question.
>>>>
>>>> At the EL3/EL2 interface (Secure Firmware and OS/Hypervisor), an error arrives
>>>> at EL3 and secure firmware (at EL3 or a lower secure exception level) is
>>>> responsible for creating the CPER. ARM is experimenting with using a Standalone
>>>> MM EDK2 image in the secure world to do the CPER creation. This will avoid
>>>> adding the same code in ARM TF in EL3 (better for security). The error will then
>>>> be injected into the OS/Hypervisor (through SEA/SEI/SDEI) through ARM Trusted
>>>> Firmware.
>>>>
>>>> Qemu is essentially fulfilling the role of secure firmware at the EL2/EL1
>>>> interface (as discussed with Christoffer below). So it should generate the CPER
>>>> before injecting the error.
>>>>
>>>> This is corresponds to (1) above apart from notifying UEFI (I am assuming you
>>>> mean guest UEFI). At this time, the guest OS already knows where to pick up the
>>>> CPER from through the HEST. Qemu has to create the CPER and populate its address
>>>> at the address exported in the HEST. Guest UEFI should not be involved in this
>>>> flow. Its job was to create the HEST at boot and that has been done by this
>>>> stage.
>>>>
>>>> Qemu folk will be able to add but it looks like support for CPER generation will
>>>> need to be added to Qemu. We need to resolve this.
>>>>
>>>> Do shout if I am missing anything above.
>>>
>>> After reading this email, the use case looks *very* similar to what
>>> we've just done with VMGENID for QEMU 2.9.
>>>
>>> We have a facility between QEMU and the guest firmware, called "ACPI
>>> linker/loader", with which QEMU instructs the firmware to
>>>
>>> - allocate and download blobs into guest RAM (AcpiNVS type memory) --
>>> ALLOCATE command,
>>>
>>> - relocate pointers in those blobs, to fields in other (or the same)
>>> blobs -- ADD_POINTER command,
>>>
>>> - set ACPI table checksums -- ADD_CHECKSUM command,
>>>
>>> - and send GPAs of fields within such blobs back to QEMU --
>>> WRITE_POINTER command.
>>>
>>> This is how I imagine we can map the facility to the current use case
>>> (note that this is the first time I read about HEST / GHES / CPER):
>>>
>>>     etc/acpi/tables                 etc/hardware_errors
>>>     ================     ==========================================
>>>                          +-----------+
>>>     +--------------+     | address   |         +-> +--------------+
>>>     |    HEST      +     | registers |         |   | Error Status |
>>>     + +------------+     | +---------+         |   | Data Block 1 |
>>>     | | GHES       | --> | | address | --------+   | +------------+
>>>     | | GHES       | --> | | address | ------+     | |  CPER      |
>>>     | | GHES       | --> | | address | ----+ |     | |  CPER      |
>>>     | | GHES       | --> | | address | -+  | |     | |  CPER      |
>>>     +-+------------+     +-+---------+  |  | |     +-+------------+
>>>                                         |  | |
>>>                                         |  | +---> +--------------+
>>>                                         |  |       | Error Status |
>>>                                         |  |       | Data Block 2 |
>>>                                         |  |       | +------------+
>>>                                         |  |       | |  CPER      |
>>>                                         |  |       | |  CPER      |
>>>                                         |  |       +-+------------+
>>>                                         |  |
>>>                                         |  +-----> +--------------+
>>>                                         |          | Error Status |
>>>                                         |          | Data Block 3 |
>>>                                         |          | +------------+
>>>                                         |          | |  CPER      |
>>>                                         |          +-+------------+
>>>                                         |
>>>                                         +--------> +--------------+
>>>                                                    | Error Status |
>>>                                                    | Data Block 4 |
>>>                                                    | +------------+
>>>                                                    | |  CPER      |
>>>                                                    | |  CPER      |
>>>                                                    | |  CPER      |
>>>                                                    +-+------------+
>>>
>>> (1) QEMU generates the HEST ACPI table. This table goes in the current
>>> "etc/acpi/tables" fw_cfg blob. Given N error sources, there will be N
>>> GHES objects in the HEST.
>>>
>>> (2) We introduce a new fw_cfg blob called "etc/hardware_errors". QEMU
>>> also populates this blob.
>>>
>>> (2a) Given N error sources, the (unnamed) table of address registers
>>> will contain N address registers.
>>>
>>> (2b) Given N error sources, the "etc/hardwre_errors" fw_cfg blob will
>>> also contain N Error Status Data Blocks.
>>>
>>> I don't know about the sizing (number of CPERs) each Error Status Data
>>> Block has to contain, but I understand it is all pre-allocated as far as
>>> the OS is concerned, which matches our capabilities well.
>> here I have a question. as you comment: " 'etc/hardwre_errors' fw_cfg blob will also contain N Error Status Data Blocks",
>> Because the CPER numbers is not fixed, how to assign each "Error Status Data Block" size using one "etc/hardwre_errors" fw_cfg blob.
>> when use one etc/hardwre_errors, will the N Error Status Data Block use one continuous buffer? as shown below. if so, maybe it not convenient for each data block size extension.
>> I see the bios_linker_loader_alloc will allocate one continuous buffer for a blob(such as VMGENID_GUID_FW_CFG_FILE)
>>
>>     /* Allocate guest memory for the Data fw_cfg blob */
>>     bios_linker_loader_alloc(linker, VMGENID_GUID_FW_CFG_FILE, guid, 4096,
>>                              false /* page boundary, high memory */);
>>
>>
>>
>> -> +--------------+
>>      |    HEST      +     | registers |             | Error Status |
>>      + +------------+     | +---------+             | Data Block  |
>>      | | GHES       | --> | | address | --------+-->| +------------+
>>      | | GHES       | --> | | address | ------+     | |  CPER      |
>>      | | GHES       | --> | | address | ----+ |     | |  CPER      |
>>      | | GHES       | --> | | address | -+  | |     | |  CPER      |
>>      +-+------------+     +-+---------+  |  | +---> +--------------+
>>                                          |  |       | |  CPER      |
>>                                          |  |       | |  CPER      |
>>                                          |  +-----> +--------------+
>>                                          |          | |  CPER      |
>>                                          +--------> +--------------+
>>                                                     | |  CPER      |
>>                                                     | |  CPER      |
>>                                                     | |  CPER      |
>>                                                     +-+------------+
>>
>>
>>
>> so how about we use separate etc/hardwre_errorsN for each Error Status status Block? then
>>
>> etc/hardwre_errors0
>> etc/hardwre_errors1
>> ...................
>> etc/hardwre_errors10
>> (the max N is 10)
>>
>>
>> the N can be one of below values, according to ACPI spec "Table 18-345 Hardware Error Notification Structure"
>> 0 – Polled
>> 1 – External Interrupt
>> 2 – Local Interrupt
>> 3 – SCI
>> 4 – NMI
>> 5 - CMCI
>> 6 - MCE
>> 7 - GPIO-Signal
>> 8 - ARMv8 SEA
>> 9 - ARMv8 SEI
>> 10 - External Interrupt - GSIV
> 
> I'm unsure if, by "not fixed", you are saying
> 
>   the number of CPER entries that fits in Error Status Data Block N is
>   not *uniform* across 0 <= N <= 10 [1]
> 
> or
> 
>   the number of CPER entries that fits in Error Status Data Block N is
>   not *known* in advance, for all of 0 <= N <= 10 [2]
> 
> Which one is your point?
> 
> If [1], that's no problem; you can simply sum the individual error
> status data block sizes in advance, and allocate "etc/hardware_errors"
> accordingly, using the total size.
> 
> (Allocating one shared fw_cfg blob for all status data blocks is more
> memory efficient, as each ALLOCATE command will allocate whole pages
> (rounded up from the actual blob size).)
> 
> If your point is [2], then splitting the error status data blocks to
> separate fw_cfg blobs makes no difference: regardless of whether we try
> to place all the error status data blocks in a single fw_cfg blob, or in
> separate fw_cfg blobs, the individual data block cannot be resized at OS
> runtime, so there's no way to make it work.
>
My Point is [2]. The HEST(Hardware Error Source Table) table format is here:
https://wiki.linaro.org/LEG/Engineering/Kernel/RAS/APEITables#Hardware_Error_Source_Table_.28HEST.29

Now I understand your thought.

> Thanks,
> Laszlo
> 
>>
>>
>>
>>
>>>
>>> (3) QEMU generates the ACPI linker/loader script for the firmware, as
>>> always.
>>>
>>> (3a) The HEST table is part of "etc/acpi/tables", which the firmware
>>> already allocates memory for, and downloads (because QEMU already
>>> generates an ALLOCATE linker/loader command for it already).
>>>
>>> (3b) QEMU will have to create another ALLOCATE command for the
>>> "etc/hardware_errors" blob. The firmware allocates memory for this blob,
>>> and downloads it.
>>>
>>> (4) QEMU generates, in the ACPI linker/loader script for the firwmare, N
>>> ADD_POINTER commands, which point the GHES."Error Status
>>> Address" fields in the HEST table, to the corresponding address
>>> registers in the downloaded "etc/hardware_errors" blob.
>>>
>>> (5) QEMU generates an ADD_CHECKSUM command for the firmware, so that the
>>> HEST table is correctly checksummed after executing the N ADD_POINTER
>>> commands from (4).
>>>
>>> (6) QEMU generates N ADD_POINTER commands for the firmware, pointing the
>>> address registers (located in guest memory, in the downloaded
>>> "etc/hardware_errors" blob) to the respective Error Status Data Blocks.
>>>
>>> (7) (This is the trick.) For this step, we need a third, write-only
>>> fw_cfg blob, called "etc/hardware_errors_addr". Through that blob, the
>>> firmware can send back the guest-side allocation addresses to QEMU.
>>>
>>> Namely, the "etc/hardware_errors_addr" blob contains N 8-byte entries.
>>> QEMU generates N WRITE_POINTER commands for the firmware.
>>>
>>> For error source K (0 <= K < N), QEMU instructs the firmware to
>>> calculate the guest address of Error Status Data Block K, from the
>>> QEMU-dictated offset within "etc/hardware_errors", and from the
>>> guest-determined allocation base address for "etc/hardware_errors". The
>>> firmware then writes the calculated address back to fw_cfg file
>>> "etc/hardware_errors_addr", at offset K*8, according to the
>>> WRITE_POINTER command.
>>>
>>> This way QEMU will know the GPA of each Error Status Data Block.
>>>
>>> (In fact this can be simplified to a single WRITE_POINTER command: the
>>> address of the "address register table" can be sent back to QEMU as
>>> well, which already contains all Error Status Data Block addresses.)
>>>
>>> (8) When QEMU gets SIGBUS from the kernel -- I hope that's going to come
>>> through a signalfd -- QEMU can format the CPER right into guest memory,
>>> and then inject whatever interrupt (or assert whatever GPIO line) is
>>> necessary for notifying the guest.
>>>
>>> (9) This notification (in virtual hardware) can either be handled by the
>>> guest kernel stand-alone, or else the guest kernel can invoke an ACPI
>>> event handler method with it (which would be in the DSDT or one of the
>>> SSDTs, also generated by QEMU). The ACPI event handler method could
>>> invoke the specific guest kernel driver for errror handling via a
>>> Notify() operation.
>>>
>>> I'm attracted to the above design because:
>>> - it would leave the firmware alone after OS boot, and
>>> - it would leave the firmware blissfully ignorant about HEST, GHES,
>>> CPER, and the like. (That's why QEMU's ACPI linker/loader was invented
>>> in the first place.)
>>>
>>> Thanks
>>> Laszlo
>>>
>>>>>    Do you think which modules generates the APEI table is better? UEFI or Qemu?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 2017/3/28 21:40, James Morse wrote:
>>>>>> Hi gengdongjiu,
>>>>>>
>>>>>> On 28/03/17 13:16, gengdongjiu wrote:
>>>>>>> On 2017/3/28 19:54, Achin Gupta wrote:
>>>>>>>> On Tue, Mar 28, 2017 at 01:23:28PM +0200, Christoffer Dall wrote:
>>>>>>>>> On Tue, Mar 28, 2017 at 11:48:08AM +0100, James Morse wrote:
>>>>>>>>>> On the host, part of UEFI is involved to generate the CPER records.
>>>>>>>>>> In a guest?, I don't know.
>>>>>>>>>> Qemu could generate the records, or drive some other component to do it.
>>>>>>>>>
>>>>>>>>> I think I am beginning to understand this a bit.  Since the guet UEFI
>>>>>>>>> instance is specifically built for the machine it runs on, QEMU's virt
>>>>>>>>> machine in this case, they could simply agree (by some contract) to
>>>>>>>>> place the records at some specific location in memory, and if the guest
>>>>>>>>> kernel asks its guest UEFI for that location, things should just work by
>>>>>>>>> having logic in QEMU to process error reports and populate guest memory.
>>>>>>>>>
>>>>>>>>> Is this how others see the world too?
>>>>>>>>
>>>>>>>> I think so!
>>>>>>>>
>>>>>>>> AFAIU, the memory where CPERs will reside should be specified in a GHES entry in
>>>>>>>> the HEST. Is this not the case with a guest kernel i.e. the guest UEFI creates a
>>>>>>>> HEST for the guest Kernel?
>>>>>>>>
>>>>>>>> If so, then the question is how the guest UEFI finds out where QEMU (acting as
>>>>>>>> EL3 firmware) will populate the CPERs. This could either be a contract between
>>>>>>>> the two or a guest DXE driver uses the MM_COMMUNICATE call (see [1]) to ask QEMU
>>>>>>>> where the memory is.
>>>>>>>
>>>>>>> whether invoke the guest UEFI will be complex? not see the advantage. it seems x86 Qemu
>>>>>>> directly generate the ACPI table, but I am not sure, we are checking the qemu
>>>>>> logical.
>>>>>>> let Qemu generate CPER record may be clear.
>>>>>>
>>>>>> At boot UEFI in the guest will need to make sure the areas of memory that may be
>>>>>> used for CPER records are reserved. Whether UEFI or Qemu decides where these are
>>>>>> needs deciding, (but probably not here)...
>>>>>>
>>>>>> At runtime, when an error has occurred, I agree it would be simpler (fewer
>>>>>> components involved) if Qemu generates the CPER records. But if UEFI made the
>>>>>> memory choice above they need to interact and it gets complicated again. The
>>>>>> CPER records are defined in the UEFI spec, so I would expect UEFI to contain
>>>>>> code to generate/parse them.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> James
>>>>>>
>>>>>>
>>>>>> .
>>>>>>
>>>>>
>>>
>>>
>>> .
>>>
>>
> 
> 
> .
>