[Question] How to testing SDEI client driver

Thu Jul 9 14:31:30 EDT 2020

Hi Gavin,

On 09/07/2020 06:33, Gavin Shan wrote:
> On 7/9/20 2:49 AM, Paolo Bonzini wrote:
>> On 08/07/20 18:11, James Morse wrote:
>>> On 03/07/2020 01:26, Gavin Shan wrote:

>>>> For the SDEI
>>>> events needed by the async page fault, it's originated from KVM (host). In order
>>>> to achieve the goal, KVM needs some code so that SDEI event can be injected and
>>>> delivered. Also, the SDEI related hypercalls needs to be handled either.
>>>
>>> I avoided doing this because it makes it massively complicated for the VMM. All that
>>> in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
>>> events, which gets nasty for shared events where some CPUs are masked, and others aren't.
>>>
>>> Having something like Qemu drive the reference code from TFA is the right thing to do for
>>> SDEI.
>>
>> Are there usecases for injecting SDEIs from QEMU?
>>
>> If not, it can be done much more easily with KVM (and it would also
>> would be really, really slow if each page fault had to be redirected
>> through QEMU), which wouldn't have more than a handful of SDEI events.
>> The in-kernel state is 4 64-bit values (EP address and argument, flags,
>> affinity) per event.

> I don't think there is existing usercase to inject SDEIs from qemu.

use-case or user-space?

There was a series to add support for emulating firmware-first RAS. I think it got stuck
in the wider problem of how Qemu can consume reference code from TFA (the EL3 firmware) to
reduce the maintenance overhead. (every time Arm add something else up there, Qemu would
need to emulate it. It should be possible to consume the TFA reference code)

> However, there is one ioctl command is reserved for this purpose
> in my code, so that QEMU can inject SDEI event if needed.
> 
> Yes, the implementation of my code is done in kvm to inject SDEI
> event directly, on request received from the consumer like APF.

> By the way, I just finished splitting the code into RFC patches.
> Please let me I should post it to provide more details, or it
> should be deferred until this discussion is finished.

I need to go through the SDEI patches you posted yet. If you post a link to the branch I
can have a look to get a better idea of the shape of this thing...

(I've not gone looking for the x86 code yet)

>>>> Yes, The SDEI specification already mentioned
>>>> this: the client handler should have all required resources in place before
>>>> the handler is going to run. However, I don't see it's a problem so far.
>>>
>>> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
>>> The host has no clue what is in guest memory.
>>
>> On x86 we don't do the notification if interrupts are disabled.  On ARM
>> I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
>> be some state that has to be migrated).  In fact it would be nice if
>> SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
>> SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".

> I'm not sure I understand this issue completely. When the vCPU is preempted,
> all registers should have been saved to vcpu->arch.ctxt. The SDEI context is
> saved to vcpu->arch.ctxt either. They will be restored when the vCPU gets
> running afterwards. From the syntax perspective, it's not broken.
> 
> Yes, I plan to use private event, which is only visible to kvm and guest.
> Also, it has critical priority. The new SDEI event can't be delivered until
> the previous critical event is finished.
> 
> Paolo, it's intresting idea to reuse SDEI_EVENT_COMPLETE/AND_RESUME. Do you
> mean to use these two hypercalls to designate PAGE_NOT_READY and PAGE_READY
> separately? If possible, please provide more details.

No, I think this suggestion is for the guest to hint back to the hypervisor whether it can
take this stage2 delay, or it must have the page to make progress.

SDEI_EVENT_COMPLETE returns to wherever we came from, the arch code will do this if it
couldn't have taken an IRQ. If it could have taken an IRQ, it uses
SDEI_EVENT_COMPLETE_AND_RESUME to exit through the interrupt vector.

This is a trick that gives us two things: KVM guest exit when this is in use on real
hardware, and the irq-work handler runs to do the work we couldn't do in NMI context, both
before we return to the context that triggered the fault in the first place.
Both are needed for the RAS support.

The problem is invoking this whole thing when the guest can't do anything about it,
because it can't schedule(). You can't know this from outside the guest.

Thanks,

James