[Question] How to testing SDEI client driver

Fri Jul 10 05:08:28 EDT 2020

Hi James,

On 7/10/20 4:31 AM, James Morse wrote:
> On 09/07/2020 06:33, Gavin Shan wrote:
>> On 7/9/20 2:49 AM, Paolo Bonzini wrote:
>>> On 08/07/20 18:11, James Morse wrote:
>>>> On 03/07/2020 01:26, Gavin Shan wrote:
> 
>>>>> For the SDEI
>>>>> events needed by the async page fault, it's originated from KVM (host). In order
>>>>> to achieve the goal, KVM needs some code so that SDEI event can be injected and
>>>>> delivered. Also, the SDEI related hypercalls needs to be handled either.
>>>>
>>>> I avoided doing this because it makes it massively complicated for the VMM. All that
>>>> in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
>>>> events, which gets nasty for shared events where some CPUs are masked, and others aren't.
>>>>
>>>> Having something like Qemu drive the reference code from TFA is the right thing to do for
>>>> SDEI.
>>>
>>> Are there usecases for injecting SDEIs from QEMU?
>>>
>>> If not, it can be done much more easily with KVM (and it would also
>>> would be really, really slow if each page fault had to be redirected
>>> through QEMU), which wouldn't have more than a handful of SDEI events.
>>> The in-kernel state is 4 64-bit values (EP address and argument, flags,
>>> affinity) per event.
> 
>> I don't think there is existing usercase to inject SDEIs from qemu.
> 
> use-case or user-space?
> 
> There was a series to add support for emulating firmware-first RAS. I think it got stuck
> in the wider problem of how Qemu can consume reference code from TFA (the EL3 firmware) to
> reduce the maintenance overhead. (every time Arm add something else up there, Qemu would
> need to emulate it. It should be possible to consume the TFA reference code)
> 

I'm not sure if the patchset has been ever posted. If so, could you
please tell the link to that? I might take a look when getting a
chance.

>> However, there is one ioctl command is reserved for this purpose
>> in my code, so that QEMU can inject SDEI event if needed.
>>
>> Yes, the implementation of my code is done in kvm to inject SDEI
>> event directly, on request received from the consumer like APF.
> 
>> By the way, I just finished splitting the code into RFC patches.
>> Please let me I should post it to provide more details, or it
>> should be deferred until this discussion is finished.
> 
> I need to go through the SDEI patches you posted yet. If you post a link to the branch I
> can have a look to get a better idea of the shape of this thing...
> 
> (I've not gone looking for the x86 code yet)
> 

Sure. Here is the link to the git repo:

https://github.com/gwshan/linux.git

branch ("sdei_client"): the sdei client driver rework series I posted.
branch ("sdei"): the patches to make SDEI virtualized, which bases on "sdei_client".

>>>>> Yes, The SDEI specification already mentioned
>>>>> this: the client handler should have all required resources in place before
>>>>> the handler is going to run. However, I don't see it's a problem so far.
>>>>
>>>> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
>>>> The host has no clue what is in guest memory.
>>>
>>> On x86 we don't do the notification if interrupts are disabled.  On ARM
>>> I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
>>> be some state that has to be migrated).  In fact it would be nice if
>>> SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
>>> SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".
> 
>> I'm not sure I understand this issue completely. When the vCPU is preempted,
>> all registers should have been saved to vcpu->arch.ctxt. The SDEI context is
>> saved to vcpu->arch.ctxt either. They will be restored when the vCPU gets
>> running afterwards. From the syntax perspective, it's not broken.
>>
>> Yes, I plan to use private event, which is only visible to kvm and guest.
>> Also, it has critical priority. The new SDEI event can't be delivered until
>> the previous critical event is finished.
>>
>> Paolo, it's intresting idea to reuse SDEI_EVENT_COMPLETE/AND_RESUME. Do you
>> mean to use these two hypercalls to designate PAGE_NOT_READY and PAGE_READY
>> separately? If possible, please provide more details.
> 
> No, I think this suggestion is for the guest to hint back to the hypervisor whether it can
> take this stage2 delay, or it must have the page to make progress.
> 
> SDEI_EVENT_COMPLETE returns to wherever we came from, the arch code will do this if it
> couldn't have taken an IRQ. If it could have taken an IRQ, it uses
> SDEI_EVENT_COMPLETE_AND_RESUME to exit through the interrupt vector.
> 
> This is a trick that gives us two things: KVM guest exit when this is in use on real
> hardware, and the irq-work handler runs to do the work we couldn't do in NMI context, both
> before we return to the context that triggered the fault in the first place.
> Both are needed for the RAS support.
> 

Ok, thanks for the information, which makes thing much more clear.
So SDEI_EVENT_COMPLETE/AND_RESUME is issued depending if current
process can be rescheduled. I think it's Paolo's idea?

> 
> The problem is invoking this whole thing when the guest can't do anything about it,
> because it can't schedule(). You can't know this from outside the guest.
> 

Yes, the interrupted process can't call schedule() before SDEI_EVENT_COMPLETE
at least because the SDEI event handler has to finish as quick as possible.
However, I

              process  ->         SDEI event trigger
                                        |
                                  SDEI event handler is called
                                        |
             schedule() <-        SDEI_EVENT_COMPLETE

As we don't have schedule() in place in advance, we might figure out one
way to insert the schedule() by the SDEI event handler.

Thanks,
Gavin