[Question] How to testing SDEI client driver

Thu Jul 9 14:50:21 EDT 2020

On 09/07/20 20:30, James Morse wrote:
> I can see why this would be useful. Is it widely used, or a bit of a niche sport?
> I don't recall seeing anything about it last time I played with migration...

It's widely used (at least Google uses it a lot).

>> The main showstopper is that you cannot rely on guest cooperation (also
>> because it works surprisingly well without).
> 
> Aren't we changing the guest kernel to support this? Certainly I agree the guest may not
> know about anything.

Yes, but the fallback is synchronous page faults and if the guest
doesn't collaborate it's business as usual.

Now I see better what you meant: the "fake swap" device would not
prevent *other* memory from being swapped even without the guest's
consent.  However, in the case of e.g. postcopy live migration you're in
a bit of a bind, because the working set is both: 1) exactly the set of
pages that are unlikely to be ready on the destination 2) the set of
pages that the guest would choose not to place in the "fake swap".

>> Are there usecases for injecting SDEIs from QEMU?
> 
> Yes. RAS.
> 
> When the VMM takes a SIGBUS:MCEERR_AO it can decide if and how to report this to the
> guest. If it advertised firmware-first support at boot, there are about five options, of
> which SDEI is one. It could emulate something platform specific, or do nothing at all.

Ok, for x86 there's an ioctl to inject MCEs and I was not really sure
how ARM does it.  But it could still be a kernel-managed event, just one
that QEMU can trigger at will.

>> If not, it can be done much more easily with KVM
> 
> The SDEI state would need to be exposed to Qemu to be migrated. If Qemu wants to use it
> for a shared event which is masked on the local vCPU, we'd need to force the other vCPU to
> exit to see if they can take it. Its not impossible, just very fiddly.
> 
> The mental-model I try to stick to is the VMM is the firmware for the guest, and KVM
> 'just' does the stuff it has to to maintain the illusion of real hardware, e.g. plumbing
> stage2 page faults into mm as if they were taken from the VMM, and making the
> timers+counters work.

Actually I try to make the firmware for the guest... the firmware of the
guest (with paravirtualized help from the hypervisor or VMM when
needed).  I think we've argued about that with Marc a lot though, so it
may not be the most common view in the ARM world!

My model is that KVM does processor stuff, while the VMM does everything
else.  It doesn't always match, for example KVM does more GIC emulation
than would fit this model (IIRC it handles the distributor?).  But this
is why I would prefer to put the system register manipulation in KVM
rather than the VMM, possibly with ioctls on the vCPU file descriptor
for use from the VMM.

> 
>> (and it would also
>> would be really, really slow if each page fault had to be redirected
>> through QEMU),
> 
> Isn't this already true for any post-copy live migration?
> There must be some way of telling Qemu that this page is urgently needed ahead of whatever
> it is copying at the moment.

It's done with userfaultfd, so it's entirely asynchronous.  It's
important to get the fault delivered quickly to the guest however,
because that affects the latency.

> There are always going to be pages we must have, and can't make progress until we do. (The
> vectors, the irq handlers .. in modules ..)

Yup but fortunately they don't change often.  For postcopy the
problematic pages are ironically those in the working set, not those
that never change (because those can be migrated just fine)!

>>>> Yes, The SDEI specification already mentioned
>>>> this: the client handler should have all required resources in place before
>>>> the handler is going to run. However, I don't see it's a problem so far.
>>>
>>> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
>>> The host has no clue what is in guest memory.
> 
>> On x86 we don't do the notification if interrupts are disabled. 
> 
> ... because you can't schedule()? What about CONFIG_PREEMPT? arm64 has that enabled in its
> defconfig. I anticipate its the most common option.

No not because we can't schedule() but because it would be a reentrancy
nightmare.  Actually it's even stricter: we don't do the notification at
all if we're in supervisor mode.  As I said above, we don't expect that
to be a big deal because the pages with the most churn for live
migration will be userspace data.

>> On ARM
>> I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
>> be some state that has to be migrated).
> 
> My problem with SDEI is the extra complexity it brings for features you don't want.
> Its an NMI, that's the last thing you want as you can't schedule()

Scheduling can be done outside the NMI handler as long as it's done
before returning to EL3.  But yeah, on x86 it's nice that the page fault
exception handler can schedule() just fine (see
kvm_async_pf_task_wait_schedule in arch/x86/kernel/kvm.c).

> PPI are a scarce resource, so this would need some VMM involvement at boot to say which
> PPI can be used. We do this for the PMU too.

Yep, this is part of why we didn't consider PPIs.

> ... I'd like to look into how x86 uses this, and what other hypervisors may do in this
> area. (another nightmare is supporting similar but different things for KVM, Xen, HyperV
> and VMWare. I'm sure other hypervisors are available...)

I'm not sure if any other hypervisor than KVM does it.  Well, IBM's
proprietary hypervisors do it, but only on POWER or s390.

>> In fact it would be nice if
>> SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
>> SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".
> 
> Sneaky. How does x86 do this? I assume there is a hypercall for 'got it' or 'not now'.
> If we go the PPI route we could use the same. (Ideally as much as possible is done in
> common code)

It doesn't do it yet, but it is planned to have a hypercall to inform
KVM of the choice.  I explained to Gavin what the v2.0 of the x86
interface will look like, so that ARM can already do it like that and
perhaps even share some code or data structures.

Paolo