[Question] How to testing SDEI client driver

Thu Jul 9 14:30:06 EDT 2020

Hi Paolo, Gavin,

On 08/07/2020 17:49, Paolo Bonzini wrote:
> On 08/07/20 18:11, James Morse wrote:
>> On 03/07/2020 01:26, Gavin Shan wrote:
>>> On 7/1/20 9:57 PM, James Morse wrote:
>>>> On 30/06/2020 06:17, Gavin Shan wrote:
>>>>> I'm currently looking into SDEI client driver and reworking on it so that
>>>>> it can provide capability/services to arm64/kvm to get it virtualized.
>>>>
>>>> What do you mean by virtualised? The expectation is the VMM would implement the 'firmware'
>>>> side of this. 'events' are most likely to come from the VMM, and having to handshake with
>>>> the kernel to work out if the event you want to inject is registered and enabled is
>>>> over-complicated. Supporting it in the VMM means you can notify a different vCPU if that
>>>> is appropriate, or take a different action if the event isn't registered.
>>>>
>>>> This was all blocked on finding a future-proof way for tools like Qemu to consume
>>>> reference code from ATF.
>>
>>> Sorry that I didn't mention the story a bit last time. We plan to use SDEI to
>>> deliver the notification (signal) from host to guest, needed by the asynchronous
>>> page fault feature. The RFCv2 patchset was post a while ago [1].
>>
>> Thanks. So this is to hint to the guest that you'd swapped its memory to disk. Yuck.
>>
>> When would you do this?

> These days, the main reason is on-demand paging with live migration.
> Instead of waiting to have a consistent version of guest memory on the
> destination, memory that the guest has dirtied can be copied on demand
> from source to destination while the guest is running.  Letting the
> guest reschedule is surprisingly effective in this case, especially with
> workloads that have a lot of threads.

Aha, so nothing to do with swap. This makes more sense.
New bedtime reading: "Post-Copy Live Migration of Virtual Machines" [0]

I can see why this would be useful. Is it widely used, or a bit of a niche sport?
I don't recall seeing anything about it last time I played with migration...

>> Isn't this roughly equivalent to SMT CPUs taking a cache-miss? ...
>> If you pinned two vCPUs to one physical CPU, the host:scheduler would multiplex between
>> them. If one couldn't due useful work because it was waiting for memory, the other gets
>> all the slack time. (the TLB maintenance would hurt, but not as much as waiting for the disk)
>> The good news is the guest:scheduler already knows how to deal with this!
>> (and, it works for other OS too)
> 
> The order of magnitude of both the wait and the reschedule is too
> different for SMT heuristics to be applicable here.  Especially, two SMT
> pCPUs compete equally for fetch resources, while two vCPUs pinned to the
> same pCPU would only reschedule a few hundred times per second.  Latency
> would be in the milliseconds and jitter would be horribl.
> 
>> Wouldn't it be better to let the guest make the swapping decision? 
>> You could provide a fast virtio swap device to the guest that is
>> backed by maybe-swapped host memory.

> I think you are describing something similar to "transcendent memory",
> which Xen implemented about 10 years ago
> (https://lwn.net/Articles/454795/).  Unfortunately you've probably never
> heard about it for good reasons. :)

Heh. With a name like that I expect it to solve all my problems!

I'm trying to work out what the problem with existing ways of doing this would be...

> The main showstopper is that you cannot rely on guest cooperation (also
> because it works surprisingly well without).

Aren't we changing the guest kernel to support this? Certainly I agree the guest may not
know about anything.

>>> For the SDEI
>>> events needed by the async page fault, it's originated from KVM (host). In order
>>> to achieve the goal, KVM needs some code so that SDEI event can be injected and
>>> delivered. Also, the SDEI related hypercalls needs to be handled either.
>>
>> I avoided doing this because it makes it massively complicated for the VMM. All that
>> in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
>> events, which gets nasty for shared events where some CPUs are masked, and others aren't.
>>
>> Having something like Qemu drive the reference code from TFA is the right thing to do for
>> SDEI.

> Are there usecases for injecting SDEIs from QEMU?

Yes. RAS.

When the VMM takes a SIGBUS:MCEERR_AO it can decide if and how to report this to the
guest. If it advertised firmware-first support at boot, there are about five options, of
which SDEI is one. It could emulate something platform specific, or do nothing at all.

The VMM owns the APCI-tables/DT, which advertise whether SDEI is supported, and for APCI
where the firmware first CPER regions are and how they are notified. We don't pass RAS
stuff into the guest directly, we treat the VMM like any other user-space process.

> If not, it can be done much more easily with KVM

The SDEI state would need to be exposed to Qemu to be migrated. If Qemu wants to use it
for a shared event which is masked on the local vCPU, we'd need to force the other vCPU to
exit to see if they can take it. Its not impossible, just very fiddly.

The mental-model I try to stick to is the VMM is the firmware for the guest, and KVM
'just' does the stuff it has to to maintain the illusion of real hardware, e.g. plumbing
stage2 page faults into mm as if they were taken from the VMM, and making the
timers+counters work.

Supporting SDEI in real firmware is done by manipulating system registers in EL3 firmware.
This falls firmly in the 'VMM is the firmware' court. Its possible for the VMM to inject
events using the existing KVM APIs, all that is missing is routing HVC to user-space for
the VMM to handle.

> (and it would also
> would be really, really slow if each page fault had to be redirected
> through QEMU),

Isn't this already true for any post-copy live migration?
There must be some way of telling Qemu that this page is urgently needed ahead of whatever
it is copying at the moment.

There are always going to be pages we must have, and can't make progress until we do. (The
vectors, the irq handlers .. in modules ..)

> which wouldn't have more than a handful of SDEI events.
> The in-kernel state is 4 64-bit values (EP address and argument, flags,
> affinity) per event.

flags: normal/critical, registered, enabled, in-progress and pending.
Pending might be backed by an IRQ that changes behind your back.

>>> Yes, The SDEI specification already mentioned
>>> this: the client handler should have all required resources in place before
>>> the handler is going to run. However, I don't see it's a problem so far.
>>
>> What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
>> The host has no clue what is in guest memory.

> On x86 we don't do the notification if interrupts are disabled. 

... because you can't schedule()? What about CONFIG_PREEMPT? arm64 has that enabled in its
defconfig. I anticipate its the most common option.

Even outside that, we have psuedo-NMI which means interrupts are unmasked at the CPU, even
in spin_lock_irqsave() regions, but instead filtered out at the interrupt controller.

I don't think KVM can know what this means by inspection, the guest chooses interrupt
priorities to separate the 'common' IRQ from 'important', but KVM can't know its not
'common' and 'devices being ignored'.

> On ARM
> I guess you'd do the same until SDEI_EVENT_COMPLETE (so yeah that would
> be some state that has to be migrated).

My problem with SDEI is the extra complexity it brings for features you don't want.
Its an NMI, that's the last thing you want as you can't schedule().
This use of SDEI is really for its synchronous exit through the irq handler, which you can
re-schedule from ... iff you took the event from a pre-emptible context...

Can we bypass the unnecessary NMI, and come in straight at the irq handler?

IRQ are asynchronous, but as this is a paravirt interface, the hypervisor can try to
guarantee a particular PPI (per cpu interrupt) that it generates is taken synchronously.
(I've yet to work out if the vGIC already does this, or we'd need to fake it in software)

By having a virtual-IRQ that the guest has registered, we can interpret the guest's
psuedo-NMI settings to know if this virtual-IRQ could be taken right now, which tells us
if the guest can handle the deferred stage2 fault, or it needs fixing before the guest can
make progress.

PPI are a scarce resource, so this would need some VMM involvement at boot to say which
PPI can be used. We do this for the PMU too.

... I'd like to look into how x86 uses this, and what other hypervisors may do in this
area. (another nightmare is supporting similar but different things for KVM, Xen, HyperV
and VMWare. I'm sure other hypervisors are available...)

> In fact it would be nice if
> SDEI_EVENT_COMPLETE meant "wait for synchronous page-in" while
> SDEI_EVENT_COMPLETE_AND_RESUME meant "handle it asynchronously".

Sneaky. How does x86 do this? I assume there is a hypercall for 'got it' or 'not now'.
If we go the PPI route we could use the same. (Ideally as much as possible is done in
common code)

Thanks,

James

[0] https://kartikgopalan.github.io/publications/hines09postcopy_osr.pdf