[Question] How to testing SDEI client driver

Wed Jul 8 12:11:48 EDT 2020

Hi Gavin,

On 03/07/2020 01:26, Gavin Shan wrote:
> On 7/1/20 9:57 PM, James Morse wrote:
>> On 30/06/2020 06:17, Gavin Shan wrote:
>>> I'm currently looking into SDEI client driver and reworking on it so that
>>> it can provide capability/services to arm64/kvm to get it virtualized.
>>
>> What do you mean by virtualised? The expectation is the VMM would implement the 'firmware'
>> side of this. 'events' are most likely to come from the VMM, and having to handshake with
>> the kernel to work out if the event you want to inject is registered and enabled is
>> over-complicated. Supporting it in the VMM means you can notify a different vCPU if that
>> is appropriate, or take a different action if the event isn't registered.
>>
>> This was all blocked on finding a future-proof way for tools like Qemu to consume
>> reference code from ATF.

> Sorry that I didn't mention the story a bit last time. We plan to use SDEI to
> deliver the notification (signal) from host to guest, needed by the asynchronous
> page fault feature. The RFCv2 patchset was post a while ago [1].

Thanks. So this is to hint to the guest that you'd swapped its memory to disk. Yuck.

When would you do this?

Surely this is "performance of an over-committed host sucks".

~

Isn't this roughly equivalent to SMT CPUs taking a cache-miss? ...
If you pinned two vCPUs to one physical CPU, the host:scheduler would multiplex between
them. If one couldn't due useful work because it was waiting for memory, the other gets
all the slack time. (the TLB maintenance would hurt, but not as much as waiting for the disk)
The good news is the guest:scheduler already knows how to deal with this!
(and, it works for other OS too)

Wouldn't it be better to let the guest make the swapping decision? You could provide a
fast virtio swap device to the guest that is backed by maybe-swapped host memory. (you'd
need to get the host to swap the block device in preference to the guest memory, or
mlock() it)
The guest gets great performance, unless its swap was actually swapped. It might even be
possible to do this without a guest exit!
(I'm not aware of a way for user-space to give a preference on what gets swapped)

Done like this, you don't pay the penalty when the guest tries to swap out a page that the
host had already swapped.

I think re-using some of these existing concepts would be better than something that is
linux+kvm+aarch64 specific.

> For the SDEI
> events needed by the async page fault, it's originated from KVM (host). In order
> to achieve the goal, KVM needs some code so that SDEI event can be injected and
> delivered. Also, the SDEI related hypercalls needs to be handled either.

I avoided doing this because it makes it massively complicated for the VMM. All that
in-kernel state now has to be migrated. KVM has to expose APIs to let the VMM inject
events, which gets nasty for shared events where some CPUs are masked, and others aren't.

Having something like Qemu drive the reference code from TFA is the right thing to do for
SDEI.

> Since we're here, I plan to expand the scope so that the firmware owned SDEI
> events (private/shared) can be passed through to multiple VMs. Lets say they're
> passthrou event. For these passthrou events, they can be shared by multiple VMs
> either.

Why? Do you have an example where that is necessary?

This stuff is for things firmware needs to tell the OS urgently, e.g. like RAS events,
platform over-temperature or the reboot watchdog is about to fire.

I can't think of anything that firmware would know about, that a guest needs to know. It
violates the isolation and abstraction that running stuff in a guest is all about!

RAS events come the closest. For RAS events the host has to handle the error first, then
it notifies the VMM like linux would for any user-space process. The VMM can then, at its
option, replay the event into the guest using whatever mechanism it likes.
This decoupling is important to ensure the VMM does not need to know how the host learns
about RAS errors, and has free choice over how it tells the guest.

>>> The
>>> primary reason is we want to use SDEI to deliver the asynchronous page fault
>>> notification from host to guest.
>>
>> As an NMI?! Yuck!
>> The SDEI handler reads memory, you'd need to stop it being re-entrant. It exits through
>> the IRQ vector, (which is necessary for forward-progress given a synchronous RAS event,
>> and for KVM to trigger guest-exit before the 'real' work that is offloaded to an irq
>> handler can run), its going to be 'fun' to have any guarantee of forward-progress if this
>> is involved with stage2.

> Yeah, It's something similar to NMI.

Aarch64 doesn't define an NMI, but we use the term for anything that interrupts IRQ-masked
code. You want to schedule(), which you can't do from an NMI.

> The notification (signal) has to be delivered in synchronous mode.

Heh, so you're using SDEI to get into the IRQ handler synchronously, so you can
reschedule. You don't actually want the NMI properties, only the software defined
synchronous exception.

> Yes, The SDEI specification already mentioned
> this: the client handler should have all required resources in place before
> the handler is going to run. However, I don't see it's a problem so far.

What if they are swapped out? This thing becomes re-entrant ... which the spec forbids.
The host has no clue what is in guest memory.

> Lets wait and see if it's a real issue until I post the RFC patchset :)

Its not really a try it and see thing!

[...]

>>> It seems that TRF (Trusted Firmware) is the only firmware with SDEI service
>>> implemented and supported.
>>
>> This project calls itself TF-A. ATF is the other widely used name. (I've never seen TRF
>> before)
>>
> 
> Yeah, I must have provided wrong name. Here is the git repo I was
> looking into:
> 
>    https://github.com/ARM-software/arm-trusted-firmware
> 
>>
>>> If so, does it mean I need to install TRF on my bare metal machine?
>>> I'm wandering how it can be installed and not sure if
>>> there is any document about this.
>>
>> Firmware should come with the platform. You'd need to know intricate details about power
>> management and initialising parts of the SoC to port it.
>>
>> ATF has a port for the fast-model/foundation model. I test this with ATF in the fast-model.

> I have no idea of fast-model and foundation model, and I got nothing
> from below commands in ATF git repo:

What are you using to test your kernel changes? Can it run EL3 software.

The foundation model can be downloaded here:
https://developer.arm.com/tools-and-software/simulation-models/fixed-virtual-platforms/arm-ecosystem-models

> [gwshan at localhost atf]$ git grep -i fast | grep -i model
> [gwshan at localhost atf]$ git grep -i fundation | grep -i model

The typo is why. Swap:

| morse at eglon:~/model/mpam/arm-trusted-firmware$ git grep -i foundation | grep model
| fdts/fvp-foundation-gicv2-psci.dts:     model = "FVP Foundation";
| fdts/fvp-foundation-gicv3-psci.dts:     model = "FVP Foundation";

'fvp' is the name atf uses for the platform.

The runes I had to build it with SDEI support are:
| make DEBUG=1 PLAT=fvp SDEI_SUPPORT=1 EL3_EXCEPTION_HANDLING=1 fip all

>>> Besides, GHES seems the only user of SDEI in the linux kernel. If so, is
>>> there a way to inject the relevant errors and how?
>>
>> It is, and unfortunately last time I checked, upstream ATF doesn't have the firmware-first
>> stuff for this. Its too SoC specific.
>>
>> I test this by binding the fast-model's SP804 one-shot interrupt controller as an event,
>> then plumbing that into GHES. Its more of case-study in why the bindable-irq stuff is
>> nasty than usable error injection method.
>> I can push the most recently rebased version of this, but you'd also need to hack-up a
>> HEST table with GHES entries to actually get it running.
>> > But, unless you are working on EL3 firmare, or a VMM, I don't think SDEI is what you
>> want.
>> What problem are you trying to solve?

> Thanks for the information. It seems I also need to emulate SDEI event by
> myself in order to test it. The best way for me is to inject SDEI event
> from KVM. By the way, the code you had is part of the firmware used by
> bare-metal machine or VM?
> 
> The issue we want to resolve is to deliver async page fault notification
> as mentioned above. Please let me know if there are more concerns :)

Re-entrance and forward progress.

I'd love to know why additional complexity to tell the guest this stuff is better than the
two approaches described above.

Thanks,

James