[PATCH 08/16] KVM: arm64: timers: Allow userspace to set the counter offsets

Fri Feb 17 14:11:36 PST 2023

On Fri, Feb 17, 2023 at 10:17:27AM +0000, Marc Zyngier wrote:
> Hi Oliver,
> 
> On Thu, 16 Feb 2023 22:09:47 +0000,
> Oliver Upton <oliver.upton at linux.dev> wrote:
> > 
> > Hi Marc,
> > 
> > On Thu, Feb 16, 2023 at 02:21:15PM +0000, Marc Zyngier wrote:
> > > And this is the moment you have all been waiting for: setting the
> > > counter offsets from userspace.
> > > 
> > > We expose a brand new capability that reports the ability to set
> > > the offsets for both the virtual and physical sides, independently.
> > > 
> > > In keeping with the architecture, the offsets are expressed as
> > > a delta that is substracted from the physical counter value.
> > > 
> > > Once this new API is used, there is no going back, and the counters
> > > cannot be written to to set the offsets implicitly (the writes
> > > are instead ignored).
> > 
> > Is there any particular reason to use an explicit ioctl as opposed to
> > the KVM_{GET,SET}_DEVICE_ATTR ioctls? Dunno where you stand on it, but I
> > quite like that interface for simple state management. We also avoid
> > eating up more UAPI bits in the global namespace.
> 
> The problem with that is that it requires yet another KVM device for
> this, and I'm lazy. It also makes it a bit harder for the VMM to buy
> into this (need to track another FD, for example).

You can also accept the device ioctls on the actual VM FD, quite like
we do for the vCPU right now. And hey, I've got a patch that gets you
most of the way there!

https://lore.kernel.org/kvmarm/20230211013759.3556016-3-oliver.upton@linux.dev/

> > Is there any reason why we can't just order this ioctl before vCPU
> > creation altogether, or is there a need to do this at runtime? We're
> > about to tolerate multiple writers to the offset value, and I think the
> > only thing we need to guarantee is that the below flag is set before
> > vCPU ioctls have a chance to run.
> 
> Again, we don't know for sure whether the final offset is available
> before vcpu creation time. My idea for QEMU would be to perform the
> offset adjustment as late as possible, right before executing the VM,
> after having restored the vcpus with whatever value they had.

So how does userspace work out an offset based on available information?
The part that hasn't clicked for me yet is where userspace gets the
current value of the true physical counter to calculate an offset.

We could make it ABI that the guest's physical counter matches that of
the host by default. Of course, that has been the case since the
beginning of time but it is now directly user-visible.

The only part I don't like about that is that we aren't fully creating
an abstraction around host and guest system time. So here's my current
mental model of how we represent the generic timer to userspace:

				+-----------------------+
				|	   		|
				| Host System Counter	|
				|	   (1) 		|
				+-----------------------+
				    	   |
			       +-----------+-----------+
			       |		       |
       +-----------------+  +-----+		    +-----+  +--------------------+
       | (2) CNTPOFF_EL2 |--| sub |		    | sub |--| (3) CNTVOFF_EL2    |
       +-----------------+  +-----+	     	    +-----+  +--------------------+
			       |           	       |
			       |		       |
		     +-----------------+	 +----------------+
		     | (5) CNTPCT_EL0  |         | (4) CNTVCT_EL0 |
		     +-----------------+	 +----------------+

AFAICT, this UAPI exposes abstractions for (2) and (3) to userspace, but
userspace cannot directly get at (1).

Chewing on this a bit more, I don't think userspace has any business
messing with virtual and physical time independently, especially when
nested virtualization comes into play.

I think the illusion to userspace needs to be built around the notion of
a system counter:

                                +-----------------------+
                                |                       |
                                | Host System Counter   |
                                |          (1)          |
                                +-----------------------+
					   |
					   |
					+-----+   +-------------------+
					| sub |---| (6) system_offset |
					+-----+   +-------------------+
					   |
					   |
                                +-----------------------+
                                |                       |
                                | Guest System Counter  |
                                |          (7)          |
                                +-----------------------+
                                           |
                               +-----------+-----------+
                               |                       |
       +-----------------+  +-----+                 +-----+  +--------------------+
       | (2) CNTPOFF_EL2 |--| sub |                 | sub |--| (3) CNTVOFF_EL2    |
       +-----------------+  +-----+                 +-----+  +--------------------+
                               |                       |
                               |                       |
                     +-----------------+         +----------------+
                     | (5) CNTPCT_EL0  |         | (4) CNTVCT_EL0 |
                     +-----------------+         +----------------+

And from a UAPI perspective, we would either expose (1) and (6) to let
userspace calculate an offset or simply allow (7) to be directly
read/written.

That frees up the meaning of the counter offsets as being purely a
virtual EL2 thing. These registers would reset to 0, and non-NV guests
could never change their value.

Under the hood KVM would program the true offset registers as:

	CNT{P,V}OFF_EL2 = 'virtual CNT{P,V}OFF_EL2' + system_offset

With this we would effectively configure CNTPCT = CNTVCT = 0 at the
point of VM creation. Only crappy thing is it requires full physical
counter/timer emulation for non-ECV systems, but the guest shouldn't be
using the physical counter in the first place.

Yes, this sucks for guests running on hosts w/ NV but not ECV. If anyone
can tell me how an L0 hypervisor is supposed to do NV without ECV, I'm
all ears.

Does any of what I've written make remote sense or have I gone entirely
off the rails with my ASCII art? :)

-- 
Thanks,
Oliver