[PATCH v3 7/8] KVM: kvm-vfio: generic forwarding control

Mon Dec 8 08:54:44 PST 2014

On Mon, 2014-12-08 at 13:22 +0100, Eric Auger wrote:
> On 11/25/2014 08:00 PM, Alex Williamson wrote:
> > On Tue, 2014-11-25 at 19:20 +0100, Eric Auger wrote:
> >> On 11/24/2014 09:56 PM, Alex Williamson wrote:
> >>> On Sun, 2014-11-23 at 19:35 +0100, Eric Auger wrote:
> >>>> This patch introduces a new KVM_DEV_VFIO_DEVICE group.
> >>>>
> >>>> This is a new control channel which enables KVM to cooperate with
> >>>> viable VFIO devices.
> >>>>
> >>>> Functions are introduced to check the validity of a VFIO device
> >>>> file descriptor, increment/decrement the ref counter of the VFIO
> >>>> device.
> >>>>
> >>>> The patch introduces 2 attributes for this new device group:
> >>>> KVM_DEV_VFIO_DEVICE_FORWARD_IRQ, KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ.
> >>>> Their purpose is to turn a VFIO device IRQ into a forwarded IRQ and
> >>>> unset respectively unset the feature.
> >>>>
> >>>> The VFIO device stores a list of registered forwarded IRQs. The reference
> >>>> counter of the device is incremented each time a new IRQ is forwarded.
> >>>> Reference counter is decremented when the IRQ forwarding is unset.
> >>>>
> >>>> The forwarding programmming is architecture specific, implemented in
> >>>> kvm_arch_set_fwd_state function. Architecture specific implementation is
> >>>> enabled when __KVM_HAVE_ARCH_KVM_VFIO_FORWARD is set. When not set those
> >>>> functions are void.
> >>>>
> >>>> Signed-off-by: Eric Auger <eric.auger at linaro.org>
> >>>>
> >>>> ---
> >>>>
> >>>> v2 -> v3:
> >>>> - add API comments in kvm_host.h
> >>>> - improve the commit message
> >>>> - create a private kvm_vfio_fwd_irq struct
> >>>> - fwd_irq_action replaced by a bool and removal of VFIO_IRQ_CLEANUP. This
> >>>>   latter action will be handled in vgic.
> >>>> - add a vfio_device handle argument to kvm_arch_set_fwd_state. The goal is
> >>>>   to move platform specific stuff in architecture specific code.
> >>>> - kvm_arch_set_fwd_state renamed into kvm_arch_vfio_set_forward
> >>>> - increment the ref counter each time we do an IRQ forwarding and decrement
> >>>>   this latter each time one IRQ forward is unset. Simplifies the whole
> >>>>   ref counting.
> >>>> - simplification of list handling: create, search, removal
> >>>>
> >>>> v1 -> v2:
> >>>> - __KVM_HAVE_ARCH_KVM_VFIO renamed into __KVM_HAVE_ARCH_KVM_VFIO_FORWARD
> >>>> - original patch file separated into 2 parts: generic part moved in vfio.c
> >>>>   and ARM specific part(kvm_arch_set_fwd_state)
> >>>> ---
> >>>>  include/linux/kvm_host.h |  28 ++++++
> >>>>  virt/kvm/vfio.c          | 249 ++++++++++++++++++++++++++++++++++++++++++++++-
> >>>>  2 files changed, 274 insertions(+), 3 deletions(-)
> >>>>
> >>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> >>>> index ea53b04..0b9659d 100644
> >>>> --- a/include/linux/kvm_host.h
> >>>> +++ b/include/linux/kvm_host.h
> >>>> @@ -1076,6 +1076,15 @@ struct kvm_device_ops {
> >>>>  		      unsigned long arg);
> >>>>  };
> >>>>  
> >>>> +/* internal self-contained structure describing a forwarded IRQ */
> >>>> +struct kvm_fwd_irq {
> >>>> +	struct kvm *kvm; /* VM to inject the GSI into */
> >>>> +	struct vfio_device *vdev; /* vfio device the IRQ belongs to */
> >>>> +	__u32 index; /* VFIO device IRQ index */
> >>>> +	__u32 subindex; /* VFIO device IRQ subindex */
> >>>> +	__u32 gsi; /* gsi, ie. virtual IRQ number */
> >>>> +};
> >>>> +
> >>>>  void kvm_device_get(struct kvm_device *dev);
> >>>>  void kvm_device_put(struct kvm_device *dev);
> >>>>  struct kvm_device *kvm_device_from_filp(struct file *filp);
> >>>> @@ -1085,6 +1094,25 @@ void kvm_unregister_device_ops(u32 type);
> >>>>  extern struct kvm_device_ops kvm_mpic_ops;
> >>>>  extern struct kvm_device_ops kvm_xics_ops;
> >>>>  
> >>>> +#ifdef __KVM_HAVE_ARCH_KVM_VFIO_FORWARD
> >>>> +/**
> >>>> + * kvm_arch_vfio_set_forward - changes the forwarded state of an IRQ
> >>>> + *
> >>>> + * @fwd_irq: handle to the forwarded irq struct
> >>>> + * @forward: true means forwarded, false means not forwarded
> >>>> + * returns 0 on success, < 0 on failure
> >>>> + */
> >>>> +int kvm_arch_vfio_set_forward(struct kvm_fwd_irq *fwd_irq,
> >>>> +			      bool forward);
> >>>
> >>> We could add a struct device* to the args list or into struct
> >>> kvm_fwd_irq so that arch code doesn't need to touch the vdev.  arch code
> >>> has no business dealing with references to the vfio_device.
> >> Hi Alex,
> >>
> >> Currently It can't put struct device* into the kvm_fwd_irq struct since
> >> I need to release the vfio_device with
> >> vfio_device_put_external_user(struct vfio_device *vdev)
> >> typically in kvm_vfio_clean_fwd_irq. So I need to store the pointers to
> >> the vfio_device somewhere.
> >>
> >> I see 2 solutions: change the proto of
> >> vfio_device_put_external_user(struct vfio_device *vdev) and pass a
> >> struct device* (??)
> >>
> >> or change the proto of kvm_arch_vfio_set_forward into
> >>
> >> kvm_arch_vfio_set_forward(struct kvm *kvm, struct device *dev, int
> >> index, [int subindex], int gsi, bool forward) or using index/start/count
> >> but loosing the interest of having a self-contained internal struct.
> > 
> > The latter is sort of what I was assuming, I think the interface between
> > VFIO and KVM-VFIO is good, we just don't need to expose VFIO-isms out to
> > the arch KVM code.  KVM-VFIO should be the barrier layer.  In that
> > spirit, maybe it should be kvm_arch_set_forward() and the KVM-VFIO code
> > should do the processing of index/subindex sort of like how Feng did for
> > PCI devices.
> 
> Hi Alex,
> 
> In Feng's series, host irq is retrieved in the generic part while in
> mine it is retrieved in arch specific part, as encouraged at some point.
> From the above comment I understand the right API between generic and
> arch specific parts may be <operation>(kvm, host_irq, guest_irq) in
> which case I should revert the platform specific IRQ retrieval in the
> generic part. Is it the correct understanding?

Hi Eric,

Sorry if I'm flip-flopping on any of this, but I think you're right that
we want some sort of <operation>(kvm, host_irq, guest_irq) callout in
the kvm direction.  The parsing of vfio index/sub-index is vfio specific
and needs to happen in either kvm-vfio or deeper on the vfio side of the
equation.  We don't want things like vfio-pci encoding of
index/sub-index leaking out into the non-vfio related parts of the code.

In a perfectly abstracted world, that might mean that in addition to the
vfio external user interface to get the base struct device, we also have
a mechanism that takes a vfio_device, index, and sub-index and returns a
host irq, encapsulating that code in vfio-pci or vfio-platform rather
than having it leak into kvm-vfio.  Our layering should be:

vfio bus driver - vfio -  kvm-vfio  -  kvm - arch

And ideally the data goes like this:

  host irq --------------------------------->
                 device --------------------> 
                          guest irq -------->

I'm willing to accept though that kvm-vfio might have everything it
needs to determine host irq and the routing back to the bus driver is
more effort than simply decoding it in kvm-vfio.  Thanks,

Alex

> >>>> +
> >>>> +#else
> >>>> +static inline int kvm_arch_vfio_set_forward(struct kvm_fwd_irq *fwd_irq,
> >>>> +					    bool forward)
> >>>> +{
> >>>> +	return 0;
> >>>> +}
> >>>> +#endif
> >>>> +
> >>>>  #ifdef CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT
> >>>>  
> >>>>  static inline void kvm_vcpu_set_in_spin_loop(struct kvm_vcpu *vcpu, bool val)
> >>>> diff --git a/virt/kvm/vfio.c b/virt/kvm/vfio.c
> >>>> index 6f0cc34..af178bb 100644
> >>>> --- a/virt/kvm/vfio.c
> >>>> +++ b/virt/kvm/vfio.c
> >>>> @@ -25,8 +25,16 @@ struct kvm_vfio_group {
> >>>>  	struct vfio_group *vfio_group;
> >>>>  };
> >>>>  
> >>>> +/* private linkable kvm_fwd_irq struct */
> >>>> +struct kvm_vfio_fwd_irq_node {
> >>>> +	struct list_head link;
> >>>> +	struct kvm_fwd_irq fwd_irq;
> >>>> +};
> >>>> +
> >>>>  struct kvm_vfio {
> >>>>  	struct list_head group_list;
> >>>> +	/* list of registered VFIO forwarded IRQs */
> >>>> +	struct list_head fwd_node_list;
> >>>>  	struct mutex lock;
> >>>>  	bool noncoherent;
> >>>>  };
> >>>> @@ -247,12 +255,239 @@ static int kvm_vfio_set_group(struct kvm_device *dev, long attr, u64 arg)
> >>>>  	return -ENXIO;
> >>>>  }
> >>>>  
> >>>> +/**
> >>>> + * kvm_vfio_get_vfio_device - Returns a handle to a vfio-device
> >>>> + *
> >>>> + * Checks it is a valid vfio device and increments its reference counter
> >>>> + * @fd: file descriptor of the vfio platform device
> >>>> + */
> >>>> +static struct vfio_device *kvm_vfio_get_vfio_device(int fd)
> >>>> +{
> >>>> +	struct fd f = fdget(fd);
> >>>> +	struct vfio_device *vdev;
> >>>> +
> >>>> +	if (!f.file)
> >>>> +		return NULL;
> >>>
> >>> ERR_PTR(-EINVAL)?
> >>>
> >>> ie. propagate errors from the point where they're encountered so we
> >>> don't need to make up an errno later.
> >> yes thanks
> >>>
> >>>> +	vdev = kvm_vfio_device_get_external_user(f.file);
> >>>> +	fdput(f);
> >>>> +	return vdev;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * kvm_vfio_put_vfio_device: decrements the reference counter of the
> >>>> + * vfio platform * device
> >>>> + *
> >>>> + * @vdev: vfio_device handle to release
> >>>> + */
> >>>> +static void kvm_vfio_put_vfio_device(struct vfio_device *vdev)
> >>>> +{
> >>>> +	kvm_vfio_device_put_external_user(vdev);
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * kvm_vfio_find_fwd_irq - checks whether a forwarded IRQ already is
> >>>> + * registered in the list of forwarded IRQs
> >>>> + *
> >>>> + * @kv: handle to the kvm-vfio device
> >>>> + * @fwd: handle to the forwarded irq struct
> >>>> + * In the positive returns the handle to its node in the kvm-vfio
> >>>> + * forwarded IRQ list, returns NULL otherwise.
> >>>> + * Must be called with kv->lock hold.
> >>>> + */
> >>>> +static struct kvm_vfio_fwd_irq_node *kvm_vfio_find_fwd_irq(
> >>>> +				struct kvm_vfio *kv,
> >>>> +				struct kvm_fwd_irq *fwd)
> >>>> +{
> >>>> +	struct kvm_vfio_fwd_irq_node *node;
> >>>> +
> >>>> +	list_for_each_entry(node, &kv->fwd_node_list, link) {
> >>>> +		if ((node->fwd_irq.index == fwd->index) &&
> >>>> +		    (node->fwd_irq.subindex == fwd->subindex) &&
> >>>> +		    (node->fwd_irq.vdev == fwd->vdev))
> >>>> +			return node;
> >>>> +	}
> >>>> +	return NULL;
> >>>> +}
> >>>> +/**
> >>>> + * kvm_vfio_register_fwd_irq - Allocates, populates and registers a
> >>>> + * forwarded IRQ
> >>>> + *
> >>>> + * @kv: handle to the kvm-vfio device
> >>>> + * @fwd: handle to the forwarded irq struct
> >>>> + * In case of success returns a handle to the new list node,
> >>>> + * NULL otherwise.
> >>>> + * Must be called with kv->lock hold.
> >>>> + */
> >>>> +static struct kvm_vfio_fwd_irq_node *kvm_vfio_register_fwd_irq(
> >>>> +				struct kvm_vfio *kv,
> >>>> +				struct kvm_fwd_irq *fwd)
> >>>> +{
> >>>> +	struct kvm_vfio_fwd_irq_node *node;
> >>>> +
> >>>> +	node = kmalloc(sizeof(*node), GFP_KERNEL);
> >>>> +	if (!node)
> >>>> +		return NULL;
> >>>
> >>> ERR_PTR(-ENOMEM)?
> >>>
> >> OK
> >>>> +
> >>>> +	node->fwd_irq = *fwd;
> >>>> +
> >>>> +	list_add(&node->link, &kv->fwd_node_list);
> >>>> +
> >>>> +	return node;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * kvm_vfio_unregister_fwd_irq - unregisters and frees a forwarded IRQ
> >>>> + *
> >>>> + * @node: handle to the node struct
> >>>> + * Must be called with kv->lock hold.
> >>>> + */
> >>>> +static void kvm_vfio_unregister_fwd_irq(struct kvm_vfio_fwd_irq_node *node)
> >>>> +{
> >>>> +	list_del(&node->link);
> >>>> +	kfree(node);
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * kvm_vfio_set_forward - turns a VFIO device IRQ into a forwarded IRQ
> >>>> + * @kv: handle to the kvm-vfio device
> >>>> + * @fd: file descriptor of the vfio device the IRQ belongs to
> >>>> + * @fwd: handle to the forwarded irq struct
> >>>> + *
> >>>> + * Registers an IRQ as forwarded and calls the architecture specific
> >>>> + * implementation of set_forward. In case of operation failure, the IRQ
> >>>> + * is unregistered. In case of success, the vfio device ref counter is
> >>>> + * incremented.
> >>>> + */
> >>>> +static int kvm_vfio_set_forward(struct kvm_vfio *kv, int fd,
> >>>> +				struct kvm_fwd_irq *fwd)
> >>>> +{
> >>>> +	int ret;
> >>>> +	struct kvm_vfio_fwd_irq_node *node =
> >>>> +			kvm_vfio_find_fwd_irq(kv, fwd);
> >>>> +
> >>>> +	if (node)
> >>>> +		return -EINVAL;
> >>>
> >>> I assume you're saving -EBUSY for arch code for the case where the IRQ
> >>> is already active?
> >> yes. -EBUSY now corresponds to the case where the VFIO signaling is
> >> already setup.
> >>>
> >>>> +	node = kvm_vfio_register_fwd_irq(kv, fwd);
> >>>> +	if (!node)
> >>>> +		return -ENOMEM;
> >>>
> >>> if (IS_ERR(node))
> >>> 	return PTR_ERR(node);
> >>>
> >>>> +	ret = kvm_arch_vfio_set_forward(fwd, true);
> >>>> +	if (ret < 0)  {
> >>>> +		kvm_vfio_unregister_fwd_irq(node);
> >>>> +		return ret;
> >>>> +	}
> >>>> +	/* increment the ref counter */
> >>>> +	kvm_vfio_get_vfio_device(fd);
> >>>
> >>> Wouldn't it be easier if the reference counting were coupled with the
> >>> register/unregister_fwd_irq?
> >> ok
> >>   I'd be tempted to pass your user_fwd_irq
> >>> to this function instead of a kvm_fwd_irq.
> >> Well in that case I would use kvm_arch_forwarded_irq for both set and
> >> unset. Then comes the problem of kvm_vfio_clean_fwd_irq which
> >> manipulates only internal structs.
> >>
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * kvm_vfio_unset_forward - Sets a VFIO device IRQ as non forwarded
> >>>> + * @kv: handle to the kvm-vfio device
> >>>> + * @fwd: handle to the forwarded irq struct
> >>>> + *
> >>>> + * Calls the architecture specific implementation of set_forward and
> >>>> + * unregisters the IRQ from the forwarded IRQ list. Decrements the vfio
> >>>> + * device reference counter.
> >>>> + */
> >>>> +static int kvm_vfio_unset_forward(struct kvm_vfio *kv,
> >>>> +				  struct kvm_fwd_irq *fwd)
> >>>> +{
> >>>> +	int ret;
> >>>> +	struct kvm_vfio_fwd_irq_node *node =
> >>>> +			kvm_vfio_find_fwd_irq(kv, fwd);
> >>>> +	if (!node)
> >>>> +		return -EINVAL;
> >>>> +	ret = kvm_arch_vfio_set_forward(fwd, false);
> >>>
> >>> Whoa, if the unforward fails we continue to undo everything else?  How
> >>> does userspace cleanup from this?  We need a guaranteed shutdown path
> >>> for cleanup code, we can never trust userspace to do things in the
> >>> correct order.  Can we really preclude the user calling unforward with
> >>> an active IRQ?  Maybe _forward() and _unforward() need to be separate
> >>> functions so that 'un' can be made void to indicate it can't fail.
> >> If I accept an unforward while the VFIO signaling mechanism is set, the
> >> wrong handler will be setup on VFIO driver. So should I put in place a
> >> mechanism that changes the VFIO handler for that irq (causing VFIO
> >> driver free_irq/request_irq), using another external API function?
> > 
> > Yep, it seems like we need to re-evaluate whether forwarding can be
> > changed on a running IRQ.  Disallowing it doesn't appear to support KVM
> > initiated shutdown, only user initiated configuration.  So the
> > vfio-platform interrupt handler probably needs to be bi-modal.  Maybe
> > the forwarding state of the IRQ can use RCU to avoid locks.
> > 
> >>>> +	kvm_vfio_unregister_fwd_irq(node);
> >>>> +
> >>>> +	/* decrement the ref counter */
> >>>> +	kvm_vfio_put_vfio_device(fwd->vdev);
> >>>
> >>> Again, seems like the unregister should do this.
> >> ok
> >>>
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +static int kvm_vfio_control_irq_forward(struct kvm_device *kdev, long attr,
> >>>> +					int32_t __user *argp)
> >>>> +{
> >>>> +	struct kvm_arch_forwarded_irq user_fwd_irq;
> >>>> +	struct kvm_fwd_irq fwd;
> >>>> +	struct vfio_device *vdev;
> >>>> +	struct kvm_vfio *kv = kdev->private;
> >>>> +	int ret;
> >>>> +
> >>>> +	if (copy_from_user(&user_fwd_irq, argp, sizeof(user_fwd_irq)))
> >>>> +		return -EFAULT;
> >>>> +
> >>>> +	vdev = kvm_vfio_get_vfio_device(user_fwd_irq.fd);
> >>>> +	if (IS_ERR(vdev)) {
> >>>> +		ret = PTR_ERR(vdev);
> >>>> +		goto out;
> >>>> +	}
> >>>> +
> >>>> +	fwd.vdev =  vdev;
> >>>> +	fwd.kvm =  kdev->kvm;
> >>>> +	fwd.index = user_fwd_irq.index;
> >>>> +	fwd.subindex = user_fwd_irq.subindex;
> >>>> +	fwd.gsi = user_fwd_irq.gsi;
> >>>> +
> >>>> +	switch (attr) {
> >>>> +	case KVM_DEV_VFIO_DEVICE_FORWARD_IRQ:
> >>>> +		mutex_lock(&kv->lock);
> >>>> +		ret = kvm_vfio_set_forward(kv, user_fwd_irq.fd, &fwd);
> >>>> +		mutex_unlock(&kv->lock);
> >>>> +		break;
> >>>> +	case KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ:
> >>>> +		mutex_lock(&kv->lock);
> >>>> +		ret = kvm_vfio_unset_forward(kv, &fwd);
> >>>> +		mutex_unlock(&kv->lock);
> >>>> +		break;
> >>>> +	}
> >>>> +out:
> >>>> +	kvm_vfio_put_vfio_device(vdev);
> >>>
> >>> It might add a little extra code, but logically the reference being tied
> >>> to the register/unregister feels a bit cleaner than this.
> >> Sorry I do not catch your comment here. Since i called
> >> kvm_vfio_get_vfio_device at the beg of the function, I need to release
> >> at the end of the function, besides the ref counter incr/decr at
> >> registration?
> > 
> > If we do reference counting on register/unregister, I'd think that the
> > get/put in this function would go away.  Having the reference here seems
> > redundant.
> > 
> >>>
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +static int kvm_vfio_set_device(struct kvm_device *kdev, long attr, u64 arg)
> >>>> +{
> >>>> +	int32_t __user *argp = (int32_t __user *)(unsigned long)arg;
> >>>> +	int ret;
> >>>> +
> >>>> +	switch (attr) {
> >>>> +	case KVM_DEV_VFIO_DEVICE_FORWARD_IRQ:
> >>>> +	case KVM_DEV_VFIO_DEVICE_UNFORWARD_IRQ:
> >>>> +		ret = kvm_vfio_control_irq_forward(kdev, attr, argp);
> >>>> +		break;
> >>>> +	default:
> >>>> +		ret = -ENXIO;
> >>>> +	}
> >>>> +	return ret;
> >>>> +}
> >>>> +
> >>>> +/**
> >>>> + * kvm_vfio_clean_fwd_irq - Unset forwarding state of all
> >>>> + * registered forwarded IRQs and free their list nodes.
> >>>> + * @kv: kvm-vfio device
> >>>> + *
> >>>> + * Loop on all registered device/IRQ combos, reset the non forwarded state,
> >>>> + * void the lists and release the reference
> >>>> + */
> >>>> +static int kvm_vfio_clean_fwd_irq(struct kvm_vfio *kv)
> >>>> +{
> >>>> +	struct kvm_vfio_fwd_irq_node *node, *tmp;
> >>>> +
> >>>> +	list_for_each_entry_safe(node, tmp, &kv->fwd_node_list, link) {
> >>>> +		kvm_vfio_unset_forward(kv, &node->fwd_irq);
> >>>> +	}
> >>>> +	return 0;
> >>>
> >>> This shouldn't be able to fail, make it void.
> >> see above questioning? But you're right, I am too much virtualization
> >> oriented. Currently my cleanup is even done in the virtual interrupt
> >> controller when the VM dies because nobody unsets the VFIO signaling.
> > 
> > Yep, being a kernel interface it needs to be hardened against careless
> > and malicious users.  The kernel should return to the correct state if
> > we kill -9 QEMU at any point.  Thanks,
> > 
> > Alex
> > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/