[PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv to VMMs running on Hyper-V

Fri Aug 18 06:08:49 PDT 2023

> -----Original Message-----
> From: Nuno Das Neves <nunodasneves at linux.microsoft.com>
> Sent: Friday, August 18, 2023 3:32 AM
> To: linux-hyperv at vger.kernel.org; linux-kernel at vger.kernel.org;
> x86 at kernel.org; linux-arm-kernel at lists.infradead.org; linux-
> arch at vger.kernel.org
> Cc: patches at lists.linux.dev; Michael Kelley (LINUX)
> <mikelley at microsoft.com>; KY Srinivasan <kys at microsoft.com>;
> wei.liu at kernel.org; Haiyang Zhang <haiyangz at microsoft.com>; Dexuan Cui
> <decui at microsoft.com>; apais at linux.microsoft.com; Tianyu Lan
> <Tianyu.Lan at microsoft.com>; ssengar at linux.microsoft.com; MUKESH
> RATHOR <mukeshrathor at microsoft.com>; stanislav.kinsburskiy at gmail.com;
> jinankjain at linux.microsoft.com; vkuznets <vkuznets at redhat.com>;
> tglx at linutronix.de; mingo at redhat.com; bp at alien8.de;
> dave.hansen at linux.intel.com; hpa at zytor.com; will at kernel.org;
> catalin.marinas at arm.com
> Subject: [PATCH v2 15/15] Drivers: hv: Add modules to expose /dev/mshv to
> VMMs running on Hyper-V
> 
> Add mshv, mshv_root, and mshv_vtl modules:
> 
> Module mshv is the parent module to the other two. It provides /dev/mshv,
> plus
> some common hypercall helper code. When one of the child modules is
> loaded, it
> is registered with the mshv module, which then provides entry point(s) to the
> child module via the IOCTLs defined in uapi/linux/mshv.h.
> 
> E.g. When the mshv_root module is loaded, it registers itself, and the
> MSHV_CREATE_PARTITION IOCTL becomes available in /dev/mshv. That is
> used to
> get a partition fd managed by mshv_root.
> 
> Similarly for mshv_vtl module, there is MSHV_CREATE_VTL, which creates
> an fd representing the lower vtl, managed by mshv_vtl.
> 
> Module mshv_root provides APIs for creating and managing child partitions.
> It
> defines abstractions for partitions (vms), vps (vcpus), and other things
> related to running a guest. It exposes the userspace interfaces for a VMM to
> manage the guest.
> 
> Module mshv_vtl provides VTL (Virtual Trust Level) support for VMMs. In
> this scenario, the host kernel and VMM run in a higher trust level than the
> guest, but within the same partition. This provides better isolation and
> performance.
> 
> Signed-off-by: Nuno Das Neves <nunodasneves at linux.microsoft.com>
> ---
>  drivers/hv/Kconfig             |   50 +
>  drivers/hv/Makefile            |   20 +
>  drivers/hv/hv_call.c           |  119 ++
>  drivers/hv/hv_common.c         |    4 +
>  drivers/hv/mshv.h              |  156 +++
>  drivers/hv/mshv_eventfd.c      |  758 ++++++++++++
>  drivers/hv/mshv_eventfd.h      |   80 ++
>  drivers/hv/mshv_main.c         |  208 ++++
>  drivers/hv/mshv_msi.c          |  129 +++
>  drivers/hv/mshv_portid_table.c |   84 ++
>  drivers/hv/mshv_root.h         |  194 ++++
>  drivers/hv/mshv_root_hv_call.c | 1064 +++++++++++++++++
>  drivers/hv/mshv_root_main.c    | 1964
> ++++++++++++++++++++++++++++++++
>  drivers/hv/mshv_synic.c        |  689 +++++++++++
>  drivers/hv/mshv_vtl.h          |   52 +
>  drivers/hv/mshv_vtl_main.c     | 1542 +++++++++++++++++++++++++
>  drivers/hv/xfer_to_guest.c     |   28 +
>  include/uapi/linux/mshv.h      |  298 +++++
>  18 files changed, 7439 insertions(+)
>  create mode 100644 drivers/hv/hv_call.c
>  create mode 100644 drivers/hv/mshv.h
>  create mode 100644 drivers/hv/mshv_eventfd.c
>  create mode 100644 drivers/hv/mshv_eventfd.h
>  create mode 100644 drivers/hv/mshv_main.c
>  create mode 100644 drivers/hv/mshv_msi.c
>  create mode 100644 drivers/hv/mshv_portid_table.c
>  create mode 100644 drivers/hv/mshv_root.h
>  create mode 100644 drivers/hv/mshv_root_hv_call.c
>  create mode 100644 drivers/hv/mshv_root_main.c
>  create mode 100644 drivers/hv/mshv_synic.c
>  create mode 100644 drivers/hv/mshv_vtl.h
>  create mode 100644 drivers/hv/mshv_vtl_main.c
>  create mode 100644 drivers/hv/xfer_to_guest.c
>  create mode 100644 include/uapi/linux/mshv.h
> 
> diff --git a/drivers/hv/Kconfig b/drivers/hv/Kconfig
> index 00242107d62e..0d9aefc07b15 100644
> --- a/drivers/hv/Kconfig
> +++ b/drivers/hv/Kconfig
> @@ -54,4 +54,54 @@ config HYPERV_BALLOON
>  	help
>  	  Select this option to enable Hyper-V Balloon driver.
> 
> +config MSHV
> +	tristate "Microsoft Hypervisor root partition interfaces: /dev/mshv"
> +	depends on X86_64 && HYPERV
> +	select EVENTFD
> +	select MSHV_XFER_TO_GUEST_WORK
> +	help
> +	  Select this option to enable core functionality for managing guest
> +	  virtual machines running under the Microsoft Hypervisor.
> +
> +	  The interfaces are provided via a device named /dev/mshv.
> +
> +	  To compile this as a module, choose M here.
> +
> +	  If unsure, say N.
> +
> +config MSHV_ROOT
> +	tristate "Microsoft Hyper-V root partition APIs driver"
> +	depends on MSHV
> +	help
> +	  Select this option to provide /dev/mshv interfaces specific to
> +	  running as the root partition on Microsoft Hypervisor.
> +
> +	  To compile this as a module, choose M here.
> +
> +	  If unsure, say N.
> +
> +config MSHV_VTL
> +	tristate "Microsoft Hyper-V VTL driver"
> +	depends on MSHV
> +	select HYPERV_VTL_MODE
> +	select TRANSPARENT_HUGEPAGE

TRANSPARENT_HUGEPAGE can be avoided for now.

> +	help
> +	  Select this option to enable Hyper-V VTL driver.
> +	  Virtual Secure Mode (VSM) is a set of hypervisor capabilities and
> +	  enlightenments offered to host and guest partitions which enables
> +	  the creation and management of new security boundaries within
> +	  operating system software.
> +
> +	  VSM achieves and maintains isolation through Virtual Trust Levels
> +	  (VTLs). Virtual Trust Levels are hierarchical, with higher levels
> +	  being more privileged than lower levels. VTL0 is the least privileged
> +	  level, and currently only other level supported is VTL2.
> +
> +	  To compile this as a module, choose M here.
> +
> +	  If unsure, say N.
> +
> +config MSHV_XFER_TO_GUEST_WORK
> +	bool
> +
>  endmenu
> diff --git a/drivers/hv/Makefile b/drivers/hv/Makefile
> index d76df5c8c2a9..da7aa7542b05 100644
> --- a/drivers/hv/Makefile
> +++ b/drivers/hv/Makefile
> @@ -2,10 +2,30 @@
>  obj-$(CONFIG_HYPERV)		+= hv_vmbus.o
>  obj-$(CONFIG_HYPERV_UTILS)	+= hv_utils.o
>  obj-$(CONFIG_HYPERV_BALLOON)	+= hv_balloon.o
> +obj-$(CONFIG_MSHV)			+= mshv.o
> +obj-$(CONFIG_MSHV_VTL)		+= mshv_vtl.o
> +obj-$(CONFIG_MSHV_ROOT)		+= mshv_root.o
> 
>  CFLAGS_hv_trace.o = -I$(src)
>  CFLAGS_hv_balloon.o = -I$(src)
> 
> +CFLAGS_mshv_main.o			= -DHV_HYPERV_DEFS
> +CFLAGS_hv_call.o			= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_root_main.o		= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_root_hv_call.o	= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_synic.o			= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_portid_table.o	= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_eventfd.o		= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_msi.o			= -DHV_HYPERV_DEFS
> +CFLAGS_mshv_vtl_main.o		= -DHV_HYPERV_DEFS
> +
> +mshv-y				+= mshv_main.o
> +mshv_root-y			:= mshv_root_main.o mshv_synic.o
> mshv_portid_table.o \
> +						mshv_eventfd.o mshv_msi.o
> mshv_root_hv_call.o hv_call.o
> +mshv_vtl-y			:= mshv_vtl_main.o hv_call.o
> +
> +obj-$(CONFIG_MSHV_XFER_TO_GUEST_WORK) += xfer_to_guest.o
> +
>  hv_vmbus-y := vmbus_drv.o \
>  		 hv.o connection.o channel.o \
>  		 channel_mgmt.o ring_buffer.o hv_trace.o
> diff --git a/drivers/hv/hv_call.c b/drivers/hv/hv_call.c
> new file mode 100644
> index 000000000000..4455001d8545
> --- /dev/null
> +++ b/drivers/hv/hv_call.c
> @@ -0,0 +1,119 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + *
> + * Hypercall helper functions shared between mshv modules.
> + *
> + * Authors:
> + *   Nuno Das Neves <nunodasneves at linux.microsoft.com>
> + */
> +
> +#include <linux/kernel.h>
> +#include <linux/mm.h>
> +#include <asm/mshyperv.h>
> +
> +#define HV_GET_REGISTER_BATCH_SIZE	\
> +	(HV_HYP_PAGE_SIZE / sizeof(union hv_register_value))
> +#define HV_SET_REGISTER_BATCH_SIZE	\
> +	((HV_HYP_PAGE_SIZE - sizeof(struct hv_input_set_vp_registers)) \
> +		/ sizeof(struct hv_register_assoc))
> +
> +int hv_call_get_vp_registers(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u16 count,
> +		union hv_input_vtl input_vtl,
> +		struct hv_register_assoc *registers)
> +{
> +	struct hv_input_get_vp_registers *input_page;
> +	union hv_register_value *output_page;
> +	u16 completed = 0;
> +	unsigned long remaining = count;
> +	int rep_count, i;
> +	u64 status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +
> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +	output_page = *this_cpu_ptr(hyperv_pcpu_output_arg);
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_GET_REGISTER_BATCH_SIZE);
> +		for (i = 0; i < rep_count; ++i)
> +			input_page->names[i] = registers[i].name;
> +
> +		status = hv_do_rep_hypercall(HVCALL_GET_VP_REGISTERS,
> rep_count,
> +					     0, input_page, output_page);

Is there any possibility that count value is passed 0 by mistake ? In that case
status will remain uninitialized. 

> +		if (!hv_result_success(status)) {
> +			pr_err("%s: completed %li out of %u, %s\n",
> +			       __func__,
> +			       count - remaining, count,
> +			       hv_status_to_string(status));
> +			break;
> +		}
> +		completed = hv_repcomp(status);
> +		for (i = 0; i < completed; ++i)
> +			registers[i].value = output_page[i];
> +
> +		registers += completed;
> +		remaining -= completed;
> +	}
> +	local_irq_restore(flags);
> +
> +	return hv_status_to_errno(status);
> +}
> +
> +int hv_call_set_vp_registers(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u16 count,
> +		union hv_input_vtl input_vtl,
> +		struct hv_register_assoc *registers)
> +{
> +	struct hv_input_set_vp_registers *input_page;
> +	u16 completed = 0;
> +	unsigned long remaining = count;
> +	int rep_count;
> +	u64 status;
> +	unsigned long flags;
> +
> +	local_irq_save(flags);
> +	input_page = *this_cpu_ptr(hyperv_pcpu_input_arg);
> +
> +	input_page->partition_id = partition_id;
> +	input_page->vp_index = vp_index;
> +	input_page->input_vtl.as_uint8 = input_vtl.as_uint8;
> +	input_page->rsvd_z8 = 0;
> +	input_page->rsvd_z16 = 0;
> +
> +	while (remaining) {
> +		rep_count = min(remaining, HV_SET_REGISTER_BATCH_SIZE);
> +		memcpy(input_page->elements, registers,
> +			sizeof(struct hv_register_assoc) * rep_count);
> +
> +		status = hv_do_rep_hypercall(HVCALL_SET_VP_REGISTERS,
> rep_count,
> +					     0, input_page, NULL);
> +		if (!hv_result_success(status)) {
> +			pr_err("%s: completed %li out of %u, %s\n",
> +			       __func__,
> +			       count - remaining, count,
> +			       hv_status_to_string(status));
> +			break;
> +		}
> +		completed = hv_repcomp(status);
> +		registers += completed;
> +		remaining -= completed;
> +	}
> +
> +	local_irq_restore(flags);
> +
> +	return hv_status_to_errno(status);
> +}
> +
> diff --git a/drivers/hv/hv_common.c b/drivers/hv/hv_common.c
> index 13f972e72375..ccd76f30a638 100644
> --- a/drivers/hv/hv_common.c
> +++ b/drivers/hv/hv_common.c
> @@ -62,7 +62,11 @@ EXPORT_SYMBOL_GPL(hyperv_pcpu_output_arg);
>   */
>  static inline bool hv_output_arg_exists(void)
>  {
> +#ifdef CONFIG_MSHV_VTL

Although today both the option works together. But thinking
which is more accurate CONFIG_HYPERV_VTL_MODE or
CONFIG_MSHV_VTL here for scalability of VTL modules.

> +	return true;
> +#else
>  	return hv_root_partition ? true : false;
> +#endif
>  }
> 
>  static void hv_kmsg_dump_unregister(void);
> diff --git a/drivers/hv/mshv.h b/drivers/hv/mshv.h
> new file mode 100644
> index 000000000000..166480a73f3f
> --- /dev/null
> +++ b/drivers/hv/mshv.h
> @@ -0,0 +1,156 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright (c) 2023, Microsoft Corporation.
> + */
> +
> +#ifndef _MSHV_H_
> +#define _MSHV_H_
> +
> +#include <linux/spinlock.h>
> +#include <linux/mutex.h>
> +#include <linux/semaphore.h>
> +#include <linux/sched.h>
> +#include <linux/srcu.h>
> +#include <linux/wait.h>
> +#include <uapi/linux/mshv.h>
> +
> +/*
> + * Hyper-V hypercalls
> + */
> +
> +int hv_call_withdraw_memory(u64 count, int node, u64 partition_id);
> +int hv_call_create_partition(
> +		u64 flags,
> +		struct hv_partition_creation_properties creation_properties,
> +		union hv_partition_isolation_properties isolation_properties,
> +		u64 *partition_id);
> +int hv_call_initialize_partition(u64 partition_id);
> +int hv_call_finalize_partition(u64 partition_id);
> +int hv_call_delete_partition(u64 partition_id);
> +int hv_call_map_gpa_pages(
> +		u64 partition_id,
> +		u64 gpa_target,
> +		u64 page_count, u32 flags,
> +		struct page **pages);
> +int hv_call_unmap_gpa_pages(
> +		u64 partition_id,
> +		u64 gpa_target,
> +		u64 page_count, u32 flags);
> +int hv_call_get_vp_registers(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u16 count,
> +		union hv_input_vtl input_vtl,
> +		struct hv_register_assoc *registers);
> +int hv_call_get_gpa_access_states(
> +		u64 partition_id,
> +		u32 count,
> +		u64 gpa_base_pfn,
> +		u64 state_flags,
> +		int *written_total,
> +		union hv_gpa_page_access_state *states);
> +
> +int hv_call_set_vp_registers(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u16 count,
> +		union hv_input_vtl input_vtl,
> +		struct hv_register_assoc *registers);

Nit: Opportunity to fix many of the checkpatch.pl related to line break here
and many other places.

> +int hv_call_install_intercept(u64 partition_id, u32 access_type,
> +		enum hv_intercept_type intercept_type,
> +		union hv_intercept_parameters intercept_parameter);
> +int hv_call_assert_virtual_interrupt(
> +		u64 partition_id,
> +		u32 vector,
> +		u64 dest_addr,
> +		union hv_interrupt_control control);
> +int hv_call_clear_virtual_interrupt(u64 partition_id);
> +
> +#ifdef HV_SUPPORTS_VP_STATE
> +int hv_call_get_vp_state(
> +		u32 vp_index,
> +		u64 partition_id,
> +		enum hv_get_set_vp_state_type type,
> +		struct hv_vp_state_data_xsave xsave,
> +		/* Choose between pages and ret_output */
> +		u64 page_count,
> +		struct page **pages,
> +		union hv_output_get_vp_state *ret_output);
> +int hv_call_set_vp_state(
> +		u32 vp_index,
> +		u64 partition_id,
> +		enum hv_get_set_vp_state_type type,
> +		struct hv_vp_state_data_xsave xsave,
> +		/* Choose between pages and bytes */
> +		u64 page_count,
> +		struct page **pages,
> +		u32 num_bytes,
> +		u8 *bytes);
> +#endif
> +
> +int hv_call_map_vp_state_page(u64 partition_id, u32 vp_index, u32 type,
> +				struct page **state_page);
> +int hv_call_unmap_vp_state_page(u64 partition_id, u32 vp_index, u32
> type);
> +int hv_call_get_partition_property(
> +		u64 partition_id,
> +		u64 property_code,
> +		u64 *property_value);
> +int hv_call_set_partition_property(
> +	u64 partition_id, u64 property_code, u64 property_value,
> +	void (*completion_handler)(void * /* data */, u64 * /* status */),
> +	void *completion_data);
> +int hv_call_translate_virtual_address(
> +		u32 vp_index,
> +		u64 partition_id,
> +		u64 flags,
> +		u64 gva,
> +		u64 *gpa,
> +		union hv_translate_gva_result *result);
> +int hv_call_get_vp_cpuid_values(
> +		u32 vp_index,
> +		u64 partition_id,
> +		union hv_get_vp_cpuid_values_flags values_flags,
> +		struct hv_cpuid_leaf_info *info,
> +		union hv_output_get_vp_cpuid_values *result);
> +
> +int hv_call_create_port(u64 port_partition_id, union hv_port_id port_id,
> +			u64 connection_partition_id, struct hv_port_info
> *port_info,
> +			u8 port_vtl, u8 min_connection_vtl, int node);
> +int hv_call_delete_port(u64 port_partition_id, union hv_port_id port_id);
> +int hv_call_connect_port(u64 port_partition_id, union hv_port_id port_id,
> +			 u64 connection_partition_id,
> +			 union hv_connection_id connection_id,
> +			 struct hv_connection_info *connection_info,
> +			 u8 connection_vtl, int node);
> +int hv_call_disconnect_port(u64 connection_partition_id,
> +			    union hv_connection_id connection_id);
> +int hv_call_notify_port_ring_empty(u32 sint_index);
> +#ifdef HV_SUPPORTS_REGISTER_INTERCEPT
> +int hv_call_register_intercept_result(u32 vp_index,
> +				  u64 partition_id,
> +				  enum hv_intercept_type intercept_type,
> +				  union
> hv_register_intercept_result_parameters *params);
> +#endif
> +int hv_call_signal_event_direct(u32 vp_index,
> +				u64 partition_id,
> +				u8 vtl,
> +				u8 sint,
> +				u16 flag_number,
> +				u8 *newly_signaled);
> +int hv_call_post_message_direct(u32 vp_index,
> +				u64 partition_id,
> +				u8 vtl,
> +				u32 sint_index,
> +				u8 *message);
> +
> +struct mshv_partition *mshv_partition_find(u64 partition_id)
> __must_hold(RCU);
> +
> +int mshv_xfer_to_guest_mode_handle_work(unsigned long ti_work);
> +
> +typedef long (*mshv_create_func_t)(void __user *user_arg);
> +typedef long (*mshv_check_ext_func_t)(u32 arg);
> +int mshv_setup_vtl_func(const mshv_create_func_t create_vtl,
> +			const mshv_check_ext_func_t check_ext);
> +int mshv_set_create_partition_func(const mshv_create_func_t func);
> +
> +#endif /* _MSHV_H */
> diff --git a/drivers/hv/mshv_eventfd.c b/drivers/hv/mshv_eventfd.c
> new file mode 100644
> index 000000000000..ddc64fe3920e
> --- /dev/null
> +++ b/drivers/hv/mshv_eventfd.c
> @@ -0,0 +1,758 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * eventfd support for mshv
> + *
> + * Heavily inspired from KVM implementation of irqfd/ioeventfd. The basic
> + * framework code is taken from the kvm implementation.
> + *
> + * All credits to kvm developers.
> + */
> +
> +#include <linux/syscalls.h>
> +#include <linux/wait.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/list.h>
> +#include <linux/workqueue.h>
> +#include <linux/eventfd.h>
> +
> +#include "mshv_eventfd.h"
> +#include "mshv.h"
> +#include "mshv_root.h"
> +
> +static struct workqueue_struct *irqfd_cleanup_wq;
> +
> +void
> +mshv_register_irq_ack_notifier(struct mshv_partition *partition,
> +			       struct mshv_irq_ack_notifier *mian)
> +{
> +	mutex_lock(&partition->irq_lock);
> +	hlist_add_head_rcu(&mian->link, &partition->irq_ack_notifier_list);
> +	mutex_unlock(&partition->irq_lock);
> +}
> +
> +void
> +mshv_unregister_irq_ack_notifier(struct mshv_partition *partition,
> +				 struct mshv_irq_ack_notifier *mian)
> +{
> +	mutex_lock(&partition->irq_lock);
> +	hlist_del_init_rcu(&mian->link);
> +	mutex_unlock(&partition->irq_lock);
> +	synchronize_rcu();
> +}
> +
> +bool
> +mshv_notify_acked_gsi(struct mshv_partition *partition, int gsi)
> +{
> +	struct mshv_irq_ack_notifier *mian;
> +	bool acked = false;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(mian, &partition->irq_ack_notifier_list,
> +			link) {
> +		if (mian->gsi == gsi) {
> +			mian->irq_acked(mian);
> +			acked = true;
> +		}
> +	}
> +	rcu_read_unlock();
> +
> +	return acked;
> +}
> +
> +static inline bool hv_should_clear_interrupt(enum hv_interrupt_type type)
> +{
> +	return type == HV_X64_INTERRUPT_TYPE_EXTINT;
> +}
> +
> +static void
> +irqfd_resampler_ack(struct mshv_irq_ack_notifier *mian)
> +{
> +	struct mshv_kernel_irqfd_resampler *resampler;
> +	struct mshv_partition *partition;
> +	struct mshv_kernel_irqfd *irqfd;
> +	int idx;
> +
> +	resampler = container_of(mian,
> +			struct mshv_kernel_irqfd_resampler, notifier);
> +	partition = resampler->partition;
> +
> +	idx = srcu_read_lock(&partition->irq_srcu);
> +
> +	hlist_for_each_entry_rcu(irqfd, &resampler->irqfds_list,
> resampler_hnode) {
> +		if (hv_should_clear_interrupt(irqfd-
> >lapic_irq.control.interrupt_type))
> +			hv_call_clear_virtual_interrupt(partition->id);
> +
> +		eventfd_signal(irqfd->resamplefd, 1);
> +	}
> +
> +	srcu_read_unlock(&partition->irq_srcu, idx);
> +}
> +
> +static void
> +irqfd_assert(struct work_struct *work)
> +{
> +	struct mshv_kernel_irqfd *irqfd =
> +		container_of(work, struct mshv_kernel_irqfd, assert);
> +	struct mshv_lapic_irq *irq = &irqfd->lapic_irq;
> +
> +	hv_call_assert_virtual_interrupt(irqfd->partition->id,
> +					 irq->vector, irq->apic_id,
> +					 irq->control);
> +}
> +
> +static void
> +irqfd_inject(struct mshv_kernel_irqfd *irqfd)
> +{
> +	struct mshv_partition *partition = irqfd->partition;
> +	struct mshv_lapic_irq *irq = &irqfd->lapic_irq;
> +	unsigned int seq;
> +	int idx;
> +
> +	WARN_ON(irqfd->resampler &&
> +		!irq->control.level_triggered);
> +
> +	idx = srcu_read_lock(&partition->irq_srcu);
> +	if (irqfd->msi_entry.gsi) {
> +		if (!irqfd->msi_entry.entry_valid) {
> +			pr_warn("Invalid routing info for gsi %u",
> +				irqfd->msi_entry.gsi);
> +			srcu_read_unlock(&partition->irq_srcu, idx);
> +			return;
> +		}
> +
> +		do {
> +			seq = read_seqcount_begin(&irqfd->msi_entry_sc);
> +		} while (read_seqcount_retry(&irqfd->msi_entry_sc, seq));
> +	}
> +
> +	srcu_read_unlock(&partition->irq_srcu, idx);
> +
> +	schedule_work(&irqfd->assert);
> +}
> +
> +static void
> +irqfd_resampler_shutdown(struct mshv_kernel_irqfd *irqfd)
> +{
> +	struct mshv_kernel_irqfd_resampler *resampler = irqfd->resampler;
> +	struct mshv_partition *partition = resampler->partition;
> +
> +	mutex_lock(&partition->irqfds.resampler_lock);
> +
> +	hlist_del_rcu(&irqfd->resampler_hnode);
> +	synchronize_srcu(&partition->irq_srcu);
> +
> +	if (hlist_empty(&resampler->irqfds_list)) {
> +		hlist_del(&resampler->hnode);
> +		mshv_unregister_irq_ack_notifier(partition, &resampler-
> >notifier);
> +		kfree(resampler);
> +	}
> +
> +	mutex_unlock(&partition->irqfds.resampler_lock);
> +}
> +
> +/*
> + * Race-free decouple logic (ordering is critical)
> + */
> +static void
> +irqfd_shutdown(struct work_struct *work)
> +{
> +	struct mshv_kernel_irqfd *irqfd =
> +		container_of(work, struct mshv_kernel_irqfd, shutdown);
> +
> +	/*
> +	 * Synchronize with the wait-queue and unhook ourselves to prevent
> +	 * further events.
> +	 */
> +	remove_wait_queue(irqfd->wqh, &irqfd->wait);
> +
> +	if (irqfd->resampler) {
> +		irqfd_resampler_shutdown(irqfd);
> +		eventfd_ctx_put(irqfd->resamplefd);
> +	}
> +
> +	/*
> +	 * We know no new events will be scheduled at this point, so block
> +	 * until all previously outstanding events have completed
> +	 */
> +	flush_work(&irqfd->assert);
> +
> +	/*
> +	 * It is now safe to release the object's resources
> +	 */
> +	eventfd_ctx_put(irqfd->eventfd);
> +	kfree(irqfd);
> +}
> +
> +/* assumes partition->irqfds.lock is held */
> +static bool
> +irqfd_is_active(struct mshv_kernel_irqfd *irqfd)
> +{
> +	return !hlist_unhashed(&irqfd->hnode);
> +}
> +
> +/*
> + * Mark the irqfd as inactive and schedule it for removal
> + *
> + * assumes partition->irqfds.lock is held
> + */
> +static void
> +irqfd_deactivate(struct mshv_kernel_irqfd *irqfd)
> +{
> +	WARN_ON(!irqfd_is_active(irqfd));
> +
> +	hlist_del(&irqfd->hnode);
> +
> +	queue_work(irqfd_cleanup_wq, &irqfd->shutdown);
> +}
> +
> +/*
> + * Called with wqh->lock held and interrupts disabled
> + */
> +static int
> +irqfd_wakeup(wait_queue_entry_t *wait, unsigned int mode,
> +		int sync, void *key)
> +{
> +	struct mshv_kernel_irqfd *irqfd =
> +		container_of(wait, struct mshv_kernel_irqfd, wait);
> +	unsigned long flags = (unsigned long)key;
> +	int idx;
> +	unsigned int seq;
> +	struct mshv_partition *partition = irqfd->partition;
> +	int ret = 0;
> +
> +	if (flags & POLLIN) {
> +		u64 cnt;
> +
> +		eventfd_ctx_do_read(irqfd->eventfd, &cnt);
> +		idx = srcu_read_lock(&partition->irq_srcu);
> +		do {
> +			seq = read_seqcount_begin(&irqfd->msi_entry_sc);
> +		} while (read_seqcount_retry(&irqfd->msi_entry_sc, seq));
> +
> +		/* An event has been signaled, inject an interrupt */
> +		irqfd_inject(irqfd);
> +		srcu_read_unlock(&partition->irq_srcu, idx);
> +
> +		ret = 1;
> +	}
> +
> +	if (flags & POLLHUP) {
> +		/* The eventfd is closing, detach from Partition */
> +		unsigned long flags;
> +
> +		spin_lock_irqsave(&partition->irqfds.lock, flags);
> +
> +		/*
> +		 * We must check if someone deactivated the irqfd before
> +		 * we could acquire the irqfds.lock since the item is
> +		 * deactivated from the mshv side before it is unhooked from
> +		 * the wait-queue.  If it is already deactivated, we can
> +		 * simply return knowing the other side will cleanup for us.
> +		 * We cannot race against the irqfd going away since the
> +		 * other side is required to acquire wqh->lock, which we hold
> +		 */
> +		if (irqfd_is_active(irqfd))
> +			irqfd_deactivate(irqfd);
> +
> +		spin_unlock_irqrestore(&partition->irqfds.lock, flags);
> +	}
> +
> +	return ret;
> +}
> +
> +/* Must be called under irqfds.lock */
> +static void irqfd_update(struct mshv_partition *partition,
> +			 struct mshv_kernel_irqfd *irqfd)
> +{
> +	write_seqcount_begin(&irqfd->msi_entry_sc);
> +	irqfd->msi_entry = mshv_msi_map_gsi(partition, irqfd->gsi);
> +	mshv_set_msi_irq(&irqfd->msi_entry, &irqfd->lapic_irq);
> +	write_seqcount_end(&irqfd->msi_entry_sc);
> +}
> +
> +void mshv_irqfd_routing_update(struct mshv_partition *partition)
> +{
> +	struct mshv_kernel_irqfd *irqfd;
> +
> +	spin_lock_irq(&partition->irqfds.lock);
> +	hlist_for_each_entry(irqfd, &partition->irqfds.items, hnode)
> +		irqfd_update(partition, irqfd);
> +	spin_unlock_irq(&partition->irqfds.lock);
> +}
> +
> +static void
> +irqfd_ptable_queue_proc(struct file *file, wait_queue_head_t *wqh,
> +			poll_table *pt)
> +{
> +	struct mshv_kernel_irqfd *irqfd =
> +		container_of(pt, struct mshv_kernel_irqfd, pt);
> +
> +	irqfd->wqh = wqh;
> +	add_wait_queue_priority(wqh, &irqfd->wait);
> +}
> +
> +static int
> +mshv_irqfd_assign(struct mshv_partition *partition,
> +		  struct mshv_irqfd *args)
> +{
> +	struct eventfd_ctx *eventfd = NULL, *resamplefd = NULL;
> +	struct mshv_kernel_irqfd *irqfd, *tmp;
> +	unsigned int events;
> +	struct fd f;
> +	int ret;
> +	int idx;
> +
> +	irqfd = kzalloc(sizeof(*irqfd), GFP_KERNEL);
> +	if (!irqfd)
> +		return -ENOMEM;
> +
> +	irqfd->partition = partition;
> +	irqfd->gsi = args->gsi;
> +	INIT_WORK(&irqfd->shutdown, irqfd_shutdown);
> +	INIT_WORK(&irqfd->assert, irqfd_assert);
> +	seqcount_spinlock_init(&irqfd->msi_entry_sc,
> +			       &partition->irqfds.lock);
> +
> +	f = fdget(args->fd);
> +	if (!f.file) {
> +		ret = -EBADF;
> +		goto out;
> +	}
> +
> +	eventfd = eventfd_ctx_fileget(f.file);
> +	if (IS_ERR(eventfd)) {
> +		ret = PTR_ERR(eventfd);
> +		goto fail;
> +	}
> +
> +	irqfd->eventfd = eventfd;
> +
> +	if (args->flags & MSHV_IRQFD_FLAG_RESAMPLE) {
> +		struct mshv_kernel_irqfd_resampler *resampler;
> +
> +		resamplefd = eventfd_ctx_fdget(args->resamplefd);
> +		if (IS_ERR(resamplefd)) {
> +			ret = PTR_ERR(resamplefd);
> +			goto fail;
> +		}
> +
> +		irqfd->resamplefd = resamplefd;
> +
> +		mutex_lock(&partition->irqfds.resampler_lock);
> +
> +		hlist_for_each_entry(resampler,
> +				    &partition->irqfds.resampler_list, hnode) {
> +			if (resampler->notifier.gsi == irqfd->gsi) {
> +				irqfd->resampler = resampler;
> +				break;
> +			}
> +		}
> +
> +		if (!irqfd->resampler) {
> +			resampler = kzalloc(sizeof(*resampler),
> +					    GFP_KERNEL_ACCOUNT);
> +			if (!resampler) {
> +				ret = -ENOMEM;
> +				mutex_unlock(&partition-
> >irqfds.resampler_lock);
> +				goto fail;
> +			}
> +
> +			resampler->partition = partition;
> +			INIT_HLIST_HEAD(&resampler->irqfds_list);
> +			resampler->notifier.gsi = irqfd->gsi;
> +			resampler->notifier.irq_acked = irqfd_resampler_ack;
> +
> +			hlist_add_head(&resampler->hnode, &partition-
> >irqfds.resampler_list);
> +			mshv_register_irq_ack_notifier(partition,
> +						      &resampler->notifier);
> +			irqfd->resampler = resampler;
> +		}
> +
> +		hlist_add_head_rcu(&irqfd->resampler_hnode, &irqfd-
> >resampler->irqfds_list);
> +
> +		mutex_unlock(&partition->irqfds.resampler_lock);
> +	}
> +
> +	/*
> +	 * Install our own custom wake-up handling so we are notified via
> +	 * a callback whenever someone signals the underlying eventfd
> +	 */
> +	init_waitqueue_func_entry(&irqfd->wait, irqfd_wakeup);
> +	init_poll_funcptr(&irqfd->pt, irqfd_ptable_queue_proc);
> +
> +	spin_lock_irq(&partition->irqfds.lock);
> +	if (args->flags & MSHV_IRQFD_FLAG_RESAMPLE &&
> +	    !irqfd->lapic_irq.control.level_triggered) {
> +		/*
> +		 * Resample Fd must be for level triggered interrupt
> +		 * Otherwise return with failure
> +		 */
> +		spin_unlock_irq(&partition->irqfds.lock);
> +		ret = -EINVAL;
> +		goto fail;
> +	}
> +	ret = 0;
> +	hlist_for_each_entry(tmp, &partition->irqfds.items, hnode) {
> +		if (irqfd->eventfd != tmp->eventfd)
> +			continue;
> +		/* This fd is used for another irq already. */
> +		ret = -EBUSY;
> +		spin_unlock_irq(&partition->irqfds.lock);
> +		goto fail;
> +	}
> +
> +	idx = srcu_read_lock(&partition->irq_srcu);
> +	irqfd_update(partition, irqfd);
> +	hlist_add_head(&irqfd->hnode, &partition->irqfds.items);
> +	spin_unlock_irq(&partition->irqfds.lock);
> +
> +	/*
> +	 * Check if there was an event already pending on the eventfd
> +	 * before we registered, and trigger it as if we didn't miss it.
> +	 */
> +	events = vfs_poll(f.file, &irqfd->pt);
> +
> +	if (events & POLLIN)
> +		irqfd_inject(irqfd);
> +
> +	srcu_read_unlock(&partition->irq_srcu, idx);
> +	/*
> +	 * do not drop the file until the irqfd is fully initialized, otherwise
> +	 * we might race against the POLLHUP
> +	 */
> +	fdput(f);
> +
> +	return 0;
> +
> +fail:
> +	if (irqfd->resampler)
> +		irqfd_resampler_shutdown(irqfd);
> +
> +	if (resamplefd && !IS_ERR(resamplefd))
> +		eventfd_ctx_put(resamplefd);
> +
> +	if (eventfd && !IS_ERR(eventfd))
> +		eventfd_ctx_put(eventfd);
> +
> +	fdput(f);
> +
> +out:
> +	kfree(irqfd);
> +	return ret;
> +}
> +
> +/*
> + * shutdown any irqfd's that match fd+gsi
> + */
> +static int
> +mshv_irqfd_deassign(struct mshv_partition *partition,
> +		    struct mshv_irqfd *args)
> +{
> +	struct mshv_kernel_irqfd *irqfd;
> +	struct hlist_node *n;
> +	struct eventfd_ctx *eventfd;
> +
> +	eventfd = eventfd_ctx_fdget(args->fd);
> +	if (IS_ERR(eventfd))
> +		return PTR_ERR(eventfd);
> +
> +	hlist_for_each_entry_safe(irqfd, n, &partition->irqfds.items, hnode) {
> +		if (irqfd->eventfd == eventfd && irqfd->gsi == args->gsi)
> +			irqfd_deactivate(irqfd);
> +	}
> +
> +	eventfd_ctx_put(eventfd);
> +
> +	/*
> +	 * Block until we know all outstanding shutdown jobs have completed
> +	 * so that we guarantee there will not be any more interrupts on this
> +	 * gsi once this deassign function returns.
> +	 */
> +	flush_workqueue(irqfd_cleanup_wq);
> +
> +	return 0;
> +}
> +
> +int
> +mshv_irqfd(struct mshv_partition *partition, struct mshv_irqfd *args)
> +{
> +	if (args->flags & MSHV_IRQFD_FLAG_DEASSIGN)
> +		return mshv_irqfd_deassign(partition, args);
> +
> +	return mshv_irqfd_assign(partition, args);
> +}
> +
> +/*
> + * This function is called as the mshv VM fd is being released.
> + * Shutdown all irqfds that still remain open
> + */
> +static void
> +mshv_irqfd_release(struct mshv_partition *partition)
> +{
> +	struct mshv_kernel_irqfd *irqfd;
> +	struct hlist_node *n;
> +
> +	spin_lock_irq(&partition->irqfds.lock);
> +
> +	hlist_for_each_entry_safe(irqfd, n, &partition->irqfds.items, hnode)
> +		irqfd_deactivate(irqfd);
> +
> +	spin_unlock_irq(&partition->irqfds.lock);
> +
> +	/*
> +	 * Block until we know all outstanding shutdown jobs have completed
> +	 * since we do not take a mshv_partition* reference.
> +	 */
> +	flush_workqueue(irqfd_cleanup_wq);
> +
> +}
> +
> +int mshv_irqfd_wq_init(void)
> +{
> +	irqfd_cleanup_wq = alloc_workqueue("mshv-irqfd-cleanup", 0, 0);
> +	if (!irqfd_cleanup_wq)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +
> +void mshv_irqfd_wq_cleanup(void)
> +{
> +	destroy_workqueue(irqfd_cleanup_wq);
> +}
> +
> +/*
> + * --------------------------------------------------------------------
> + * ioeventfd: translate a MMIO memory write to an eventfd signal.
> + *
> + * userspace can register a MMIO address with an eventfd for receiving
> + * notification when the memory has been touched.
> + *
> + * TODO: Implement eventfd for PIO as well.
> + * --------------------------------------------------------------------
> + */
> +
> +static void
> +ioeventfd_release(struct kernel_mshv_ioeventfd *p, u64 partition_id)
> +{
> +	if (p->doorbell_id > 0)
> +		mshv_unregister_doorbell(partition_id, p->doorbell_id);
> +	eventfd_ctx_put(p->eventfd);
> +	kfree(p);
> +}
> +
> +/* MMIO writes trigger an event if the addr/val match */
> +static void
> +ioeventfd_mmio_write(int doorbell_id, void *data)
> +{
> +	struct mshv_partition *partition = (struct mshv_partition *)data;
> +	struct kernel_mshv_ioeventfd *p;
> +
> +	rcu_read_lock();
> +	hlist_for_each_entry_rcu(p, &partition->ioeventfds.items, hnode) {
> +		if (p->doorbell_id == doorbell_id) {
> +			eventfd_signal(p->eventfd, 1);
> +			break;
> +		}
> +	}
> +	rcu_read_unlock();
> +}
> +
> +static bool
> +ioeventfd_check_collision(struct mshv_partition *partition,
> +			  struct kernel_mshv_ioeventfd *p)
> +	__must_hold(&partition->mutex)
> +{
> +	struct kernel_mshv_ioeventfd *_p;
> +
> +	hlist_for_each_entry(_p, &partition->ioeventfds.items, hnode)
> +		if (_p->addr == p->addr && _p->length == p->length &&
> +		    (_p->wildcard || p->wildcard ||
> +		     _p->datamatch == p->datamatch))
> +			return true;
> +
> +	return false;
> +}
> +
> +static int
> +mshv_assign_ioeventfd(struct mshv_partition *partition,
> +		      struct mshv_ioeventfd *args)
> +	__must_hold(&partition->mutex)
> +{
> +	struct kernel_mshv_ioeventfd *p;
> +	struct eventfd_ctx *eventfd;
> +	u64 doorbell_flags = 0;
> +	int ret;
> +
> +	/* This mutex is currently protecting ioeventfd.items list */
> +	WARN_ON_ONCE(!mutex_is_locked(&partition->mutex));
> +
> +	if (args->flags & MSHV_IOEVENTFD_FLAG_PIO)
> +		return -EOPNOTSUPP;
> +
> +	/* must be natural-word sized */
> +	switch (args->len) {
> +	case 0:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_ANY;
> +		break;
> +	case 1:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_BYTE;
> +		break;
> +	case 2:
> +		doorbell_flags = HV_DOORBELL_FLAG_TRIGGER_SIZE_WORD;
> +		break;
> +	case 4:
> +		doorbell_flags =
> HV_DOORBELL_FLAG_TRIGGER_SIZE_DWORD;
> +		break;
> +	case 8:
> +		doorbell_flags =
> HV_DOORBELL_FLAG_TRIGGER_SIZE_QWORD;
> +		break;
> +	default:
> +		pr_warn("ioeventfd: invalid length specified\n");
> +		return -EINVAL;
> +	}
> +
> +	/* check for range overflow */
> +	if (args->addr + args->len < args->addr)
> +		return -EINVAL;
> +
> +	/* check for extra flags that we don't understand */
> +	if (args->flags & ~MSHV_IOEVENTFD_VALID_FLAG_MASK)
> +		return -EINVAL;
> +
> +	eventfd = eventfd_ctx_fdget(args->fd);
> +	if (IS_ERR(eventfd))
> +		return PTR_ERR(eventfd);
> +
> +	p = kzalloc(sizeof(*p), GFP_KERNEL);
> +	if (!p) {
> +		ret = -ENOMEM;
> +		goto fail;
> +	}
> +
> +	p->addr    = args->addr;
> +	p->length  = args->len;
> +	p->eventfd = eventfd;
> +
> +	/* The datamatch feature is optional, otherwise this is a wildcard */
> +	if (args->flags & MSHV_IOEVENTFD_FLAG_DATAMATCH)
> +		p->datamatch = args->datamatch;
> +	else {
> +		p->wildcard = true;
> +		doorbell_flags |=
> HV_DOORBELL_FLAG_TRIGGER_ANY_VALUE;
> +	}
> +
> +	if (ioeventfd_check_collision(partition, p)) {
> +		ret = -EEXIST;
> +		goto unlock_fail;
> +	}
> +
> +	ret = mshv_register_doorbell(partition->id, ioeventfd_mmio_write,
> +				     (void *)partition, p->addr,
> +				     p->datamatch, doorbell_flags);
> +	if (ret < 0) {
> +		pr_err("Failed to register ioeventfd doorbell!\n");

Nit: Do we like to print function name at the start of pr_err. 

- Saurabh