[PATCH v13 15/15] vfio/type1: Return the MSI geometry through VFIO_IOMMU_GET_INFO capability chains

Mon Oct 10 08:01:47 PDT 2016

Hi Alex,
On 07/10/2016 22:38, Alex Williamson wrote:
> On Fri, 7 Oct 2016 19:10:27 +0200
> Auger Eric <eric.auger at redhat.com> wrote:
> 
>> Hi Alex,
>>
>> On 06/10/2016 22:42, Alex Williamson wrote:
>>> On Thu, 6 Oct 2016 14:20:40 -0600
>>> Alex Williamson <alex.williamson at redhat.com> wrote:
>>>   
>>>> On Thu,  6 Oct 2016 08:45:31 +0000
>>>> Eric Auger <eric.auger at redhat.com> wrote:
>>>>  
>>>>> This patch allows the user-space to retrieve the MSI geometry. The
>>>>> implementation is based on capability chains, now also added to
>>>>> VFIO_IOMMU_GET_INFO.
>>>>>
>>>>> The returned info comprise:
>>>>> - whether the MSI IOVA are constrained to a reserved range (x86 case) and
>>>>>   in the positive, the start/end of the aperture,
>>>>> - or whether the IOVA aperture need to be set by the userspace. In that
>>>>>   case, the size and alignment of the IOVA window to be provided are
>>>>>   returned.
>>>>>
>>>>> In case the userspace must provide the IOVA aperture, we currently report
>>>>> a size/alignment based on all the doorbells registered by the host kernel.
>>>>> This may exceed the actual needs.
>>>>>
>>>>> Signed-off-by: Eric Auger <eric.auger at redhat.com>
>>>>>
>>>>> ---
>>>>> v11 -> v11:
>>>>> - msi_doorbell_pages was renamed msi_doorbell_calc_pages
>>>>>
>>>>> v9 -> v10:
>>>>> - move cap_offset after iova_pgsizes
>>>>> - replace __u64 alignment by __u32 order
>>>>> - introduce __u32 flags in vfio_iommu_type1_info_cap_msi_geometry and
>>>>>   fix alignment
>>>>> - call msi-doorbell API to compute the size/alignment
>>>>>
>>>>> v8 -> v9:
>>>>> - use iommu_msi_supported flag instead of programmable
>>>>> - replace IOMMU_INFO_REQUIRE_MSI_MAP flag by a more sophisticated
>>>>>   capability chain, reporting the MSI geometry
>>>>>
>>>>> v7 -> v8:
>>>>> - use iommu_domain_msi_geometry
>>>>>
>>>>> v6 -> v7:
>>>>> - remove the computation of the number of IOVA pages to be provisionned.
>>>>>   This number depends on the domain/group/device topology which can
>>>>>   dynamically change. Let's rely instead rely on an arbitrary max depending
>>>>>   on the system
>>>>>
>>>>> v4 -> v5:
>>>>> - move msi_info and ret declaration within the conditional code
>>>>>
>>>>> v3 -> v4:
>>>>> - replace former vfio_domains_require_msi_mapping by
>>>>>   more complex computation of MSI mapping requirements, especially the
>>>>>   number of pages to be provided by the user-space.
>>>>> - reword patch title
>>>>>
>>>>> RFC v1 -> v1:
>>>>> - derived from
>>>>>   [RFC PATCH 3/6] vfio: Extend iommu-info to return MSIs automap state
>>>>> - renamed allow_msi_reconfig into require_msi_mapping
>>>>> - fixed VFIO_IOMMU_GET_INFO
>>>>> ---
>>>>>  drivers/vfio/vfio_iommu_type1.c | 78 ++++++++++++++++++++++++++++++++++++++++-
>>>>>  include/uapi/linux/vfio.h       | 32 ++++++++++++++++-
>>>>>  2 files changed, 108 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
>>>>> index dc3ee5d..ce5e7eb 100644
>>>>> --- a/drivers/vfio/vfio_iommu_type1.c
>>>>> +++ b/drivers/vfio/vfio_iommu_type1.c
>>>>> @@ -38,6 +38,8 @@
>>>>>  #include <linux/workqueue.h>
>>>>>  #include <linux/dma-iommu.h>
>>>>>  #include <linux/msi-doorbell.h>
>>>>> +#include <linux/irqdomain.h>
>>>>> +#include <linux/msi.h>
>>>>>  
>>>>>  #define DRIVER_VERSION  "0.2"
>>>>>  #define DRIVER_AUTHOR   "Alex Williamson <alex.williamson at redhat.com>"
>>>>> @@ -1101,6 +1103,55 @@ static int vfio_domains_have_iommu_cache(struct vfio_iommu *iommu)
>>>>>  	return ret;
>>>>>  }
>>>>>  
>>>>> +static int compute_msi_geometry_caps(struct vfio_iommu *iommu,
>>>>> +				     struct vfio_info_cap *caps)
>>>>> +{
>>>>> +	struct vfio_iommu_type1_info_cap_msi_geometry *vfio_msi_geometry;
>>>>> +	unsigned long order = __ffs(vfio_pgsize_bitmap(iommu));
>>>>> +	struct iommu_domain_msi_geometry msi_geometry;
>>>>> +	struct vfio_info_cap_header *header;
>>>>> +	struct vfio_domain *d;
>>>>> +	bool reserved;
>>>>> +	size_t size;
>>>>> +
>>>>> +	mutex_lock(&iommu->lock);
>>>>> +	/* All domains have same require_msi_map property, pick first */
>>>>> +	d = list_first_entry(&iommu->domain_list, struct vfio_domain, next);
>>>>> +	iommu_domain_get_attr(d->domain, DOMAIN_ATTR_MSI_GEOMETRY,
>>>>> +			      &msi_geometry);
>>>>> +	reserved = !msi_geometry.iommu_msi_supported;
>>>>> +
>>>>> +	mutex_unlock(&iommu->lock);
>>>>> +
>>>>> +	size = sizeof(*vfio_msi_geometry);
>>>>> +	header = vfio_info_cap_add(caps, size,
>>>>> +				   VFIO_IOMMU_TYPE1_INFO_CAP_MSI_GEOMETRY, 1);
>>>>> +
>>>>> +	if (IS_ERR(header))
>>>>> +		return PTR_ERR(header);
>>>>> +
>>>>> +	vfio_msi_geometry = container_of(header,
>>>>> +				struct vfio_iommu_type1_info_cap_msi_geometry,
>>>>> +				header);
>>>>> +
>>>>> +	vfio_msi_geometry->flags = reserved;    
>>>>
>>>> Use the bit flag VFIO_IOMMU_MSI_GEOMETRY_RESERVED
>>>>  
>>>>> +	if (reserved) {
>>>>> +		vfio_msi_geometry->aperture_start = msi_geometry.aperture_start;
>>>>> +		vfio_msi_geometry->aperture_end = msi_geometry.aperture_end;    
>>>>
>>>> But maybe nobody has set these, did you intend to use
>>>> iommu_domain_msi_aperture_valid(), which you defined early on but never
>>>> used?
>>>>  
>>>>> +		return 0;
>>>>> +	}
>>>>> +
>>>>> +	vfio_msi_geometry->order = order;    
>>>>
>>>> I'm tempted to suggest that a user could do the same math on their own
>>>> since we provide the supported bitmap already... could it ever not be
>>>> the same? 
>>>>  
>>>>> +	/*
>>>>> +	 * we compute a system-wide requirement based on all the registered
>>>>> +	 * doorbells
>>>>> +	 */
>>>>> +	vfio_msi_geometry->size =
>>>>> +		msi_doorbell_calc_pages(order) * ((uint64_t) 1 << order);
>>>>> +
>>>>> +	return 0;
>>>>> +}
>>>>> +
>>>>>  static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>>  				   unsigned int cmd, unsigned long arg)
>>>>>  {
>>>>> @@ -1122,8 +1173,10 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>>  		}
>>>>>  	} else if (cmd == VFIO_IOMMU_GET_INFO) {
>>>>>  		struct vfio_iommu_type1_info info;
>>>>> +		struct vfio_info_cap caps = { .buf = NULL, .size = 0 };
>>>>> +		int ret;
>>>>>  
>>>>> -		minsz = offsetofend(struct vfio_iommu_type1_info, iova_pgsizes);
>>>>> +		minsz = offsetofend(struct vfio_iommu_type1_info, cap_offset);
>>>>>  
>>>>>  		if (copy_from_user(&info, (void __user *)arg, minsz))
>>>>>  			return -EFAULT;
>>>>> @@ -1135,6 +1188,29 @@ static long vfio_iommu_type1_ioctl(void *iommu_data,
>>>>>  
>>>>>  		info.iova_pgsizes = vfio_pgsize_bitmap(iommu);
>>>>>  
>>>>> +		ret = compute_msi_geometry_caps(iommu, &caps);
>>>>> +		if (ret)
>>>>> +			return ret;
>>>>> +
>>>>> +		if (caps.size) {
>>>>> +			info.flags |= VFIO_IOMMU_INFO_CAPS;
>>>>> +			if (info.argsz < sizeof(info) + caps.size) {
>>>>> +				info.argsz = sizeof(info) + caps.size;
>>>>> +				info.cap_offset = 0;
>>>>> +			} else {
>>>>> +				vfio_info_cap_shift(&caps, sizeof(info));
>>>>> +				if (copy_to_user((void __user *)arg +
>>>>> +						sizeof(info), caps.buf,
>>>>> +						caps.size)) {
>>>>> +					kfree(caps.buf);
>>>>> +					return -EFAULT;
>>>>> +				}
>>>>> +				info.cap_offset = sizeof(info);
>>>>> +			}
>>>>> +
>>>>> +			kfree(caps.buf);
>>>>> +		}
>>>>> +
>>>>>  		return copy_to_user((void __user *)arg, &info, minsz) ?
>>>>>  			-EFAULT : 0;
>>>>>  
>>>>> diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
>>>>> index 4a9dbc2..8dae013 100644
>>>>> --- a/include/uapi/linux/vfio.h
>>>>> +++ b/include/uapi/linux/vfio.h
>>>>> @@ -488,7 +488,35 @@ struct vfio_iommu_type1_info {
>>>>>  	__u32	argsz;
>>>>>  	__u32	flags;
>>>>>  #define VFIO_IOMMU_INFO_PGSIZES (1 << 0)	/* supported page sizes info */
>>>>> -	__u64	iova_pgsizes;		/* Bitmap of supported page sizes */
>>>>> +#define VFIO_IOMMU_INFO_CAPS	(1 << 1)	/* Info supports caps */
>>>>> +	__u64	iova_pgsizes;	/* Bitmap of supported page sizes */
>>>>> +	__u32	__resv;
>>>>> +	__u32   cap_offset;	/* Offset within info struct of first cap */
>>>>> +};    
>>>>
>>>> I understand the padding, but not the ordering.  Why not end with
>>>> padding?
>>>>  
>>>>> +
>>>>> +#define VFIO_IOMMU_TYPE1_INFO_CAP_MSI_GEOMETRY	1
>>>>> +
>>>>> +/*
>>>>> + * The MSI geometry capability allows to report the MSI IOVA geometry:
>>>>> + * - either the MSI IOVAs are constrained within a reserved IOVA aperture
>>>>> + *   whose boundaries are given by [@aperture_start, @aperture_end].
>>>>> + *   this is typically the case on x86 host. The userspace is not allowed
>>>>> + *   to map userspace memory at IOVAs intersecting this range using
>>>>> + *   VFIO_IOMMU_MAP_DMA.
>>>>> + * - or the MSI IOVAs are not requested to belong to any reserved range;
>>>>> + *   in that case the userspace must provide an IOVA window characterized by
>>>>> + *   @size and @alignment using VFIO_IOMMU_MAP_DMA with RESERVED_MSI_IOVA flag.
>>>>> + */
>>>>> +struct vfio_iommu_type1_info_cap_msi_geometry {
>>>>> +	struct vfio_info_cap_header header;
>>>>> +	__u32 flags;
>>>>> +#define VFIO_IOMMU_MSI_GEOMETRY_RESERVED (1 << 0) /* reserved geometry */
>>>>> +	/* not reserved */
>>>>> +	__u32 order; /* iommu page order used for aperture alignment*/
>>>>> +	__u64 size; /* IOVA aperture size (bytes) the userspace must provide */
>>>>> +	/* reserved */
>>>>> +	__u64 aperture_start;
>>>>> +	__u64 aperture_end;    
>>>>
>>>> Should these be a union?  We never set them both.  Should the !reserved
>>>> case have a flag as well, so the user can positively identify what's
>>>> being provided?  
>>>
>>> Actually, is there really any need to fit both of these within the same
>>> structure?  Part of the idea of the capability chains is we can create
>>> a capability for each new thing we want to describe.  So, we could
>>> simply define a generic reserved IOVA range capability with a 'start'
>>> and 'end' and then another capability to define MSI mapping
>>> requirements.  Thanks,  
>> Yes your suggested approach makes sense to me.
>>
>> One reason why I proceeded that way is we are mixing things at iommu.h
>> level too. Personally I would have preferred to separate things:
>> 1) add a new IOMMU_CAP_TRANSLATE_MSI capability in iommu_cap
>> 2) rename iommu_msi_supported into "programmable" bool: reporting
>> whether the aperture is reserved or programmable.
>>
>> In the early releases I think it was as above but slightly we moved to a
>> mixed description.
>>
>> What do you think?
> 
> The API certainly doesn't seem like it has a cohesive feel to me.  It's
> not entirely clear to me how we know when we need to register a DMA MSI
> cookie, or how we know that the MSI doorbell API is actually
> initialized and in use by the MSI/IOMMU layer, or exactly what is the
> MSI geometry telling me.  Perhaps this is why the code doesn't seem to
> have a good rejection mechanism for architectures that need it versus
> those that don't, it's too hard to tell.
> 
> Maybe we can look at what we think the user API should be and work
> backwards.  For x86 we simply have a reserved range of IOVA.  I'm not
> entirely sure it adds to the user API to know that it's for MSI, it's
> just a range of IOVAs that we cannot allocate for regular DMA.  In
> fact, we currently lack a mechanism for describing the IOVA space of
> the IOMMU at all, so rather than focusing on a mechanism to describe a
> hole in the IOVA space, we might simply want to focus on a mechanism to
> describe the available IOVA space.  Everybody needs that, not just
> x86.  That sort of sounds like a VFIO_IOMMU_TYPE1_INFO_CAP_IOVA_RANGE
> that perhaps looks like:
> 
> struct vfio_iommu_type1_info_cap_iova_range {
> 	struct vfio_info_cap_header header;
> 	u64 start;
> 	u64 end;
> };
> 
> Clearly we need to allow multiple of these in the capability chain
> since the existing x86 MSI range bisects this address space.
> 
> To support this, we basically need the same information from the IOMMU
> API.  We already have DOMAIN_ATTR_GEOMETRY, which should give us the
> base IOVA range, but we don't have anything describing the gaps.  We
> don't know how many sources of gaps we'll have in the future, but let's
> keep it simple and assume we can look for MSI gaps and add other
> possible sources of gaps in the future, it's an internal API after all.
> So we can use DOMAIN_ATTR_MSI_GEOMETRY to tell us about the (we assume
> one) MSI range of reserved IOVA within DOMAIN_ATTR_GEOMETRY.  For x86
> this is fixed, for SMMU this is a zero range until someone programs it.
> 
> Now, what does a user need to know to add a reserved MSI IOVA range?
> They need to know a) that it needs to be done, and b) how big to make
> it (and maybe alignment requirements).  Really all we need to describe
> then is b) since b) implies a). So maybe that gives us another
> capability chain entry:
> 
> struct vfio_iommu_type1_info_cap_msi_resv {
> 	struct vfio_info_cap_header header;
> 	u64 size;
> 	u64 alignment;
> };

I like the approach and I like the idea to separate the 2 issues in
separate structs, both at VFIO level and IOMMU level. It makes even more
sense now we have the other requirement to handle host PCIe host bridge
window.
> 
> It doesn't seem like we need to waste a flag bit on
> vfio_iommu_type1_info.flags for this since the existence of this
> capability would imply that VFIO_IOMMU_MAP_DMA supports an MSI_RESV
> flag.
I agree.
> 
> So what do we need from the kernel infrastructure to make that happen?
> Well, we need a) and b) above, and again b) can imply a), so if the
> IOMMU API provided a DOMAIN_ATTR_MSI_RESV, providing the same
> size/alignment, then we're nearly there.
Agreed
  Then we just need a way to
> set that range, which I'd probably try to plumb through the IOMMU API
> rather than pulling in separate doorbell APIs and DMA cookie APIs.  If
> it's going to pull together all those different things, let's at least
> only do that in one place so we can expose a consistent API through the
> IOMMU API.  Obviously once a range is set, DOMAIN_ATTR_MSI_RESV should
> report that range, so if the user were to look at the type1 info
> capability chain again, the available IOVA ranges would reflect the now
> reserved range.
So my plan is to respin the passthrough series with
vfio_iommu_type1_info_cap_msi_resv and associated iommu struct.

I would prefer to send a separate series to report IOVA  usable address
space.

Thanks

Eric
> 
> Maybe that's more than you're asking for, but that's the approach I
> would take to solidify the API.  Thanks,
> 
> Alex
> 
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel
>