[PATCH v2 2/4] iommu/io-pgtable-arm: Add read_and_clear_dirty() support

Shameerali Kolothum Thodi shameerali.kolothum.thodi at huawei.com
Wed Apr 24 01:01:50 PDT 2024



> -----Original Message-----
> From: Ryan Roberts <ryan.roberts at arm.com>
> Sent: Tuesday, April 23, 2024 4:56 PM
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi at huawei.com>;
> iommu at lists.linux.dev; linux-arm-kernel at lists.infradead.org
> Cc: joro at 8bytes.org; jgg at nvidia.com; kevin.tian at intel.com;
> nicolinc at nvidia.com; mshavit at google.com; robin.murphy at arm.com;
> will at kernel.org; joao.m.martins at oracle.com; jiangkunkun
> <jiangkunkun at huawei.com>; zhukeqian <zhukeqian1 at huawei.com>;
> Linuxarm <linuxarm at huawei.com>
> Subject: Re: [PATCH v2 2/4] iommu/io-pgtable-arm: Add
> read_and_clear_dirty() support
> 
> On 22/02/2024 09:49, Shameer Kolothum wrote:
> > .read_and_clear_dirty() IOMMU domain op takes care of reading the
> > dirty bits (i.e. PTE has both DBM and AP[2] set) and marshalling into
> > a bitmap of
> 
> Err... a PTE is HW-dirty if DBM is set and AP[2] is *clear*, right? (AP[2] is
> "RDONLY" bit, and is initially set, then the HW clears it on first write. See
> pte_hw_dirty() in linux/arch/arm64/include/asm/pgtable.h).

Oops..yes, the comment here is indeed wrong.

I will update and add something like below to make it clear: 

                                    DBM bit             AP[2]("RDONLY" bit)
1. writeable_clean         1                       1
2. writeable_dirty          1                       0
3. read-only                     0                      1

> 
> > a given page size.
> >
> > While reading the dirty bits we also clear the PTE AP[2] bit to mark
> > it as writable-clean depending on read_and_clear_dirty() flags.
> 
> You would need to *set* AP[2] to mark it clean.
> 
> >
> > Structure it in a way that the IOPTE walker is generic, and so we pass
> > a function pointer over what to do on a per-PTE basis.
> >
> > Signed-off-by: Joao Martins <joao.m.martins at oracle.com>
> > Signed-off-by: Shameer Kolothum
> <shameerali.kolothum.thodi at huawei.com>
> > ---
> >  drivers/iommu/io-pgtable-arm.c | 128
> > ++++++++++++++++++++++++++++++++-
> >  1 file changed, 127 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/iommu/io-pgtable-arm.c
> > b/drivers/iommu/io-pgtable-arm.c index f7828a7aad41..1ce7b7a3c1e8
> > 100644
> > --- a/drivers/iommu/io-pgtable-arm.c
> > +++ b/drivers/iommu/io-pgtable-arm.c
> > @@ -75,6 +75,7 @@
> >
> >  #define ARM_LPAE_PTE_NSTABLE		(((arm_lpae_iopte)1) << 63)
> >  #define ARM_LPAE_PTE_XN			(((arm_lpae_iopte)3) << 53)
> > +#define ARM_LPAE_PTE_DBM		(((arm_lpae_iopte)1) << 51)
> >  #define ARM_LPAE_PTE_AF			(((arm_lpae_iopte)1) << 10)
> >  #define ARM_LPAE_PTE_SH_NS		(((arm_lpae_iopte)0) << 8)
> >  #define ARM_LPAE_PTE_SH_OS		(((arm_lpae_iopte)2) << 8)
> > @@ -84,7 +85,7 @@
> >
> >  #define ARM_LPAE_PTE_ATTR_LO_MASK	(((arm_lpae_iopte)0x3ff) <<
> 2)
> >  /* Ignore the contiguous bit for block splitting */
> > -#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)6) << 52)
> > +#define ARM_LPAE_PTE_ATTR_HI_MASK	(((arm_lpae_iopte)0xd) <<
> 51)
> 
> Perhaps this is easier to understand:
> 
> #define ARM_LPAE_PTE_ATTR_HI_MASK	(ARM_LPAE_PTE_XN |
> ARM_LPAE_PTE_DBM)

Ok.

> 
> >  #define ARM_LPAE_PTE_ATTR_MASK
> 	(ARM_LPAE_PTE_ATTR_LO_MASK |	\
> >  					 ARM_LPAE_PTE_ATTR_HI_MASK)
> >  /* Software bit for solving coherency races */ @@ -93,6 +94,9 @@
> >  /* Stage-1 PTE */
> >  #define ARM_LPAE_PTE_AP_UNPRIV		(((arm_lpae_iopte)1) << 6)
> >  #define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)2)
> << 6)
> > +#define ARM_LPAE_PTE_AP_RDONLY_BIT	7
> 
> Perhaps:
> 
> #define ARM_LPAE_PTE_AP_RDONLY_BIT	7
> #define ARM_LPAE_PTE_AP_RDONLY		(((arm_lpae_iopte)1) <<
> ARM_LPAE_PTE_AP_RDONLY_BIT)

Ok.

> 
> > +#define ARM_LPAE_PTE_AP_WRITABLE	(ARM_LPAE_PTE_AP_RDONLY
> | \
> > +					 ARM_LPAE_PTE_DBM)
> 
> Perhaps: ARM_LPAE_PTE_AP_WRITABLE_CLEAN ?

Sure. That’s more clear.

 > 
> >  #define ARM_LPAE_PTE_ATTRINDX_SHIFT	2
> >  #define ARM_LPAE_PTE_nG			(((arm_lpae_iopte)1) << 11)
> >
> > @@ -729,6 +733,127 @@ static phys_addr_t
> arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
> >  	return iopte_to_paddr(pte, data) | iova;  }
> >
> > +struct arm_lpae_iopte_read_dirty {
> > +	unsigned long flags;
> > +	struct iommu_dirty_bitmap *dirty;
> > +};
> > +
> > +typedef int (*io_pgtable_visit_fn_t)(unsigned long iova, size_t size,
> > +				     arm_lpae_iopte *ptep, void *opaque);
> struct
> > +io_pgtable_walk_data {
> > +	const io_pgtable_visit_fn_t	fn;
> > +	void				*opaque;> +	u64
> 		addr;
> > +	const u64			end;
> > +};
> > +
> > +static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
> > +				 struct io_pgtable_walk_data *walk_data,
> > +				 arm_lpae_iopte *ptep,
> > +				 int lvl);
> > +
> > +static int __arm_lpae_read_and_clear_dirty(unsigned long iova, size_t
> size,
> > +					   arm_lpae_iopte *ptep, void
> *opaque) {
> > +	struct arm_lpae_iopte_read_dirty *arg = opaque;
> > +	struct iommu_dirty_bitmap *dirty = arg->dirty;
> > +	arm_lpae_iopte pte;
> > +
> > +	pte = READ_ONCE(*ptep);
> > +	if (WARN_ON(!pte))
> > +		return -EINVAL;
> 
> You've already read and checked its not zero in io_pgtable_visit(). Either the
> walker is considered generic and probably shouldn't care if the pte is 0, (so
> just do check here). Or the walker is specific for this use case, in which case
> there is no need for the function pointer callback inefficiencies (and you've
> already READ_ONCE() the ptep so no need to repeat. (Multiple ptep_get()s
> does make a meaningful performance impact in the core-mm).

Yes, that check is a duplication. Will remove that. I do have plans to support block
page split/merge functionality during migration start. And with that in mind tried
to make it generic with callback ptr. May be it is not worth at this time and I will
revisit it when we add split/merge functionality.

 > > +
> > +	if ((pte & ARM_LPAE_PTE_AP_WRITABLE) ==
> ARM_LPAE_PTE_AP_WRITABLE)
> > +		return 0;
> 
> What about RO ptes? This check only checks that it is writable-clean. So you
> have 2 remaining options; writable-dirty and read-only. Suggest:

Ok. Got it.

> 
> if (pte_hw_dirty(pte)) {
> 	iommu_dirty_bitmap_record(dirty, iova, size);
> 	if (!(arg->flags & IOMMU_DIRTY_NO_CLEAR))
> 		set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long
> *)ptep); }
> 
> > +
> > +	iommu_dirty_bitmap_record(dirty, iova, size);
> > +	if (!(arg->flags & IOMMU_DIRTY_NO_CLEAR))
> > +		set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long
> *)ptep);
> > +	return 0;
> > +}
> > +
> > +static int io_pgtable_visit(struct arm_lpae_io_pgtable *data,
> > +			    struct io_pgtable_walk_data *walk_data,
> > +			    arm_lpae_iopte *ptep, int lvl) {
> > +	struct io_pgtable *iop = &data->iop;
> > +	arm_lpae_iopte pte = READ_ONCE(*ptep);
> > +
> > +	if (WARN_ON(!pte))
> > +		return -EINVAL;
> > +
> > +	if (iopte_leaf(pte, lvl, iop->fmt)) {
> > +		size_t size = ARM_LPAE_BLOCK_SIZE(lvl, data);
> > +
> > +		walk_data->fn(walk_data->addr, size, ptep, walk_data-
> >opaque);
> > +		walk_data->addr += size;
> > +		return 0;
> > +	}
> > +
> > +	ptep = iopte_deref(pte, data);
> > +	return __arm_lpae_iopte_walk(data, walk_data, ptep, lvl + 1); }
> > +
> > +static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
> > +				 struct io_pgtable_walk_data *walk_data,
> > +				 arm_lpae_iopte *ptep,
> > +				 int lvl)
> > +{
> > +	u32 idx;
> > +	int max_entries, ret;
> > +
> > +	if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
> > +		return -EINVAL;
> > +
> > +	if (lvl == data->start_level)
> > +		max_entries = ARM_LPAE_PGD_SIZE(data) /
> sizeof(arm_lpae_iopte);
> > +	else
> > +		max_entries = ARM_LPAE_PTES_PER_TABLE(data);
> > +
> > +	for (idx = ARM_LPAE_LVL_IDX(walk_data->addr, lvl, data);
> > +	     (idx < max_entries) && (walk_data->addr < walk_data->end);
> ++idx) {
> > +		arm_lpae_iopte *pteref = &ptep[idx];
> 
> nit: Personally I would get rid of this and just pass `ptep + idx` to
> io_pgtable_visit()

Ack.

> > +
> > +		ret = io_pgtable_visit(data, walk_data, pteref, lvl);
> > +		if (ret)
> > +			return ret;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
> > +					 unsigned long iova, size_t size,
> > +					 unsigned long flags,
> > +					 struct iommu_dirty_bitmap *dirty) {
> > +	struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
> > +	struct io_pgtable_cfg *cfg = &data->iop.cfg;
> > +	struct arm_lpae_iopte_read_dirty arg = {
> > +		.flags = flags, .dirty = dirty,
> > +	};
> > +	struct io_pgtable_walk_data walk_data = {
> > +		.fn = __arm_lpae_read_and_clear_dirty,
> > +		.opaque = &arg,
> 
> Do you have plans to reuse the walker? If not, then I wonder if separating
> the argument and callback function are valuable? It will certainly be more
> efficient without the per-pte indirect call.

Ok. I will remove indirection for now and may revisit it when we add split/merge
functionality.

> 
> > +		.addr = iova,
> > +		.end = iova + size - 1,
> 
> Bug?: You are initializing end to be inclusive (i.e. last valid address in range).
> But __arm_lpae_iopte_walk() is using it as exclusive (i.e. first address after
> range) - see "walk_data->addr < walk_data->end".
> 
> mm code usually treats end as exclusive, so suggest removing the "- 1".

I will double check this.

> 
> > +	};
> > +	arm_lpae_iopte *ptep = data->pgd;
> > +	int lvl = data->start_level;
> > +	long iaext = (s64)iova >> cfg->ias;
> 
> Why cast this to signed? And why is the cast s64 and the result long? Suggest
> they should both be consistent at least. But probably clearer to just do:
> 
> WARN_ON((iova + size - 1) & (BIT(cfg->ias) - 1))
> 
> in the below if()? That way you are checking the full range too.

Ok.

Thanks,
Shameer
 


More information about the linux-arm-kernel mailing list