[PATCH v2 2/4] iommu/io-pgtable-arm: Add read_and_clear_dirty() support
Ryan Roberts
ryan.roberts at arm.com
Wed Apr 24 01:36:28 PDT 2024
On 24/04/2024 09:01, Shameerali Kolothum Thodi wrote:
>
>
>> -----Original Message-----
>> From: Ryan Roberts <ryan.roberts at arm.com>
>> Sent: Tuesday, April 23, 2024 4:56 PM
>> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi at huawei.com>;
>> iommu at lists.linux.dev; linux-arm-kernel at lists.infradead.org
>> Cc: joro at 8bytes.org; jgg at nvidia.com; kevin.tian at intel.com;
>> nicolinc at nvidia.com; mshavit at google.com; robin.murphy at arm.com;
>> will at kernel.org; joao.m.martins at oracle.com; jiangkunkun
>> <jiangkunkun at huawei.com>; zhukeqian <zhukeqian1 at huawei.com>;
>> Linuxarm <linuxarm at huawei.com>
>> Subject: Re: [PATCH v2 2/4] iommu/io-pgtable-arm: Add
>> read_and_clear_dirty() support
>>
>> On 22/02/2024 09:49, Shameer Kolothum wrote:
>>> .read_and_clear_dirty() IOMMU domain op takes care of reading the
>>> dirty bits (i.e. PTE has both DBM and AP[2] set) and marshalling into
>>> a bitmap of
>>
>> Err... a PTE is HW-dirty if DBM is set and AP[2] is *clear*, right? (AP[2] is
>> "RDONLY" bit, and is initially set, then the HW clears it on first write. See
>> pte_hw_dirty() in linux/arch/arm64/include/asm/pgtable.h).
>
> Oops..yes, the comment here is indeed wrong.
>
> I will update and add something like below to make it clear:
>
> DBM bit AP[2]("RDONLY" bit)
> 1. writeable_clean 1 1
> 2. writeable_dirty 1 0
> 3. read-only 0 1
Yes, sounds good!
>
>>
>>> a given page size.
>>>
>>> While reading the dirty bits we also clear the PTE AP[2] bit to mark
>>> it as writable-clean depending on read_and_clear_dirty() flags.
>>
>> You would need to *set* AP[2] to mark it clean.
>>
>>>
>>> Structure it in a way that the IOPTE walker is generic, and so we pass
>>> a function pointer over what to do on a per-PTE basis.
>>>
>>> Signed-off-by: Joao Martins <joao.m.martins at oracle.com>
>>> Signed-off-by: Shameer Kolothum
>> <shameerali.kolothum.thodi at huawei.com>
>>> ---
>>> drivers/iommu/io-pgtable-arm.c | 128
>>> ++++++++++++++++++++++++++++++++-
>>> 1 file changed, 127 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/drivers/iommu/io-pgtable-arm.c
>>> b/drivers/iommu/io-pgtable-arm.c index f7828a7aad41..1ce7b7a3c1e8
>>> 100644
>>> --- a/drivers/iommu/io-pgtable-arm.c
>>> +++ b/drivers/iommu/io-pgtable-arm.c
>>> @@ -75,6 +75,7 @@
>>>
>>> #define ARM_LPAE_PTE_NSTABLE (((arm_lpae_iopte)1) << 63)
>>> #define ARM_LPAE_PTE_XN (((arm_lpae_iopte)3) << 53)
>>> +#define ARM_LPAE_PTE_DBM (((arm_lpae_iopte)1) << 51)
>>> #define ARM_LPAE_PTE_AF (((arm_lpae_iopte)1) << 10)
>>> #define ARM_LPAE_PTE_SH_NS (((arm_lpae_iopte)0) << 8)
>>> #define ARM_LPAE_PTE_SH_OS (((arm_lpae_iopte)2) << 8)
>>> @@ -84,7 +85,7 @@
>>>
>>> #define ARM_LPAE_PTE_ATTR_LO_MASK (((arm_lpae_iopte)0x3ff) <<
>> 2)
>>> /* Ignore the contiguous bit for block splitting */
>>> -#define ARM_LPAE_PTE_ATTR_HI_MASK (((arm_lpae_iopte)6) << 52)
>>> +#define ARM_LPAE_PTE_ATTR_HI_MASK (((arm_lpae_iopte)0xd) <<
>> 51)
>>
>> Perhaps this is easier to understand:
>>
>> #define ARM_LPAE_PTE_ATTR_HI_MASK (ARM_LPAE_PTE_XN |
>> ARM_LPAE_PTE_DBM)
>
> Ok.
>
>>
>>> #define ARM_LPAE_PTE_ATTR_MASK
>> (ARM_LPAE_PTE_ATTR_LO_MASK | \
>>> ARM_LPAE_PTE_ATTR_HI_MASK)
>>> /* Software bit for solving coherency races */ @@ -93,6 +94,9 @@
>>> /* Stage-1 PTE */
>>> #define ARM_LPAE_PTE_AP_UNPRIV (((arm_lpae_iopte)1) << 6)
>>> #define ARM_LPAE_PTE_AP_RDONLY (((arm_lpae_iopte)2)
>> << 6)
>>> +#define ARM_LPAE_PTE_AP_RDONLY_BIT 7
>>
>> Perhaps:
>>
>> #define ARM_LPAE_PTE_AP_RDONLY_BIT 7
>> #define ARM_LPAE_PTE_AP_RDONLY (((arm_lpae_iopte)1) <<
>> ARM_LPAE_PTE_AP_RDONLY_BIT)
>
> Ok.
>
>>
>>> +#define ARM_LPAE_PTE_AP_WRITABLE (ARM_LPAE_PTE_AP_RDONLY
>> | \
>>> + ARM_LPAE_PTE_DBM)
>>
>> Perhaps: ARM_LPAE_PTE_AP_WRITABLE_CLEAN ?
>
> Sure. That’s more clear.
>
> >
>>> #define ARM_LPAE_PTE_ATTRINDX_SHIFT 2
>>> #define ARM_LPAE_PTE_nG (((arm_lpae_iopte)1) << 11)
>>>
>>> @@ -729,6 +733,127 @@ static phys_addr_t
>> arm_lpae_iova_to_phys(struct io_pgtable_ops *ops,
>>> return iopte_to_paddr(pte, data) | iova; }
>>>
>>> +struct arm_lpae_iopte_read_dirty {
>>> + unsigned long flags;
>>> + struct iommu_dirty_bitmap *dirty;
>>> +};
>>> +
>>> +typedef int (*io_pgtable_visit_fn_t)(unsigned long iova, size_t size,
>>> + arm_lpae_iopte *ptep, void *opaque);
>> struct
>>> +io_pgtable_walk_data {
>>> + const io_pgtable_visit_fn_t fn;
>>> + void *opaque;> + u64
>> addr;
>>> + const u64 end;
>>> +};
>>> +
>>> +static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
>>> + struct io_pgtable_walk_data *walk_data,
>>> + arm_lpae_iopte *ptep,
>>> + int lvl);
>>> +
>>> +static int __arm_lpae_read_and_clear_dirty(unsigned long iova, size_t
>> size,
>>> + arm_lpae_iopte *ptep, void
>> *opaque) {
>>> + struct arm_lpae_iopte_read_dirty *arg = opaque;
>>> + struct iommu_dirty_bitmap *dirty = arg->dirty;
>>> + arm_lpae_iopte pte;
>>> +
>>> + pte = READ_ONCE(*ptep);
>>> + if (WARN_ON(!pte))
>>> + return -EINVAL;
>>
>> You've already read and checked its not zero in io_pgtable_visit(). Either the
>> walker is considered generic and probably shouldn't care if the pte is 0, (so
>> just do check here). Or the walker is specific for this use case, in which case
>> there is no need for the function pointer callback inefficiencies (and you've
>> already READ_ONCE() the ptep so no need to repeat. (Multiple ptep_get()s
>> does make a meaningful performance impact in the core-mm).
>
> Yes, that check is a duplication. Will remove that. I do have plans to support block
> page split/merge functionality during migration start. And with that in mind tried
> to make it generic with callback ptr. May be it is not worth at this time and I will
> revisit it when we add split/merge functionality.
Ahh ok. My preference would be to keep it simple while this is the only user,
then generalize it when you add a second user. But if you plan to add the second
user imminently, then feel free to leave it general.
>
> > > +
>>> + if ((pte & ARM_LPAE_PTE_AP_WRITABLE) ==
>> ARM_LPAE_PTE_AP_WRITABLE)
>>> + return 0;
>>
>> What about RO ptes? This check only checks that it is writable-clean. So you
>> have 2 remaining options; writable-dirty and read-only. Suggest:
>
> Ok. Got it.
>
>>
>> if (pte_hw_dirty(pte)) {
>> iommu_dirty_bitmap_record(dirty, iova, size);
>> if (!(arg->flags & IOMMU_DIRTY_NO_CLEAR))
>> set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long
>> *)ptep); }
>>
>>> +
>>> + iommu_dirty_bitmap_record(dirty, iova, size);
>>> + if (!(arg->flags & IOMMU_DIRTY_NO_CLEAR))
>>> + set_bit(ARM_LPAE_PTE_AP_RDONLY_BIT, (unsigned long
>> *)ptep);
>>> + return 0;
>>> +}
>>> +
>>> +static int io_pgtable_visit(struct arm_lpae_io_pgtable *data,
>>> + struct io_pgtable_walk_data *walk_data,
>>> + arm_lpae_iopte *ptep, int lvl) {
>>> + struct io_pgtable *iop = &data->iop;
>>> + arm_lpae_iopte pte = READ_ONCE(*ptep);
>>> +
>>> + if (WARN_ON(!pte))
>>> + return -EINVAL;
>>> +
>>> + if (iopte_leaf(pte, lvl, iop->fmt)) {
>>> + size_t size = ARM_LPAE_BLOCK_SIZE(lvl, data);
>>> +
>>> + walk_data->fn(walk_data->addr, size, ptep, walk_data-
>>> opaque);
>>> + walk_data->addr += size;
>>> + return 0;
>>> + }
>>> +
>>> + ptep = iopte_deref(pte, data);
>>> + return __arm_lpae_iopte_walk(data, walk_data, ptep, lvl + 1); }
>>> +
>>> +static int __arm_lpae_iopte_walk(struct arm_lpae_io_pgtable *data,
>>> + struct io_pgtable_walk_data *walk_data,
>>> + arm_lpae_iopte *ptep,
>>> + int lvl)
>>> +{
>>> + u32 idx;
>>> + int max_entries, ret;
>>> +
>>> + if (WARN_ON(lvl == ARM_LPAE_MAX_LEVELS))
>>> + return -EINVAL;
>>> +
>>> + if (lvl == data->start_level)
>>> + max_entries = ARM_LPAE_PGD_SIZE(data) /
>> sizeof(arm_lpae_iopte);
>>> + else
>>> + max_entries = ARM_LPAE_PTES_PER_TABLE(data);
>>> +
>>> + for (idx = ARM_LPAE_LVL_IDX(walk_data->addr, lvl, data);
>>> + (idx < max_entries) && (walk_data->addr < walk_data->end);
>> ++idx) {
>>> + arm_lpae_iopte *pteref = &ptep[idx];
>>
>> nit: Personally I would get rid of this and just pass `ptep + idx` to
>> io_pgtable_visit()
>
> Ack.
>
>>> +
>>> + ret = io_pgtable_visit(data, walk_data, pteref, lvl);
>>> + if (ret)
>>> + return ret;
>>> + }
>>> +
>>> + return 0;
>>> +}
>>> +
>>> +static int arm_lpae_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>> + unsigned long iova, size_t size,
>>> + unsigned long flags,
>>> + struct iommu_dirty_bitmap *dirty) {
>>> + struct arm_lpae_io_pgtable *data = io_pgtable_ops_to_data(ops);
>>> + struct io_pgtable_cfg *cfg = &data->iop.cfg;
>>> + struct arm_lpae_iopte_read_dirty arg = {
>>> + .flags = flags, .dirty = dirty,
>>> + };
>>> + struct io_pgtable_walk_data walk_data = {
>>> + .fn = __arm_lpae_read_and_clear_dirty,
>>> + .opaque = &arg,
>>
>> Do you have plans to reuse the walker? If not, then I wonder if separating
>> the argument and callback function are valuable? It will certainly be more
>> efficient without the per-pte indirect call.
>
> Ok. I will remove indirection for now and may revisit it when we add split/merge
> functionality.
>
>>
>>> + .addr = iova,
>>> + .end = iova + size - 1,
>>
>> Bug?: You are initializing end to be inclusive (i.e. last valid address in range).
>> But __arm_lpae_iopte_walk() is using it as exclusive (i.e. first address after
>> range) - see "walk_data->addr < walk_data->end".
>>
>> mm code usually treats end as exclusive, so suggest removing the "- 1".
>
> I will double check this.
Yes please do. It certainly looks like a bug, but I may be wrong!
>
>>
>>> + };
>>> + arm_lpae_iopte *ptep = data->pgd;
>>> + int lvl = data->start_level;
>>> + long iaext = (s64)iova >> cfg->ias;
>>
>> Why cast this to signed? And why is the cast s64 and the result long? Suggest
>> they should both be consistent at least. But probably clearer to just do:
>>
>> WARN_ON((iova + size - 1) & (BIT(cfg->ias) - 1))
>>
>> in the below if()? That way you are checking the full range too.
>
> Ok.
>
> Thanks,
> Shameer
>
More information about the linux-arm-kernel
mailing list