[PATCH] mmu_notifiers: Notify on pte permission upgrades

Mon May 22 16:50:02 PDT 2023

Sean Christopherson <seanjc at google.com> writes:

> On Mon, May 22, 2023, Alistair Popple wrote:
>> Some architectures, specifically ARM and perhaps Sparc and IA64,
>> require TLB invalidates when upgrading pte permission from read-only
>> to read-write.
>> 
>> The current mmu_notifier implementation assumes that upgrades do not
>> need notifications. Typically though mmu_notifiers are used to
>> implement TLB invalidations for secondary MMUs that comply with the
>> main CPU architecture.
>> 
>> Therefore if the main CPU architecture requires an invalidation for
>> permission upgrade the secondary MMU will as well and an mmu_notifier
>> should be sent for the upgrade.
>> 
>> Currently CPU invalidations for permission upgrade occur in
>> ptep_set_access_flags(). Unfortunately MMU notifiers cannot be called
>> directly from this architecture specific code as the notifier
>> callbacks can sleep, and ptep_set_access_flags() is usually called
>> whilst holding the PTL spinlock. Therefore add the notifier calls
>> after the PTL is dropped and only if the PTE actually changed. This
>> will allow secondary MMUs to obtain an updated PTE with appropriate
>> permissions.
>> 
>> This problem was discovered during testing of an ARM SMMU
>> implementation that does not support broadcast TLB maintenance
>> (BTM). In this case the SMMU driver uses notifiers to issue TLB
>> invalidates. For read-only to read-write pte upgrades the SMMU
>> continually returned a read-only PTE to the device, even though the
>> CPU had a read-write PTE installed.
>> 
>> Sending a mmu notifier event to the SMMU driver fixes the problem by
>> flushing secondary TLB entries. A new notifier event type is added so
>> drivers may filter out these invalidations if not required. Note a
>> driver should never upgrade or install a PTE in response to this mmu
>> notifier event as it is not synchronised against other PTE operations.
>> 
>> Signed-off-by: Alistair Popple <apopple at nvidia.com>
>> ---
>>  include/linux/mmu_notifier.h |  6 +++++
>>  mm/hugetlb.c                 | 24 ++++++++++++++++++-
>>  mm/memory.c                  | 45 ++++++++++++++++++++++++++++++++++--
>>  3 files changed, 72 insertions(+), 3 deletions(-)
>> 
>> diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
>> index d6c06e140277..f14d68f119d8 100644
>> --- a/include/linux/mmu_notifier.h
>> +++ b/include/linux/mmu_notifier.h
>> @@ -31,6 +31,11 @@ struct mmu_interval_notifier;
>>   * pages in the range so to mirror those changes the user must inspect the CPU
>>   * page table (from the end callback).
>>   *
>> + * @MMU_NOTIFY_PROTECTION_UPGRAGE: update is due to a change from read-only to
>> + * read-write for pages in the range. This must not be used to upgrade
>> + * permissions on secondary PTEs, rather it should only be used to invalidate
>> + * caches such as secondary TLBs that may cache old read-only entries.
>
> This is a poor fit for invalidate_range_{start,end}().  All other uses bookend
> the primary MMU update, i.e. call start() _before_ changing PTEs.  The comments
> are somewhat stale as they talk only about "unmapped", but the contract between
> the primary MMU and the secondary MMU is otherwise quite clear on when the primary
> MMU will invoke start() and end().
>
> 	 * invalidate_range_start() is called when all pages in the
> 	 * range are still mapped and have at least a refcount of one.
> 	 *
> 	 * invalidate_range_end() is called when all pages in the
> 	 * range have been unmapped and the pages have been freed by
> 	 * the VM.
>
> I'm also confused as to how this actually fixes ARM's SMMU.  Unless I'm looking
> at the wrong SMMU implementation, the SMMU implemenents only invalidate_range(),
> not the start()/end() variants.

mmu_invalidate_range_end() calls the invalidate_range() callback if
the start()/end() variants aren't set.

> 	static const struct mmu_notifier_ops arm_smmu_mmu_notifier_ops = {
> 		.invalidate_range	= arm_smmu_mm_invalidate_range,
> 		.release		= arm_smmu_mm_release,
> 		.free_notifier		= arm_smmu_mmu_notifier_free,
> 	};
>
> Again from include/linux/mmu_notifier.h, not implementing the start()/end() hooks
> is perfectly valid.  And AFAICT, the existing invalidate_range() hook is pretty
> much a perfect fit for what you want to achieve.

Right, I didn't take that approach because it doesn't allow an event
type to be passed which would allow them to be filtered on platforms
which don't require this.

I had also assumed the invalidate_range() callbacks were allowed to
sleep, hence couldn't be called under PTL. That's certainly true of mmu
interval notifier callbacks, but Catalin reminded me that calls such as
ptep_clear_flush_notify() already call invalidate_range() callback under
PTL so I guess we already assume drivers don't sleep in their
invalidate_range() callbacks. I will update the comments to reflect
that.

> 	 * If invalidate_range() is used to manage a non-CPU TLB with
> 	 * shared page-tables, it not necessary to implement the
> 	 * invalidate_range_start()/end() notifiers, as
> 	 * invalidate_range() already catches the points in time when an
> 	 * external TLB range needs to be flushed. For more in depth
> 	 * discussion on this see Documentation/mm/mmu_notifier.rst
>
> Even worse, this change may silently regress performance for secondary MMUs that
> haven't yet taken advantage of the event type, e.g. KVM will zap all of KVM's PTEs
> in response to the upgrade, instead of waiting until the guest actually tries to
> utilize the new protections.

Yeah, I like the idea of introducing a
ptep_set_access_flags_notify(). That way this won't regress performance
on platforms that don't need it. Note this isn't a new feature but
rather a bugfix. It's unclear to me why KVM on ARM hasn't already run
into this issue, but I'm no KVM expert. Thanks for the feedback.

 - Alistair