[PATCH v3 09/21] KVM: arm64: Convert unmap_stage2_range() to generic page-table API

Tue Sep 8 09:07:37 EDT 2020

Hi Will,

On 9/3/20 6:57 PM, Will Deacon wrote:
> On Wed, Sep 02, 2020 at 05:23:08PM +0100, Alexandru Elisei wrote:
>> On 8/25/20 10:39 AM, Will Deacon wrote:
>>> Convert unmap_stage2_range() to use kvm_pgtable_stage2_unmap() instead
>>> of walking the page-table directly.
>>>
>>> Cc: Marc Zyngier <maz at kernel.org>
>>> Cc: Quentin Perret <qperret at google.com>
>>> Signed-off-by: Will Deacon <will at kernel.org>
>>> ---
>>>  arch/arm64/kvm/mmu.c | 57 +++++++++++++++++++++++++-------------------
>>>  1 file changed, 32 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
>>> index 704b471a48ce..751ce2462765 100644
>>> --- a/arch/arm64/kvm/mmu.c
>>> +++ b/arch/arm64/kvm/mmu.c
>>> @@ -39,6 +39,33 @@ static bool is_iomap(unsigned long flags)
>>>  	return flags & KVM_S2PTE_FLAG_IS_IOMAP;
>>>  }
>>>  
>>> +/*
>>> + * Release kvm_mmu_lock periodically if the memory region is large. Otherwise,
>>> + * we may see kernel panics with CONFIG_DETECT_HUNG_TASK,
>>> + * CONFIG_LOCKUP_DETECTOR, CONFIG_LOCKDEP. Additionally, holding the lock too
>>> + * long will also starve other vCPUs. We have to also make sure that the page
>>> + * tables are not freed while we released the lock.
>>> + */
>>> +#define stage2_apply_range(kvm, addr, end, fn, resched)			\
>>> +({									\
>>> +	int ret;							\
>>> +	struct kvm *__kvm = (kvm);					\
>>> +	bool __resched = (resched);					\
>>> +	u64 next, __addr = (addr), __end = (end);			\
>>> +	do {								\
>>> +		struct kvm_pgtable *pgt = __kvm->arch.mmu.pgt;		\
>>> +		if (!pgt)						\
>>> +			break;						\
>> I'm 100% sure there's a reason why we've dropped the READ_ONCE, but it still looks
>> to me like the compiler might decide to optimize by reading pgt once at the start
>> of the loop and stashing it in a register. Would you mind explaining what I am
>> missing?
> The load always happens with the mmu_lock held, so I think it's not a
> problem because it means that the pointer is stable.
> spin_lock()/spin_unlock() imply compiler barriers.

I think you are correct, if this is supposed to always execute with kvm->mmu_lock
held, then pgt should not change between iterations. It didn't immediately occur
to me that that is the case because we check if pgt is NULL every iteration. If we
are relying on the lock being held, maybe we should move the pgt load + comparison
against NULL out of the loop? That should avoid any confusion and make the code
ever so slightly faster.

Also, I see that in __unmap_stage2_range() we check that the mmu_lock is held, but
we don't check that at all call sites (for example, in stage2_wp_range()). I
realize this is me bikeshedding, but that looks a bit asymmetrical. Should we move
the assert_spin_locked(&kvm->mmu_lock) statement in stage2_apply_range(), since
the function assumes the pgt will remain unchanged? What do you think?

Thanks,
Alex