[PATCH v2] arm64: tlb: Fix TLBI RANGE operand

Thu Apr 4 03:26:20 PDT 2024

Hi Marc,

On 4/4/24 19:10, Marc Zyngier wrote:
> On Thu, 04 Apr 2024 06:36:24 +0100,
> Gavin Shan <gshan at redhat.com> wrote:
>>
>> KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
>> bitmap is collected by VMM and the corresponding PTEs need to be
>> write-protected during live migration. Unfortunately, the operand
>> passed to the TLBI RANGE instruction isn't correctly sorted out by
>> commit d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64").
> 
> This isn't the offending commit. See below.
> 

Yes, I agree with you and more explanations as below.

>> It leads to crash on the destination VM after live migration because
>> TLBs aren't flushed completely and some of the dirty pages are missed.
>>
>> For example, I have a VM where 8GB memory is assigned, starting from
>> 0x40000000 (1GB). Note that the host has 4KB as the base page size.
>> All TLBs for VM can be covered by one TLBI RANGE operation. However,
>> the operand 0xffff708000040000 is set for scale -9, and  -1 is returned
> 
> It's not scale, as it is limited to 2 bits. It's a random value that
> actively corrupts adjacent fields because it is wrongly sign-extended.
> ASID and TG are now utter bollocks, and the CPU is within its rights
> to totally ignore the TLBI (TG indicates 64kB translation granule...).
> 
> We really should fix __TLBI_VADDR_RANGE() to use proper bit fields
> instead of a bunch of shifts that lead to this mess.
> 

Well, I meant the variable @scale instead of SCALE, part of the operand
to TLBI RANGE instruction. Let me make it clear in next revision.

Yes, __TLBI_VADDR_RANGE() can be improved by masks. However, the source
of the mess is 117940aa6e5f8 ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
because it can pass @pages (e.g. 0x200000) that __flush_tlb_range_op() can't
handle. It may be another hint to set 117940aa6e5f8 as the fix target.

>> from __TLBI_RANGE_NUM() for scale 3/2/1/0 and rejected by the loop in
>> __flush_tlb_range_op(). __TLBI_RANGE_NUM() isn't expected to work
>> like this because all the pages should be covered by scale 3/2/1/0,
>> plus an additional page if needed.
>>
>> Fix the macro __TLBI_RANGE_NUM() so that the correct NUM and TLBI RANGE
>> operand are provided for each scale level. With the changes, [-1 31]
>> instead of [-1 30] can be returned from the macro, meaning the TLBs for
>> 0x200000 pages (8GB memory) can be flushed in one shoot at scale 3. The
>> macro TLBI_RANGE_MASK is dropped since no one uses it any more.
>>
>> Fixes: d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64")
>> Cc: stable at kernel.org # v5.10+
> 
> I don't think this is right. The problem only occurs since
> 117940aa6e5f8 ("KVM: arm64: Define kvm_tlb_flush_vmid_range()"), which
> is the only case where we try to use NUM=31 (the rest of the kernel is
> using (MAX_TLBI_RANGE_PAGES - 1), which results in NUM=30 at most).
> 
> Also, before e2768b798a19 ("arm64/mm: Modify range-based tlbi to
> decrement scale"), we used a different algorithm to perform the
> invalidation (increasing scale instead of decreasing), so this
> probably doesn't hit the same way.
> 
> In any case, this is a KVM-only problem that will never show before
> v6.6. So 5.10 really isn't a place where we need to backport anything.
> 

Agreed. It's more precise to set 117940aa6e5f8 as the fix target. I will
correct it in v3.

Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")

>> Reported-by: Yihuang Yu <yihyu at redhat.com>
>> Suggested-by: Marc Zyngier <maz at kernel.org>
>> Signed-off-by: Gavin Shan <gshan at redhat.com>
>> ---
>> v2: Improve __TLBI_RANGE_NUM() as Marc suggested
>> ---
>>   arch/arm64/include/asm/tlbflush.h | 13 +++++++++----
>>   1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 3b0e8248e1a4..cd9b71c30366 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -161,12 +161,17 @@ static inline unsigned long get_trans_granule(void)
>>   #define MAX_TLBI_RANGE_PAGES		__TLBI_RANGE_PAGES(31, 3)
>>   
>>   /*
>> - * Generate 'num' values from -1 to 30 with -1 rejected by the
>> + * Generate 'num' values from -1 to 31 with -1 rejected by the
>>    * __flush_tlb_range() loop below.
>>    */
>> -#define TLBI_RANGE_MASK			GENMASK_ULL(4, 0)
>> -#define __TLBI_RANGE_NUM(pages, scale)	\
>> -	((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
>> +#define __TLBI_RANGE_NUM(pages, scale)					\
>> +	({								\
>> +		int __pages = min((pages),				\
>> +				  __TLBI_RANGE_PAGES(31, (scale)));	\
>> +		int __numplus1 = __pages >> (5 * (scale) + 1);		\
>> +									\
>> +		(__numplus1 - 1);					\
>> +	})
> 
> This was only intended as a way to convey the general idea. __numplus1
> can obviously be removed and the right-shifting expression promoted as
> the return value.
> 
> Next, the comments in this file need adjustments to reflect the
> supported invalidation range, as my original patch did (plus some
> more).
> 
> Finally, and since we can now handle the full range of invalidation,
> it would make sense to fix __flush_tlb_range_nosync() to allow
> MAX_TLBI_RANGE_PAGES ranges (potentially in a separate patch).
> 
> In the end, my sandbox contains the following, which should probably
> be split in 3 patches:
> 

Ok, I thought @__numplus1 was there for better code readability since it's
likely opted out by GCC's optimization. I will drop @__numplus1 anyway. And
the comments need to be improved for sure.

Yeah, I've noticed that __flush_tlb_range_nosync() needs to allow
MAX_TLBI_RANGE_PAGES to do range based TLB flush.

In summary, we need 3 patches but the one fixing __TLBI_RANGE_NUM needs to be
PATCH[1/3] so that it can be easily picked by stable kernel. PATCH[2/3] would
be to improve __TLBI_VADDR_RANGE with masks. PATCH[3/3] will allow __flush_tlb_range_nosync()
to do range-based TLB flush for MAX_TLBI_RANGE_PAGES.

> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 3b0e8248e1a4..bcbe697ed191 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -142,16 +142,22 @@ static inline unsigned long get_trans_granule(void)
>    * EL1, Inner Shareable".
>    *
>    */
> +#define TLBIR_ASID_MASK		GENMASK_ULL(63, 48)
> +#define TLBIR_TG_MASK		GENMASK_ULL(47, 46)
> +#define TLBIR_SCALE_MASK	GENMASK_ULL(45, 44)
> +#define TLBIR_NUM_MASK		GENMASK_ULL(43, 39)
> +#define TLBIR_TTL_MASK		GENMASK_ULL(38, 37)
> +#define TLBIR_BADDR_MASK	GENMASK_ULL(36,  0)
> +
>   #define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl)			\
>   	({									\
> -		unsigned long __ta = (baddr);					\
> +		unsigned long __ta = FIELD_PREP(TLBIR_BADDR_MASK, (baddr)); 	\
>   		unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0;		\
> -		__ta &= GENMASK_ULL(36, 0);					\
> -		__ta |= __ttl << 37;						\
> -		__ta |= (unsigned long)(num) << 39;				\
> -		__ta |= (unsigned long)(scale) << 44;				\
> -		__ta |= get_trans_granule() << 46;				\
> -		__ta |= (unsigned long)(asid) << 48;				\
> +		__ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl);			\
> +		__ta |= FIELD_PREP(TLBIR_NUM_MASK, (unsigned long)(num));	\
> +		__ta |= FIELD_PREP(TLBIR_SCALE_MASK, (unsigned long)(scale));	\
> +		__ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule());		\
> +		__ta |= FIELD_PREP(TLBIR_ASID_MASK, (unsigned long)(asid));	\
>   		__ta;								\
>   	})
>   
> @@ -161,12 +167,17 @@ static inline unsigned long get_trans_granule(void)
>   #define MAX_TLBI_RANGE_PAGES		__TLBI_RANGE_PAGES(31, 3)
>   
>   /*
> - * Generate 'num' values from -1 to 30 with -1 rejected by the
> - * __flush_tlb_range() loop below.
> + * Generate 'num' values from -1 to 31 with -1 rejected by the
> + * __flush_tlb_range() loop below. Its return value is only
> + * significant for a maximum of MAX_TLBI_RANGE_PAGES pages. If 'pages'
> + * is more than that, you must iterate over the overall range.
>    */
> -#define TLBI_RANGE_MASK			GENMASK_ULL(4, 0)
> -#define __TLBI_RANGE_NUM(pages, scale)	\
> -	((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
> +#define __TLBI_RANGE_NUM(pages, scale)					\
> +	({								\
> +		int __pages = min((pages),				\
> +				  __TLBI_RANGE_PAGES(31, (scale)));	\
> +		(__pages >> (5 * (scale) + 1)) - 1;			\
> +	})
>   
>   /*
>    *	TLB Invalidation
> @@ -379,10 +390,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
>    * 3. If there is 1 page remaining, flush it through non-range operations. Range
>    *    operations can only span an even number of pages. We save this for last to
>    *    ensure 64KB start alignment is maintained for the LPA2 case.
> - *
> - * Note that certain ranges can be represented by either num = 31 and
> - * scale or num = 0 and scale + 1. The loop below favours the latter
> - * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
>    */
>   #define __flush_tlb_range_op(op, start, pages, stride,			\
>   				asid, tlb_level, tlbi_user, lpa2)	\
> @@ -437,11 +444,11 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
>   	 * When not uses TLB range ops, we can handle up to
>   	 * (MAX_DVM_OPS - 1) pages;
>   	 * When uses TLB range ops, we can handle up to
> -	 * (MAX_TLBI_RANGE_PAGES - 1) pages.
> +	 * MAX_TLBI_RANGE_PAGES pages.
>   	 */
>   	if ((!system_supports_tlb_range() &&
>   	     (end - start) >= (MAX_DVM_OPS * stride)) ||
> -	    pages >= MAX_TLBI_RANGE_PAGES) {
> +	    pages > MAX_TLBI_RANGE_PAGES) {
>   		flush_tlb_mm(vma->vm_mm);
>   		return;
>   	}
> 

Thanks,
Gavin