[PATCH v2] arm64: tlb: Fix TLBI RANGE operand

Thu Apr 4 02:10:23 PDT 2024

On Thu, 04 Apr 2024 06:36:24 +0100,
Gavin Shan <gshan at redhat.com> wrote:
> 
> KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
> bitmap is collected by VMM and the corresponding PTEs need to be
> write-protected during live migration. Unfortunately, the operand
> passed to the TLBI RANGE instruction isn't correctly sorted out by
> commit d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64").

This isn't the offending commit. See below.

> It leads to crash on the destination VM after live migration because
> TLBs aren't flushed completely and some of the dirty pages are missed.
> 
> For example, I have a VM where 8GB memory is assigned, starting from
> 0x40000000 (1GB). Note that the host has 4KB as the base page size.
> All TLBs for VM can be covered by one TLBI RANGE operation. However,
> the operand 0xffff708000040000 is set for scale -9, and  -1 is returned

It's not scale, as it is limited to 2 bits. It's a random value that
actively corrupts adjacent fields because it is wrongly sign-extended.
ASID and TG are now utter bollocks, and the CPU is within its rights
to totally ignore the TLBI (TG indicates 64kB translation granule...).

We really should fix __TLBI_VADDR_RANGE() to use proper bit fields
instead of a bunch of shifts that lead to this mess.

> from __TLBI_RANGE_NUM() for scale 3/2/1/0 and rejected by the loop in
> __flush_tlb_range_op(). __TLBI_RANGE_NUM() isn't expected to work
> like this because all the pages should be covered by scale 3/2/1/0,
> plus an additional page if needed.
> 
> Fix the macro __TLBI_RANGE_NUM() so that the correct NUM and TLBI RANGE
> operand are provided for each scale level. With the changes, [-1 31]
> instead of [-1 30] can be returned from the macro, meaning the TLBs for
> 0x200000 pages (8GB memory) can be flushed in one shoot at scale 3. The
> macro TLBI_RANGE_MASK is dropped since no one uses it any more.
> 
> Fixes: d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64")
> Cc: stable at kernel.org # v5.10+

I don't think this is right. The problem only occurs since
117940aa6e5f8 ("KVM: arm64: Define kvm_tlb_flush_vmid_range()"), which
is the only case where we try to use NUM=31 (the rest of the kernel is
using (MAX_TLBI_RANGE_PAGES - 1), which results in NUM=30 at most).

Also, before e2768b798a19 ("arm64/mm: Modify range-based tlbi to
decrement scale"), we used a different algorithm to perform the
invalidation (increasing scale instead of decreasing), so this
probably doesn't hit the same way.

In any case, this is a KVM-only problem that will never show before
v6.6. So 5.10 really isn't a place where we need to backport anything.

> Reported-by: Yihuang Yu <yihyu at redhat.com>
> Suggested-by: Marc Zyngier <maz at kernel.org>
> Signed-off-by: Gavin Shan <gshan at redhat.com>
> ---
> v2: Improve __TLBI_RANGE_NUM() as Marc suggested
> ---
>  arch/arm64/include/asm/tlbflush.h | 13 +++++++++----
>  1 file changed, 9 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 3b0e8248e1a4..cd9b71c30366 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -161,12 +161,17 @@ static inline unsigned long get_trans_granule(void)
>  #define MAX_TLBI_RANGE_PAGES		__TLBI_RANGE_PAGES(31, 3)
>  
>  /*
> - * Generate 'num' values from -1 to 30 with -1 rejected by the
> + * Generate 'num' values from -1 to 31 with -1 rejected by the
>   * __flush_tlb_range() loop below.
>   */
> -#define TLBI_RANGE_MASK			GENMASK_ULL(4, 0)
> -#define __TLBI_RANGE_NUM(pages, scale)	\
> -	((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
> +#define __TLBI_RANGE_NUM(pages, scale)					\
> +	({								\
> +		int __pages = min((pages),				\
> +				  __TLBI_RANGE_PAGES(31, (scale)));	\
> +		int __numplus1 = __pages >> (5 * (scale) + 1);		\
> +									\
> +		(__numplus1 - 1);					\
> +	})

This was only intended as a way to convey the general idea. __numplus1
can obviously be removed and the right-shifting expression promoted as
the return value.

Next, the comments in this file need adjustments to reflect the
supported invalidation range, as my original patch did (plus some
more).

Finally, and since we can now handle the full range of invalidation,
it would make sense to fix __flush_tlb_range_nosync() to allow
MAX_TLBI_RANGE_PAGES ranges (potentially in a separate patch).

In the end, my sandbox contains the following, which should probably
be split in 3 patches:

diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
index 3b0e8248e1a4..bcbe697ed191 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -142,16 +142,22 @@ static inline unsigned long get_trans_granule(void)
  * EL1, Inner Shareable".
  *
  */
+#define TLBIR_ASID_MASK		GENMASK_ULL(63, 48)
+#define TLBIR_TG_MASK		GENMASK_ULL(47, 46)
+#define TLBIR_SCALE_MASK	GENMASK_ULL(45, 44)
+#define TLBIR_NUM_MASK		GENMASK_ULL(43, 39)
+#define TLBIR_TTL_MASK		GENMASK_ULL(38, 37)
+#define TLBIR_BADDR_MASK	GENMASK_ULL(36,  0)
+
 #define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl)			\
 	({									\
-		unsigned long __ta = (baddr);					\
+		unsigned long __ta = FIELD_PREP(TLBIR_BADDR_MASK, (baddr)); 	\
 		unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0;		\
-		__ta &= GENMASK_ULL(36, 0);					\
-		__ta |= __ttl << 37;						\
-		__ta |= (unsigned long)(num) << 39;				\
-		__ta |= (unsigned long)(scale) << 44;				\
-		__ta |= get_trans_granule() << 46;				\
-		__ta |= (unsigned long)(asid) << 48;				\
+		__ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl);			\
+		__ta |= FIELD_PREP(TLBIR_NUM_MASK, (unsigned long)(num));	\
+		__ta |= FIELD_PREP(TLBIR_SCALE_MASK, (unsigned long)(scale));	\
+		__ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule());		\
+		__ta |= FIELD_PREP(TLBIR_ASID_MASK, (unsigned long)(asid));	\
 		__ta;								\
 	})
 
@@ -161,12 +167,17 @@ static inline unsigned long get_trans_granule(void)
 #define MAX_TLBI_RANGE_PAGES		__TLBI_RANGE_PAGES(31, 3)
 
 /*
- * Generate 'num' values from -1 to 30 with -1 rejected by the
- * __flush_tlb_range() loop below.
+ * Generate 'num' values from -1 to 31 with -1 rejected by the
+ * __flush_tlb_range() loop below. Its return value is only
+ * significant for a maximum of MAX_TLBI_RANGE_PAGES pages. If 'pages'
+ * is more than that, you must iterate over the overall range.
  */
-#define TLBI_RANGE_MASK			GENMASK_ULL(4, 0)
-#define __TLBI_RANGE_NUM(pages, scale)	\
-	((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
+#define __TLBI_RANGE_NUM(pages, scale)					\
+	({								\
+		int __pages = min((pages),				\
+				  __TLBI_RANGE_PAGES(31, (scale)));	\
+		(__pages >> (5 * (scale) + 1)) - 1;			\
+	})
 
 /*
  *	TLB Invalidation
@@ -379,10 +390,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
  * 3. If there is 1 page remaining, flush it through non-range operations. Range
  *    operations can only span an even number of pages. We save this for last to
  *    ensure 64KB start alignment is maintained for the LPA2 case.
- *
- * Note that certain ranges can be represented by either num = 31 and
- * scale or num = 0 and scale + 1. The loop below favours the latter
- * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
  */
 #define __flush_tlb_range_op(op, start, pages, stride,			\
 				asid, tlb_level, tlbi_user, lpa2)	\
@@ -437,11 +444,11 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
 	 * When not uses TLB range ops, we can handle up to
 	 * (MAX_DVM_OPS - 1) pages;
 	 * When uses TLB range ops, we can handle up to
-	 * (MAX_TLBI_RANGE_PAGES - 1) pages.
+	 * MAX_TLBI_RANGE_PAGES pages.
 	 */
 	if ((!system_supports_tlb_range() &&
 	     (end - start) >= (MAX_DVM_OPS * stride)) ||
-	    pages >= MAX_TLBI_RANGE_PAGES) {
+	    pages > MAX_TLBI_RANGE_PAGES) {
 		flush_tlb_mm(vma->vm_mm);
 		return;
 	}

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.