[PATCH v2] arm64: tlb: Fix TLBI RANGE operand
Gavin Shan
gshan at redhat.com
Thu Apr 4 03:26:20 PDT 2024
Hi Marc,
On 4/4/24 19:10, Marc Zyngier wrote:
> On Thu, 04 Apr 2024 06:36:24 +0100,
> Gavin Shan <gshan at redhat.com> wrote:
>>
>> KVM/arm64 relies on TLBI RANGE feature to flush TLBs when the dirty
>> bitmap is collected by VMM and the corresponding PTEs need to be
>> write-protected during live migration. Unfortunately, the operand
>> passed to the TLBI RANGE instruction isn't correctly sorted out by
>> commit d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64").
>
> This isn't the offending commit. See below.
>
Yes, I agree with you and more explanations as below.
>> It leads to crash on the destination VM after live migration because
>> TLBs aren't flushed completely and some of the dirty pages are missed.
>>
>> For example, I have a VM where 8GB memory is assigned, starting from
>> 0x40000000 (1GB). Note that the host has 4KB as the base page size.
>> All TLBs for VM can be covered by one TLBI RANGE operation. However,
>> the operand 0xffff708000040000 is set for scale -9, and -1 is returned
>
> It's not scale, as it is limited to 2 bits. It's a random value that
> actively corrupts adjacent fields because it is wrongly sign-extended.
> ASID and TG are now utter bollocks, and the CPU is within its rights
> to totally ignore the TLBI (TG indicates 64kB translation granule...).
>
> We really should fix __TLBI_VADDR_RANGE() to use proper bit fields
> instead of a bunch of shifts that lead to this mess.
>
Well, I meant the variable @scale instead of SCALE, part of the operand
to TLBI RANGE instruction. Let me make it clear in next revision.
Yes, __TLBI_VADDR_RANGE() can be improved by masks. However, the source
of the mess is 117940aa6e5f8 ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
because it can pass @pages (e.g. 0x200000) that __flush_tlb_range_op() can't
handle. It may be another hint to set 117940aa6e5f8 as the fix target.
>> from __TLBI_RANGE_NUM() for scale 3/2/1/0 and rejected by the loop in
>> __flush_tlb_range_op(). __TLBI_RANGE_NUM() isn't expected to work
>> like this because all the pages should be covered by scale 3/2/1/0,
>> plus an additional page if needed.
>>
>> Fix the macro __TLBI_RANGE_NUM() so that the correct NUM and TLBI RANGE
>> operand are provided for each scale level. With the changes, [-1 31]
>> instead of [-1 30] can be returned from the macro, meaning the TLBs for
>> 0x200000 pages (8GB memory) can be flushed in one shoot at scale 3. The
>> macro TLBI_RANGE_MASK is dropped since no one uses it any more.
>>
>> Fixes: d1d3aa98b1d4 ("arm64: tlb: Use the TLBI RANGE feature in arm64")
>> Cc: stable at kernel.org # v5.10+
>
> I don't think this is right. The problem only occurs since
> 117940aa6e5f8 ("KVM: arm64: Define kvm_tlb_flush_vmid_range()"), which
> is the only case where we try to use NUM=31 (the rest of the kernel is
> using (MAX_TLBI_RANGE_PAGES - 1), which results in NUM=30 at most).
>
> Also, before e2768b798a19 ("arm64/mm: Modify range-based tlbi to
> decrement scale"), we used a different algorithm to perform the
> invalidation (increasing scale instead of decreasing), so this
> probably doesn't hit the same way.
>
> In any case, this is a KVM-only problem that will never show before
> v6.6. So 5.10 really isn't a place where we need to backport anything.
>
Agreed. It's more precise to set 117940aa6e5f8 as the fix target. I will
correct it in v3.
Fixes: 117940aa6e5f ("KVM: arm64: Define kvm_tlb_flush_vmid_range()")
>> Reported-by: Yihuang Yu <yihyu at redhat.com>
>> Suggested-by: Marc Zyngier <maz at kernel.org>
>> Signed-off-by: Gavin Shan <gshan at redhat.com>
>> ---
>> v2: Improve __TLBI_RANGE_NUM() as Marc suggested
>> ---
>> arch/arm64/include/asm/tlbflush.h | 13 +++++++++----
>> 1 file changed, 9 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
>> index 3b0e8248e1a4..cd9b71c30366 100644
>> --- a/arch/arm64/include/asm/tlbflush.h
>> +++ b/arch/arm64/include/asm/tlbflush.h
>> @@ -161,12 +161,17 @@ static inline unsigned long get_trans_granule(void)
>> #define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)
>>
>> /*
>> - * Generate 'num' values from -1 to 30 with -1 rejected by the
>> + * Generate 'num' values from -1 to 31 with -1 rejected by the
>> * __flush_tlb_range() loop below.
>> */
>> -#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
>> -#define __TLBI_RANGE_NUM(pages, scale) \
>> - ((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
>> +#define __TLBI_RANGE_NUM(pages, scale) \
>> + ({ \
>> + int __pages = min((pages), \
>> + __TLBI_RANGE_PAGES(31, (scale))); \
>> + int __numplus1 = __pages >> (5 * (scale) + 1); \
>> + \
>> + (__numplus1 - 1); \
>> + })
>
> This was only intended as a way to convey the general idea. __numplus1
> can obviously be removed and the right-shifting expression promoted as
> the return value.
>
> Next, the comments in this file need adjustments to reflect the
> supported invalidation range, as my original patch did (plus some
> more).
>
> Finally, and since we can now handle the full range of invalidation,
> it would make sense to fix __flush_tlb_range_nosync() to allow
> MAX_TLBI_RANGE_PAGES ranges (potentially in a separate patch).
>
> In the end, my sandbox contains the following, which should probably
> be split in 3 patches:
>
Ok, I thought @__numplus1 was there for better code readability since it's
likely opted out by GCC's optimization. I will drop @__numplus1 anyway. And
the comments need to be improved for sure.
Yeah, I've noticed that __flush_tlb_range_nosync() needs to allow
MAX_TLBI_RANGE_PAGES to do range based TLB flush.
In summary, we need 3 patches but the one fixing __TLBI_RANGE_NUM needs to be
PATCH[1/3] so that it can be easily picked by stable kernel. PATCH[2/3] would
be to improve __TLBI_VADDR_RANGE with masks. PATCH[3/3] will allow __flush_tlb_range_nosync()
to do range-based TLB flush for MAX_TLBI_RANGE_PAGES.
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 3b0e8248e1a4..bcbe697ed191 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -142,16 +142,22 @@ static inline unsigned long get_trans_granule(void)
> * EL1, Inner Shareable".
> *
> */
> +#define TLBIR_ASID_MASK GENMASK_ULL(63, 48)
> +#define TLBIR_TG_MASK GENMASK_ULL(47, 46)
> +#define TLBIR_SCALE_MASK GENMASK_ULL(45, 44)
> +#define TLBIR_NUM_MASK GENMASK_ULL(43, 39)
> +#define TLBIR_TTL_MASK GENMASK_ULL(38, 37)
> +#define TLBIR_BADDR_MASK GENMASK_ULL(36, 0)
> +
> #define __TLBI_VADDR_RANGE(baddr, asid, scale, num, ttl) \
> ({ \
> - unsigned long __ta = (baddr); \
> + unsigned long __ta = FIELD_PREP(TLBIR_BADDR_MASK, (baddr)); \
> unsigned long __ttl = (ttl >= 1 && ttl <= 3) ? ttl : 0; \
> - __ta &= GENMASK_ULL(36, 0); \
> - __ta |= __ttl << 37; \
> - __ta |= (unsigned long)(num) << 39; \
> - __ta |= (unsigned long)(scale) << 44; \
> - __ta |= get_trans_granule() << 46; \
> - __ta |= (unsigned long)(asid) << 48; \
> + __ta |= FIELD_PREP(TLBIR_TTL_MASK, __ttl); \
> + __ta |= FIELD_PREP(TLBIR_NUM_MASK, (unsigned long)(num)); \
> + __ta |= FIELD_PREP(TLBIR_SCALE_MASK, (unsigned long)(scale)); \
> + __ta |= FIELD_PREP(TLBIR_TG_MASK, get_trans_granule()); \
> + __ta |= FIELD_PREP(TLBIR_ASID_MASK, (unsigned long)(asid)); \
> __ta; \
> })
>
> @@ -161,12 +167,17 @@ static inline unsigned long get_trans_granule(void)
> #define MAX_TLBI_RANGE_PAGES __TLBI_RANGE_PAGES(31, 3)
>
> /*
> - * Generate 'num' values from -1 to 30 with -1 rejected by the
> - * __flush_tlb_range() loop below.
> + * Generate 'num' values from -1 to 31 with -1 rejected by the
> + * __flush_tlb_range() loop below. Its return value is only
> + * significant for a maximum of MAX_TLBI_RANGE_PAGES pages. If 'pages'
> + * is more than that, you must iterate over the overall range.
> */
> -#define TLBI_RANGE_MASK GENMASK_ULL(4, 0)
> -#define __TLBI_RANGE_NUM(pages, scale) \
> - ((((pages) >> (5 * (scale) + 1)) & TLBI_RANGE_MASK) - 1)
> +#define __TLBI_RANGE_NUM(pages, scale) \
> + ({ \
> + int __pages = min((pages), \
> + __TLBI_RANGE_PAGES(31, (scale))); \
> + (__pages >> (5 * (scale) + 1)) - 1; \
> + })
>
> /*
> * TLB Invalidation
> @@ -379,10 +390,6 @@ static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
> * 3. If there is 1 page remaining, flush it through non-range operations. Range
> * operations can only span an even number of pages. We save this for last to
> * ensure 64KB start alignment is maintained for the LPA2 case.
> - *
> - * Note that certain ranges can be represented by either num = 31 and
> - * scale or num = 0 and scale + 1. The loop below favours the latter
> - * since num is limited to 30 by the __TLBI_RANGE_NUM() macro.
> */
> #define __flush_tlb_range_op(op, start, pages, stride, \
> asid, tlb_level, tlbi_user, lpa2) \
> @@ -437,11 +444,11 @@ static inline void __flush_tlb_range_nosync(struct vm_area_struct *vma,
> * When not uses TLB range ops, we can handle up to
> * (MAX_DVM_OPS - 1) pages;
> * When uses TLB range ops, we can handle up to
> - * (MAX_TLBI_RANGE_PAGES - 1) pages.
> + * MAX_TLBI_RANGE_PAGES pages.
> */
> if ((!system_supports_tlb_range() &&
> (end - start) >= (MAX_DVM_OPS * stride)) ||
> - pages >= MAX_TLBI_RANGE_PAGES) {
> + pages > MAX_TLBI_RANGE_PAGES) {
> flush_tlb_mm(vma->vm_mm);
> return;
> }
>
Thanks,
Gavin
More information about the linux-arm-kernel
mailing list