[PATCH] riscv: Flush stale TLB entry with VMAP_STACK enabled

Wed Aug 2 07:49:13 PDT 2023

On 02/08/2023 14:46, dylan wrote:
> On Tue, Aug 01, 2023 at 03:56:06PM +0200, Alexandre Ghiti wrote:
>
> Hi Alex,
>
> Thanks for the comment.
>
>> Hi Dylan,
>>
>>
>> On 01/08/2023 11:09, Dylan Jhong wrote:
>>> When VMAP_STACK is enabled, the kernel stack will be obtained through
>>> vmalloc(). Normally, we rely on the logic in vmalloc_fault() to update stale
>>> P*D entries covering the vmalloc space in a task's page tables when it first
>>> accesses the problematic region.
>>
>> I guess that's for rv32 right? Because vmalloc_fault() has been removed for
>> rv64 in 6.5.
>>
>> Here you describe the issue as being caused by the vmap stack being in a new
>> PGD which then needs a page table synchronization in vmalloc_fault(), which
>> can't happen since vmalloc_fault() needs this same stack in the current page
>> table.
>>
> No. This problem only occurs on rv64. It is triggered when the kernel stack is
> allocated via vmalloc(), which is only eligible if VMAP_STACK is enabled.
> According to riscv/Kconfig, VMAP_STACK can only be enabled under rv64.
>
> Sorry I forgot to mention that I was developing on Linux 6.1 LTS, And I did not
> notice that vmalloc_fault() was removed in Linux 6.5. But this should not impact
> the problem this patch aims to solve, as the kernel stack is accessed in
> handle_exception()[1] long before vmalloc_fault() is called.
>
> To verify this, I also backported the "remove vmalloc_fault()" patch [2] to
> Linux 6.1 LTS. The experiment revealed that the problem still persists.
>
> [*1]: https://github.com/torvalds/linux/blob/master/arch/riscv/kernel/entry.S#L42
> [*2]: https://lore.kernel.org/linux-mm/871qiyfhpe.fsf@all.your.base.are.belong.to.us/T/
>
>>>    Unfortunately, this is not sufficient when
>>> the kernel stack resides in the vmalloc region, because vmalloc_fault() is a
>>> C function that needs a stack to run. So we need to ensure that these P*D
>>> entries are up to date *before* the MM switch.
>>>
>>> Here's our symptom:
>>> core 0: A speculative load lead the kernel stack load to the TLB before the
>>>           corresponding kernel stack's page table is created.
>>> core 1: Create page table mapping of that kernel stack.
>>> core 0: After a context switch, the kernel attempts to use the stack region.
>>>           However, even if the page table is correct, the stack address mapping
>>>           in the TLB is invalid, leading to subsequent nested exceptions.
>>
>> But here the problem you describe is different since it seems to be caused
>> by the TLB caching invalid entries which then needs a sfence.vma for the
>> page table walker to see the new correct entry.
>>
> TLB caching invalid entries is exactly the problem we encountered, so my modification
> is trying to update the TLB through the arch_sync_kernel_mappings() after vmalloc
> creates the page table.
>
> let me describe the situation we encountered more clearly:
>
> When we execute a multi-threaded user space program on rv64 SMP CPU, we eventually
> encounter the issue of nested exceptions caused by the TLB not being updated.
>
>
> CPU0                                           CPU1
> - (Thread 1) Create page table
> - (Thread 1) Do plist_check_list() to
>    check the kernel stack used by thread 1
>    is valid. But a speculative load will
>    be triggered here, causing the kernel
>    stack address of thread 2 to be loaded
>    to the TLB of CPU0, and thread 2 has
>    not yet been established at this time.
>                                                 - (Thread 2) Create page table
>                                                 - (Thread 2) Do plist_check_list().
>                                                 - Switch MM
> - (Thread 2) Keep executing user program
>    (thread 2) Enter handle_exception().
>    In Handle_excption() will try to access
>    kernel stack to store registers. But the
>    kernel stack mapping in TLB is invalid.
>    So another page fault exception is raised.
>    The nested exception occurs here. System hang.
>
> Therefore, the main intention of this patch is to flush the TLB of all CPUs
> after the page table created by vmalloc() before switching MM.
>
>>> This fix is inspired by ARM's approach[*1], commit a1c510d0adc6 ("ARM:
>>> implement support for vmap'ed stacks"), it also performs a TLB flush after
>>> setting up the page tables in vmalloc().
>>> Fixes: 31da94c25aea ("riscv: add VMAP_STACK overflow detection")
>>> Signed-off-by: Dylan Jhong <dylan at andestech.com>
>>> ---
>>>    arch/riscv/include/asm/page.h |  4 ++++
>>>    arch/riscv/mm/tlbflush.c      | 16 ++++++++++++++++
>>>    2 files changed, 20 insertions(+)
>>>
>>> diff --git a/arch/riscv/include/asm/page.h b/arch/riscv/include/asm/page.h
>>> index 349fad5e35de..c9b080a72855 100644
>>> --- a/arch/riscv/include/asm/page.h
>>> +++ b/arch/riscv/include/asm/page.h
>>> @@ -21,6 +21,10 @@
>>>    #define HPAGE_MASK              (~(HPAGE_SIZE - 1))
>>>    #define HUGETLB_PAGE_ORDER      (HPAGE_SHIFT - PAGE_SHIFT)
>>> +#ifdef CONFIG_VMAP_STACK
>>> +#define ARCH_PAGE_TABLE_SYNC_MASK	PGTBL_PTE_MODIFIED
>>> +#endif
>>> +
>>>    /*
>>>     * PAGE_OFFSET -- the first address of the first page of memory.
>>>     * When not using MMU this corresponds to the first free page in
>>> diff --git a/arch/riscv/mm/tlbflush.c b/arch/riscv/mm/tlbflush.c
>>> index ef701fa83f36..0799978913ee 100644
>>> --- a/arch/riscv/mm/tlbflush.c
>>> +++ b/arch/riscv/mm/tlbflush.c
>>> @@ -86,3 +86,19 @@ void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
>>>    	__sbi_tlb_flush_range(vma->vm_mm, start, end - start, PMD_SIZE);
>>>    }
>>>    #endif
>>> +
>>> +#ifdef CONFIG_VMAP_STACK
>>> +/*
>>> + * Normally, we rely on the logic in vmalloc_fault() to update stale P*D
>>> + * entries covering the vmalloc space in a task's page tables when it first
>>> + * accesses the problematic region. Unfortunately, this is not sufficient when
>>> + * the kernel stack resides in the vmalloc region, because vmalloc_fault() is a
>>> + * C function that needs a stack to run. So we need to ensure that these P*D
>>> + * entries are up to date *before* the MM switch.
>>> + */
>>> +void arch_sync_kernel_mappings(unsigned long start, unsigned long end)
>>> +{
>>> +	if (start < VMALLOC_END && end > VMALLOC_START)
>>> +		flush_tlb_all();
>>> +}
>>> +#endif
>>
>> And if that works for you, I'd say the problem is the latter: the TLB
>> caching invalid entries, since you don't synchronize the page tables here.
>> That looks a lot like the patch I proposed here https://patchwork.kernel.org/project/linux-riscv/patch/20230725132246.817726-1-alexghiti@rivosinc.com/
>> that implements flush_cache_vmap().
>>
> Updating the TLB in flush_cache_vmap() might address the problem I encountered,
> And I noticed that ARM accomplishes this through arch_sync_kernel_mappings()[3].
> As a result, I chose to adopt a similar approach to ARM's solution.
>
> [3]: https://github.com/torvalds/linux/blob/5d0c230f1de8c7515b6567d9afba1f196fb4e2f4/arch/arm/kernel/traps.c#L962

That being said, the problem is more general than the one you are trying 
to fix here, we should not restrict arch_sync_kernel_mappings() to 
VMAP_STACK as any vmalloc mapping could lead to such bug (even more 
since the removal of vmalloc_fault()). For microarchitectures that allow 
the caching of invalid entries in the TLB, at any new kernel mapping we 
must emit a "preventive" sfence.vma (unfortunately on all cores) since 
we would not be able to recover the kernel page fault (unless we find 
another solution, but given 6.5 is approaching, I doubt it).

I don't know which patch is "better" than the other, I'm looking into 
that. Maybe you could send a v2 in the meantime, where the commit log 
does not mention vmalloc_fault() and the synchronization of PGD entries, 
and which removes the #ifdef CONFIG_VMAP_STACK?

Thanks,

Alex

>
>> So I'm not mistaken, we have another problem in 32-bit: I guess that in your
>> example core 0 and core 1 execute in the same address space (ie the same
>> page table) and a simple sfence.vma gets rid of the invalid entry and things
>> can go on. But what if 2 page tables are created with the same vmalloc
>> mappings, one adds a PGD in the vmalloc mapping, then the other one does not
>> have it in its page table but still allocates its vmap stack in this new PGD
>> => the latter would never be able to recover from the vmalloc fault since it
>> needs to update its page table with the new PGD and it needs the new PGD for
>> that.
>>
>> Let me know if I'm completely wrong here!
>>
>> Thanks,
>>
>> Alex
>>
>>
> Best regards,
> Dylan Jhong
>
>
>
> _______________________________________________
> linux-riscv mailing list
> linux-riscv at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-riscv