[PATCH] riscv: pageattr: Fixup synchronization problem between init_mm and active_mm

Mon Jul 3 19:25:33 PDT 2023

On Mon, Jul 3, 2023 at 6:17 PM Alexandre Ghiti <alex at ghiti.fr> wrote:
>
> Hi Guo,
>
> On 29/06/2023 10:20, guoren at kernel.org wrote:
> > From: Guo Ren <guoren at linux.alibaba.com>
> >
> > The machine_kexec() uses set_memory_x to add the executable attribute to the
> > page table entry of control_code_buffer. It only modifies the init_mm but not
> > the current->active_mm. The current kexec process won't use init_mm directly,
> > and it depends on minor_pagefault, which is removed by commit 7d3332be011e4
>
>
> Is the removal of minor_pagefault an issue? I'm not sure I understand
> this part of the changelog.
I use two different work-around patches to answer your question:
1st:
-----

diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c
index 705d63a59aec..b8b200c81606 100644
--- a/arch/riscv/mm/fault.c
+++ b/arch/riscv/mm/fault.c
@@ -249,7 +249,7 @@ void handle_page_fault(struct pt_regs *regs)
* only copy the information from the master page table,
* nothing more.
*/
- if (unlikely((addr >= VMALLOC_START) && (addr < VMALLOC_END))) {
+ if (unlikely(addr >= 0x8000000000000000UL)) {
vmalloc_fault(regs, code, addr);
return;
}
------

2nd:
------
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 8e65f0a953e5..270f50852886 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1387,7 +1387,7 @@ static void __init create_linear_mapping_page_table(void)
if (end >= __pa(PAGE_OFFSET) + memory_limit)
end = __pa(PAGE_OFFSET) + memory_limit;

- create_linear_mapping_range(start, end, 0);
+ create_linear_mapping_range(start, end, PMD_SIZE);
}

#ifdef CONFIG_STRICT_KERNEL_RWX
-----

The removal of minor_pagefault could be an issue, but in this case
it's the VMALLOC_START/END which prevents the minor_pagefault at
first. I didn't say commit 7d3332be011e4 is the problem.

>
>
> > ("riscv: mm: Pre-allocate PGD entries for vmalloc/modules area") of 64BIT. So,
> > when it met pud mapping on an MMU_SV39 machine, it caused the following:
> >
> >   kexec_core: Starting new kernel
> >   Will call new kernel at 00300000 from hart id 0
> >   FDT image at 747c7000
> >   Bye...
> >   Unable to handle kernel paging request at virtual address ffffffda23b0d000
> >   Oops [#1]
> >   Modules linked in:
> >   CPU: 0 PID: 53 Comm: uinit Not tainted 6.4.0-rc6 #15
> >   Hardware name: Sophgo Mango (DT)
> >   epc : 0xffffffda23b0d000
> >    ra : machine_kexec+0xa6/0xb0
> >   epc : ffffffda23b0d000 ra : ffffffff80008272 sp : ffffffc80c173d10
> >    gp : ffffffff8150e1e0 tp : ffffffd9073d2c40 t0 : 0000000000000000
> >    t1 : 0000000000000042 t2 : 6567616d69205444 s0 : ffffffc80c173d50
> >    s1 : ffffffd9076c4800 a0 : ffffffd9076c4800 a1 : 0000000000300000
> >    a2 : 00000000747c7000 a3 : 0000000000000000 a4 : ffffffd800000000
> >    a5 : 0000000000000000 a6 : ffffffd903619c40 a7 : ffffffffffffffff
> >    s2 : ffffffda23b0d000 s3 : 0000000000300000 s4 : 00000000747c7000
> >    s5 : 0000000000000000 s6 : 0000000000000000 s7 : 0000000000000000
> >    s8 : 0000000000000000 s9 : 0000000000000000 s10: 0000000000000000
> >    s11: 0000003f940001a0 t3 : ffffffff815351af t4 : ffffffff815351af
> >    t5 : ffffffff815351b0 t6 : ffffffc80c173b50
> >   status: 0000000200000100 badaddr: ffffffda23b0d000 cause: 000000000000000c
> >
> > Yes, Using set_memory_x API after boot has the limitation, and at least we
> > should synchronize the current->active_mm to fix the problem.
> >
> > Fixes: d3ab332a5021 ("riscv: add ARCH_HAS_SET_MEMORY support")
> > Signed-off-by: Guo Ren <guoren at linux.alibaba.com>
> > Signed-off-by: Guo Ren <guoren at kernel.org>
> > ---
> >   arch/riscv/mm/pageattr.c | 7 +++++++
> >   1 file changed, 7 insertions(+)
> >
> > diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
> > index ea3d61de065b..23d169c4ee81 100644
> > --- a/arch/riscv/mm/pageattr.c
> > +++ b/arch/riscv/mm/pageattr.c
> > @@ -123,6 +123,13 @@ static int __set_memory(unsigned long addr, int numpages, pgprot_t set_mask,
> >                                    &masks);
> >       mmap_write_unlock(&init_mm);
> >
> > +     if (current->active_mm != &init_mm) {
> > +             mmap_write_lock(current->active_mm);
> > +             ret =  walk_page_range_novma(current->active_mm, start, end,
> > +                                          &pageattr_ops, NULL, &masks);
> > +             mmap_write_unlock(current->active_mm);
> > +     }
> > +
> >       flush_tlb_kernel_range(start, end);
> >
> >       return ret;
>
>
> I don't understand: any page table inherits the entries of
> swapper_pg_dir (see pgd_alloc()), so any kernel page table entry is
> "automatically" synchronized, so why should we synchronize one 4K entry
> explicitly? A PGD entry would need to be synced, but not a PTE entry.
The purpose of the second walk_page_range_novma() is for pgd's entries
synchronization. I'm a bit lazy here, I agree, it's unnecessary to
write lower level entries again. So I would use a simple pgd entries
synchronization from vmalloc_fault in the next version of patch, all
right?


>


--
Best Regards
 Guo Ren