traversing vma on nommu

Tue Nov 5 15:48:28 PST 2024

Hello Liam,

Thanks for your time to give me useful information, even though my
insufficient report for the situation; apologize my laziness.

On Wed, 06 Nov 2024 00:04:25 +0900,
Liam R. Howlett wrote:
> 
> * Hajime Tazaki <thehajime at gmail.com> [241104 21:24]:
> > 
> > Hello,
> > 
> > I'd like to ask your help to debug my out-of-tree kernel extension,
> > which is still in an RFC stage.  The RFC is about nommu extension to
> > UML (user momd linux).
> > 
> > https://lore.kernel.org/linux-um/m2y129hpx5.wl-thehajime@gmail.com/T/#t
> > 
> > I've been using v6.10 tag for the base branch for a while, mostly
> > works fine but after I rebased my branch to v6.12-rc2, I faced a crash
> > upon a process exit during vma iteration.
> 
> Does this happen in later v6.12-rc releases as well?
>
> There are a few issues fixed in later releases which may solve your
> problem.

I pulled/rebased v6.12-rc6 and the issue still persists.

Thread 1 "vmlinux" received signal SIGSEGV, Segmentation fault.
                                                               acct_collect (exitcode=exitcode at entry=256, group_dead=group_dead at entry=1) at ../kernel/acct.c:565
565      vsize += vma->vm_end - vma->vm_start;
(gdb) p vma
$4 = (struct vm_area_struct *) 0xe
(gdb) p vmi
$5 = {
  mas = {
    tree = 0x703e8708,
    index = 1886371840,
    last = 1886371839,
    node = 0x70bef30c,
    min = 0,
    max = 1886371839,
    alloc = 0x0,
    status = ma_active,
    depth = 1 '\001',
    offset = 15 '\017',
    mas_flags = 0 '\000',
    end = 14 '\016',
    store_type = wr_invalid
  }
}
(gdb) bt
#0  acct_collect (exitcode=exitcode at entry=256, group_dead=group_dead at entry=1)
    at ../kernel/acct.c:565
#1  0x000000006003de9d in do_exit (code=code at entry=256) at ../kernel/exit.c:918
#2  0x000000006003e842 in do_group_exit (exit_code=256) at ../kernel/exit.c:1088
#3  0x000000006003e85c in __do_sys_exit_group (error_code=<optimized out>) at ../kernel/exit.c:1099
#4  __se_sys_exit_group (error_code=<optimized out>) at ../kernel/exit.c:1097
#5  0x0000000060038996 in do_syscall_64 (regs=0x705aadd0) at ../arch/x86/um/do_syscall_64.c:83
#6  0x0000000060038b03 in __kernel_vsyscall () at ../arch/x86/um/entry_64.S:73
#7  0x00010102464c457f in ?? ()
#8  0x0000000000000000 in ?? ()

> There was a patch for a numa scheduler issue with iterations discovered
> with stree-ng - which I introduced in my conversion to the maple tree in
> kernel/sched/fair.c [1].  I'm pretty sure this has nothing to do with
> what you are seeing, but worth a mention.

thanks.

> Does it happen without your patches?  It is not clear form your
> statement above - but maybe you can't check considering it's nommu and
> you are trying to get um linux going.

I should try this before sending my previous email..

I'm preparing a test environment with other nommu arch with buildroot,
will let you know how it'll be.

> > I bisected this issue and found that if I reverted the 4 commits
> > below, the issue is gone.
> > 
> > ed4dfd9aa1b1 maple_tree: make write helper functions void
> > c27e6183c654 maple_tree: remove unneeded mas_wr_walk() in mas_store_prealloc()
> > add60ea5f6d8 maple_tree: remove repeated sanity checks from write helper functions
> > 9155e8433498 maple_tree: remove node allocations from various write helper functions
> 
> None of these patches will change the way any of this works - they are
> all on the write path and you are reading the vmas.

thanks for the information.
indeed, before this iteration, the data structure (vma_iterator->ma_state)
is somehow broken:

from above gdb output, when this issue happened, the ma_state->end
value seems to be smaller than ->offsest value.

(gdb) p vmi
$5 = {
  mas = {
    tree = 0x703e8708,
    index = 1886371840,
    last = 1886371839,
    node = 0x70bef30c,
    min = 0,
    max = 1886371839,
    alloc = 0x0,
    status = ma_active,
    depth = 1 '\001',
    offset = 15 '\017',
    mas_flags = 0 '\000',
    end = 14 '\016',
    store_type = wr_invalid
  }
}

when other programs passed this path without any crash, it seems to be
always end > offset.

> It seems like this might be some sort of a race.  How many times were
> you able to recreate it with/without those changes?

with my uml patches, some specific process (in the gdb session case,
`apk add pkgname`) will always trigger this issue.  without patch,
I'll try to test it with another arch.

> It is also on task teardown, which uses a special path when no one has
> the mm struct reference and so we can be sure there are no readers
> within the tree.

I see.

> > I'd like to debug what's wrong with my code but no luck so far.
> > I thought it is related with nommu code (mm/nommu.c) but didn't find
> > any useful hints for me.
> > 
> > It'd be very great if you have similar experience on this kind of
> > issue (tree iteration over vma, etc), or share some common pitfall
> > when using maple tree library.
> 
> 
> > 
> > below is the log of a gdb session.
> > 
> > ```
> > ${HOSTNAME%%.*}:$PWD# apk add se
> > afetch https://dl-cdn.alpinelinux.org/alpine/v3.20/main/x86_64/APKINDEX.tar.gz
> > fetch https://dl-cdn.alpinelinux.org/alpine/v3.20/community/x86_64/APKINDEX.tar.gz
> > ERROR: unable to select packages:
> >   se (no such package):
> >     required by: world[se]
> 
> Not entirely sure on alpine linux, is the relevant to the issue?

currently nommu UML only works on alpine with libc/busybox rebuild.
and not all processes trigger this issue but the apk command always
shows the issue on process exit.

> > Thread 1 "vmlinux" received signal SIGSEGV, Segmentation fault.
> 
> is the bug that you got a sigsegv or is that the test (ie, are you
> sending a segv to make sure task teardown is okay?)

I guess this sigsegv is generated because the variable vma has invalid
address.

(gdb) p vma
$4 = (struct vm_area_struct *) 0xe

> >                                                                acct_collect (exitcode=exitcode at entry=256, group_dead=1) at ../kernel/acct.c:565
> > 565      vsize += vma->vm_end - vma->vm_start;
> > (gdb) l
> > 560     VMA_ITERATOR(vmi, mm, 0);
> > 561     struct vm_area_struct *vma;
> > 562   
> > 563     mmap_read_lock(mm);
> > 564     for_each_vma(vmi, vma)
> > 565      vsize += vma->vm_end - vma->vm_start;
> > 566     mmap_read_unlock(mm);
> > 567    }
> 
> This is fine.  You have the mm locked (from current->mm) so the vma tree
> should not be modified.

thanks, this is the part of kernel/acct.c, which I didn't touch.

> > 568   
> > 569    spin_lock_irq(&current->sighand->siglock);
> > (gdb) bt
> > #0  acct_collect (exitcode=exitcode at entry=256, group_dead=1) at ../kernel/acct.c:565
> > #1  0x000000006003d5af in do_exit (code=code at entry=256) at ../kernel/exit.c:918
> 
> do_exit uses the current->mm after here as well, so the current->mm
> seems safe.  We'd know if the referen

I see.

> > #2  0x000000006003deca in do_group_exit (exit_code=256) at ../kernel/exit.c:1088
> > #3  0x000000006003dee4 in __do_sys_exit_group (error_code=<optimized out>) at ../kernel/exit.c:1099
> > #4  __se_sys_exit_group (error_code=<optimized out>) at ../kernel/exit.c:1097
> > #5  0x000000006003840d in do_syscall_64 (regs=0x705aadd8) at ../arch/x86/um/do_syscall_64.c:83
> > #6  0x000000006003855e in __kernel_vsyscall () at ../arch/x86/um/entry_64.S:73
> > #7  0x00000087000081ed in ?? ()
> > #8  0x671cad19671cad19 in ?? ()
> > #9  0x00000000671ca871 in ?? ()
> > #10 0x0000000800010000 in ?? ()
> > #11 0x0000000400080000 in ?? ()
> > #12 0x000000040001f30a in ?? ()
> > #13 0x0000000000000000 in ?? ()
> > ```
> 
> I suspect that something went wrong in nommu in regards to
> adding/removing/splitting vmas, but you only discover the issue much
> later.

indeed.

> There is a fix in v6.12-rc4 bea07fd63192b ("maple_tree: correct tree
> corruption on spanning store") which you may need.  Can you please try a
> newer version of the kernel?
> 
> [1] https://lore.kernel.org/all/173012156393.1442.1751639070858226239.tip-bot2@tip-bot2/

thanks for the info.
As I noted above, the newer rc6 still cause the same issue.

I'll keep investigating what's going on and get back here once I have
some news.

again, thank you so much for your help.

-- Hajime