WARNING in __do_kernel_fault

Wed Jan 27 12:34:47 EST 2021

On Wed, Jan 27, 2021 at 06:24:22PM +0100, Dmitry Vyukov wrote:
> On Wed, Jan 27, 2021 at 6:15 PM Will Deacon <will at kernel.org> wrote:
> >
> > On Wed, Jan 27, 2021 at 06:00:30PM +0100, Dmitry Vyukov wrote:
> > > On Wed, Jan 27, 2021 at 5:56 PM syzbot
> > > <syzbot+45b6fce29ff97069e2c5 at syzkaller.appspotmail.com> wrote:
> > > >
> > > > Hello,
> > > >
> > > > syzbot found the following issue on:
> > > >
> > > > HEAD commit:    2ab38c17 mailmap: remove the "repo-abbrev" comment
> > > > git tree:       upstream
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=15a25264d00000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=ad43be24faf1194c
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=45b6fce29ff97069e2c5
> > > > userspace arch: arm64
> > > >
> > > > Unfortunately, I don't have any reproducer for this issue yet.
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > > > Reported-by: syzbot+45b6fce29ff97069e2c5 at syzkaller.appspotmail.com
> > >
> > > This happens on arm64 instance with mte enabled.
> > > There is a GPF in reiserfs_xattr_init on x86_64 reported:
> > > https://syzkaller.appspot.com/bug?id=8abaedbdeb32c861dc5340544284167dd0e46cde
> > > so I would assume it's just a plain NULL deref. Is this WARNING not
> > > indicative of a kernel bug? Or there is something special about this
> > > particular NULL deref?
> >
> > Congratulations, you're the first person to trigger this warning!
> >
> > This fires if we take an unexpected data abort in the kernel but when we
> > get into the fault handler the page-table looks ok (according to the CPU via
> > an 'AT' instruction). Are you using QEMU system emulation? Perhaps its
> > handling of AT isn't quite right.
> 
> Yes, it's qemu-system-aarch64 5.2 with -machine virt,mte=on -cpu max.
> Do you see any way forward for this issue? Can somehow prove/disprove
> it's qemu at fault?
> The instance just started running, but it seems to be the most common
> crash so far and it seems to happen on _all_ gpf's.
> You can see all arm64 crashes so far here:
> https://syzkaller.appspot.com/upstream?manager=ci-qemu2-arm64-mte
> They all happen in reiserfs_security_init, but locally I got a bunch
> of different stacks, e.g.:

Your best bet is to hack is_spurious_el1_translation_fault() to dump addr,
es and par, then we can help decipher the logs here. It could also easily be
a bug in that code, since it hasn't been run before (well, other than
contrived testing when I wrote it).

Will