[PATCH 00/11] arm64: entry lockdep/rcu/tracing fixes

Mon Nov 30 11:54:27 EST 2020

On Mon, Nov 30, 2020 at 02:32:45PM +0100, Marco Elver wrote:
> On Mon, Nov 30, 2020 at 12:38PM +0000, Mark Rutland wrote:
> > On Mon, Nov 30, 2020 at 01:03:05PM +0100, Marco Elver wrote:

> > > So, I was hoping that this would fix all the problems I was seeing when
> > > running the ftrace tests ... unfortunately, it didn't. :-( Perhaps the
> > > WIP version you had only worked because it ended up disabling lockdep
> > > early?
> > 
> > Possibly, yes. Either that or the way we do / do-not treat debug
> > exceptions as true NMIs. Either way this appears to be a latent issue
> > rather than something introduced by this series.
> > 
> > From the log below I see you're using:
> > 
> >   5.10.0-rc4-next-20201119-00002-gc88aca8827ce #1 Not tainted
> > 
> > ... and it's possible that the issue you're seeing now is a delta
> > between v5.10-rc3 and what's queued in linux-next -- I've been running
> > the ftrace tests locally without issue atop v5.10-rc3 and v5.10-rc5.
> > 
> > Are you able to reproduce this on my branch alone? If so that gives us a
> > stable tree to investigate, and if not that gives us a stable base for a
> > bisect against linux-next.
> 
> It's the same problem as before and that I've been reporting in the
> other thread [1]. We know mainline is fine, however, -next is broken. We
> also know that next-20201105 was still fine, and next-202010 started
> breaking:
> 
> 	https://lkml.kernel.org/r/20201111133813.GA81547@elver.google.com
> 
> The recent tests have been on next-20201119 (including the logs from
> previous email).
>
> I tried bisection, but results are never conclusive (the closest I got
> was a -rcu merge commit). As discussed in the thread at [1] (and its
> ancestors) we never really got anywhere and really exhausted all options
> (several bisection attempts, etc.).

Ah; I'd lost track and missed that you'd already identified this was
introduced in linux-next, and that bisection wasn't getting anywhere.
Thanks for bearing with me! :)

> > This area is really sensitive to config options, so if you can reproduce
> > this on a stable base, could you share youir exact config?
> 
> No, it's not reproducible on mainline.
> 
> Which might also mean that it's something else in -next and your work is
> unrelated.
> 
> But I was surprised your WIP series fixed the problems on next-20201119
> (or so it seemed). So, given all the confusion in [1], I was really
> hoping this would be it...

The major difference between that and the version upstreamed is the way
debug exceptions (including BRKs) got handled as true NMIs, which hints
that there could be a subtle interaction in that area (or that the
lockdep disable calls in the NMI paths simply masked the problem).

One simple thing to try would be to hack the debug exception cases to
enter/exit as true NMIs and see whether that hides the issue again. If
so, we can start teasing that apart to narrow it down.

> > > I've attached the log and the symbolized report.
> > 
> > Thanks for all this. I'll see if I can tickle this locally while waiting
> > for the above. If you could share your config from this time around
> > that'd be a great head-start!
> 
> It's the same as I've been using for the work in
> 
> 	[1] https://lore.kernel.org/r/20201119193819.GA2601289@elver.google.com
> 
> In summary, to repro:
> 
> 	1. Switch to next-20201119 (possibly even latest, but I haven't tested)
> 
> 	2. Apply provoke-bug.diff
> 
> 	3. Use the attached .config
> 
> 	4. Run with 
> 
> 	   qemu-system-aarch64 -kernel $KERNEL_WORKTREE/arch/arm64/boot/Image \
> 		-append "console=ttyAMA0 root=/dev/sda debug earlycon earlyprintk=serial workqueue.watchdog_thresh=10" \
> 		-nographic -smp 1 -machine virt -cpu cortex-a57 -m 2G

Thanks for the comprehensive repro information!

I note that you're using QEMU in TCG mode, whereas I've been testing
with KVM acceleration. Those differ in speed by ordered of magnitude, so
I wonder if the stalls you see are down to TCG simply being slow, and my
patches just happened to shuffle where that slowness was felt.

I gave the above a go, but I wasn't able to reproduce the issue under
either TCG or KVM acceleration after a few attempts. I'm not sure
whether this is intermittent and I'm just getting lucky, or if something
is different between our setups that's causing me to not hit this.

FWIW I'm testing on a ThunderX2 workstation running Debian 10.6, using
the packaged GCC 8.3.0-6, and a locally-built QEMU 5.1.50
(v5.1.0-2347-g1f3081f6de). The QEMU has a couple of test patches atop
upstream commit ba2a9a9e6318bfd93a2306dec40137e198205b86.

> The tests I ran on your WIP series and just now were applied on top of
> next-20201119+provoke-bug.diff. Your WIP series seemed to fix whatever
> it was we were debugging in [1] (but with some new warnings), but this
> latest series shows no difference and behaviour is unchanged again.
> 
> I also want to emphasize it is really hard to say if your series here is
> related or the fact that the WIP series worked was some other
> side-effect we don't understand.

Sure; I think we're aligned on that understanding. There are a
sufficient number of moving parts here that the WIP might have been
masking a problem, or might have unintentionally solved a problem we
haven't realised exists.

> So I leave it to your judgement to decide to what extent this series
> could possibly help, because I wouldn't want to make you go down a
> rabbit hole that doesn't lead anywhere (as I had already done to
> somehow debug the problem in [1]).

I think as you say it's not at all clear, but I'd hope this series at
least removes a number of potential problems from the search space.

Thanks,
Mark.