Problem with nbcon console and amba-pl011 serial port

Michael Kelley mhklinux at outlook.com
Sun Jun 8 20:38:07 PDT 2025


From: John Ogness <john.ogness at linutronix.de> Sent: Thursday, June 5, 2025 12:43 AM
> 
> On 2025-06-05, "Toshiyuki Sato (Fujitsu)" <fj6611ie at fujitsu.com> wrote:
> >> I've tested the fix in my primary environment (ARM64 VM in the Azure cloud), and
> I've seen no failures to stop a CPU. I kept my
> >> custom logging in place, so I could confirm that the problem path is still happening,
> and the fix recovers from the problem path.
> >> So the good results are not due to just a timing change. The "pr/ttyAMA0" task is still
> looping forever trying to get ownership
> >> of the console, but it is doing so at a higher level in nbcon_kthread_func() and in
> calling nbcon_emit_one(), and interrupts are
> >> enabled for part of the loop.
> >>
> >> Full disclosure: I have a secondary environment, also an ARM64 VM in the Azure
> cloud, but running on an older version of
> >> Hyper-V. In this environment I see the same custom logging results, and the
> "pr/ttyAMA0" task is indeed looping with
> >> interrupts enabled. But for some reason, the CPU doesn't stop in response to
> IPI_CPU_STOP. I don't see any evidence that this
> >> failure to stop is due to the Linux pl011 driver or nbcon. This older version of Hyper-V
> has a known problem in pl011 UART
> >> emulation, and I have a theory on how that problem may be causing the failure to
> stop. It will take me some time to investigate
> >> further, but based on what I know now, that investigation should not hold up this fix.
> >>
> >> Michael
> >
> > Thank you for testing the patch.
> > I'm concerned about the thread looping...
> 
> The thread would only loop if there is a backlog. But that backlog
> should have been flushed atomically by the panic CPU.
> 
> Are you able to dump the kernel buffer and see if there are trailing
> messages in the kernel buffer that did not get printed? I wonder if the
> atomic printing is hanging or something.
> 

Getting back to your question. There are 24 lines of console output
in the panic path with sysrq, up to and including the "SMP: stopping
secondary CPUs" line. The nbcon kthread starts to output the
first line, which is at INFO level. Then the panic() function outputs
the 2nd line at EMERGENCY level and in doing so it takes control
of the console, and re-outputs the 1st line followed by the 2nd line.
The panic function then outputs the remaining 22 lines. What I see
is that in nbcon_kthread_func(), the call to rcuwait_wait_event()
completes about 80,000 times after the panic() path takes control
of the console. That rcuwait_wait_event() stops completing sometime
between when the panic path calls nbcon_emit_next_record() for
the 2nd line and again for the 3rd line. Then nbcon_kthread_func()
remains quiescent as the panic path outputs the remaining lines in
successive calls to nbcon_emit_next_record(). Of course, the other
CPUs then get stopped, and the kthread can't do anything anyway. I
haven't tried to track down all the nuances of the expected behavior,
and my custom tracing has limitations. But maybe the kthread looping
behavior is as expected?

Separately, I see that you have posted a patch that solves the
original problem in a different way. I'll test that tonight or
tomorrow.

Michael



More information about the linux-arm-kernel mailing list