Problem with nbcon console and amba-pl011 serial port
Toshiyuki Sato (Fujitsu)
fj6611ie at fujitsu.com
Thu Jun 5 23:46:27 PDT 2025
Hi Petr,
Thanks for your comment.
> On Thu 2025-06-05 05:27:56, Toshiyuki Sato (Fujitsu) wrote:
> > Hi John and Petr,
> >
> > > On Wed 2025-06-04 13:56:34, John Ogness wrote:
> > > > On 2025-06-04, Petr Mladek <pmladek at suse.com> wrote:
> > > > > On Wed 2025-06-04 04:11:10, Toshiyuki Sato (Fujitsu) wrote:
> > > > >> > On 2025-06-03, John Ogness <john.ogness at linutronix.de> wrote:
> > > > >> > > On 2025-06-03, "Toshiyuki Sato (Fujitsu)" <fj6611ie at fujitsu.com> wrote:
> > > > >> > >>> 4. pr_emerg() has a high logging level, and it effectively steals the console
> > > > >> > >>> from the "pr/ttyAMA0" task, which I believe is intentional in the nbcon
> > > > >> > design.
> > > > >> > >>> Down in pl011_console_write_thread(), the "pr/ttyAMA0" task is doing
> > > > >> > >>> nbcon_enter_unsafe() and nbcon_exit_unsafe() around each character
> > > > >> > >>> that it outputs. When pr_emerg() steals the console, nbcon_exit_unsafe()
> > > > >> > >>> returns 0, so the "for" loop exits. pl011_console_write_thread() then
> > > > >> > >>> enters a busy "while" loop waiting to reclaim the console. It's doing this
> > > > >> > >>> busy "while" loop with interrupts disabled, and because of the panic,
> > > > >> > >>> it never succeeds.
> > > > >
> > > > > I am a bit surprised that it never succeeds. The panic CPU takes over
> > > > > the ownership but it releases it when the messages are flushed. And
> > > > > the original owner should be able to reacquire it in this case.
> > > >
> > > > The problem is that other_cpu_in_panic() will return true forever, which
> > > > will cause _all_ acquires to fail forever. Originally we did allow
> > > > non-panic to take over again after panic releases ownership. But IIRC we
> > > > removed that capability because it allowed us to reduce a lot of
> > > > complexity. And now nbcon_waiter_matches() relies on "Lower priorities
> > > > are ignored during panic() until reboot."
> > >
> > > Great catch! I forgot it. And it explains everything.
> > >
> > > It would be nice to mention this in the commit message or
> > > in the comment above nbcon_reacquire_nobuf().
> > >
> > > My updated prosal of the comment is:
> > >
> > > * Return: True when the context reacquired the owner ship. The caller
> > > * might try entering the unsafe state and restore the original
> > > * console device setting. It must not access the output buffer
> > > * anymore.
> > > *
> > > * False when another CPU is in panic(). nbcon_try_acquire()
> > > * would never succeed and the infinite loop would prevent
> > > * stopping this CPU on architectures without proper NMI.
> > > * The caller should bail out immediately without
> > > * touching the console device or the output buffer.
> > >
> > > Best Regards,
> > > Petr
> >
> > Thank you for your comments and suggestions.
> >
> > After consideration,
> > I believe that there is no problem with forcibly terminating the process when
> > nbcon_reacquire_nobuf returns false at the pl011 driver level,
> > as in the test patch.
> >
> > It feels a bit harsh that a thread which started processing before the panic
> > and then transferred ownership to an atomic operation isn't allowed to perform
> > cleanup during panic handling or the grace period before the CPU halts.
> >
> > I would like to hear your opinion on this.
> > If nbcon_reacquire_nobuf() could acquire ownership even after the panic,
> > then driver-side modifications might not be necessary.
> > (The responsibility to complete the process without hindering the panic process
> > would still remain.)
> >
> > Would it be difficult to make an exception to the rule,
> > "Lower priorities are ignored during panic() until reboot,"
> > depending on the situation?
>
> Good question.
>
> The following two problems came to my mind:
>
> 1. As John, pointed out, the fact that non-panic CPUs are not
> able to acquire the context allowed to simplify the implementation.
>
> I think that it is primary about nbcon_waiter_matches(),
> nbcon_owner_matches(), and the related logic. It was documented by
> the commit 8c9dab2c55ad7 ("printk: nbcon: Clarify rules of
> the owner/waiter matching").
>
> But it seems that nbcon_owner_matches() is safe even without the rule.
> The race is prevented either by disabling interrupts and preemption
> or by taking device_lock().
>
> The rule prevents a race in nbcon_waiter_matches(). But it seems
> that in the worst case, more CPUs might end up busy waiting.
> And it would be acceptable during panic().
>
> So, this need not be a big problem in the end.
>
>
> 2. If we allowed non-panic() CPUs to acquire the ownership, it would
> increase the risk that the panic CPU will not be able to
> flush the messages.
>
> But maybe, the problem is only when the architecture supports
> proper NMI and non-panic CPUs might be stopped anywhere.
>
> It should be less problem on architectures without proper NMI
> where the non-panic CPU could not be stopped in the problematic
> situation.
>
> So, maybe, we could relax the rule on architectures without
> proper NMI.
>
>
> The question is if it is worth it. Is the clean up really important?
>
> Note that the clean up will never be guaranteed on architectures with
> a proper NMI. They would stop the non-panic CPUs, including the printk
> kthread, anywhere.
>
> And I guess that the console devices will be initialized after the
> reboot anyway.
>
Understood.
At least for pl011, a forced shutdown does not cause any problems.
(Since clk remains enabled and CR maintains a transmittable state,
it does not hinder the panic handling process.)
I'll create a revised patch based on the patch I posted for testing this time.
Please allow me some time.
When I submit the patch,
I would appreciate your continued assistance with reviews and other aspects.
Regards,
Toshiyuki Sato
More information about the linux-arm-kernel
mailing list