Problem with nbcon console and amba-pl011 serial port

Thu Jun 5 23:46:27 PDT 2025

Hi Petr,

Thanks for your comment.

> On Thu 2025-06-05 05:27:56, Toshiyuki Sato (Fujitsu) wrote:
> > Hi John and Petr,
> >
> > > On Wed 2025-06-04 13:56:34, John Ogness wrote:
> > > > On 2025-06-04, Petr Mladek <pmladek at suse.com> wrote:
> > > > > On Wed 2025-06-04 04:11:10, Toshiyuki Sato (Fujitsu) wrote:
> > > > >> > On 2025-06-03, John Ogness <john.ogness at linutronix.de> wrote:
> > > > >> > > On 2025-06-03, "Toshiyuki Sato (Fujitsu)" <fj6611ie at fujitsu.com> wrote:
> > > > >> > >>> 4. pr_emerg() has a high logging level, and it effectively steals the console
> > > > >> > >>> from the "pr/ttyAMA0" task, which I believe is intentional in the nbcon
> > > > >> > design.
> > > > >> > >>> Down in pl011_console_write_thread(), the "pr/ttyAMA0" task is doing
> > > > >> > >>> nbcon_enter_unsafe() and nbcon_exit_unsafe() around each character
> > > > >> > >>> that it outputs.  When pr_emerg() steals the console, nbcon_exit_unsafe()
> > > > >> > >>> returns 0, so the "for" loop exits. pl011_console_write_thread() then
> > > > >> > >>> enters a busy "while" loop waiting to reclaim the console. It's doing this
> > > > >> > >>> busy "while" loop with interrupts disabled, and because of the panic,
> > > > >> > >>> it never succeeds.
> > > > >
> > > > > I am a bit surprised that it never succeeds. The panic CPU takes over
> > > > > the ownership but it releases it when the messages are flushed. And
> > > > > the original owner should be able to reacquire it in this case.
> > > >
> > > > The problem is that other_cpu_in_panic() will return true forever, which
> > > > will cause _all_ acquires to fail forever. Originally we did allow
> > > > non-panic to take over again after panic releases ownership. But IIRC we
> > > > removed that capability because it allowed us to reduce a lot of
> > > > complexity. And now nbcon_waiter_matches() relies on "Lower priorities
> > > > are ignored during panic() until reboot."
> > >
> > > Great catch! I forgot it. And it explains everything.
> > >
> > > It would be nice to mention this in the commit message or
> > > in the comment above nbcon_reacquire_nobuf().
> > >
> > > My updated prosal of the comment is:
> > >
> > >  * Return:	True when the context reacquired the owner ship. The caller
> > >  *		might try entering the unsafe state and restore the original
> > >  *		console device setting. It must not access the output buffer
> > >  *		anymore.
> > >  *
> > >  *		False when another CPU is in panic(). nbcon_try_acquire()
> > >  *		would never succeed and the infinite loop would	prevent
> > >  *		stopping this CPU on architectures without proper NMI.
> > >  *		The caller should bail out immediately without
> > >  *		touching the console device or the output buffer.
> > >
> > > Best Regards,
> > > Petr
> >
> > Thank you for your comments and suggestions.
> >
> > After consideration,
> > I believe that there is no problem with forcibly terminating the process when
> > nbcon_reacquire_nobuf returns false at the pl011 driver level,
> > as in the test patch.
> >
> > It feels a bit harsh that a thread which started processing before the panic
> > and then transferred ownership to an atomic operation isn't allowed to perform
> > cleanup during panic handling or the grace period before the CPU halts.
> >
> > I would like to hear your opinion on this.
> > If nbcon_reacquire_nobuf() could acquire ownership even after the panic,
> > then driver-side modifications might not be necessary.
> > (The responsibility to complete the process without hindering the panic process
> > would still remain.)
> >
> > Would it be difficult to make an exception to the rule,
> >   "Lower priorities are ignored during panic() until reboot,"
> > depending on the situation?
> 
> Good question.
> 
> The following two problems came to my mind:
> 
> 1. As John, pointed out, the fact that non-panic CPUs are not
>    able to acquire the context allowed to simplify the implementation.
> 
>    I think that it is primary about nbcon_waiter_matches(),
>    nbcon_owner_matches(), and the related logic. It was documented by
>    the commit 8c9dab2c55ad7 ("printk: nbcon: Clarify rules of
>    the owner/waiter matching").
> 
>    But it seems that nbcon_owner_matches() is safe even without the rule.
>    The race is prevented either by disabling interrupts and preemption
>    or by taking device_lock().
> 
>    The rule prevents a race in nbcon_waiter_matches(). But it seems
>    that in the worst case, more CPUs might end up busy waiting.
>    And it would be acceptable during panic().
> 
>    So, this need not be a big problem in the end.
> 
> 
> 2. If we allowed non-panic() CPUs to acquire the ownership, it would
>    increase the risk that the panic CPU will not be able to
>    flush the messages.
> 
>    But maybe, the problem is only when the architecture supports
>    proper NMI and non-panic CPUs might be stopped anywhere.
> 
>    It should be less problem on architectures without proper NMI
>    where the non-panic CPU could not be stopped in the problematic
>    situation.
> 
>    So, maybe, we could relax the rule on architectures without
>    proper NMI.
> 
> 
> The question is if it is worth it. Is the clean up really important?
> 
> Note that the clean up will never be guaranteed on architectures with
> a proper NMI. They would stop the non-panic CPUs, including the printk
> kthread, anywhere.
> 
> And I guess that the console devices will be initialized after the
> reboot anyway.
> 

Understood.

At least for pl011, a forced shutdown does not cause any problems.
(Since clk remains enabled and CR maintains a transmittable state,
it does not hinder the panic handling process.)

I'll create a revised patch based on the patch I posted for testing this time.
Please allow me some time.

When I submit the patch,
I would appreciate your continued assistance with reviews and other aspects.

Regards,
Toshiyuki Sato