Problem with nbcon console and amba-pl011 serial port

Fri Jun 6 07:01:25 PDT 2025

On Fri 2025-06-06 12:25:35, John Ogness wrote:
> On 2025-06-05, Petr Mladek <pmladek at suse.com> wrote:
> > The question is if it is worth it. Is the clean up really important?
> 
> I must admit that I am not happy about encouraging the proposed solution
> so far

I am not super happy either. But let me play the devil advocate
for a bit longer ;-)

> (i.e. expecting driver authors to create special unsafe code in
> the panic situation).

The "usafe code" sounds too strong to me. The console driver is
supposed to "do nothing" and just leave when nbcon_reacquire() fails.

> It leads down the "hope and pray" path that nbcon
> was designed to fix.

My understanding is that we "hope and pray" to show the messages
on the console. And it should work because the ownership was
lost only in a safe state.

If the ownership was lost in the unsafe state then it might
be even bigger gamble to do anything in parallel.

Of course, another panic priority is to provide a crash dump
or reboot. But it hopefully should not depend on a console
driver clean up.

> What if during non-panic-CPU shutdown, we allow reacquires to succeed
> only for _direct_ acquires? The below diff shows how this could be
> implemented. Since it only supports direct acquires, it does not violate
> any state rules. And also, since it only involves the reacquire, there
> is no ugly battling for console printing between the panic and non-panic
> CPUs.

Interesting idea. I thought a lot about it, see below.

> diff --git a/include/linux/printk.h b/include/linux/printk.h
> index 5b462029d03c1..d58ebdc8170b3 100644
> --- a/include/linux/printk.h
> +++ b/include/linux/printk.h
> diff --git a/kernel/panic.c b/kernel/panic.c
> index b0b9a8bf4560d..8f572630c9f7e 100644
> --- a/kernel/panic.c
> +++ b/kernel/panic.c
> @@ -304,6 +310,8 @@ static void panic_other_cpus_shutdown(bool crash_kexec)
>  		smp_send_stop();
>  	else
>  		crash_smp_send_stop();
> +
> +	nbcon_panic_allow_reacquire_set(false);
>  }

I have two concerns here:

1. I wonder whether this is reliable enough. It seems that
   smp_send_stop() waits at least 1 sec until the CPUs
   get stopped. But is this enough on virtualized systems?

2. It might increase a risk when CPUs are stopped using NMI.
   The change would allow a non-panic CPU to reacquire the ownership
   and enter _unsafe_ section right before being stopped by NMI.

The 1st problem might be avoided by allowing the reacquire all
the time unless an "unsafe" takeover happened.

The 2nd problem is worse. But allowing the reacquire all the time
might actually help as well.

Note that the information about the "unsafe_takeover" is stored
in struct console so that we even won't need a new global
variable.

>  /**
> diff --git a/kernel/printk/nbcon.c b/kernel/printk/nbcon.c
> index d60596777d278..d960cb8a05558 100644
> --- a/kernel/printk/nbcon.c
> +++ b/kernel/printk/nbcon.c
> @@ -235,7 +235,8 @@ static void nbcon_seq_try_update(struct nbcon_context *ctxt, u64 new_seq)
>   *			the handover acquire method.
>   */
>  static int nbcon_context_try_acquire_direct(struct nbcon_context *ctxt,
> -					    struct nbcon_state *cur)
> +					    struct nbcon_state *cur,
> +					    bool ignore_other_cpu_in_panic)
>  {
>  	unsigned int cpu = smp_processor_id();
>  	struct console *con = ctxt->console;
> @@ -249,7 +250,7 @@ static int nbcon_context_try_acquire_direct(struct nbcon_context *ctxt,
>  		 * nbcon_waiter_matches(). In particular, the assumption that
>  		 * lower priorities are ignored during panic.
>  		 */
> -		if (other_cpu_in_panic())
> +		if (other_cpu_in_panic() && !ignore_other_cpu_in_panic)

If you agree with allowing the reacquire all the time then I would
rename the parameter to @is_reacquire and do something like:

	if (other_cpu_in_panic() &&
	   (!is_reacquire || cur->unsafe_takeover))

>  			return -EPERM;
>  
>  		if (ctxt->prio <= cur->prio || ctxt->prio <= cur->req_prio)
> @@ -913,7 +920,7 @@ void nbcon_reacquire_nobuf(struct nbcon_write_context *wctxt)
>  {
>  	struct nbcon_context *ctxt = &ACCESS_PRIVATE(wctxt, ctxt);
>  
> -	while (!nbcon_context_try_acquire(ctxt))
> +	while (!nbcon_context_try_acquire(ctxt, READ_ONCE(nbcon_panic_allow_reacquire)))

And here it would be:

	while (!nbcon_context_try_acquire(ctxt, true))

>  		cpu_relax();
>  
>  	nbcon_write_context_set_buf(wctxt, NULL, 0);

Summary:

I open to give this alternative approach a chance when we allow the
reacquire all the time. It might work well. And it won't require
any special "panic" handling in all console drivers.

Best Regards,
Petr