[PATCH] firmware: arm_scmi: Queue in scmi layer for mailbox implementation

Tue Oct 8 06:37:31 PDT 2024

On Tue, Oct 08, 2024 at 02:34:59PM +0100, Sudeep Holla wrote:
> On Tue, Oct 08, 2024 at 02:23:00PM +0100, Sudeep Holla wrote:
> > On Tue, Oct 08, 2024 at 01:10:39PM +0100, Cristian Marussi wrote:
> > > On Mon, Oct 07, 2024 at 10:58:47AM -0700, Justin Chen wrote:
> > > > Thanks for the response. I'll try to elaborate.
> > > > 
> > > > When comparing SMC and mailbox transport, we noticed mailbox transport
> > > > timesout much quicker when under load. Originally we thought this was the
> > > > latency of the mailbox implementation, but after debugging we noticed a
> > > > weird behavior. We saw SMCI transactions timing out before the mailbox even
> > > > transmitted the message.
> > > > 
> > > > This issue lies in the SCMI layer. drivers/firmware/arm_scmi/driver.c
> > > > do_xfer() function.
> > > > 
> > > > The fundamental issue is send_message() blocks for SMC transport, but
> > > > doesn't block for mailbox transport. So if send_message() doesn't block we
> > > > can have multiple messages waiting at scmi_wait_for_message_response().
> > > > 
> > > 
> > > oh...yes...now I can see it...tx_prepare is really never called given
> > > how the mailbox subsystem de-queues messages once at time...so we end up
> > > waiting for a reply to some message that is still to be sent...so the
> > > message inflight is really NOT corrupted because the next remain pending
> > > until the reply in the shmem is read back , BUT the timeout will drift away
> > > if you multiple inflights are pending to be sent...
> > >
> > 
> > Indeed.
> > 
> > > > SMC looks like this
> > > > CPU #0 SCMI message 0 -> calls send_message() then calls
> > > > scmi_wait_for_message_response(), timesout after 30ms.
> > > > CPU #1 SCMI message 1 -> blocks at send_message() waiting for SCMI message 0
> > > > to complete.
> > > > 
> > > > Mailbox looks like this
> > > > CPU #0 SCMI message 0 -> calls send_message(), mailbox layer queues up
> > > > message, mailbox layer sees no message is outgoing and sends it. CPU waits
> > > > at scmi_wait_for_message_response(), timesout after 30ms
> > > > CPU #1 SCMI message 1 -> calls send_message(), mailbox layer queues up
> > > > message, mailbox layer sees message pending, hold message in queue. CPU
> > > > waits at scmi_wait_for_message_response(), timesout after 30ms.
> > > > 
> > > > Lets say if transport takes 25ms. The first message would succeed, the
> > > > second message would timeout after 5ms.
> > > > 
> > > > Hopefully this makes sense.
> > > 
> > > Yes, of course, thanks, for reporting this, and taking time to
> > > explain...
> > > 
> > > ...in general the patch LGTM...I think your patch is good also because it
> > > could be easily backported as a fix....can you add a Fixes tag in your
> > > next version ?
> > >
> > 
> > Are you seeing this issue a lot ? IOW, do we need this to be backported ?
> > 
> > > Also can you explain in more detail the issue and the solution in the commit
> > > message....that will help having it merged as a Fix in stables...
> > >
> > > ...for the future (definitely NOT in this series) we could probably think to
> > > get rid of the sleeping mutex in favour of some other non-sleeping form of
> > > mutual exclusion around the channnel (like in SMC transport) and enable
> > > (optionally) Atomic transmission support AND also review if the shmem
> > > layer busy-waiting in txprepare is anymore needed at all...
> > >
> > 
> > Agreed, if we are locking the channel in SCMI, we can drop the busy-waiting
> > in tx_prepare and the associated details in the comment as this locking
> > voids that. It is better have both the changes in the same patch to indicate
> > the relation between them.
> 
> Actually scratch that last point. The waiting in tx_prepare until the platform
> marks it free for agent to use is still needed. One usecase is when agent/OS
> times out but platform continues to process and eventually releases the shmem.
> Sorry I completely forgot about that.
> 

Yes indeed it is the mechanism that we avoid to reclaim forcibly anyway the shmem
if the transmission times out...and we should keep that to avoid
corruption of newer messages by late replies from the earlier ones that
have timed out.

Thanks,
Cristian