[PATCH] firmware: arm_scmi: Queue in scmi layer for mailbox implementation
Justin Chen
justin.chen at broadcom.com
Tue Oct 8 12:40:37 PDT 2024
On 10/8/24 12:23 PM, Justin Chen wrote:
>
>
> On 10/8/24 6:37 AM, Cristian Marussi wrote:
>> On Tue, Oct 08, 2024 at 02:34:59PM +0100, Sudeep Holla wrote:
>>> On Tue, Oct 08, 2024 at 02:23:00PM +0100, Sudeep Holla wrote:
>>>> On Tue, Oct 08, 2024 at 01:10:39PM +0100, Cristian Marussi wrote:
>>>>> On Mon, Oct 07, 2024 at 10:58:47AM -0700, Justin Chen wrote:
>>>>>> Thanks for the response. I'll try to elaborate.
>>>>>>
>>>>>> When comparing SMC and mailbox transport, we noticed mailbox
>>>>>> transport
>>>>>> timesout much quicker when under load. Originally we thought this
>>>>>> was the
>>>>>> latency of the mailbox implementation, but after debugging we
>>>>>> noticed a
>>>>>> weird behavior. We saw SMCI transactions timing out before the
>>>>>> mailbox even
>>>>>> transmitted the message.
>>>>>>
>>>>>> This issue lies in the SCMI layer. drivers/firmware/arm_scmi/driver.c
>>>>>> do_xfer() function.
>>>>>>
>>>>>> The fundamental issue is send_message() blocks for SMC transport, but
>>>>>> doesn't block for mailbox transport. So if send_message() doesn't
>>>>>> block we
>>>>>> can have multiple messages waiting at
>>>>>> scmi_wait_for_message_response().
>>>>>>
>>>>>
>>>>> oh...yes...now I can see it...tx_prepare is really never called given
>>>>> how the mailbox subsystem de-queues messages once at time...so we
>>>>> end up
>>>>> waiting for a reply to some message that is still to be sent...so the
>>>>> message inflight is really NOT corrupted because the next remain
>>>>> pending
>>>>> until the reply in the shmem is read back , BUT the timeout will
>>>>> drift away
>>>>> if you multiple inflights are pending to be sent...
>>>>>
>>>>
>>>> Indeed.
>>>>
>>>>>> SMC looks like this
>>>>>> CPU #0 SCMI message 0 -> calls send_message() then calls
>>>>>> scmi_wait_for_message_response(), timesout after 30ms.
>>>>>> CPU #1 SCMI message 1 -> blocks at send_message() waiting for SCMI
>>>>>> message 0
>>>>>> to complete.
>>>>>>
>>>>>> Mailbox looks like this
>>>>>> CPU #0 SCMI message 0 -> calls send_message(), mailbox layer
>>>>>> queues up
>>>>>> message, mailbox layer sees no message is outgoing and sends it.
>>>>>> CPU waits
>>>>>> at scmi_wait_for_message_response(), timesout after 30ms
>>>>>> CPU #1 SCMI message 1 -> calls send_message(), mailbox layer
>>>>>> queues up
>>>>>> message, mailbox layer sees message pending, hold message in
>>>>>> queue. CPU
>>>>>> waits at scmi_wait_for_message_response(), timesout after 30ms.
>>>>>>
>>>>>> Lets say if transport takes 25ms. The first message would succeed,
>>>>>> the
>>>>>> second message would timeout after 5ms.
>>>>>>
>>>>>> Hopefully this makes sense.
>>>>>
>>>>> Yes, of course, thanks, for reporting this, and taking time to
>>>>> explain...
>>>>>
>>>>> ...in general the patch LGTM...I think your patch is good also
>>>>> because it
>>>>> could be easily backported as a fix....can you add a Fixes tag in your
>>>>> next version ?
>>>>>
>>>>
>>>> Are you seeing this issue a lot ? IOW, do we need this to be
>>>> backported ?
>>>>
>
> I wouldn't say a lot. But we are seeing it with standard use of our
> devices running over an extended amount of time. Yes we would like this
> backported.
>
Any advice on what Fixes tag to add here? Since the driver was moved to
the transport folder, I assume this is going to be a manual backport.
Not sure if the Fixes tag should be when the SCMI MBOX transport was
moved to a standalone driver in the transport folder or the very first
SCMI commits since the MBOX transport was added then.
Thanks,
Justin
<snip>
More information about the linux-arm-kernel
mailing list