[BUG] New arm scmi check in linux-next causing rk3568 not to boot due to firmware bug

Thu May 5 02:40:09 PDT 2022

Hello Nicolas, Cristian,

On Thu, 5 May 2022 at 10:03, Cristian Marussi <cristian.marussi at arm.com> wrote:
>
> On Wed, May 04, 2022 at 07:51:45PM +0200, Nicolas Frattaroli wrote:
> > On Mittwoch, 4. Mai 2022 15:21:30 CEST Sudeep Holla wrote:
> > > + Cristian
>
> +Etieenne
>
> Hi Nicolas,
>
> > >
> > > Hi Nicolas,
> > >
> > > Thanks for the formal report.
> > >
> > > On Wed, May 04, 2022 at 02:49:07PM +0200, Nicolas Frattaroli wrote:
> > > > Good day,
> > > >
> > > > a user on the #linux-rockchip channel on the Libera.chat IRC network
> > > > reported that their RK3568 was no longer getting a CPU and GPU clock
> > > > from scmi and consequently not booting when using linux-next. This
> > > > was bisected down to the following commit:
> > > >
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/drivers/firmware/arm_scmi/base.c?h=next-20220503&id=3b0041f6e10e5bdbb646d98172be43e88734ed62
> > > >
> > > > The error message in the log is as follows:
> > > >
> > > > arm-scmi firmware:scmi: Malformed reply - real_sz:8  calc_sz:4, t->rx.len is 12, sizeof(u32) is 4, loop_num_ret is 3
> > > >
> > > > The rockchip firmware (bl31) being used was v1.32, from here:
> > > >
> > > > https://github.com/JeffyCN/rockchip_mirrors/blob/rkbin/bin/rk35/rk3568_bl31_v1.32.elf
> > > >
> > >
> > > So this platform is not supported in upstream TF-A like its predecessors ?
> >
> > Hello,
> >
> > it is not yet supported by upstream. Rockchip plans to release the sources
> > for it at some point if I recall correctly, but I believe their software
> > team has been very busy due to new hardware releases, so it hasn't happened
> > yet. I hope we'll see an open source release of the TF-A sources eventually,
> > so that for bugs like this we can always fix them without the vendor needing
> > to do it for us.
> >
> > >
> > > > This seems like a non-fatal firmware bug, for which a kernel workaround is
> > > > certainly possible, but it would be good if rockchip could fix this in their
> > > > firmware.
> > > >
> > >
> > > Indeed, we added this check finding issue in one of our tests. Luckily
> > > it helped to unearth the same issue on this platform, but due to the
> > > nature of its f/w release, it is bit unfortunate that it can't be fixed
> > > easily and quickly. But I really wish this gets fixed in the firmware.
> > > Are there any other f/w bugs reported so far ? If so how are they fixed
> > > as I don't expect all such bugs can be worked around in the kernel though
> > > this might be. I would like to hear details there if possible.
> >
> > I'm not aware of how the rockchip bug report workflow works. They seemingly
> > did update the firmware multiple times, last in October of 2021.
> >
> > The official rockchip repository at [1] hasn't been kept as up to date as
> > the mirror by a rockchip employee at [2], so most people seem to have been
> > using the latter. Speaking of which, I'll add the owner of that repo to
> > the CC of this thread to make sure this doesn't get lost.
> >
> > Rockchip lists an e-mail at [3] for reporting issues at, but this seems to
> > relate to their open-source documentation. The official github repository
> > of "rkbin" on the "rockchip-linux" organisation does not have issues
> > enabled, so submitting a bug report through that is unfortunately not
> > possible.
>
> Having a quick look at TF-A SCMI code in charge of this message (at least in
> the upstream):
>
> https://git.trustedfirmware.org/TF-A/trusted-firmware-a.git/tree/drivers/scmi-msg/base.c#n136
>
> it seems to me that the bug lies in the fact that the BASE_DISCOVER_PROTOCOLS
> response message built by the function above is not properly sized: the trailing
> message payload carrying the list of protocols (after returned_protos field) returns
> always a fixed arbitrarily sized payload, possibly zeroed if fewer protocols are
> carried.
>
> IOW, even though the answer in this case carries 3 items/protocols, the payload
> is anyway 8 bytes (with the last 5 bytes zeroed), while by the spec it should have
> been just 4 bytes.
>
> (in fact testing the kernel fix on a JUNO with last SCP fw release did NOT expose
> any issue...)
>
> I think a fix FW side could be something along these lines (UNTESTED NOR
> BUILT ! ... I Cc'ed Etienne that seems the author of this bit)
>
> This basically mirrors the same checks introduced in kernel...if someone
> is fancy/able to test it....

Indeed the firmware implementation is wrong in TF-A.
And also in OP-TEE by the way:
https://github.com/OP-TEE/optee_os/blob/3.17.0/core/drivers/scmi-msg/base.c#L163-L166

@Nicoals, do you want to send a patch to TF-A, or do you want me to do it?

I can fix the optee_os implementation. I'll tell you when I'll have
created a P-R.
The fix is the same for TF-A and OP-TEE.
Proposal from Cristian looks good to me, maybe simplified:

```patch
         memcpy(outargs, &p2a, sizeof(p2a));
         memcpy(outargs + sizeof(p2a), list + a2p->skip, count);

-        scmi_write_response(msg, outargs, sizeof(outargs));
+        list_sz = (1 + (count - 1) / sizeof(uint32_t)) * sizeof(uint32_t);
+        scmi_write_response(msg, outargs, sizeof(p2a) + list_sz);
```

>
> diff --git a/drivers/scmi-msg/base.c b/drivers/scmi-msg/base.c
> index 2d7203451..35c99e308 100644
> --- a/drivers/scmi-msg/base.c
> +++ b/drivers/scmi-msg/base.c
> @@ -142,6 +142,7 @@ static void discover_list_protocols(struct scmi_msg *msg)
>         uint8_t outargs[sizeof(p2a) + MAX_PROTOCOL_IN_LIST] = { 0U };
>         const uint8_t *list = NULL;
>         unsigned int count = 0U;
> +       size_t list_sz = 0U;
>
>         if (msg->in_size != sizeof(*a2p)) {
>                 scmi_status_response(msg, SCMI_PROTOCOL_ERROR);
> @@ -163,9 +164,12 @@ static void discover_list_protocols(struct scmi_msg *msg)
>         p2a.num_protocols = count;
>
>         memcpy(outargs, &p2a, sizeof(p2a));
> -       memcpy(outargs + sizeof(p2a), list + a2p->skip, count);
> -
> -       scmi_write_response(msg, outargs, sizeof(outargs));
> +       if (count) {
> +               memcpy(outargs + sizeof(p2a), list + a2p->skip, count);
> +               list_sz = (1 + (count - 1) / sizeof(uint32_t)) *
> +                          sizeof(uint32_t);
> +       }
> +       scmi_write_response(msg, outargs, sizeof(p2a) + list_sz);
>  }
>
>
> >
> > >
> > > > The user going by "amazingfate" reported that commenting out the
> > > >   ret = -EPROTO; break;
> > > > fixes the issue for them.
> > > >
> > >
> > > Sure, or we could relax the check as calc_sz <= real_sz or something so
> > > that the reverse is still caught and handled as OS might read junk data in
> > > the later case.
> >
> > This seems like a good solution, that way we're unlikely to ever run into a
> > situation where the kernel does the wrong thing here even if we're less
> > strict about the check. In either case, it should print a dev_err though,
> > it's still an error even if we can tolerate it in some cases.
> >
> > >
>
> Beside fixing the FW, adding a workaround like the one Sudeep mentioned to
> avoid killing old fw plaform seems reasonable: we can just _err() as long as
> Kernel is not put into peril (i.e. calc_sz <= real_sz ) without
> necessarily bail out if an out of spec, but harmless, message is
> spotted.
>
> > > > I'm writing here to get the discussion started on how we can resolve this
> > > > before the Linux 5.19 release.
> > > >
> > >
> > > Agreed, I just sent by pull request for this literally few hours ago.
> > >
> > > > Sudeep Holla has already told me they'll gladly add a workaround before
> > > > the 5.19 release, but would rather see this fixed in the vendor firmware
> > > > first. Would rockchip be able and willing to fix it and publish a new
> > > > bl31 for rk3568?
> > > >
> > >
> > > Indeed and as mentioned above details on how other such f/w bugs are dealt
> > > in general esp that the firmware is blob release and one can't fix it easily.
> > > Do we have a bugzilla kind of setup to report and get the bugs fixed ?
> >
> > It's worth mentioning that I think even if we get Rockchip to fix the bug in
> > the firmware, I believe Linux should still add a workaround, as otherwise
> > people running older firmware who are upgrading their kernels could suddenly
> > have unbootable systems and don't know why that happened.
> >
>
> Agree.

Agree also. Sorry for that bug in TF-A/OP-TEE.

BR,
Etienne

>
> Thanks,
> Cristian