❌ FAIL: Test report for kernel 5.13.0-rc4 (arm-next, 8124c8a6)

Fri Jun 11 03:34:57 PDT 2021

On Thu, Jun 10, 2021 at 01:59:12PM +0200, Veronika Kabatova wrote:
> On Thu, Jun 3, 2021 at 12:44 PM Veronika Kabatova <vkabatov at redhat.com> wrote:
> >
> > On Wed, Jun 2, 2021 at 7:10 PM Will Deacon <will at kernel.org> wrote:
> > >
> > > On Wed, Jun 02, 2021 at 01:00:47PM +0200, Veronika Kabatova wrote:
> > > > On Wed, Jun 2, 2021 at 12:51 PM Will Deacon <will at kernel.org> wrote:
> > > > > On Wed, Jun 02, 2021 at 12:40:07PM +0200, Ard Biesheuvel wrote:
> > > > > > On Wed, 2 Jun 2021 at 12:12, Will Deacon <will at kernel.org> wrote:
> > > > > > > On Wed, Jun 02, 2021 at 01:35:01AM -0000, CKI Project wrote:
> > > > > > > >      stress: stress-ng
> > > > > > >
> > > > > > > This explodes pretty badly. Some CPUs detect RCU stalls when trying to use
> > > > > > > the EFI "efi_read_time" service, which eventually fails but soon after we
> > > > > > > explode trying to access memory which I think is mapped by
> > > > > > > acpi_os_ioremap(), so it looks like the f/w might be the culprit here. Is
> > > > > > > the "HPE Apollo 70" machine known to have bad EFI firmware?
> > > > > > >
> > > > > > > https://arr-cki-prod-datawarehouse-public.s3.amazonaws.com/datawarehouse-public/2021/06/01/313156257/build_aarch64_redhat%3A1310052388/tests/stress_stress_ng/10079827_aarch64_2_dmesg.log
> > > > > > >
> > > > > > > (scroll to the end for the fireworks)
> > > > > > >
> > > > > >
> > > > > > Wow that looks pretty horrible. I take it this tree has your MAIR changes?
> > > > >
> > > > > Nope, this is just vanilla -rc4! I'm trying to get a "known good" base
> > > > > before I throw all the new things at it :)
> > > > >
> > > > > > Would be useful to have a log with efi=debug, to see what the EFI
> > > > > > memory map looks like.
> > > > >
> > > > > Veronika -- please could you help us with that?
> > > >
> > > > Sure, I'll get a rerun with that option and report back when I have any
> > > > results. I am also planning just a plain rerun on the machine to see if it
> > > > reproduces somewhat reliably, however the machine is taken up by
> > > > other automation now so it will take a while.
> > >
> > > Thanks. In the meantime, I've pushed a bunch of new stuff into for-kernelci,
> > > so I can at least see if it regresses when compared to the three failures
> > > we're seeing here.
> > >
> >
> > Hi,
> >
> > I don't have very good news so far. We did 4 targeted runs with the machine
> > and weren't able to reproduce the panic. However, there was a panic hit in
> > the new test run you should have in the inbox and it also reproduced in a
> > completely unrelated test run with *this* kernel (not the new one). In all 3
> > cases the HW model is the same, but they were all different machines.
> >
> > I'm currently doing a full run which includes all tests from the run instead
> > of just stress-ng to see if it reproduces that way - there was a panic case
> > last year (not ARM specific :) that we weren't able to pinpoint to a nice
> > reproducer and had to run multiple tests to trigger it so it's possible this
> > one is similar. I'll try to pair down the tests if this strategy works and
> > keep you updated.
> >
> 
> I just wanted to follow up here. Outside of the single run I mentioned
> previously, we are still unable to reproduce the panic. We tried a lot of
> runs on the various machines of the model that hit it, with both full test
> runs and stress-ng test only.
> 
> We'll still reach out if we manage to hit it in the future, but it looks like
> a race condition that's not easy to reproduce. Of course if anyone has
> an idea we should try (whether it's about reproducing or debugging what
> the problem is) we can try that.

Thanks for the follow-up, Veronika. I also noticed that it seems to have
disappeared from subsequent runs :/

Will