[EXT] Re: [PATCH] clocksource: Add Marvell Errata-38627 workaround

Marc Zyngier maz at kernel.org
Sun Jul 11 02:57:34 PDT 2021


On Thu, 08 Jul 2021 11:48:18 +0100,
Bharat Bhushan <bbhushan2 at marvell.com> wrote:
> 
> Hi Marc,
> 
> Similar questions are asked by Mark, response might be duplicated.

Mark had a ton of very good questions, so I won't repeat them. Some
more below though:

> > -----Original Message-----
> > From: Marc Zyngier <maz at kernel.org>
> > Sent: Monday, July 5, 2021 2:57 PM
> > To: Bharat Bhushan <bbhushan2 at marvell.com>
> > Cc: catalin.marinas at arm.com; will at kernel.org; daniel.lezcano at linaro.org;
> > mark.rutland at arm.com; konrad.dybcio at somainline.org;
> > saiprakash.ranjan at codeaurora.org; robh at kernel.org; marcan at marcan.st;
> > suzuki.poulose at arm.com; broonie at kernel.org; linux-arm-
> > kernel at lists.infradead.org; linux-kernel at vger.kernel.org; Linu Cherian
> > <lcherian at marvell.com>
> > Subject: [EXT] Re: [PATCH] clocksource: Add Marvell Errata-38627 workaround
> > 
> > External Email
> > 
> > ----------------------------------------------------------------------
> > On Mon, 05 Jul 2021 07:08:43 +0100,
> > Bharat Bhushan <bbhushan2 at marvell.com> wrote:
> > >
> > > CPU pipeline have unpredicted behavior when timer interrupt appears
> > > and then disappears prior to the exception happening.
> > 
> > What kind of unpredictable behaviours?  
> 
> This is a race condition where an instruction (except store, system,
> load atomic and load exclusive) becomes "nop" if interrupt appears
> and disappears before taken by CPU. This can lead to GPR
> corruption. For example interrupt appears after the atomic load
> instruction starts executing and disappears before the atomic load
> instruction completes, in that case instruction (not all) can become
> "nop". As interrupt disappears before atomic instruction completes,
> cpu continues to execute and while take stale value from register as
> other dependent got "nop".

So here's what I understand from the above:

- Interrupts being a context synchronisation event, the CPU deals with
  them by preventing in-flight instructions from having any effect
  (what you above describe as becoming NOP).

- If the interrupt is recalled before the exception entry can take
  place, the exception doesn't occur, but the discarded instructions
  are not replayed, leaving the program in an inconsistent state.

Is this interpretation correct? If so, I have more questions:

- Does the erratum trigger when interrupts are masked in PSTATE? Can
  this erratum be triggered by masking interrupts in PSTATE?

- What makes this specific to the timer? Why can't this be triggered
  with any other interrupt? Spurious interrupts do exist, and happen
  all the time, specially with level triggered interrupts.

- What if *another* CPU masks the interrupt at the GIC redistributor
  level?

> > What happens if a guest isn't aware of the erratum or actively
> > tries to trigger it?
> 
> Errata applies to VM (EL1 virtual timer) as well. In addition
> extending the workaround to timer context save/restore in kvm seems
> to work.  Can you help if we are missing something in VM?

Maybe. First, I want to understand why this is specific to the timer,
and whether this can have any impact when already in an exception
context. I'm not convinced that this issue is specific to the timer
either.

Which revision of the architecture does this CPU implements? Depending
on whether the CPU runs VHE or not, we handle things slightly differently.

> > > Time interrupt appears on timer
> > > expiry and disappears when timer programming or timer disable. This
> > > typically can happen when a load instruction misses in the cache,
> > > which can take few hundreds of cycles, and an interrupt appears after
> > > the load instruction starts executing but disappears before the load
> > > instruction completes.
> > >
> > > Workaround of this is to ensure maximum 2us of time
> > 
> > maximum? I'm not sure how you can bound this. Or did you mean *minimum*?
> 
> It is minimum
> 
> > 
> > How was this value obtained? What guarantees that it is safe?
> 
> H/w team suggested same

This doesn't answer my question.

Thanks,

	M.

-- 
Without deviation from the norm, progress is not possible.



More information about the linux-arm-kernel mailing list