[PATCH] arm64: head: Fix cache inconsistency of the identity-mapped region

Thu Apr 21 04:47:20 PDT 2022

Hi Shanker,

First of all, thanks for continuing to share so many useful details here.
I'm pretty confident we'll eventually get to the bottom of this.

On Wed, Apr 20, 2022 at 10:02:13AM -0500, Shanker R Donthineni wrote:
> On 4/20/22 5:08 AM, Will Deacon wrote:
> > Non-coherent? You mean non-cacheable, right? At this stage, we only
> >>> have a single CPU, so I'm not sure coherency is the problem here. When
> >>> you say 'drop', is that an eviction? Replaced by what? By the previous
> >>> version of the cache line, containing the stale value?
> >> Yes,non-cacheable. The cache line corresponding to function enter_vhe() was
> >> marked with shareable & WB-cache as a result of write to RELA, the same cache
> >> line is being fetched with non-shareable & non-cacheable. The eviction is
> >> not pushed to PoC and got dropped because of cache-line attributes mismatch.
> > I'm really struggling to understand this. Why is the instruction fetch
> > non-shareable? I'm trying to align your observations with the rules about
> > mismatched aliases in the architecture and I'm yet to satisfy myself that
> > the CPU is allowed to drop a dirty line on the floor in response to an
> > unexpected hit.
> >
> > My mental model (which seems to align with Marc) is something like:
> >
> >
> > 1. The boot CPU fetches the line via a cacheable mapping and dirties it
> >    in its L1 as part of applying the relocations.
> 
> ARM-CPU core is sending ReadNotSharedDirty CHI command to LLC (last-level-cache).
> This cache-line is marked as checked-out in LLC, would be used to keep track
> of coherency.

Oh man, that CHI stuff always sends me into a spin so I'm probably not the
right person to debug it, but let's see...

> > 2. The boot CPU then enters EL2 with the MMU off and fetches the same
> >    line on the I-side. AFAICT, the architecture says this is done with
> >    outer-shareable, non-cacheable attributes.
> 
> ARM-CPU core is sending ReadNoSnoop CHI command when MMU disabled. The
> marked cache-line from the step 1) become invalid in LLC.

When you say "invalid" here, do you mean in the same sense as "cache
invalidation"? Why is that the right thing to do in response to a _read_?

Would the same invalidation happen if the ReadNoSnoop originated from a
different CPU?

> As per the ARM-ARM specification, CMO is recommended whenever memory
> attributes are changed for a given memory region.

The mismatched attribute section (B2.8) is quite a lot more nuanced than
that, and in particular I don't see it allowing for writes to be lost
in the presence of a mismatched read. Then again, the whole section is
hard to follow and doesn't really call out instruction fetch other than
in a note at the end.

> With my understating the CPU core must generate coherent accesses for
> Shared+Cacheable memory but not clear for OSH+non-cacheable regions
> in the spec.

Ok, but we don't need the reloc to be visible to the MMU-off fetch in
the case you're describing so coherency between the two aliases is not
required. The problem is that the cacheable data is being lost forever,
so even cacheable accesses can't retrieve it.

> Are you expecting "OSH+non-cacheable" must generate coherent accesses?
> 
> > 3. !!! Somehow the instruction fetch results in the _dirty_ line from (1)
> >        being discarded !!!
> 
> The ARM-CPU is sending WritebackFull CHI command to LLC for the cache line
> which was marked as invalid from step 2). The write CMD is ignored/dropped.

Is this because the LLC is inclusive and doesn't know what to do with the
writeback after invalidating the line earlier on? When else would a
writeback get silently dropped?

Stepping back a bit, my concern with all of this is that it seems as though
non-cacheable instruction fetches that alias with dirty lines on the data
side can lead to data loss. This feels like a significantly broader issue
than the relocation case you've run into, particularly as instruction fetch
with the MMU off is still permitted to fetch speculatively from within a
couple of pages of the PC. Cache maintenance doesn't feel like a viable
solution in general, as we'd need to ensure that no fetch occurs between
the write and the CMO, and this could occur on another CPU, off the back
of an IRQ or even a FIQ (with the fetch being made at a different exception
level).

Given that you have all the low-level details about the problem, it would
probably be worth pulling Arm into this so we can figure out what the right
solution is. I'm happy to be involved with that, if you like.

Will