[PATCH] ARM: decompressor: cover BSS in cache clean and reorder with MMU disable on v7

Russell King - ARM Linux admin linux at armlinux.org.uk
Sun Jan 24 10:21:27 EST 2021


On Sun, Jan 24, 2021 at 02:35:31PM +0100, Ard Biesheuvel wrote:
> So what I think is happening is the following:
> 
> In v5.7 and before, the set/way operations trap into KVM, which sets
> another trap bit to ensure that second trap occurs the next time the
> MMU is disabled. So if any cachelines are allocated after the call to
> cache_clean_flush(), they will be invalidated again when KVM
> invalidates the VM's entire IPA space.
> 
> According to DDI0406C.d paragraph B3.2.1, it is implementation defined
> whether non-cacheable accesses that occur with MMU/caches disabled may
> hit in the data cache.
> 
> So after v5.7, without set/way instructions being issued, the second
> trap is never set, and so the only cache clean+invalidate that occurs
> is the one that the decompressor performs itself, and the one that KVM
> does on the guest's behalf at cache_off() time is omitted. This
> results in clean cachelines being allocated that shadow the
> mini-stack, which are hit by the non-cacheable accesses that occur
> before the kernel proper enables the MMU again.
> 
> Reordering the clean+invalidate with the MMU/cache disabling prevent
> the issue, as disabling the MMU and caches first disables any mappings
> that the cache could perform speculative linefills from, and so the
> mini-stack memory access cannot be served from the cache.

This may be part of the story, but it doesn't explain all of the
observed behaviour.

First, some backround...

We have three levels of cache on the Armada 8040 - there are the two
levels inside the A72 clusters, as designed by Arm Ltd. There is a
third level designed by Marvell which is common to all CPUs, which is
an exclusive cache. This means that if the higher levels of cache
contain a cache line, the L3 cache will not.

Next, consider the state leading up to this point inside the guest:

- the decompressor code has been copied, overlapping the BSS and the
  mini-stack.
- the decompressor code and data has been C+I using the by-MVA
  instructions. This should push the data out to DDR.
- the decompressor has run, writing a large amount of data (that being
  the decompressed kernel image.)

At this precise point where we write to the mini-cache, the data cache
and MMU are both turned off, but the instruction cache is left enabled.

The action around the mini-stack involves writing the following hex
values to the mini-stack, located at 0x40e69420 - note it's alignment:

   ffffffff 48000000 09000401 40003000 00000000 4820071d 40008090

It has been observed that immediately after writing, reading the values
read back have been observed to be (when incorrect, these are a couple
of examples):

   ffffffff 48000000 09000401 ee020f30 ee030f10 e3a00903 ee050f30 (1)
   ffffffff 48000000 09000401 ee020f30 00000000 4820071d 40008090 (2)

and after v1_invalidate_l1, it always seems to be:

   ee060f37 e3a00080 ee020f10 ee020f30 ee030f10 e3a00903 ee050f30

v1_invalidate_l1 operates by issuing set/way instructions that target
only the L1 cache - its purpose is to initialise the at-reset undefined
state of the L1 cache. These invalidates must not target lower level
caches, since these may contain valid data from other CPUs already
brought up in the system.

To be absolutely clear about these two observed cases:

case 1:
write: ffffffff 48000000 09000401 40003000 00000000 4820071d 40008090
read : ffffffff 48000000 09000401 ee020f30 ee030f10 e3a00903 ee050f30
read : ee060f37 e3a00080 ee020f10 ee020f30 ee030f10 e3a00903 ee050f30

case 2:
write: ffffffff 48000000 09000401 40003000 00000000 4820071d 40008090
read : ffffffff 48000000 09000401 ee020f30 00000000 4820071d 40008090
read : ee060f37 e3a00080 ee020f10 ee020f30 ee030f10 e3a00903 ee050f30

If we look at the captured data above, there are a few things to note:
1) the point at which we read-back wrong data is part way through
   a cache line.
2) case 2 shows only one value is wrong initially, mid-way through the
   stack.
3) after v1_invalidate_l1, it seems that all data is incorrect. This
   could be a result of the actions of v1_invalidate_l1, or merely
   due to time passing and there being pressure from other system
   activity to evict lines from the various levels of caches.

Considering your theory that there are clean cache lines overlapping
the mini-stack, and that non-cacheable accesses hit those cache lines,
then the stmia write should hit those cache lines and mark them dirty.
The subsequent read-back should also hit those cache lines, and return
consistent data. If the cache lines are evicted back to RAM, then a
read will not hit any cache lines, and should still return the data
that was written. Therefore, we should not be seeing any effects at
all, and the data should be consistent. This does not fit with the
observations.

If we consider an alternative theory - that there are clean cache lines
overlapping the mini-stack, and non-cacheable accesses do not hit the
cache lines. This means that the stmia write bypasses the caches and
hits the RAM directly, and reads would also fetch from the RAM. The
only way in this case that we would see data change is if the cache
line were in fact dirty, and it gets written back to RAM between our
non-cacheable write and a subsequent non-cacheable read. This also does
not fit the observations, particularly case (2) that I highlight above
where only _one_ value was seen to be incorrect.

There is another theory along this though - the L1 and L2 have
differing behaviour to non-cacheable accesses from L3, and when a
clean cache line is discarded from L1/L2, it is placed in L3. For
example, if non-cacheable accesses bypass L1 and L2 but not L3. Now we
have a _possibility_ to explain this behaviour. Initially, L1/L2
contain a clean cache line overlapping this area. Accesses initially
bypass the clean cache line, until it gets evicted into L3, where
accesses hit it instead. When it gets evicted from L3, as it was clean,
it doesn't get written back, and we see the in-DDR data. The reverse
could also be true - L1/L2 could be hit by an uncached access but not
L3, and I'd suggest similar effects would be possible. However, this
does not fully explain case (2).

So, I don't think we have a full and proper idea of what is really
behind this.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!



More information about the linux-arm-kernel mailing list