[PATCH] ARM: decompressor: cover BSS in cache clean and reorder with MMU disable on v7

Sun Jan 24 10:45:14 EST 2021

On Sun, 24 Jan 2021 at 16:21, Russell King - ARM Linux admin
<linux at armlinux.org.uk> wrote:
>
> On Sun, Jan 24, 2021 at 02:35:31PM +0100, Ard Biesheuvel wrote:
> > So what I think is happening is the following:
> >
> > In v5.7 and before, the set/way operations trap into KVM, which sets
> > another trap bit to ensure that second trap occurs the next time the
> > MMU is disabled. So if any cachelines are allocated after the call to
> > cache_clean_flush(), they will be invalidated again when KVM
> > invalidates the VM's entire IPA space.
> >
> > According to DDI0406C.d paragraph B3.2.1, it is implementation defined
> > whether non-cacheable accesses that occur with MMU/caches disabled may
> > hit in the data cache.
> >
> > So after v5.7, without set/way instructions being issued, the second
> > trap is never set, and so the only cache clean+invalidate that occurs
> > is the one that the decompressor performs itself, and the one that KVM
> > does on the guest's behalf at cache_off() time is omitted. This
> > results in clean cachelines being allocated that shadow the
> > mini-stack, which are hit by the non-cacheable accesses that occur
> > before the kernel proper enables the MMU again.
> >
> > Reordering the clean+invalidate with the MMU/cache disabling prevent
> > the issue, as disabling the MMU and caches first disables any mappings
> > that the cache could perform speculative linefills from, and so the
> > mini-stack memory access cannot be served from the cache.
>
> This may be part of the story, but it doesn't explain all of the
> observed behaviour.
>
> First, some backround...
>
> We have three levels of cache on the Armada 8040 - there are the two
> levels inside the A72 clusters, as designed by Arm Ltd. There is a
> third level designed by Marvell which is common to all CPUs, which is
> an exclusive cache. This means that if the higher levels of cache
> contain a cache line, the L3 cache will not.
>
> Next, consider the state leading up to this point inside the guest:
>
> - the decompressor code has been copied, overlapping the BSS and the
>   mini-stack.
> - the decompressor code and data has been C+I using the by-MVA
>   instructions. This should push the data out to DDR.
> - the decompressor has run, writing a large amount of data (that being
>   the decompressed kernel image.)
>
> At this precise point where we write to the mini-cache, the data cache
> and MMU are both turned off, but the instruction cache is left enabled.
>
> The action around the mini-stack involves writing the following hex
> values to the mini-stack, located at 0x40e69420 - note it's alignment:
>
>    ffffffff 48000000 09000401 40003000 00000000 4820071d 40008090
>
> It has been observed that immediately after writing, reading the values
> read back have been observed to be (when incorrect, these are a couple
> of examples):
>
>    ffffffff 48000000 09000401 ee020f30 ee030f10 e3a00903 ee050f30 (1)
>    ffffffff 48000000 09000401 ee020f30 00000000 4820071d 40008090 (2)
>
> and after v1_invalidate_l1, it always seems to be:
>
>    ee060f37 e3a00080 ee020f10 ee020f30 ee030f10 e3a00903 ee050f30
>
> v1_invalidate_l1 operates by issuing set/way instructions that target
> only the L1 cache - its purpose is to initialise the at-reset undefined
> state of the L1 cache. These invalidates must not target lower level
> caches, since these may contain valid data from other CPUs already
> brought up in the system.
>
> To be absolutely clear about these two observed cases:
>
> case 1:
> write: ffffffff 48000000 09000401 40003000 00000000 4820071d 40008090
> read : ffffffff 48000000 09000401 ee020f30 ee030f10 e3a00903 ee050f30
> read : ee060f37 e3a00080 ee020f10 ee020f30 ee030f10 e3a00903 ee050f30
>
> case 2:
> write: ffffffff 48000000 09000401 40003000 00000000 4820071d 40008090
> read : ffffffff 48000000 09000401 ee020f30 00000000 4820071d 40008090
> read : ee060f37 e3a00080 ee020f10 ee020f30 ee030f10 e3a00903 ee050f30
>
> If we look at the captured data above, there are a few things to note:
> 1) the point at which we read-back wrong data is part way through
>    a cache line.
> 2) case 2 shows only one value is wrong initially, mid-way through the
>    stack.
> 3) after v1_invalidate_l1, it seems that all data is incorrect. This
>    could be a result of the actions of v1_invalidate_l1, or merely
>    due to time passing and there being pressure from other system
>    activity to evict lines from the various levels of caches.
>
> Considering your theory that there are clean cache lines overlapping
> the mini-stack, and that non-cacheable accesses hit those cache lines,
> then the stmia write should hit those cache lines and mark them dirty.
> The subsequent read-back should also hit those cache lines, and return
> consistent data. If the cache lines are evicted back to RAM, then a
> read will not hit any cache lines, and should still return the data
> that was written. Therefore, we should not be seeing any effects at
> all, and the data should be consistent. This does not fit with the
> observations.
>
> If we consider an alternative theory - that there are clean cache lines
> overlapping the mini-stack, and non-cacheable accesses do not hit the
> cache lines. This means that the stmia write bypasses the caches and
> hits the RAM directly, and reads would also fetch from the RAM. The
> only way in this case that we would see data change is if the cache
> line were in fact dirty, and it gets written back to RAM between our
> non-cacheable write and a subsequent non-cacheable read. This also does
> not fit the observations, particularly case (2) that I highlight above
> where only _one_ value was seen to be incorrect.
>
> There is another theory along this though - the L1 and L2 have
> differing behaviour to non-cacheable accesses from L3, and when a
> clean cache line is discarded from L1/L2, it is placed in L3. For
> example, if non-cacheable accesses bypass L1 and L2 but not L3. Now we
> have a _possibility_ to explain this behaviour. Initially, L1/L2
> contain a clean cache line overlapping this area. Accesses initially
> bypass the clean cache line, until it gets evicted into L3, where
> accesses hit it instead. When it gets evicted from L3, as it was clean,
> it doesn't get written back, and we see the in-DDR data. The reverse
> could also be true - L1/L2 could be hit by an uncached access but not
> L3, and I'd suggest similar effects would be possible. However, this
> does not fully explain case (2).
>
> So, I don't think we have a full and proper idea of what is really
> behind this.
>

Fair enough. I don't think we will ever be able to get to the bottom of this.

But I do stand by my conclusion that the likely reason that this patch
fixes the issue is that it reinstates the additional cache
clean+invalidate occuring *after* cache_off(), which was carried out
on the guest's behalf before, due to the way KVM handles set/way
operations and the subsequent disabling of the MMU. I.e.,

kvm_set_way_flush() in arch/arm64/kvm/mmu.c, which is called every
time the guest performs a cache op by set/way

/*
 * If this is the first time we do a S/W operation
 * (i.e. HCR_TVM not set) flush the whole memory, and set the
 * VM trapping.
 *
 * Otherwise, rely on the VM trapping to wait for the MMU +
 * Caches to be turned off. At that point, we'll be able to
 * clean the caches again.
 */

(so the significance here is *not* that it flushes the whole memory
but that it sets HCR_TVM - this is why the experiment we did with
adding a single set/way op fixed the issue but changing it to another
mcr instruction that is also known to trap to KVM did not make a
difference)

Then, in kvm_toggle_cache(),

/*
 * If switching the MMU+caches on, need to invalidate the caches.
 * If switching it off, need to clean the caches.
 * Clean + invalidate does the trick always.
 */

which is why cache_off() used to result in a full clean+invalidate
under KVM, but not after my by-MVA patch was applied.

So on the basis that this patch adds back a full cache
clean+invalidate that was inadvertently removed by my previous patch,
I think we should apply this patch. GIven that this change can be
traced back to commit 401b368caaecdce1cf8f05bab448172752230cb0, we
should probably include

Fixes: 401b368caaec ("ARM: decompressor: switch to by-VA cache
maintenance for v7 cores")

Are you on board with that?