[PATCH v4 11/13] KVM: arm64: Improve CONFIG_CFI_CLANG error message

Thu Jun 6 09:22:26 PDT 2024

On Tue, Jun 04, 2024 at 05:05:59PM +0100, Pierre-Clément Tosi wrote:
> On Mon, Jun 03, 2024 at 03:48:08PM +0100, Will Deacon wrote:
> > On Wed, May 29, 2024 at 01:12:17PM +0100, Pierre-Clément Tosi wrote:
> > > For kCFI, the compiler encodes in the immediate of the BRK (which the
> > > CPU places in ESR_ELx) the indices of the two registers it used to hold
> > > (resp.) the function pointer and expected type. Therefore, the kCFI
> > > handler must be able to parse the contents of the register file at the
> > > point where the exception was triggered.
> > > 
> > > To achieve this, introduce a new hypervisor panic path that first stores
> > > the CPU context in the per-CPU kvm_hyp_ctxt before calling (directly or
> > > indirectly) hyp_panic() and execute it from all EL2 synchronous
> > > exception handlers i.e.
> > > 
> > > - call it directly in host_el2_sync_vect (__kvm_hyp_host_vector, EL2t&h)
> > > - call it directly in el2t_sync_invalid (__kvm_hyp_vector, EL2t)
> > > - set ELR_EL2 to it in el2_sync (__kvm_hyp_vector, EL2h), which ERETs
> > > 
> > > Teach hyp_panic() to decode the kCFI ESR and extract the target and type
> > > from the saved CPU context. In VHE, use that information to panic() with
> > > a specialized error message. In nVHE, only report it if the host (EL1)
> > > has access to the saved CPU context i.e. iff CONFIG_NVHE_EL2_DEBUG=y,
> > > which aligns with the behavior of CONFIG_PROTECTED_NVHE_STACKTRACE.
> > > 
> > > Signed-off-by: Pierre-Clément Tosi <ptosi at google.com>
> > > ---
> > >  arch/arm64/kvm/handle_exit.c            | 30 +++++++++++++++++++++++--
> > >  arch/arm64/kvm/hyp/entry.S              | 24 +++++++++++++++++++-
> > >  arch/arm64/kvm/hyp/hyp-entry.S          |  2 +-
> > >  arch/arm64/kvm/hyp/include/hyp/switch.h |  4 ++--
> > >  arch/arm64/kvm/hyp/nvhe/host.S          |  2 +-
> > >  arch/arm64/kvm/hyp/vhe/switch.c         | 26 +++++++++++++++++++--
> > >  6 files changed, 79 insertions(+), 9 deletions(-)
> > 
> > This quite a lot of work just to print out some opaque type numbers
> > when CONFIG_NVHE_EL2_DEBUG=y. Is it really worth it? How would I use
> > this information to debug an otherwise undebuggable kcfi failure at EL2?
> 
> The type ID alone might not be worth it but what about the target?
> 
> In my experience, non-malicious kCFI failures are often caused by an issue with
> the callee, not the caller. Without this patch, only the source of the exception
> is reported but, with it, the panic handler also prints the kCFI target (i.e.
> value of the function pointer) as a symbol.

I think it's less of an issue for EL2, as we don't have tonnes of
indirections, but I agree that the target is nice to have.

> With the infrastructure for the target in place, it isn't much more work to also
> report the type ID. Although it is rarely informative (as you noted), there are
> some situations where it can still be useful e.g. if reported as zero and/or has
> been corrupted and does not match its value from the ELF.

So looking at the implementation, I'm not a huge fan of saving off all
the GPRs and then relying on the stage-2 being disabled so that the host
can fish out the registers it cares about. I think I'd prefer to provide
the target as an additional argument to nvhe_hyp_panic_handler(), meaning
that we could even print the VA when CONFIG_NVHE_EL2_DEBUG is disabled.

But for now, I suggest we drop this patch along with the testing patches
because I think the rest of the series is nearly there and it's a useful
change on its own.

Will