[PATCH] arm64: errata: Add NXP iMX8QM workaround for A53 Cache coherency issue

Thu Apr 13 04:19:39 PDT 2023

Hi,

On Wed, Apr 12, 2023 at 03:55:06PM +0300, Ivan T. Ivanov wrote:
> According to NXP errata document[1] i.MX8QuadMax SoC suffers from
> serious cache coherence issue. It was also mentioned in initial
> support[2] for imx8qm mek machine.

That's a fairly horrid bug.

> I chose to use an ALTERNATIVE() framework, instead downstream solution[3],
> for this issue with the hope to reduce effect of this fix on unaffected
> platforms.
> 
> Unfortunately I was unable to find a way to identify SoC ID using
> registers. Boot CPU MIDR_EL1 is equal to 0x410fd034. So I fallback to
> using devicetree compatible strings for this.

How does this work with KVM?

VMs have no idea that the host platform is, and so will have no idea that this
erratum applies, so they're going to blow up spectacularly.

So we should probably be disabling KVM (or at the very least, printing a
gigantic warning).

> I know this fix is a suboptimal solution for affected machines, but I
> haven't been able to come up with a less intrusive fix.  And I hope once
> TLB caches are invalidated any immediate attempt to invalidate them again
> will be close to NOP operation (flush_tlb_kernel_range())
> 
> I have run few simple benchmarks and perf tests on affected and unaffected
> machines and I was not able see any obvious issues. iMX8QM "performance"
> was nearly doubled with 2 A72 bringed online.
> 
> Following is excerpt from NXP IMX8_1N94W "Mask Set Errata" document
> Rev. 5, 3/2023. Just in case it gets lost somehow.
> 
> ---
> "ERR050104: Arm/A53: Cache coherency issue"
> 
> Description
> 
> Some maintenance operations exchanged between the A53 and A72
> core clusters, involving some Translation Look-aside Buffer
> Invalidate (TLBI) and Instruction Cache (IC) instructions can
> be corrupted. The upper bits, above bit-35, of ARADDR and ACADDR
> buses within in Arm A53 sub-system have been incorrectly connected.
> Therefore ARADDR and ACADDR address bits above bit-35 should not
> be used.
> 
> Workaround
> 
> The following software instructions are required to be downgraded
> to TLBI VMALLE1IS:  TLBI ASIDE1, TLBI ASIDE1IS, TLBI VAAE1,
> TLBI VAAE1IS, TLBI VAALE1, TLBI VAALE1IS, TLBI VAE1, TLBI VAE1IS,
> TLBI VALE1, TLBI VALE1IS
> 
> The following software instructions are required to be downgraded
> to TLBI VMALLS12E1IS: TLBI IPAS2E1IS, TLBI IPAS2LE1IS
> 
> The following software instructions are required to be downgraded
> to TLBI ALLE2IS: TLBI VAE2IS, TLBI VALE2IS.
> 
> The following software instructions are required to be downgraded
> to TLBI ALLE3IS: TLBI VAE3IS, TLBI VALE3IS.
> 
> The following software instructions are required to be downgraded
> to TLBI VMALLE1IS when the Force Broadcast (FB) bit [9] of the
> Hypervisor Configuration Register (HCR_EL2) is set:
> TLBI ASIDE1, TLBI VAAE1, TLBI VAALE1, TLBI VAE1, TLBI VALE1
> 
> The following software instruction is required to be downgraded
> to IC IALLUIS: IC IVAU, Xt
> 
> Specifically for the IC IVAU, Xt downgrade, setting SCTLR_EL1.UCI
> to 0 will disable EL0 access to this instruction. Any attempt to
> execute from EL0 will generate an EL1 trap, where the downgrade to
> IC ALLUIS can be implemented.
> --
> 
> [1] https://www.nxp.com/docs/en/errata/IMX8_1N94W.pdf
> [2] 307fd14d4b14 ("arm64: dts: imx: add imx8qm mek support")
> [3] https://github.com/nxp-imx/linux-imx/blob/lf-6.1.y/arch/arm64/include/asm/tlbflush.h#L19
> 
> Signed-off-by: Ivan T. Ivanov <iivanov at suse.de>
> ---
>  Documentation/arm64/silicon-errata.rst |  2 ++
>  arch/arm64/Kconfig                     | 10 ++++++++++
>  arch/arm64/include/asm/cpufeature.h    |  3 ++-
>  arch/arm64/include/asm/tlbflush.h      |  6 +++++-
>  arch/arm64/kernel/cpu_errata.c         | 18 ++++++++++++++++++
>  arch/arm64/kernel/traps.c              | 22 +++++++++++++++++++++-
>  arch/arm64/tools/cpucaps               |  1 +
>  7 files changed, 59 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/arm64/silicon-errata.rst b/Documentation/arm64/silicon-errata.rst
> index ec5f889d7681..fce231797184 100644
> --- a/Documentation/arm64/silicon-errata.rst
> +++ b/Documentation/arm64/silicon-errata.rst
> @@ -175,6 +175,8 @@ stable kernels.
>  +----------------+-----------------+-----------------+-----------------------------+
>  | Freescale/NXP  | LS2080A/LS1043A | A-008585        | FSL_ERRATUM_A008585         |
>  +----------------+-----------------+-----------------+-----------------------------+
> +| Freescale/NXP  | i.MX 8QuadMax   | ERR050104       | NXP_IMX8QM_ERRATUM_ERR050104|
> ++----------------+-----------------+-----------------+-----------------------------+
>  +----------------+-----------------+-----------------+-----------------------------+
>  | Hisilicon      | Hip0{5,6,7}     | #161010101      | HISILICON_ERRATUM_161010101 |
>  +----------------+-----------------+-----------------+-----------------------------+
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 1023e896d46b..437cb53f8753 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -1159,6 +1159,16 @@ config SOCIONEXT_SYNQUACER_PREITS
>  
>  	  If unsure, say Y.
>  
> +config NXP_IMX8QM_ERRATUM_ERR050104
> +	bool "NXP iMX8QM: Workaround for Arm/A53 Cache coherency issue"

Please use the erratum number, e.g.

	bool "NXP iMX8QM ERR050104: broken cache/tlb invalidation"

> +	default n
> +	help
> +	  Some maintenance operations exchanged between the A53 and A72 core
> +	  clusters, involving some Translation Look-aside Buffer Invalidate
> +	  (TLBI) and Instruction Cache (IC) instructions can be corrupted.

Likewise, please add a more compelte description here, e.g.

	help
	  On iMX8QM, addresses above bit 35 are not broadcast correctly for
	  TLBI or IC operations, making TLBI and IC unreliable.

	  Work around this erratum by using TLBI *ALL*IS and IC IALLUIS
	  operations. EL0 use of IC IVAU is trapped and upgraded to IC IALLUIS.

> +
> +	  If unsure, say N.
> +
>  endmenu # "ARM errata workarounds via the alternatives framework"
>  
>  choice
> diff --git a/arch/arm64/include/asm/cpufeature.h b/arch/arm64/include/asm/cpufeature.h
> index 6bf013fb110d..1ed648f7f29a 100644
> --- a/arch/arm64/include/asm/cpufeature.h
> +++ b/arch/arm64/include/asm/cpufeature.h
> @@ -835,7 +835,8 @@ static inline bool system_supports_bti(void)
>  static inline bool system_supports_tlb_range(void)
>  {
>  	return IS_ENABLED(CONFIG_ARM64_TLB_RANGE) &&
> -		cpus_have_const_cap(ARM64_HAS_TLB_RANGE);
> +		cpus_have_const_cap(ARM64_HAS_TLB_RANGE) &&
> +		!cpus_have_const_cap(ARM64_WORKAROUND_NXP_ERR050104);
>  }

It'd be better to handle this in the detection of ARM64_HAS_TLB_RANGE, as we
have for CNP where has_useable_cnp() checks for ARM64_WORKAROUND_NVIDIA_CARMEL_CNP.

>  
>  int do_emulate_mrs(struct pt_regs *regs, u32 sys_reg, u32 rt);
> diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h
> index 412a3b9a3c25..12055b859ce3 100644
> --- a/arch/arm64/include/asm/tlbflush.h
> +++ b/arch/arm64/include/asm/tlbflush.h
> @@ -37,7 +37,11 @@
>  			    : : )
>  
>  #define __TLBI_1(op, arg) asm (ARM64_ASM_PREAMBLE			       \
> -			       "tlbi " #op ", %0\n"			       \
> +		   ALTERNATIVE("nop\n	nop\n  tlbi " #op ", %0",              \
> +			       "tlbi vmalle1is\n  dsb ish\n  isb",	       \
> +			       ARM64_WORKAROUND_NXP_ERR050104)		       \
> +			    : : "r" (arg));				       \

Why do you need the DSB ISH + ISB here? It's up to the caller to issue those,
and the ARM64_WORKAROUND_REPEAT_TLBI workaround only has DSB ISH to ensure that
the first op completes before the second is issued.

> +			  asm (ARM64_ASM_PREAMBLE			       \
>  		   ALTERNATIVE("nop\n			nop",		       \
>  			       "dsb ish\n		tlbi " #op ", %0",     \
>  			       ARM64_WORKAROUND_REPEAT_TLBI,		       \
> diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c
> index 307faa2b4395..7b702a79bf60 100644
> --- a/arch/arm64/kernel/cpu_errata.c
> +++ b/arch/arm64/kernel/cpu_errata.c
> @@ -8,6 +8,7 @@
>  #include <linux/arm-smccc.h>
>  #include <linux/types.h>
>  #include <linux/cpu.h>
> +#include <linux/of.h>
>  #include <asm/cpu.h>
>  #include <asm/cputype.h>
>  #include <asm/cpufeature.h>
> @@ -55,6 +56,14 @@ is_kryo_midr(const struct arm64_cpu_capabilities *entry, int scope)
>  	return model == entry->midr_range.model;
>  }
>  
> +static bool __maybe_unused
> +is_imx8qm_soc(const struct arm64_cpu_capabilities *entry, int scope)
> +{
> +	WARN_ON(preemptible());
> +
> +	return of_machine_is_compatible("fsl,imx8qm");
> +}

As above, what is going to be done for VMs, where this won't be present?

> +
>  static bool
>  has_mismatched_cache_type(const struct arm64_cpu_capabilities *entry,
>  			  int scope)
> @@ -729,6 +738,15 @@ const struct arm64_cpu_capabilities arm64_errata[] = {
>  		MIDR_FIXED(MIDR_CPU_VAR_REV(1,1), BIT(25)),
>  		.cpu_enable = cpu_clear_bf16_from_user_emulation,
>  	},
> +#endif
> +#ifdef CONFIG_NXP_IMX8QM_ERRATUM_ERR050104
> +	{
> +		.desc = "NXP A53 cache coherency issue",

Please use the erratum number, i.e. 

		.desc = "NXP erratum ERR050104",

> +		.capability = ARM64_WORKAROUND_NXP_ERR050104,
> +		.type = ARM64_CPUCAP_STRICT_BOOT_CPU_FEATURE,
> +		.matches = is_imx8qm_soc,
> +		.cpu_enable = cpu_enable_cache_maint_trap,
> +	},
>  #endif
>  	{
>  	}
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index 4a79ba100799..4858f8c86fd5 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -529,6 +529,26 @@ void do_el1_fpac(struct pt_regs *regs, unsigned long esr)
>  		uaccess_ttbr0_disable();			\
>  	}
>  
> +#define __user_instruction_cache_maint(address, res)		\
> +do {								\
> +	if (address >= TASK_SIZE_MAX) {				\
> +		res = -EFAULT;					\
> +	} else {						\
> +		uaccess_ttbr0_enable();				\
> +		asm volatile (					\
> +			"1:\n"					\
> +	    ALTERNATIVE("	ic	ivau, %1\n",		\
> +			"	ic	ialluis\n",		\
> +			ARM64_WORKAROUND_NXP_ERR050104)		\
> +			"	mov	%w0, #0\n"		\
> +			"2:\n"					\
> +			_ASM_EXTABLE_UACCESS_ERR(1b, 2b, %w0)	\
> +			: "=r" (res)				\
> +			: "r" (address));			\
> +		uaccess_ttbr0_disable();			\
> +	}							\
> +} while (0)
> +
>  static void user_cache_maint_handler(unsigned long esr, struct pt_regs *regs)
>  {
>  	unsigned long tagged_address, address;
> @@ -556,7 +576,7 @@ static void user_cache_maint_handler(unsigned long esr, struct pt_regs *regs)
>  		__user_cache_maint("dc civac", address, ret);
>  		break;
>  	case ESR_ELx_SYS64_ISS_CRM_IC_IVAU:	/* IC IVAU */
> -		__user_cache_maint("ic ivau", address, ret);
> +		__user_instruction_cache_maint(address, ret);
>  		break;

Hmm... this will silently change any 'IC IVAU' to never fault. That's probably
not the end of the world, since it's IMP-DEF whether IC would raise a
permission fault, but it is a change of behaviour.

It would be a bit simpler to handle this inline within the switch, e.g.

	case ESR_ELx_SYS64_ISS_CRM_IC_IVAU:     /* IC IVAU */
		if (cpus_have_final_cap(ARM64_WORKAROUND_NXP_ERR050104)) {
			asm volatile("ic ialluis");
			ret = 0;
			break;
		}
		__user_instruction_cache_maint(address, ret);
		break;

... as that would avoid duplicating the bulk of the __user_cache_maint() macro.

>  	default:
>  		force_signal_inject(SIGILL, ILL_ILLOPC, regs->pc, 0);
> diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps
> index 37b1340e9646..e225f1cd1005 100644
> --- a/arch/arm64/tools/cpucaps
> +++ b/arch/arm64/tools/cpucaps
> @@ -90,3 +90,4 @@ WORKAROUND_NVIDIA_CARMEL_CNP
>  WORKAROUND_QCOM_FALKOR_E1003
>  WORKAROUND_REPEAT_TLBI
>  WORKAROUND_SPECULATIVE_AT
> +WORKAROUND_NXP_ERR050104

These definitions are expected to be ordered alphabetically, so this should be
earlier in the list.

Thanks,
Mark.

> -- 
> 2.35.3
>