[PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum

Thu Jun 11 18:13:48 PDT 2026

Hi Will,

On 6/11/2026 8:39 AM, sdonthineni at nvidia.com wrote:
>
> -----Original Message-----
> From: Will Deacon <will at kernel.org>
> Sent: Thursday, June 11, 2026 8:34 AM
> To: Shanker Donthineni <sdonthineni at nvidia.com>
> Cc: Catalin Marinas <catalin.marinas at arm.com>; Vladimir Murzin <vladimir.murzin at arm.com>; Jason Gunthorpe <jgg at nvidia.com>; linux-arm-kernel at lists.infradead.org; Mark Rutland <mark.rutland at arm.com>; linux-kernel at vger.kernel.org; linux-doc at vger.kernel.org; Vikram Sethi <vsethi at nvidia.com>; Jason Sequeira <jsequeira at nvidia.com>
> Subject: Re: [PATCH v3] arm64: errata: Workaround NVIDIA Olympus device store/load ordering erratum
>
> External email: Use caution opening links or attachments
>
>
> On Wed, Jun 10, 2026 at 11:48:22AM -0500, Shanker Donthineni wrote:
>> On systems with NVIDIA Olympus cores, a Device-nGnR* load can be
>> observed by a peripheral before an older, non-overlapping Device-nGnR*
>> store to the same peripheral. This breaks the program-order guarantee
>> that software expects for Device-nGnR* accesses and can leave a
>> peripheral in an incorrect state, as a load is observed before an
>> earlier store takes effect.
>>
>> The erratum can occur only when all of the following apply:
>>
>>    - A PE executes a Device-nGnR* store followed by a younger
>>      Device-nGnR* load.
>>    - The store is not a store-release.
>>    - The accesses target the same peripheral and do not overlap in bytes.
>>    - There is at most one intervening Device-nGnR* store in program
>>      order, and there are no intervening Device-nGnR* loads.
>>    - There is no DSB, and no DMB that orders loads, between the store and
>>      the load.
>>    - Specific micro-architectural and timing conditions occur.
>>
>> Promote the raw MMIO store helpers (__raw_writeb/w/l/q) from plain
>> str* to stlr* (Store-Release), which removes the "store is not a
>> store-release" condition for every device write the kernel issues.
>> Because writel() and writel_relaxed() are both built on __raw_writel()
>> in asm-generic/io.h, patching the raw variants covers both the
>> non-relaxed and relaxed APIs without touching the higher layers. Note
>> that writel()'s own barrier sits before the store, so it does not
>> order the store against a subsequent readl(); the store-release
>> promotion is what provides that ordering.
>>
>> Like ARM64_ERRATUM_832075 on the load side, the change is gated on a
>> new ARM64_WORKAROUND_DEVICE_STORE_RELEASE capability and only
>> activated on parts that match MIDR_NVIDIA_OLYMPUS, so unaffected CPUs
>> continue to use the plain str* sequence.
>>
>> Note: stlr* only supports base-register addressing, so affected CPUs
>> use a base-register stlr* path. Unaffected CPUs keep the original
>> offset-addressed str* sequence introduced by commit d044d6ba6f02
>> ("arm64: io: permit offset addressing").
>>
>> The __const_memcpy_toio_aligned32() and
>> __const_memcpy_toio_aligned64() helpers are left unchanged. These
>> helpers are intended for write-combining mappings, which are Normal-NC
>> on arm64. Replacing their contiguous str* groups would defeat the
>> write-combining behavior used to improve store performance.
>>
>> Co-developed-by: Vikram Sethi <vsethi at nvidia.com>
>> Signed-off-by: Vikram Sethi <vsethi at nvidia.com>
>> Signed-off-by: Shanker Donthineni <sdonthineni at nvidia.com>
>> Reviewed-by: Catalin Marinas <catalin.marinas at arm.com>
>> ---
>> Changes since v2:
>>    - Reworked the raw MMIO write helpers so unaffected CPUs keep the
>>      existing offset-addressed STR sequence, while affected CPUs use the
>>      base-register STLR path.
>>    - Updated the commit message to match the code changes.
>>    - Rebased on top of the arm64 for-next/errata branch:
>>      
>> https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git/log/?h
>> =for-next/errata
>>
>> Changes since v1:
>>    - Updated the commit message based on feedback from Vladimir Murzin.
>>
>>   Documentation/arch/arm64/silicon-errata.rst |  2 ++
>>   arch/arm64/Kconfig                          | 23 ++++++++++++++++
>>   arch/arm64/include/asm/io.h                 | 30 +++++++++++++++++++++
>>   arch/arm64/kernel/cpu_errata.c              |  8 ++++++
>>   arch/arm64/tools/cpucaps                    |  1 +
>>   5 files changed, 64 insertions(+)
>>
>> diff --git a/Documentation/arch/arm64/silicon-errata.rst
>> b/Documentation/arch/arm64/silicon-errata.rst
>> index ad09bbb10da80..fc45125dc2f80 100644
>> --- a/Documentation/arch/arm64/silicon-errata.rst
>> +++ b/Documentation/arch/arm64/silicon-errata.rst
>> @@ -298,6 +298,8 @@ stable kernels.
>>   +----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | Carmel Core     | N/A             | NVIDIA_CARMEL_CNP_ERRATUM   |
>>   
>> +----------------+-----------------+-----------------+----------------
>> -------------+
>> +| NVIDIA         | Olympus core    | T410-OLY-1027   | NVIDIA_OLYMPUS_1027_ERRATUM |
>> ++----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | Olympus core    | T410-OLY-1029   | ARM64_ERRATUM_4118414       |
>>   +----------------+-----------------+-----------------+-----------------------------+
>>   | NVIDIA         | T241 GICv3/4.x  | T241-FABRIC-4   | N/A                         |
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index
>> c65cef81be86a..d633eb70de1ac 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -564,6 +564,29 @@ config ARM64_ERRATUM_832075
>>
>>          If unsure, say Y.
>>
>> +config NVIDIA_OLYMPUS_1027_ERRATUM
>> +     bool "NVIDIA Olympus: device store/load ordering erratum"
>> +     default y
>> +     help
>> +       This option adds an alternative code sequence to work around an
>> +       NVIDIA Olympus core erratum where a Device-nGnR* store can be
>> +       observed by a peripheral after a younger Device-nGnR* load to the
>> +       same peripheral. This breaks the program order that drivers rely
>> +       on for MMIO and can leave a device in an incorrect state.
>> +
>> +       The workaround promotes the raw MMIO store helpers
>> +       (__raw_writeb/w/l/q) to Store-Release (STLR), which restores the
>> +       required ordering. Because writel() and writel_relaxed() are built
>> +       on __raw_writel(), both are covered without changes to the higher
>> +       layers.
>> +
>> +       The fix is applied through the alternatives framework, so enabling
>> +       this option does not by itself activate the workaround: it is
>> +       patched in only when an affected CPU is detected, and is a no-op on
>> +       unaffected CPUs.
>> +
>> +       If unsure, say Y.
>> +
>>   config ARM64_ERRATUM_834220
>>        bool "Cortex-A57: 834220: Stage 2 translation fault might be incorrectly reported in presence of a Stage 1 fault (rare)"
>>        depends on KVM
>> diff --git a/arch/arm64/include/asm/io.h b/arch/arm64/include/asm/io.h
>> index 8cbd1e96fd50b..801223e754c90 100644
>> --- a/arch/arm64/include/asm/io.h
>> +++ b/arch/arm64/include/asm/io.h
>> @@ -22,10 +22,22 @@
>>   /*
>>    * Generic IO read/write.  These perform native-endian accesses.
>>    */
>> +static __always_inline bool arm64_needs_device_store_release(void)
>> +{
>> +     return alternative_has_cap_unlikely(
>> +                             ARM64_WORKAROUND_DEVICE_STORE_RELEASE);
>> +}
>> +
>>   #define __raw_writeb __raw_writeb
>>   static __always_inline void __raw_writeb(u8 val, volatile void
>> __iomem *addr)  {
>>        volatile u8 __iomem *ptr = addr;
>> +
>> +     if (arm64_needs_device_store_release()) {
>> +             asm volatile("stlrb %w0, [%1]" : : "rZ" (val), "r" (addr));
>> +             return;
>> +     }
>> +
>>        asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));  }
> Use an 'else' clause instead of the early return? (similarly for the other changes).
>
> I still reckon you should do something with the memcpy-to-io routines.
> A simple option could be to make dgh() a dmb on parts with the erratum?
> That at least moves the barrier out of the loop.

Thanks Will. I looked again at both the arm64 comments and the generic iomap_copy.c
contract, and I’m not convinced that making dgh() a dmb is the right fit for this
path. Based on the documented comments, callers should not assume ordering from
these helpers; if ordering is required around a memcpy, the call site should already
be providing the necessary barriers.

Related data point in generic lib/iomap_copy.c:

/**
  * __iowrite32_copy - copy data to MMIO space, in 32-bit units
  * @to: destination, in MMIO space (must be 32-bit aligned)
  * @from: source (must be 32-bit aligned)
  * @count: number of 32-bit quantities to copy
  *
  * Copy data from kernel space to MMIO space, in units of 32 bits at a
  * time.  Order of access is not guaranteed, nor is a memory barrier
  * performed afterwards.
  */
#ifndef __iowrite32_copy
void __iowrite32_copy(void __iomem *to, const void *from, size_t count)

/**
  * __iowrite64_copy - copy data to MMIO space, in 64-bit or 32-bit units
  * @to: destination, in MMIO space (must be 64-bit aligned)
  * @from: source (must be 64-bit aligned)
  * @count: number of 64-bit quantities to copy
  *
  * Copy data from kernel space to MMIO space, in units of 32 or 64 bits at a
  * time.  Order of access is not guaranteed, nor is a memory barrier
  * performed afterwards.
  */
#ifndef __iowrite64_copy
void __iowrite64_copy(void __iomem *to, const void *from, size_t count)

/**
  * __iowrite32_copy - copy data to MMIO space, in 32-bit units
  * @to: destination, in MMIO space (must be 32-bit aligned)
  * @from: source (must be 32-bit aligned)
  * @count: number of 32-bit quantities to copy
  *
  * Copy data from kernel space to MMIO space, in units of 32 bits at a
  * time.  Order of access is not guaranteed, nor is a memory barrier
  * performed afterwards.
  */
#ifndef __iowrite32_copy
void __iowrite32_copy(void __iomem *to, const void *from, size_t count)

The arm64 comment says in arch/arm64/asm/io.h:

/*
  * The ARM64 iowrite implementation is intended to support drivers that want to
  * use write combining. For instance PCI drivers using write combining with a 64
  * byte __iowrite64_copy() expect to get a 64 byte MemWr TLP on the PCIe bus.
  *
  * Newer ARM core have sensitive write combining buffers, it is important that
  * the stores be contiguous blocks of store instructions. Normal memcpy
  * approaches have a very low chance to generate write combining.
  *
  * Since this is the only API on ARM64 that should be used with write combining
  * it also integrates the DGH hint which is supposed to lower the latency to
  * emit the large TLP from the CPU.
  */

So my reading is that dgh() in the arm64 implementation is there for the
write-combining/gathering behavior. Replacing it with dmb would make this
path stronger than the generic API contract and could penalize performance
of the WC use case.

For the scalar MMIO helpers, the workaround promotes the raw writes to
store-release on affected CPUs as v1/v2 shown below. For the memcpy-toIO
helpers, could you please clarify the specific reason for adding a dmb despite
the documented no-ordering contract? Is the concern that some drivers may
be relying on ordering across memcpy_toio_*() today even though the API
does not guarantee it, and that we should cover those cases defensively?

Would prefer to avoid replacing DGH() with DMB unless there is a strong
reason to do so. Please let me know if I can post the v4 patch with
the change below, while keeping DGH() as-is in the memcpy-toIO path.

  #define __raw_writeb __raw_writeb
  static __always_inline void __raw_writeb(u8 val, volatile void __iomem *addr)
  {
-       volatile u8 __iomem *ptr = addr;
-       asm volatile("strb %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("strb %w0, [%1]",
+                                "stlrb %w0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

  #define __raw_writew __raw_writew
  static __always_inline void __raw_writew(u16 val, volatile void __iomem *addr)
  {
-       volatile u16 __iomem *ptr = addr;
-       asm volatile("strh %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("strh %w0, [%1]",
+                                "stlrh %w0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

  #define __raw_writel __raw_writel
  static __always_inline void __raw_writel(u32 val, volatile void __iomem *addr)
  {
-       volatile u32 __iomem *ptr = addr;
-       asm volatile("str %w0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("str %w0, [%1]",
+                                "stlr %w0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

  #define __raw_writeq __raw_writeq
  static __always_inline void __raw_writeq(u64 val, volatile void __iomem *addr)
  {
-       volatile u64 __iomem *ptr = addr;
-       asm volatile("str %x0, %1" : : "rZ" (val), "Qo" (*ptr));
+       asm volatile(ALTERNATIVE("str %x0, [%1]",
+                                "stlr %x0, [%1]",
+                                ARM64_WORKAROUND_DEVICE_STORE_RELEASE)
+                    : : "rZ" (val), "r" (addr));
  }

-Shanker