[PATCH v2 1/7] iommu: Do not call pci_dev_reset_iommu_done() unless reset succeeds
Shuai Xue
xueshuai at linux.alibaba.com
Wed Mar 18 01:02:01 PDT 2026
On 3/18/26 3:15 AM, Nicolin Chen wrote:
> IOMMU drivers handle ATC cache maintenance. They may encounter ATC-related
> errors (e.g., ATC invalidation request timeout), indicating that ATC cache
> might have stale entries that can corrupt the memory. In this case, IOMMU
> driver has no choice but to block the device's ATS function and wait for a
> device recovery.
>
> The pci_dev_reset_iommu_done() called at the end of a reset function could
> serve as a reliable signal to the IOMMU subsystem that the physical device
> cache is completely clean. However, the function is called unconditionally
> even if the reset operation had actually failed, which would re-attach the
> faulty device back to a normal translation domain. And this will leave the
> system highly exposed, creating vulnerabilities for data corruption:
> IOMMU blocks RID/ATS
> pci_reset_function():
> pci_dev_reset_iommu_prepare(); // Block RID/ATS
> __reset(); // Failed (ATC is still stale)
> pci_dev_reset_iommu_done(); // Unblock RID/ATS (ah-ha)
>
> The simplest fix is to use pci_dev_reset_iommu_done() only on a successful
> reset:
> IOMMU blocks RID/ATS
> pci_reset_function():
> pci_dev_reset_iommu_prepare(); // Block RID/ATS
> if (!__reset())
> pci_dev_reset_iommu_done(); // Unblock RID/ATS
> else
> // Keep the device blocked by IOMMU
>
> However, this breaks the symmetric requirement of these reset APIs so that
> we have to allow a re-entry to pass a second reset attempt:
> IOMMU blocks RID/ATS
> pci_reset_function():
> pci_dev_reset_iommu_prepare(); // Block RID/ATS
> __reset(); // Failed (ATC is still stale)
> // Keep the device blocked by IOMMU
> Second reset:
> pci_reset_function():
> pci_dev_reset_iommu_prepare(); // Re-entry (!)
>
> Update the function kdocs and all the existing callers to only unblock ATS
> when the reset succeeds. Drop the WARN_ON in pci_dev_reset_iommu_prepare()
> to allow re-entries.
>
> Signed-off-by: Nicolin Chen <nicolinc at nvidia.com>
> ---
> drivers/iommu/iommu.c | 16 +++++++++-----
> drivers/pci/pci-acpi.c | 11 +++++++++-
> drivers/pci/pci.c | 50 +++++++++++++++++++++++++++++++++++++-----
> drivers/pci/quirks.c | 11 +++++++++-
> 4 files changed, 75 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index 35db517809540..40a15c9360bd1 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -3938,8 +3938,10 @@ EXPORT_SYMBOL_NS_GPL(iommu_replace_group_handle, "IOMMUFD_INTERNAL");
> * IOMMU activity while leaving the group->domain pointer intact. Later when the
> * reset is finished, pci_dev_reset_iommu_done() can restore everything.
> *
> - * Caller must use pci_dev_reset_iommu_prepare() with pci_dev_reset_iommu_done()
> - * before/after the core-level reset routine, to unset the resetting_domain.
> + * Caller must use pci_dev_reset_iommu_done() after a successful PCI-level reset
> + * to unset the resetting_domain. If the reset fails, caller can choose to keep
> + * the device in the resetting_domain to protect system memory using IOMMU from
> + * any bad ATS.
I think you mean ...to protect system memory from DMA via stale ATC entries.
> *
> * Return: 0 on success or negative error code if the preparation failed.
> *
> @@ -3961,9 +3963,9 @@ int pci_dev_reset_iommu_prepare(struct pci_dev *pdev)
>
> guard(mutex)(&group->mutex);
>
> - /* Re-entry is not allowed */
> - if (WARN_ON(group->resetting_domain))
> - return -EBUSY;
> + /* Already prepared */
> + if (group->resetting_domain)
> + return 0;
Could you elaborate more on why Re-entry is allowed now? For example:
/*
* Allow re-entry: if a previous reset failed, the device remains in
* resetting_domain. A subsequent reset attempt must be able to call
* prepare() again without failing.
*/
>
> ret = __iommu_group_alloc_blocking_domain(group);
> if (ret)
> @@ -4001,7 +4003,9 @@ EXPORT_SYMBOL_GPL(pci_dev_reset_iommu_prepare);
> * re-attaching all RID/PASID of the device's back to the domains retained in
> * the core-level structure.
> *
> - * Caller must pair it with a successful pci_dev_reset_iommu_prepare().
> + * This is a pairing function for pci_dev_reset_iommu_prepare(). Caller should
> + * use it on a successful PCI-level reset. Otherwise, it's suggested for caller
> + * to keep the device in the resetting_domain to protect system memory.
> *
> * Note that, although unlikely, there is a risk that re-attaching domains might
> * fail due to some unexpected happening like OOM.
> diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
> index 4d0f2cb6c695b..f1a918938242c 100644
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -16,6 +16,7 @@
> #include <linux/pci_hotplug.h>
> #include <linux/module.h>
> #include <linux/pci-acpi.h>
> +#include <linux/pci-ats.h>
> #include <linux/pci-ecam.h>
> #include <linux/pm_runtime.h>
> #include <linux/pm_qos.h>
> @@ -977,7 +978,15 @@ int pci_dev_acpi_reset(struct pci_dev *dev, bool probe)
> ret = -ENOTTY;
> }
>
> - pci_dev_reset_iommu_done(dev);
> + /*
> + * The reset might be invoked to recover a serious error. E.g. when the
> + * ATC failed to invalidate its stale entries, which can result in data
> + * corruption. Thus, do not unblock ATS until a successful reset.
> + */
> + if (!ret || !pci_ats_supported(dev))
> + pci_dev_reset_iommu_done(dev);
pci_dev_reset_iommu_prepare() already does an early return for non-ATS
devices, and pci_dev_reset_iommu_done() also early-returns
when !pci_ats_supported(). So for non-ATS devices, done()
is always a no-op regardless of whether it's called or not.
Do we really need check pci_ats_supported(dev) here?
> + else
> + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
> return ret;
> }
>
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index 8479c2e1f74f1..80c5cf6eeebdc 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -4358,7 +4358,15 @@ int pcie_flr(struct pci_dev *dev)
>
> ret = pci_dev_wait(dev, "FLR", PCIE_RESET_READY_POLL_MS);
> done:
> - pci_dev_reset_iommu_done(dev);
> + /*
> + * The reset might be invoked to recover a serious error. E.g. when the
> + * ATC failed to invalidate its stale entries, which can result in data
> + * corruption. Thus, do not unblock ATS until a successful reset.
> + */
> + if (!ret || !pci_ats_supported(dev))
> + pci_dev_reset_iommu_done(dev);
> + else
> + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
> return ret;
> }
> EXPORT_SYMBOL_GPL(pcie_flr);
> @@ -4436,7 +4444,15 @@ static int pci_af_flr(struct pci_dev *dev, bool probe)
>
> ret = pci_dev_wait(dev, "AF_FLR", PCIE_RESET_READY_POLL_MS);
> done:
> - pci_dev_reset_iommu_done(dev);
> + /*
> + * The reset might be invoked to recover a serious error. E.g. when the
> + * ATC failed to invalidate its stale entries, which can result in data
> + * corruption. Thus, do not unblock ATS until a successful reset.
> + */
> + if (!ret || !pci_ats_supported(dev))
> + pci_dev_reset_iommu_done(dev);
> + else
> + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
> return ret;
> }
>
> @@ -4490,7 +4506,15 @@ static int pci_pm_reset(struct pci_dev *dev, bool probe)
> pci_dev_d3_sleep(dev);
>
> ret = pci_dev_wait(dev, "PM D3hot->D0", PCIE_RESET_READY_POLL_MS);
> - pci_dev_reset_iommu_done(dev);
> + /*
> + * The reset might be invoked to recover a serious error. E.g. when the
> + * ATC failed to invalidate its stale entries, which can result in data
> + * corruption. Thus, do not unblock ATS until a successful reset.
> + */
> + if (!ret || !pci_ats_supported(dev))
> + pci_dev_reset_iommu_done(dev);
> + else
> + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
> return ret;
> }
>
> @@ -4933,7 +4957,15 @@ static int pci_reset_bus_function(struct pci_dev *dev, bool probe)
>
> rc = pci_parent_bus_reset(dev, probe);
> done:
> - pci_dev_reset_iommu_done(dev);
> + /*
> + * The reset might be invoked to recover a serious error. E.g. when the
> + * ATC failed to invalidate its stale entries, which can result in data
> + * corruption. Thus, do not unblock ATS until a successful reset.
> + */
> + if (!rc || !pci_ats_supported(dev))
> + pci_dev_reset_iommu_done(dev);
> + else
> + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
> return rc;
> }
>
> @@ -4978,7 +5010,15 @@ static int cxl_reset_bus_function(struct pci_dev *dev, bool probe)
> pci_write_config_word(bridge, dvsec + PCI_DVSEC_CXL_PORT_CTL,
> reg);
>
> - pci_dev_reset_iommu_done(dev);
> + /*
> + * The reset might be invoked to recover a serious error. E.g. when the
> + * ATC failed to invalidate its stale entries, which can result in data
> + * corruption. Thus, do not unblock ATS until a successful reset.
> + */
> + if (!rc || !pci_ats_supported(dev))
> + pci_dev_reset_iommu_done(dev);
> + else
> + pci_warn(dev, "Reset failed. Blocking ATS to protect memory\n");
> return rc;
> }
cxl_reset_bus_function() calls pci_reset_bus_function() which already
contains the conditional pci_dev_reset_iommu_done() logic. Then the
outer function calls done() again. The second call is a no-op because
resetting_domain has already been cleared, but this is confusing.
Additionally, cxl_reset_bus_function() does its own
pci_dev_reset_iommu_prepare(), while pci_reset_bus_function() also
calls prepare() internally. With the re-entry change in this patch,
the nested prepare() silently succeeds.
Is it a expected behavior?
Thanks.
Shuai
More information about the linux-arm-kernel
mailing list