[PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA
Alex Williamson
alex.williamson at redhat.com
Tue Jul 30 23:11:09 EDT 2013
On Wed, 2013-07-31 at 09:35 +0900, Takao Indoh wrote:
> (2013/07/31 0:59), Bjorn Helgaas wrote:
> > On Tue, Jul 30, 2013 at 12:09 AM, Takao Indoh
> > <indou.takao at jp.fujitsu.com> wrote:
> >> (2013/07/29 23:17), Bjorn Helgaas wrote:
> >>> On Sun, Jul 28, 2013 at 6:37 PM, Takao Indoh <indou.takao at jp.fujitsu.com> wrote:
> >>>> (2013/07/26 2:00), Bjorn Helgaas wrote:
> >
> >>>>> My point about IOMMU and PCI initialization order doesn't go away just
> >>>>> because it doesn't fit "kdump policy." Having system initialization
> >>>>> occur in a logical order is far more important than making kdump work.
> >>>>
> >>>> My next plan is as follows. I think this is matched to logical order
> >>>> on boot.
> >>>>
> >>>> drivers/pci/pci.c
> >>>> - Add function to reset bus, for example, pci_reset_bus(struct pci_bus *bus)
> >>>>
> >>>> drivers/iommu/intel-iommu.c
> >>>> - On initialization, if IOMMU is already enabled, call this bus reset
> >>>> function before disabling and re-enabling IOMMU.
> >>>
> >>> I raised this issue because of arches like sparc that enumerate the
> >>> IOMMU before the PCI devices that use it. In that situation, I think
> >>> you're proposing this:
> >>>
> >>> panic kernel
> >>> enable IOMMU
> >>> panic
> >>> kdump kernel
> >>> initialize IOMMU (already enabled)
> >>> pci_reset_bus
> >>> disable IOMMU
> >>> enable IOMMU
> >>> enumerate PCI devices
> >>>
> >>> But the problem is that when you call pci_reset_bus(), you haven't
> >>> enumerated the PCI devices, so you don't know what to reset.
> >>
> >> Right, so my idea is adding reset code into "intel-iommu.c". intel-iommu
> >> initialization is based on the assumption that enumeration of PCI devices
> >> is already done. We can find target device from IOMMU page table instead
> >> of scanning all devices in pci tree.
> >>
> >> Therefore, this idea is only for intel-iommu. Other architectures need
> >> to implement their own reset code.
> >
> > That's my point. I'm opposed to adding code to PCI when it only
> > benefits x86 and we know other arches will need a fundamentally
> > different design. I would rather have a design that can work for all
> > arches.
> >
> > If your implementation is totally implemented under arch/x86 (or in
> > intel-iommu.c, I guess), I can't object as much. However, I think
> > that eventually even x86 should enumerate the IOMMUs via ACPI before
> > we enumerate PCI devices.
> >
> > It's pretty clear that's how BIOS designers expect the OS to work.
> > For example, sec 8.7.3 of the Intel Virtualization Technology for
> > Directed I/O spec, rev 1.3, shows the expectation that remapping
> > hardware (IOMMU) is initialized before discovering the I/O hierarchy
> > below a hot-added host bridge. Obviously you're not talking about a
> > hot-add scenario, but I think the same sequence should apply at
> > boot-time as well.
>
> Of course I won't add something just for x86 into common PCI layer. I
> attach my new patch, though it is not well tested yet.
>
> On x86, currently IOMMU initialization run *after* PCI enumeration, but
> what you are talking about is that it should be changed so that x86
> IOMMU initialization is done *before* PCI enumeration like sparc, right?
>
> Hmm, ok, I think I need to post attached patch to iommu list and
> discuss it including current order of x86 IOMMU initialization.
>
> Thanks,
> Takao Indoh
> ---
> drivers/iommu/intel-iommu.c | 55 +++++++++++++++++++++++++++++++++-
> drivers/pci/pci.c | 53 ++++++++++++++++++++++++++++++++
> include/linux/pci.h | 1
> 3 files changed, 108 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/iommu/intel-iommu.c b/drivers/iommu/intel-iommu.c
> index eec0d3e..fb8a546 100644
> --- a/drivers/iommu/intel-iommu.c
> +++ b/drivers/iommu/intel-iommu.c
> @@ -3663,6 +3663,56 @@ static struct notifier_block device_nb = {
> .notifier_call = device_notifier,
> };
>
> +/* Reset PCI device if its entry exists in DMAR table */
> +static void __init iommu_reset_devices(struct intel_iommu *iommu, u16 segment)
> +{
> + u64 addr;
> + struct root_entry *root;
> + struct context_entry *context;
> + int bus, devfn;
> + struct pci_dev *dev;
> +
> + addr = dmar_readq(iommu->reg + DMAR_RTADDR_REG);
> + if (!addr)
> + return;
> +
> + /*
> + * In the case of kdump, ioremap is needed because root-entry table
> + * exists in first kernel's memory area which is not mapped in second
> + * kernel
> + */
> + root = (struct root_entry*)ioremap(addr, PAGE_SIZE);
> + if (!root)
> + return;
> +
> + for (bus=0; bus<ROOT_ENTRY_NR; bus++) {
> + if (!root_present(&root[bus]))
> + continue;
> +
> + context = (struct context_entry *)ioremap(
> + root[bus].val & VTD_PAGE_MASK, PAGE_SIZE);
> + if (!context)
> + continue;
> +
> + for (devfn=0; devfn<ROOT_ENTRY_NR; devfn++) {
> + if (!context_present(&context[devfn]))
> + continue;
> +
> + dev = pci_get_domain_bus_and_slot(segment, bus, devfn);
> + if (!dev)
> + continue;
> +
> + if (!pci_reset_bus(dev->bus)) /* go to next bus */
> + break;
> + else /* Try per-function reset */
> + pci_reset_function(dev);
> +
> + }
> + iounmap(context);
> + }
> + iounmap(root);
> +}
> +
> int __init intel_iommu_init(void)
> {
> int ret = 0;
> @@ -3687,8 +3737,11 @@ int __init intel_iommu_init(void)
> continue;
>
> iommu = drhd->iommu;
> - if (iommu->gcmd & DMA_GCMD_TE)
> + if (iommu->gcmd & DMA_GCMD_TE) {
> + if (reset_devices)
> + iommu_reset_devices(iommu, drhd->segment);
> iommu_disable_translation(iommu);
> + }
> }
>
> if (dmar_dev_scope_init() < 0) {
> diff --git a/drivers/pci/pci.c b/drivers/pci/pci.c
> index e37fea6..c595997 100644
> --- a/drivers/pci/pci.c
> +++ b/drivers/pci/pci.c
> @@ -3392,6 +3392,59 @@ int pci_reset_function(struct pci_dev *dev)
> EXPORT_SYMBOL_GPL(pci_reset_function);
>
> /**
> + * pci_reset_bus - reset a PCI bus
> + * @bus: PCI bus to reset
> + *
> + * Returns 0 if the bus was successfully reset or negative if failed.
> + */
> +int pci_reset_bus(struct pci_bus *bus)
> +{
> + struct pci_dev *pdev;
> + u16 ctrl;
> +
> + if (!bus->self)
> + return -ENOTTY;
> +
> + list_for_each_entry(pdev, &bus->devices, bus_list)
> + if (pdev->subordinate)
> + return -ENOTTY;
> +
> + /* Save config registers of children */
> + list_for_each_entry(pdev, &bus->devices, bus_list) {
> + dev_info(&pdev->dev, "Save state\n");
> + pci_save_state(pdev);
> + }
> +
> + dev_info(&bus->self->dev, "Reset Secondary bus\n");
> +
> + /* Assert Secondary Bus Reset */
> + pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
> + ctrl |= PCI_BRIDGE_CTL_BUS_RESET;
> + pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
> +
> + /* Read config again to flush previous write */
> + pci_read_config_word(bus->self, PCI_BRIDGE_CONTROL, &ctrl);
> +
> + msleep(2);
> +
> + /* De-assert Secondary Bus Reset */
> + ctrl &= ~PCI_BRIDGE_CTL_BUS_RESET;
> + pci_write_config_word(bus->self, PCI_BRIDGE_CONTROL, ctrl);
> +
> + /* Wait for completion */
> + msleep(1000);
We already have secondary bus reset code in this file, why are we
duplicating it here? Also, why are these delays different from the
existing code? I'm also in need of a bus reset interface for when we
assign all of the devices on a bus to userspace and do not have working
function level resets per device. I'll post my patch series and perhaps
we can collaborate on a pci bus reset interface. Thanks,
Alex
> +
> + /* Restore config registers of children */
> + list_for_each_entry(pdev, &bus->devices, bus_list) {
> + dev_info(&pdev->dev, "Restore state\n");
> + pci_restore_state(pdev);
> + }
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(pci_reset_bus);
> +
> +/**
> * pcix_get_max_mmrbc - get PCI-X maximum designed memory read byte count
> * @dev: PCI device to query
> *
> diff --git a/include/linux/pci.h b/include/linux/pci.h
> index 0fd1f15..125fbc6 100644
> --- a/include/linux/pci.h
> +++ b/include/linux/pci.h
> @@ -924,6 +924,7 @@ int pcie_set_mps(struct pci_dev *dev, int mps);
> int __pci_reset_function(struct pci_dev *dev);
> int __pci_reset_function_locked(struct pci_dev *dev);
> int pci_reset_function(struct pci_dev *dev);
> +int pci_reset_bus(struct pci_bus *bus);
> void pci_update_resource(struct pci_dev *dev, int resno);
> int __must_check pci_assign_resource(struct pci_dev *dev, int i);
> int __must_check pci_reassign_resource(struct pci_dev *dev, int i, resource_size_t add_size, resource_size_t align);
>
More information about the kexec
mailing list