[PATCH v2] PCI: Reset PCIe devices to stop ongoing DMA

Thu Jun 13 22:11:02 EDT 2013

(2013/06/13 12:41), Bjorn Helgaas wrote:
> On Wed, Jun 12, 2013 at 8:44 PM, Takao Indoh <indou.takao at jp.fujitsu.com> wrote:
>> (2013/06/12 13:45), Bjorn Helgaas wrote:
>>> [+cc Vivek, Haren; sorry I didn't think to add you earlier]
>>>
>>> On Tue, Jun 11, 2013 at 12:08 AM, Takao Indoh
>>> <indou.takao at jp.fujitsu.com> wrote:
>>>> (2013/06/11 11:20), Bjorn Helgaas wrote:
>>>
>>>>> I'm not sure you need to reset legacy devices (or non-PCI devices)
>>>>> yet, but the current hook isn't anchored anywhere -- it's just an
>>>>> fs_initcall() that doesn't give the reader any clue about the
>>>>> connection between the reset and the problem it's solving.
>>>>>
>>>>> If we do something like this patch, I think it needs to be done at the
>>>>> point where we enable or disable the IOMMU.  That way, it's connected
>>>>> to the important event, and there's a clue about how to make
>>>>> corresponding fixes for other IOMMUs.
>>>>
>>>> Ok. pci_iommu_init() is appropriate place to add this hook?
>>>
>>> I looked at various IOMMU init places today, and it's far more
>>> complicated and varied than I had hoped.
>>>
>>> This reset scheme depends on enumerating PCI devices before we
>>> initialize the IOMMU used by those devices.  x86 works that way today,
>>> but not all architectures do (see the sparc pci_fire_pbm_init(), for
>>
>> Sorry, could you tell me which part depends on architecture?
> 
> Your patch works if PCIe devices are reset before the kdump kernel
> enables the IOMMU.  On x86, this is possible because PCI enumeration
> happens before the IOMMU initialization.  On sparc, the IOMMU is
> initialized before PCI devices are enumerated, so there would still be
> a window where ongoing DMA could cause an IOMMU error.

Ok, understood, thanks.

Hmmm, it seems to be difficult to find out method which is common to
all architectures. So, what I can do for now is introducing reset scheme
which is only for x86.

1) Change this patch so that it work only on x86 platform. For example
   call this reset code from x86_init.iommu.iommu_init() instead of
   fs_initcall.

Or another idea is:

2) Enumerate PCI devices in IOMMU layer. That is:
   PCI layer
     Just provide interface to reset given strcut pci_dev. Maybe
     pci_reset_function() looks good for this purpose.
   IOMMU layer
     Determine which devices should be reset. On kernel boot, check if
     IOMMU is already active or not, and if active, check IOMMU page
     table and reset devices whose entry exists there.

> Of course, it might be possible to reorganize the sparc code to to the
> IOMMU init *after* it enumerates PCI devices.  But I think that change
> would be hard to justify.
> 
> And I think even on x86, it would be better if we did the IOMMU init
> before PCI enumeration -- the PCI devices depend on the IOMMU, so
> logically the IOMMU should be initialized first so the PCI devices can
> be associated with it as they are enumerated.

So third idea is:

3) Do reset before PCI enumeration(arch_initcall_sync or somewhere). We
   need to implement new code to enumerate PCI devices and reset them
   for this purpose.

Idea 2 is not difficult to implement, but one problem is that this
method may be dangerous. We need to scan IOMMU page table which is used
in previous kernel, but it may be broken. Idea 3 seems to be difficult
to implement...

> 
>>> example).  And I think conceptually, the IOMMU should be enumerated
>>> and initialized *before* the devices that use it.
>>>
>>> So I'm uncomfortable with that aspect of this scheme.
>>>
>>> It would be at least conceivable to reset the devices in the system
>>> kernel, before the kexec.  I know we want to do as little as possible
>>> in the crashing kernel, but it's at least a possibility, and it might
>>> be cleaner.
>>
>> I bet this will be not accepted by kdump maintainer. Everything in panic
>> kernel is unreliable.
> 
> kdump is inherently unreliable.  The kdump kernel doesn't start from
> an arbitrary machine state.  We don't expect it to tolerate all CPUs
> running, for example.  Maybe it should be expected to tolerate PCI
> devices running, either.

What I wanted to say is that any resources of first kernel are
unreliable. Under panic situation, struct pci_dev tree may be broken, or
pci_lock may be already hold by someone, etc. So, if we do this in first
kernel, maybe kdump needs its own code to enumerate PCI devices and
reset them.

Vivek?

Thanks,
Takao Indoh