[PATCH 1/3] x86/quirks: Scan all busses for early PCI quirks

Mon Nov 16 15:31:36 EST 2020

First of all, thanks everybody for the great insights/discussion! This
thread ended-up being a great learning for (at least) me.

Given the big amount of replies and intermixed comments, I wouldn't be
able to respond inline to all, so I'll try another approach below.

>From Bjorn:
"I think [0] proposes using early_quirks() to disable MSIs at boot-time.
That doesn't seem like a robust solution because (a) the problem affects
all arches but early_quirks() is x86-specific and (b) even on x86
early_quirks() only works for PCI segment 0 because it relies on the
0xCF8/0xCFC I/O ports."

Ah. I wasn't aware of that limitation, I thought enhancing the
early_quirks() search to go through all buses would fix that, thanks for
the clarification! And again, worth to clarify that this is not a
problem affecting all arches _in practice_ - PowerPC for example has the
FW primitives allowing a powerful PCI controller (out-of-band) reset,
preventing this kind of issue usually.

[0]
https://lore.kernel.org/linux-pci/20181018183721.27467-1-gpiccoli@canonical.com

>From Bjorn:
"A crash_device_shutdown() could do something at the host bridge level
if that's possible, or reset/disable bus mastering/disable MSI/etc on
individual PCI devices if necessary."

>From Lukas:
"Guilherme's original patches from 2018 iterate over all 256 PCI buses.
That might impact boot time negatively.  The reason he has to do that is
because the crashing kernel doesn't know which devices exist and which
have interrupts enabled.  However the crashing kernel has that
information.  It should either disable interrupts itself or pass the
necessary information to the crashing kernel as setup_data or whatever.

Guilherme's patches add a "clearmsi" command line parameter.  I guess
the idea is that the parameter is always passed to the crash kernel but
the patches don't do that, which seems odd."

Very good points Lukas, thanks! The reason of not adding the clearmsi
thing as a default kdump procedure is kinda related to your first point:
it impacts a bit boot-time, also it's an additional logic in the kdump
kernel, which we know is (sometimes) the last resort in gathering
additional data to debug a potentially complex issue. That said, I'd not
like to enforce this kind of "policy" to everybody, so my implementation
relies on having it as a parameter, and the kdump userspace counter-part
could then have a choice in adding or not such mechanism to the kdump
kernel parameter list.

About passing the data to next kernel, this is very interesting, we
could do something like that either through setup_data (as you said) or
using a new proposal, the PKRAM thing [a].
This way we wouldn't need a crash_device_shutdown(), but instead when
kernel is loading a crash kernel (through kexec -p) we can collect all
the necessary information that'll be passed to the crash kernel
(obviously that if we are collecting PCI topology information, we need
callbacks in the PCI hotplug add/del path to update this information).

[a]
https://lore.kernel.org/lkml/1588812129-8596-1-git-send-email-anthony.yznaga@oracle.com/

Below, inline reply to Eric's last message.

On 15/11/2020 17:46, Eric W. Biederman wrote:
>> [...]
>> An MSI interrupt is a (DMA) write to the local APIC base address
>> 0xfeexxxxx which has the target CPU and control bits encoded in the
>> lower bits. The written data is the vector and more control bits.
>>
>> The only way to stop that is to shut it up at the PCI device level.
> 
> Or to write to magic chipset registers that will stop transforming DMA
> writes to 0xfeexxxxx into x86 interrupts.  With an IOMMU I know x86 has
> such registers (because the point of the IOMMU is to limit the problems
> rogue devices can cause).  Without an IOMMU I don't know if x86 has any
> such registers.  I remember that other platforms have an interrupt
> controller that does provide some level of control.  That x86 does not
> is what makes this an x86 specific problem.
> [...]
> Looking at patch 3/3 what this patchset does is an early disable of
> of the msi registers.  Which is mostly reasonable.  Especially as has
> been pointed out the only location the x86 vector and x86 cpu can
> be found is in the msi configuration registers.
> 
> That also seems reasonable.  But Bjorn's concern about not finding all
> devices in all domains does seem real.
> [...]
> So if we can safely and reliably disable DMA and MSI at the generic PCI
> device level during boot up I am all for it.
> 
> 
> How difficult would it be to tell the IOMMUs to stop passing traffic
> through in an early pci quirk?  The problem setup was apparently someone
> using the device directly from a VM.  So I presume there is an IOMMU
> in that configuration.

This is a good idea I think - we considered something like that in
theory while working the problem back in 2018, but given I'm even less
expert in IOMMU that I already am in PCI, the option was to go with the
PCI approach. And you are right, the original problem is a device in
PCI-PT to a VM, and a tool messing with the PF device in the SR-IOV (to
collect some stats) from the host side, through VFIO IIRC.

Anyway, we could split the problem in two after all the considerations
in the thread, I believe:

(1) If possible to set-up the IOMMU to prevent MSIs, by "blocking" the
DMA writes for PCI devices *until* PCI core code properly initialize the
devices, that'd handle the majority of the cases I guess (IOMMU usage is
quite common nowadays).

(2) Collecting PCI topology information in a running kernel and passing
that to the kdump kernel would allow us to disable the PCI devices MSI
capabilities, the only problem here is that I couldn't see how to do
that early enough, before the 'sti' executes, without relying in
early_quirks(). ACPI maybe? As per Bjorn comment when explaining the
limitations of early_quirks(), this problem seems not to be trivial.

I'm a bit against doing that in crash_kexec() for the reasons mentioned
in the thread, specially by Thomas and Eric - but if there's no other
way...maybe this is the way to go.

Cheers,

Guilherme

P.S. Fixed Gavin's bouncing address, sorry for that.