[PATCH v4 2/2] PCI: quirks: Fix ThunderX2 dma alias handling

Jayachandran C jnair at caviumnetworks.com
Wed Apr 12 13:41:20 PDT 2017


On Wed, Apr 12, 2017 at 02:11:38PM -0500, Bjorn Helgaas wrote:
> On Wed, Apr 12, 2017 at 06:10:34PM +0000, Jayachandran C wrote:
> > On Wed, Apr 12, 2017 at 11:21:18AM -0500, Bjorn Helgaas wrote:
> > > On Tue, Apr 11, 2017 at 03:27:02PM +0000, Jayachandran C wrote:
> > > > On Tue, Apr 11, 2017 at 08:41:25AM -0500, Bjorn Helgaas wrote:
> > > > > [+cc Joerg]
> > > > > 
> > > > > On Tue, Apr 11, 2017 at 07:10:48AM +0000, Jayachandran C wrote:
> > > > > > On Mon, Apr 10, 2017 at 08:28:47PM -0500, Bjorn Helgaas wrote:
> > > > > > > Hi Jayachandran,
> > > > > > > 
> > > > > > > On Mon, Apr 03, 2017 at 01:15:04PM +0000, Jayachandran C wrote:
> > > > > > > > The Cavium ThunderX2 arm64 SoCs (called Broadcom Vulcan earlier), the PCI
> > > > > > > > topology is slightly unusual. For a multi-node system, it looks like:
> > > > > > > > 
> > > > > > > > [node level PCI bridges - one per node]
> > > > > > > >     [SoC PCI devices with MSI-X but no IOMMU]
> > > > > > > >     [PCI-PCIe "glue" bridges - upto 14, one per real port below]
> > > > > > > >         [PCIe real root ports associated with IOMMU and GICv3 ITS]
> > > > > > > >             [External PCI devices connected to PCIe links]
> > > > > > > > 
> > > > > > > > The top two levels of bridges should have introduced aliases since they
> > > > > > > > are PCI and PCI/PCIe bridges, but in the case of ThunderX2 they do not.
> > > > > > > > In the case of external PCIe devices, the "real" root ports are connected
> > > > > > > > to the SMMU and the GIC ITS, so PCI-PCIe bridge does not introduce an
> > > > > > > > alias. The SoC PCI devices are directly connected to the GIC ITS, so the
> > > > > > > > node level bridges do not introduce an alias either.
> > > > > > > > 
> > > > > > > > To handle this quirk, we mark the real PCIe root ports and node level
> > > > > > > > PCI bridges with the flag PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT.  With this,
> > > > > > > > pci_for_each_dma_alias() works correctly for external PCIe devices and
> > > > > > > > SoC PCI devices.
> > > > > > > > 
> > > > > > > > For the current revision of Cavium ThunderX2, the VendorID and Device ID
> > > > > > > > are from Broadcom Vulcan (14e4:90XX).
> > > > > > > 
> > > > > > > Can you supply some text here about why we want to apply this patch?
> > > > > > > E.g., does it avoid making unnecessary IOMMU mappings, improve
> > > > > > > performance, avoid a crash, etc?
> > > > > > 
> > > > > > If this is for the commit message, I hope the following is ok:
> > > > > > 
> > > > > > "With this change, both MSI-X and IO virtualization work correctly on
> > > > > > Cavium ThunderX2. The GIC ITS driver gets the correct device ID to
> > > > > > configure MSI-X, the SMMUv3 driver gets the correct Stream IDs for PCI
> > > > > > devices, and the IOMMU groups are setup correctly."
> > > > > 
> > > > > This doesn't get at what the actual problem is.  I'm hoping for
> > > > > something like "without this change, we set up an IOMMU mapping for
> > > > > requestor ID X, but device DMA uses requestor ID Y because ...., which
> > > > > results in an IOMMU fault"
> > > > 
> > > > Ok. I hope this would be better:
> > > > 
> > > > "Without this change, the last alias seen while traversing the PCI
> > > > hierarchy will be used as the RID to generate the device ID for ITS
> > > > and stream ID for SMMU. This in turn causes the MSI-X generated by the
> > > > device to fail since the ITS expects to have translation tables based
> > > > on the actual PCIe RID and not the (irrelevant) alias. Similarly, the
> > > > device DMA also fails when SMMU is enabled due to incorrect value in
> > > > SMMU translation tables"
> > > 
> > > This description is true, but I don't think it addresses the real
> > > problem.  I think the real problem is that your IOMMU code doesn't
> > > handle aliases correctly, and by ignoring these invalid aliases, we
> > > happen to map an alias that works for the builtin devices.  But that's
> > > only because we got lucky (those devices use a single RID and they're
> > > not behind bridges that optionally take ownership).
> > > 
> > > It would make sense to me if we fixed the IOMMU code to map *all* the
> > > aliases, which should be enough to make your devices work.  If we then
> > > wanted to apply a patch like this on top, it would be simply an
> > > optimization that avoids unnecessary IOMMU mappings.
> > 
> > The issue that the IOMMU code does not handle valid aliases is
> > unrelated to what I am trying to fix. The quirk is to make sure
> > that invalid aliases are not seen on ThunderX2 while doing
> > pci_for_each_dma_alias().
> > 
> > The DMA and MSI-X requests leave the PCI/PCIe hierarchy at the point
> > where the SMMU (or ITS) is attached, i.e. at the bridge marked with
> > PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT. The quirk ensures that we don't look
> > for aliases above that point.
> > 
> > The toplevel bridge is marked PCI_DEV_FLAGS_BRIDGE_XLATE_ROOT since
> > the on-chip devices are directly connected to the ITS (they do not
> > use SMMU).
> > 
> > The top two levels of bridges are not real PCI bridges but just
> > PCI bridge-like things that were added to tie the whole hierarchy
> > together for configuration and enumeration. They do not handle
> > PCI/PCIe transactions in the traditional sense.
> > 
> > I think my problem description is still not correct, maybe:
> > "The SMMU (and ITS) expects to device tables to use the RID seen
> > at the bridge they are associated with. Currently the
> > pci_for_each_dma_alias() code traverses beyond this point and
> > generates incorrect aliases due to the PCI and PCI/PCIe bridges
> > above. This causes MSI-X interrupts and device DMA to fail since
> > the SMMU and ITS tables to be setup with incorrect IDs"
> 
> I haven't tried to figure out the MSI-X piece of this, but let me try
> to come up with a concrete DMA example.  Assume this topology:
> 
>   00:00.0 bridge to [bus 01-1e]
>   01:0a.0 bridge to [bus 04-05]
>   04:00.0 [14e4:9000 or 9084] bridge to [bus 05] (XLATE_ROOT)
>   05:00.0 endpoint
> 
> Assume 05:00.0 generates a DMA request.  Assume the top two bridges
> are such that pci_for_each_dma_alias() includes them as well, so it
> iterates through 05:00.0, 01:0a.0, and 00:00.0.
> 
> When the driver for 05:00.0 makes a DMA mapping, the current code
> apparently makes an IOMMU mapping for requester ID 00:00.0 because
> that's the last alias.  Obviously this doesn't work because the IOMMU
> at 04:00.0 will see a requester ID of 05:00.0, not 00:00.0.
> 
> With this quirk, we'll omit 01:0a.0 and 00:00.0, so we'll make an
> IOMMU mapping for requester ID 05:00.0, which will work fine.  I think
> it would *also* work fine if we made IOMMU mappings for 05:00.0,
> 01:0a.0, and 00:0.0.  The last two are unnecessary, but probably not
> harmful.
 
Ok. The last two are not harmful but still incorrect. The bridges
0:0.0 and 1:a.0 are outside the path with the PCI transactions takes
(they go from 4:0.0 to the SMMU).

> Now assume 05:00 is a multi-function device that has a DMA alias
> quirk, e.g., see quirk_dma_func0_alias().  It has another function:
> 
>   05:00.3 endpoint
> 
> DMA from 05:00.3 may use a requester ID of either 05:00.3 or 05:00.0.
> The driver makes a DMA mapping, pci_for_each_dma_alias() iterates
> through 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, and we again map only
> 00:00.0, which again doesn't work.
> 
> With this quirk, we create a single mapping for 05:00.0.  That will
> work sometimes, but the device may also generate DMA with a requester
> ID of 05:00.3, and that won't work.

Note that here, pci_for_each_dma_alias() works correctly with the quirk
and incorrectly without the quirk. With the quirk, the callback function
is involved on 05:00.3 and 05:00.0 as expected. Without the quirk the
call back is invoked on 05:00.3, 05:00.0, 01:0a.0, and 00:00.0, which
includes aliases that are not valid.

The idea behind the quirk is to get pci_for_each_dma_alias work correctly
on ThunderX2, and invoke the function only on the valid aliases. With the
quirk, the code using it - like the SMMU code - gets the correct aliases
to process, and do not have to deal with or filter out incorrect aliases.

I don't think it would be safe to assume that all callers of
pci_for_each_dma_alias will be ok to handle invalid aliases. Even in
the IOMMU case, the case which calculates stream IDs is one usage, the
usage for iommu groups also did not deal with invalid aliases correctly
when I tried it. Then there are the MSI cases too, I had done some work
on trying to filter out invalid aliases in the relevant code[1], but
in my opinion, getting pci_for_each_dma_alias() do the right thing is
the correct solution.

> If your on-chip device is, e.g., 01:04.0, pci_for_each_dma_alias()
> probably iterates through 01:04.0, 00:00.0.  Today we make an IOMMU
> mapping for 00:00.0, which doesn't work.  With this quirk, we'll
> ignore 00:00.0 and make a mapping for 01:04.0, which does work.  But I
> think if you made the IOMMU code add mappings for each of the aliases,
> i.e., for both 01:04.0 and 00:00.0, that device *would* work even
> without this quirk.

There is no IOMMU for the on-chip devices, but the MSI-X interrupts are
looked up in a table based on DeviceID(RID) by the ARM interrupt
controller. The issue here is similar.

JC.

[1]https://www.spinics.net/lists/arm-kernel/msg573744.html



More information about the linux-arm-kernel mailing list