[PATCH 0/8] io-pgtable lock removal
Robin Murphy
robin.murphy at arm.com
Tue Jun 20 06:37:51 PDT 2017
On 15/06/17 01:40, Ray Jui wrote:
> Hi Robin,
>
> I have applied this patch series on top of v4.12-rc4, and ran various
> Ethernet and NVMf target throughput tests on it.
>
> To give you some background of my setup:
>
> The system is a ARMv8 based system with 8 cores. It has various PCIe
> root complexes that can be used to connect to PCIe endpoint devices
> including NIC cards and NVMe SSDs.
>
> I'm particularly interested in the performance of the PCIe root complex
> that connects to the NIC card, and during my test, IOMMU is
> enabled/disabled against that particular PCIe root complex. The root
> complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).
>
> For the Ethernet throughput out of 50G link:
>
> Note during the multiple TCP session test, each session will be spread
> to different CPU cores for optimized performance
>
> Without IOMMU:
>
> TX TCP x1 - 29.7 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 28 Gbps
>
> RX TCP x1 - 15 Gbps
> RX TCP x4 - 33.7 Gbps
> RX TCP x8 - 36 Gbps
>
> With IOMMU, but without your latest patch:
>
> TX TCP x1 - 15.2 Gbps
> TX TCP x4 - 14.3 Gbps
> TX TCP x8 - 13 Gbps
>
> RX TCP x1 - 7.88 Gbps
> RX TCP x4 - 13.2 Gbps
> RX TCP x8 - 12.6 Gbps
>
> With IOMMU and your latest patch:
>
> TX TCP x1 - 21.4 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 21.3 Gbps
>
> RX TCP x1 - 7.7 Gbps
> RX TCP x4 - 20.1 Gbps
> RX TCP x8 - 27.1 Gbps
Cool, those seem more or less in line with expectations. Nate's
currently cooking a patch to further reduce the overhead when unmapping
multi-page buffers, which we believe should make up most of the rest of
the difference.
> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
> 8 jobs:
>
> Without IOMMU:
>
> IOPS = 1080K
>
> With IOMMU, but without your latest patch:
>
> IOPS = 520K
>
> With IOMMU and your latest patch:
>
> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)
That does seem a bit off - are you able to try some perf profiling to
get a better idea of where the overhead appears to be?
> As you can see, performance has improved significantly with this patch
> series! That is very impressive!
>
> However, it is still off, compared to the test runs without the IOMMU.
> I'm wondering if more improvement is expected.
>
> In addition, a much larger throughput variation is observed in the tests
> with these latest patches, when multiple CPUs are involved. I'm
> wondering if that is caused by some remaining lock in the driver?
Assuming this is the platform with MMU-500, there shouldn't be any locks
left, since that shouldn't have the hardware ATOS registers for
iova_to_phys().
> Also, in a few occasions, I observed the following message during the
> test, when multiple cores are involved:
>
> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked
That's particularly worrying, because it means we spent over a second
waiting for something that normally shouldn't take more than a few
hundred cycles. The only time I've ever actually seen that happen is if
TLBSYNC is issued while a context fault is pending - on MMU-500 it seems
that the sync just doesn't proceed until the fault is cleared - but that
stemmed from interrupts not being wired up correctly (on FPGAs) such
that we never saw the fault reported in the first place :/
Robin.
>
> Thanks,
>
> Ray
>
> On 6/9/17 12:28 PM, Nate Watterson wrote:
>> Hi Robin,
>>
>> On 6/8/2017 7:51 AM, Robin Murphy wrote:
>>> Hi all,
>>>
>>> Here's the cleaned up nominally-final version of the patches everybody's
>>> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
>>> #2-#4 do some preparatory work (and bid farewell to everyone's least
>>> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
>>>
>>> The branch I've previously shared has been updated too:
>>>
>>> git://linux-arm.org/linux-rm iommu/pgtable
>>>
>>> All feedback welcome, as I'd really like to land this for 4.13.
>>>
>>
>> I tested the series on a QDF2400 development platform and see notable
>> performance improvements particularly in workloads that make concurrent
>> accesses to a single iommu_domain.
>>
>>> Robin.
>>>
>>>
>>> Robin Murphy (8):
>>> iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>> iommu/io-pgtable-arm: Improve split_blk_unmap
>>> iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>> iommu/io-pgtable: Introduce explicit coherency
>>> iommu/io-pgtable-arm: Support lockless operation
>>> iommu/io-pgtable-arm-v7s: Support lockless operation
>>> iommu/arm-smmu: Remove io-pgtable spinlock
>>> iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>>
>>> drivers/iommu/arm-smmu-v3.c | 36 ++-----
>>> drivers/iommu/arm-smmu.c | 48 ++++------
>>> drivers/iommu/io-pgtable-arm-v7s.c | 173
>>> +++++++++++++++++++++------------
>>> drivers/iommu/io-pgtable-arm.c | 190
>>> ++++++++++++++++++++++++-------------
>>> drivers/iommu/io-pgtable.h | 6 ++
>>> 5 files changed, 268 insertions(+), 185 deletions(-)
>>>
>>
More information about the linux-arm-kernel
mailing list