[PATCH 0/8] io-pgtable lock removal

Tue Jun 20 06:37:51 PDT 2017

On 15/06/17 01:40, Ray Jui wrote:
> Hi Robin,
> 
> I have applied this patch series on top of v4.12-rc4, and ran various
> Ethernet and NVMf target throughput tests on it.
> 
> To give you some background of my setup:
> 
> The system is a ARMv8 based system with 8 cores. It has various PCIe
> root complexes that can be used to connect to PCIe endpoint devices
> including NIC cards and NVMe SSDs.
> 
> I'm particularly interested in the performance of the PCIe root complex
> that connects to the NIC card, and during my test, IOMMU is
> enabled/disabled against that particular PCIe root complex. The root
> complexes connected to NVMe SSDs remain unchanged (i.e., without IOMMU).
> 
> For the Ethernet throughput out of 50G link:
> 
> Note during the multiple TCP session test, each session will be spread
> to different CPU cores for optimized performance
> 
> Without IOMMU:
> 
> TX TCP x1 - 29.7 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 28 Gbps
> 
> RX TCP x1 - 15 Gbps
> RX TCP x4 - 33.7 Gbps
> RX TCP x8 - 36 Gbps
> 
> With IOMMU, but without your latest patch:
> 
> TX TCP x1 - 15.2 Gbps
> TX TCP x4 - 14.3 Gbps
> TX TCP x8 - 13 Gbps
> 
> RX TCP x1 - 7.88 Gbps
> RX TCP x4 - 13.2 Gbps
> RX TCP x8 - 12.6 Gbps
> 
> With IOMMU and your latest patch:
> 
> TX TCP x1 - 21.4 Gbps
> TX TCP x4 - 30.5 Gbps
> TX TCP x8 - 21.3 Gbps
> 
> RX TCP x1 - 7.7 Gbps
> RX TCP x4 - 20.1 Gbps
> RX TCP x8 - 27.1 Gbps

Cool, those seem more or less in line with expectations. Nate's
currently cooking a patch to further reduce the overhead when unmapping
multi-page buffers, which we believe should make up most of the rest of
the difference.

> With the NVMf target test with 4 SSDs, fio based test, random read, 4k,
> 8 jobs:
> 
> Without IOMMU:
> 
> IOPS = 1080K
> 
> With IOMMU, but without your latest patch:
> 
> IOPS = 520K
> 
> With IOMMU and your latest patch:
> 
> IOPS = 500K ~ 850K (a lot of variation observed during the same test run)

That does seem a bit off - are you able to try some perf profiling to
get a better idea of where the overhead appears to be?

> As you can see, performance has improved significantly with this patch
> series! That is very impressive!
> 
> However, it is still off, compared to the test runs without the IOMMU.
> I'm wondering if more improvement is expected.
> 
> In addition, a much larger throughput variation is observed in the tests
> with these latest patches, when multiple CPUs are involved. I'm
> wondering if that is caused by some remaining lock in the driver?

Assuming this is the platform with MMU-500, there shouldn't be any locks
left, since that shouldn't have the hardware ATOS registers for
iova_to_phys().

> Also, in a few occasions, I observed the following message during the
> test, when multiple cores are involved:
> 
> arm-smmu 64000000.mmu: TLB sync timed out -- SMMU may be deadlocked

That's particularly worrying, because it means we spent over a second
waiting for something that normally shouldn't take more than a few
hundred cycles. The only time I've ever actually seen that happen is if
TLBSYNC is issued while a context fault is pending - on MMU-500 it seems
that the sync just doesn't proceed until the fault is cleared - but that
stemmed from interrupts not being wired up correctly (on FPGAs) such
that we never saw the fault reported in the first place :/

Robin.

> 
> Thanks,
> 
> Ray
> 
> On 6/9/17 12:28 PM, Nate Watterson wrote:
>> Hi Robin,
>>
>> On 6/8/2017 7:51 AM, Robin Murphy wrote:
>>> Hi all,
>>>
>>> Here's the cleaned up nominally-final version of the patches everybody's
>>> keen to see. #1 is just a non-critical thing-I-spotted-in-passing fix,
>>> #2-#4 do some preparatory work (and bid farewell to everyone's least
>>> favourite bit of code, hooray!), and #5-#8 do the dirty deed itself.
>>>
>>> The branch I've previously shared has been updated too:
>>>
>>>    git://linux-arm.org/linux-rm  iommu/pgtable
>>>
>>> All feedback welcome, as I'd really like to land this for 4.13.
>>>
>>
>> I tested the series on a QDF2400 development platform and see notable
>> performance improvements particularly in workloads that make concurrent
>> accesses to a single iommu_domain.
>>
>>> Robin.
>>>
>>>
>>> Robin Murphy (8):
>>>    iommu/io-pgtable-arm-v7s: Check table PTEs more precisely
>>>    iommu/io-pgtable-arm: Improve split_blk_unmap
>>>    iommu/io-pgtable-arm-v7s: Refactor split_blk_unmap
>>>    iommu/io-pgtable: Introduce explicit coherency
>>>    iommu/io-pgtable-arm: Support lockless operation
>>>    iommu/io-pgtable-arm-v7s: Support lockless operation
>>>    iommu/arm-smmu: Remove io-pgtable spinlock
>>>    iommu/arm-smmu-v3: Remove io-pgtable spinlock
>>>
>>>   drivers/iommu/arm-smmu-v3.c        |  36 ++-----
>>>   drivers/iommu/arm-smmu.c           |  48 ++++------
>>>   drivers/iommu/io-pgtable-arm-v7s.c | 173
>>> +++++++++++++++++++++------------
>>>   drivers/iommu/io-pgtable-arm.c     | 190
>>> ++++++++++++++++++++++++-------------
>>>   drivers/iommu/io-pgtable.h         |   6 ++
>>>   5 files changed, 268 insertions(+), 185 deletions(-)
>>>
>>