[PATCH v7 00/15] Optimizing iommu_[map/unmap] performance

Wed Jul 14 18:51:13 PDT 2021

在 2021/7/15 9:23, Lu Baolu 写道:
> On 7/14/21 10:24 PM, Georgi Djakov wrote:
>> On 16.06.21 16:38, Georgi Djakov wrote:
>>> When unmapping a buffer from an IOMMU domain, the IOMMU framework 
>>> unmaps
>>> the buffer at a granule of the largest page size that is supported by
>>> the IOMMU hardware and fits within the buffer. For every block that
>>> is unmapped, the IOMMU framework will call into the IOMMU driver, and
>>> then the io-pgtable framework to walk the page tables to find the entry
>>> that corresponds to the IOVA, and then unmaps the entry.
>>>
>>> This can be suboptimal in scenarios where a buffer or a piece of a
>>> buffer can be split into several contiguous page blocks of the same 
>>> size.
>>> For example, consider an IOMMU that supports 4 KB page blocks, 2 MB 
>>> page
>>> blocks, and 1 GB page blocks, and a buffer that is 4 MB in size is 
>>> being
>>> unmapped at IOVA 0. The current call-flow will result in 4 indirect 
>>> calls,
>>> and 2 page table walks, to unmap 2 entries that are next to each 
>>> other in
>>> the page-tables, when both entries could have been unmapped in one shot
>>> by clearing both page table entries in the same call.
>>>
>>> The same optimization is applicable to mapping buffers as well, so
>>> these patches implement a set of callbacks called unmap_pages and
>>> map_pages to the io-pgtable code and IOMMU drivers which unmaps or maps
>>> an IOVA range that consists of a number of pages of the same
>>> page size that is supported by the IOMMU hardware, and allows for
>>> manipulating multiple page table entries in the same set of indirect
>>> calls. The reason for introducing these callbacks is to give other 
>>> IOMMU
>>> drivers/io-pgtable formats time to change to using the new 
>>> callbacks, so
>>> that the transition to using this approach can be done piecemeal.
>>
>> Hi Will,
>>
>> Did you get a chance to look at this patchset? Most patches are already
>> acked/reviewed and all still applies clean on rc1.
>
> I also have the ops->[un]map_pages implementation for the Intel IOMMU
> driver. I will post them once the iommu/core part get applied.

I also implement those callbacks on ARM SMMUV3 based on this series, and 
use dma_map_benchmark to have a test on
the latency of map/unmap as follows, and i think it promotes much on the 
latency of map/unmap. I will also plan to post
the implementations for ARM SMMUV3 after this series are applied.

t = 1(thread = 1):
                    before opt(us)   after opt(us)
g=1(4K size)        0.1/1.3          0.1/0.8
g=2(8K size)        0.2/1.5          0.2/0.9
g=4(16K size)       0.3/1.9          0.1/1.1
g=8(32K size)       0.5/2.7          0.2/1.4
g=16(64K size)      1.0/4.5          0.2/2.0
g=32(128K size)     1.8/7.9          0.2/3.3
g=64(256K size)     3.7/14.8         0.4/6.1
g=128(512K size)    7.1/14.7         0.5/10.4
g=256(1M size)      14.0/53.9        0.8/19.3
g=512(2M size)      0.2/0.9          0.2/0.9
g=1024(4M size)     0.5/1.5          0.4/1.0

t = 10(thread = 10):
                    before opt(us)   after opt(us)
g=1(4K size)        0.3/7.0          0.1/5.8
g=2(8K size)        0.4/6.7          0.3/6.0
g=4(16K size)       0.5/6.3          0.3/5.6
g=8(32K size)       0.5/8.3          0.2/6.3
g=16(64K size)      1.0/17.3         0.3/12.4
g=32(128K size)     1.8/36.0         0.2/24.2
g=64(256K size)     4.3/67.2         1.2/46.4
g=128(512K size)    7.8/93.7         1.3/94.2
g=256(1M size)      14.7/280.8       1.8/191.5
g=512(2M size)      3.6/3.2          1.5/2.5
g=1024(4M size)     2.0/3.1          1.8/2.6

>
> Best regards,
> baolu
> _______________________________________________
> iommu mailing list
> iommu at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/iommu
>
> .
>