[PATCH 3/4] iommu/arm-smmu-v3: Use NUMA memory allocations for stream tables and comamnd queues

Mon Nov 6 01:04:52 PST 2017

On Wed, Oct 18, 2017 at 7:06 PM, Robin Murphy <robin.murphy at arm.com> wrote:
> On 04/10/17 14:53, Ganapatrao Kulkarni wrote:
>> Hi Robin,
>>
>>
>> On Thu, Sep 21, 2017 at 5:28 PM, Robin Murphy <robin.murphy at arm.com> wrote:
>>> [+Christoph and Marek]
>>>
>>> On 21/09/17 09:59, Ganapatrao Kulkarni wrote:
>>>> Introduce smmu_alloc_coherent and smmu_free_coherent functions to
>>>> allocate/free dma coherent memory from NUMA node associated with SMMU.
>>>> Replace all calls of dmam_alloc_coherent with smmu_alloc_coherent
>>>> for SMMU stream tables and command queues.
>>>
>>> This doesn't work - not only do you lose the 'managed' aspect and risk
>>> leaking various tables on probe failure or device removal, but more
>>> importantly, unless you add DMA syncs around all the CPU accesses to the
>>> tables, you lose the critical 'coherent' aspect, and that's a horribly
>>> invasive change that I really don't want to make.
>>
>> this implementation is similar to function used to allocate memory for
>> translation tables.
>
> The concept is similar, yes, and would work if implemented *correctly*
> with the aforementioned comprehensive and hugely invasive changes. The
> implementation as presented in this patch, however, is incomplete and
> badly broken.
>
> By way of comparison, the io-pgtable implementations contain all the
> necessary dma_sync_* calls, never relied on devres, and only have one
> DMA direction to worry about (hint: the queues don't all work
> identically). There are also a couple of practical reasons for using
> streaming mappings with the DMA == phys restriction there - tracking
> both the CPU and DMA addresses for each table would significantly
> increase the memory overhead, and using the cacheable linear map address
> in all cases sidesteps any potential problems with the atomic PTE
> updates. Neither of those concerns apply to the SMMUv3 data structures,
> which are textbook coherent DMA allocations (being tied to the lifetime
> of the device, rather than transient).
>
>> why do you see it affects to stream tables and not to page tables.
>> at runtime, both tables are accessed by SMMU only.
>>
>> As said in cover letter, having stream table from respective NUMA node
>> is yielding
>> around 30% performance!
>> please suggest, if there is any better way to address this issue?
>
> I fully agree that NUMA-aware allocations are a worthwhile thing that we
> want. I just don't like the idea of going around individual drivers
> replacing coherent API usage with bodged-up streaming mappings - I
> really think it's worth making the effort to to tackle it once, in the
> proper place, in a way that benefits all users together.
>
> Robin.
>
>>>
>>> Christoph, Marek; how reasonable do you think it is to expect
>>> dma_alloc_coherent() to be inherently NUMA-aware on NUMA-capable
>>> systems? SWIOTLB looks fairly straightforward to fix up (for the simple
>>> allocation case; I'm not sure it's even worth it for bounce-buffering),
>>> but the likes of CMA might be a little trickier...

IIUC, having DMA allocation per node may become issue for 32 bit PCI
devices connected on NODE 1 on IOMMU less platforms.
most of the platforms may have NODE 1 RAM located beyond 4GB and
having DMA allocation beyond 32bit for NODE1(and above) devices may
make 32 bit pci devices not usable.

DMA/IOMMU experts, please advise?

>>>
>>> Robin.
>>>
>>>> Signed-off-by: Ganapatrao Kulkarni <ganapatrao.kulkarni at cavium.com>
>>>> ---
>>>>  drivers/iommu/arm-smmu-v3.c | 57 ++++++++++++++++++++++++++++++++++++++++-----
>>>>  1 file changed, 51 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/drivers/iommu/arm-smmu-v3.c b/drivers/iommu/arm-smmu-v3.c
>>>> index e67ba6c..bc4ba1f 100644
>>>> --- a/drivers/iommu/arm-smmu-v3.c
>>>> +++ b/drivers/iommu/arm-smmu-v3.c
>>>> @@ -1158,6 +1158,50 @@ static void arm_smmu_init_bypass_stes(u64 *strtab, unsigned int nent)
>>>>       }
>>>>  }
>>>>
>>>> +static void *smmu_alloc_coherent(struct arm_smmu_device *smmu, size_t size,
>>>> +             dma_addr_t *dma_handle, gfp_t gfp)
>>>> +{
>>>> +     struct device *dev = smmu->dev;
>>>> +     void *pages;
>>>> +     dma_addr_t dma;
>>>> +     int numa_node = dev_to_node(dev);
>>>> +
>>>> +     pages = alloc_pages_exact_nid(numa_node, size, gfp | __GFP_ZERO);
>>>> +     if (!pages)
>>>> +             return NULL;
>>>> +
>>>> +     if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY)) {
>>>> +             dma = dma_map_single(dev, pages, size, DMA_TO_DEVICE);
>>>> +             if (dma_mapping_error(dev, dma))
>>>> +                     goto out_free;
>>>> +             /*
>>>> +              * We depend on the SMMU being able to work with any physical
>>>> +              * address directly, so if the DMA layer suggests otherwise by
>>>> +              * translating or truncating them, that bodes very badly...
>>>> +              */
>>>> +             if (dma != virt_to_phys(pages))
>>>> +                     goto out_unmap;
>>>> +     }
>>>> +
>>>> +     *dma_handle = (dma_addr_t)virt_to_phys(pages);
>>>> +     return pages;
>>>> +
>>>> +out_unmap:
>>>> +     dev_err(dev, "Cannot accommodate DMA translation for IOMMU page tables\n");
>>>> +     dma_unmap_single(dev, dma, size, DMA_TO_DEVICE);
>>>> +out_free:
>>>> +     free_pages_exact(pages, size);
>>>> +     return NULL;
>>>> +}
>>>> +
>>>> +static void smmu_free_coherent(struct arm_smmu_device *smmu, size_t size,
>>>> +             void *pages, dma_addr_t dma_handle)
>>>> +{
>>>> +     if (!(smmu->features & ARM_SMMU_FEAT_COHERENCY))
>>>> +             dma_unmap_single(smmu->dev, dma_handle, size, DMA_TO_DEVICE);
>>>> +     free_pages_exact(pages, size);
>>>> +}
>>>> +
>>>>  static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>>>>  {
>>>>       size_t size;
>>>> @@ -1172,7 +1216,7 @@ static int arm_smmu_init_l2_strtab(struct arm_smmu_device *smmu, u32 sid)
>>>>       strtab = &cfg->strtab[(sid >> STRTAB_SPLIT) * STRTAB_L1_DESC_DWORDS];
>>>>
>>>>       desc->span = STRTAB_SPLIT + 1;
>>>> -     desc->l2ptr = dmam_alloc_coherent(smmu->dev, size, &desc->l2ptr_dma,
>>>> +     desc->l2ptr = smmu_alloc_coherent(smmu, size, &desc->l2ptr_dma,
>>>>                                         GFP_KERNEL | __GFP_ZERO);
>>>>       if (!desc->l2ptr) {
>>>>               dev_err(smmu->dev,
>>>> @@ -1487,7 +1531,7 @@ static void arm_smmu_domain_free(struct iommu_domain *domain)
>>>>               struct arm_smmu_s1_cfg *cfg = &smmu_domain->s1_cfg;
>>>>
>>>>               if (cfg->cdptr) {
>>>> -                     dmam_free_coherent(smmu_domain->smmu->dev,
>>>> +                     smmu_free_coherent(smmu,
>>>>                                          CTXDESC_CD_DWORDS << 3,
>>>>                                          cfg->cdptr,
>>>>                                          cfg->cdptr_dma);
>>>> @@ -1515,7 +1559,7 @@ static int arm_smmu_domain_finalise_s1(struct arm_smmu_domain *smmu_domain,
>>>>       if (asid < 0)
>>>>               return asid;
>>>>
>>>> -     cfg->cdptr = dmam_alloc_coherent(smmu->dev, CTXDESC_CD_DWORDS << 3,
>>>> +     cfg->cdptr = smmu_alloc_coherent(smmu, CTXDESC_CD_DWORDS << 3,
>>>>                                        &cfg->cdptr_dma,
>>>>                                        GFP_KERNEL | __GFP_ZERO);
>>>>       if (!cfg->cdptr) {
>>>> @@ -1984,7 +2028,7 @@ static int arm_smmu_init_one_queue(struct arm_smmu_device *smmu,
>>>>  {
>>>>       size_t qsz = ((1 << q->max_n_shift) * dwords) << 3;
>>>>
>>>> -     q->base = dmam_alloc_coherent(smmu->dev, qsz, &q->base_dma, GFP_KERNEL);
>>>> +     q->base = smmu_alloc_coherent(smmu, qsz, &q->base_dma, GFP_KERNEL);
>>>>       if (!q->base) {
>>>>               dev_err(smmu->dev, "failed to allocate queue (0x%zx bytes)\n",
>>>>                       qsz);
>>>> @@ -2069,7 +2113,7 @@ static int arm_smmu_init_strtab_2lvl(struct arm_smmu_device *smmu)
>>>>                        size, smmu->sid_bits);
>>>>
>>>>       l1size = cfg->num_l1_ents * (STRTAB_L1_DESC_DWORDS << 3);
>>>> -     strtab = dmam_alloc_coherent(smmu->dev, l1size, &cfg->strtab_dma,
>>>> +     strtab = smmu_alloc_coherent(smmu, l1size, &cfg->strtab_dma,
>>>>                                    GFP_KERNEL | __GFP_ZERO);
>>>>       if (!strtab) {
>>>>               dev_err(smmu->dev,
>>>> @@ -2097,8 +2141,9 @@ static int arm_smmu_init_strtab_linear(struct arm_smmu_device *smmu)
>>>>       u32 size;
>>>>       struct arm_smmu_strtab_cfg *cfg = &smmu->strtab_cfg;
>>>>
>>>> +
>>>>       size = (1 << smmu->sid_bits) * (STRTAB_STE_DWORDS << 3);
>>>> -     strtab = dmam_alloc_coherent(smmu->dev, size, &cfg->strtab_dma,
>>>> +     strtab = smmu_alloc_coherent(smmu, size, &cfg->strtab_dma,
>>>>                                    GFP_KERNEL | __GFP_ZERO);
>>>>       if (!strtab) {
>>>>               dev_err(smmu->dev,
>>>>
>>>
>>
>> thanks
>> Ganapat
>>
>

thanks
Ganapat