NVMe vs DMA addressing limitations

Mon Jan 9 23:31:47 PST 2017

Christoph, thanks for clear input.

Arnd, I think that given this discussion, best short-term solution is
indeed the patch I've submitted yesterday. That is, your version +
coherent mask support.  With that, set_dma_mask(DMA_BIT_MASK(64)) will
succeed and hardware with work with swiotlb.

Possible next step is to teach swiotlb to dynamically allocate bounce
buffers within entire arm64's ZONE_DMA.

Also there is some hope that R-Car *can* iommu-translate addresses that
PCIe module issues to system bus.  Although previous attempts to make
that working failed. Additional research is needed here.

Nikita

> On Tue, Jan 10, 2017 at 09:47:21AM +0300, Nikita Yushchenko wrote:
>> I'm now working with HW that:
>> - is now way "low end" or "obsolete", it has 4G of RAM and 8 CPU cores,
>> and is being manufactured and developed,
>> - has 75% of it's RAM located beyond first 4G of address space,
>> - can't physically handle incoming PCIe transactions addressed to memory
>> beyond 4G.
> 
> It might not be low end or obselete, but it's absolutely braindead.
> Your I/O performance will suffer badly for the life of the platform
> because someone tries to save 2 cents, and there is not much we can do
> about it.
> 
>> (1) it constantly runs of swiotlb space, logs are full of warnings
>> despite of rate limiting,
> 
>> Per my current understanding, blk-level bounce buffering will at least
>> help with (1) - if done properly it will allocate bounce buffers within
>> entire memory below 4G, not within dedicated swiotlb space (that is
>> small and enlarging it makes memory permanently unavailable for other
>> use).  This looks simple and safe (in sense of not anyhow breaking
>> unrelated use cases).
> 
> Yes.  Although there is absolutely no reason why swiotlb could not
> do the same.
> 
>> (2) it runs far suboptimal due to bounce-buffering almost all i/o,
>> despite of lots of free memory in area where direct DMA is possible.
> 
>> Addressing (2) looks much more difficult because different memory
>> allocation policy is required for that.
> 
> It's basically not possible.  Every piece of memory in a Linux
> kernel is a possible source of I/O, and depending on the workload
> type it might even be a the prime source of I/O.
> 
>>> NVMe should never bounce, the fact that it currently possibly does
>>> for highmem pages is a bug.
>>
>> The entire topic is absolutely not related to highmem (i.e. memory not
>> directly addressable by 32-bit kernel).
> 
> I did not say this affects you, but thanks to your mail I noticed that
> NVMe has a suboptimal setting there.  Also note that highmem does not
> have to imply a 32-bit kernel, just physical memory that is not in the
> kernel mapping.
> 
>> What we are discussing is hw-originated restriction on where DMA is
>> possible.
> 
> Yes, where hw means the SOC, and not the actual I/O device, which is an
> important distinction.
> 
>>> Or even better remove the call to dma_set_mask_and_coherent with
>>> DMA_BIT_MASK(32).  NVMe is designed around having proper 64-bit DMA
>>> addressing, there is not point in trying to pretent it works without that
>>
>> Are you claiming that NVMe driver in mainline is intentionally designed
>> to not work on HW that can't do DMA to entire 64-bit space?
> 
> It is not intenteded to handle the case where the SOC / chipset
> can't handle DMA to all physical memoery, yes.
> 
>> Such setups do exist and there is interest to make them working.
> 
> Sure, but it's not the job of the NVMe driver to work around such a broken
> system.  It's something your architecture code needs to do, maybe with
> a bit of core kernel support.
> 
>> Quite a few pages used for block I/O are allocated by filemap code - and
>> at allocation point it is known what inode page is being allocated for.
>> If this inode is from filesystem located on a known device with known
>> DMA limitations, this knowledge can be used to allocate page that can be
>> DMAed directly.
> 
> But in other cases we might never DMA to it.  Or we rarely DMA to it, say
> for a machine running databses or qemu and using lots of direct I/O. Or
> a storage target using it's local alloc_pages buffers.
> 
>> Sure there are lots of cases when at allocation time there is no idea
>> what device will run DMA on page being allocated, or perhaps page is
>> going to be shared, or whatever. Such cases unavoidably require bounce
>> buffers if page ends to be used with device with DMA limitations. But
>> still there are cases when better allocation can remove need for bounce
>> buffers - without any hurt for other cases.
> 
> It takes your max 1GB DMA addressable memoery away from other uses,
> and introduce the crazy highmem VM tuning issues we had with big
> 32-bit x86 systems in the past.
>