radeon ring 0 test failed on arm64

Wed Mar 16 20:07:50 PDT 2022

Hi Peter,

On 2022/3/17 08:14, Peter Geis wrote:
> Good Evening,
>
> I apologize for raising this email chain from the dead, but there have
> been some developments that have introduced even more questions.
> I've looped the Rockchip mailing list into this too, as this affects
> rk356x, and likely the upcoming rk3588 if [1] is to be believed.
>
> TLDR for those not familiar: It seems the rk356x series (and possibly
> the rk3588) were built without any outer coherent cache.
> This means (unless Rockchip wants to clarify here) devices such as the
> ITS and PCIe cannot utilize cache snooping.
> This is based on the results of the email chain [2].
>
> The new circumstances are as follows:
> The RPi CM4 Adventure Team as I've taken to calling them has been
> attempting to get a dGPU working with the very broken Broadcom
> controller in the RPi CM4.
> Recently they acquired a SoQuartz rk3566 module which is pin
> compatible with the CM4, and have taken to trying it out as well.
>
> This is how I got involved.
> It seems they found a trivial way to force the Radeon R600 driver to
> use Non-Cached memory for everything.
> This single line change, combined with using memset_io instead of
> memset, allows the ring tests to pass and the card probes successfully
> (minus the DMA limitations of the rk356x due to the 32 bit
> interconnect).
> I discovered using this method that we start having unaligned io
> memory access faults (bus errors) when running glmark2-drm (running
> glmark2 directly was impossible, as both X and Wayland crashed too
> early).
> I traced this to using what I thought at the time was an unsafe memcpy
> in the mesa stack.
> Rewriting this function to force aligned writes solved the problem and
> allows glmark2-drm to run to completion.
> With some extensive debugging, I found about half a dozen memcpy
> functions in mesa that if forced to be aligned would allow Wayland to
> start, but with hilarious display corruption (see [3]. [4]).
> The CM4 team is convinced this is an issue with memcpy in glibc, but
> I'm not convinced it's that simple.
>
> On my two hour drive in to work this morning, I got to thinking.
> If this was an memcpy fault, this would be universally broken on arm64
> which is obviously not the case.
> So I started thinking, what is different here than with systems known to work:
> 1. No IOMMU for the PCIe controller.
> 2. The Outer Cache Issue.
>
> Robin:
> My questions for you, since you're the smartest person I know about
> arm64 memory management:
> Could cache snooping permit unaligned accesses to IO to be safe?
> Or
> Is it the lack of an IOMMU that's causing the alignment faults to become fatal?
> Or
> Am I insane here?
>
> Rockchip:
> Please update on the status for the Outer Cache errata for ITS services.

Our SoC design team has double check with ARM GIC/ITS IP team for many 
times, and the GITS_CBASER
of GIC600 IP does not support hardware bind or config to a fix value, so 
they insist this is an IP
limitation instead of a SoC bug, software should take  care of it :(
I will check again if we can provide errata for this issue.
> Please provide an answer to the errata of the PCIe controller, in
> regard to cache snooping and buffering, for both the rk356x and the
> upcoming rk3588.

Sorry, what is this?

Thanks,
- Kever
>
> [1] https://github.com/JeffyCN/mirrors/commit/0b985f29304dcb9d644174edacb67298e8049d4f
> [2] https://lore.kernel.org/lkml/871rbdt4tu.wl-maz@kernel.org/T/
> [3] https://cdn.discordapp.com/attachments/926487797844541510/953414755970850816/unknown.png
> [4] https://cdn.discordapp.com/attachments/926487797844541510/953424952042852422/unknown.png
>
> Thank you everyone for your time.
>
> Very Respectfully,
> Peter Geis
>
> On Wed, May 26, 2021 at 7:21 AM Christian König
> <christian.koenig at amd.com> wrote:
>> Hi Robin,
>>
>> Am 26.05.21 um 12:59 schrieb Robin Murphy:
>>> On 2021-05-26 10:42, Christian König wrote:
>>>> Hi Robin,
>>>>
>>>> Am 25.05.21 um 22:09 schrieb Robin Murphy:
>>>>> On 2021-05-25 14:05, Alex Deucher wrote:
>>>>>> On Tue, May 25, 2021 at 8:56 AM Peter Geis <pgwipeout at gmail.com>
>>>>>> wrote:
>>>>>>> On Tue, May 25, 2021 at 8:47 AM Alex Deucher
>>>>>>> <alexdeucher at gmail.com> wrote:
>>>>>>>> On Tue, May 25, 2021 at 8:42 AM Peter Geis <pgwipeout at gmail.com>
>>>>>>>> wrote:
>>>>>>>>> Good Evening,
>>>>>>>>>
>>>>>>>>> I am stress testing the pcie controller on the rk3566-quartz64
>>>>>>>>> prototype SBC.
>>>>>>>>> This device has 1GB available at <0x3 0x00000000> for the PCIe
>>>>>>>>> controller, which makes a dGPU theoretically possible.
>>>>>>>>> While attempting to light off a HD7570 card I manage to get a
>>>>>>>>> modeset
>>>>>>>>> console, but ring0 test fails and disables acceleration.
>>>>>>>>>
>>>>>>>>> Note, we do not have UEFI, so all PCIe setup is from the Linux
>>>>>>>>> kernel.
>>>>>>>>> Any insight you can provide would be much appreciated.
>>>>>>>> Does your platform support PCIe cache coherency with the CPU?  I.e.,
>>>>>>>> does the CPU allow cache snoops from PCIe devices?  That is required
>>>>>>>> for the driver to operate.
>>>>>>> Ah, most likely not.
>>>>>>> This issue has come up already as the GIC isn't permitted to snoop on
>>>>>>> the CPUs, so I doubt the PCIe controller can either.
>>>>>>>
>>>>>>> Is there no way to work around this or is it dead in the water?
>>>>>> It's required by the pcie spec.  You could potentially work around it
>>>>>> if you can allocate uncached memory for DMA, but I don't think that is
>>>>>> possible currently.  Ideally we'd figure out some way to detect if a
>>>>>> particular platform supports cache snooping or not as well.
>>>>> There's device_get_dma_attr(), although I don't think it will work
>>>>> currently for PCI devices without an OF or ACPI node - we could
>>>>> perhaps do with a PCI-specific wrapper which can walk up and defer
>>>>> to the host bridge's firmware description as necessary.
>>>>>
>>>>> The common DMA ops *do* correctly keep track of per-device coherency
>>>>> internally, but drivers aren't supposed to be poking at that
>>>>> information directly.
>>>> That sounds like you underestimate the problem. ARM has unfortunately
>>>> made the coherency for PCI an optional IP.
>>> Sorry to be that guy, but I'm involved a lot internally with our
>>> system IP and interconnect, and I probably understand the situation
>>> better than 99% of the community ;)
>> I need to apologize, didn't realized who was answering :)
>>
>> It just sounded to me that you wanted to suggest to the end user that
>> this is fixable in software and I really wanted to avoid even more
>> customers coming around asking how to do this.
>>
>>> For the record, the SBSA specification (the closet thing we have to a
>>> "system architecture") does require that PCIe is integrated in an
>>> I/O-coherent manner, but we don't have any control over what people do
>>> in embedded applications (note that we don't make PCIe IP at all, and
>>> there is plenty of 3rd-party interconnect IP).
>> So basically it is not the fault of the ARM IP-core, but people are just
>> stitching together PCIe interconnect IP with a core where it is not
>> supposed to be used with.
>>
>> Do I get that correctly? That's an interesting puzzle piece in the picture.
>>
>>>> So we are talking about a hardware limitation which potentially can't
>>>> be fixed without replacing the hardware.
>>> You expressed interest in "some way to detect if a particular platform
>>> supports cache snooping or not", by which I assumed you meant a
>>> software method for the amdgpu/radeon drivers to call, rather than,
>>> say, a website that driver maintainers can look up SoC names on. I'm
>>> saying that that API already exists (just may need a bit more work).
>>> Note that it is emphatically not a platform-level thing since
>>> coherency can and does vary per device within a system.
>> Well, I think this is not something an individual driver should mess
>> with. What the driver should do is just express that it needs coherent
>> access to all of system memory and if that is not possible fail to load
>> with a warning why it is not possible.
>>
>>> I wasn't suggesting that Linux could somehow make coherency magically
>>> work when the signals don't physically exist in the interconnect - I
>>> was assuming you'd merely want to do something like throw a big
>>> warning and taint the kernel to help triage bug reports. Some drivers
>>> like ahci_qoriq and panfrost simply need to know so they can program
>>> their device to emit the appropriate memory attributes either way, and
>>> rely on the DMA API to hide the rest of the difference, but if you
>>> want to treat non-coherent use as unsupported because it would require
>>> too invasive changes that's fine by me.
>> Yes exactly that please. I mean not sure how panfrost is doing it, but
>> at least the Vulkan userspace API specification requires devices to have
>> coherent access to system memory.
>>
>> So even if I would want to do this it is simply not possible because the
>> application doesn't tell the driver which memory is accessed by the
>> device and which by the CPU.
>>
>> Christian.
>>
>>> Robin.