Issues with DRM_ACCEL_ROCKET (Oops, power-domain, probe-ordering)

Fri Dec 12 02:25:35 PST 2025

Hi Tomeu,

On 12/10/25 5:39 PM, Tomeu Vizoso wrote:
> [You don't often get email from tomeu at tomeuvizoso.net. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ]
> 
> On Wed, Dec 10, 2025 at 11:47 AM Chaoyi Chen <chaoyi.chen at rock-chips.com> wrote:
>>
>> Hello Tomeu,
>>
>> On 12/10/2025 2:50 PM, Tomeu Vizoso wrote:
>>> On Tue, Dec 9, 2025 at 10:31 AM Chaoyi Chen <chaoyi.chen at rock-chips.com> wrote:
>>>>
>>>> Hello Tomeu,
>>>>
>>>> On 12/9/2025 2:55 PM, Tomeu Vizoso wrote:
>>>>> On Tue, Dec 2, 2025 at 8:00 AM Chaoyi Chen <chaoyi.chen at rock-chips.com> wrote:
>>>>
>>>> [...]
>>>>
>>>>>
>>>>> This need was questioned during the review process, and during my
>>>>> stress testing I didn't find a need for core 0 to be treated
>>>>> specially. So it was considered unneeded complexity and the code was
>>>>> dropped.
>>>>>
>>>>
>>>> Oh, I just noticed that the commit comment in the devicetree still
>>>> contains a relevant description.
>>>>
>>>>> My testing involves running the supported models as a whole, and also
>>>>> their individual operations as individual tests. I run them in 8
>>>>> parallel processes, in batches so we have continuous bring up and down
>>>>> of clients.
>>>>>
>>>>> I'm a bit surprised that that testing worked stably but we have still
>>>>> failures in some setups.
>>>>
>>>> Could you please describe the failures that still exist? Thank you.
>>>
>>> I was referring to the ones that Quentin reported when starting this thread.
>>>
>>
>> I believe this pmdomain patch can alleviate some of the problems [0].
>>
>> However, the problem Quentin mentioned still exists. For example, if
>> there is an invalid device on core0, while core1 and core2 are valid,
>> accessing the relevant data on core0 will cause problems. Quentin's
>> patch can fix this, but he doesn't seem to have time for it lately...
> 
> Has anybody reproduced this problem with the pmdomain fix applied?
> 

If I've written my reboot-loop script properly, 715 successful boots and 
0 failed, so I guess this works.

The script:
"""
#!/usr/bin/env bash

set -ux

myreboot() {
	/bin/sync
	echo rebooting
	echo 1 > /proc/sys/kernel/sysrq
	echo b > /proc/sysrq-trigger
}

mount -t proc none /proc
mount -t sysfs none /sys

FAIL_TXT=/root/fail.txt
SUCCESS_TXT=/root/success.txt

[ -d /sys/bus/platform/drivers/rocket/fdab0000.npu/ ] && [ -d 
/sys/bus/platform/drivers/rocket/fdac0000.npu/ ] && [ -d 
/sys/bus/platform/drivers/rocket/fdad0000.npu/ ]
RES=$?
if [ $RES -ne 0 ]; then
	VAL=$(cat $FAIL_TXT)
	VAL=$(( VAL + 1 ))
	echo -n $VAL > $FAIL_TXT
	echo "FAILED!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
	myreboot
fi
TEFLON_DEBUG=verbose /root/mesa/venv/bin/python3 
/root/mesa/src/gallium/frontends/teflon/tests/classification.py -i 
/root/mesa/grace_hopper.bmp -m 
/root/mesa/src/gallium/targets/teflon/tests/models/mobilenetv1/mobilenet_v1_1_224_quant.tflite 
-l 
/root/mesa/src/gallium/frontends/teflon/tests/labels_mobilenet_quant_v1_224.txt 
-e /root/mesa/builddir/src/gallium/targets/teflon/libteflon.so
RES=$?
if [ $RES -ne 0 ]; then
	VAL=$(cat $FAIL_TXT)
	VAL=$(( VAL + 1 ))
	echo -n $VAL > $FAIL_TXT
	echo "FAILED!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"
else
	VAL=$(cat $SUCCESS_TXT)
	VAL=$(( VAL + 1 ))
	echo -n $VAL > $SUCCESS_TXT
fi
myreboot
"""

Seems like the cores are always listed in 
/sys/bus/platform/drivers/rocket/ even when one fails to probe, but then 
classification.py triggers an Oops and it seems the exit code of that 
script is not 0 when it happens. This info was gathered from the one 
boot I tested *without* the patch applied.

Cheers,
Quentin