[RFC PATCH v3 0/9] accel: rocket: Add RK3568 NPU support
Chaoyi Chen
chaoyi.chen at rock-chips.com
Thu Jun 4 18:36:28 PDT 2026
Hello Midgy,
On 6/4/2026 9:52 PM, Midgy BALON wrote:
> RFC, not for merge. End-to-end inference does not produce correct output
> yet (see Status), so per the v2 discussion this is a request for design
> feedback. It now probes, attaches, and submits cleanly on a stock
> v7.1-rc6 tree; what remains is one hardware-internal issue.
>
> The RK3568 has a single NVDLA-derived NPU core, the same IP family as the
> RK3588 NPU the driver already supports; the register layout matches. The
> RK3568 differences are a 32-bit NPU AXI/IOMMU (vs 40-bit) and explicit
> PVTPLL/PMU bring-up to power and de-idle the NPU before it is reachable.
>
> Patches:
> 1-2 rocket: per-SoC data struct, then derive DMA width and core count
> from match data (refactors, no functional change).
> 3 rocket: RK3568 SoC data + PVTPLL/PMU/NOC bring-up.
> 4 rocket: reset the NPU before detaching the IOMMU on a job timeout
> (the detach otherwise stalls a wedged AXI master and WARNs).
> 5 rocket: keep the IOMMU domain attached across jobs instead of
> re-attaching per job (the per-job rk_iommu handshake on the idle
> NPU MMU is slow and noisy).
> 6 iommu/rockchip: clear AUTO_GATING bit 1 on the RK356x v1 IOMMU so
> the page-walker keeps its clock (else a TLB-miss walk never
> completes).
> 7 dt-bindings: add the RK3568 NPU compatible.
> 8-9 arm64 dts: add the NPU and its IOMMU, and enable them on ROCK 3B.
>
> Dependency. The NPU MMU is rockchip-iommu v1 (32-bit) while the rest of
> the RK3568 uses v2 (40-bit). They cannot coexist until the driver carries
> per-device ops; this series is developed on top of Simon Xue's
> "iommu/rockchip: Drop global rk_ops in favor of per-device ops" [1].
> Without it the NPU IOMMU fails to probe on a full RK3568 boot.
>
Hmmm. If I understand correctly, the NPU IOMMU should be v2 rather than
v1, implying it should support 40-bit PAs. Nevertheless, please note that
the upper limit for DTE is 32 bits.
> Power bring-up. The NPU is brought up through the power-domain layer (no
> driver hack): the NPU power-domain keeps its clocks but drops the pm_qos
> phandle (qos_npu sits behind the gated NPU NoC, so genpd's power-off QoS
> save faults reading it), and vdd_npu is marked always-on so the rail is
> up before genpd de-idles the NoC at power-on. The PMU de-idle then ACKs
> without PVTPLL running; PVTPLL is only needed for compute.
>
Can these operations not be completed via the pmdomain driver?
If some operations are controlled by TF-A, are you using open
source TF-A? Thank you.
> Status. On v7.1-rc6 the driver probes, creates /dev/accel/accel0,
> attaches an IOMMU domain, and submits jobs; the program controller
> fetches and broadcasts the command list. Inference output is still wrong,
> and the cause is split across three layers:
> - kernel (this series): the RK3568 differences appear handled;
> - mesa/Teflon userspace: still emits RK3588-tuned config, wrong for
> RK3568 (to be filed separately on mesa-dev);
> - hardware: with corrected config the NPU's DMA reads the full input
> and weight tensors (confirmed via its DMA bandwidth counters), but
> the MAC/output stage never completes, the job times out, and the
> output stays at the buffer's zero-point. I have not found the missing
> step; it is not in the command list (replaying the vendor's
> byte-exact command list behaves the same). Pointers welcome,
> especially from anyone with RK3568 NPU experience.
>
> Known residual. On the first IOMMU attach the NPU MMU is idle with paging
> already enabled; the rk_iommu stall/reset handshake does not complete in
> that state and logs one burst of timeouts before the (kept) domain
> settles. It is harmless here because the job times out regardless, but it
> points at an idle-MMU reconfiguration corner the rk_iommu code does not
> handle on this block.
>
> [1] https://lore.kernel.org/linux-rockchip/20260310105303.128859-1-xxm@rock-chips.com/
>
> Changes since v2:
> - Tagged RFC; now tested on a stock v7.1-rc6 tree.
> - Bring-up moved into the power-domain/DT layer (no initcall hack).
> - Added the IOMMU detach-on-timeout and attach-once driver fixes.
> - Split the driver patch (Heiko): soc_data / match-data / RK3568.
> - Derive DMA width and core count from match data; drop the DT rescans.
> - Binding describes the hardware; added the missing $ref on rockchip,pmu.
> - Disclosed the per-device-ops IOMMU dependency.
>
> Midgy BALON (9):
> accel: rocket: Introduce per-SoC rocket_soc_data
> accel: rocket: Derive DMA width and core count from match data
> accel: rocket: Add RK3568 SoC support
> accel: rocket: Reset the NPU before detaching the IOMMU on timeout
> accel: rocket: Keep the IOMMU domain attached across jobs
> iommu/rockchip: Clear AUTO_GATING bit 1 on the RK356x v1 IOMMU
> dt-bindings: npu: rockchip,rk3588-rknn-core: Add RK3568
> arm64: dts: rockchip: rk356x: Add the NPU and its IOMMU
> arm64: dts: rockchip: rk3568-rock-3b: Enable the NPU
>
> .../npu/rockchip,rk3588-rknn-core.yaml | 18 ++++-
> .../boot/dts/rockchip/rk3568-rock-3b.dts | 14 +++-
> arch/arm64/boot/dts/rockchip/rk356x-base.dtsi | 38 +++++++++++
> drivers/accel/rocket/rocket_core.c | 22 ++++++-
> drivers/accel/rocket/rocket_core.h | 19 ++++++
> drivers/accel/rocket/rocket_device.c | 15 ++---
> drivers/accel/rocket/rocket_device.h | 3 +-
> drivers/accel/rocket/rocket_drv.c | 66 ++++++++++++++++++-
> drivers/accel/rocket/rocket_job.c | 35 ++++++++--
> drivers/iommu/rockchip-iommu.c | 12 ++++
> 10 files changed, 219 insertions(+), 23 deletions(-)
>
>
> base-commit: 52c800fdcf11888ebeb50c3d707f782cc15b66eb
--
Best,
Chaoyi
More information about the linux-arm-kernel
mailing list