[PATCH v6 06/10] accel/rocket: Add IOCTL for BO creation

Thu Jun 5 00:41:21 PDT 2025

On Wed, Jun 4, 2025 at 7:03 PM Robin Murphy <robin.murphy at arm.com> wrote:
>
> On 2025-06-04 5:18 pm, Daniel Stone wrote:
> > Hi Tomeu,
> > I have some bad news ...
> >
> > On Wed, 4 Jun 2025 at 08:57, Tomeu Vizoso <tomeu at tomeuvizoso.net> wrote:
> >> +int rocket_ioctl_create_bo(struct drm_device *dev, void *data, struct drm_file *file)
> >> +{
> >> +       [...]
> >> +
> >> +       /* This will map the pages to the IOMMU linked to core 0 */
> >> +       sgt = drm_gem_shmem_get_pages_sgt(shmem_obj);
> >> +       if (IS_ERR(sgt)) {
> >> +               ret = PTR_ERR(sgt);
> >> +               goto err;
> >> +       }
> >> +
> >> +       /* Map the pages to the IOMMUs linked to the other cores, so all cores can access this BO */
> >
> > So, uh, this is not great.
> >
> > We only have a single IOMMU context (well, one per core, but one
> > effective VMA) for the whole device. Every BO that gets created, gets
> > mapped into the IOMMU until it's been destroyed. Given that there is
> > no client isolation and no CS validation, that means that every client
> > has RW access to every BO created by any other client, for the
> > lifetime of that BO.
> >
> > I really don't think that this is tractable, given that anyone with
> > access to the device can exfiltrate anything that anyone else has
> > provided to the device.
> >
> > I also don't think that CS validation is tractable tbh.
> >
> > So I guess that leaves us with the third option: enforcing context
> > separation within the kernel driver.
> >
> > The least preferable option I can think of is that rkt sets up and
> > tears down MMU mappings for each job, according to the BO list
> > provided for it. This seems like way too much overhead - especially
> > with RK IOMMU ops having been slow enough within DRM that we expended
> > a lot of effort in Weston doing caching of DRM BOs to avoid doing this
> > unless completely necessary. It also seems risky wrt allocating memory
> > in drm_sched paths to ensure forward progress.
> >
> > Slightly more preferable than this would be that rkt kept a
> > per-context list of BOs and their VA mappings, and when switching
> > between different contexts, would tear down all MMU mappings from the
> > old context and set up mappings from the new. But this has the same
> > issues with drm_sched.
> >
> > The most preferable option from where I sit is that we could have an
> > explicit notion of driver-managed IOMMU contexts, such that rkt could
> > prepare the IOMMU for each context, and then switching contexts at
> > job-run time would be a matter of changing the root DTE pointer and
> > issuing a flush. But I don't see that anywhere in the user-facing
> > IOMMU API, and I'm sure Robin (CCed) will be along shortly to explain
> > why it's not possible ...
>
> On the contrary, it's called iommu_attach_group() :)
>
> In fact this is precisely the usage model I would suggest for this sort
> of thing, and IIRC I had a similar conversation with the Ethos driver
> folks a few years back. Running your own IOMMU domain is no biggie, see
> several other DRM drivers (including rockchip). As long as you have a
> separate struct device per NPU core then indeed it should be perfectly
> straightforward to maintain distinct IOMMU domains per job, and
> attach/detach them as part of scheduling the jobs on and off the cores.
> Looks like rockchip-iommu supports cross-instance attach, so if
> necessary you should already be OK to have multiple cores working on the
> same job at once, without needing extra work at the IOMMU end.

Ok, so if I understood it correctly, the plan would be for each DRM
client to have one IOMMU domain per each core (each core has its own
IOMMU), and map all its buffers in all these domains.

Then when a job is about to be scheduled on a given core, make sure
that the IOMMU for that core uses the domain for the client that
submitted the job.

Did I get that right?

> > Either way, I wonder if we have fully per-context mappings, userspace
> > should not manage IOVAs in the VM_BIND style common to newer drivers,
> > rather than relying on the kernel to do VA allocation and inform
> > userspace of them?
>
> Indeed if you're using the IOMMU API directly then you need to do your
> own address space management either way, so matching bits of process VA
> space is pretty simple to put on the table.

My impression was that the VM_BIND facility is intended for SVM as in
OpenCL and maybe Vulkan?

Guess my question is, what would such an accelerator driver win by
letting userspace manage the address space?

Thanks,

Tomeu