[RFC 1/1] drm/pl111: Initial drm/kms driver for pl111

Fri Aug 9 12:15:33 EDT 2013

Hi Daniel, Rob.

Thank you both for your reviews - greatly appreciated!

> > > Known issues:
> > >  * It still includes code to use KDS, which is not going upstream.
> >
> > review's on <http://lists.freedesktop.org/archives/dri-devel/2013-
> > July/042462.html> can't hurt
> >
> > although you might consider submitting a reduced functionality driver
> > w/ KDS bits removed in the mean time.. then when the fence stuff is
> > merged it is just an incremental patch rather than a whole driver ;-)
> 
> Yeah, I think the KDS bits and comments need to go first before
> merginge.

Right, as I expected really. Though as I said we'll probably wait for
fences to land and switch over to that before asking for it to be
merged. A pl111 KMS driver with neither KDS nor implicit fences is 
useless to us. Having said that, if someone else would have a use for
a fence/KDS-less pl111 KMS driver, please let me know!

> > > +/*
> > > + * Number of flips allowed in flight at any one time. Any more
> > > + * flips requested beyond this value will cause the caller to 
> > > + * block until earlier flips have completed.
> > > + *
> > > + * For performance reasons, this must be greater than the number
> > > + * of buffers used in the rendering pipeline. Note that the 
> > > + * rendering pipeline can contain different types of buffer, e.g.:
> > > + * - 2 final framebuffers
> > > + * - >2 geometry buffers for GPU use-cases
> > > + * - >2 vertex buffers for GPU use-cases
> > > + *
> > > + * For example, a system using 5 geometry buffers could have 5
> > > + * flips in flight, and so NR_FLIPS_IN_FLIGHT_THRESHOLD must be 
> > > + * 5 or greater.
> > > + *
> > > + * Whilst there may be more intermediate buffers (such as
> > > + * vertex/geometry) than final framebuffers, KDS is used to 
> > > + * ensure that GPU rendering waits for the next off-screen 
> > > + * buffer, so it doesn't overwrite an on-screen buffer and 
> > > + * produce tearing.
> > > + */
> > > +
> >
> > fwiw, this is at least different from how other drivers do triple
> > (or > double) buffering.  In other drivers (intel, omap, and
> > msm/freedreno, that I know of, maybe others too) the xorg driver
> > dri2 bits implement the double buffering (ie. send flip event back
> > to client immediately and queue up the flip and call page-flip
> > after the pageflip event back from kernel.
> >
> > I'm not saying not to do it this way, I guess I'd like to hear
> > what other folks think.  I kinda prefer doing this in userspace 
> > as it keeps the kernel bits simpler (plus it would then work 
> > properly on exynosdrm or other kms drivers).
> 
> Yeah, if this is just a sw queue then I don't think it makes sense
> to have it in the kernel. Afaik the current pageflip interface drm
> exposes allows one oustanding flip only, and you _must_ wait for
> the flip complete event before you can submit the second one.

Right, I'll have a think about this. I think our idea was to issue
enough page-flips into the kernel to make sure that any process
scheduling latencies on a heavily loaded system don't cause us to
miss a v_sync deadline. At the moment we issue the page flip from DRI2
schedule_swap. If we were to move that to the page flip event handler
of the previous page-flip, we're potentially adding in extra latency.

I.e. Currently we have:

DRI2SwapBuffers
 - drm_mode_page_flip to buffer B
DRI2SwapBuffers
 - drm_mode_page_flip to buffer A (gets queued in kernel)
...
v_sync! (at this point buffer B is scanned out)
 - release buffer A's KDS resource/signal buffer A's fence
    - queued GPU job to render next frame to buffer A scheduled on HW
...
GPU interrupt! (at this point buffer A is ready to be scanned out)
 - release buffer A's KDS resource/signal buffer A's fence
    - second page flip executed, buffer A's address written to scanout
      register, takes effect on next v_sync.

So in the above, after X receives the second DRI2SwapBuffers, it
doesn't need to get scheduled again for the next frame to be both
rendered by the GPU and issued to the display for scanout.

If we were to move to a user-space queue, I think we have something
like this:

DRI2SwapBuffers
 - drm_mode_page_flip to buffer B
DRI2SwapBuffers
 - queue page flip to buffer A in DDX
...
v_sync! (at this point buffer B is scanned out)
 - release buffer A's KDS resource/signal buffer A's fence
    - queued GPU job to render next frame to buffer A scheduled on HW
 - Send page flip event to X
...
GPU interrupt! (at this point buffer A is ready to be scanned out)
 - Release buffer A's KDS resource/signal buffer A's fence - but nothing
   is waiting on it....
...
X gets scheduled, runs page flip handler
 - drm_mode_page_flip to buffer A
   - buffer A's address written to scanout register, takes effect on
     next v_sync.

So here, X must get scheduled again after processing the second
DRI2SwapBuffers in order to have the next frame displayed. This
increases the likely-hood that we're not able to write the address of
buffer A to the display HW's scan-out buffer in time to catch the next
v_sync, especially on a loaded system.

Anyway, I think that's our rational for keeping the queue in kernel
space, but I don't see there's much value in queuing more than 2 page
flips in kernel space.

> Ofc if your hardware as a hw-based flip queue (maybe even with frame
> targets) that's a different matter, but currently we don't have a drm
> interface to expose this. I'd say for merging the basic driver first we
> should go with the existing simple pageflip semantics.

Sure - I think it would mean slightly increased jank, but probably
something we can address later.

> > > +enum pl111_bo_type {
> > > +       PL111_BOT_DMA,
> > > +       PL111_BOT_SHM
> > > +};
> > > +
> > > +struct pl111_gem_bo_dma {
> > > +       dma_addr_t fb_dev_addr;
> > > +       void *fb_cpu_addr;
> > > +};
> > > +
> > > +struct pl111_gem_bo_shm {
> > > +       struct page **pages;
> > > +       dma_addr_t *dma_addrs;
> > > +};
> > > +
> > > +struct pl111_gem_bo {
> > > +       struct drm_gem_object gem_object;
> > > +       enum pl111_bo_type type;
> > > +       union {
> > > +               struct pl111_gem_bo_dma dma;
> > > +               struct pl111_gem_bo_shm shm;
> > > +       } backing_data;
> > > +       struct drm_framebuffer *fb;
> >
> > this is at least a bit odd.. normally the fb has ref to the bo(s) and
> > not the other way around.  And the same bo could be referenced by
> > multiple fb's which would kinda fall down with this approach.
> 
> I'd say that's just backwards, framebuffers are created from backing
> storage objects (which for a gem based driver is a gem object), not the
> other way round. What's this exactly used for?

Yup.

> > > +static void vsync_worker(struct work_struct *work)
> > > +{
> > > +       struct pl111_drm_flip_resource *flip_res;
> > > +       struct pl111_gem_bo *bo;
> > > +       struct pl111_drm_crtc *pl111_crtc;
> > > +       struct drm_device *dev;
> > > +       int flips_in_flight;
> > > +       flip_res =
> > > +               container_of(work, struct pl111_drm_flip_resource,
> > > +                            vsync_work);
> > > +
> > > +       pl111_crtc = to_pl111_crtc(flip_res->crtc);
> > > +       dev = pl111_crtc->crtc.dev;
> > > +
> > > +       DRM_DEBUG_KMS("DRM Finalizing flip_res=%p\n", flip_res);
> > > +
> > > +       bo = PL111_BO_FROM_FRAMEBUFFER(flip_res->fb);
> > > +#ifdef CONFIG_DMA_SHARED_BUFFER_USES_KDS
> > > +       if (flip_res->worker_release_kds == true) {
> > > +               spin_lock(&pl111_crtc->current_displaying_lock);
> > > +               release_kds_resource_and_display(flip_res);
> > > +               spin_unlock(&pl111_crtc->current_displaying_lock);
> > > +       }
> > > +#endif
> > > +       /* Release DMA buffer on this flip */
> > > +       if (bo->gem_object.export_dma_buf != NULL)
> > > +               dma_buf_put(bo->gem_object.export_dma_buf);
> >
> > I think you just want to unref the outgoing bo, and let it drop the
> > dmabuf ref when the file ref of the imported bo goes.  Or actually,
> > it would be better to hold/drop ref's to the fb, rather than the bo.
> > At least this will make things simpler if you ever have multi-planar
> > support.
> 
> Drivers have no business frobbing around the dma-buf refcount of
> imported objects imo, at least if they use all the standard drm 
> prime infrastructure. And if they're bugs they need to be fixed 
> there, not in drivers.

Good point. I guess the fb holds a ref on the bo and the bo holds a
ref on the imported dma_buf. Don't know what this was for...

> > > +       BUG_ON(bo->type != PL111_BOT_DMA);
> >
> > umm, no BUG_ON() is not really a good way to validate userspace
> > input..
> 
> Yep.

:-D

> > > +
> > > +       switch ((char)(mode_cmd->pixel_format & 0xFF)) {
> > > +       case 'Y':
> > > +       case 'U':
> > > +       case 'V':
> > > +       case 'N':
> > > +       case 'T':
> >
> > perhaps we should instead add a drm_format_is_yuv().. or you could
> > (ab)use drm_fb_get_bpp_depth()..
> 
> Yeah, I think a new drm_format_is_yuv is asked-for here. Now the bigger
> question is why you need this, since the drm core should filter out
> formats not in your list of supported ones. Or at least it should ...

Probably unnecessary belts & braces. I'll see if I can find some DRM
test which tries to create an fb using a yuv format and see where, if
anywhere, it gets rejected.

Thanks again!!

Cheers,

Tom