Kernel oops in vc4_overflow_mem_work

Mon Nov 6 17:33:01 PST 2017

Hey everyone,

I wanted to follow up on an issue that was first reported here:

https://github.com/anholt/linux/issues/114

A patch was proposed here, but it doesn't appear to address the issue:

https://patchwork.freedesktop.org/patch/181157/

Looking at the stack trace it's a null pointer dereference at +0x88 so
that didn't seem symptomatic of an uninitialized spin lock. On my
kernel which occasionally showed this issue, the PC matched up with
line 95 of vc4_irq.c:

   V3D_WRITE(V3D_BPOA, bo->base.paddr + bin_bo_slot * vc4->bin_alloc_size);

And more specifically, the bo->base.paddr expression indicating that
bo (from vc4->bin_bo, the binner BO) is null when the problem occurs.

There are only two places where bin_bo is specifically set to null,
the first in line 964 of vc4_gem.c (vc4_gem_destroy). This is invoked
when binding fails but has a comment stating that the interrupts
should be disabled at this point.

The second occurrence is in line 307 of vc4_v3d.c
(vc4_v3d_runtime_suspend). This comes immediately after
vc4_irq_uninstall, which first acks & disables all interrupts in the
VC4 registers then synchronously cancels the overflow work queue (so
all pending or running vc4_overflow_mem_work is completed before
returning). The problem with this order is that since binner overflow
interrupt handling is offloaded to a work queue, the worker
vc4_overflow_mem_work will specifically enable the interrupt bit
again. So in the rare case that there was pending overflow work during
vc4_irq_uninstall, we will first disable the interrupt, then wait for
the remaining work to complete - which will enable the interrupt
again.

This is probably harmless though because at this point we shouldn't be
queuing jobs anymore that could trigger the binner overflow interrupt
in the first place and because the work queue is already canceled (so
if the interrupt happens, we still can't queue work and cause the
worker to be invoked).

At this point I unfortunately don't have a setup that can reliably
reproduce the issue so I can't further narrow it down. My quick-fix
right now is to simply check for bo == NULL in vc4_overflow_mem_work
and avoid the null pointer dereference. This should be fine because
vc4 can't operate without a binner BO, so if we don't have one we are
in the process of shutting down anyway.

I wanted to post this in case someone more in tune with the inner
workings of the vc4 kernel parts has some spontaneous insight.

Thank you,
Stefan