next/master boot: 273 boots: 63 failed, 209 passed with 1 untried/unknown (next-20171106)

Fri Nov 10 03:26:28 PST 2017

On 10/11/17 09:18, Jon Hunter wrote:

...

> Thanks Ben. However, looking at next-20171109 this one is already in.
> So maybe the bisect is still not getting me to the current issue. When
> booting next-20171109 the last thing I see is ...
> 
> [    2.228178] nouveau 57000000.gpu: NVIDIA GK20A (0ea000a1)
> [    2.233634] nouveau 57000000.gpu: imem: using IOMMU
> [    2.238572] nouveau 57000000.gpu: Direct firmware load for nvidia/gk20a/fecs_inst.bin failed with error -2
> [    2.248295] nouveau 57000000.gpu: Direct firmware load for nouveau/nvea_fuc409c failed with error -2
> [    2.257479] nouveau 57000000.gpu: Direct firmware load for nouveau/fuc409c failed with error -2
> [    2.266189] nouveau 57000000.gpu: gr: failed to load fuc409c
> 
> So no crash. I did see the crash after the bisect, but not in top of
> tree. It appears to hang after the nouveau probe fails. Any thoughts
> on how to debug further?

So this is probably wrong, but here is a clue about what is happening. 
It appears that the error code is not being propagated from
gk20a_gr_new(). gk20a_gr_new is returning -ENODEV due to the firmware
loading failure...

342         if (gf100_gr_ctor_fw(gr, "fecs_inst", &gr->fuc409c) ||
343             gf100_gr_ctor_fw(gr, "fecs_data", &gr->fuc409d) ||
344             gf100_gr_ctor_fw(gr, "gpccs_inst", &gr->fuc41ac) ||
345             gf100_gr_ctor_fw(gr, "gpccs_data", &gr->fuc41ad))
346                 return -ENODEV;

... but this is ignored by nvkm_device_ctor() (probably for good
reason). If I make the following change the hang no longer occurs
(although I realise this is probably wrong as it has been there for
 years!) ...

diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
index e14643615698..a611615d3ce7 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
@@ -2869,7 +2869,7 @@ struct nvkm_engine *
                        subdev = nvkm_device_subdev(device, (s));              \
                        nvkm_subdev_del(&subdev);                              \
                        device->m = NULL;                                      \
-                       if (ret != -ENODEV) {                                  \
+                       if (ret == -ENODEV) {                                  \
                                nvdev_error(device, "%s ctor failed, %d\n",    \
                                            nvkm_subdev_name[s], ret);         \
                                goto done;                                     \

So is gk20a_gr_new() returning the wrong error code for when the
firmware load fails? 

I have no gone back to see what has change in this regard, but I
can, probably next week.

Cheers
Jon

-- 
nvpublic