next/master boot: 273 boots: 63 failed, 209 passed with 1 untried/unknown (next-20171106)
Jon Hunter
jonathanh at nvidia.com
Fri Nov 10 03:26:28 PST 2017
On 10/11/17 09:18, Jon Hunter wrote:
...
> Thanks Ben. However, looking at next-20171109 this one is already in.
> So maybe the bisect is still not getting me to the current issue. When
> booting next-20171109 the last thing I see is ...
>
> [ 2.228178] nouveau 57000000.gpu: NVIDIA GK20A (0ea000a1)
> [ 2.233634] nouveau 57000000.gpu: imem: using IOMMU
> [ 2.238572] nouveau 57000000.gpu: Direct firmware load for nvidia/gk20a/fecs_inst.bin failed with error -2
> [ 2.248295] nouveau 57000000.gpu: Direct firmware load for nouveau/nvea_fuc409c failed with error -2
> [ 2.257479] nouveau 57000000.gpu: Direct firmware load for nouveau/fuc409c failed with error -2
> [ 2.266189] nouveau 57000000.gpu: gr: failed to load fuc409c
>
> So no crash. I did see the crash after the bisect, but not in top of
> tree. It appears to hang after the nouveau probe fails. Any thoughts
> on how to debug further?
So this is probably wrong, but here is a clue about what is happening.
It appears that the error code is not being propagated from
gk20a_gr_new(). gk20a_gr_new is returning -ENODEV due to the firmware
loading failure...
342 if (gf100_gr_ctor_fw(gr, "fecs_inst", &gr->fuc409c) ||
343 gf100_gr_ctor_fw(gr, "fecs_data", &gr->fuc409d) ||
344 gf100_gr_ctor_fw(gr, "gpccs_inst", &gr->fuc41ac) ||
345 gf100_gr_ctor_fw(gr, "gpccs_data", &gr->fuc41ad))
346 return -ENODEV;
... but this is ignored by nvkm_device_ctor() (probably for good
reason). If I make the following change the hang no longer occurs
(although I realise this is probably wrong as it has been there for
years!) ...
diff --git a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
index e14643615698..a611615d3ce7 100644
--- a/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
+++ b/drivers/gpu/drm/nouveau/nvkm/engine/device/base.c
@@ -2869,7 +2869,7 @@ struct nvkm_engine *
subdev = nvkm_device_subdev(device, (s)); \
nvkm_subdev_del(&subdev); \
device->m = NULL; \
- if (ret != -ENODEV) { \
+ if (ret == -ENODEV) { \
nvdev_error(device, "%s ctor failed, %d\n", \
nvkm_subdev_name[s], ret); \
goto done; \
So is gk20a_gr_new() returning the wrong error code for when the
firmware load fails?
I have no gone back to see what has change in this regard, but I
can, probably next week.
Cheers
Jon
--
nvpublic
More information about the linux-arm-kernel
mailing list