amdgpu problem after kexec

Mon Feb 8 16:43:27 EST 2021

On Mon, Feb 8, 2021 at 1:34 AM Alexander E. Patrakov <patrakov at gmail.com> wrote:
>
> пн, 8 февр. 2021 г. в 08:32, Alexander E. Patrakov <patrakov at gmail.com>:
> >
> > чт, 4 февр. 2021 г. в 09:31, Alex Deucher <alexdeucher at gmail.com>:
> > >
> > > On Wed, Feb 3, 2021 at 7:56 PM Eric W. Biederman <ebiederm at xmission.com> wrote:
> > > >
> > > > Alex Deucher <alexdeucher at gmail.com> writes:
> > > >
> > > > > On Wed, Feb 3, 2021 at 3:36 AM Dave Young <dyoung at redhat.com> wrote:
> > > > >>
> > > > >> Hi Baoquan,
> > > > >>
> > > > >> Thanks for ccing.
> > > > >> On 01/28/21 at 01:29pm, Baoquan He wrote:
> > > > >> > On 01/11/21 at 01:17pm, Alexander E. Patrakov wrote:
> > > > >> > > Hello,
> > > > >> > >
> > > > >> > > I was trying out kexec on my new laptop, which is a HP EliteBook 735
> > > > >> > > G6. The problem is, amdgpu does not have hardware acceleration after
> > > > >> > > kexec. Also, strangely, the lines about BlueTooth are missing from
> > > > >> > > dmesg after kexec, but I have not tried to use BlueTooth on this
> > > > >> > > laptop yet. I don't know how to debug this, the relevant amdgpu lines
> > > > >> > > in dmesg are:
> > > > >> > >
> > > > >> > > amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB
> > > > >> > > test failed on gfx (-110).
> > > > >> > > [drm:process_one_work] *ERROR* ib ring test failed (-110).
> > > > >> > >
> > > > >> > > The good and bad dmesg files are attached. Is it a kexec problem (and
> > > > >> > > amdgpu is only a victim), or should I take it to amdgpu lists? Do I
> > > > >> > > need to provide some extra kernel arguments for debugging?
> > > >
> > > > The best debugging I can think of is can you arrange to have the amdgpu
> > > > modules removed before the final kexec -e?
> > > >
> > > > That would tell us if the code to shutdown the gpu exist in the rmmod
> > > > path aka the .remove method and is simply missing in the kexec path aka
> > > > the .shutdown method.
> > > >
> > > >
> > > > >> > I am not familiar with graphical component. Add Dave to CC to see if
> > > > >> > he has some comments. It would be great if amdgpu expert can have a look.
> > > > >>
> > > > >> It needs amdgpu driver people to help.  Since kexec bypass
> > > > >> bios/UEFI initialization so we requires drivers to implement .shutdown
> > > > >> method and test it to make 2nd kernel to work correctly.
> > > > >
> > > > > kexec is tricky to make work properly on our GPUs.  The problem is
> > > > > that there are some engines on the GPU that cannot be re-initialized
> > > > > once they have been initialized without an intervening device reset.
> > > > > APUs are even trickier because they share a lot of hardware state with
> > > > > the CPU.  Doing lots of extra resets adds latency.  The driver has
> > > > > code to try and detect if certain engines are running at driver load
> > > > > time and do a reset before initialization to make this work, but it
> > > > > apparently is not working properly on your system.
> > > >
> > > > There are two cases that I think sometimes get mixed up.
> > > >
> > > > There is kexec-on-panic in which case all of the work needs to happen in
> > > > the driver initialization.
> > > >
> > > > There is also a simple kexec in which case some of the work can happen
> > > > in the kernel that is being shutdown and sometimes that is easer.
> > > >
> > > > Does it make sense to reset your device unconditionally on driver removal?
> > >
> > > I think we tried that at some point in the past but users complained
> > > that it added latency or artifacts on the display at shutdown or
> > > reboot time.
> > >
> > > > Would it make sense to reset your device unconditionally on driver add?
> > >
> > > Pretty much the same issue there.  It adds latency and you get
> > > artifacts on the display when the reset happens.
> > >
> > > >
> > > > How can someone debug the smart logic of reset on driver load?
> > >
> > > See this block of code in amdgpu_device.c:
> > >         /* check if we need to reset the asic
> > >          *  E.g., driver was not cleanly unloaded previously, etc.
> > >          */
> > >     if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) {
> > >         r = amdgpu_asic_reset(adev);
> > >                 if (r) {
> > >                         dev_err(adev->dev, "asic reset on init failed\n");
> > >                         goto failed;
> > >                 }
> > >         }
> > >
> > > You'll want to see if amdgpu_asic_need_reset_on_init() was able to
> > > determine that the asic needs a reset.  If it does,
> > > amdgpu_asic_reset() getds called to reset it.
> > > The tricky thing is that some reset methods require a fair amount of
> > > driver state and so, they are only possible when the driver is up and
> > > running.  Those methods are not necessarily available at driver load
> > > time because we need to reset the GPU before we can initialize it and
> > > determine that state so we end up in a kind of catch 22.
> > > Unfortunately, generic PCI resets don't necessarily work on many of
> > > our GPUs so that's not an option either.
> > >
> > > Alex
> >
> > Sorry for the delay with the reply, I was distracted.
> >
> > Anyway, I managed to unload the amdgpu module successfully, using this
> > script (as /usr/lib/systemd/system-shutdown/debug.sh):
> >
> > #!/bin/sh
> > mount -o remount,rw /
> > echo 0 > /sys/class/vtconsole/vtcon1/bind
> > rmmod amdgpu && echo '<4>==== Succeeded removing amdgpu module ====' > /dev/kmsg
> > dmesg > /var/log/shutdown-log-$(date +%Y%m%d-%H%M%S)
> > mount -o remount,ro /
> >
> > At the end of a non-kexec boot, it logs this:
> >
> > [  116.512621] Console: switching to colour dummy device 80x25
> > [  116.518591] amdgpu 0000:04:00.0: amdgpu: amdgpu: finishing device.
> > [  116.644899] [drm:dal_irq_service_dummy_ack [amdgpu]] *ERROR*
> > dal_irq_service_dummy_ack: called for non-implemented irq source
> > [  116.645168] [drm:dal_irq_service_dummy_set [amdgpu]] *ERROR*
> > dal_irq_service_dummy_set: called for non-implemented irq source
> > [  116.658515] [drm] free PSP TMR buffer
> > [  116.706265] [TTM] Zone  kernel: Used memory at exit: 0 KiB
> > [  116.706276] [TTM] Zone   dma32: Used memory at exit: 0 KiB
> > [  116.706280] [drm] amdgpu: ttm finalized
> > [  116.740460] ==== Succeeded removing amdgpu module ====
> >
> > However, the next kexec-based boot still misses hardware acceleration.
>
> Regarding the reset considerations.
>
> The amdgpu driver contains some logic to reset the card on init if
> needed. However, for all APU chipsets, it says that reset on init is
> not needed. So I tried to force this. In amdgpu_device_init(), I
> changed:

Ah, right.  On APUs, the SMU and PSP which are what we check to see if
they are running on dGPU are always running on APUs since they are
shared with the CPU so it doesn't make sense to check them.

>
>         if (!amdgpu_sriov_vf(adev) && (1 ||
> amdgpu_asic_need_reset_on_init(adev))) {
> ...
>         }
>
>         pci_enable_pcie_error_reporting(adev->ddev.pdev);
>
>         /* Post card if necessary */
>         if (1 || amdgpu_device_need_post(adev)) {
> ...
>         }
>
> Then it tried to reset the card using MODE2 method, which failed:
>
> [    1.467192] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
> [    1.467194] amdgpu 0000:04:00.0: amdgpu: asic reset on init failed
> [    1.467197] amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init
>
> The only reset method which doesn't fail is BACO
> (amdgpu.reset_method=4) but unfortunately it doesn't help either. The
> dmesg after kexec is attached. The old workaround that removed the
> amdgpu module on reboot (and thus before kexec) is still active.

mode2 reset is the only reset available on APUs.  The others are not valid.

Did this ever work in the past on this platform?

Alex