amdgpu problem after kexec

Mon Feb 8 19:21:05 EST 2021

вт, 9 февр. 2021 г. в 02:43, Alex Deucher <alexdeucher at gmail.com>:
>
> On Mon, Feb 8, 2021 at 1:34 AM Alexander E. Patrakov <patrakov at gmail.com> wrote:
> >
> > пн, 8 февр. 2021 г. в 08:32, Alexander E. Patrakov <patrakov at gmail.com>:
> > >
> > > чт, 4 февр. 2021 г. в 09:31, Alex Deucher <alexdeucher at gmail.com>:
> > > >
> > > > On Wed, Feb 3, 2021 at 7:56 PM Eric W. Biederman <ebiederm at xmission.com> wrote:
> > > > >
> > > > > Alex Deucher <alexdeucher at gmail.com> writes:
> > > > >
> > > > > > On Wed, Feb 3, 2021 at 3:36 AM Dave Young <dyoung at redhat.com> wrote:
> > > > > >>
> > > > > >> Hi Baoquan,
> > > > > >>
> > > > > >> Thanks for ccing.
> > > > > >> On 01/28/21 at 01:29pm, Baoquan He wrote:
> > > > > >> > On 01/11/21 at 01:17pm, Alexander E. Patrakov wrote:
> > > > > >> > > Hello,
> > > > > >> > >
> > > > > >> > > I was trying out kexec on my new laptop, which is a HP EliteBook 735
> > > > > >> > > G6. The problem is, amdgpu does not have hardware acceleration after
> > > > > >> > > kexec. Also, strangely, the lines about BlueTooth are missing from
> > > > > >> > > dmesg after kexec, but I have not tried to use BlueTooth on this
> > > > > >> > > laptop yet. I don't know how to debug this, the relevant amdgpu lines
> > > > > >> > > in dmesg are:
> > > > > >> > >
> > > > > >> > > amdgpu 0000:04:00.0: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* IB
> > > > > >> > > test failed on gfx (-110).
> > > > > >> > > [drm:process_one_work] *ERROR* ib ring test failed (-110).
> > > > > >> > >
> > > > > >> > > The good and bad dmesg files are attached. Is it a kexec problem (and
> > > > > >> > > amdgpu is only a victim), or should I take it to amdgpu lists? Do I
> > > > > >> > > need to provide some extra kernel arguments for debugging?
> > > > >
> > > > > The best debugging I can think of is can you arrange to have the amdgpu
> > > > > modules removed before the final kexec -e?
> > > > >
> > > > > That would tell us if the code to shutdown the gpu exist in the rmmod
> > > > > path aka the .remove method and is simply missing in the kexec path aka
> > > > > the .shutdown method.
> > > > >
> > > > >
> > > > > >> > I am not familiar with graphical component. Add Dave to CC to see if
> > > > > >> > he has some comments. It would be great if amdgpu expert can have a look.
> > > > > >>
> > > > > >> It needs amdgpu driver people to help.  Since kexec bypass
> > > > > >> bios/UEFI initialization so we requires drivers to implement .shutdown
> > > > > >> method and test it to make 2nd kernel to work correctly.
> > > > > >
> > > > > > kexec is tricky to make work properly on our GPUs.  The problem is
> > > > > > that there are some engines on the GPU that cannot be re-initialized
> > > > > > once they have been initialized without an intervening device reset.
> > > > > > APUs are even trickier because they share a lot of hardware state with
> > > > > > the CPU.  Doing lots of extra resets adds latency.  The driver has
> > > > > > code to try and detect if certain engines are running at driver load
> > > > > > time and do a reset before initialization to make this work, but it
> > > > > > apparently is not working properly on your system.
> > > > >
> > > > > There are two cases that I think sometimes get mixed up.
> > > > >
> > > > > There is kexec-on-panic in which case all of the work needs to happen in
> > > > > the driver initialization.
> > > > >
> > > > > There is also a simple kexec in which case some of the work can happen
> > > > > in the kernel that is being shutdown and sometimes that is easer.
> > > > >
> > > > > Does it make sense to reset your device unconditionally on driver removal?
> > > >
> > > > I think we tried that at some point in the past but users complained
> > > > that it added latency or artifacts on the display at shutdown or
> > > > reboot time.
> > > >
> > > > > Would it make sense to reset your device unconditionally on driver add?
> > > >
> > > > Pretty much the same issue there.  It adds latency and you get
> > > > artifacts on the display when the reset happens.
> > > >
> > > > >
> > > > > How can someone debug the smart logic of reset on driver load?
> > > >
> > > > See this block of code in amdgpu_device.c:
> > > >         /* check if we need to reset the asic
> > > >          *  E.g., driver was not cleanly unloaded previously, etc.
> > > >          */
> > > >     if (!amdgpu_sriov_vf(adev) && amdgpu_asic_need_reset_on_init(adev)) {
> > > >         r = amdgpu_asic_reset(adev);
> > > >                 if (r) {
> > > >                         dev_err(adev->dev, "asic reset on init failed\n");
> > > >                         goto failed;
> > > >                 }
> > > >         }
> > > >
> > > > You'll want to see if amdgpu_asic_need_reset_on_init() was able to
> > > > determine that the asic needs a reset.  If it does,
> > > > amdgpu_asic_reset() getds called to reset it.
> > > > The tricky thing is that some reset methods require a fair amount of
> > > > driver state and so, they are only possible when the driver is up and
> > > > running.  Those methods are not necessarily available at driver load
> > > > time because we need to reset the GPU before we can initialize it and
> > > > determine that state so we end up in a kind of catch 22.
> > > > Unfortunately, generic PCI resets don't necessarily work on many of
> > > > our GPUs so that's not an option either.
> > > >
> > > > Alex
> > >
> > > Sorry for the delay with the reply, I was distracted.
> > >
> > > Anyway, I managed to unload the amdgpu module successfully, using this
> > > script (as /usr/lib/systemd/system-shutdown/debug.sh):
> > >
> > > #!/bin/sh
> > > mount -o remount,rw /
> > > echo 0 > /sys/class/vtconsole/vtcon1/bind
> > > rmmod amdgpu && echo '<4>==== Succeeded removing amdgpu module ====' > /dev/kmsg
> > > dmesg > /var/log/shutdown-log-$(date +%Y%m%d-%H%M%S)
> > > mount -o remount,ro /
> > >
> > > At the end of a non-kexec boot, it logs this:
> > >
> > > [  116.512621] Console: switching to colour dummy device 80x25
> > > [  116.518591] amdgpu 0000:04:00.0: amdgpu: amdgpu: finishing device.
> > > [  116.644899] [drm:dal_irq_service_dummy_ack [amdgpu]] *ERROR*
> > > dal_irq_service_dummy_ack: called for non-implemented irq source
> > > [  116.645168] [drm:dal_irq_service_dummy_set [amdgpu]] *ERROR*
> > > dal_irq_service_dummy_set: called for non-implemented irq source
> > > [  116.658515] [drm] free PSP TMR buffer
> > > [  116.706265] [TTM] Zone  kernel: Used memory at exit: 0 KiB
> > > [  116.706276] [TTM] Zone   dma32: Used memory at exit: 0 KiB
> > > [  116.706280] [drm] amdgpu: ttm finalized
> > > [  116.740460] ==== Succeeded removing amdgpu module ====
> > >
> > > However, the next kexec-based boot still misses hardware acceleration.
> >
> > Regarding the reset considerations.
> >
> > The amdgpu driver contains some logic to reset the card on init if
> > needed. However, for all APU chipsets, it says that reset on init is
> > not needed. So I tried to force this. In amdgpu_device_init(), I
> > changed:
>
> Ah, right.  On APUs, the SMU and PSP which are what we check to see if
> they are running on dGPU are always running on APUs since they are
> shared with the CPU so it doesn't make sense to check them.
>
> >
> >         if (!amdgpu_sriov_vf(adev) && (1 ||
> > amdgpu_asic_need_reset_on_init(adev))) {
> > ...
> >         }
> >
> >         pci_enable_pcie_error_reporting(adev->ddev.pdev);
> >
> >         /* Post card if necessary */
> >         if (1 || amdgpu_device_need_post(adev)) {
> > ...
> >         }
> >
> > Then it tried to reset the card using MODE2 method, which failed:
> >
> > [    1.467192] amdgpu 0000:04:00.0: amdgpu: MODE2 reset
> > [    1.467194] amdgpu 0000:04:00.0: amdgpu: asic reset on init failed
> > [    1.467197] amdgpu 0000:04:00.0: amdgpu: Fatal error during GPU init
> >
> > The only reset method which doesn't fail is BACO
> > (amdgpu.reset_method=4) but unfortunately it doesn't help either. The
> > dmesg after kexec is attached. The old workaround that removed the
> > amdgpu module on reboot (and thus before kexec) is still active.
>
> mode2 reset is the only reset available on APUs.  The others are not valid.
>
> Did this ever work in the past on this platform?

I think no. This laptop's GPU is supported since linux-5.7 (before
that, there were BIOS fetching problems), and the only kernels where I
tested kexec were 5.10 and 5.11-rc{6,7}.

-- 
Alexander E. Patrakov
CV: http://u.pc.cd/wT8otalK