kdump: need help with kexec -p

Prabhakar Kushwaha prabhakar.kushwaha at nxp.com
Fri Oct 13 02:41:37 PDT 2017


> -----Original Message-----
> From: James Morse [mailto:james.morse at arm.com]
> Sent: Thursday, October 12, 2017 5:11 PM
> To: Prabhakar Kushwaha <prabhakar.kushwaha at nxp.com>;
> takahiro.akashi at linaro.org
> Cc: linux-arm-kernel at lists.infradead.org; Poonam Aggrwal
> <poonam.aggrwal at nxp.com>; Scott Wood <oss at buserror.net>; Abhimanyu
> Saini <abhimanyu.saini at nxp.com>
> Subject: Re: kdump: need help with kexec -p
> 
> Hi Prabhakar,
> 
> (+CC: Akashi Takahiro, who wrote the arm64 kdump support)
> 
> On 11/10/17 10:11, Prabhakar Kushwaha wrote:
> > We are facing some issues while using  kexec -p on ARM64 NXP platforms.
> >
> > 1) After calling kexec -p, if immediately "panic" is triggered the crash kernel
> > does not boot. If we run few commands and wait for atleast (20-30 secs),
> before
> > triggering the panic, the crash kernel boots.
> 
> What kernel version do you see this on? 
linux-linaro-lsk-v4.4  (f3b1dec5e8f2b4d17442a79bcb1f15953056519d)

> Can you log the kernel output in each
> case, (do you get a 'bye' message even when the new kernel doesn't boot).
> 
Yes I get 'bye' message in all cases. 

> Does 'kexec -p' report success in both cases? ($? == 0)
> 
> 

Unfortunately this command not support in my root file system.

I always gets prompt. So I assume kexec runs successfully. 

> kdump can take many seconds in purgatory, it checksums the kdump image to
> check
> it didn't get corrupted between 'kexec -p' and crash time, but it doesn't sound
> like this is what you're seeing.
> 
> 

Yes, this is correct understanding

> > 2) We do not see the issue ("1" ), when we do umount -a, before calling the
> panic
> > after kexec-p.
> 
> What filesystems (ext4, nfs etc) do you have mounted, and which ones does
> 'umount -a' get rid of?

root at ls1043ardb:~# mkdir temp; mount -t ext4 /dev/mmcblk0p3 temp/
[   27.786681] EXT4-fs (mmcblk0p3): mounted filesystem with ordered data mode. Opts: (null)
root at ls1043ardb:~# cat /proc/mounts
/dev/root / ext4 rw,relatime,block_validity,delalloc,barrier,user_xattr,acl 0 0
devtmpfs /dev devtmpfs rw,relatime,mode=0755 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
debugfs /sys/kernel/debug debugfs rw,relatime 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
/dev/mmcblk0p3 /home/root/temp ext4 rw,relatime,data=ordered 0 0

root at ls1043ardb:~# umount -a
umount: /dev: target is busy
        (In some cases useful info about processes that
         use the device is found by lsof(8) or fuser(1).)
umount: /: target is busy
        (In some cases useful info about processes that
         use the device is found by lsof(8) or fuser(1).)

root at ls1043ardb:~# cat /proc/mounts
/dev/root / ext4 rw,relatime,block_validity,delalloc,barrier,user_xattr,acl 0 0
devtmpfs /dev devtmpfs rw,relatime,mode=0755 0 0
proc /proc proc rw,relatime 0 0
sysfs /sys sysfs rw,relatime 0 0
devpts /dev/pts devpts rw,relatime,gid=5,mode=620,ptmxmode=000 0 0
root at ls1043ardb:~#


> Where are these filesystems stored?
> 

We are using ramdisk.

Bootargs: ttyS0,115200 root=/dev/ram0 earlycon=uart8250,mmio,0x21c0500 crashkernel=512M loglevel=8 ramdisk_size=0x20000000

> How many CPUs does your platform have?
> 

4

> (...does crashing on a different CPU change the behaviour?)
> > taskset -c 1 bash -c "echo c > /proc/sysrq-trigger"
> 

I tired taskset -c 1 bash -c "echo c > /proc/sysrq-trigger" and taskset -c 2 bash -c "echo c > /proc/sysrq-trigger".
Both worked i.e. crash kernel boot. 

One strange observation: Very first time crash kernel never boot. If you restart and try again.. it start working. 
I tried 3 iteration.   1/3 --> failed for both core 1 and core 2.    Subsequent restart and try always worked. 

Not able to correlate with anything. 

> 
> > The issue does not seem to pertain to the NXP software it seems.   (because
> this
> > observation has been observed on very simple kernel, where most of the
> > controllers have been removed from device tree).
> 
> > Also found some info related to this on  internet where it is mentioned that
> > without un-mounting the mounted filesystems, the boot of next kernel is not
> > recommended. (this is in context of kexec -e though)
> > https://www.linux.com/news/reboot-racecar-kexec.
> 
> This is because the filesystem is marked as mounted on-disk, and there may be
> vital data you've written but hasn't made it to the disk yet.
> 
> For 'kexec -e' I think it tries to shutdown and reboot, then jumps to the new
> kernel instead of calling the firmware. This means all filesystems should be
> sync()d, umounted or at least remounted read-only.

Ok. understood

> 
> For kdump, we've already crashed, so you've already lost data. Its a best effort
> can we get to a point where you can debug the original crash.
> 

Looks like umount  -a is not mandatory for kexec -p


Further observation 
---------------------------
** On upstream the dump capture boots (the issue is not observed) **
Default config + enabled RAM Block Device
The commit details as below:
commit 569dbb88e80deb68974ef6fdd6a13edb9d686261
Author: Linus Torvalds <torvalds at linux-foundation.org>
Date:   Sun Sep 3 13:56:17 2017 -0700

    Linux 4.13

commit 5e3b19d8165c2af2afee313c9b40eee55cf27a55
Merge: d0fa6ea 2c0e838
Author: Linus Torvalds <torvalds at linux-foundation.org>
Date:   Sun Sep 3 09:50:26 2017 -0700

    Merge branch 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus
    
    Pull MIPS fixes from Ralf Baechle:
     "The two indirect syscall fixes have sat in linux-next for a few days.
      I did check back with a hardware designer to ensure a SYNC is really
      what's required for the GIC fix and so the GIC fix didn't make it into
      to linux-next in time for this final pull request.
    
      It builds in local build tests and passes Imagination's test system"
    
    * 'upstream' of git://git.linux-mips.org/pub/scm/ralf/upstream-linus:
      irqchip: mips-gic: SYNC after enabling GIC region
      MIPS: Remove pt_regs adjustments in indirect syscall handler
      MIPS: seccomp: Fix indirect syscall args


** On 4.4 LSK: (default defconfig + enabled RAM Block Device); issue is observed **
 commit f3b1dec5e8f2b4d17442a79bcb1f15953056519d
 Merge: f5ca0eb 09e6960
 Author: Alex Shi <alex.shi at linaro.org>
 Date:   Mon Aug 7 12:02:09 2017 +0800
 
      Merge tag 'v4.4.80' into linux-linaro-lsk-v4.4
    
      This is the 4.4.80 stable release

--prabhakar




More information about the linux-arm-kernel mailing list