kexec: purgatory hang
Cliff Wickman
cpw at sgi.com
Tue Jun 11 18:54:30 EDT 2013
I'm getting a hang when trying to enter a high-memory crash kernel,
and I'm at a loss as to how to debug this.
This is a 3.10.0-rc3 kernel, and set up as the crash kernel by kexec 2.0.4.
The machine is an SGI UV1000.
[ 164.027275] SysRq : Trigger a crash
[ 164.031136] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 164.031136] IP: [<ffffffff81397771>] sysrq_handle_crash+0x11/0x20
[ 164.031136] PGD 1fbe835067 PUD 1fbc2e8067 PMD 0
[ 164.031136] Oops: 0002 [#1] SMP
[ 164.031136] xpc : all partitions have deactivated
[ 164.031136] Modules linked in: autofs4 binfmt_misc af_packet rdma_ucm rdma_cm iw_cm ib_addr ib_srp scsi_transport_srp scsi_tgt ib_ipoib ib_cm ib_uverbs ib_umad iw_cxgb3 cxgb3 mdio mlx4_en mlx4_ib ib_sa mlx4_core ib_mthca ib_mad ib_core fuse nls_iso8859_1 nls_cp437 vfat fat loop uv_mmtimer dm_mod sr_mod cdrom usb_storage iTCO_wdt iTCO_vendor_support coretemp mperf kvm_intel ipv6 kvm igb sg crc32c_intel lpc_ich pcspkr mptctl i2c_algo_bit ptp i2c_i801 microcode xhci_hcd joydev ioatdma ehci_pci hid_generic pps_core i2c_core rtc_cmos mfd_core button dca usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif scsi_dh_hp_sw scsi_dh_alua scsi_dh_emc scsi_dh_rdac scsi_dh thermal sata_nv processor piix mptsas mptscsih scsi_transport_sas mptbase megaraid_sas fan thermal_sys hwmon ext3 jbd ata_piix ahci libahci libata scsi_mod
[ 164.031136] CPU: 10 PID: 9299 Comm: dopanic Not tainted 3.10.0-rc3-linus-cpw+ #17
[ 164.031136] Hardware name: Intel Corp. Stoutland Platform, BIOS 2.16 UEFI2.10 PI1.0 X64 2012-04-27
[ 164.031136] task: ffff88203df94440 ti: ffff88203d5c2000 task.ti: ffff88203d5c2000
[ 164.031136] RIP: 0010:[<ffffffff81397771>] [<ffffffff81397771>] sysrq_handle_crash+0x11/0x20
[ 164.031136] RSP: 0018:ffff88203d5c3e68 EFLAGS: 00010092
[ 164.031136] RAX: 000000000000000f RBX: ffffffff81a974e0 RCX: 0000000000000004
[ 164.031136] RDX: 0000000000000000 RSI: ffff881fffd0ef48 RDI: 0000000000000063
[ 164.031136] RBP: ffff88203d5c3e68 R08: ffff881fffd0d3e8 R09: 000000000004268c
[ 164.031136] R10: 0000000000000b8b R11: 0000000000000000 R12: 0000000000000063
[ 164.031136] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000296
[ 164.031136] FS: 00007ffff7fb5700(0000) GS:ffff881fffd00000(0000) knlGS:0000000000000000
[ 164.031136] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 164.031136] CR2: 0000000000000000 CR3: 0000001fbea6c000 CR4: 00000000000007e0
[ 164.031136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 164.031136] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 164.031136] Stack:
[ 164.031136] ffff88203d5c3ea8 ffffffff81398008 01ff88203d5c3e88 0000000000000002
[ 164.031136] ffff895f9d478380 ffff88203d5c3f40 00007ffff7ff8000 ffff88203d5c3f40
[ 164.031136] ffff88203d5c3ec8 ffffffff813980ad ffff88203d5c3ee8 fffffffffffffffb
[ 164.031136] Call Trace:
[ 164.031136] [<ffffffff81398008>] __handle_sysrq+0x128/0x190
[ 164.031136] [<ffffffff813980ad>] write_sysrq_trigger+0x3d/0x40
[ 164.031136] [<ffffffff811c323f>] proc_reg_write+0x4f/0x80
[ 164.031136] [<ffffffff8115f107>] vfs_write+0xe7/0x190
[ 164.031136] [<ffffffff8115f8ec>] SyS_write+0x5c/0xa0
[ 164.031136] [<ffffffff8153c092>] system_call_fastpath+0x16/0x1b
[ 164.031136] Code: 00 48 8b 75 e8 48 81 c7 08 08 00 00 e8 09 c6 19 00 31 d2 eb 95 90 90 90 90 90 55 c7 05 f5 74 96 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 c9 c3 0f 1f 44 00 00 8d 47 d0 55 83 f8
[ 164.031136] RIP [<ffffffff81397771>] sysrq_handle_crash+0x11/0x20
[ 164.031136] RSP <ffff88203d5c3e68>
[ 164.031136] CR2: 0000000000000000
This is always the last output.
Can anyone suggest any way to debug this problem?
I suppose I can hang the processor just before it executes machine_kexec()
and look at it with crash. Any suggestions as to what to look at?
----------------------------------------------------------------------
The problem does not occur if the crash kernel is loaded below 4G, or if
the machine is an SGI UV2000.
It works as below when loaded into low memory:
[ 517.278643] SysRq : Trigger a crash
[ 517.282507] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 517.282507] IP: [<ffffffff81397771>] sysrq_handle_crash+0x11/0x20
[ 517.282507] PGD 203e209067 PUD 203e6dc067 PMD 0
[ 517.282507] Oops: 0002 [#1] SMP
[ 517.282507] xpc : all partitions have deactivated
[ 517.282507] Modules linked in: autofs4 binfmt_misc af_packet rdma_ucm rdma_cm iw_cm ib_addr ib_srp scsi_transport_srp scsi_tgt ib_ipoib ib_cm ib_uverbs ib_umad iw_cxgb3 cxgb3 mdio mlx4_en mlx4_ib ib_sa mlx4_core ib_mthca ib_mad ib_core fuse nls_iso8859_1 nls_cp437 vfat fat loop sr_mod cdrom uv_mmtimer dm_mod usb_storage coretemp mperf kvm_intel kvm crc32c_intel ipv6 microcode igb ioatdma joydev i2c_algo_bit ptp xhci_hcd hid_generic sg iTCO_wdt i2c_i801 pcspkr i2c_core iTCO_vendor_support ehci_pci lpc_ich pps_core mfd_core dca mptctl button rtc_cmos usbhid hid uhci_hcd ehci_hcd usbcore usb_common sd_mod crc_t10dif scsi_dh_hp_sw scsi_dh_alua scsi_dh_emc scsi_dh_rdac scsi_dh thermal sata_nv processor piix mptsas mptscsih scsi_transport_sas mptbase megaraid_sas fan thermal_sys hwmon ext3 jbd ata_piix ahci libahci libata scsi_mod
[ 517.282507] CPU: 4 PID: 9989 Comm: dopanic Not tainted 3.10.0-rc3-linus-cpw+ #17
[ 517.282507] Hardware name: Intel Corp. Stoutland Platform, BIOS 2.16 UEFI2.10 PI1.0 X64 2012-04-27
[ 517.282507] task: ffff881fbe55a300 ti: ffff881fbc7fc000 task.ti: ffff881fbc7fc000
[ 517.282507] RIP: 0010:[<ffffffff81397771>] [<ffffffff81397771>] sysrq_handle_crash+0x11/0x20
[ 517.282507] RSP: 0018:ffff881fbc7fde68 EFLAGS: 00010092
[ 517.282507] RAX: 000000000000000f RBX: ffffffff81a974e0 RCX: 0000000000000004
[ 517.282507] RDX: 0000000000000000 RSI: ffff88207fd0ef48 RDI: 0000000000000063
[ 517.282507] RBP: ffff881fbc7fde68 R08: ffff88207fd0d3e8 R09: 0000000000042424
[ 517.282507] R10: 0000000000000b83 R11: 0000000000000000 R12: 0000000000000063
[ 517.282507] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000296
[ 517.282507] FS: 00007ffff7fb5700(0000) GS:ffff88207fd00000(0000) knlGS:0000000000000000
[ 517.282507] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 517.282507] CR2: 0000000000000000 CR3: 000000203dbea000 CR4: 00000000000007e0
[ 517.282507] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 517.282507] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 517.282507] Stack:
[ 517.282507] ffff881fbc7fdea8 ffffffff81398008 01ff881fbc7fde88 0000000000000002
[ 517.282507] ffff891fbdaea6c0 ffff881fbc7fdf40 00007ffff7ff8000 ffff881fbc7fdf40
[ 517.282507] ffff881fbc7fdec8 ffffffff813980ad ffff881fbc7fdee8 fffffffffffffffb
[ 517.282507] Call Trace:
[ 517.282507] [<ffffffff81398008>] __handle_sysrq+0x128/0x190
[ 517.282507] [<ffffffff813980ad>] write_sysrq_trigger+0x3d/0x40
[ 517.282507] [<ffffffff811c323f>] proc_reg_write+0x4f/0x80
[ 517.282507] [<ffffffff8115f107>] vfs_write+0xe7/0x190
[ 517.282507] [<ffffffff8115f8ec>] SyS_write+0x5c/0xa0
[ 517.282507] [<ffffffff8153c092>] system_call_fastpath+0x16/0x1b
[ 517.282507] Code: 00 48 8b 75 e8 48 81 c7 08 08 00 00 e8 09 c6 19 00 31 d2 eb 95 90 90 90 90 90 55 c7 05 f5 74 96 00 01 00 00 00 48 89 e5 0f ae f8 <c6> 04 25 00 00 00 00 01 c9 c3 0f 1f 44 00 00 8d 47 d0 55 83 f8
[ 517.282507] RIP [<ffffffff81397771>] sysrq_handle_crash+0x11/0x20
[ 517.282507] RSP <ffff881fbc7fde68>
[ 517.282507] CR2: 0000000000000000
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 3.10.0-rc3-linus-cpw+ (cpw at gulag1) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #17 SMP Fri Jun 7 10:47:45 CDT 2013
[ 0.000000] Command line: root=LABEL=uv21-sysR13 kdb=on pcie_aspm=on add_efi_memmap cgroup_disable=memory earlyprintk=ttyS0,115200n8 log_buf_len=8M processor.max_cstate=1 stop_machine.lazy=1 nobau console=ttyS0,115200n8 rcutree.rcu_cpu_stall_suppress=1 nortsched cpuidle_sysfs_switch ipmi_si.trydefaults=0 intel_idle.max_cstate=0 nmi_watchdog=0 pci=hpiosize=0,hpmemsize=0,nobar udev.children_max=128 skew_tick=1 relax_domain_level=2 nohz=off highres=off elevator=deadline sysrq=yes reset_devices irqpoll maxcpus=1 noefi acpi_rsdp=0x78d30014 memmap=exactmap memmap=568K at 4K memmap=523696K at 393216K acpi_rsdp=0x78d30014 elfcorehdr=916912K memmap=4K$0K memmap=4K#572K memmap=4K$1926652K memmap=4K$1926660K memmap=36K$1929652K memmap=8K$1930380K memmap=12K$1930392K memmap=48K$1937884K memmap=8K$1977476K memmap=8K$1977492K memmap=152K$1977540K memmap=64K$1977732K memmap=320K$1978116K memmap=396K#1978436K memmap=468K#1978832K memmap=160K#1979300K memmap=128K#1979460K memmap=4K$2045576K mem
map=512K$20456[ 0.000000] e820: BIOS-provided physical RAM map:
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x0000000000000fff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000001000-0x000000000008efff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000008f000-0x000000000008ffff] ACPI NVS
[ 0.000000] BIOS-e820: [mem 0x0000000000090000-0x000000000009ffff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000075958fff] usable
[ 0.000000] BIOS-e820: [mem 0x000000007597d000-0x000000007597efff] usable
[ 0.000000] BIOS-e820: [mem 0x000000007597f000-0x000000007597ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000075980000-0x0000000075980fff] usable
[ 0.000000] BIOS-e820: [mem 0x0000000075981000-0x0000000075981fff] reserved
...
Thanks for any suggestions.
-Cliff Wickman
More information about the kexec
mailing list