kexec fails (pretty often)

Sun Jul 8 21:45:56 EDT 2007

> > > > > > I came across a report about panics on a IA64 system that happen when
> > > > > > kexec is being executed. The FSB parity error gets generated:
> > > > > >
> > > > > > BRLD / UC to x8208208208,   A43:41 = x0,  FSB Parity Error detected
> > on
> > > > > > Processor Request
> > > > > > BRLC / UC to xFFFF2000000,  A43:41 = x7,  FSB Parity Error detected
> > on
> > > > > > the Deferred Reply
> > > > > > BRLD / WB to xFFFFFFF0028,  A43:41 = x7,  FSB Parity Error detected
> > on
> > > > > > the Deferred Reply
> > > > > > BRLD / WB to xFFFFFFF0028,  A43:41 = x7,  FSB Parity Error detected
> > on
> > > > > > the Deferred Reply
> > > > > > BRLC / UC to xFFFF2000000,  A43:41 = x7,  FSB Parity Error detected
> > on
> > > > > > the Deferred Reply
> > > > > > BRLD / UC to x8208208208,   A43:41 = x0,  FSB Parity Error detected
> > on
> > > > > > Processor Request
> > > > > >
> > > > > >
> > > > > > And the pattern of the address on the bus is actually coming from the
> > > > > > piece of code in arch/ia64/kernel/gate.S, calculating ar.bpstore:
> > > > > >
> > > > > > ...
> > > > > >        sub r14=r14,r17         // r14 <- -rse_num_regs(bspstore1,
> > bsp1)
> > > > > >        movl r17=0x8208208208208209
> > > > > >        ;;
> > > > > >        add r18=r18,r14         // r18 (delta) <- rse_slot_num(bsp0)
> > -
> > > > > > rse_num_regs(bspstore1,bsp1)
> > > > > >        setf.sig f7=r17
> > > > > >        cmp.lt p7,p0=r14,r0     // p7 <- (r14 < 0)?
> > > > > >        ;;
> > > > > > ...
> > > > > >
> > > >
> > > >
> > > > Hi,
> > > >
> > > > Is the problem reproducible? Is there any special configuration or kexec
> > > > command line option to reproduce it?
> > > > On which platform and which version of kernel did you see the issue?
> > > >
> > > > It looks like there may be something wrong with the memory map setting
> > > > of the second kernel.
> > > > Can you send me copies of /proc/iomem of the first kernel and the second
> > > > kernel?
> > > >
> > >
> > > Thanks! I will try to get as much information as I can.
> > > It is 100 % reproducible, but intermittent - another words happens
> > > with each run, but not predictably (I will get more precise scenario).
> > > This is a large ES7000 server with up to 512 processors, I will find
> > > out if this happens with large configuration or any.
> > > Kernel is the SLES10 or RHEL4U5, they use both.
> > > I will provide the iomem, not sure how soon - either tomorrow or after
> > > the holiday...
> > >
> > Zou,
> >
> > I got this information. Actually the situation is even worse than I imagined.
> >
> > According to Ben who is working on this those are:
> >
> > --------------------
> > "To sum up what happens, I do this using the default kernel command line
> > (and also one with "debug console=uart,io,0x3f8,115200n8 console=tty0"
> > added to it):
> >
> > # kexec -l /boot/efi/efi/redhat/vmlinuz-2.6.18-8.el5
> > --append=`cat /proc/cmdline`
> > --initrd=/boot/efi/efi/redhat/initrd-2.6.18-8.el5.img
> >
> > # kexec -e
> >
> > The old kernel shuts down and boots the new one successfully, but, the
> > new kernel causes a fault during its boot. I can't positively identify
> > the exact spot it crashes because the serial output stops. Going by the
> > screen, it is either during or immediately after the ACPI system tries
> > to detect all of the CPUs. On a couple occasions I've seen it spit out
> > something along the lines of "EFI Time driver" before it blanks the
> > screen out, but it does it very quickly and the Raritan doesn't update
> > fast enough, even if I'm sitting at the cold floor display.
> >
> > The system is configured with a single CPU, as multiple CPUs cause a
> > different error, something along the lines of "huh? CPU #0x200 is
> > already present" - but this also happened on the system without the
> > capacitor fix. Turning on and off hyperthreading doesn't seem to matter
> > either.
> >
> > Here is entire log and a screen capture of the last things that
> > show up on the video console. The /proc/iomem contents are at line 460
> > in the log, and the kexec I used is at the very end. The second kernel
> > doesn't get far enough to enter any commands, so I'm
> > afraid I can't get you the /proc/iomem for that."
> > --------------
> >
> > Please advise where he can look to analyse this.
> > Thanks!
> > --Natalie
>
>
> From the log it is very hard to tell what is going wrong.
> Is the "acpi=debug" in command line intend to be there?
> Could you try latest base kernel?
> Also could you test if "kexec -p" works?
>
I will ask them to run kexec -p, I think he intended to do apic=debug,
will mention that either ...
I see that they load the same kernel as the first one. It is a good
idea - to test with latest vanilla (2.6.22 as of today) and see if the
problem still there.
Thanks, I will pass on information to you later on as it becomes available.
Regards,
--Natalie