kexec fails (pretty often)
Natalie Protasevich
protasnb at gmail.com
Wed Jul 4 00:22:06 EDT 2007
On 04 Jul 2007 08:29:39 +0800, Zou Nan hai <nanhai.zou at intel.com> wrote:
> On Wed, 2007-07-04 at 04:24, Eric W. Biederman wrote:
> > "Natalie Protasevich" <protasnb at gmail.com> writes:
> >
> > > I came across a report about panics on a IA64 system that happen when
> > > kexec is being executed. The FSB parity error gets generated:
> > >
> > > BRLD / UC to x8208208208, A43:41 = x0, FSB Parity Error detected on
> > > Processor Request
> > > BRLC / UC to xFFFF2000000, A43:41 = x7, FSB Parity Error detected on
> > > the Deferred Reply
> > > BRLD / WB to xFFFFFFF0028, A43:41 = x7, FSB Parity Error detected on
> > > the Deferred Reply
> > > BRLD / WB to xFFFFFFF0028, A43:41 = x7, FSB Parity Error detected on
> > > the Deferred Reply
> > > BRLC / UC to xFFFF2000000, A43:41 = x7, FSB Parity Error detected on
> > > the Deferred Reply
> > > BRLD / UC to x8208208208, A43:41 = x0, FSB Parity Error detected on
> > > Processor Request
> > >
> > >
> > > And the pattern of the address on the bus is actually coming from the
> > > piece of code in arch/ia64/kernel/gate.S, calculating ar.bpstore:
> > >
> > > ...
> > > sub r14=r14,r17 // r14 <- -rse_num_regs(bspstore1, bsp1)
> > > movl r17=0x8208208208208209
> > > ;;
> > > add r18=r18,r14 // r18 (delta) <- rse_slot_num(bsp0) -
> > > rse_num_regs(bspstore1,bsp1)
> > > setf.sig f7=r17
> > > cmp.lt p7,p0=r14,r0 // p7 <- (r14 < 0)?
> > > ;;
> > > ...
> > >
>
>
> Hi,
>
> Is the problem reproducible? Is there any special configuration or kexec
> command line option to reproduce it?
> On which platform and which version of kernel did you see the issue?
>
> It looks like there may be something wrong with the memory map setting
> of the second kernel.
> Can you send me copies of /proc/iomem of the first kernel and the second
> kernel?
>
Thanks! I will try to get as much information as I can.
It is 100 % reproducible, but intermittent - another words happens
with each run, but not predictably (I will get more precise scenario).
This is a large ES7000 server with up to 512 processors, I will find
out if this happens with large configuration or any.
Kernel is the SLES10 or RHEL4U5, they use both.
I will provide the iomem, not sure how soon - either tomorrow or after
the holiday...
Regards,
--Natalie
> Thanks
> Zou Nan hai
>
>
> > > Have you seen such error before? What would you recommend for debugging this?
> >
> > Not really.
> >
> > However this sounds fairly deterministic on the hardware involved.
> > So I would recommend a code audit.
> >
> > With low-level kexec code like this it really requires someone who knows
> > the architecture to think through the code.
> >
> > Adding in serial output into the assembly and what not can help to
> > isolate the piece of the code causing the problem. But it looks
> > like you have done that.
> >
> > You haven't provided quite enough context for me to understand how
> > this code sequence is reproduced. I would certainly need more
> > information then you have given to even locate the code path this is
> > coming from, as it has been a long time since I looked at ia64.
> >
> > I have CC'd a few likely suspects and the kexec list so with a little
> > luck if anyone is familiar with this they can answer you.
> >
> > Eric
>
More information about the kexec
mailing list