[PATCH] aacraid: fails to initialize after a kexec operation
mark_salyzyn at adaptec.com
Mon Apr 30 10:11:03 EDT 2007
Foreign arrays are arrays configured on another adapter then moved over
to the current host adapter. I do not know why this may be the case in
your situation, but it had the smell of behaving like a foreign array
and thus my suggestion. We use commit=1 for all situations where the
importation of an array is not considered an error and there is no BIOS
to intervene prior to driver load. Typically we advise to set this flag
in embedded systems, or in non-Intel based architectures. Normally on
Intel based systems you get a query from the card's BIOS as you boot
that queries the user (to answer yes) to accept the array configuration
should it be detected as foreign.
I see some problems with declaring aacraid.commit=1 for kdump, you are
changing the storage system conditions and the fact you have a foreign
array may have been the cause of the primary kernel's failure. You are
rubbing out a factor in the system's failure? I would also hate to store
a kernel dump over an array one does not know the status or origin of.
If there is a clean shutdown, and there are no outstanding commands from
the OS (including the ioctl, so make sure the management software
commands are shut down), I do not see a reason to reset the adapter.
I agree, the irqpoll is troublesome! Could something else in the kexec
kernel be catching the interrupts and dropping them on the floor? Are
there any other devices sharing that same interrupt line that may be
holding the interrupt asserted? /proc/irq/*, /proc/interrupts? By
routing, I did not make it clear, but there is more than just the PCI
hardware in control of the path of an Interrupt from the controller
hardware to the interrupt service routine ... this may not be a pure
issue with PCI configuration being corrupted.
Sincerely -- Mark Salyzyn
> -----Original Message-----
> From: Vivek Goyal [mailto:vgoyal at in.ibm.com]
> Sent: Monday, April 30, 2007 5:54 AM
> To: Salyzyn, Mark
> Cc: James Bottomley; Kexec Mailing List; Judith Lebzelter;
> linux-scsi at vger.kernel.org; Darrick J. Wong
> Subject: Re: [PATCH] aacraid: fails to initialize after a
> kexec operation
> On Tue, Apr 24, 2007 at 09:21:35AM -0400, Salyzyn, Mark wrote:
> > The system BIOS sets up the card's PCI configuration and
> there is code
> > in the kernel that is capable of picking up some of the BIOS'
> > information from the BIOS Data Space (not sure if it is actively
> > collected in your configuration, you need a kernel flag to pick this
> > up). On kexec this BIOS Data Space information is missing (?) and if
> > there was any reconfiguration of the PCI space going on (I
> think only
> > the Linux BIOS project does this), kexec will inherit it. This issue
> > strikes me as a corrupted PCI configuration inherited in
> the kexec case,
> > such corrupted PCI configurations could be a motherboard
> specific issue
> > and can be related to the BIOS' initial setup for the
> initial kernel. At
> > least that is my thought process in questioning the
> motherboard BIOS or
> > hardware.
> > Another possibility is that after you have patched over the
> > routing issues (a PCI configuration problem), the card has a foreign
> > array, and the reset and reconfiguration is taking arrays
> offline. Add
> > 'aacraid.commit=1' to force the foreign arrays to be accepted by the
> > card.
> Hi Mark,
> So aacraid.commit=1 and irqpoll combination has done the trick. I can
> kexec/kdump into second kernel. I am using an IBM x366 series machine.
> There is one array and three disks behind it.
> Now few queries.
> - What is the concept of foreign arrays?
> - Should we pass aacraid.commit=1 all the time or this is only for
> some special cases? What's the point in resetting an adapter if it
> does not online the array it is managing?
> - For kexec, it calls the device shutdown routine
> (aac_shutdown) in this
> case. If this is the case for normal kexec (not kdump)
> adapter should
> not be reset?
> - Still needs to be found out why PCI configuration is
> getting corrupted
> and why irq routing is not proper and irqpoll is required.
More information about the kexec