[PATCH] aacraid: fails to initialize after a kexec operation
vgoyal at in.ibm.com
Wed May 2 00:21:15 EDT 2007
On Mon, Apr 30, 2007 at 10:11:03AM -0400, Salyzyn, Mark wrote:
> Foreign arrays are arrays configured on another adapter then moved over
> to the current host adapter. I do not know why this may be the case in
> your situation, but it had the smell of behaving like a foreign array
> and thus my suggestion. We use commit=1 for all situations where the
> importation of an array is not considered an error and there is no BIOS
> to intervene prior to driver load. Typically we advise to set this flag
> in embedded systems, or in non-Intel based architectures. Normally on
> Intel based systems you get a query from the card's BIOS as you boot
> that queries the user (to answer yes) to accept the array configuration
> should it be detected as foreign.
> I see some problems with declaring aacraid.commit=1 for kdump, you are
> changing the storage system conditions and the fact you have a foreign
> array may have been the cause of the primary kernel's failure. You are
> rubbing out a factor in the system's failure? I would also hate to store
> a kernel dump over an array one does not know the status or origin of.
How does one find from BIOS if array is local or foreign? In this machine
I have not done any migration. I have not even configured the array. I think
I am using default factory settings. If I get into the controller BIOS and
query arrays, it shows me one array of type Volume.
So if an adapter is managing both local and foreign arrays, it would online
local one upon reset but offline foreign one. So we can continue to save
By the way, when you say that foreign arrays are configured on another
adapter and then moved to current host adapter. Once the movement is
complete (I am assuming it will happen in first kernel) then what's the
issue with saving dump on foreign array. I think if applications are
actively using the disks behind foreign array, then it should not be
unreliable to save dump on those disks?
> If there is a clean shutdown, and there are no outstanding commands from
> the OS (including the ioctl, so make sure the management software
> commands are shut down), I do not see a reason to reset the adapter.
In case of normal kexec (not kdump) clean shutdown takes place. All
filesystems are unmounted, processes stopped and from kernel we call
device_shutdown() which should shutdown the device no pending interrupt.
I am wondering why it does not happen in case of aacraid and we end up
restarting adapter even in case of clean shutdown using kexec.
> I agree, the irqpoll is troublesome! Could something else in the kexec
> kernel be catching the interrupts and dropping them on the floor? Are
> there any other devices sharing that same interrupt line that may be
> holding the interrupt asserted? /proc/irq/*, /proc/interrupts? By
> routing, I did not make it clear, but there is more than just the PCI
> hardware in control of the path of an Interrupt from the controller
> hardware to the interrupt service routine ... this may not be a pure
> issue with PCI configuration being corrupted.
I will look more into it.
More information about the kexec