[PATCH 18/19] arm64: kdump: update a kernel doc

Wed Jan 20 18:54:22 PST 2016

On 01/20/16 at 11:28am, Mark Rutland wrote:
> On Wed, Jan 20, 2016 at 10:49:46AM +0800, Dave Young wrote:
> > On 01/19/16 at 02:01pm, Mark Rutland wrote:
> > > On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
> > > > On 01/19/16 at 12:51pm, Mark Rutland wrote:
> > > > > On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
> > > > > > On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
> > > > > > > On 01/19/2016 10:43 AM, Dave Young wrote:
> > > > > > > >X86 takes another way in latest kexec-tools and kexec_file_load, that is
> > > > > > > >recreating E820 table and pass it to kexec/kdump kernel, if the entries
> > > > > > > >are over E820 limitation then turn to use setup_data list for remain
> > > > > > > >entries.
> > > > > > > 
> > > > > > > Thanks. I will visit x86 code again.
> > > > > > > 
> > > > > > > >I think it is X86 specific. Personally I think device tree property is
> > > > > > > >better.
> > > > > > > 
> > > > > > > Do you think so?
> > > > > > 
> > > > > > I'm not sure it is the best way. For X86 we run into problem with
> > > > > > memmap= design, one example is pci domain X (X>1) need the pci memory
> > > > > > ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
> > > > > > to 2nd kernel we find that cmdline[] array is not big enough.
> > > > > 
> > > > > I'm not sure how PCI ranges relate to the memory map used for normal
> > > > > memory (i.e. RAM), though I'm probably missing some caveat with the way
> > > > > ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
> > > > 
> > > > Here is the old patch which was rejected in kexec-tools:
> > > > http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
> > > > 
> > > > > 
> > > > > If the kernel got the rest of its system topology from DT, the PCI
> > > > > regions would be described there.
> > > > 
> > > > Yes, if kdump kernel use same DT as 1st kernel.
> > > 
> > > Other than for testing purposes, I don't see why you'd pass the kdump
> > > kernel a DTB inconsistent with that the 1st kernel was passsed (other
> > > than some proerties under /chosen).
> > > 
> > > We added /sys/firmware/fdt specifically to allow the kexec tools to get
> > > the exact DTB the first kernel used. There's no reason for tools to have
> > > to make something up.
> > 
> > Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
> > how one will use it unless dropping the option and use /sys/firmware/fdt
> > unconditionally. 
> 
> I think this is a tangential discussion. I think it's fine to say that
> for kdump we do not expect this -- a user would be shooting themselves
> in the foot if they did. Regardless, I was under the impression that
> kdump was usually set up by distribution-provided init code.
> 
> or kdump, which typically is set up automatically by the OS, 

Yes, usually os setup kdump but with some config file user can tune the kexec
arguments. Anyway I agree that one should do right but if we are sure exact
fdt in 1st kernel is needed we can drop kexec-tools --dtb option. 

> 
> > If we choose to implement kexec_file_load only in kernel, the interfaces
> > provided are kernel, initrd and cmdline. We can always use same dtb.
> 
> There are use-cases where being in complete control of the purgatory
> code is necessary. For example, the next OS might not be Linux (and
> might not accept a DTB, or have different requirements on the initial
> register state).
> 
> Regardless of the need for something like kexec_file_load for kdump in
> Secure Boot environments, there is also a need for kexec_load with the
> user having complete control.

I'm not sure if there are such use cases in arm64 in real life.
But indeed it is a reason kexec_load can exist if there really are such requests. 

> 
> > > > > > Do you think for arm64 only usable memory is necessary to let kdump kernel
> > > > > > know? I'm curious about how arm64 kernel get all memory layout from boot loader,
> > > > > > via UEFI memmap?
> > > > > 
> > > > > When booted via EFI, we use the EFI memory map. The EFI stub handles
> > > > > acquring the relevant information and passing that to the first kernel
> > > > > in the DTB (see Documentation/arm/uefi.txt).
> > > > 
> > > > Ok, thanks for the pointer. So in dt we are just have uefi memmap infomation
> > > > instead of memory nodes details.. 
> > > 
> > > When booted via EFI, yes.
> > > 
> > > For NUMA topology in !ACPI kernels, we might need to also retain and
> > > parse memory nodes, but only for toplogy information. The kernel would
> > > still only use memory as described by the EFI memory map.
> > > 
> > > There's a horrible edge case I've spotted if performing a chain of
> > > cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
> > > respect the EFI memory map so as to avoid corrupting it for the
> > > subsequent LE kernel. Other than this I believe everything should just
> > > work.
> > 
> > Firmware do not know kernel endianniess, kernel should respect firmware
> > maps and adapt to it, it sounds like a generic issue not specfic to kexec.
> 
> I agree that this isn't kexec's fault as such, but in the absence of
> kexec, the above issue does not exist, so one can't consider it in
> isolation.
> 
> > > > > A kexec'd kernel should simply inherit that. So long as the DTB and/or
> > > > > UEFI tables in memory are the same, it would be the same as a cold boot.
> > > > 
> > > > For kexec all memory ranges are same, for kdump we need use original reserved
> > > > range with crashkernel= as usable memory and all other orignal usable ranges
> > > > are not usable anymore. 
> > > 
> > > Sure. This is what I believe we should expose with an additional
> > > property under /chosen, while keeping everything else pristine.
> > > 
> > > The crash kernel can then limit itself to that region, while it would
> > > have the information of the full memory map (which it could log and/or
> > > use to drive other dumping).
> > 
> > In this way kernel should be aware it is a kdump booting, it is doable though
> > I feel it is better for kdump kernel in a black box with infomations it
> > can use just like the 1st kernel. Things here is where we choose to cook
> > the memory infomation in boot loader or in kernel itself.
> 
> Sorry, I can't follow what you are trying to say here. Could you
> elaborate?

Hmm, I menas if we prepare a kdump usable uefi memmap then we do not need introduce
the extra dtb property and kdump kernel just works like a normal boot.
I think we have understand each other upon latter part of this mail :)

One additianl issue with the simple way is if it can be used only in kdump kernel
Or it applys to both normal boot and kdump kernel boot so that it becomes a
general interface instead of only for kdump.

That means in 1st kernel we need override all system ram sections from uefi if
the usable chosen property is provided in 1st kernel.

> 
> > > > Is it possible to modify uefi memmap for kdump case?
> > > 
> > > Technically it would be possible, however I don't think it's necessary,
> > > and I think it would be disadvantageous to do so.
> > > 
> > > Describing the range(s) the crash kernel can use in separate properties
> > > under /chosen has a number of advantages.
> > 
> > Ok, I got the points. We have a is_kdump_kernel() by checking if there is
> > elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
> > do not work well in kdump kernel some uncertain reasons. But ideally I
> > think kernel should handle things just like in 1st kernel and avoid to use
> > it. 
> 
> I agree that we should not have kexec/kdump knowledge spread throughout
> the kernel, and that the boot protocol should be uniform with a cold
> boot as far as possible.
> 
> However, requiring userspace or the first kernel to modify
> firmware-provided information has a number of risks and reduces the
> amount of information available to the kdump kernel. To that end I am
> opposed to modifying the memory nodes in the DTB, or to modifying the
> EFI memory map.
> 
> Having a property in the DTB describing the range(s) of memory reserved
> for use by the kdump kernel is vastly simpler, and avoids those risks:
> 
> * It requires a tiny amount of self-contained code in the kdump kernel
>   to parse the property and apply the constraints imposed (i.e. carve up
>   memblock).
> 
>   This is easy to contain in a single function (or at least within a
>   single file), and need not affect drivers or other code.
> 
> * It is uniform regardless of whether the EFI memory map, DT memory
>   nodes, or some other mechanism is used to discover memory in the
>   systems.
> 
>   This makes it easy to impose the restrictions consistently, and is
>   somewhat future-proof.

Ok, considering arm64 specific complexity of the several combind cases
especially this one I would say choosing a simple solution may be the
best choice.

> 
> * Userspace or the first kernel to not need to parse and modify an
>   arbitrary amount of data (which might be in an extended format it
>   doesn't fully understand). There is less risk for this to go wrong.
> 
>   It is far easier to add a property than it is to correctly modify the
>   EFI memory map, memory nodes, or some other data structure. There is
>   less risk, and it is somewhat future-proof.
>   
> * The original memory map information is preserved, even though unused.
>   This may be useful for debugging, and it may turn out that the kdump
>   kernel needs to know about certain portions of the original memory
>   map, even if we are not currently aware of why we would need this.

Thanks
Dave

> 
> Thanks,
> Mark.