[PATCH 18/19] arm64: kdump: update a kernel doc

Thu Jan 21 22:23:14 PST 2016

On 01/21/2016 09:02 PM, Mark Rutland wrote:
> On Thu, Jan 21, 2016 at 03:53:42PM +0900, AKASHI Takahiro wrote:
>> On 01/20/2016 08:49 PM, Mark Rutland wrote:
>>> On Wed, Jan 20, 2016 at 03:07:53PM +0900, AKASHI Takahiro wrote:
>>>> On 01/20/2016 11:49 AM, Dave Young wrote:
>>>>> On 01/19/16 at 02:01pm, Mark Rutland wrote:
>>>>>> On Tue, Jan 19, 2016 at 09:45:53PM +0800, Dave Young wrote:
>>>>>>> On 01/19/16 at 12:51pm, Mark Rutland wrote:
>>>>>>>> On Tue, Jan 19, 2016 at 08:28:48PM +0800, Dave Young wrote:
>>>>>>>>> On 01/19/16 at 02:35pm, AKASHI Takahiro wrote:
>>>>>>>>>> On 01/19/2016 10:43 AM, Dave Young wrote:
>>>>>>>>>>> X86 takes another way in latest kexec-tools and kexec_file_load, that is
>>>>>>>>>>> recreating E820 table and pass it to kexec/kdump kernel, if the entries
>>>>>>>>>>> are over E820 limitation then turn to use setup_data list for remain
>>>>>>>>>>> entries.
>>>>>>>>>>
>>>>>>>>>> Thanks. I will visit x86 code again.
>>>>>>>>>>
>>>>>>>>>>> I think it is X86 specific. Personally I think device tree property is
>>>>>>>>>>> better.
>>>>>>>>>>
>>>>>>>>>> Do you think so?
>>>>>>>>>
>>>>>>>>> I'm not sure it is the best way. For X86 we run into problem with
>>>>>>>>> memmap= design, one example is pci domain X (X>1) need the pci memory
>>>>>>>>> ranges being passed to kdump kernel. When we passed reserved ranges in /proc/iomem
>>>>>>>>> to 2nd kernel we find that cmdline[] array is not big enough.
>>>>>>>>
>>>>>>>> I'm not sure how PCI ranges relate to the memory map used for normal
>>>>>>>> memory (i.e. RAM), though I'm probably missing some caveat with the way
>>>>>>>> ACPI and UEFI describe PCI. Why does memmap= affect PCI memory?
>>>>>>>
>>>>>>> Here is the old patch which was rejected in kexec-tools:
>>>>>>> http://lists.infradead.org/pipermail/kexec/2013-February/007924.html
>>>>>>>
>>>>>>>>
>>>>>>>> If the kernel got the rest of its system topology from DT, the PCI
>>>>>>>> regions would be described there.
>>>>>>>
>>>>>>> Yes, if kdump kernel use same DT as 1st kernel.
>>>>>>
>>>>>> Other than for testing purposes, I don't see why you'd pass the kdump
>>>>>> kernel a DTB inconsistent with that the 1st kernel was passsed (other
>>>>>> than some proerties under /chosen).
>>>>>>
>>>>>> We added /sys/firmware/fdt specifically to allow the kexec tools to get
>>>>>> the exact DTB the first kernel used. There's no reason for tools to have
>>>>>> to make something up.
>>>>>
>>>>> Agreed, but kexec-tools has an option to pass in any dtb files. Who knows
>>>>> how one will use it unless dropping the option and use /sys/firmware/fdt
>>>>> unconditionally.
>>>>
>>>> As a matter of fact, specifying proper command line parameters as well as
>>>> dtb is partly users' responsibility for kdump to work correctly.
>>>> (especially for BE kernel)
>>>>
>>>>> If we choose to implement kexec_file_load only in kernel, the interfaces
>>>>> provided are kernel, initrd and cmdline. We can always use same dtb.
>>>>
>>>> I would say that we can always use the same dtb even with kexec_load
>>> >from user's perspective. Right?
>>>
>>> No.
>>>
>>> This breaks using kexec for boot-loader purposes, and imposes a policy.
>>
>> What kind of policy?
>> I said "can", but if we want to use other setting/configuration, we can
>> still have a full control.
>
> Apologies, I misunderstood.
>
> In most cases, using the existing DTB (with minor modifications to
> /chosen for bootargs and such) is fine. If the user just wants to boot
> another Linux kernel, that's generally what they should do.
>
> I think we're agreed on that.

Yes.

> However, there are cases when the user may want to use a different DTB,
> or use a different purgatory. So we cannot mandate that the existing DTB
> is reused, nor that an in-kernel purgatory must be used, as that imposes
> a policy.

Agree!

>>> For better or worse kexec_file_load has always imposed a constrained
>>> Linux-only policy, so that's a different story.
>>>
>>>>>> There's a horrible edge case I've spotted if performing a chain of
>>>>>> cross-endian kexecs: LE -> BE -> LE, as the BE kernel would have to
>>>>>> respect the EFI memory map so as to avoid corrupting it for the
>>>>>> subsequent LE kernel. Other than this I believe everything should just
>>>>>> work.
>>>>>
>>>>> Firmware do not know kernel endianniess, kernel should respect firmware
>>>>> maps and adapt to it, it sounds like a generic issue not specfic to kexec.
>>>>
>>>> On arm64, a kernel image header has a bit field to specify the image's endianness.
>>>> Anyway, our current implementation replies on a user-supplied dtb to start BE kernel.
>>>
>>> The firmware should _never_ care about the kernel's endianness. The
>>> bootlaoder or first kernel shouldn't care about the next kernel's
>>> endianness apart from in exceptional circumstances. The DTB for a LE
>>> kernel should look identical to that passed to a BE kernel.
>>
>> Please note that I didn't say anything different from your last two statements.
>> The current arm64 kexec implementation doesn't do anything specific to BE,
>> but as far as BE kernel doesn't support UEFI, users are responsible for
>> providing a proper dtb.
>
> I'm just confused as to what you mean by a "proper dtb" in that case.
>
> If you just mean one with memory nodes hacked in, then that would
> currently be a way to make that work, yes.

One of useful cases that I have in my mind is kdump.
We may want to use a small sub-set of dtb, especially devices, to
make the reboot more reliable. Device drivers are likely to be vulnerable
at crash.

> It seems like the better option is to fix the BE kernel to support a
> UEFI memory map, as that solves other issues.

Why did Ard throw away his patch?

>>
>>> In my mind, the only valid reason to look at that bit is so that
>>> bootloaders can provide a warning if the CPU does not implement that
>>> endianness.
>>>
>>> The issue I mention above should be solved by changes to the BE kernel.
>>>
>>>>>>> Is it possible to modify uefi memmap for kdump case?
>>>>>>
>>>>>> Technically it would be possible, however I don't think it's necessary,
>>>>>> and I think it would be disadvantageous to do so.
>>>>>>
>>>>>> Describing the range(s) the crash kernel can use in separate properties
>>>>>> under /chosen has a number of advantages.
>>>>>
>>>>> Ok, I got the points. We have a is_kdump_kernel() by checking if there is
>>>>> elfcorehdr_addr kernel cmdline. This is mainly for some drivers which
>>>>> do not work well in kdump kernel some uncertain reasons. But ideally I
>>>>> think kernel should handle things just like in 1st kernel and avoid to use
>>>>> it.
>>>>
>>>> So I'm not still sure about what are advantages of a property under /chosen
>>>> over "memmap=" kernel parameter.
>>>> Both are simple and can have the same effect with minimizing changes to dtb.
>>>> (But if, in the latter case, we have to provide *all* the memory-related information
>>>> through "memmap=" parameters, it would be much complicated.)
>>>
>>> The reason I prefer a property over command line additions include:
>>
>> Take some examples:
>> (a) a property under /chosen
>>    {
>>      chosen = {
>>        cmdline = "elfcorehdr=AA at BB maxcpus=1 ...";
>>      }
>>      usable-memory = <XX YY>;
>>      memory {
>>        ...
>>      }
>>    }
>>
>> (b) a kernel command line parameter
>>    (I use the same name, "usable-memory", to show the similarity. may use another name though.)
>>    {
>>      chosen = {
>>        cmdline = "elfcorehdr=AA at BB maxcpus=1 usable-memory=YY at XX ...";
>>      }
>>      memory {
>>        ...
>>      }
>>    }
>>
>>> * It keeps the command line simple (as you mention the opposite is
>>>    "complicated").
>>
>> I think both are simple.
>>
>>> * It is logically separate from options the user may pass to the kernel
>>>    in that the restricted region(s) of memory avaialble are effectively
>>>    properties of the system (in that the crashed OS is part of the system
>>>    state).
>>
>> "elfcorehdr=" parameter already breaks your point.
>> "elfcorehdr=" looks to be, what you say, a system property, and is actually
>> added by kexec-tools on all architectures, and "usable-memory", whether it is
>> a DT property or a kernel parameter, will also be added by kexec-tools.
>> (Users don't have to care.)
>
> Just because architectures do one thing today does not mean that we have
> to follow it.
>
> I don't think that breaks my point so much as shows that a different
> approach is taken by others today.
>
> There's also no reason this cannot be a property under /chosen.

No, but no strong reason to be so IMO.

>>> * The semantics of the command line parsing can change subtly over time
>>>    (for example, see 51e158c12aca3c9a, which terminates command line
>>>    parseing at "--"). Maknig sure that a command line option will
>>>    actually be parsed by the next kernel is not trivial.
>>>
>>>    Keeping this information isolated from the command line is more
>>>    robust.
>>
>> Even so, who wants to use kdump without testing?
>> and this is not a kdump specific issue.
>>
>>> * Addition of a property is a self-contained operation, that doesn't
>>>    require any knowledge about the command line.
>>
>> I don't get your point here.
>> For a kernel parameter, early_param() can encapsulate all the stuffs necessary.
>> Once the kernel recognizes a usable memory region, limiting available
>> memory should be done in the exact same way.
>
> I mean when modifying the command line.

OK, I understand what you mean.

> To place "elfcorehdr=" or "memmap="/"usable-memory=" into the command
> line, one needs to know where it is valid to place it. Appending doesn't
> always work, as per the example above with 51e158c12aca3c9a.

So, are you now suggesting that we put both "elfcorehdr=" and
"usable-memory=" under /chosen in dtb? That's fair enough.
(as far as nobody cares about incompatibility with other archs.)

-Takahiro AKASHI

> For both of these my point was that generally there is some fragility in
> this area. While it's easy to say that breaking this would be someone
> else's fault, we can easily avoid the possibility of that happening, and
> avoid a set of problems trying to maintain backwards compatibility if
> there were a sensible change that happened to break things.
>
> Thanks,
> Mark.
>