[PATCH 0/4] kdump: crashkernel reservation from CMA

Fri Dec 1 04:35:41 PST 2023

On Thu, Nov 30, 2023 at 12:01:36PM +0800, Baoquan He wrote:
> On 11/29/23 at 11:51am, Jiri Bohac wrote:
> > We get a lot of problems reported by partners testing kdump on
> > their setups prior to release. But even if we tune the reserved
> > size up, OOM is still the most common reason for kdump to fail
> > when the product starts getting used in real life. It's been
> > pretty frustrating for a long time.
> 
> I remember SUSE engineers ever told you will boot kernel and do an
> estimation of kdump kernel usage, then set the crashkernel according to
> the estimation. OOM will be triggered even that way is taken? Just
> curious, not questioning the benefit of using ,cma to save memory.

Yes, we do that during the kdump package build. We use this to
find some baseline for memory requirements of the kdump kernel
and tools on that specific product. Using these numbers we
estimate the requirements on the system where kdump is
configured by adding extra memory for the size of RAM, number of
SCSI devices, etc. But apparently we get this wrong in too many cases,
because the actual hardware differs too much from the virtual
environment which we used to get the baseline numbers. We've been
adding silly constants to the calculations and we still get OOMs on
one hand and people hesitant to sacrifice the calculated amount
of memory on the other. 

The result is that kdump basically cannot be trusted unless the
user verifies that the sacrificed memory is still enough after
every major upgrade.

This is the main motivation behind the CMA idea: to safely give
kdump enough memory, including a safe margin, without sacrificing
too much memory.

> > I feel the exact opposite about VMs. Reserving hundreds of MB for
> > crash kernel on _every_ VM on a busy VM host wastes the most
> > memory. VMs are often tuned to well defined task and can be set
> > up with very little memory, so the ~256 MB can be a huge part of
> > that. And while it's theoretically better to dump from the
> > hypervisor, users still often prefer kdump because the hypervisor
> > may not be under their control. Also, in a VM it should be much
> > easier to be sure the machine is safe WRT the potential DMA
> > corruption as it has less HW drivers. So I actually thought the
> > CMA reservation could be most useful on VMs.
> 
> Hmm, we ever discussed this in upstream with David Hildend who works in
> virt team. VMs problem is much easier to solve if they complain the
> default crashkernel value is wasteful. The shrinking interface is for
> them. The crashkernel value can't be enlarged, but shrinking existing
> crashkernel memory is functioning smoothly well. They can adjust that in
> script in a very simple way.

The shrinking does not solve this problem at all. It solves a
different problem: the virtual hardware configuration can easily
vary between boots and so will the crashkernel size requirements.
And since crashkernel needs to be passed on the commandline, once
the system is booted it's impossible to change it without a
reboot. Here the shrinking mechanism comes in handy
- we reserve enough for all configurations on the command line and
during boot the requirements for the currently booted
configuration can be determined and the reservation shrunk to
the determined value.  But determining this value is the same
unsolved problem as above and CMA could help in exactly the same
way.

> Anyway, let's discuss and figure out any risk of ,cma. If finally all
> worries and concerns are proved unnecessary, then let's have a new great
> feature. But we can't afford the risk if the ,cma area could be entangled
> with 1st kernel's on-going action. As we know, not like kexec reboot, we
> only shutdown CPUs, interrupt, most of devices are alive. And many of
> them could be not reset and initialized in kdump kernel if the relevant
> driver is not added in.

Well since my patchset makes the use of ,cma completely optional
and has _absolutely_ _no_ _effect_ on users that don't opt to use
it, I think you're not taking any risk at all. We will never know
how much DMA is a problem in practice unless we give users or
distros a way to try and come up with good ways of determining if
it's safe on whichever specific system based on the hardware,
drivers, etc.

I've successfully tested the patches on a few systems, physical
and virtual. Of course this is not proof that the DMA problem
does not exist but shows that it may be a solution that mostly
works. If nothing else, for systems where sacrificing ~400 MB of
memory is something that prevents the user from having any dump
at all, having a dump that mostly works with a sacrifice of ~100
MB may be useful.

Thanks,

-- 
Jiri Bohac <jbohac at suse.cz>
SUSE Labs, Prague, Czechia