[RFC][nvdimm][crash] pmem memmap dump support

HAGIO KAZUHITO(萩尾 一仁) k-hagio-ab at nec.com
Mon Mar 6 18:05:04 PST 2023


On 2023/02/23 15:24, lizhijian at fujitsu.com wrote:
> Hello folks,
> 
> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
> I really hope you can provide some feedback.
> 
> pmem memmap can also be called pmem metadata here.
> 
> ### Background and motivate overview ###
> ---
> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
> trouble around pmem (especially Filesystem-DAX).
> 
> 
> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
> status of reverse map.
> 
> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
> troubleshooters are unable to check more details about pmem from the dumpfile.
> 
> ### Make pmem memmap dump support ###
> ---
> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
> 
> First, based on our previous investigation, according to the location of metadata and the scope of
> dump, we can divide it into the following four cases: A, B, C, D.
> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
> it may contain user sensitive data.
> 
> +-------------+----------+------------+
> |\+--------+\     metadata location   |
> |            ++-----------------------+
> | dump scope  |  mem     |   PMEM     |
> +-------------+----------+------------+
> | entire pmem |     A    |     B      |
> +-------------+----------+------------+
> | metadata    |     C    |     D      |
> +-------------+----------+------------+
> 
> Case A&B: unsupported
> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
> region into vmcore's PT_LOADs in kexec-tools.
> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
> are readable, and then skips/excludes the specific page according to its attributes. But in the case
> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
> errors[2] when specific -d options are specified.
> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
> 
> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
> from reading by the dump application(makedumpfile).
> 
> Case C: native supported
> metadata is stored in mem, and the entire mem/ram is dumpable.
> 
> Case D: unsupported && need your input
> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
> namespace and the address and size of metadata in the pmem [start, end)
> 
> We have thought of a few possible options:
> 
> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.

Hi Zhijian,

sorry, probably I don't understand enough, but do these mean that
  1. /proc/vmcore exports pmem regions with PT_LOADs, which contain
     unreadable ones, and
  2. makedumpfile gets to know the readable regions somehow?

Then /proc/vmcore with pmem cannot be captured by other commands,
e.g. cp command?

Thanks,
Kazu

> 3) others ?
> 
> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
> we dumped is inconsistent with the metadata at the moment of the crash happening.
> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
> dumpfile on the filesystem/partition based on pmem.
> 
> So here I hope you can provide some ideas about this feature/requirement and on the possible solution
> for the cases A&B&D mentioned above, it would be greatly appreciated.
> 
> If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.
> 
> 
> [1] Pmem region layout:
>     ^<--namespace0.0---->^<--namespace0.1------>^
>     |                    |                      |
>     +--+m----------------+--+m------------------+---------------------+-+a
>     |++|e                |++|e                  |                     |+|l
>     |++|t                |++|t                  |                     |+|i
>     |++|a                |++|a                  |                     |+|g
>     |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>     |++|a    fsdax       |++|a     devdax       |                     |+|m
>     |++|t                |++|t                  |                     |+|e
>     +--+a----------------+--+a------------------+---------------------+-+n
>     |                                                                   |t
>     v<-----------------------pmem region------------------------------->v
> 
> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
> 
> 
> Thanks
> Zhijian


More information about the kexec mailing list