[RFC][nvdimm][crash] pmem memmap dump support

HAGIO KAZUHITO(萩尾 一仁) k-hagio-ab at nec.com
Tue Mar 7 00:31:01 PST 2023


On 2023/03/07 11:49, lizhijian at fujitsu.com wrote:
> On 07/03/2023 10:05, HAGIO KAZUHITO(萩尾 一仁) wrote:
>> On 2023/02/23 15:24, lizhijian at fujitsu.com wrote:
>>> Hello folks,
>>>
>>> This mail raises a pmem memmap dump requirement and possible solutions, but they are all still premature.
>>> I really hope you can provide some feedback.
>>>
>>> pmem memmap can also be called pmem metadata here.
>>>
>>> ### Background and motivate overview ###
>>> ---
>>> Crash dump is an important feature for trouble shooting of kernel. It is the final way to chase what
>>> happened at the kernel panic, slowdown, and so on. It is the most important tool for customer support.
>>> However, a part of data on pmem is not included in crash dump, it may cause difficulty to analyze
>>> trouble around pmem (especially Filesystem-DAX).
>>>
>>>
>>> A pmem namespace in "fsdax" or "devdax" mode requires allocation of per-page metadata[1]. The allocation
>>> can be drawn from either mem(system memory) or dev(pmem device), see `ndctl help create-namespace` for
>>> more details. In fsdax, struct page array becomes very important, it is one of the key data to find
>>> status of reverse map.
>>>
>>> So, when metadata was stored in pmem, even pmem's per-page metadata will not be dumped. That means
>>> troubleshooters are unable to check more details about pmem from the dumpfile.
>>>
>>> ### Make pmem memmap dump support ###
>>> ---
>>> Our goal is that whether metadata is stored on mem or pmem, its metadata can be dumped and then the
>>> crash-utilities can read more details about the pmem. Of course, this feature can be enabled/disabled.
>>>
>>> First, based on our previous investigation, according to the location of metadata and the scope of
>>> dump, we can divide it into the following four cases: A, B, C, D.
>>> It should be noted that although we mentioned case A&B below, we do not want these two cases to be
>>> part of this feature, because dumping the entire pmem will consume a lot of space, and more importantly,
>>> it may contain user sensitive data.
>>>
>>> +-------------+----------+------------+
>>> |\+--------+\     metadata location   |
>>> |            ++-----------------------+
>>> | dump scope  |  mem     |   PMEM     |
>>> +-------------+----------+------------+
>>> | entire pmem |     A    |     B      |
>>> +-------------+----------+------------+
>>> | metadata    |     C    |     D      |
>>> +-------------+----------+------------+
>>>
>>> Case A&B: unsupported
>>> - Only the regions listed in PT_LOAD in vmcore are dumpable. This can be resolved by adding the pmem
>>> region into vmcore's PT_LOADs in kexec-tools.
>>> - For makedumpfile which will assume that all page objects of the entire region described in PT_LOADs
>>> are readable, and then skips/excludes the specific page according to its attributes. But in the case
>>> of pmem, 1st kernel only allocates page objects for the namespaces of pmem, so makedumpfile will throw
>>> errors[2] when specific -d options are specified.
>>> Accordingly, we should make makedumpfile to ignore these errors if it's pmem region.
>>>
>>> Because these above cases are not in our goal, we must consider how to prevent the data part of pmem
>>> from reading by the dump application(makedumpfile).
>>>
>>> Case C: native supported
>>> metadata is stored in mem, and the entire mem/ram is dumpable.
>>>
>>> Case D: unsupported && need your input
>>> To support this situation, the makedumpfile needs to know the location of metadata for each pmem
>>> namespace and the address and size of metadata in the pmem [start, end)
>>>
>>> We have thought of a few possible options:
>>>
>>> 1) In the 2nd kernel, with the help of the information from /sys/bus/nd/devices/{namespaceX.Y, daxX.Y, pfnX.Y}
>>> exported by pmem drivers, makedumpfile is able to calculate the address and size of metadata
>>> 2) In the 1st kernel, add a new symbol to the vmcore. The symbol is associated with the layout of
>>> each namespace. The makedumpfile reads the symbol and figures out the address and size of the metadata.
>>
>> Hi Zhijian,
>>
>> sorry, probably I don't understand enough, but do these mean that
>>     1. /proc/vmcore exports pmem regions with PT_LOADs, which contain
>>        unreadable ones, and
>>     2. makedumpfile gets to know the readable regions somehow?
> 
> Kazu,
> 
> Generally, only metadata of pmem is readable by crash-utilities, because metadata contains its own memmap(page array).
> The rest part of pmem which could be used as a block device(DAX filesystem) or other purpose, so it's not much helpful
> for the troubleshooting.
> 
> In my understanding, PT_LOADs is part of ELF format, it complies with what it's.
> In my current thoughts,
> 1. crash-tool will export the entire pmem region to /proc/vmcore. makedumpfile/cp etc commands can read the entire
> pmem region directly.
> 2. export the namespace layout to vmcore as a symbol, then dumping applications(makedumpfile) can figure out where
> the metadata is, and read metadata only.

Ah got it, Thanks!

My understanding is that makedumpfile/cp will be able to read the entire
pmem, but with some makedumpfile -d option values it cannot get the
physical address of struct page for data pages and throws an error.  So
you think there will be need to export the ranges of allocated metadata.

Thanks,
Kazu

> 
> Not sure whether the reply is helpful, if you have any other questions, feel free to let me know. :)
> 
> 
> Thanks
> Zhijian
> 
>>
>> Then /proc/vmcore with pmem cannot be captured by other commands,
>> e.g. cp command?
>>
>> Thanks,
>> Kazu
>>
>>> 3) others ?
>>>
>>> But then we found that we have always ignored a user case, that is, the user could save the dumpfile
>>> to the pmem. Neither of these two options can solve this problem, because the pmem drivers will
>>> re-initialize the metadata during the pmem drivers loading process, which leads to the metadata
>>> we dumped is inconsistent with the metadata at the moment of the crash happening.
>>> Simply, can we just disable the pmem directly in 2nd kernel so that previous metadata will not be
>>> destroyed? But this operation will bring us inconvenience that 2nd kernel doesn’t allow user storing
>>> dumpfile on the filesystem/partition based on pmem.
>>>
>>> So here I hope you can provide some ideas about this feature/requirement and on the possible solution
>>> for the cases A&B&D mentioned above, it would be greatly appreciated.
>>>
>>> If I’m missing something, feel free to let me know. Any feedback & comment are very welcome.
>>>
>>>
>>> [1] Pmem region layout:
>>>       ^<--namespace0.0---->^<--namespace0.1------>^
>>>       |                    |                      |
>>>       +--+m----------------+--+m------------------+---------------------+-+a
>>>       |++|e                |++|e                  |                     |+|l
>>>       |++|t                |++|t                  |                     |+|i
>>>       |++|a                |++|a                  |                     |+|g
>>>       |++|d  namespace0.0  |++|d  namespace0.1    |     un-allocated    |+|n
>>>       |++|a    fsdax       |++|a     devdax       |                     |+|m
>>>       |++|t                |++|t                  |                     |+|e
>>>       +--+a----------------+--+a------------------+---------------------+-+n
>>>       |                                                                   |t
>>>       v<-----------------------pmem region------------------------------->v
>>>
>>> [2] https://lore.kernel.org/linux-mm/70F971CF-1A96-4D87-B70C-B971C2A1747C@roc.cs.umass.edu/T/
>>>
>>>
>>> Thanks
>>> Zhijian


More information about the kexec mailing list