[RFC PATCH v2 0/3] pmem memmap dump support

Li, Zhijian lizhijian at fujitsu.com
Wed May 24 22:36:46 PDT 2023


Ping

Baoquan, Dan

Sorry to bother you again.

Could you further comment a word or two on this set?


Thanks
Zhijian


on 5/10/2023 6:41 PM, Zhijian Li (Fujitsu) wrote:
> Hi Dan
>
>
> on 5/8/2023 5:45 PM, Zhijian Li (Fujitsu) wrote:
>> Dan,
>>
>>
>> On 29/04/2023 02:59, Dan Williams wrote:
>>> Li Zhijian wrote:
>>>> Hello folks,
>>>>
>>>> About 2 months ago, we posted our first RFC[3] and received your kindly feedback. Thank you :)
>>>> Now, I'm back with the code.
>>>>
>>>> Currently, this RFC has already implemented to supported case D*. And the case A&B is disabled
>>>> deliberately in makedumpfile. It includes changes in 3 source code as below:
>>> I think the reason this patchkit is difficult to follow is that it
>>> spends a lot of time describing a chosen solution, but not enough time
>>> describing the problem and the tradeoffs.
>>>
>>> For example why is updating /proc/vmcore with pmem metadata the chosen
>>> solution? Why not leave the kernel out of it and have makedumpfile
>>> tooling aware of how to parse persistent memory namespace info-blocks
>>> and retrieve that dump itself? This is what I proposed here:
>>>
>>> http://lore.kernel.org/r/641484f7ef780_a52e2940@dwillia2-mobl3.amr.corp.intel.com.notmuch
>> Sorry for the late reply. I'm just back from the vacation.
>> And sorry again for missing your previous *important* information in V1.
>>
>> Your proposal also sounds to me with less kernel changes, but more ndctl coupling with makedumpfile tools.
>> In my current understanding, it will includes following source changes.
> The kernel and makedumpfile has updated. It's still in a early stage, but in order to make sure I'm following your proposal.
> i want to share the changes with you early. Alternatively, you are able to refer to my github for the full details.
> https://github.com/zhijianli88/makedumpfile/commit/8ebfe38c015cfca0545cb3b1d7a6cc9a58fc9bb3
>
> If I'm going the wrong way, fee free to let me know :)
>
>
>> -----------+-------------------------------------------------------------------+
>> Source     |                      changes                                      |
>> -----------+-------------------------------------------------------------------+
>> I.         | 1. enter force_raw in kdump kernel automatically(avoid metadata being updated again)|
> kernel should adapt it so that the metadata of pmem will be updated again in the kdump kernel:
>
> diff --git a/drivers/nvdimm/namespace_devs.c b/drivers/nvdimm/namespace_devs.c
> index c60ec0b373c5..2e59be8b9c78 100644
> --- a/drivers/nvdimm/namespace_devs.c
> +++ b/drivers/nvdimm/namespace_devs.c
> @@ -8,6 +8,7 @@
>    #include <linux/slab.h>
>    #include <linux/list.h>
>    #include <linux/nd.h>
> +#include <linux/crash_dump.h>
>    #include "nd-core.h"
>    #include "pmem.h"
>    #include "pfn.h"
> @@ -1504,6 +1505,8 @@ struct nd_namespace_common *nvdimm_namespace_common_probe(struct device *dev)
>                           return ERR_PTR(-ENODEV);
>           }
>    
> +       if (is_kdump_kernel())
> +               ndns->force_raw = true;
>           return ndns;
>    }
>    EXPORT_SYMBOL(nvdimm_namespace_common_probe);
>
>> kernel     |                                                                   |
>>               | 2. mark the whole pmem's PT_LOAD for kexec_file_load(2) syscall   |
>> -----------+-------------------------------------------------------------------+
>> II. kexec- | 1. mark the whole pmem's PT_LOAD for kexe_load(2) syscall         |
>> tool       |                                                                   |
>> -----------+-------------------------------------------------------------------+
>> III.       | 1. parse the infoblock and calculate the boundaries of userdata and metadata   |
>> makedump-  | 2. skip pmem userdata region                                      |
>> file       | 3. exclude pmem metadata region if needed                         |
>> -----------+-------------------------------------------------------------------+
>>
>> I will try rewrite it with your proposal ASAP
> inspect_pmem_namespace() will walk the namespaces and the read its resource.start and infoblock. With this
> information, we can calculate the boundaries of userdata and metadata easily. But currently this changes are
> strongly coupling with the ndctl/pmem which looks a bit messy and ugly.
>
> ============makedumpfile=======
>
> diff --git a/Makefile b/Makefile
> index a289e41ef44d..4b4ded639cfd 100644
> --- a/Makefile
> +++ b/Makefile
> @@ -50,7 +50,7 @@ OBJ_PART=$(patsubst %.c,%.o,$(SRC_PART))
>    SRC_ARCH = arch/arm.c arch/arm64.c arch/x86.c arch/x86_64.c arch/ia64.c arch/ppc64.c arch/s390x.c arch/ppc.c arch/sparc64.c arch/mips64.c arch/loongarch64.c
>    OBJ_ARCH=$(patsubst %.c,%.o,$(SRC_ARCH))
>    
> -LIBS = -ldw -lbz2 -ldl -lelf -lz
> +LIBS = -ldw -lbz2 -ldl -lelf -lz -lndctl
>    ifneq ($(LINKTYPE), dynamic)
>    LIBS := -static $(LIBS) -llzma
>    endif
> diff --git a/makedumpfile.c b/makedumpfile.c
> index 98c3b8c7ced9..db68d05a29f9 100644
> --- a/makedumpfile.c
> +++ b/makedumpfile.c
> @@ -27,6 +27,8 @@
>    #include <limits.h>
>    #include <assert.h>
>    #include <zlib.h>
> +#include <sys/types.h>
> +#include <ndctl/libndctl.h>
>
> +
> +#define INFOBLOCK_SZ (8192)
> +#define SZ_4K (4096)
> +#define PFN_SIG_LEN 16
> +
> +typedef uint64_t u64;
> +typedef int64_t s64;
> +typedef uint32_t u32;
> +typedef int32_t s32;
> +typedef uint16_t u16;
> +typedef int16_t s16;
> +typedef uint8_t u8;
> +typedef int8_t s8;
> +
> +typedef int64_t le64;
> +typedef int32_t le32;
> +typedef int16_t le16;
> +
> +struct pfn_sb {
> +       u8 signature[PFN_SIG_LEN];
> +       u8 uuid[16];
> +       u8 parent_uuid[16];
> +       le32 flags;
> +       le16 version_major;
> +       le16 version_minor;
> +       le64 dataoff; /* relative to namespace_base + start_pad */
> +       le64 npfns;
> +       le32 mode;
> +       /* minor-version-1 additions for section alignment */
> +       le32 start_pad;
> +       le32 end_trunc;
> +       /* minor-version-2 record the base alignment of the mapping */
> +       le32 align;
> +       /* minor-version-3 guarantee the padding and flags are zero */
> +       /* minor-version-4 record the page size and struct page size */
> +       le32 page_size;
> +       le16 page_struct_size;
> +       u8 padding[3994];
> +       le64 checksum;
> +};
> +
> +static int nd_read_infoblock_dataoff(struct ndctl_namespace *ndns)
> +{
> +       int fd, rc;
> +       char path[50];
> +       char buf[INFOBLOCK_SZ + 1];
> +       struct pfn_sb *pfn_sb = (struct pfn_sb *)(buf + SZ_4K);
> +
> +       sprintf(path, "/dev/%s", ndctl_namespace_get_block_device(ndns));
> +
> +       fd = open(path, O_RDONLY|O_EXCL);
> +       if (fd < 0)
> +               return -1;
> +
> +
> +       rc = read(fd, buf, INFOBLOCK_SZ);
> +       if (rc < INFOBLOCK_SZ) {
> +               return -1;
> +       }
> +
> +       return pfn_sb->dataoff;
> +}
> +
> +int inspect_pmem_namespace(void)
> +{
> +       struct ndctl_ctx *ctx;
> +       struct ndctl_bus *bus;
> +       int rc = -1;
> +
> +       fprintf(stderr, "\n\ninspect_pmem_namespace!!\n\n");
> +       rc = ndctl_new(&ctx);
> +       if (rc)
> +               return -1;
> +
> +       ndctl_bus_foreach(ctx, bus) {
> +               struct ndctl_region *region;
> +
> +               ndctl_region_foreach(bus, region) {
> +                       struct ndctl_namespace *ndns;
> +
> +                       ndctl_namespace_foreach(region, ndns) {
> +                               enum ndctl_namespace_mode mode;
> +                               long long start, end_metadata;
> +
> +                               mode = ndctl_namespace_get_mode(ndns);
> +                               /* kdump kernel should set force_raw, mode become *safe* */
> +                               if (mode == NDCTL_NS_MODE_SAFE) {
> +                                       fprintf(stderr, "Only raw can be dumpable\n");
> +                                       continue;
> +                               }
> +
> +                               start = ndctl_namespace_get_resource(ndns);
> +                               end_metadata = nd_read_infoblock_dataoff(ndns);
> +
> +                               /* metadata really starts from 2M alignment */
> +                               if (start != ULLONG_MAX && end_metadata > 2 * 1024 * 1024) // 2M
> +                                       pmem_add_next(start, end_metadata);
> +                       }
> +               }
> +       }
> +
> +       ndctl_unref(ctx);
> +       return 0;
> +}
> +
>
> Thanks
> Zhijian
>
>
>
>> Thanks again
>>
>> Thanks
>> Zhijian
>>
>>> ...but never got an answer, or I missed the answer.
>> _______________________________________________
>> kexec mailing list
>> kexec at lists.infradead.org
>> http://lists.infradead.org/mailman/listinfo/kexec
> _______________________________________________
> kexec mailing list
> kexec at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/kexec




More information about the kexec mailing list