[RFC PATCH v1 0/3] kdump, vmcore: Map vmcore memory in direct mapping region

Thu Jan 10 06:59:34 EST 2013

Currently, kdump reads the 1st kernel's memory, called old memory in
the source code, using ioremap per a single page. This causes big
performance degradation since page tables modification and tlb flush
happen each time the single page is read.

This issue turned out from Cliff's kernel-space filtering work.

To avoid calling ioremap, we map a whole 1st kernel's memory targeted
as vmcore regions in direct mapping table. By this we got big
performance improvement. See the following simple benchmark.

Machine spec:

| CPU    | Intel(R) Xeon(R) CPU E7- 4820  @ 2.00GHz (4 sockets, 8 cores) (*) |
| Memory | 32 GB                                                             |
| Kernel | 3.7 vanilla and with this patch set                               |

 (*) only 1 cpu is used in the 2nd kenrel now.

Benchmark:

I executed the following commands on the 2nd kernel and recorded real
time.

  $ time dd bs=$((4096 * n)) if=/proc/vmcore of=/dev/null

[3.7 vanilla]

| block size | time      | performance |
|       [KB] |           | [MB/sec]    |
|------------+-----------+-------------|
|          4 | 5m 46.97s | 93.56       |
|          8 | 4m 20.68s | 124.52      |
|         16 | 3m 37.85s | 149.01      |

[3.7 with this patch]

| block size | time   | performance |
|       [KB] |        |    [GB/sec] |
|------------+--------+-------------|
|          4 | 17.59s |        1.85 |
|          8 | 14.73s |        2.20 |
|         16 | 14.26s |        2.28 |
|         32 | 13.38s |        2.43 |
|         64 | 12.77s |        2.54 |
|        128 | 12.41s |        2.62 |
|        256 | 12.50s |        2.60 |
|        512 | 12.37s |        2.62 |
|       1024 | 12.30s |        2.65 |
|       2048 | 12.29s |        2.64 |
|       4096 | 12.32s |        2.63 |

[perf bench]

I also did perf bench mem memcpy -o on the 2nd kenrel like:

# /var/crash/perf bench mem memcpy -o -l 128MB
# Running mem/memcpy benchmark...
# Copying 128MB Bytes ...

       2.854337 GB/Sec (with prefault)

Several trials stably showed around 2.85 [GB/Sec].

Notes:

* Why direct mapping region

  I chose direct mapping region because this address space has 64TB
  length to cover a whole physical memory while vmlloc-and-ioremap
  region has 16TB only. For some particular machine with huge memory,
  the latter is already problematic.

  In the near future, machine with more than 64TB could occur, but
  then direct mapping space would also be extended to follow.

* Memory consumption issue on the 2nd kenrel

  Typical reserved memory size for the 2nd kerne is 512MB. But if
  mapping tera-byte memory with 4kB pages, page table size amounts to
  more than giga bytes.

  But direct mapping region is mapped using 1GB and 2MB pages. By
  this, memory consumption for page table is minimamized in most
  cases.

  Boot debug message tells you how each map is mapped:

vmcore: [oldmem 0000000027000000-000000002708afff]
vmcore: [oldmem 0000000000100000-0000000026ffffff]
vmcore: [oldmem 0000000037000000-000000007b00cfff]
vmcore: [oldmem 0000000100000000-000000087fffffff]
 [mem 0x27000000-0x2708afff] page 4k
 [mem 0x00100000-0x001fffff] page 4k
 [mem 0x00200000-0x26ffffff] page 2M
 [mem 0x37000000-0x7affffff] page 2M
 [mem 0x7b000000-0x7b00cfff] page 4k
 [mem 0x100000000-0x87fffffff] page 1G

  where each [oldmem <start>-<end>] is mapped region and I omited some
  other messages.

TODO:

* Use of init_memory_mapping

  init_memory_mapping is used to map memory in direct mapping region
  both in boot time and memory hot-plug codes. This should be used
  here too, but just as I explain in the patch description, I faced
  some page-fault related bugs after it was called in the 2nd kernel
  boot. This means page table mapping is not done correctly.

  As a workaround, I wrote the code constructing page table from
  scratch just like Cliff's patch, and it works well aparently now.

  But ideally it's necessary to know why init_memory_mapping doesn't
  work well. I continue to debug this. Sugestion around this is very
  helpful. This issue comes purely from lack of my familiality around
  here (^^;

* Benchmark of Cliff's kernel-space filtering

  He has attempted kernel-space filtering of makedumpfile for
  performance improvement. I noticed the ioremap issue through his
  this work.

  I now think bad performance is mainly caused by the ioremap issue. I
  don't know how much filtering performance is improved by doing it in
  kernel-space. I guess there's just a similar improvement just like
  increasing block size just as the above benchmark.

  Anyway, we need first to compare kernel-space filtering with
  user-space one.

  Note that this work is orthogonal to kernel-space filtering, can be
  proceeded separately.

---

HATAYAMA Daisuke (3):
      vmcore: read vmcore through direct mapping region
      vmcore: map vmcore memory in direct mapping region
      vmcore: Add function to merge memory mapping of vmcore

 fs/proc/vmcore.c |  420 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 419 insertions(+), 1 deletions(-)

-- 

Thanks.
HATAYAMA, Daisuke