[RESEND] Benchmark of I/O multiplexing with LZO/snappy compression

Mon Jun 18 02:38:24 EDT 2012

Hello,

I evaluated I/O multiplexing together with paralell compression in two
kinds of formats: lzo and snappy.

In summary:
- By 8-dimentional I/O multiplexing, thoughtput is 5 times as quick as
  the 1-dimentional: For snappy, 1TB copy takes 25min.
- For randomized data, snappy is as quick as raw, e.g. no compression
  case.
- lzo consumes more CPU time than snappy, but it could probably be
  better for quicker CPUs and more sparse data; another kind of bench
  is required.

In advance, any comments are appreciated.

Notice: Also, I'm sorry that I cannot attach source files I used in
this benchmark, fakevmcore.c and nsplit.c, due to ``suspicious
header'' warning from mail server; it requires moderator's approval
for the mail to get posted, but no one reacts... So I resend this mail
without any attachment.

* Environments

- PRIMEQUEST 1800E2
  - CPU: Intel Xeon E7-8870 (10core/2.4GHz) x 2 sockets
  - RAM: 32GB
- DISKS
  - MBD2147RC (10025rpm) x 4
  - ETERNUS DX440: Emulex 8Gb/s fiber adapters x 4

(*) To get 8-dimentional I/O multiplexing, I used 4 disks and 4 LUNV
    of SAN simply because I didn't have enough disks available (^^;

* How to measure what?

  This bench measured the real time consumed for copying 10GB
  on-memory data, simulating /proc/vmcore, into multiple different
  disks with no compression or with LZO and snappy compressions.

  The data is randomized enough so the time for compression is
  meaningless; no I/O workload changes during compression; this bench
  is only for worst case.

    /proc/fakevmcore shows 

  - Parameters
    - number of writing/compressing threads (and so number of I/O
      multiplexing)
      - 1 ~ 8
    - compression format
      - raw
      - lzo
      - snappy
    - kernel versions
      - v3.4
      - RHEL6.2 (2.6.32.220)
      - RHEL5.8 (2.6.18-238)

  example)
    - Let fakevmcore of 10GB and its block size 4kB.
    - split I/O into two different disks: /mnt/disk{0,1}
    - Block size for compression is 4kB.
    - compress data in LZO: -c is LZO and -s is snappy.
    - flush page cache after nsplit.

  $ insmod ./fakevmcore.ko fakevmcore_size=$((10*1024*1024*1024)) fakevmcore_block_size=4096
  $ time { nsplit -c --blocksize=4096 /proc/fakevmcore /mnt/disk0/a /mnt/disk1/a ; \
           echo 3 > /proc/sys/vm/drop_caches; }

  To build nsplit.c on fc16, the following compression libraries are required:
    - lzo-devel, lzo-minilzo, lzo
    - snappy-devel, snappy

* Results

n: number of writing and compressing threads

- upstream v3.4 kernel

n    raw       lzo      snappy
1 1m29.617s 2m41.979s 1m9.592s
2 1m8.519s  1m26.555s 1m26.902s
3 0m48.653s 1m0.462s  0m35.172s
4 0m28.039s 0m47.248s 0m28.430s
5 0m23.491s 0m37.181s 0m23.435s
6 0m18.202s 0m28.428s 0m18.580s
7 0m15.897s 0m29.873s 0m16.678s
8 0m13.659s 0m23.180s 0m13.922s

- RHEL6.2 (2.6.32.220)

n    raw       lzo     snappy
1 0m53.119s 2m36.603s 1m33.061s
2 1m31.578s 1m28.808s 0m49.492s
3 0m31.675s 0m57.540s 0m33.795s
4 0m37.714s 0m45.035s 0m32.871s
5 0m20.363s 0m34.988s 0m21.894s
6 0m22.602s 0m31.216s 0m19.195s
7 0m18.837s 0m25.204s 0m15.906s
8 0m13.715s 0m22.228s 0m13.884s

- RHEL5.8 (2.6.18-238)

n    raw       lzo     snappy
1 0m55.144s 1m20.771s 1m4.140s
2 0m52.157s 1m8.336s  1m1.089s
3 0m50.172s 0m41.329s 0m47.859s
4 0m35.409s 0m28.764s 0m43.286s
5 0m22.974s 0m20.501s 0m20.197s
6 0m17.430s 0m18.072s 0m19.524s
7 0m14.222s 0m14.936s 0m15.603s
8 0m13.071s 0m14.755s 0m13.313s

- By 8-dimentional I/O multiplexing, throughput is improved as quick
  as 4~5 times in raw, 5-6 times in lzo, and 6-8 times in snappy.

  - 10GB per 15sec corresponds to 1TB per 25min 36sec.

- snappy is as quick as raw. I think snappy can be used with a very
  low risk even at the worst case.

- lzo is slower than raw and snappy. But paralell compression works
  well. Although lzo is worse than other two in this bench, I expect
  lzo could be better than the other two if using better CPU and data
  consists of more sparse data.

- On LZO, RHEL5.8's is better than those of v3.4 and RHEL6.2. Due to
  I/O workloads situation? But I don't know that precisely.

* TODO
- Retry benchmark using disks only.
- Evaluate btrfs's transparent compression for large data; for very
  large data, compression in kernel-space has advantage compared to
  that in user-space.

Thanks.
HATAYAMA, Daisuke