[RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log

Wed Sep 13 10:30:03 PDT 2023

Hi Shameer,

On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote:
> Hi,
> 
> This is to revive the RFC series[1], which makes use of hardware dirty
> bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
> out by Zhu Keqian sometime back.
> 
> One of the main drawbacks in using the hardware DBM feature for dirty
> page tracking is the additional overhead in scanning the PTEs for dirty
> pages[2]. Also there are no vCPU page faults when we set the DBM bit,
> which may result in higher convergence time during guest migration. 
> 
> This series tries to reduce these overheads by not setting the
> DBM for all the writeable pages during migration and instead uses a
> combined software(current page fault mechanism) and hardware approach
> (set DBM) for dirty page tracking.
> 
> As noted in RFC v1[1],
> "The core idea is that we do not enable hardware dirty at start (do not
> add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
> for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
> DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
> PTs with hardware dirty enabled, so we do not need to scan all PTs."

I'm unconvinced of the value of such a change.

What you're proposing here is complicated and I fear not easily
maintainable. Keeping the *two* sources of dirty state seems likely to
fail (eventually) with some very unfortunate consequences.

The optimization of enabling DBM on neighboring PTEs is presumptive of
the guest access pattern and could incur unnecessary scans of the
stage-2 page table w/ a sufficiently sparse guest access pattern.

> Tests with dirty_log_perf_test with anonymous THP pages shows significant
> improvement in "dirty memory time" as expected but with a hit on
> "get dirty time" .
> 
> ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp
> 
> +---------------------------+----------------+------------------+
> |                           |   6.5-rc5      | 6.5-rc5 + series |
> |                           |     (s)        |       (s)        |
> +---------------------------+----------------+------------------+
> |    dirty memory time      |    4.22        |          0.41    |
> |    get dirty log time     |    0.00047     |          3.25    |
> |    clear dirty log time   |    0.48        |          0.98    |
> +---------------------------------------------------------------+

The vCPU:memory ratio you're testing doesn't seem representative of what
a typical cloud provider would be configuring, and the dirty log
collection is going to scale linearly with the size of guest memory.

Slow dirty log collection is going to matter a lot for VM blackout,
which from experience tends to be the most sensitive period of live
migration for guest workloads.

At least in our testing, the split GET/CLEAR dirty log ioctls
dramatically improved the performance of a write-protection based ditry
tracking scheme, as the false positive rate for dirtied pages is
significantly reduced. FWIW, this is what we use for doing LM on arm64 as
opposed to the D-bit implemenation that we use on x86.

> In order to get some idea on actual live migration performance,
> I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
> while the test was in progress initiated live migration(local).
> 
> redis-benchmark -t set -c 900 -n 5000000 --threads 96
> 
> Average of 5 runs shows that benchmark finishes ~10% faster with
> a ~8% increase in "total time" for migration.
> 
> +---------------------------+----------------+------------------+
> |                           |   6.5-rc5      | 6.5-rc5 + series |
> |                           |     (s)        |    (s)           |
> +---------------------------+----------------+------------------+
> | [redis]5000000 requests in|    79.428      |      71.49       |
> | [info migrate]total time  |    8438        |      9097        |
> +---------------------------------------------------------------+

Faster pre-copy performance would help the benchmark complete faster,
but the goal for a live migration should be to minimize the lost
computation for the entire operation. You'd need to test with a
continuous workload rather than one with a finite amount of work.

Also, do you know what live migration scheme you're using here?

-- 
Thanks,
Oliver