[RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log

Thu Sep 14 02:47:48 PDT 2023

Hi Oliver,

> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton at linux.dev]
> Sent: 13 September 2023 18:30
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi at huawei.com>
> Cc: kvmarm at lists.linux.dev; kvm at vger.kernel.org;
> linux-arm-kernel at lists.infradead.org; maz at kernel.org; will at kernel.org;
> catalin.marinas at arm.com; james.morse at arm.com;
> suzuki.poulose at arm.com; yuzenghui <yuzenghui at huawei.com>; zhukeqian
> <zhukeqian1 at huawei.com>; Jonathan Cameron
> <jonathan.cameron at huawei.com>; Linuxarm <linuxarm at huawei.com>
> Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined
> dirty log
> 
> Hi Shameer,
> 
> On Fri, Aug 25, 2023 at 10:35:20AM +0100, Shameer Kolothum wrote:
> > Hi,
> >
> > This is to revive the RFC series[1], which makes use of hardware dirty
> > bit modifier(DBM) feature(FEAT_HAFDBS) for dirty page tracking, sent
> > out by Zhu Keqian sometime back.
> >
> > One of the main drawbacks in using the hardware DBM feature for dirty
> > page tracking is the additional overhead in scanning the PTEs for dirty
> > pages[2]. Also there are no vCPU page faults when we set the DBM bit,
> > which may result in higher convergence time during guest migration.
> >
> > This series tries to reduce these overheads by not setting the
> > DBM for all the writeable pages during migration and instead uses a
> > combined software(current page fault mechanism) and hardware
> approach
> > (set DBM) for dirty page tracking.
> >
> > As noted in RFC v1[1],
> > "The core idea is that we do not enable hardware dirty at start (do not
> > add DBM bit). When an arbitrary PT occurs fault, we execute soft tracking
> > for this PT and enable hardware tracking for its *nearby* PTs (e.g. Add
> > DBM bit for nearby 64PTs). Then when sync dirty log, we have known all
> > PTs with hardware dirty enabled, so we do not need to scan all PTs."
> 
> I'm unconvinced of the value of such a change.
> 
> What you're proposing here is complicated and I fear not easily
> maintainable. Keeping the *two* sources of dirty state seems likely to
> fail (eventually) with some very unfortunate consequences.

It does adds complexity to the dirty state management code. I have tried
to separate the code path using appropriate FLAGS etc to make it more
manageable. But this is probably one area we can work on if the overall
approach does have some benefits.

> The optimization of enabling DBM on neighboring PTEs is presumptive of
> the guest access pattern and could incur unnecessary scans of the
> stage-2 page table w/ a sufficiently sparse guest access pattern.

Agree. This may not work as intended for all workloads and especially 
if the access pattern is sparse. But still hopeful that it will be beneficial for
workloads that have continuous write patterns. And we do have a knob to
turn it on or off.

> > Tests with dirty_log_perf_test with anonymous THP pages shows
> significant
> > improvement in "dirty memory time" as expected but with a hit on
> > "get dirty time" .
> >
> > ./dirty_log_perf_test -b 512MB -v 96 -i 5 -m 2 -s anonymous_thp
> >
> > +---------------------------+----------------+------------------+
> > |                           |   6.5-rc5      | 6.5-rc5 + series
> |
> >
> |                           |     (s)        |       (s)
>    |
> > +---------------------------+----------------+------------------+
> > |    dirty memory
> time      |    4.22        |          0.41    |
> > |    get dirty log
> time     |    0.00047     |          3.25    |
> > |    clear dirty log
> time   |    0.48        |          0.98    |
> > +---------------------------------------------------------------+
> 
> The vCPU:memory ratio you're testing doesn't seem representative of what
> a typical cloud provider would be configuring, and the dirty log
> collection is going to scale linearly with the size of guest memory.

I was limited by the test setup I had. I will give it a go with a higher mem
system. 

> Slow dirty log collection is going to matter a lot for VM blackout,
> which from experience tends to be the most sensitive period of live
> migration for guest workloads.
> 
> At least in our testing, the split GET/CLEAR dirty log ioctls
> dramatically improved the performance of a write-protection based ditry
> tracking scheme, as the false positive rate for dirtied pages is
> significantly reduced. FWIW, this is what we use for doing LM on arm64 as
> opposed to the D-bit implemenation that we use on x86.

Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is
something we lack on ARM yet.

> > In order to get some idea on actual live migration performance,
> > I created a VM (96vCPUs, 1GB), ran a redis-benchmark test and
> > while the test was in progress initiated live migration(local).
> >
> > redis-benchmark -t set -c 900 -n 5000000 --threads 96
> >
> > Average of 5 runs shows that benchmark finishes ~10% faster with
> > a ~8% increase in "total time" for migration.
> >
> > +---------------------------+----------------+------------------+
> > |                           |   6.5-rc5      | 6.5-rc5 + series
> |
> >
> |                           |     (s)        |    (s)
>     |
> > +---------------------------+----------------+------------------+
> > | [redis]5000000 requests in|    79.428      |      71.49       |
> > | [info migrate]total
> time  |    8438        |      9097        |
> > +---------------------------------------------------------------+
> 
> Faster pre-copy performance would help the benchmark complete faster,
> but the goal for a live migration should be to minimize the lost
> computation for the entire operation. You'd need to test with a
> continuous workload rather than one with a finite amount of work.

Ok. Though the above is not representative of a real workload, I thought
it gives some idea on how "Guest up time improvement" is benefitting the
overall availability of the workload during migration. I will check within our
wider team to see if I can setup a more suitable test/workload to show some
improvement with this approach. 

Please let me know if there is a specific workload you have in mind.

> Also, do you know what live migration scheme you're using here?

The above is the default one (pre-copy).

Thanks for getting back on this. Appreciate if you can do a quick glance
through the rest of the patches as well for any gross errors especially with
respect to page table walk locking, usage of DBM FLAGS etc.

Thanks,
Shameer