[RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined dirty log

Mon Sep 18 02:55:22 PDT 2023

> -----Original Message-----
> From: Oliver Upton [mailto:oliver.upton at linux.dev]
> Sent: 15 September 2023 01:36
> To: Shameerali Kolothum Thodi <shameerali.kolothum.thodi at huawei.com>
> Cc: kvmarm at lists.linux.dev; kvm at vger.kernel.org;
> linux-arm-kernel at lists.infradead.org; maz at kernel.org; will at kernel.org;
> catalin.marinas at arm.com; james.morse at arm.com;
> suzuki.poulose at arm.com; yuzenghui <yuzenghui at huawei.com>; zhukeqian
> <zhukeqian1 at huawei.com>; Jonathan Cameron
> <jonathan.cameron at huawei.com>; Linuxarm <linuxarm at huawei.com>
> Subject: Re: [RFC PATCH v2 0/8] KVM: arm64: Implement SW/HW combined
> dirty log
> 
> On Thu, Sep 14, 2023 at 09:47:48AM +0000, Shameerali Kolothum Thodi
> wrote:
> 
> [...]
> 
> > > What you're proposing here is complicated and I fear not easily
> > > maintainable. Keeping the *two* sources of dirty state seems likely to
> > > fail (eventually) with some very unfortunate consequences.
> >
> > It does adds complexity to the dirty state management code. I have tried
> > to separate the code path using appropriate FLAGS etc to make it more
> > manageable. But this is probably one area we can work on if the overall
> > approach does have some benefits.
> 
> I'd be a bit more amenable to a solution that would select either
> write-protection or dirty state management, but not both.
> 
> > > The vCPU:memory ratio you're testing doesn't seem representative of
> what
> > > a typical cloud provider would be configuring, and the dirty log
> > > collection is going to scale linearly with the size of guest memory.
> >
> > I was limited by the test setup I had. I will give it a go with a higher mem
> > system.
> 
> Thanks. Dirty log collection needn't be single threaded, but the
> fundamental concern of dirty log collection time scaling linearly w.r.t.
> the size to memory remains. Write-protection helps spread the cost of
> collecting dirty state out across all the vCPU threads.
> 
> There could be some value in giving userspace the ability to parallelize
> calls to dirty log ioctls to work on non-intersecting intervals.
> 
> > > Slow dirty log collection is going to matter a lot for VM blackout,
> > > which from experience tends to be the most sensitive period of live
> > > migration for guest workloads.
> > >
> > > At least in our testing, the split GET/CLEAR dirty log ioctls
> > > dramatically improved the performance of a write-protection based ditry
> > > tracking scheme, as the false positive rate for dirtied pages is
> > > significantly reduced. FWIW, this is what we use for doing LM on arm64
> as
> > > opposed to the D-bit implemenation that we use on x86.
> >
> > Guess, by D-bit on x86 you mean the PML feature. Unfortunately that is
> > something we lack on ARM yet.
> 
> Sorry, this was rather nonspecific. I was describing the pre-copy
> strategies we're using at Google (out of tree). We're carrying patches
> to use EPT D-bit for exitless dirty tracking.

Just curious, how does it handle the overheads associated with scanning for
dirty pages and the convergence w.r.t high rate of dirtying in exitless mode? 

> > > Faster pre-copy performance would help the benchmark complete faster,
> > > but the goal for a live migration should be to minimize the lost
> > > computation for the entire operation. You'd need to test with a
> > > continuous workload rather than one with a finite amount of work.
> >
> > Ok. Though the above is not representative of a real workload, I thought
> > it gives some idea on how "Guest up time improvement" is benefitting the
> > overall availability of the workload during migration. I will check within our
> > wider team to see if I can setup a more suitable test/workload to show
> some
> > improvement with this approach.
> >
> > Please let me know if there is a specific workload you have in mind.
> 
> No objection to the workload you've chosen, I'm more concerned about the
> benchmark finishing before live migration completes.
> 
> What I'm looking for is something like this:
> 
>  - Calculate the ops/sec your benchmark completes in steady state
> 
>  - Do a live migration and sample the rate throughout the benchmark,
>    accounting for VM blackout time
> 
>  - Calculate the area under the curve of:
> 
>      y = steady_state_rate - live_migration_rate(t)
> 
>  - Compare the area under the curve for write-protection and your DBM
>    approach.

Ok. Got it.

> > Thanks for getting back on this. Appreciate if you can do a quick glance
> > through the rest of the patches as well for any gross errors especially with
> > respect to page table walk locking, usage of DBM FLAGS etc.
> 
> I'll give it a read when I have some spare cycles. To be entirely clear,
> I don't have any fundamental objections to using DBM for dirty tracking.
> I just want to make sure that all alternatives have been considered
> in the current scheme before we seriously consider a new approach with
> its own set of tradeoffs.

Thanks for taking a look.

Shameer