[PATCH v6 3/4] live migration support for VM dirty log management

Christoffer Dall christoffer.dall at linaro.org
Wed May 28 02:08:34 PDT 2014


On Tue, May 27, 2014 at 02:55:21PM -0700, Mario Smarduch wrote:
> On 05/27/2014 01:12 PM, Christoffer Dall wrote:
> > On Thu, May 15, 2014 at 11:27:30AM -0700, Mario Smarduch wrote:

[...]

> >> +
> >> +	/* If pgd, pud, pmd not present and you cross pmd range check next
> >> +	 * index.
> >> +	 */
> >> +	pgd = pgdp + pgd_index(ipa);
> >> +	if (unlikely(crosses_pmd && !pgd_present(*pgd))) {
> >> +		pgd = pgdp + pgd_index(next);
> >> +		if (!pgd_present(*pgd))
> >> +			return;
> >> +	}
> >> +
> >> +	pud = pud_offset(pgd, ipa);
> >> +	if (unlikely(crosses_pmd && !pud_present(*pud))) {
> >> +		pud = pud_offset(pgd, next);
> >> +		if (!pud_present(*pud))
> >> +			return;
> >> +	}
> >> +
> >> +	pmd = pmd_offset(pud, ipa);
> >> +	if (unlikely(crosses_pmd && !pmd_present(*pmd))) {
> >> +		pmd = pmd_offset(pud, next);
> >> +		if (!pmd_present(*pmd))
> >> +			return;
> >> +	}
> >> +
> >> +	for (;;) {
> >> +		pte = pte_offset_kernel(pmd, ipa);
> >> +		if (!pte_present(*pte))
> >> +			goto next_ipa;
> >> +
> >> +		if (kvm_s2pte_readonly(pte))
> >> +			goto next_ipa;
> >> +		kvm_set_s2pte_readonly(pte);
> >> +next_ipa:
> >> +		mask &= mask - 1;
> >> +		if (!mask)
> >> +			break;
> >> +
> >> +		/* find next page */
> >> +		ipa = (gfnofst + __ffs(mask)) << PAGE_SHIFT;
> >> +
> >> +		/* skip upper page table lookups */
> >> +		if (!crosses_pmd)
> >> +			continue;
> >> +
> >> +		pgd = pgdp + pgd_index(ipa);
> >> +		if (unlikely(!pgd_present(*pgd)))
> >> +			goto next_ipa;
> >> +		pud = pud_offset(pgd, ipa);
> >> +		if (unlikely(!pud_present(*pud)))
> >> +			goto next_ipa;
> >> +		pmd = pmd_offset(pud, ipa);
> >> +		if (unlikely(!pmd_present(*pmd)))
> >> +			goto next_ipa;
> >> +	}
> > 
> > So I think the reason this is done separately on x86 is that they have
> > an rmap structure for their gfn mappings so that they can quickly lookup
> > ptes based on a gfn and write-protect it without having to walk the
> > stage-2 page tables.
> 
> Yes, they also use rmapps for mmu notifiers, invalidations on huge VMs and 
> large ranges resulted in excessive times. 
> > 
> > Unless you want to introduce this on ARM, I think you will be much
> 
> Eventually yes but that would also require reworking mmu notifiers.  I had 
> two step approach in mind. Initially get the dirty page marking to work, 
> TLB flushing, GIC/arch-timer migration, validate migration under various 
> stress loads (page reclaim) with mmu notifiers, test several VMs and migration 
> times. 
> 
> Then get rmapp (or something similar) working - eventually for huge VMs it's
> needed. In short two phases.
> 
> > better off just having a single (properly written) iterating
> > write-protect function, that takes a start and end IPA and a bitmap for
> > which pages to actually write-protect, which can then handle the generic
> > case (either NULL or all-ones bitmap) or a specific case, which just
> > traverses the IPA range given as input.  Such a function should follow
> > the model of page table walk functions discussed previously
> > (separate functions: wp_pgd_enties(), wp_pud_entries(),
> > wp_pmd_entries(), wp_pte_entries()).
> > 
> > However, you may want to verify my assumption above with the x86 people
> > and look at sharing the rmap logic between architectures.
> > 
> > In any case, this code is very difficult to read and understand, and it
> > doesn't look at all like the other code we have to walk page tables.  I
> > understand you are trying to optimize for performance (by skipping some
> > intermediate page table level lookups), but you never declare that goal
> > anywhere in the code or in the commit message.
> 
> Marc's comment noticed I was walking a small range (128k), using upper table
> iterations that covered 1G, 2MB ranges. As you mention the code tries to
> optimize upper table lookups. Yes the function is too bulky, but I'm not sure how 
> to remove the upper table checks since page tables may change between the 
> time pages are marked dirty and the log is retrieved. And if a memory slot 
> is very dirty walking upper tables will impact performance. I'll think some 
> more on this function.
> 
I think you should aim at the simplest possible implementation that
functionally works, first.  Let's verify that this thing works, have
clean working code that implementation-wise is as minimal as possible.

Then we can run perf on that and see if our migrations are very slow,
where we are actually spending time, and only then optimize it.

The solution to this specific problem for the time being appears quite
clear to me: Follow the exact same scheme as for unmap_range (the one I
sent out here:
https://lists.cs.columbia.edu/pipermail/kvmarm/2014-May/009592.html, the
diff is hard to read, so I recommend you apply the patch and look at the
resulting code).  Have a similar scheme, call it wp_ipa_range() or
something like that, and use that for now.

-Christoffer



More information about the linux-arm-kernel mailing list