[PATCH] kvm: arm64: Enable hardware updates of the Access Flag for Stage 2 page tables

Mon May 9 09:50:56 PDT 2016

On Mon, May 09, 2016 at 05:33:10PM +0200, Christoffer Dall wrote:
> On Wed, Apr 13, 2016 at 05:57:37PM +0100, Catalin Marinas wrote:
> > The ARMv8.1 architecture extensions introduce support for hardware
> > updates of the access and dirty information in page table entries. With
> > VTCR_EL2.HA enabled (bit 21), when the CPU accesses an IPA with the
> > PTE_AF bit cleared in the stage 2 page table, instead of raising an
> > Access Flag fault to EL2 the CPU sets the actual page table entry bit
> > (10). To ensure that kernel modifications to the page table do not
> > inadvertently revert a bit set by hardware updates, certain Stage 2
> > software pte/pmd operations must be performed atomically.
> > 
> > The main user of the AF bit is the kvm_age_hva() mechanism. The
> > kvm_age_hva_handler() function performs a "test and clear young" action
> > on the pte/pmd. This needs to be atomic in respect of automatic hardware
> > updates of the AF bit. Since the AF bit is in the same position for both
> > Stage 1 and Stage 2, the patch reuses the existing
> > ptep_test_and_clear_young() functionality if
> > __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG is defined. Otherwise, the
> > existing pte_young/pte_mkold mechanism is preserved.
> > 
> > The kvm_set_s2pte_readonly() (and the corresponding pmd equivalent) have
> > to perform atomic modifications in order to avoid a race with updates of
> > the AF bit. The arm64 implementation has been re-written using
> > exclusives.
> > 
> > Currently, kvm_set_s2pte_writable() (and pmd equivalent) take a pointer
> > argument and modify the pte/pmd in place. However, these functions are
> > only used on local variables rather than actual page table entries, so
> > it makes more sense to follow the pte_mkwrite() approach for stage 1
> > attributes. The change to kvm_s2pte_mkwrite() makes it clear that these
> > functions do not modify the actual page table entries.
> 
> so if one CPU takes a permission fault and is in the process of updating
> the page table, what prevents another CPU from setting the access flag
> (on a read, done by HW) on that entry between us reading the old pte and
> replacing it with the new pte?

There isn't anything that would prevent this. However, as in the stage 1
scenarios, explicitly writing a PTE as a result of a permission fault
should also mark the entry as accessed (young, either as the default
protection in PROT_DEFAULT or explicitly via pte_mkyoung).

> Don't we loose the AF information in that case too?

Since the hardware never clears the AF bit automatically, there is no
information to lose in this race as long as the kvm_set_s2pte_writable()
sets the AF bit. AFAICT, the PAGE_S2 and PAGE_S2_DEVICE definitions
already use PROT_DEFAULT which has the AF bit set.

> >  static inline void kvm_set_s2pte_readonly(pte_t *pte)
> >  {
> > -	pte_val(*pte) = (pte_val(*pte) & ~PTE_S2_RDWR) | PTE_S2_RDONLY;
> > +	pteval_t pteval;
> > +	unsigned long tmp;
> > +
> > +	asm volatile("//	kvm_set_s2pte_readonly\n"
> > +	"	prfm	pstl1strm, %2\n"
> > +	"1:	ldxr	%0, %2\n"
> > +	"	and	%0, %0, %3		// clear PTE_S2_RDWR\n"
> > +	"	orr	%0, %0, %4		// set PTE_S2_RDONLY\n"
> > +	"	stxr	%w1, %0, %2\n"
> > +	"	cbnz	%w1, 1b\n"
> > +	: "=&r" (pteval), "=&r" (tmp), "+Q" (pte_val(*pte))
> 
> so the +Q means "pass the memory address of the value and by the way the
> content, not the address itself, can be updated by the assembly code" ?

Yes. I think "info gcc" isn't clear enough on this constraint (at least
to me) but that's something we learnt from the tools guys in the early
days of the aarch64 kernel port.

-- 
Catalin