Oops in guest after ioremap() on ARMv7

Thu Dec 22 13:13:56 EST 2011

On Thu, Dec 22, 2011 at 04:38:23PM +0000, David Vrabel wrote:
> On 22/12/11 14:49, Catalin Marinas wrote:
> > On Thu, Dec 22, 2011 at 12:08:07PM +0000, David Vrabel wrote:
> >> When running the linux kernel on the ARMv7 envelope model as a guest
> >> under the Xen hypervisor there is a oops (see below for an example of
> >> the page translation fault) when trying to access ioremap()'d memory.
> 
> The translation tables for userspace seem to be also affected.  The
> program repeatedly faults with a translation fault on the same address.
>  Putting a cache_flush_all() after the call to handle_mm_fault() in
> __do_page_fault() makes userspace work as well.

With the classic page tables, on A15 we need this patch:

http://git.kernel.org/?p=linux/kernel/git/cmarinas/linux.git;a=commitdiff_plain;h=27cbbe6b1e17fa0b954edd37e26d601bdd7766a6

But that's to do with TLBs rather than cache and it only shows on real
hardware rather than model.

> >> The same kernel works fine when not running under the hypervisor.
> >>
> >> It's a 3.2.0-rc5+ kernel with the two additional linux-arch-arm
> >> branches: arm-arch/vexpress and arm-arch/arm-lpae.
> >>
> >> Calling flush_cache_all() in flush_cache_vmap() makes it work.  What
> >> isn't being correctly flushed?  I see that flush_pmd_entry() and
> >> cpu_v7_set_pte_ext() already flush the L1 and L2 translation table
> >> entries and I can't think of anything else that would need to be flushed
> >> (unless the mapped virtual addresses need to be flushed as well?)
> >>
> >> The "Barrier Litmus Tests and Cookbook" says that a TLB flush and a
> >> branch predictor flush are required after a translation table entry
> >> update.  This seems not to be done but adding this didn't seem to help
> >> (and using local_flush_tlb_all()) in flush_cache_vmap() didn't help either).
> >>
> >> I don't see anything in the hypervisor that could be causing this as the
> >> fault is occurring at stage 1 and not stage 2 translation.
> > 
> > Interesting error, I don't have an immediate idea of what might be
> > wrong, just some questions.
> > 
> > What's the value of the VTCR register for this guest? Are the
> > translation table walks marked as cacheable? Also, are the page table
> > attributes Normal Cacheable in the stage 2 translation? The processor
> > chooses the more restrictive attribute between stage 1 and stage 2.
> 
> VTCR = 0x80002558 which is: Outer Shareable; Normal memory, outer
> write-back write-allocate cacheable; Normal memory, inner write-back,
> write-allocate cacheable.
> 
> L3 TT entries for stage 2 have the following attributes:
> Outer-Shareable; Normal, inner write-back cachable; Normal, outer
> write-back cacheable.
> 
> These look sensible to me.

They look fine (UP system). BTW, I assume that the hypervisor also
flushes the caches and TLBs for the stage 2 translation tables.

It could as well be a model bug but people are on holiday at the moment
(and I'm off shortly as well, until 3rd of January). Could you try to
disable the cacheability of the page table walks for both stage 1 (TTBRx
with classic page tables or TTBCR with LPAE) and stage 2 (VTCR)? Since
Linux does the correct cache flushing and I assume the hypervisor as
well, this may work around possible model bug.

-- 
Catalin