Unhandled level 2 translation fault on A72 board.

Tue Jan 26 03:44:45 PST 2016

On Tue, Jan 26, 2016 at 07:33:17PM +0800, Ding Tianhong wrote:
> On 2016/1/26 19:03, Catalin Marinas wrote:
> > On Tue, Jan 26, 2016 at 03:37:42PM +0800, Ding Tianhong wrote:
> >> I met this problem when running the hackbench test on A72 chip board:
> >>
> >> sh[4779]: unhandled level 2 translation fault (11) at 0x7f96be0c80, esr 0x83000006 
> >> pgd = ffffffc01a1f0000 
> >> [7f96be0c80] *pgd=0000000084a20003, *pud=0000000084a20003, *pmd=0000000000000000
[...]
> > I can't tell for sure it's a TLB issue. The kernel page table dump shows
> > *pmd being 0, so the fault is correctly called "level 2 translation
> > fault". It also seems that there is no vma at this address, hence the
> > kernel reports it as unhandled. It looks like data corruption which
> > could be caused by cache or TLB incoherence. Just make sure the
> > interconnect linking the two clusters is configured correctly by
> > _firmware_ before Linux starts.
> 
> Thanks for the apply, I have try to apply this patch to test:
> 
> --- arch/arm64/kernel/process.c | 9 +++++++++
> 1 file changed, 9 insertions(+)
>  
> diff --git a/arch/arm64/kernel/process.c b/arch/arm64/kernel/process.c
> index 6391485..d7d8439 100644
> --- a/arch/arm64/kernel/process.c
> +++ b/arch/arm64/kernel/process.c
> @@ -283,6 +283,13 @@ static void tls_thread_switch(struct task_struct *next)
> : : "r" (tpidr), "r" (tpidrro));
> }
> +static void tlb_flush_thread(struct task_struct *prev)
> +{
> +/* Flush the prev task's TLB entries */
> +if (prev->mm)
> +flush_tlb_mm(prev->mm);
> +}
> +
> /*
>   * Thread switching.
>   */
> @@ -296,6 +303,8 @@ struct task_struct *__switch_to(struct task_struct *prev,
> hw_breakpoint_thread_switch(next);
> contextidr_thread_switch(next);
> +tlb_flush_thread(prev);
> +
> /*
> * Complete any pending TLB or cache maintenance on this CPU in case
> * the thread migrates to a different CPU.
> 
> The hackbench would work fine after this patch, so I guess that the old thread tlb may not be
> invalidate as soon as possible, but I don't know why, everything is fine on A57,
> Does I miss something?

It looks like the TLB invalidation messages may not get across the CCI
between clusters. I don't have the TRMs at hand but make sure all the
relevant bits in the CPUs and CCI are enabled.

BTW, which kernel version are you running? Is the firmware your own or
built around ARM Trusted Firmware?

-- 
Catalin