答复: [PATCH RESEND] arm64: fault: avoid send SIGBUS two times

Mon Dec 11 07:38:44 PST 2017

Hi James,
  Thanks for your review and suggestion.

> Hi gengdongjiu,
> 
> On 08/12/17 04:43, gengdongjiu wrote:
> > by the way, I think also change the info.si_code to "BUS_MCEERR_AR" is better, as shown [1].
> > BUS_MCEERR_AR can tell user space  "Hardware memory error consumed on a error; action required".
> 
> Today its also used as the last-resort. This signal tells user-space the page can't be re-read from disk/swap, and its been unmapped from all
> affected processes.
> 
> I think using it like this (tempting as it is) changes the meaning.

Consider again, I think what is your said is reasonable, when the ghes_notify_sea() return failure, it means the meory_failure() does not handler the error and even not 
unmapped the affected processes. 
So setting the si_code to BUS_MCEERR_AR may not better, I will set the si_code to 0.
Thanks for your suggestion and reminder.

> 
> 
> > so it is better than "0". In the X86 platform, it also use the "BUS_MCEERR_AR" for si_code[2] in "arch/x86/mm/fault.c".
> > what do you think about it?
> 
> This is heading into kernel-first territory, I'd prefer we do that all at once so we know everything is covered.

Yes, it is.

> 
> 
> > [2]:
> > arch/x86/mm/fault.c:
> >
> > static void
> > do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
> >       u32 *pkey, unsigned int fault)
> > {
> >   ......
> > #ifdef CONFIG_MEMORY_FAILURE
> >     if (fault & (VM_FAULT_HWPOISON|VM_FAULT_HWPOISON_LARGE)) {
> 
> These VM_FAULT flags indicate memory_failure() has run, tried to re-read the memory from disk/swap, failed, and unmapped the page
> from all affected processes.

Understand, These VM_FAULT flags is different with the do_sea() handling. In the do_sea(), the memory_failure() may not run.  

> 
> 
> >         printk(KERN_ERR
> >     "MCE: Killing %s:%d due to hardware memory corruption fault at %lx\n",
> >             tsk->comm, tsk->pid, address);
> >         code = BUS_MCEERR_AR;
> >     }
> > #endif
> >     force_sig_info_fault(SIGBUS, code, address, tsk, pkey, fault); }
> 
> This is x86's page fault handler, not its Machine-Check-Exception handler.
> 
> arm64's page fault handler does this too, from do_page_fault():

Yes, indeed. 
just now I check the code, you are right. 

> >	} else if (fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE)) {
> >		sig = SIGBUS;
> >		code = BUS_MCEERR_AR;
> 
> 
> If you're seeing this, its likely due to the race Xie XiuQi spotted where the recovery action has been queued, then we return to user-space
> before its done.
> 
> I had a go at tackling this, adding helpers to kick the assorted queues, which we can do if we took the exception from user-space. Where I
> got stuck is whether we should still force a signal, and how signals get merged. I'll try and spend some more time on that this week.

Understand, thanks for tacking and effort.

> 
> 
> 
> Thanks,
> 
> James