Re: 答复: [PATCH v6 6/7] KVM: arm64: allow get exception information from userspace

Wed Oct 18 22:48:49 PDT 2017

Hi james,
   thanks for the mail, and sorry for my late response.

On 2017/10/7 0:46, James Morse wrote:
>>> will give you an BUS_MCEERR_AO signal for any guest memory that is 
>>> affected, and a BUS_MCEERR_AR if the guest directly accesses a page of affected memory.
>>>
>>> What Qemu/kvmtool do with this is up to them. If they're emulating a 
>>> machine with no RAS features, printing an error and exit.
>>>
>>> Otherwise BUS_MCEERR_AR could be notified as one of the flavours of 
>>> IRQ, unless the affected vcpu has interrupts masked, in which case an 
>>> SEA notification gives you some NMI-like behaviour.
>> Thanks for explanation. 
>> Now that SEA notification can provide NMI-like behaviour. How about we use it for
>> BUS_MCEERR_AR to avoid check the interrupts mask status?
> ... yes, that's the suggestion. Qemu/kvmtool can take MCEERR_AR from KVM and
> SET_ONE_REG the vcpu into taking a synchronous-external-abort instead.
> 
> (but there is trouble ahead where this suggestion unravels)
> 
> 
>> Even though guest OS not support SEA notification, It is still a valid
>> guest:Synchronous-external-abort
> When it comes from KVM, Yes.
> 
> If you think of linux+KVM's stage2 translation as your guests memory controller.
> It is effectively guaranteeing you to:
> 1. 'interrupt' userspace with MCEERR_AO when it detects an error,
> 2. contain it (by unmapping and marking the page PG_poison) then,
> 3. give you a synchronous external abort with MCEERR_AR if the guest touches the
> contained memory.
> 
> This is fine when the MCEERR_AR came from KVM, what we need to find out is
> whether this will still be true for some future kernel code that sends MCEERR_AR...

Ok

> 
> 
>>> For BUS_MCEERR_AO you could use SEI, IRQ or polled notification. My 
>>> choice would be IRQ, as you can't know if the guest supports SEI and 
>>> it would be a shame to
>> How about we first check whether user space can specify the virtual SError Exception Syndrome(have vsesr_el2 register)?
>> If can specify, using SEI notification, otherwise use IRQ notification. 
>> The advantage is that it can pass more error information than IRQ if can specify Syndrome information.
> If this is for APEI firmware-first, its the severity in the CPER records that
> matters, so the SEI/IRQ choice doesn't matter if the guest supports both.
> 
> But its unlikely the guest supports both. (v4.14 doesn't).
> If you pick IRQ and the guest doesn't support it, nothing happens, the guest
> never takes the IRQ. You take the memory-error as a MCEERR_AR some time later.
> 
> If you pick SEI and the guest doesn't support it, the guest dies of an unknown
> SError.
> 
> I would pick POLLed by default, as this would let the Qemu/kvmtool code that
> does this work be portable between x86 and arm64.
> 
> [...]

AS we disused in another mail, we can use GPIO-signal or GPIO interrupts which is also suggestion by APEI spec
and does not let to guest dies, as shown in [1]

[1]
HW-reduced ACPI platforms signal the error using a GPIO interrupt or another interrupt declared
under a generic event device (Section 5.6.9). In the case of GPIO-signaled events, an _AEI object
lists the appropriate GPIO pin, while for Interrupt-signaled events a _CRS object is used to list
the interrupt:

> 
>>>> If call memory_failure(), memory_failure can translate the PA to host 
>>>> VA, then deliver host VA to Qemu.
>>> Yes, this is how it works for any user-space process as two processes 
>>> sharing the same page may map it in different locations.
>>>
>>>
>>>> Qemu can translate the host VA to IPA. so we rely on memory_failure() 
>>>> to get the IPA.
>>> Yes. I don't see why this is a problem: The kernel isn't going to pass 
>>> RAS events into the guest, so it never needs to know the IPA.
>>>
>>> Instead we notify user-space about ranges of memory affected by 
>>> memory_failure(), KVM's user-space isn't a special case here.
>>>
>>> As you point out, if Qemu wants to notify the guest it can calculate 
>>> the IPA and either use CPER for firmware-first, or in the future, 
>>> update some representation of the v8.2 ERR records once we can virtualise kernel-first.
>>>
>>> (I'm not sure I understand your point here, but I don't think we 
>>> disagree,)
>> Yes, I only describe the workflow, not think we do not disagree.
> (double negative?)
> 
> 
>> If not pass exception information to user space, there is another issue.
>> As our agreement, if we want to inject a Synchronous-external-abort, we let Qemu/kvmtool injects it.
> Yes, because all we do for user space is send it MCEERR_AR when this happens.
> 
> KVM only does this to make guest-accesses the same as user-space accesses.
> (See arm64's do_page_fault() and x86's do_sigbus() for the regular path)
> 
> 
>> when Qemu injecting it, it needs to set the value of FAR_EL1 with the value of FAR_EL2. but if we do not 
>> pass the far_el2's value to user space, Qemu will have to set the FAR_EL1 to 0, then FAR_EL1's value is invalid.
>> The FAR_EL1 usually is used to save the fault guest VA. 
> Yes, Qemu doesn't know the the guest VA, but it doesn't need to for this
> mechanism to work: Set the DFSC to '0b010000 Synchronous external abort, not on
> translation table walk', then set 'FnV - Far Not Valid'.

yes, I think so. thanks for your good suggestion.

> 
> ...
> 
> But there is one piece of information Qemu needs here but doesn't have: whether
> this was an instruction or a data abort, which you need to pick the correct EC.
so this is patch aim to pass more information to  user space. but you have some concern

> 
> Ideally, this would come with the signal, but that doesn't happen today.
I ever mentioned that the information is not enough, but you reply user space should not need that.

> 
> I don't think user-space generally needs to know this, only a JIT is likely to
> be able to handle these errors, and it would know which regions were code or data.
why it is related with a JIT?

> 
> So it looks like we really do need more information from KVM, and guests have to
> be a special-case here. Bother.

I think my current patch can pass the needed information. in fact, Qemu
will judge whether the fault address is belong to guest, only belong to guest, it will
get these information. not exist information stale issue. how about we keep this patch?

For the guest APEI CPER recording, if want to make it better, it also need to pass more information.

> 
> I'll try and put together an RFC to let user-space generate a believable
> architectural exception when it gets this signal.

I think this patch can pass the enough information for guest, why you will put another patch?

> 
> 
>> Of course, if guest cannot get the fault VA from the FAR_EL1. it still can read the CPER to get
>> the guest fault PA and translate it to fault VA.
> _A_ fault VA, there may be a set of them. This is why memory_failure() walks the
> rmap to find all the users of the page, and why the faulting VA isn't really
> relevant for this sort of RAS error.

yes.

> 
> 
> Thanks,