qemu

Ard Biesheuvel ard.biesheuvel at linaro.org
Thu Sep 11 11:35:07 PDT 2014


On 11 September 2014 20:09, Christoffer Dall
<christoffer.dall at linaro.org> wrote:
> On Thu, Sep 11, 2014 at 07:06:17PM +0200, Ard Biesheuvel wrote:
>> On 11 September 2014 18:48, Marc Zyngier <marc.zyngier at arm.com> wrote:
>> > On 11/09/14 16:26, Ard Biesheuvel wrote:
>> >> On 11 September 2014 16:52, Marc Zyngier <marc.zyngier at arm.com> wrote:
>> >>> On 11/09/14 15:41, Ard Biesheuvel wrote:
>> >>>> Hello all,
>> >>>>
>> >>>> I spent most of the day chasing a particularly weird heisenbug in the
>> >>>> QEMU+KVM+UEFI combo.
>> >>>> The symptom was that UEFI init would hang on the first write to the
>> >>>> second NOR flash (to initialize the variable store) but *only* when
>> >>>> using the -bios option (instead of -pflash) and a boot image of
>> >>>> exactly 64 MB in size. Note that this implies that the second NOR
>> >>>> flash was not file backed.
>> >>>>
>> >>>> As it turns out, the choice of the -bios option and the size of the
>> >>>> file affect whether KVM ends up using sections or pages to back the
>> >>>> NOR flash, and in my failure case, it was using the latter. That
>> >>>> resulted in KVM going down a code path where the memory backing the
>> >>>> NOR was writable, which breaks the MMIO emulation, and resulted in the
>> >>>> hang on init of the variable store.
>> >>>>
>> >>>> The patch below fixes it for me.
>> >>>>
>> >>>> diff --git a/arch/arm/kvm/mmu.c b/arch/arm/kvm/mmu.c
>> >>>> index c68ec28f17c3..121abc6fef97 100644
>> >>>> --- a/arch/arm/kvm/mmu.c
>> >>>> +++ b/arch/arm/kvm/mmu.c
>> >>>> @@ -817,7 +817,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu,
>> >>>> phys_addr_t fault_ipa,
>> >>>>  pfn = gfn_to_pfn_prot(kvm, gfn, write_fault, &writable);
>> >>>>  if (is_error_pfn(pfn))
>> >>>>   return -EFAULT;
>> >>>>
>> >>>> - if (kvm_is_mmio_pfn(pfn))
>> >>>> + if (writable && kvm_is_mmio_pfn(pfn))
>> >>>>   mem_type = PAGE_S2_DEVICE;
>> >>>>
>> >>>>   spin_lock(&kvm->mmu_lock);
>> >>>>
>> >>>> Here is the definition of kvm_is_mmio() for completeness. I am a bit
>> >>>> out of my depth here, so perhaps someone else can shed some light on
>> >>>> this?
>> >>>>
>> >>>> bool kvm_is_mmio_pfn(pfn_t pfn)
>> >>>> {
>> >>>>         if (pfn_valid(pfn))
>> >>>>                 return PageReserved(pfn_to_page(pfn));
>> >>>>
>> >>>>         return true;
>> >>>> }
>> >>>>
>> >>>> To me, it is particularly puzzling what PageReserved() has to do with
>> >>>> anything, as I couldn't find any other uses of it under kvm/
>> >>>>
>> >>>
>> >>> My understanding is that kvm_is_mmio_pfn() is used for *devices* that
>> >>> are mapped directly mapped (think device assignment). PageReserved()
>> >>> would make sense there.
>> >>>
>> >>> Now, I'm not familiar with the whole QEMU setup, so maybe you could
>> >>> describe how things are mapped, and what is supposed to happen?
>> >>>
>> >>
>> >> When running QEMU using the -bios <file> option, the file is exposed
>> >> to the guest as an emulated NOR flash at 0x0, so that you can boot
>> >> from it directly.
>> >> In my case, the NOR is used for the boot image itself, and for a
>> >> non-volatile variable store at 0x400_0000, which is initialized by
>> >> UEFI when it boots.
>> >>
>> >> In order for the NOR emulation to work, writes to the NOR need to
>> >> trap, so that QEMU can take down the whole memory region, trapping all
>> >> reads and writes, until a command is issued that puts it back into
>> >> array mode, and the memory region is created again. In this mode, the
>> >> guest reads to the NOR go straight to host RAM. (Or they should: the
>> >> patch you merged today fixed and issue where reads were mistaken for
>> >> writes and sent to QEMU instead)
>> >>
>> >> So when UEFI enters it non-volatile variable store driver, it first
>> >> issues a read to the base of the second half of the NOR (0x400_0000),
>> >> resulting in some host RAM to be pinned to back it up. However, for
>> >> some reason, it is mapped as PAGE_S2_DEVICE, which is read-write, so
>> >> when subsequently a write is issued to kick the NOR into command mode,
>> >> it just gets sent to host RAM as well.
>> >>
>> >> Indeed, by the looks of it, kvm_is_mmio_pfn() is intended to map host
>> >> device memory straight into the guest physical address space, but this
>> >> is not what I am doing, so why is it returning 'true' here?
>> >
>> > That's the bit I do not understand. Any chance you can find out who sets
>> > this bit in the page tables?
>> >
>>
>> That is also the bit *I* don't understand, and that is why I am asking
>> you guys :-)
>>
>> Let me first double check that PageReserved is actually the culprit here ...
>
> Yeah, so assuming that Linux is not broken, we either have a bug in how
> we resolve the pfn or QEMU puts an incorrect address for the backing of
> the flash into the memory region.
>
> Another question here is whether we should be always setting the RDWR
> bits for PAGE_S2_DEVICE?  Is it conceivable that we would want to expose
> some device MMIO region as a read-only region and still fault on writes?
>

Let's see if the holes in our knowledge line up :-)

I noticed that the call to kvm_is_mmio_pfn() was only recently added
by Kim Phillips (b88657674d39 ARM: KVM: user_mem_abort: support stage
2 MMIO page mapping), and there are no other references to it in
KVM/ARM

I am willing to accept that checking for PG_reserved makes sense in
the context of deciding whether some pfn is backed by normal RAM
(aiui, there may be a hole there), but returning writable RAM when
writable = false is arguably a bug. Due to the fact that we are only
introducing KVM_CAP_READONLY_MEM now, and due to the
read-as-ROM/write-as-control-reg nature of NOR, it is no surprise that
this is where the issues present themselves.

So could anyone shed some light on the nature of the memory returned
by hva_to_pfn() and why it would have the PG_reserved flag set?

-- 
Ard.



More information about the linux-arm-kernel mailing list