[PATCH] KVM: arm/arm64: fix unaligned hva start and end in handle_hva_to_gpa

Tue May 15 19:47:58 PDT 2018

On 5/16/2018 12:21 AM, Suzuki K Poulose Wrote:
> Hi Jia,
> 
> On 15/05/18 14:15, Jia He wrote:
>>
>>
>> On 5/15/2018 8:38 PM, Jia He Wrote:
>>> Hi Suzuki
>>>
>>> On 5/15/2018 4:36 PM, Suzuki K Poulose Wrote:
>>>>
>>>> Hi Jia
>>>>
>>>> On 05/15/2018 03:03 AM, Jia He wrote:
>>>>> Hi Suzuki
>>>>>
>>>>> I will merge the other thread into this, and add the necessary CC list
>>>>>
>>>>> That WARN_ON call trace is very easy to reproduce in my armv8a server after I
>>>>> start 20 guests
>>>>>
>>>>> and run memhog in the host. Of course, ksm should be enabled
>>>>>
>>>>> For you question about my inject fault debug patch:
>>>>
>>>>
>>>> Thanks for the patch, comments below.
>>>>
>>>>>
>>>>
>>>> ...
>>>>
>>>>> index 7f6a944..ab8545e 100644
>>>>> --- a/virt/kvm/arm/mmu.c
>>>>> +++ b/virt/kvm/arm/mmu.c
>>>>> @@ -290,12 +290,17 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t
>>>>> *pgd,
>>>>>     * destroying the VM), otherwise another faulting VCPU may come in and mess
>>>>>     * with things behind our backs.
>>>>>     */
>>>>> +extern int trigger_by_ksm;
>>>>>    static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64
>>>>> size)
>>>>>    {
>>>>>           pgd_t *pgd;
>>>>>           phys_addr_t addr = start, end = start + size;
>>>>>           phys_addr_t next;
>>>>>
>>>>> +       if(trigger_by_ksm) {
>>>>> +               end -= 0x200;
>>>>> +       }
>>>>> +
>>>>>           assert_spin_locked(&kvm->mmu_lock);
>>>>>           pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>>>>>           do {
>>>>>
>>>>> I need to point out that I never reproduced it without this debugging patch.
>>>>
>>>> That could trigger the panic iff your "size" <= 0x200, leading to the
>>>> condition (end < start), which can make the loop go forever, as we
>>>> do while(addr < end) and end up accessing something which may not be PGD entry
>>>> and thus get a bad page with bad numbers all around. This case could be hit
>>>> only
>>>> with your change and the bug in the KSM which gives us an address near the page
>>>> boundary.
>>> No, I injected the fault on purpose to simulate the case when size is less than
>>> PAGE_SIZE(eg. PAGE_SIZE-0x200=65024)
>>> I ever got the panic info [1] *without* the debugging patch only once
>>>
>>> [1] https://lkml.org/lkml/2018/5/9/992
>>>>
>>>> So, I think we can safely ignore the PANIC().
>>>> More below.
> 
> I am a bit confused now. Do you mean, the panic was triggered *without* the
> debugging
> patch ? If that is the case, it is really worrying.
> 
Hi Suzuki, sorry for my unclear decription before.
Without ksm fixing patch and injecting fault debugging patch, I ever met WARN_ON
storm(easily reproduced) and panic on unmap_stage2_range(only once).
After the injecting fault debugging patch, the panic can be easily reproduced,
hence I thought the panic was also caused by the same root cause.

After ksm fixing patch, the WARN_ON and panic are both disappearred.

What I mean is:
1. if PAGE_SIZE alignment *should* be guaranteed in handle_hva_to_gpa, it would
be better to add a WARN_ON or WARN_ON_ONCE in unmap_stage2_range()

2. if PAGE_SIZE alignment is *not* guaranteed, handle_hva_to_gpa needs to handle
the unalignment cases, I will propose to change the codes in handle_hva_to_gpa.
eg. use (gpa_end-gpa) instead of (hva_end - hva_start)

Cheers,
Jia
>>>>
>>>>
>>>>>>> Suzuki, thanks for the comments.
>>>>>>>
>>>>>>> I proposed another ksm patch https://lkml.org/lkml/2018/5/3/1042
>>>>>>> The root cause is ksm will add some extra flags to indicate that the page
>>>>>>> is in/not_in the stable tree. This makes address not be aligned with
>>>>>>> PAGE_SIZE.
>>>>>> Thanks for the pointer. In the future, please Cc the people relevant to the
>>>>>> discussion in the patches.
>>>>>>
>>>>>>>    From arm kvm mmu point of view, do you think handle_hva_to_gpa still need
>>>>>>> to handle
>>>>>>> the unalignment case?
>>>>>> I don't think we should do that. Had we done this, we would never have caught
>>>>>> this bug
>>>>>> in KSM. Eventually if some other new implementation comes up with the a new
>>>>>> notifier
>>>>>> consumer which doesn't check alignment and doesn't WARN, it could simply do
>>>>>> the wrong
>>>>>> thing. So I believe what we have is a good measure to make sure that
>>>>>> things are
>>>>>> in the right order.
>>>>>>
>>>>>>> IMO, the PAGE_SIZE alignment is still needed because we should not let the
>>>>>>> bottom function
>>>>>>> kvm_age_hva_handler to handle the exception. Please refer to the
>>>>>>> implementation in X86 and
>>>>>>> powerpc kvm_handle_hva_range(). They both aligned the hva with
>>>>>>> hva_to_gfn_memslot.
>>>>>>>
>>>>>>    From an API perspective, you are passed on a "start" and "end" address.
>>>>>> So,
>>>>>> you could potentially
>>>>>> do the wrong thing if you align the "start" and "end". May be those handlers
>>>>>> should also do the
>>>>>> same thing as we do.
>>>>
>>>>> But handle_hva_to_gpa has partially adjusted the alignment possibly:
>>>>>      1750         kvm_for_each_memslot(memslot, slots) {
>>>>>      1751                 unsigned long hva_start, hva_end;
>>>>>      1752                 gfn_t gpa;
>>>>>      1753
>>>>>      1754                 hva_start = max(start, memslot->userspace_addr);
>>>>>      1755                 hva_end = min(end, memslot->userspace_addr +
>>>>>      1756                             (memslot->npages << PAGE_SHIFT));
>>>>>
>>>>> at line 1755, let us assume that end=0x12340200 and
>>>>> memslot->userspace_addr + (memslot->npages << PAGE_SHIFT)=0x12340000
>>>>> Then, hva_start is not page_size aligned and hva_end is aligned, and the size
>>>>> will be PAGE_SIZE-0x200,
>>>>> just as what I had done in the inject fault debugging patch.
>>>>
>>>> Thats because we want to limit the handling of the hva/gpa range by memslot.
>>>> So,
>>>> we make sure we pass on the range within the given memslot
>>>> to hva_to_gfn_memslot(). But we do iterate over the next memslot if the
>>>> original range falls in to the next slot. So, in practice, there is no
>>>> alignment/trimming of the range. Its just that we pass on the appropriate range
>>>> for each slot.
>>>>
>>> Yes, I understand what the codes did in hva_to_gfn_memslot(). What I mean is
>>> hva_end may be changed and (hva_end - hva_start) will not be same as the
>>> parameter _size_ ?
>>>
>>>> ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
>>>
>>> Anyway, I have to admit that all the exceptions are originally caused by the
>>> STABLE_FLAG in ksm code. What I want to discuss here is how to make arm kvm
>>> handle the exception more gracefully.
> 
> Thats my point. We need the fix in ksm. Once we have the fix in place, do
> we hit the WARN_ON() any more ?
> 
>>>
>> Hi Suzuki
>> How about this patch (balance of avoiding the WARN_ON storm and debugging
>> convenience):
> 
> The problem with WARN_ON_ONCE() is, it could potentially suppress the different
> call paths that could lead to the triggers. e.g, unmap_stage2_range() could be
> called from different paths and having a WARN_ON_ONCE() could potentially
> hide the other call paths.
> 
>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>> index 7f6a944..4033946 100644
>> --- a/virt/kvm/arm/mmu.c
>> +++ b/virt/kvm/arm/mmu.c
>> @@ -297,6 +297,8 @@ static void unmap_stage2_range(struct kvm *kvm, phys_addr_t
>> start, u64 size)
>>       phys_addr_t next;
>>
>>       assert_spin_locked(&kvm->mmu_lock);
>> +
>> +    WARN_ON_ONCE(size & ~PAGE_MASK);
>>       pgd = kvm->arch.pgd + stage2_pgd_index(addr);
>>       do {
>>           /*
>> @@ -1792,7 +1794,7 @@ static int kvm_set_spte_handler(struct kvm *kvm, gpa_t
>> gpa, u64 size, void *data
>>   {
>>       pte_t *pte = (pte_t *)data;
>>
>> -    WARN_ON(size != PAGE_SIZE);
>> +    WARN_ON_ONCE(size != PAGE_SIZE);
>>       /*
>>        * We can always call stage2_set_pte with KVM_S2PTE_FLAG_LOGGING_ACTIVE
>>        * flag clear because MMU notifiers will have unmapped a huge PMD before
>> @@ -1823,7 +1825,7 @@ static int kvm_age_hva_handler(struct kvm *kvm, gpa_t gpa,
>> u64 size, void *data)
>>       pmd_t *pmd;
>>       pte_t *pte;
>>
>> -    WARN_ON(size != PAGE_SIZE && size != PMD_SIZE);
>> +    WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE);
>>       pmd = stage2_get_pmd(kvm, NULL, gpa);
>>       if (!pmd || pmd_none(*pmd))    /* Nothing there */
>>           return 0;
>> @@ -1843,7 +1845,7 @@ static int kvm_test_age_hva_handler(struct kvm *kvm, gpa_t
>> gpa, u64 size, void *
>>       pmd_t *pmd;
>>       pte_t *pte;
>>
>> -    WARN_ON(size != PAGE_SIZE && size != PMD_SIZE);
>> +    WARN_ON_ONCE(size != PAGE_SIZE && size != PMD_SIZE);
>>       pmd = stage2_get_pmd(kvm, NULL, gpa);
>>       if (!pmd || pmd_none(*pmd))    /* Nothing there */
>>           return 0;
>>
> 
> 
> Cheers
> Suzuki
>