[PATCH] KVM: arm/arm64: fix unaligned hva start and end in handle_hva_to_gpa

Mon May 14 19:03:30 PDT 2018

Hi Suzuki

I will merge the other thread into this, and add the necessary CC list

That WARN_ON call trace is very easy to reproduce in my armv8a server after I 
start 20 guests

and run memhog in the host. Of course, ksm should be enabled

For you question about my inject fault debug patch:

diff --git a/mm/ksm.c b/mm/ksm.c
index e3cbf9a..876bec8 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -43,6 +43,8 @@
  #include <asm/tlbflush.h>
  #include "internal.h"

+int trigger_by_ksm = 0;
+
  #ifdef CONFIG_NUMA
  #define NUMA(x)                (x)
  #define DO_NUMA(x)     do { (x); } while (0)
@@ -2587,11 +2589,14 @@ void rmap_walk_ksm(struct page *page, struct 
rmap_walk_control *rwc)
                         if (rwc->invalid_vma && rwc->invalid_vma(vma, rwc->arg))
                                 continue;

+trigger_by_ksm = 1;
                         if (!rwc->rmap_one(page, vma,
                                         rmap_item->address, rwc->arg)) {
                                 anon_vma_unlock_read(anon_vma);
+trigger_by_ksm = 0;
                                 return;
                         }
+trigger_by_ksm = 0;
                         if (rwc->done && rwc->done(page)) {
                                 anon_vma_unlock_read(anon_vma);
                                 return;
diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
index 7f6a944..ab8545e 100644
--- a/virt/kvm/arm/mmu.c
+++ b/virt/kvm/arm/mmu.c
@@ -290,12 +290,17 @@ static void unmap_stage2_puds(struct kvm *kvm, pgd_t *pgd,
   * destroying the VM), otherwise another faulting VCPU may come in and mess
   * with things behind our backs.
   */
+extern int trigger_by_ksm;
  static void unmap_stage2_range(struct kvm *kvm, phys_addr_t start, u64 size)
  {
         pgd_t *pgd;
         phys_addr_t addr = start, end = start + size;
         phys_addr_t next;

+       if(trigger_by_ksm) {
+               end -= 0x200;
+       }
+
         assert_spin_locked(&kvm->mmu_lock);
         pgd = kvm->arch.pgd + stage2_pgd_index(addr);
         do {

I need to point out that I never reproduced it without this debugging patch.

Please also see my comments below

On 5/14/2018 6:06 PM, Suzuki K Poulose Wrote:
> On 14/05/18 03:30, Jia He wrote:
>>
>> On 5/11/2018 9:39 PM, Suzuki K Poulose Wrote:
>>> Marc
>>>
>>> Thanks for looping me in. Comments below.
>>>
>>>
>>> On 03/05/18 03:02, Jia He wrote:
>>>> Hi Marc
>>>>
>>>> Thanks for the review
>>>>
>>>>
>>>> On 5/2/2018 10:26 PM, Marc Zyngier Wrote:
>>>>> [+ Suzuki]
>>>>>
>>>>> On 02/05/18 08:08, Jia He wrote:
>>>>>> From: Jia He <jia.he at hxt-semitech.com>
>>>>>>
>>>>>> In our armv8a server (QDF2400), I noticed a WARN_ON as follows:
>>>>>>
>>>>>> [  800.202850] WARNING: CPU: 33 PID: 255 at arch/arm64/kvm/../../../virt/kvm/arm/mmu.c:1670 kvm_age_hva_handler+0xcc/0xd4
>>>>> Which kernel version is that? I don't have a WARN_ON() at this line in
>>>>> 4.17. Do you have a reproducer?
>>>> My running kernel version is v4.14-15, but I can reproduced it in 4.17 (start 20 guests and run memhog in the host)
>>>> In 4.17, the warn_on is at
>>>> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/virt/kvm/arm/mmu.c#n1826
>>>>>> [  800.213535] Modules linked in: vhost_net vhost tap xt_CHECKSUM ipt_MASQUERADE nf_nat_masquerade_ipv4 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm vfat fat iw_cm mlx5_ib ib_core dm_mirror dm_region_hash dm_log dm_mod crc32_ce ipmi_ssif sg nfsd
>>>>>> [  800.284115]  auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c mlx5_core ixgbe mlxfw devlink mdio ahci_platform libahci_platform qcom_emac libahci hdma hdma_mgmt i2c_qup
>>>>>> [  800.300382] CPU: 33 PID: 255 Comm: khugepaged Tainted: G        W       4.14.36+ #6
>>>>>> [  800.308030] Hardware name: <snip for confidential issues>
>>>>> Well, that's QDF2400, right? ;-)
>>>> yes, exactly :)
>>>>>> [  800.318717] task: ffff8017c949c000 task.stack: ffff8017c9498000
>>>>>> [  800.324629] PC is at kvm_age_hva_handler+0xcc/0xd4
>>>>>> [  800.329412] LR is at handle_hva_to_gpa+0xec/0x15c
>>>>>> [  800.334109] pc : [<ffff0000080b4f2c>] lr : [<ffff0000080b4838>] pstate: 20400145
>>>>>> [  800.341496] sp : ffff8017c949b260
>>>>>> [  800.344804] x29: ffff8017c949b260 x28: ffff801663e25008
>>>>>> [  800.350110] x27: 0000000000020000 x26: 00000001fb1a0000
>>>>>> [  800.355416] x25: 0000ffff605b0200 x24: 0000ffff605a0200
>>>>>> [  800.360722] x23: 0000000000000000 x22: 000000000000ffff
>>>>>> [  800.366028] x21: 00000001fb1a0000 x20: ffff8017c085a000
>>>>>> [  800.371334] x19: ffff801663e20008 x18: 0000000000000000
>>>>>> [  800.376641] x17: 0000000000000000 x16: 0000000000000000
>>>>>> [  800.381947] x15: 0000000000000000 x14: 3d646e655f617668
>>>>>> [  800.387254] x13: 2c30303230623530 x12: 36666666663d646e
>>>>>> [  800.392560] x11: 652c303032306135 x10: 3036666666663d74
>>>>>> [  800.397867] x9 : 0000000000003796 x8 : 655f6e66672c3030
>>>>>> [  800.403173] x7 : ffff00000859434c x6 : ffff8017f9c30cb8
>>>>>> [  800.408479] x5 : ffff8017f9c30cb8 x4 : ffff0000080b4e60
>>>>>> [  800.413786] x3 : 0000000000000000 x2 : 0000000000020000
>>>>>> [  800.419092] x1 : 00000001fb1a0000 x0 : 0000000020000000
>>>>>> [  800.424398] Call trace:
>>>>>> [  800.426838] Exception stack(0xffff8017c949b120 to 0xffff8017c949b260)
>>>>>> [  800.433272] b120: 0000000020000000 00000001fb1a0000 0000000000020000 0000000000000000
>>>>>> [  800.441095] b140: ffff0000080b4e60 ffff8017f9c30cb8 ffff8017f9c30cb8 ffff00000859434c
>>>>>> [  800.448918] b160: 655f6e66672c3030 0000000000003796 3036666666663d74 652c303032306135
>>>>>> [  800.456740] b180: 36666666663d646e 2c30303230623530 3d646e655f617668 0000000000000000
>>>>>> [  800.464563] b1a0: 0000000000000000 0000000000000000 0000000000000000 ffff801663e20008
>>>>>> [  800.472385] b1c0: ffff8017c085a000 00000001fb1a0000 000000000000ffff 0000000000000000
>>>>>> [  800.480208] b1e0: 0000ffff605a0200 0000ffff605b0200 00000001fb1a0000 0000000000020000
>>>>>> [  800.488030] b200: ffff801663e25008 ffff8017c949b260 ffff0000080b4838 ffff8017c949b260
>>>>>> [  800.495853] b220: ffff0000080b4f2c 0000000020400145 0000000000000001 ffff8017c949b2a0
>>>>>> [  800.503676] b240: ffffffffffffffff ffff8017c949b260 ffff8017c949b260 ffff0000080b4f2c
>>>>>> [  800.511498] [<ffff0000080b4f2c>] kvm_age_hva_handler+0xcc/0xd4
>>>>>> [  800.517324] [<ffff0000080b4838>] handle_hva_to_gpa+0xec/0x15c
>>>>>> [  800.523063] [<ffff0000080b6c5c>] kvm_age_hva+0x5c/0xcc
>>>>>> [  800.528194] [<ffff0000080a7c3c>] kvm_mmu_notifier_clear_flush_young+0x54/0x90
>>>>>> [  800.535324] [<ffff00000827a0e8>] __mmu_notifier_clear_flush_young+0x6c/0xa8
>>>>>> [  800.542279] [<ffff00000825a644>] page_referenced_one+0x1e0/0x1fc
>>>>>> [  800.548279] [<ffff00000827e8f8>] rmap_walk_ksm+0x124/0x1a0
>>>>>> [  800.553759] [<ffff00000825c974>] rmap_walk+0x94/0x98
>>>>>> [  800.558717] [<ffff00000825ca98>] page_referenced+0x120/0x180
>>>>>> [  800.564369] [<ffff000008228c58>] shrink_active_list+0x218/0x4a4
>>>>>> [  800.570281] [<ffff000008229470>] shrink_node_memcg+0x58c/0x6fc
>>>>>> [  800.576107] [<ffff0000082296c4>] shrink_node+0xe4/0x328
>>>>>> [  800.581325] [<ffff000008229c9c>] do_try_to_free_pages+0xe4/0x3b8
>>>>>> [  800.587324] [<ffff00000822a094>] try_to_free_pages+0x124/0x234
>>>>>> [  800.593150] [<ffff000008216aa0>] __alloc_pages_nodemask+0x564/0xf7c
>>>>>> [  800.599412] [<ffff000008292814>] khugepaged_alloc_page+0x38/0xb8
>>>>>> [  800.605411] [<ffff0000082933bc>] collapse_huge_page+0x74/0xd70
>>>>>> [  800.611238] [<ffff00000829470c>] khugepaged_scan_mm_slot+0x654/0xa98
>>>>>> [  800.617585] [<ffff000008294e0c>] khugepaged+0x2bc/0x49c
>>>>>> [  800.622803] [<ffff0000080ffb70>] kthread+0x124/0x150
>>>>>> [  800.627762] [<ffff0000080849f0>] ret_from_fork+0x10/0x1c
>>>>>> [  800.633066] ---[ end trace 944c130b5252fb01 ]---
>>>>>> -------------------------------------------------------------------------
>>>>>>
>>>>>> The root cause might be: we can't guarantee that the parameter start and end
>>>>>> in handle_hva_to_gpa is PAGE_SIZE aligned, let alone hva_start and hva_end.
>>>>> So why not aligning them the first place?
>>>> at the first place of handle_hva_to_gpa()?
>>>> but boundary check is needed in each loop of kvm_for_each_memslot. Am I missing anything here?
>>>>>> This bug is introduced by commit 056aad67f836 ("kvm: arm/arm64: Rework gpa
>>>>>> callback handlers")
>>>>>>
>>>>>> It fixes the bug by use pfn size converted.
>>>>>>
>>>>>> Fixes: 056aad67f836 ("kvm: arm/arm64: Rework gpa callback handlers")
>>>>>>
>>>>>> Signed-off-by: jia.he at hxt-semitech.com
>>>>>> Signed-off-by: li.zhang at hxt-semitech.com
>>>>>> ---
>>>>>>     virt/kvm/arm/mmu.c | 10 ++++++----
>>>>>>     1 file changed, 6 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/virt/kvm/arm/mmu.c b/virt/kvm/arm/mmu.c
>>>>>> index 7f6a944..9dd7ae4 100644
>>>>>> --- a/virt/kvm/arm/mmu.c
>>>>>> +++ b/virt/kvm/arm/mmu.c
>>>>>> @@ -1744,7 +1744,7 @@ static int handle_hva_to_gpa(struct kvm *kvm,
>>>>>>         /* we only care about the pages that the guest sees */
>>>>>>         kvm_for_each_memslot(memslot, slots) {
>>>>>>             unsigned long hva_start, hva_end;
>>>>>> -        gfn_t gpa;
>>>>>> +        gpa_t gpa, gpa_end;
>>>>>>             hva_start = max(start, memslot->userspace_addr);
>>>>>>             hva_end = min(end, memslot->userspace_addr +
>>>>>> @@ -1753,7 +1753,9 @@ static int handle_hva_to_gpa(struct kvm *kvm,
>>>>>>                 continue;
>>>>>>             gpa = hva_to_gfn_memslot(hva_start, memslot) << PAGE_SHIFT;
>>>>>> -        ret |= handler(kvm, gpa, (u64)(hva_end - hva_start), data);
>>>>>> +        gpa_end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, memslot)
>>>>>> +                             << PAGE_SHIFT;
>>>>>> +        ret |= handler(kvm, gpa, (u64)(gpa_end - gpa), data);
>>>>> But we're looking for the mapping in the same memslot, so the distance
>>>>> between hva and hva_end is the same as the one between gpa and gpa_end
>>>>> if you didn't align it.
>>>> maybe not, sometimes hva_end-hva != gpa_end-gpa
>>>> start=fffdc37f0200,hva_start=fffdc37f0200,end=fffdc3800200,hva_end=fffdc3800000,gpa=3ff0000,gfn_end=4000000
>>>>
>>>> but sometimes it is:
>>>> start=ffff60590200,hva_start=ffff60590200,end=ffff605a0200,hva_end=ffff605a0200,gpa=1fb190000,gfn_end=1fb1b0000
>>>>
>>>> IMO, the unalignment is caused by the ksm stable page flag STABLE_FLAG. I will
>>>> propose another ksm patch to fix it。
>>>> But from handle_hva_to_gpa's point of view, arm kvm needs to void the followup
>>>> exception, just like what powerpc andx86 have done.
>>>
>>> As far as I can see this is triggered by someone (in this page_referenced_one via ksm?)
>>> triggering a clear_flush_young for a page, with a non-aligned page address.
>>>
>>> If you look at the code path, the __mmu_notifier_clear_flush_young is invoked
>>> via 2 code paths with the "given" address.
>>>
>>> ptep_clear_flush_young_notify(), in which case the end is set to + PAGE_SIZE
>>> pmdp_clear_flush_young_notify(), in which case the end is set to + PMD_SIZE
>>>
>>> We were supposed to only clear_flush_young for *the page* containing
>>> address (start), but we do a clear_flush_young for the next page
>>> as well, which (I think) is not something intended. So to me, it looks like, either
>>> page_referenced_one() or its caller must align the address to the PAGE_SIZE
>>> or PMD_SIZE depending on what it really wants to do, to avoid touching
>>> the adjacent entries (page or block pages).
>>>
>>> Suzuki
> Jia He,
>
>> Suzuki, thanks for the comments.
>>
>> I proposed another ksm patch https://lkml.org/lkml/2018/5/3/1042
>> The root cause is ksm will add some extra flags to indicate that the page
>> is in/not_in the stable tree. This makes address not be aligned with PAGE_SIZE.
> Thanks for the pointer. In the future, please Cc the people relevant to the
> discussion in the patches.
>
>>   From arm kvm mmu point of view, do you think handle_hva_to_gpa still need to handle
>> the unalignment case?
> I don't think we should do that. Had we done this, we would never have caught this bug
> in KSM. Eventually if some other new implementation comes up with the a new notifier
> consumer which doesn't check alignment and doesn't WARN, it could simply do the wrong
> thing. So I believe what we have is a good measure to make sure that things are
> in the right order.
>
>> IMO, the PAGE_SIZE alignment is still needed because we should not let the bottom function
>> kvm_age_hva_handler to handle the exception. Please refer to the implementation in X86 and
>> powerpc kvm_handle_hva_range(). They both aligned the hva with hva_to_gfn_memslot.
>>
>   From an API perspective, you are passed on a "start" and "end" address. So, you could potentially
> do the wrong thing if you align the "start" and "end". May be those handlers should also do the
> same thing as we do.
But handle_hva_to_gpa has partially adjusted the alignment possibly:
    1750         kvm_for_each_memslot(memslot, slots) {
    1751                 unsigned long hva_start, hva_end;
    1752                 gfn_t gpa;
    1753
    1754                 hva_start = max(start, memslot->userspace_addr);
    1755                 hva_end = min(end, memslot->userspace_addr +
    1756                             (memslot->npages << PAGE_SHIFT));

at line 1755, let us assume that end=0x12340200 and
memslot->userspace_addr + (memslot->npages << PAGE_SHIFT)=0x12340000
Then, hva_start is not page_size aligned and hva_end is aligned, and the size 
will be PAGE_SIZE-0x200,
just as what I had done in the inject fault debugging patch.

-- 
Cheers,
Jia