[PATCH v3 3/6] mm: handle poisoning of pfn without struct pages

HORIGUCHI NAOYA(堀口 直也) naoya.horiguchi at nec.com
Mon May 22 22:43:11 PDT 2023


On Mon, May 15, 2023 at 11:18:16AM +0000, Ankit Agrawal wrote:
> Thanks, Naoya for reviewing the patch. Comments inline.
> 
> > -----Original Message-----
> > From: HORIGUCHI NAOYA(堀口 直也) <naoya.horiguchi at nec.com>
> > Sent: Tuesday, May 9, 2023 2:51 AM
> > To: Ankit Agrawal <ankita at nvidia.com>
> > Cc: Jason Gunthorpe <jgg at nvidia.com>; alex.williamson at redhat.com;
> > maz at kernel.org; oliver.upton at linux.dev; Aniket Agashe
> > <aniketa at nvidia.com>; Neo Jia <cjia at nvidia.com>; Kirti Wankhede
> > <kwankhede at nvidia.com>; Tarun Gupta (SW-GPU) <targupta at nvidia.com>;
> > Vikram Sethi <vsethi at nvidia.com>; Andy Currid <acurrid at nvidia.com>;
> > Alistair Popple <apopple at nvidia.com>; John Hubbard
> > <jhubbard at nvidia.com>; Dan Williams <danw at nvidia.com>;
> > kvm at vger.kernel.org; linux-kernel at vger.kernel.org; linux-arm-
> > kernel at lists.infradead.org; linux-mm at kvack.org
> > Subject: Re: [PATCH v3 3/6] mm: handle poisoning of pfn without struct pages
> > 
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Wed, Apr 05, 2023 at 11:01:31AM -0700, ankita at nvidia.com wrote:
> > > From: Ankit Agrawal <ankita at nvidia.com>
> > >
> > > The kernel MM does not currently handle ECC errors / poison on a
> > > memory region that is not backed by struct pages. In this series,
> > > mapping request from QEMU to the device memory is executed using
> > remap_pfn_range().
> > > Hence added a new mechanism to handle memory failure on such memory.
> > >
> > > Make kernel MM expose a function to allow modules managing the device
> > > memory to register a failure function and the address space that is
> > > associated with the device memory. MM maintains this information as
> > > interval tree. The registered memory failure function is used by MM to
> > > notify the module of the PFN, so that the module may take any required
> > > action. The module for example may use the information to track the
> > > poisoned pages.
> > >
> > > In this implementation, kernel MM follows the following sequence
> > > (mostly) similar to the memory_failure() handler for struct page backed
> > memory:
> > > 1. memory_failure() is triggered on reception of a poison error. An
> > > absence of struct page is detected and consequently memory_failure_pfn
> > > is executed.
> > > 2. memory_failure_pfn() call the newly introduced failure handler
> > > exposed by the module managing the poisoned memory to notify it of the
> > > problematic PFN.
> > > 3. memory_failure_pfn() unmaps the stage-2 mapping to the PFN.
> > > 4. memory_failure_pfn() collects the processes mapped to the PFN.
> > > 5. memory_failure_pfn() sends SIGBUS (BUS_MCEERR_AO) to all the
> > > processes mapping the faulty PFN using kill_procs().
> > > 6. An access to the faulty PFN by an operation in VM at a later point
> > > of time is trapped and user_mem_abort() is called.
> > > 7. user_mem_abort() calls __gfn_to_pfn_memslot() on the PFN, and the
> > > following execution path is followed: __gfn_to_pfn_memslot() ->
> > > hva_to_pfn() -> hva_to_pfn_remapped() -> fixup_user_fault() ->
> > > handle_mm_fault() -> handle_pte_fault() -> do_fault(). do_fault() is
> > > expected to return VM_FAULT_HWPOISON on the PFN (it currently does not
> > > and is fixed as part of another patch in the series).
> > > 8. __gfn_to_pfn_memslot() then returns KVM_PFN_ERR_HWPOISON,
> > which
> > > cause the poison with SIGBUS (BUS_MCEERR_AR) to be sent to the QEMU
> > > process through kvm_send_hwpoison_signal().
> > >
> > > Signed-off-by: Ankit Agrawal <ankita at nvidia.com>
> > > ---
> > >  include/linux/memory-failure.h |  22 +++++
> > >  include/linux/mm.h             |   1 +
> > >  include/ras/ras_event.h        |   1 +
> > >  mm/memory-failure.c            | 148 +++++++++++++++++++++++++++++----
> > >  4 files changed, 154 insertions(+), 18 deletions(-)  create mode
> > > 100644 include/linux/memory-failure.h
> > >
...

> > > @@ -2052,6 +2063,99 @@ static int
> > memory_failure_dev_pagemap(unsigned long pfn, int flags,
> > >       return rc;
> > >  }
> > >
> > > +/**
> > > + * register_pfn_address_space - Register PA region for poison notification.
> > > + * @pfn_space: structure containing region range and callback function on
> > > + *             poison detection.
> > > + *
> > > + * This function is called by a kernel module to register a PA region
> > > +and
> > > + * a callback function with the kernel. On detection of poison, the
> > > + * kernel code will go through all registered regions and call the
> > > + * appropriate callback function associated with the range. The
> > > +kernel
> > > + * module is responsible for tracking the poisoned pages.
> > > + *
> > > + * Return: 0 if successfully registered,
> > > + *         -EBUSY if the region is already registered
> > > + */
> > > +int register_pfn_address_space(struct pfn_address_space *pfn_space) {
> > > +     if (!request_mem_region(pfn_space->node.start << PAGE_SHIFT,
> > > +             (pfn_space->node.last - pfn_space->node.start + 1) << PAGE_SHIFT,
> > ""))
> > > +             return -EBUSY;
> > > +
> > > +     mutex_lock(&pfn_space_lock);
> > > +     interval_tree_insert(&pfn_space->node, &pfn_space_itree);
> > > +     mutex_unlock(&pfn_space_lock);
> > > +
> > > +     return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(register_pfn_address_space);
> > 
> > register_pfn_address_space and unregister_pfn_address_space are not
> > compiled if CONFIG_MEMORY_FAILURE is not set, so maybe your new driver
> > might need to depend on this config.
> > 
> 
> Yeah, that's right. Shall I put parts of code in the vfio-pci variant driver
> related to poison handling as part of #ifdef CONFIG_MEMORY_FAILURE.
> Or would you rather prefer I make NVGPU_VFIO_PCI config dependent
> on CONFIG_MEMORY_FAILURE?

If the vfio-pci variant driver plans to provide features unrelated to poison
handling, "ifdef" option looks fine to me.  Otherwise, the second option could
be possible.  Both options look to me comparably reasonable.

> 
> > > +
> > > +/**
> > > + * unregister_pfn_address_space - Unregister a PA region from poison
> > > + * notification.
> > > + * @pfn_space: structure containing region range to be unregistered.
> > > + *
> > > + * This function is called by a kernel module to unregister the PA
> > > +region
> > > + * from the kernel from poison tracking.
> > > + */
> > > +void unregister_pfn_address_space(struct pfn_address_space
> > > +*pfn_space) {
> > > +     mutex_lock(&pfn_space_lock);
> > > +     interval_tree_remove(&pfn_space->node, &pfn_space_itree);
> > > +     mutex_unlock(&pfn_space_lock);
> > > +     release_mem_region(pfn_space->node.start << PAGE_SHIFT,
> > > +             (pfn_space->node.last - pfn_space->node.start + 1) <<
> > > +PAGE_SHIFT); } EXPORT_SYMBOL_GPL(unregister_pfn_address_space);
> > > +
> > > +static int memory_failure_pfn(unsigned long pfn, int flags) {
> > > +     struct interval_tree_node *node;
> > > +     int rc = -EBUSY;
> > > +     LIST_HEAD(tokill);
> > > +
> > > +     mutex_lock(&pfn_space_lock);
> > > +     /*
> > > +      * Modules registers with MM the address space mapping to the device
> > memory they
> > > +      * manage. Iterate to identify exactly which address space has mapped to
> > this
> > > +      * failing PFN.
> > > +      */
> > > +     for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
> > > +          node = interval_tree_iter_next(node, pfn, pfn)) {
> > > +             struct pfn_address_space *pfn_space =
> > > +                     container_of(node, struct pfn_address_space, node);
> > > +             rc = 0;
> > > +
> > > +             /*
> > > +              * Modules managing the device memory needs to be conveyed
> > about the
> > > +              * memory failure so that the poisoned PFN can be tracked.
> > > +              */
> > > +             pfn_space->ops->failure(pfn_space, pfn);
> > > +
> > > +             collect_procs_pgoff(NULL, pfn_space->mapping, pfn,
> > > + &tokill);
> > > +
> > > +             unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
> > > +                                 PAGE_SIZE, 0);
> > > +     }
> > > +     mutex_unlock(&pfn_space_lock);
> > 
> > If rc == 0 at this point, the given pfn seems to be outside the GPU memory, so
> > that should be considered as "Invalid address" case whose check is removed
> > by patch 5/6.  So it might be better to sperate the case from "do handling for
> > non struct page pfn" case.
> > 
> 
> Sorry did you mean rc !=0 here? But yeah, you are right that I should add the
> case for check in case a region with the desired PFN isn't found above.
> 
> > > +
> > > +     /*
> > > +      * Unlike System-RAM there is no possibility to swap in a different
> > > +      * physical page at a given virtual address, so all userspace
> > > +      * consumption of direct PFN memory necessitates SIGBUS (i.e.
> > > +      * MF_MUST_KILL)
> > > +      */
> > > +     flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
> > > +     kill_procs(&tokill, true, false, pfn, flags);
> > > +
> > > +     pr_err("%#lx: recovery action for %s: %s\n",
> > > +                     pfn, action_page_types[MF_MSG_PFN],
> > > +                     action_name[rc ? MF_FAILED : MF_RECOVERED]);
> > 
> > Could you call action_result() to print out the summary line.
> > It has some other common things like accounting and tracepoint.
> > 
> 
> Ack.
> 
> > > +
> > > +     return rc;
> > > +}
> > > +
> > >  static DEFINE_MUTEX(mf_mutex);
> > >
> > >  /**
> > > @@ -2093,6 +2197,11 @@ int memory_failure(unsigned long pfn, int flags)
> > >       if (!(flags & MF_SW_SIMULATED))
> > >               hw_memory_failure = true;
> > >
> > > +     if (!pfn_valid(pfn) && !arch_is_platform_page(PFN_PHYS(pfn))) {
> > > +             res = memory_failure_pfn(pfn, flags);
> > > +             goto unlock_mutex;
> > > +     }
> > 
> > This might break exisiting corner case about DAX device, so maybe this should
> > be checked after confirming that pfn_to_online_page returns NULL.
> > 
> 
> Sorry, it isn't clear to me which corner case could break here. Can the pfn_valid()
> be false on a DAX  device page?

I thought that possibility, but I commented with relying on my vague/wrong memory, sorry.
pfn_valid() should always true for DAX device pages, so the above check should not break.

> 
> > > +
> > >       p = pfn_to_online_page(pfn);
> > >       if (!p) {
> > >               res = arch_memory_failure(pfn, flags); @@ -2106,6
> > > +2215,9 @@ int memory_failure(unsigned long pfn, int flags)
> > >                                                                pgmap);
> > >                               goto unlock_mutex;
> > >                       }
> > > +
> > > +                     res = memory_failure_pfn(pfn, flags);
> > > +                     goto unlock_mutex;
> > 
> > This path is chosen when pfn_valid returns true, which cannot happen for
> > GPU memory's case?
> > 
> 
> Good catch. That needs to be removed.
> 
> > Thanks,
> > Naoya Horiguchi
> > 
> > >               }
> > >               pr_err("%#lx: memory outside kernel control\n", pfn);
> > >               res = -ENXIO;
> > > --
> > > 2.17.1
> 
> On a separate note, would you rather prefer that I separate out the poison
> handling parts (i.e. 3/6 and 5/6) into a separate series? Or should I just keep the
> whole series together?

The new poison handling code is used after the vfio-pci driver is available,
so I think that they may as well be merged together.

Thanks,
Naoya Horiguchi


More information about the linux-arm-kernel mailing list