[PATCH v2 00/19] mm: Support huge pfnmaps

Thu Aug 29 12:21:39 PDT 2024

On Wed, Aug 28, 2024 at 4:50 PM Jason Gunthorpe <jgg at nvidia.com> wrote:
>
> On Wed, Aug 28, 2024 at 09:10:34AM -0700, Jiaqi Yan wrote:
> > On Wed, Aug 28, 2024 at 7:24 AM Jason Gunthorpe <jgg at nvidia.com> wrote:
> > >
> > > On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> > >
> > > > Instead of removing the whole pud, can driver or memory_failure do
> > > > something similar to non-struct-page-version of split_huge_page? So
> > > > driver doesn't need to re-fault good pages back?
> > >
> > > It would be far nicer if we didn't have to poke a hole in a 1G mapping
> > > just for memory failure reporting.
> >
> > If I follow this, which of the following sounds better? 1. remove pud
> > and rely on the driver to re-fault PFNs that it knows are not poisoned
> > (what Peter suggested), or 2. keep the pud and allow access to both
> > good and bad PFNs.
>
> In practice I think people will need 2, as breaking up a 1G mapping
> just because a few bits are bad will destroy the VM performance.
>

Totally agreed.

> For this the expectation would be for the VM to co-operate and not
> keep causing memory failures, or perhaps for the platform to spare in
> good memory somehow.

Yes, whether a VM gets into a memory-error-consumption loop
maliciously or accidentally, a reasonable VMM should have means to
detect and break it.

>
> > Or provide some knob (configured by ?) so that kernel + driver can
> > switch between the two?
>
> This is also sounding reasonable, especially if we need some
> alternative protocol to signal userspace about the failed memory
> besides fault and SIGBUS.

To clarify, what on my mind is a knob say named
"sysctl_enable_hard_offline", configured by userspace.

To apply to Ankit's memory_failure_pfn patch[*]:

static int memory_failure_pfn(unsigned long pfn, int flags)
{
  struct interval_tree_node *node;
  int res = MF_FAILED;
  LIST_HEAD(tokill);

  mutex_lock(&pfn_space_lock);
   for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
         node = interval_tree_iter_next(node, pfn, pfn)) {
    struct pfn_address_space *pfn_space =
      container_of(node, struct pfn_address_space, node);

    if (pfn_space->ops)
      pfn_space->ops->failure(pfn_space, pfn);

    collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);

    if (sysctl_enable_hard_offline)
      unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
                                             PAGE_SIZE, 0);

    res = MF_RECOVERED;
  }
  mutex_unlock(&pfn_space_lock);

  if (res == MF_FAILED)
    return action_result(pfn, MF_MSG_PFN_MAP, res);

  flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
  kill_procs(&tokill, true, false, pfn, flags);

  return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
}

I think we still want to attempt to SIGBUS userspace, regardless of
doing unmap_mapping_range or not.

[*] https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t

>
> Jason