[PATCH v2 00/19] mm: Support huge pfnmaps
Jiaqi Yan
jiaqiyan at google.com
Thu Aug 29 12:21:39 PDT 2024
On Wed, Aug 28, 2024 at 4:50 PM Jason Gunthorpe <jgg at nvidia.com> wrote:
>
> On Wed, Aug 28, 2024 at 09:10:34AM -0700, Jiaqi Yan wrote:
> > On Wed, Aug 28, 2024 at 7:24 AM Jason Gunthorpe <jgg at nvidia.com> wrote:
> > >
> > > On Tue, Aug 27, 2024 at 05:42:21PM -0700, Jiaqi Yan wrote:
> > >
> > > > Instead of removing the whole pud, can driver or memory_failure do
> > > > something similar to non-struct-page-version of split_huge_page? So
> > > > driver doesn't need to re-fault good pages back?
> > >
> > > It would be far nicer if we didn't have to poke a hole in a 1G mapping
> > > just for memory failure reporting.
> >
> > If I follow this, which of the following sounds better? 1. remove pud
> > and rely on the driver to re-fault PFNs that it knows are not poisoned
> > (what Peter suggested), or 2. keep the pud and allow access to both
> > good and bad PFNs.
>
> In practice I think people will need 2, as breaking up a 1G mapping
> just because a few bits are bad will destroy the VM performance.
>
Totally agreed.
> For this the expectation would be for the VM to co-operate and not
> keep causing memory failures, or perhaps for the platform to spare in
> good memory somehow.
Yes, whether a VM gets into a memory-error-consumption loop
maliciously or accidentally, a reasonable VMM should have means to
detect and break it.
>
> > Or provide some knob (configured by ?) so that kernel + driver can
> > switch between the two?
>
> This is also sounding reasonable, especially if we need some
> alternative protocol to signal userspace about the failed memory
> besides fault and SIGBUS.
To clarify, what on my mind is a knob say named
"sysctl_enable_hard_offline", configured by userspace.
To apply to Ankit's memory_failure_pfn patch[*]:
static int memory_failure_pfn(unsigned long pfn, int flags)
{
struct interval_tree_node *node;
int res = MF_FAILED;
LIST_HEAD(tokill);
mutex_lock(&pfn_space_lock);
for (node = interval_tree_iter_first(&pfn_space_itree, pfn, pfn); node;
node = interval_tree_iter_next(node, pfn, pfn)) {
struct pfn_address_space *pfn_space =
container_of(node, struct pfn_address_space, node);
if (pfn_space->ops)
pfn_space->ops->failure(pfn_space, pfn);
collect_procs_pgoff(NULL, pfn_space->mapping, pfn, &tokill);
if (sysctl_enable_hard_offline)
unmap_mapping_range(pfn_space->mapping, pfn << PAGE_SHIFT,
PAGE_SIZE, 0);
res = MF_RECOVERED;
}
mutex_unlock(&pfn_space_lock);
if (res == MF_FAILED)
return action_result(pfn, MF_MSG_PFN_MAP, res);
flags |= MF_ACTION_REQUIRED | MF_MUST_KILL;
kill_procs(&tokill, true, false, pfn, flags);
return action_result(pfn, MF_MSG_PFN_MAP, MF_RECOVERED);
}
I think we still want to attempt to SIGBUS userspace, regardless of
doing unmap_mapping_range or not.
[*] https://lore.kernel.org/lkml/20231123003513.24292-2-ankita@nvidia.com/#t
>
> Jason
More information about the linux-arm-kernel
mailing list