[PATCH AUTOSEL 6.15 6/8] PM: Restrict swap use to later in the suspend sequence
Eric W. Biederman
ebiederm at xmission.com
Tue Jul 8 12:32:02 PDT 2025
Wow!
Sasha I think an impersonator has gotten into your account, and
is just making nonsense up.
At first glance this reads like an impassioned plea to backport this
change, from someone who has actually dealt with it.
Unfortunately reading the justification in detail is an exercise
in reading falsehoods.
If this does not come from an impersonator then:
- If this comes from a human being, I recommend you have a talk with
them.
- If this comes from a machine I recommend you take it out of commission
and rework it.
At best all of this appears to be an effort to get someone else to
do necessary thinking for you. As my time for kernel work is very
limited I expect I will auto-nack any such future attempts to outsource
someone else's thinking on me.
Eric
Sasha Levin <sashal at kernel.org> writes:
> From: Mario Limonciello <mario.limonciello at amd.com>
>
> [ Upstream commit 12ffc3b1513ebc1f11ae77d053948504a94a68a6 ]
>
> Currently swap is restricted before drivers have had a chance to do
> their prepare() PM callbacks. Restricting swap this early means that if
> a driver needs to evict some content from memory into sawp in it's
> prepare callback, it won't be able to.
>
> On AMD dGPUs this can lead to failed suspends under memory pressure
> situations as all VRAM must be evicted to system memory or swap.
>
> Move the swap restriction to right after all devices have had a chance
> to do the prepare() callback. If there is any problem with the sequence,
> restore swap in the appropriate dpm resume callbacks or error handling
> paths.
>
> Closes: https://github.com/ROCm/ROCK-Kernel-Driver/issues/174
> Closes: https://gitlab.freedesktop.org/drm/amd/-/issues/2362
> Signed-off-by: Mario Limonciello <mario.limonciello at amd.com>
> Tested-by: Nat Wittstock <nat at fardog.io>
> Tested-by: Lucian Langa <lucilanga at 7pot.org>
> Link: https://patch.msgid.link/20250613214413.4127087-1-superm1@kernel.org
> Signed-off-by: Rafael J. Wysocki <rafael.j.wysocki at intel.com>
> Signed-off-by: Sasha Levin <sashal at kernel.org>
> ---
>
> **YES**
>
> This commit should be backported to stable kernel trees for the
> following reasons:
Really? And when those reasons turn out to be baloney?
> ## Critical Bug Fix for Real User Issues
>
> 1. **Fixes Actual Suspend Failures**: The commit addresses real-world
> suspend failures under memory pressure on systems with AMD discrete
> GPUs. The linked issues (ROCm/ROCK-Kernel-Driver#174 and
> freedesktop.org/drm/amd#2362) indicate this affects actual users.
Those linked issues are completely corrupted in the paragraph above.
>From the original commit the proper issues are:
https://github.com/ROCm/ROCK-Kernel-Driver/issues/174
https://gitlab.freedesktop.org/drm/amd/-/issues/2362
Which indicate that something is going on, but are old enough and
long enough coming to any kind of conclusion from them is not easy.
> 2. **Regression Fix**: This is effectively a regression fix. The PM
> subsystem's early swap restriction prevents AMD GPU drivers from
> properly evicting VRAM during their prepare() callbacks, which is a
> requirement that has become more critical as GPU VRAM sizes have
> increased.
There is no indication that this used to work, or that an earlier
kernel change caused this to stop working. This is not a regression.
> ## Small, Contained Change
>
> 3. **Minimal Code Changes**: The fix is remarkably simple - it just
> moves the `pm_restrict_gfp_mask()` call from early in the suspend
> sequence to after `dpm_prepare()` completes. The changes are:
> - Move `pm_restrict_gfp_mask()` from multiple early locations to
> inside `dpm_suspend_start()` after `dpm_prepare()` succeeds
> - Add corresponding `pm_restore_gfp_mask()` calls in error paths and
> resume paths
> - Remove the now-redundant calls from hibernate.c and suspend.c
Reworking how different layers of the kernel interact is not minimal,
and it not self contained.
> 4. **Low Risk of Regression**: The change maintains the original intent
> of preventing I/O during the critical suspend phase while allowing it
> during device preparation. The swap restriction still happens before
> `dpm_suspend()`, just after `dpm_prepare()`.
There is no analysis anywhere on what happens to the code with
code that might expect the old behavior.
So it is not possible to conclude a low risk of regression,
in fact we can't conclude anything.
> ## Follows Stable Rules
>
> 5. **Meets Stable Criteria**:
> - Fixes a real bug that bothers people (suspend failures)
Addresses a real bug, yes. Fixes?
> - Small change (moves function calls, doesn't introduce new logic)
No.
> - Obviously correct (allows drivers to use swap during their
> designated preparation phase)
Not at all. It certainly isn't obvious to me what is going on.
> - Already tested by users (Tested-by tags from affected users)
Yes there are Tested-by tags.
> ## Similar to Other Backported Commits
>
> 6. **Pattern Matches**: Looking at the similar commits provided, this
> follows the same pattern as the AMD GPU eviction commits that were
> backported. Those commits also addressed the same fundamental issue -
> ensuring GPU VRAM can be properly evicted during suspend/hibernation.
Which other commits are those?
> ## Critical Timing
Timing?
> 7. **Error Path Handling**: The commit properly handles error paths by
> adding `pm_restore_gfp_mask()` calls in:
> - `dpm_resume_end()` for normal resume
> - `platform_recover()` error path in suspend.c
> - `pm_restore_gfp_mask()` in kexec_core.c for kexec flows
I don't see anything in this change that has to do with error paths.
> The commit is well-tested, addresses a real problem affecting users, and
> makes a minimal, obviously correct change to fix suspend failures on
> systems with discrete GPUs under memory pressure.
The evidence that a 3 week old change is well tested, simply
because it has been merged into Linus's change seems lacking.
Tested yes, but is it well tested? Are there any possible side
effects?
I certainly see no evidence of any testing or any exercise at
all of the kexec path modified. I wasn't even away of this
change until this backport came in.
Eric
More information about the kexec
mailing list