[PATCH AUTOSEL 7.0-6.18] nvmet-tcp: Don't clear tls_key when freeing sq
Sasha Levin
sashal at kernel.org
Tue May 5 02:51:28 PDT 2026
From: Alistair Francis <alistair.francis at wdc.com>
[ Upstream commit 5fc422951c962cc01e654950fc043ebd8fadd865 ]
Curently after the host sends a REPLACETLSPSK we free the TLS keys as
part of calling nvmet_auth_sq_free() on success. This means when the
host sends a follow up REPLACETLSPSK we return CONCAT_MISMATCH as the
check for !nvmet_queue_tls_keyid(req->sq) fails.
A previous attempt to fix this involed not calling nvmet_auth_sq_free()
on successful connections, but that results in memory leaks. Instead we
should not clear `tls_key` in nvmet_auth_sq_free(), as that was
incorrectly wiping the tls keys which are used for the session.
This patch ensures we correctly free the ephemeral session key on
connection, yet we don't free the TLS key unless closing the connection.
Reviewed-by: Chris Leech <cleech at redhat.com>
Reviewed-by: Hannes Reinecke <hare at suse.de>
Signed-off-by: Alistair Francis <alistair.francis at wdc.com>
Signed-off-by: Keith Busch <kbusch at kernel.org>
Signed-off-by: Sasha Levin <sashal at kernel.org>
---
LLM Generated explanations, may be completely bogus:
## Phase 1: Commit Message Forensics
Step 1.1 Record: subsystem `nvmet-tcp` / NVMe target TCP; action verb
“Don’t clear”; intent is to stop `nvmet_auth_sq_free()` from wiping
`sq->tls_key` during authentication cleanup.
Step 1.2 Record: tags in upstream commit `5fc422951c962`: `Reviewed-by:
Chris Leech <cleech at redhat.com>`, `Reviewed-by: Hannes Reinecke
<hare at suse.de>`, `Signed-off-by: Alistair Francis
<alistair.francis at wdc.com>`, `Signed-off-by: Keith Busch
<kbusch at kernel.org>`. No `Fixes:`, no `Reported-by:`, no `Cc: stable`.
Step 1.3 Record: the described bug is deterministic: after a host sends
`REPLACETLSPSK`, successful auth cleanup calls `nvmet_auth_sq_free()`,
which clears `sq->tls_key`; a later `REPLACETLSPSK` then fails the
`!nvmet_queue_tls_keyid(req->sq)` check and returns `CONCAT_MISMATCH`.
The commit also states the earlier approach of not calling
`nvmet_auth_sq_free()` leaked memory.
Step 1.4 Record: this is not a hidden cleanup; it is an explicit
correctness fix for broken TLS PSK replacement and for preserving the
key until connection close.
## Phase 2: Diff Analysis
Step 2.1 Record: one file changed, `drivers/nvme/target/auth.c`;
upstream diff is 3 deletions. Modified function: `nvmet_auth_sq_free()`.
Scope: single-file surgical fix.
Step 2.2 Record: before, `nvmet_auth_sq_free()` canceled auth timeout
work, cleared `sq->tls_key`, then freed DH-CHAP ephemeral buffers.
After, it still cancels work and frees `dhchap_c1`, `dhchap_c2`, and
`dhchap_skey`, but no longer clears the TCP TLS key.
Step 2.3 Record: bug category is logic/resource lifetime.
`nvmet_tcp_tls_key_lookup()` stores a key from `nvme_tls_key_lookup()`
into `queue->nvme_sq.tls_key`; `key_lookup()` increments the key
refcount. `nvmet_sq_put_tls_key()` later does `key_put()` only if
`sq->tls_key` is still set. Clearing the pointer in
`nvmet_auth_sq_free()` both breaks later `REPLACETLSPSK` validation and
prevents the normal close path from seeing the key pointer.
Step 2.4 Record: fix quality is high. The patch removes only incorrect
ownership cleanup from the per-auth cleanup helper. Regression risk is
low because the verified TCP queue release path calls
`nvmet_sq_put_tls_key()` before `nvmet_sq_destroy()`, and non-TCP
transports do not populate `sq->tls_key`.
## Phase 3: Git History Investigation
Step 3.1 Record: `git blame` on the changed area shows `sq->tls_key`
handling was introduced by `fa2e0f8bbc689` (“nvmet-tcp: support secure
channel concatenation”), first contained at `v6.15-rc1~166^2^2~13`;
`sq->tls_key = NULL` was later style-adjusted by `b1efcc470eb30`. The
older `nvmet_auth_sq_free()` DH-CHAP cleanup itself dates to the
original auth code around v6.0, but the TLS-specific bug begins with
`fa2e0f8bbc689`.
Step 3.2 Record: no `Fixes:` tag is present. I manually inspected the
likely introducing commit `fa2e0f8bbc689`, which added secure channel
concatenation, `sq->tls_key`, `nvmet_queue_tls_keyid()`,
`nvmet_sq_put_tls_key()`, and the clear in `nvmet_auth_sq_free()`.
Step 3.3 Record: recent related upstream history includes
`2e6eb6b277f59` (“Don’t free SQ on authentication success”),
`f920ebd03cd13` reverting that due to leaks, and this commit
`5fc422951c962`. This commit was submitted as patch 2/2 after the
revert. On the checked stable branches, `2e6eb6b277f59` is not an
ancestor, so this patch can apply standalone there.
Step 3.4 Record: Alistair Francis has multiple recent NVMe target
auth/TLS commits in this subsystem, including `ecf4d2d883515`,
`2e6eb6b277f59`, `f920ebd03cd13`, `5fc422951c962`, and `5d10069e1a169`.
Step 3.5 Record: dependencies are minimal. The patch depends on
`sq->tls_key` and `nvmet_auth_sq_free()` existing, which local stable
refs show in `6.15.y` and newer. If a target stable tree had already
taken `2e6eb6b277f59`, then the paired revert `f920ebd03cd13` would also
be needed; in the local stable refs checked, that prerequisite is not
needed.
## Phase 4: Mailing List And External Research
Step 4.1 Record: `b4 dig -c 5fc422951c962...` found the original
submission at `https://patch.msgid.link/20260417004809.2894745-2-
alistair.francis at wdc.com`. `b4 dig -a` found only a v1 two-patch series.
The committed version is the reviewed/applied version.
Step 4.2 Record: `b4 dig -w` showed the patch was sent to the expected
NVMe/block audience: Hannes Reinecke, Christoph Hellwig, Sagi Grimberg,
Chaitanya Kulkarni, Keith Busch, linux-nvme, linux-block, linux-kernel,
Yi Zhang, Maurizio Lombardi, and Shinichiro Kawasaki. Lore replies show
`Reviewed-by` from Hannes Reinecke and Chris Leech, and Keith Busch
replied that patches 1 and 2 were applied to `nvme-7.1`.
Step 4.3 Record: no direct bug-report link is in this commit. The paired
revert references a kmemleak report from Yi Zhang during blktests
(`nvme/041`, `nvme/042`, `nvme/043`, `nvme/044`, `nvme/045`, `nvme/051`,
`nvme/052`), confirming the earlier “don’t call cleanup” approach leaked
DH-CHAP allocations.
Step 4.4 Record: series context is important: patch 1 reverts the
earlier workaround that skipped `nvmet_auth_sq_free()` on success; patch
2, this commit, fixes the root cause by keeping cleanup but no longer
clearing `tls_key`.
Step 4.5 Record: stable-specific web search through
`lore.kernel.org/stable` was blocked by Anubis; the `yhbt.net` stable
path returned 404. I found an AUTOSEL posting for the earlier
`2e6eb6b277f59` workaround, but did not use that as selection evidence.
## Phase 5: Code Semantic Analysis
Step 5.1 Record: modified function is `nvmet_auth_sq_free()`.
Step 5.2 Record: callers found by `rg`: `nvmet_execute_auth_send()`,
`nvmet_execute_auth_receive()`, and `nvmet_sq_destroy()`. Auth
send/receive are assigned as handlers for `nvme_fabrics_type_auth_send`
and `nvme_fabrics_type_auth_receive` in `fabrics-cmd.c`.
Step 5.3 Record: key callees around the affected flow include
`nvmet_queue_tls_keyid()`, `nvmet_auth_insert_psk()`,
`nvmet_tcp_tls_key_lookup()`, `nvme_tls_key_lookup()`, `key_lookup()`,
`nvmet_sq_put_tls_key()`, and `key_put()`.
Step 5.4 Record: reachability is verified through the NVMe/TCP receive
path: `nvmet_tcp_done_recv_pdu()` initializes a request, assigns/uses
the fabrics auth execute handler, and runs `req->execute()`. This is
reachable to an NVMe-oF host using target auth/TLS, not a theoretical
internal path.
Step 5.5 Record: similar lifetime pattern found: the only proper TCP TLS
SQ key release helper is `nvmet_sq_put_tls_key()`, called from
`nvmet_tcp_release_queue_work()`. `nvmet_auth_sq_free()` is a DH-CHAP
ephemeral cleanup helper and should not own the TLS key.
## Phase 6: Cross-Referencing And Stable Tree Analysis
Step 6.1 Record: local stable refs show `sq->tls_key = NULL` in
`nvmet_auth_sq_free()` exists in `stable/linux-6.15.y`, `6.16.y`,
`6.17.y`, `6.18.y`, `6.19.y`, and `7.0.y`; it is absent from `6.14.y`,
`6.13.y`, `6.12.y`, `6.6.y`, and `6.1.y`.
Step 6.2 Record: `git apply --check` of the candidate patch succeeds on
local `stable/linux-6.15.y` through `stable/linux-7.0.y`, and fails on
`6.14.y` because the buggy TLS key clearing code is not there. Expected
backport difficulty: clean for affected local refs.
Step 6.3 Record: bounded related-history searches found no existing
equivalent fix in `stable/linux-7.0.y`. Upstream has the earlier
attempted fix `2e6eb6b277f59`, its revert `f920ebd03cd13`, and this
final fix.
## Phase 7: Subsystem And Maintainer Context
Step 7.1 Record: subsystem is NVMe target over TCP, with authentication
and TLS secure channel concatenation. Criticality: IMPORTANT, because it
affects networked storage authentication/TLS behavior, though not a
universal core-kernel path.
Step 7.2 Record: subsystem is active. Recent history in
`drivers/nvme/target` includes multiple auth/TLS fixes and refactors,
including secure concatenation support and follow-up fixes through the
6.15 to 7.1 development window.
## Phase 8: Impact And Risk Assessment
Step 8.1 Record: affected users are systems using NVMe target TCP with
`CONFIG_NVME_TARGET_AUTH` and `CONFIG_NVME_TARGET_TCP_TLS`, specifically
secure channel concatenation / TLS PSK replacement.
Step 8.2 Record: trigger is a host sending a follow-up `REPLACETLSPSK`
after successful authentication on a TLS-enabled admin queue. The
failure is deterministic from the verified code path. I did not verify
that an unprivileged local user can trigger it; this is a
remote/protocol operation by an NVMe host with access to the target.
Step 8.3 Record: failure mode is protocol/authentication failure with
`CONCAT_MISMATCH`, breaking TLS PSK replacement/key rotation.
Additionally, because `key_lookup()` increments the key refcount and the
close path only `key_put()`s if `sq->tls_key` remains set, clearing the
pointer early risks leaking the key reference. Severity: HIGH for
affected NVMe/TCP secure-channel users; not a crash/data-corruption fix.
Step 8.4 Record: benefit is high for affected users because it fixes a
broken authenticated TLS key replacement flow and restores the verified
close-time key release path. Risk is very low: 3 deleted lines, no new
API, no new feature, no locking changes, no cross-subsystem behavior
change.
## Phase 9: Final Synthesis
Step 9.1 Record: evidence for backporting: real deterministic protocol
failure; possible key-ref lifetime leak; tiny one-function patch;
reviewed by relevant NVMe/TLS people; applied by NVMe maintainer;
cleanly applies to affected stable refs; affects released stable
branches from 6.15.y onward. Evidence against: affects a relatively
specialized NVMe/TCP TLS configuration and needs no backport before
6.15.y. Unresolved: I could not access `lore.kernel.org/stable` directly
due Anubis, and I did not run runtime blktests.
Step 9.2 Stable rules checklist: obviously correct and reviewed: yes.
Fixes a real bug: yes. Important issue: yes for NVMe/TCP TLS key
replacement and key lifetime. Small and contained: yes, 3 deleted lines
in one function. No new features/APIs: yes. Applies to stable trees: yes
for local `6.15.y` through `7.0.y`; not applicable to older checked
branches.
Step 9.3 Exception category: not a device ID, quirk, DT, build, or
documentation exception.
Step 9.4 Decision: backport to stable trees that contain `fa2e0f8bbc689`
/ the `sq->tls_key` clearing code, namely local refs `6.15.y` and newer.
Do not apply to older trees where the code is absent. If a target branch
contains the earlier workaround `2e6eb6b277f59`, include/reconcile the
paired revert `f920ebd03cd13`; otherwise this commit stands alone.
## Verification
- Phase 1: Found upstream commit `5fc422951c962` with `git log
origin/master --grep`.
- Phase 1: Parsed commit tags from `git show --format=fuller --stat
--patch 5fc422951c962`.
- Phase 2: Verified the diff removes only the
`CONFIG_NVME_TARGET_TCP_TLS` block clearing `sq->tls_key`.
- Phase 2: Verified `nvmet_queue_tls_keyid()` returns 0 when
`sq->tls_key` is NULL.
- Phase 2: Verified `REPLACETLSPSK` returns `CONCAT_MISMATCH` when
`nvmet_queue_tls_keyid(req->sq)` is false.
- Phase 2: Verified `nvmet_sq_put_tls_key()` calls `key_put()` and NULLs
the key on TCP queue release.
- Phase 2: Verified `nvme_tls_key_lookup()` uses `key_lookup()`, and
`key_lookup()` increments the key refcount.
- Phase 3: Ran `git blame` on `auth.c`; TLS key clearing originates from
`fa2e0f8bbc689`/`b1efcc470eb30`.
- Phase 3: Ran `git describe --contains fa2e0f8bbc689`; first contained
at `v6.15-rc1`.
- Phase 3: Inspected `fa2e0f8bbc689`, `b1efcc470eb30`, `2e6eb6b277f59`,
`f920ebd03cd13`, and `ecf4d2d883515`.
- Phase 4: Ran `b4 dig -c`, `b4 dig -a`, and `b4 dig -w`; found v1 two-
patch lore series and recipient list.
- Phase 4: Fetched lore mirror thread; verified Hannes Reinecke and
Chris Leech reviewed, Keith Busch applied patches 1 and 2.
- Phase 4: Fetched Yi Zhang kmemleak report linked by the paired revert.
- Phase 5: Used `rg` and file reads to trace auth command handlers, TCP
receive execution, TLS key lookup, and queue release.
- Phase 6: Checked local stable refs with scripted `git show`; buggy
code exists in `6.15.y` through `7.0.y`, absent in older checked
stable refs.
- Phase 6: Ran `git apply --check` against local stable worktrees; clean
for `6.15.y` through `7.0.y`, not applicable to `6.14.y`.
- Phase 7: Reviewed recent `drivers/nvme/target` history showing active
auth/TLS development.
- Phase 8: Verified trigger and failure path from
`nvmet_auth_negotiate()` and auth send/receive call flow.
- Unverified: no runtime blktests were run.
- Unverified: direct `lore.kernel.org/stable` search was blocked by
Anubis; the mirror stable search path returned 404.
**YES**
drivers/nvme/target/auth.c | 3 ---
1 file changed, 3 deletions(-)
diff --git a/drivers/nvme/target/auth.c b/drivers/nvme/target/auth.c
index 2eadeb7e06f26..3a905124afdee 100644
--- a/drivers/nvme/target/auth.c
+++ b/drivers/nvme/target/auth.c
@@ -239,9 +239,6 @@ u8 nvmet_setup_auth(struct nvmet_ctrl *ctrl, struct nvmet_sq *sq)
void nvmet_auth_sq_free(struct nvmet_sq *sq)
{
cancel_delayed_work(&sq->auth_expired_work);
-#ifdef CONFIG_NVME_TARGET_TCP_TLS
- sq->tls_key = NULL;
-#endif
kfree(sq->dhchap_c1);
sq->dhchap_c1 = NULL;
kfree(sq->dhchap_c2);
--
2.53.0
More information about the Linux-nvme
mailing list