[PATCH net v3 1/8] net/handshake: Use spin_lock_bh for hn_lock

Wed May 27 07:24:11 PDT 2026

On Wed, May 27, 2026, at 4:59 AM, Hannes Reinecke wrote:
> On 5/25/26 18:51, Chuck Lever wrote:
>> From: Chuck Lever <chuck.lever at oracle.com>
>> 
>> nvmet_tcp_state_change(), a socket callback that runs in BH context,
>> can reach handshake_req_cancel() via nvmet_tcp_schedule_release_queue()
>> and tls_handshake_cancel().  handshake_req_cancel() acquires
>> hn->hn_lock with plain spin_lock().  If a process-context thread on
>> the same CPU holds hn->hn_lock when a softirq invokes the cancel path,
>> the lock attempt deadlocks.  This is the only caller that invokes
>> tls_handshake_cancel() from BH context; every other consumer calls it
>> from process context.
>> 
>> Deferring the cancel to process context in the NVMe target is not
>> straightforward: nvmet_tcp_schedule_release_queue() must call
>> tls_handshake_cancel() atomically with its state transition to
>> DISCONNECTING.  If the cancel were deferred, the handshake completion
>> callback could fire in the window before the cancel runs, observe the
>> unexpected state, and return without dropping its kref on the queue.
>> Reworking that interlock is considerably more invasive than hardening
>> the handshake lock.  Convert all hn->hn_lock acquisitions from
>> spin_lock/spin_unlock to spin_lock_bh/spin_unlock_bh so the lock is
>> never taken with softirqs enabled.
>> 
>> Fixes: 675b453e0241 ("nvmet-tcp: enable TLS handshake upcall")
>> Signed-off-by: Chuck Lever <chuck.lever at oracle.com>
>> ---
>>   net/handshake/netlink.c |  4 ++--
>>   net/handshake/request.c | 14 +++++++-------
>>   net/handshake/tlshd.c   |  2 ++
>>   3 files changed, 11 insertions(+), 9 deletions(-)
>> 
> ... and there is always the question whather we should be calling
> tls_handshake_cancel() in the first place.
> We already call 'tls_handshake_cancel()' from the handshake timeout
> handler, and this instance of 'tls_handshake_cancel()' is called
> from a socket callback (ie when the soecket is closed or something).
> I would have expected that the handshake code cleans up outstanding
> requests on socket close already; if so, we can delete the cancel
> here.

The handshake layer doesn't auto-cancel on socket close.

handshake_sk_destruct() fires when the sock refcount reaches
zero, but it only frees the handshake_req and removes it from
the hash table. It does not set HANDSHAKE_F_REQ_COMPLETED or
invoke the consumer's done callback. It assumes the request
was already completed or canceled before the socket reached
final destruction.

The timeout handler and the state-change path cover different
failure modes. The timeout fires when tlshd is slow or absent.
The state-change callback fires when the remote peer resets or
closes the connection while the handshake is still in
progress -- that can happen well before the timeout expires.

Without the cancel in nvmet_tcp_schedule_release_queue(), the
race plays out as follows: the state-change callback
transitions the queue to DISCONNECTING and drops its kref.
Then tlshd finishes and the handshake layer calls
nvmet_tcp_tls_handshake_done(). That callback sees the queue
is not in TLS_HANDSHAKE state, hits the WARN_ON, and returns
without dropping its own kref -- leaking the queue.

The cancel makes the teardown race-free: either cancel wins
and the completion callback never fires, or completion already
won and cancel returns false. Either way, exactly one side
drops the kref.

The SunRPC consumers (client and server) avoid this problem by
blocking a thread on wait_for_completion() during the
handshake. A socket state change wakes the waiter indirectly
through the completion callback, and the waiter handles
cleanup. nvme-tcp's handshake is fully asynchronous -- there
is no blocked thread to wake, so the state-change callback is
the only context that can cancel the handshake before queue
teardown begins.

-- 
Chuck Lever