[PATCH] nvme-host: tcp: do not read from the socket if we are performing a reset

Tue May 3 04:48:10 PDT 2022

> nvme_tcp_io_work() may be scheduled while the controller is resetting,
> triggering kernel crashes because of invalid memory accesses.
> 
> general protection fault, probably for non-canonical
> address 0x82f4228d6e5c6

Can you please provide more details on what kernel this happened? Does
this reproduce in upstream kernel?

>   [exception RIP: _copy_to_iter+1344]
>   [ffffa8a9e5c07c90] __skb_datagram_iter at ffffffffb0c339d8
>   [ffffa8a9e5c07cf8] skb_copy_datagram_iter at ffffffffb0c33c83
>   [ffffa8a9e5c07d28] nvme_tcp_recv_data at ffffffffc04dbff7 [nvme_tcp]
>   [ffffa8a9e5c07d78] nvme_tcp_recv_skb at ffffffffc04dc90e [nvme_tcp]
>   [ffffa8a9e5c07dc0] tcp_read_sock at ffffffffb0cfedbe
>   [ffffa8a9e5c07e10] nvme_tcp_try_recv at ffffffffc04da518 [nvme_tcp]
>   [ffffa8a9e5c07e58] nvme_tcp_io_work at ffffffffc04dbc54 [nvme_tcp]
>   [ffffa8a9e5c07e88] process_one_work at ffffffffb0505408
>   [ffffa8a9e5c07ed0] worker_thread at ffffffffb05059a0
> 
> Fix this bug by preventing nvme_tcp_io_work() from running if the queue
> is not flagged as "LIVE" and is not in the process of
> connecting to the target.

The current controller teardown flow is supposed to guarantee that
io_work will not run after a queue is being stopped. See
nvme_tcp_stop_queue() it shuts down the socket to deny any new
datagrams to go to the network (or come from the network), restore
socket calls so data_ready/write_space are not called again, and
cancels the io_work. New submissions are prevented as the queue
state clears LIVE.

So in theory, io_work should not be running in conjunction with
queue shutdown. Can you explain how it happens to get there?

> 
> Signed-off-by: Maurizio Lombardi <mlombard at redhat.com>
> ---
>   drivers/nvme/host/tcp.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
> index ad3a2bf2f1e9..e3ef014bbc0b 100644
> --- a/drivers/nvme/host/tcp.c
> +++ b/drivers/nvme/host/tcp.c
> @@ -103,6 +103,7 @@ enum nvme_tcp_queue_flags {
>   	NVME_TCP_Q_ALLOCATED	= 0,
>   	NVME_TCP_Q_LIVE		= 1,
>   	NVME_TCP_Q_POLLING	= 2,
> +	NVME_TCP_Q_CONNECTING	= 3,
>   };
>   
>   enum nvme_tcp_recv_state {
> @@ -1213,6 +1214,10 @@ static void nvme_tcp_io_work(struct work_struct *w)
>   		bool pending = false;
>   		int result;
>   
> +		if (unlikely(!test_bit(NVME_TCP_Q_LIVE, &queue->flags) &&
> +				!test_bit(NVME_TCP_Q_CONNECTING, &queue->flags)))
> +			return;

How is this protecting anything? If the queue is torn down we may access
other freed memory, not just from accessing the sockets.

> +
>   		if (mutex_trylock(&queue->send_mutex)) {
>   			result = nvme_tcp_try_send(queue);
>   			mutex_unlock(&queue->send_mutex);
> @@ -1670,6 +1675,8 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>   	struct nvme_tcp_ctrl *ctrl = to_tcp_ctrl(nctrl);
>   	int ret;
>   
> +	set_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
> +
>   	if (idx)
>   		ret = nvmf_connect_io_queue(nctrl, idx);
>   	else
> @@ -1683,6 +1690,7 @@ static int nvme_tcp_start_queue(struct nvme_ctrl *nctrl, int idx)
>   		dev_err(nctrl->device,
>   			"failed to connect queue: %d ret=%d\n", idx, ret);
>   	}
> +	clear_bit(NVME_TCP_Q_CONNECTING, &ctrl->queues[idx].flags);
>   	return ret;
>   }
>   

Question, can you tell from your stack dump if this queue is the admin
queue or I/O queue?

If this is indeed the admin queue, please check if you have a related
fix applied:
0fa0f99fc84e ("nvme: fix a possible use-after-free in controller reset 
during load")