[PATCH v3 08/16] nvmet-fc: untangle cross refcounting objects

Mon Dec 18 23:59:10 PST 2023

On 12/18/23 16:30, Daniel Wagner wrote:
> Associations take a refcount on queues, queues take a refcount on
> associations.
> 
> The existing code lead to the situation that the target executes a
> disconnect and the host triggers a reconnect immediately. The reconnect
> command still finds an existing association and uses this. Though the
> reconnect crashes later on because nvmet_fc_delete_target_assoc()
> blindly goes ahead and removes resources while the reconnect code wants
> to use it. The problem is that nvmet_fc_find_target_assoc() is able to
> lookup an association which is being removed.
> 
> So the first thing to address nvmet_fc_find_target_queue() is to remove
> the association out of the list and wait a RCU cycle and free resources
> in the free function callback of the kref_put().
> 
> The live time of the queues are strictly bound to the lifetime of an
> association. Thus we don't need to take reverse refcounts (queue ->
> association).
> 
> Furthermore, streamline the cleanup code by using the workqueue for
> delete the association in nvmet_fc_ls_disconnect. This ensures, that we
> run through the same shutdown path in all non error cases.
> 
> Reproducer: nvme/003
> 
> Signed-off-by: Daniel Wagner <dwagner at suse.de>
> ---
>   drivers/nvme/target/fc.c | 83 +++++++++++++++++++---------------------
>   1 file changed, 40 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/nvme/target/fc.c b/drivers/nvme/target/fc.c
> index 28e432e62361..db992df13c73 100644
> --- a/drivers/nvme/target/fc.c
> +++ b/drivers/nvme/target/fc.c
> @@ -166,6 +166,7 @@ struct nvmet_fc_tgt_assoc {
>   	struct nvmet_fc_hostport	*hostport;
>   	struct nvmet_fc_ls_iod		*rcv_disconn;
>   	struct list_head		a_list;
> +	struct nvmet_fc_tgt_queue	*_queues[NVMET_NR_QUEUES + 1];
>   	struct nvmet_fc_tgt_queue __rcu	*queues[NVMET_NR_QUEUES + 1];
>   	struct kref			ref;
>   	struct work_struct		del_work;
> @@ -803,14 +804,11 @@ nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc,
>   	if (!queue)
>   		return NULL;
>   
> -	if (!nvmet_fc_tgt_a_get(assoc))
> -		goto out_free_queue;
> -
>   	queue->work_q = alloc_workqueue("ntfc%d.%d.%d", 0, 0,
>   				assoc->tgtport->fc_target_port.port_num,
>   				assoc->a_id, qid);
>   	if (!queue->work_q)
> -		goto out_a_put;
> +		goto out_free_queue;
>   
>   	queue->qid = qid;
>   	queue->sqsize = sqsize;
> @@ -831,7 +829,8 @@ nvmet_fc_alloc_target_queue(struct nvmet_fc_tgt_assoc *assoc,
>   	if (ret)
>   		goto out_fail_iodlist;
>   
> -	WARN_ON(assoc->queues[qid]);
> +	WARN_ON(assoc->_queues[qid]);
> +	assoc->_queues[qid] = queue;
>   	rcu_assign_pointer(assoc->queues[qid], queue);
>   
Is it really worth is using an rcu pointer here?
In the end, creating and deleting queues for an association don't happen 
that often, and involve some synchronization points anyway.
IE the lifetime of the queue is actually bounded by the lifetime of the
association itself, so if the association is valid the queues will be
valid, too.

With that reasoning can't we drop rcu above and use the array directly,
delegating any synchronization to the association itself?

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                Kernel Storage Architect
hare at suse.de                              +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Ivo Totev, Andrew McDonald,
Werner Knoblich