[PATCH v2] um: vector: Replace locks guarding queue depth with atomics

Thu Jul 4 09:12:04 PDT 2024

On Thu, 2024-07-04 at 13:22 +0100, anton.ivanov at cambridgegreys.com
wrote:
> 
> @@ -675,11 +657,20 @@ static void prep_queue_for_rx(struct vector_queue *qi)
>  	struct vector_private *vp = netdev_priv(qi->dev);
>  	struct mmsghdr *mmsg_vector = qi->mmsg_vector;
>  	void **skbuff_vector = qi->skbuff_vector;
> -	int i;
> +	int i, queue_depth;
> +
> +	queue_depth = atomic_read(&qi->queue_depth);
>  
> -	if (qi->queue_depth == 0)
> +	if (queue_depth == 0)
>  		return;
> -	for (i = 0; i < qi->queue_depth; i++) {
> +
> +	/* RX is always emptied 100% during each cycle, so we do not
> +	 * want to do the tail wraparound math for it
> +	 */
> +
> +	qi->head = qi->tail = 0;
> +
> +	for (i = 0; i < queue_depth; i++) {
>  		/* it is OK if allocation fails - recvmmsg with NULL data in
>  		 * iov argument still performs an RX, just drops the packet
>  		 * This allows us stop faffing around with a "drop buffer"
> @@ -689,7 +680,8 @@ static void prep_queue_for_rx(struct vector_queue *qi)
>  		skbuff_vector++;
>  		mmsg_vector++;
>  	}
> -	qi->queue_depth = 0;
> +	wmb(); /* Ensure that RX processing elsewhere sees the changes */
> +	atomic_set(&qi->queue_depth, 0);
>  }

I don't understand this.

prep_queue_for_rx() is called by vector_mmsg_rx(), not in parallel or in
a different thread or something, inside the NAPI polling. All it does is
reset the queue to empty with all the SKBs allocated.

After that, prep_queue_for_rx() calls uml_vector_recvmmsg() [1] which
fills the SKBs, and then vector_mmsg_rx() itself consumes them by going
over it and calling napi_gro_receive() (or whatever else).

There's no parallelism here? The RX queue wouldn't even need locking or
atomic instructions at all at this point, since it's just "refill
buffers, fill buffers, release buffers to network stack".

What am I missing?

[1] btw, it should pass 'budget' to uml_vector_recvmmsg(), not the depth
(qi->max_depth), and possibly even pass that to prep_queue_for_rx(), but
that's a different patch.

>  static struct vector_device *find_device(int n)
> @@ -987,7 +979,7 @@ static int vector_mmsg_rx(struct vector_private *vp, int budget)
>  	 * many do we need to prep the next time prep_queue_for_rx() is called.
>  	 */
>  
> -	qi->queue_depth = packet_count;
> +	atomic_add(packet_count, &qi->queue_depth);

and that also means my earlier analysis was just confused - there's no
need for _add here, _set is fine - in fact even non-atomic would be
fine.

> @@ -1176,6 +1168,7 @@ static int vector_poll(struct napi_struct *napi, int budget)
>  
>  	if ((vp->options & VECTOR_TX) != 0)
>  		tx_enqueued = (vector_send(vp->tx_queue) > 0);
> +	spin_lock(&vp->rx_queue->head_lock);
>  	if ((vp->options & VECTOR_RX) > 0)
>  		err = vector_mmsg_rx(vp, budget);
>  	else {
> @@ -1183,6 +1176,7 @@ static int vector_poll(struct napi_struct *napi, int budget)
>  		if (err > 0)
>  			err = 1;
>  	}
> +	spin_unlock(&vp->rx_queue->head_lock);

that's needed for ... stats?

johannes