[PATCH v2] um: vector: Replace locks guarding queue depth with atomics

Thu Jul 4 13:16:45 PDT 2024

On 04/07/2024 17:12, Johannes Berg wrote:
> On Thu, 2024-07-04 at 13:22 +0100, anton.ivanov at cambridgegreys.com
> wrote:
>> @@ -675,11 +657,20 @@ static void prep_queue_for_rx(struct vector_queue *qi)
>>   	struct vector_private *vp = netdev_priv(qi->dev);
>>   	struct mmsghdr *mmsg_vector = qi->mmsg_vector;
>>   	void **skbuff_vector = qi->skbuff_vector;
>> -	int i;
>> +	int i, queue_depth;
>> +
>> +	queue_depth = atomic_read(&qi->queue_depth);
>>   
>> -	if (qi->queue_depth == 0)
>> +	if (queue_depth == 0)
>>   		return;
>> -	for (i = 0; i < qi->queue_depth; i++) {
>> +
>> +	/* RX is always emptied 100% during each cycle, so we do not
>> +	 * want to do the tail wraparound math for it
>> +	 */
>> +
>> +	qi->head = qi->tail = 0;
>> +
>> +	for (i = 0; i < queue_depth; i++) {
>>   		/* it is OK if allocation fails - recvmmsg with NULL data in
>>   		 * iov argument still performs an RX, just drops the packet
>>   		 * This allows us stop faffing around with a "drop buffer"
>> @@ -689,7 +680,8 @@ static void prep_queue_for_rx(struct vector_queue *qi)
>>   		skbuff_vector++;
>>   		mmsg_vector++;
>>   	}
>> -	qi->queue_depth = 0;
>> +	wmb(); /* Ensure that RX processing elsewhere sees the changes */
>> +	atomic_set(&qi->queue_depth, 0);
>>   }
> I don't understand this.
>
> prep_queue_for_rx() is called by vector_mmsg_rx(), not in parallel or in
> a different thread or something, inside the NAPI polling. All it does is
> reset the queue to empty with all the SKBs allocated.
>
> After that, prep_queue_for_rx() calls uml_vector_recvmmsg() [1] which
> fills the SKBs, and then vector_mmsg_rx() itself consumes them by going
> over it and calling napi_gro_receive() (or whatever else).
>
> There's no parallelism here? The RX queue wouldn't even need locking or
> atomic instructions at all at this point, since it's just "refill
> buffers, fill buffers, release buffers to network stack".
>
> What am I missing?

You are not missing anything.

The rx and tx are using the same infra to map vectors to skbs arrays.

I can make the RX use a set of lockless and atomicless functions, but 
this means duplicating some of the code.

>
>
> [1] btw, it should pass 'budget' to uml_vector_recvmmsg(), not the depth
> (qi->max_depth), and possibly even pass that to prep_queue_for_rx(), but
> that's a different patch.
Good point. Let's sort out the locking and parallelism first.
>
>>   static struct vector_device *find_device(int n)
>> @@ -987,7 +979,7 @@ static int vector_mmsg_rx(struct vector_private *vp, int budget)
>>   	 * many do we need to prep the next time prep_queue_for_rx() is called.
>>   	 */
>>   
>> -	qi->queue_depth = packet_count;
>> +	atomic_add(packet_count, &qi->queue_depth);
> and that also means my earlier analysis was just confused - there's no
> need for _add here, _set is fine - in fact even non-atomic would be
> fine.
>
>> @@ -1176,6 +1168,7 @@ static int vector_poll(struct napi_struct *napi, int budget)
>>   
>>   	if ((vp->options & VECTOR_TX) != 0)
>>   		tx_enqueued = (vector_send(vp->tx_queue) > 0);
>> +	spin_lock(&vp->rx_queue->head_lock);
>>   	if ((vp->options & VECTOR_RX) > 0)
>>   		err = vector_mmsg_rx(vp, budget);
>>   	else {
>> @@ -1183,6 +1176,7 @@ static int vector_poll(struct napi_struct *napi, int budget)
>>   		if (err > 0)
>>   			err = 1;
>>   	}
>> +	spin_unlock(&vp->rx_queue->head_lock);
> that's needed for ... stats?

Yes, unfortunately.

I am reusing the lock.

And that one may be accessed in parallel :(

>
> johannes
>
>

-- 
Anton R. Ivanov
https://www.kot-begemot.co.uk/