[RFC 1/7] swiotlb: Introduce swiotlb throttling

Tue Aug 27 23:14:54 PDT 2024

From: Petr Tesařík <petr at tesarici.cz> Sent: Tuesday, August 27, 2024 10:16 PM
> 
> On Tue, 27 Aug 2024 17:30:59 +0000
> Michael Kelley <mhklinux at outlook.com> wrote:
> 
> > From: Petr Tesařík <petr at tesarici.cz> Sent: Tuesday, August 27, 2024 8:56 AM
> > >
> > > On Fri, 23 Aug 2024 20:41:15 +0000
> > > Michael Kelley <mhklinux at outlook.com> wrote:
> > >
> > > > From: Petr Tesařík <petr at tesarici.cz> Sent: Friday, August 23, 2024 12:41 AM
> > > > >
> > > > > On Thu, 22 Aug 2024 11:37:12 -0700
> > > > > mhkelley58 at gmail.com wrote:
> > > >[...]
> > > > > > @@ -71,12 +72,15 @@
> > > > > >   *		from each index.
> > > > > >   * @pad_slots:	Number of preceding padding slots. Valid only in the first
> > > > > >   *		allocated non-padding slot.
> > > > > > + * @throttled:  Boolean indicating the slot is used by a request that was
> > > > > > + *		throttled. Valid only in the first allocated non-padding slot.
> > > > > >   */
> > > > > >  struct io_tlb_slot {
> > > > > >  	phys_addr_t orig_addr;
> > > > > >  	size_t alloc_size;
> > > > > >  	unsigned short list;
> > > > > > -	unsigned short pad_slots;
> > > > > > +	u8 pad_slots;
> > > > > > +	u8 throttled;
> > > > >
> > > > > I'm not sure this flag is needed for each slot.
> > > > >
> > > > > SWIOTLB mappings should be throttled when the total SWIOTLB usage is
> > > > > above a threshold. Conversely, it can be unthrottled when the total
> > > > > usage goes below a threshold, and it should not matter if that happens
> > > > > due to an unmap of the exact buffer which previously pushed the usage
> > > > > over the edge, or due to an unmap of any other unrelated buffer.
> > > >
> > > > I think I understand what you are proposing. But I don't see a way
> > > > to make it work without adding global synchronization beyond
> > > > the current atomic counter for the number of uI'm sed slabs. At a minimum
> > > > we would need a global spin lock instead of the atomic counter. The spin
> > > > lock would protect the (non-atomic) slab count along with some other
> > > > accounting, and that's more global references. As described in the
> > > > cover letter, I was trying to avoid doing that.
> > >
> > > I have thought about this for a few days. And I'm still not convinced.
> > > You have made it clear in multiple places that the threshold is a soft
> > > limit, and there are many ways the SWIOTLB utilization may exceed the
> > > threshold. In fact I'm not even 100% sure that an atomic counter is
> > > needed, because the check is racy anyway.
> >
> > Atomic operations are expensive at the memory bus level, particularly
> > in high CPU count systems with NUMA topologies. However,
> 
> Sure, the CPU must ensure exclusive access to the underlying memory and
> cache coherency across all CPUs. I know how these things work...
> 
> > maintaining an imprecise global count doesn't work because the
> > divergence from reality can become unbounded over time. The
> > alternative is to sum up all the per-area counters each time a
> > reasonably good global value is needed, and that can be expensive itself
> > with high area counts. A hybrid might be to maintain an imprecise global
> > count, but periodically update it by summing up all the per-area counters
> > so that the divergence from reality isn't unbounded.
> 
> Yes, this is what I had in mind, but I'm not sure which option is
> worse. Let me run a micro-benchmark on a 192-core AmpereOne system.
> 
> > > Another task may increase
> > > (or decrease) the counter between atomic_long_read(&mem->total_used)
> > > and a subsequent down(&mem->throttle_sem).
> > >
> > > I consider it a feature, not a flaw, because the real important checks
> > > happen later while searching for free slots, and those are protected
> > > with a spinlock.
> > >
> > > > If you can see how to do what you propose with just the current
> > > > atomic counter, please describe.
> > >
> > > I think I'm certainly missing something obvious, but let me open the
> > > discussion to improve my understanding of the matter.
> > >
> > > Suppose we don't protect the slab count with anything. What is the
> > > worst possible outcome? IIUC the worst scenario is that multiple tasks
> > > unmap swiotlb buffers simultaneously and all of them believe that their
> > > action made the total usage go below the low threshold, so all of them
> > > try to release the semaphore.
> > >
> > > That's obviously not good, but AFAICS all that's needed is a
> > > test_and_clear_bit() on a per-io_tlb_mem throttled flag just before
> > > calling up(). Since up() would acquire the semaphore's spinlock, and
> > > there's only one semaphore per io_tlb_mem, adding an atomic flag doesn't
> > > look like too much overhead to me, especially if it ends up in the same
> > > cache line as the semaphore.
> >
> > Yes, the semaphore management is the problem. Presumably we want
> > each throttled request to wait on the semaphore, forming an ordered
> > queue of waiters. Each up() on the semaphore releases one of those
> > waiters. We don’t want to release all the waiters when the slab count
> > transitions from "above throttle" to "below throttle" because that
> > creates a thundering herd problem.
> >
> > So consider this example scenario:
> > 1) Two waiters ("A" and "B") are queued the semaphore, each wanting 2 slabs.
> > 2) An unrelated swiotlb unmap frees 10 slabs, taking the slab count
> > from 2 above threshold to 8 below threshold. This does up() on
> > the semaphore and awakens "A".
> > 3) "A" does his request for 2 slabs, and the slab count is now 6 below
> > threshold.
> > 4) "A" does swiotlb unmap.  The slab count goes from 6 below threshold back
> > to 8 below threshold, so no semaphore operation is done. "B" is still waiting.
> > 5) System-wide, swiotlb requests decline, and the slab count never goes above
> > the threshold again. At this point, "B" is still waiting and never gets awakened.
> >
> > An ordered queue of waiters is incompatible with wakeups determined solely
> > on whether the slab count is below the threshold after swiotlb unmap. You
> > would have to wait up all waiters and let them re-contend for the slots that
> > are available below the threshold, with most probably losing out and going
> > back on the semaphore wait queue (i.e., a thundering herd).
> 
> Ah, right, the semaphore must be released as many times as it is
> acquired. Thank you for your patience.
> 
> > Separately, what does a swiotlb unmap do if it takes the slab count from above
> > threshold to below threshold, and there are no waiters? It should not do up()
> > in that case, but how can it make that decision in a way that doesn't race
> > with a swiotlb map operation running at the same time?
> 
> Hm, this confirms my gut feeling that the atomic counter alone would
> not be sufficient.
> 
> I think I can follow your reasoning now:
> 
> 1. Kernels which enable CONFIG_SWIOTLB_THROTTLE are likely to have
>    CONFIG_DEBUG_FS as well, so the price for an atomic operation on
>    total_used is already paid.

I'm unsure if that is true. But my thinking that the atomic total_used is
needed by throttling may have been faulty.  Certainly, if CONFIG_DEBUG_FS
is set, then the cost is already paid. But if not, CONFIG_SWIOTLB_THROTTLE
in my current code adds the atomic total_used cost for *all* swiotlb map
and unmap requests. But the cost of a computed-on-the-fly value (by
summing across all areas) would be paid only by MAY_BLOCK map
requests (and not on unmap), so that decreases the overall cost. And I
had not thought of the hybrid approach until I wrote my previous
response to you. Both seem worth further thinking/investigation.

> 2. There are no pre-existing per-io_tlb_mem ordering constraints on
>    unmap, except the used counter, which is insufficient.

Agreed.

> 3. Slot data is already protected by its area spinlock, so adding
>    something there does not increase the price.

Agreed.

> 
> I don't have an immediate idea, but I still believe we can do better.
> For one thing, your scheme is susceptible to excessive throttling in
> degenerate cases, e.g.:
> 
> 1. A spike in network traffic temporarily increases swiotlb usage above
>    the threshold, but it is not throttled because the network driver
>    does not use SWIOTLB_ATTR_MAY_BLOCK.
> 2. A slow disk "Snail" maps a buffer and acquires the semaphore.
> 3. A fast disk "Cheetah" tries to map a buffer and goes on the
>    semaphore wait queue.
> 4. Network buffers are unmapped, dropping usage below the threshold,
>    but since the throttle flag was not set, the semaphore is not
>    touched.
> 5. "Cheetah" is unnecessarily waiting for "Snail" to finish.
> 
> You may have never hit this scenario in your testing, because you
> presumably had only fast virtual block devices.

My approach was to explicitly not worry about this scenario. :-)  I
stated in the patch set cover letter that throttled requests are
serialized (though maybe not clearly enough). And if a workload
regularly runs above the threshold, the size of the swiotlb memory
should probably be increased. I'm open to an approach that does
better than serialization of throttled requests if it doesn't get
too complicated, but I think it's of secondary importance.

> 
> I'm currently thinking along the lines of waking up the semaphore
> on unmap whenever current usage is above the threshold and there is a
> waiter.
> 
> As a side note, I get your concerns about the thundering herd effect,
> but keep in mind that bounce buffers are not necessarily equal. If four
> devices are blocked on mapping a single slot, you can actually wake up
> all of them after you release four slots. 

Agreed. But the accounting to do that correctly probably requires
a spin lock, and I didn't want to go there.

> For SG lists, you even add
> explicit logic to trigger the wakeup only on the last segment...

Yes. I'm my thinking, that's just part of the serialization of throttled
requests. Throttled request "A", which used an SGL, shouldn't release
the semaphore and hand off ownership to request "B" until all
the swiotlb memory allocated by "A"s SGL has been released.

> 
> BTW as we talk about the semaphore queue, it reminds me of an issue I
> had with your proposed patch:
> 
> > @@ -1398,6 +1431,32 @@ phys_addr_t swiotlb_tbl_map_single(struct device *dev, phys_addr_t orig_addr,
> >  	dev_WARN_ONCE(dev, alloc_align_mask > ~PAGE_MASK,
> >  		"Alloc alignment may prevent fulfilling requests with max mapping_size\n");
> >
> > +	if (IS_ENABLED(CONFIG_SWIOTLB_THROTTLE) && attrs & DMA_ATTR_MAY_BLOCK) {
> > +		unsigned long used = atomic_long_read(&mem->total_used);
> > +
> > +		/*
> > +		 * Determining whether to throttle is intentionally done without
> > +		 * atomicity. For example, multiple requests could proceed in
> > +		 * parallel when usage is just under the threshold, putting
> > +		 * usage above the threshold by the aggregate size of the
> > +		 * parallel requests. The thresholds must already be set
> > +		 * conservatively because of drivers that can't enable
> > +		 * throttling, so this slop in the accounting shouldn't be
> > +		 * problem. It's better than the potential bottleneck of a
> > +		 * globally synchronzied reservation mechanism.
> > +		 */
> > +		if (used > mem->high_throttle) {
> > +			throttle = true;
> > +			mem->high_throttle_count++;
> > +		} else if ((used > mem->low_throttle) &&
> > +					(mem->throttle_sem.count <= 0)) {
>                                               ^^^^^^^^^^^^^^^^^^
> 
> Is it safe to access the semaphore count like this without taking the
> semaphore spinlock? If it is, then it deserves a comment to explain why
> you can ignore this comment in include/linux/semaphore.h:
> 
> /* Please don't access any members of this structure directly */
> 
> Petr T

Yes, this is a bit of a hack for the RFC patch set. The semaphore code
doesn't offer an API to find out if a semaphore is held. In my mind, the
right solution is to add a semaphore API to get the current "count"
of the semaphore (or maybe just a boolean indicating if it is held),
and then use that API. I would add the API as this patch set goes
from RFC to PATCH status. (Mutex's have such an API.)

The API would provide only an instantaneous value, and in the absence
of any higher-level synchronization, the value could change immediately
after it is read. But that's OK in the swiotlb throttling use case because
the throttling tolerates "errors" due to such a change. The
implementation of the API doesn't need to obtain the semaphore spin
lock as long as the read of the count field is atomic (i.e., doesn't tear),
which it should be.

Michael

> 
> > +			throttle = true;
> > +			mem->low_throttle_count++;
> > +		}
> > +		if (throttle)
> > +			down(&mem->throttle_sem);
> > +	}
> > +
> >  	offset = swiotlb_align_offset(dev, alloc_align_mask, orig_addr);
> >  	size = ALIGN(mapping_size + offset, alloc_align_mask + 1);
> >  	index = swiotlb_find_slots(dev, orig_addr, size, alloc_align_mask, &pool);