[LSF/MM/BPF TOPIC] Improving Zoned Storage Support

Wed Jan 17 13:14:42 PST 2024

On 1/17/24 2:02 PM, Jens Axboe wrote:
> On 1/17/24 1:20 PM, Jens Axboe wrote:
>> On 1/17/24 1:18 PM, Bart Van Assche wrote:
>>> On 1/17/24 12:06, Jens Axboe wrote:
>>>> Case in point, I spent 10 min hacking up some smarts on the insertion
>>>> and dispatch side, and then we get:
>>>>
>>>> IOPS=2.54M, BW=1240MiB/s, IOS/call=32/32
>>>>
>>>> or about a 63% improvement when running the _exact same thing_. Looking
>>>> at profiles:
>>>>
>>>> -   13.71%  io_uring  [kernel.kallsyms]  [k] queued_spin_lock_slowpath
>>>>
>>>> reducing the > 70% of locking contention down to ~14%. No change in data
>>>> structures, just an ugly hack that:
>>>>
>>>> - Serializes dispatch, no point having someone hammer on dd->lock for
>>>>    dispatch when already running
>>>> - Serialize insertions, punt to one of N buckets if insertion is already
>>>>    busy. Current insertion will notice someone else did that, and will
>>>>    prune the buckets and re-run insertion.
>>>>
>>>> And while I seriously doubt that my quick hack is 100% fool proof, it
>>>> works as a proof of concept. If we can get that kind of reduction with
>>>> minimal effort, well...
>>>
>>> If nobody else beats me to it then I will look into using separate
>>> locks in the mq-deadline scheduler for insertion and dispatch.
>>
>> That's not going to help by itself, as most of the contention (as I
>> showed in the profile trace in the email) is from dispatch competing
>> with itself, and not necessarily dispatch competing with insertion. And
>> not sure how that would even work, as insert and dispatch are working on
>> the same structures.
>>
>> Do some proper analysis first, then that will show you where the problem
>> is.
> 
> Here's a quick'n dirty that brings it from 1.56M to:
> 
> IOPS=3.50M, BW=1711MiB/s, IOS/call=32/32
> 
> by just doing something stupid - if someone is already dispatching, then
> don't dispatch anything. Clearly shows that this is just dispatch
> contention. But a 160% improvement from looking at the initial profile I

224%, not sure where that math came from...

Anyway, just replying as I sent out the wrong patch. Here's the one I
tested.

diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..133ab4a2673b 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -80,6 +80,13 @@ struct dd_per_prio {
 };
 
 struct deadline_data {
+	struct {
+		spinlock_t lock;
+		spinlock_t zone_lock;
+	} ____cacheline_aligned_in_smp;
+
+	unsigned long dispatch_state;
+
 	/*
 	 * run time data
 	 */
@@ -100,9 +107,6 @@ struct deadline_data {
 	int front_merges;
 	u32 async_depth;
 	int prio_aging_expire;
-
-	spinlock_t lock;
-	spinlock_t zone_lock;
 };
 
 /* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +604,10 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	struct request *rq;
 	enum dd_prio prio;
 
+	if (test_bit(0, &dd->dispatch_state) ||
+	    test_and_set_bit(0, &dd->dispatch_state))
+		return NULL;
+
 	spin_lock(&dd->lock);
 	rq = dd_dispatch_prio_aged_requests(dd, now);
 	if (rq)
@@ -616,6 +624,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
 	}
 
 unlock:
+	clear_bit(0, &dd->dispatch_state);
 	spin_unlock(&dd->lock);
 
 	return rq;

-- 
Jens Axboe