[LSF/MM/BPF TOPIC] Improving Zoned Storage Support
Jens Axboe
axboe at kernel.dk
Wed Jan 17 13:02:51 PST 2024
On 1/17/24 1:20 PM, Jens Axboe wrote:
> On 1/17/24 1:18 PM, Bart Van Assche wrote:
>> On 1/17/24 12:06, Jens Axboe wrote:
>>> Case in point, I spent 10 min hacking up some smarts on the insertion
>>> and dispatch side, and then we get:
>>>
>>> IOPS=2.54M, BW=1240MiB/s, IOS/call=32/32
>>>
>>> or about a 63% improvement when running the _exact same thing_. Looking
>>> at profiles:
>>>
>>> - 13.71% io_uring [kernel.kallsyms] [k] queued_spin_lock_slowpath
>>>
>>> reducing the > 70% of locking contention down to ~14%. No change in data
>>> structures, just an ugly hack that:
>>>
>>> - Serializes dispatch, no point having someone hammer on dd->lock for
>>> dispatch when already running
>>> - Serialize insertions, punt to one of N buckets if insertion is already
>>> busy. Current insertion will notice someone else did that, and will
>>> prune the buckets and re-run insertion.
>>>
>>> And while I seriously doubt that my quick hack is 100% fool proof, it
>>> works as a proof of concept. If we can get that kind of reduction with
>>> minimal effort, well...
>>
>> If nobody else beats me to it then I will look into using separate
>> locks in the mq-deadline scheduler for insertion and dispatch.
>
> That's not going to help by itself, as most of the contention (as I
> showed in the profile trace in the email) is from dispatch competing
> with itself, and not necessarily dispatch competing with insertion. And
> not sure how that would even work, as insert and dispatch are working on
> the same structures.
>
> Do some proper analysis first, then that will show you where the problem
> is.
Here's a quick'n dirty that brings it from 1.56M to:
IOPS=3.50M, BW=1711MiB/s, IOS/call=32/32
by just doing something stupid - if someone is already dispatching, then
don't dispatch anything. Clearly shows that this is just dispatch
contention. But a 160% improvement from looking at the initial profile I
sent and hacking up something stupid in a few minutes does show that
there's a ton of low hanging fruit here.
This is run on nvme, so there's going to be lots of hardware queues.
This may even be worth solving in blk-mq rather than try and hack around
it in the scheduler, blk-mq has no idea that mq-deadline is serializing
all hardware queues like this. Or we just solve it in the io scheduler,
since that's the one with the knowledge.
diff --git a/block/mq-deadline.c b/block/mq-deadline.c
index f958e79277b8..822337521fc5 100644
--- a/block/mq-deadline.c
+++ b/block/mq-deadline.c
@@ -80,6 +80,11 @@ struct dd_per_prio {
};
struct deadline_data {
+ spinlock_t lock;
+ spinlock_t zone_lock ____cacheline_aligned_in_smp;
+
+ unsigned long dispatch_state;
+
/*
* run time data
*/
@@ -100,9 +105,6 @@ struct deadline_data {
int front_merges;
u32 async_depth;
int prio_aging_expire;
-
- spinlock_t lock;
- spinlock_t zone_lock;
};
/* Maps an I/O priority class to a deadline scheduler priority. */
@@ -600,6 +602,10 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
struct request *rq;
enum dd_prio prio;
+ if (test_bit(0, &dd->dispatch_state) &&
+ test_and_set_bit(0, &dd->dispatch_state))
+ return NULL;
+
spin_lock(&dd->lock);
rq = dd_dispatch_prio_aged_requests(dd, now);
if (rq)
@@ -616,6 +622,7 @@ static struct request *dd_dispatch_request(struct blk_mq_hw_ctx *hctx)
}
unlock:
+ clear_bit(0, &dd->dispatch_state);
spin_unlock(&dd->lock);
return rq;
--
Jens Axboe
More information about the Linux-nvme
mailing list