dm-multipath low performance with blk-mq

Fri Jan 29 15:35:05 PST 2016

On Wed, Jan 27 2016 at 12:56pm -0500,
Sagi Grimberg <sagig at dev.mellanox.co.il> wrote:

> 
> 
> On 27/01/2016 19:48, Mike Snitzer wrote:
> >On Wed, Jan 27 2016 at  6:14am -0500,
> >Sagi Grimberg <sagig at dev.mellanox.co.il> wrote:
> >
> >>
> >>>>I don't think this is going to help __multipath_map() without some
> >>>>configuration changes.  Now that we're running on already merged
> >>>>requests instead of bios, the m->repeat_count is almost always set to 1,
> >>>>so we call the path_selector every time, which means that we'll always
> >>>>need the write lock. Bumping up the number of IOs we send before calling
> >>>>the path selector again will give this patch a change to do some good
> >>>>here.
> >>>>
> >>>>To do that you need to set:
> >>>>
> >>>>	rr_min_io_rq <something_bigger_than_one>
> >>>>
> >>>>in the defaults section of /etc/multipath.conf and then reload the
> >>>>multipathd service.
> >>>>
> >>>>The patch should hopefully help in multipath_busy() regardless of the
> >>>>the rr_min_io_rq setting.
> >>>
> >>>This patch, while generic, is meant to help the blk-mq case.  A blk-mq
> >>>request_queue doesn't have an elevator so the requests will not have
> >>>seen merging.
> >>>
> >>>But yes, implied in the patch is the requirement to increase
> >>>m->repeat_count via multipathd's rr_min_io_rq (I'll backfill a proper
> >>>header once it is tested).
> >>
> >>I'll test it once I get some spare time (hopefully soon...)
> >
> >OK thanks.
> >
> >BTW, I _cannot_ get null_blk to come even close to your reported 1500K+
> >IOPs on 2 "fast" systems I have access to.  Which arguments are you
> >loading the null_blk module with?
> >
> >I've been using:
> >modprobe null_blk gb=4 bs=4096 nr_devices=1 queue_mode=2 submit_queues=12
> 
> $ for f in /sys/module/null_blk/parameters/*; do echo $f; cat $f; done
> /sys/module/null_blk/parameters/bs
> 512
> /sys/module/null_blk/parameters/completion_nsec
> 10000
> /sys/module/null_blk/parameters/gb
> 250
> /sys/module/null_blk/parameters/home_node
> -1
> /sys/module/null_blk/parameters/hw_queue_depth
> 64
> /sys/module/null_blk/parameters/irqmode
> 1
> /sys/module/null_blk/parameters/nr_devices
> 2
> /sys/module/null_blk/parameters/queue_mode
> 2
> /sys/module/null_blk/parameters/submit_queues
> 24
> /sys/module/null_blk/parameters/use_lightnvm
> N
> /sys/module/null_blk/parameters/use_per_node_hctx
> N
> 
> $ fio --group_reporting --rw=randread --bs=4k --numjobs=24
> --iodepth=32 --runtime=99999999 --time_based --loops=1
> --ioengine=libaio --direct=1 --invalidate=1 --randrepeat=1
> --norandommap --exitall --name task_nullb0 --filename=/dev/nullb0
> task_nullb0: (g=0): rw=randread, bs=4K-4K/4K-4K/4K-4K,
> ioengine=libaio, iodepth=32
> ...
> fio-2.1.10
> Starting 24 processes
> Jobs: 24 (f=24): [rrrrrrrrrrrrrrrrrrrrrrrr] [0.0% done]
> [7234MB/0KB/0KB /s] [1852K/0/0 iops] [eta 1157d:09h:46m:22s]

Your test above is prone to exhaust the dm-mpath blk-mq tags (128)
because 24 threads * 32 easily exceeds 128 (by a factor of 6).

I found that we were context switching (via bt_get's io_schedule)
waiting for tags to become available.

This is embarassing but, until Jens told me today, I was oblivious to
the fact that the number of blk-mq's tags per hw_queue was defined by
tag_set.queue_depth.

Previously request-based DM's blk-mq support had:
md->tag_set.queue_depth = BLKDEV_MAX_RQ; (again: 128)

Now I have a patch that allows tuning queue_depth via dm_mod module
parameter.  And I'll likely bump the default to 4096 or something (doing
so eliminated blocking in bt_get).

But eliminating the tags bottleneck only raised my read IOPs from ~600K
to ~800K (using 1 hw_queue for both null_blk and dm-mpath).

When I raise nr_hw_queues to 4 for null_blk (keeping dm-mq at 1) I see a
whole lot more context switching due to request-based DM's use of
ksoftirqd (and kworkers) for request completion.

So I'm moving on to optimizing the completion path.  But at least some
progress was made, more to come...

Mike