[PATCH 1/3] block: introduce blk_queue_nr_active()

Uday Shankar ushankar at purestorage.com
Tue Oct 3 13:11:02 PDT 2023


Hi Ewan,

> For NVMe/TCP, at least on the system I used that has 80 cores, there
> were only 1 hctxs

This sounds like a target limitation and/or an effect of the command
used to create the connection. The host driver will respect any limits
imposed by the target or the command creating the connection, but will
otherwise prefer to create one I/O queue per CPU (see nvmf_nr_io_queues,
nvme_set_queue_count, and nvme_fc_create_io_queues). This is the
suggested behavior from the spec as well (section 2.1):

> For example, on a four core processor based system, there may be a
> queue pair per core to avoid locking and ensure data structures are
> created in the appropriate processor core’s cache.

Targets aiming to provide high performance should thus allow for the
creation of more than one I/O queue. This is the case with the in-kernel
target (the default queue count limit is NVMET_NR_QUEUES, or 128) as
well as Pure's Flasharray.

Testing against such a target, the approach of looping through all hctxs
and summing the nr_active on every I/O comes with a very significant
latency hit. I used fio to run benchmarks (jobfile attached) and the
in-kernel NVMe-TCP target, whose generous limit of 128 allows the host
to create one queue per CPU, for 64 queues total. I used a kernel built
with two queue-depth path selectors (patch against Linux 6.5 attached) -
one is the one you posted, and the other is the atomics-based one that
Thomas wrote (with a small change so that path selectors not using the
atomic do not have to pay its cost). Here are the submission latency
statistics with the looping-over-hctx approach (the full fio output is
also attached):

    slat (usec): min=70, max=598, avg=78.35, stdev= 7.76

And with atomics:

    slat (usec): min=24, max=491, avg=32.51, stdev= 6.47

Note the ~46us difference in the average submission latency. I used
ftrace to measure the amount of time spent in the path selector
functions, and the results confirm that essentially all of this
additional latency is due to the path selector. For the hctx looping
approach:

# grep 'nvme_mpath_start_request' ftrace_queue-depth-hctx | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.20338877564886418
# grep 'nvme_mpath_end_request' ftrace_queue-depth-hctx | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.5282572360375025
# grep 'nvme_find_path' ftrace_queue-depth-hctx | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
46.14676273950285

vs atomics:

# grep 'nvme_mpath_start_request' ftrace_queue-depth-atomic | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.20764315633022504
# grep 'nvme_mpath_end_request' ftrace_queue-depth-atomic | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.5173406296446426
# grep 'nvme_find_path' ftrace_queue-depth-atomic | grep us | tr -s ' ' | cut -d ' ' -f3 | jq -s add/length
0.4562795568166467

I don't know if this is due to the "using xarray on 'small' arrays is
horrible for performance" that Hannes mentioned. Maybe reverting that
patch would help things, but I still prefer the atomics approach for its
simplicity and the fact that the data does not indicate that the two new
RMW ops per I/O are a source of issues. If the contention is still
considered a problem, we can "split" the atomic into pieces along:

- namespace boundaries, so that the atomic lives in struct nvme_ns
  instead of struct nvme_ctrl. if different CPUs are doing I/O to
  different namespaces (which may be a common access pattern), this will
  reduce the contention on the atomic. this would give us pretty much a
  1:1 translation of the queue-length path selector from dm.
- hctx boundaries - when we calculate the queue depth, instead of
  summing the depths for every path's hctxs, only consider the hctx
  associated to the local CPU.
- CPU boundaries - make the queue depth variable fully per-CPU. this
  should be equivalent to the previous approach in the case where the
  host can create one hctx per CPU.

Of course, all of these come with the disadvantage of sharing less
information about path quality. But that's what it boils down to -
sharing of rapidly-changing information between CPUs is expensive.

We can also consider "standard" optimizations around the atomic(s), like
ensuring each one gets its own cache line.

---

fio jobfile:

[global]
ioengine=libaio
randrepeat=1
direct=1
group_reporting
blocksize=512k
iodepth=1
numjobs=1
runtime=15
time_based
rw=randread

[JOB1]
cpus_allowed=1
filename=/dev/nvme4n1

full fio results with hctx-looping approach:

JOB1: (g=0): rw=randread, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1
fio-3.19
Starting 1 process

JOB1: (groupid=0, jobs=1): err= 0: pid=185197: Mon Oct  2 15:22:39 2023
  read: IOPS=490, BW=245MiB/s (257MB/s)(3680MiB/15001msec)
    slat (usec): min=70, max=598, avg=78.35, stdev= 7.76
    clat (usec): min=1662, max=2743, avg=1958.35, stdev=135.18
     lat (usec): min=1739, max=2901, avg=2036.92, stdev=135.45
    clat percentiles (usec):
     |  1.00th=[ 1729],  5.00th=[ 1778], 10.00th=[ 1811], 20.00th=[ 1844],
     | 30.00th=[ 1876], 40.00th=[ 1909], 50.00th=[ 1942], 60.00th=[ 1975],
     | 70.00th=[ 2008], 80.00th=[ 2057], 90.00th=[ 2147], 95.00th=[ 2212],
     | 99.00th=[ 2376], 99.50th=[ 2442], 99.90th=[ 2540], 99.95th=[ 2671],
     | 99.99th=[ 2737]
   bw (  KiB/s): min=248832, max=254976, per=100.00%, avg=251490.55, stdev=1468.84, samples=29
   iops        : min=  486, max=  498, avg=491.17, stdev= 2.90, samples=29
  lat (msec)   : 2=68.68%, 4=31.32%
  cpu          : usr=0.19%, sys=4.08%, ctx=7359, majf=0, minf=138
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=7359,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=245MiB/s (257MB/s), 245MiB/s-245MiB/s (257MB/s-257MB/s), io=3680MiB (3858MB), run=15001-15001msec

Disk stats (read/write):
  nvme4n1: ios=7274/0, merge=0/0, ticks=14134/0, in_queue=14134, util=99.44%

full fio results with atomics approach:

JOB1: (g=0): rw=randread, bs=(R) 512KiB-512KiB, (W) 512KiB-512KiB, (T) 512KiB-512KiB, ioengine=libaio, iodepth=1
fio-3.19
Starting 1 process

JOB1: (groupid=0, jobs=1): err= 0: pid=185287: Mon Oct  2 15:23:03 2023
  read: IOPS=493, BW=247MiB/s (259MB/s)(3701MiB/15002msec)
    slat (usec): min=24, max=491, avg=32.51, stdev= 6.47
    clat (usec): min=1648, max=24772, avg=1992.77, stdev=329.21
     lat (usec): min=1679, max=24804, avg=2025.49, stdev=329.31
    clat percentiles (usec):
     |  1.00th=[ 1745],  5.00th=[ 1795], 10.00th=[ 1827], 20.00th=[ 1876],
     | 30.00th=[ 1909], 40.00th=[ 1942], 50.00th=[ 1975], 60.00th=[ 2008],
     | 70.00th=[ 2040], 80.00th=[ 2089], 90.00th=[ 2180], 95.00th=[ 2245],
     | 99.00th=[ 2409], 99.50th=[ 2474], 99.90th=[ 2606], 99.95th=[ 2868],
     | 99.99th=[24773]
   bw (  KiB/s): min=235520, max=260096, per=100.00%, avg=252927.10, stdev=4438.49, samples=29
   iops        : min=  460, max=  508, avg=493.97, stdev= 8.68, samples=29
  lat (msec)   : 2=60.26%, 4=39.70%, 10=0.01%, 20=0.01%, 50=0.01%
  cpu          : usr=0.15%, sys=1.91%, ctx=7401, majf=0, minf=139
  IO depths    : 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     issued rwts: total=7401,0,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=1

Run status group 0 (all jobs):
   READ: bw=247MiB/s (259MB/s), 247MiB/s-247MiB/s (259MB/s-259MB/s), io=3701MiB (3880MB), run=15002-15002msec

Disk stats (read/write):
  nvme4n1: ios=7316/0, merge=0/0, ticks=14470/0, in_queue=14470, util=99.24%

patch against Linux 6.5 used for testing. this needs polish and is buggy
when the iopolicy changes while I/O is outstanding, but I avoided
hitting the bug during experiments by not doing I/O while changing the
iopolicy:

diff --git a/block/blk-mq.h b/block/blk-mq.h
index 1743857e0b01..fbc65eefa017 100644
--- a/block/blk-mq.h
+++ b/block/blk-mq.h
@@ -214,11 +214,6 @@ static inline bool blk_mq_tag_is_reserved(struct blk_mq_tags *tags,
 	return tag < tags->nr_reserved_tags;
 }
 
-static inline bool blk_mq_is_shared_tags(unsigned int flags)
-{
-	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
-}
-
 static inline struct blk_mq_tags *blk_mq_tags_from_data(struct blk_mq_alloc_data *data)
 {
 	if (data->rq_flags & RQF_SCHED_TAGS)
diff --git a/drivers/nvme/host/multipath.c b/drivers/nvme/host/multipath.c
index 0a88d7bdc5e3..f71aa1e6f537 100644
--- a/drivers/nvme/host/multipath.c
+++ b/drivers/nvme/host/multipath.c
@@ -17,6 +17,8 @@ MODULE_PARM_DESC(multipath,
 static const char *nvme_iopolicy_names[] = {
 	[NVME_IOPOLICY_NUMA]	= "numa",
 	[NVME_IOPOLICY_RR]	= "round-robin",
+	[NVME_IOPOLICY_QD_ATOMIC]      = "queue-depth-atomic",
+	[NVME_IOPOLICY_QD_EWAN] = "queue-depth-ewan",
 };
 
 static int iopolicy = NVME_IOPOLICY_NUMA;
@@ -29,6 +31,10 @@ static int nvme_set_iopolicy(const char *val, const struct kernel_param *kp)
 		iopolicy = NVME_IOPOLICY_NUMA;
 	else if (!strncmp(val, "round-robin", 11))
 		iopolicy = NVME_IOPOLICY_RR;
+	else if (!strncmp(val, "queue-depth-atomic", 18))
+		iopolicy = NVME_IOPOLICY_QD_ATOMIC;
+	else if (!strncmp(val, "queue-depth-ewan", 16))
+		iopolicy = NVME_IOPOLICY_QD_EWAN;
 	else
 		return -EINVAL;
 
@@ -126,6 +132,11 @@ void nvme_mpath_start_request(struct request *rq)
 {
 	struct nvme_ns *ns = rq->q->queuedata;
 	struct gendisk *disk = ns->head->disk;
+	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+
+	if (READ_ONCE(subsys->iopolicy) == NVME_IOPOLICY_QD_ATOMIC) {
+                atomic_inc(&ns->ctrl->nr_active);
+        }
 
 	if (!blk_queue_io_stat(disk->queue) || blk_rq_is_passthrough(rq))
 		return;
@@ -139,9 +150,15 @@ EXPORT_SYMBOL_GPL(nvme_mpath_start_request);
 void nvme_mpath_end_request(struct request *rq)
 {
 	struct nvme_ns *ns = rq->q->queuedata;
+	struct nvme_subsystem *subsys = ns->ctrl->subsys;
+
+	if (READ_ONCE(subsys->iopolicy) == NVME_IOPOLICY_QD_ATOMIC) {
+                atomic_dec(&ns->ctrl->nr_active);
+        }
 
 	if (!(nvme_req(rq)->flags & NVME_MPATH_IO_STATS))
 		return;
+
 	bdev_end_io_acct(ns->head->disk->part0, req_op(rq),
 			 blk_rq_bytes(rq) >> SECTOR_SHIFT,
 			 nvme_req(rq)->start_time);
@@ -329,13 +346,69 @@ static struct nvme_ns *nvme_round_robin_path(struct nvme_ns_head *head,
 	return found;
 }
 
+static struct nvme_ns *nvme_queue_depth_atomic_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+	unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+	unsigned int depth;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (nvme_path_is_disabled(ns))
+			continue;
+
+		depth = atomic_read(&ns->ctrl->nr_active);
+
+		if (ns->ana_state == NVME_ANA_OPTIMIZED &&
+		    depth < min_depth_opt) {
+			min_depth_opt = depth;
+			best_opt = ns;
+		}
+
+		if (ns->ana_state == NVME_ANA_NONOPTIMIZED &&
+		    depth < min_depth_nonopt) {
+			min_depth_nonopt = depth;
+			best_nonopt = ns;
+		}
+	}
+
+	return best_opt ? best_opt : best_nonopt;
+}
+
+static struct nvme_ns *nvme_queue_depth_ewan_path(struct nvme_ns_head *head)
+{
+	struct nvme_ns *best_opt = NULL, *best_nonopt = NULL, *ns;
+	unsigned int min_depth_opt = UINT_MAX, min_depth_nonopt = UINT_MAX;
+	unsigned int depth;
+
+	list_for_each_entry_rcu(ns, &head->list, siblings) {
+		if (nvme_path_is_disabled(ns))
+			continue;
+
+		depth = blk_mq_queue_nr_active(ns->queue);
+
+		if (ns->ana_state == NVME_ANA_OPTIMIZED &&
+		    depth < min_depth_opt) {
+			min_depth_opt = depth;
+			best_opt = ns;
+		}
+
+		if (ns->ana_state == NVME_ANA_NONOPTIMIZED &&
+		    depth < min_depth_nonopt) {
+			min_depth_nonopt = depth;
+			best_nonopt = ns;
+		}
+	}
+
+	return best_opt ? best_opt : best_nonopt;
+}
+
 static inline bool nvme_path_is_optimized(struct nvme_ns *ns)
 {
 	return ns->ctrl->state == NVME_CTRL_LIVE &&
 		ns->ana_state == NVME_ANA_OPTIMIZED;
 }
 
-inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
+noinline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 {
 	int node = numa_node_id();
 	struct nvme_ns *ns;
@@ -346,6 +419,11 @@ inline struct nvme_ns *nvme_find_path(struct nvme_ns_head *head)
 
 	if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_RR)
 		return nvme_round_robin_path(head, node, ns);
+	else if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_QD_ATOMIC)
+		return nvme_queue_depth_atomic_path(head);
+	else if (READ_ONCE(head->subsys->iopolicy) == NVME_IOPOLICY_QD_EWAN)
+		return nvme_queue_depth_ewan_path(head);
+
 	if (unlikely(!nvme_path_is_optimized(ns)))
 		return __nvme_find_path(head, node);
 	return ns;
@@ -900,6 +978,7 @@ void nvme_mpath_remove_disk(struct nvme_ns_head *head)
 
 void nvme_mpath_init_ctrl(struct nvme_ctrl *ctrl)
 {
+	atomic_set(&ctrl->nr_active, 0);
 	mutex_init(&ctrl->ana_lock);
 	timer_setup(&ctrl->anatt_timer, nvme_anatt_timeout, 0);
 	INIT_WORK(&ctrl->ana_work, nvme_ana_work);
diff --git a/drivers/nvme/host/nvme.h b/drivers/nvme/host/nvme.h
index f35647c470af..a2e5d09d697d 100644
--- a/drivers/nvme/host/nvme.h
+++ b/drivers/nvme/host/nvme.h
@@ -342,6 +342,7 @@ struct nvme_ctrl {
 	u8 anatt;
 	u32 anagrpmax;
 	u32 nanagrpid;
+	atomic_t nr_active;
 	struct mutex ana_lock;
 	struct nvme_ana_rsp_hdr *ana_log_buf;
 	size_t ana_log_size;
@@ -389,6 +390,8 @@ struct nvme_ctrl {
 enum nvme_iopolicy {
 	NVME_IOPOLICY_NUMA,
 	NVME_IOPOLICY_RR,
+	NVME_IOPOLICY_QD_ATOMIC,
+	NVME_IOPOLICY_QD_EWAN,
 };
 
 struct nvme_subsystem {
diff --git a/include/linux/blk-mq.h b/include/linux/blk-mq.h
index 495ca198775f..a7a32a1fd16c 100644
--- a/include/linux/blk-mq.h
+++ b/include/linux/blk-mq.h
@@ -718,6 +718,33 @@ int blk_rq_poll(struct request *rq, struct io_comp_batch *iob,
 
 bool blk_mq_queue_inflight(struct request_queue *q);
 
+#define queue_for_each_hw_ctx(q, hctx, i)				\
+	xa_for_each(&(q)->hctx_table, (i), (hctx))
+
+#define hctx_for_each_ctx(hctx, ctx, i)					\
+	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
+	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
+
+static inline bool blk_mq_is_shared_tags(unsigned int flags)
+{
+	return flags & BLK_MQ_F_TAG_HCTX_SHARED;
+}
+
+static inline unsigned int blk_mq_queue_nr_active(struct request_queue *q)
+{
+	unsigned int nr_active = 0;
+	struct blk_mq_hw_ctx *hctx;
+	unsigned long i;
+
+	queue_for_each_hw_ctx(q, hctx, i) {
+		if (unlikely(blk_mq_is_shared_tags(hctx->flags)))
+			return atomic_read(&q->nr_active_requests_shared_tags);
+		nr_active += atomic_read(&hctx->nr_active);
+	}
+	return nr_active;
+}
+
+
 enum {
 	/* return when out of requests */
 	BLK_MQ_REQ_NOWAIT	= (__force blk_mq_req_flags_t)(1 << 0),
@@ -943,13 +970,6 @@ static inline void *blk_mq_rq_to_pdu(struct request *rq)
 	return rq + 1;
 }
 
-#define queue_for_each_hw_ctx(q, hctx, i)				\
-	xa_for_each(&(q)->hctx_table, (i), (hctx))
-
-#define hctx_for_each_ctx(hctx, ctx, i)					\
-	for ((i) = 0; (i) < (hctx)->nr_ctx &&				\
-	     ({ ctx = (hctx)->ctxs[(i)]; 1; }); (i)++)
-
 static inline void blk_mq_cleanup_rq(struct request *rq)
 {
 	if (rq->q->mq_ops->cleanup_rq)




More information about the Linux-nvme mailing list