[PATCH] iosched: Add i10 I/O Scheduler
Jens Axboe
axboe at kernel.dk
Thu Nov 12 13:02:19 EST 2020
On 11/12/20 7:07 AM, Rachit Agarwal wrote:
> From: Rachit Agarwal <rach4x0r at gmail.com>>
>
> Hi All,
>
> I/O batching is beneficial for optimizing IOPS and throughput for
> various applications. For instance, several kernel block drivers would
> benefit from batching, including mmc [1] and tcp-based storage drivers
> like nvme-tcp [2,3]. While we have support for batching dispatch [4],
> we need an I/O scheduler to efficiently enable batching. Such a
> scheduler is particularly interesting for disaggregated storage, where
> the access latency of remote disaggregated storage may be higher than
> local storage access; thus, batching can significantly help in
> amortizing the remote access latency while increasing the throughput.
>
> This patch introduces the i10 I/O scheduler, which performs batching
> per hctx in terms of #requests, #bytes, and timeouts (at microseconds
> granularity). i10 starts dispatching only when #requests or #bytes is
> larger than a default threshold or when a timer expires. After that,
> batching dispatch [3] would happen, allowing batching at device
> drivers along with "bd->last" and ".commit_rqs".
>
> The i10 I/O scheduler builds upon recent work on [6]. We have tested
> the i10 I/O scheduler with nvme-tcp optimizaitons [2,3] and batching
> dispatch [4], varying number of cores, varying read/write ratios, and
> varying request sizes, and with NVMe SSD and RAM block device. For
> NVMe SSDs, the i10 I/O scheduler achieves ~60% improvements in terms
> of IOPS per core over "noop" I/O scheduler. These results are
> available at [5], and many additional results are presented in [6].
>
> While other schedulers may also batch I/O (e.g., mq-deadline), the
> optimization target in the i10 I/O scheduler is throughput
> maximization. Hence there is no latency target nor a need for a global
> tracking context, so a new scheduler is needed rather than to build
> this functionality to an existing scheduler.
>
> We currently use fixed default values as batching thresholds (e.g., 16
> for #requests, 64KB for #bytes, and 50us for timeout). These default
> values are based on sensitivity tests in [6]. For our future work, we
> plan to support adaptive batching according to system load and to
> extend the scheduler to support isolation in multi-tenant deployments
> (to simultaneously achieve low tail latency for latency-sensitive
> applications and high throughput for throughput-bound applications).
I haven't taken a close look at the code yet so far, but one quick note
that patches like this should be against the branches for 5.11. In fact,
this one doesn't even compile against current -git, as
blk_mq_bio_list_merge is now called blk_bio_list_merge.
In any case, I did run this through some quick peak testing as I was
curious, and I'm seeing about 20% drop in peak IOPS over none running
this. Perf diff:
10.71% -2.44% [kernel.vmlinux] [k] read_tsc
2.33% -1.99% [kernel.vmlinux] [k] _raw_spin_lock
Also:
> [5] https://github.com/i10-kernel/upstream-linux/blob/master/dss-evaluation.pdf
Was curious and wanted to look it up, but it doesn't exist.
--
Jens Axboe
More information about the Linux-nvme
mailing list