[LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

Wed Mar 9 00:51:41 PST 2022

On Tue, 8 Mar 2022, Nikos Tsironis wrote:

> My work focuses mainly on improving the IOPs and latency of the
> dm-snapshot target, in order to bring the performance of short-lived
> snapshots as close as possible to bare-metal performance.
> 
> My initial performance evaluation of dm-snapshot had revealed a big
> performance drop, while the snapshot is active; a drop which is not
> justified by COW alone.
> 
> Using fio with blktrace I had noticed that the per-CPU I/O distribution
> was uneven. Although many threads were doing I/O, only a couple of the
> CPUs ended up submitting I/O requests to the underlying device.
> 
> The same issue also affects dm-clone, when doing I/O with sizes smaller
> than the target's region size, where kcopyd is used for COW.
> 
> The bottleneck here is kcopyd serializing all I/O. Users of kcopyd, such
> as dm-snapshot and dm-clone, cannot take advantage of the increased I/O
> parallelism that comes with using blk-mq in modern multi-core systems,
> because I/Os are issued only by a single CPU at a time, the one on which
> kcopyd’s thread happens to be running.
> 
> So, I experimented redesigning kcopyd to prevent I/O serialization by
> respecting thread locality for I/Os and their completions. This made the
> distribution of I/O processing uniform across CPUs.
> 
> My measurements had shown that scaling kcopyd, in combination with
> scaling dm-snapshot itself [1] [2], can lead to an eventual performance
> improvement of ~300% increase in sustained throughput and ~80% decrease
> in I/O latency for transient snapshots, over the null_blk device.
> 
> The work for scaling dm-snapshot has been merged [1], but,
> unfortunately, I haven't been able to send upstream my work on kcopyd
> yet, because I have been really busy with other things the last couple
> of years.
> 
> I haven't looked into the details of copy offload yet, but it would be
> really interesting to see how it affects the performance of random and
> sequential workloads, and to check how, and if, scaling kcopyd affects
> the performance, in combination with copy offload.
> 
> Nikos

Hi

Note that you must submit kcopyd callbacks from a single thread, otherwise 
there's a race condition in snapshot.

The snapshot code doesn't take locks in the copy_callback and it expects 
that the callbacks are serialized.

Maybe, adding the locks to copy_callback would solve it.

Mikulas