[LSF/MM/BFP ATTEND] [LSF/MM/BFP TOPIC] Storage: Copy Offload

Mon May 17 17:15:11 PDT 2021

On 5/10/21 5:15 PM, Chaitanya Kulkarni wrote:
> I'd like to propose a session to go over this topic to understand :-
> 
> 1. What are the blockers for Copy Offload implementation ?
> 2. Discussion about having a file system interface.
> 3. Discussion about having right system call for user-space.
> 4. What is the right way to move this work forward ?
> 5. How can we help to contribute and move this work forward ?

We need to achieve agreement about an approach. The text below is my
attempt at guiding the discussion. A HTML version is available at
https://github.com/bvanassche/linux-kernel-copy-offload. As usual,
feedback is welcome.

Bart.

# Implementing Copy Offloading in the Linux Kernel

## Introduction

Efforts to add copy offloading support in the Linux kernel started considerable
time ago. Despite this copy offloading support is not yet upstream and there is
no detailed plan yet of how to implement copy offloading.

This document outlines a possible implementation. The purpose of this document
is to help guiding the conversations around copy offloading.

## Block Layer

We need an interface to pass copy offload requests from user space or file
systems to block drivers. Although the first implementation of copy offloading
added a single operation to the block layer for copy offloading, there seems
to be agreement today to implement copy offloading as two operations,
namely `REQ_COPY_IN` and `REQ_COPY_OUT`.

A possible approach is as follows:

* Fall back to a non-offloaded copy operation if necessary, e.g. if copy
  offloading is not supported or if data is encrypted and the ciphertext
  depends on the LBA. The following code may be a good starting point:
  `drivers/md/dm-kcopyd.c`.
* If the block driver supports copy offloading, submit the `REQ_COPY_IN`
  operation first. The block driver stores the data ranges associated with the
  `REQ_COPY_IN` operation.
* Wait for completion of the `REQ_COPY_IN` operation.
* After the `REQ_COPY_IN` operation has completed, submit the `REQ_COPY_OUT`
  operation and include a reference to the `REQ_COPY_IN` operation. If the
  block driver that receives the `REQ_COPY_OUT` operation receives a matching
  `REQ_COPY_IN` operation, offload the copy operation. Otherwise report that no
  data has been copied and let the block layer perform a non-offloaded copy
  operation.

The operation type is stored in the top bits of the `bi_opf` member of struct
bio.  With each bio a single data buffer and a single contiguous byte range on
the storage medium are associated. Pointers to the data buffer occur in
`bi_io_vec[]`. The affected byte range is represented by `bi_iter.bi_sector` and
`bi_iter.bi_size`.

While the NVMe and SCSI copy offload commands both support multiple source
ranges, XCOPY supports multiple destination ranges while the NVMe simple copy
command supports a single destination range.

Possible approaches for passing the data ranges involved in a copy operation
from the block layer to block drivers are as follows:

* Attach a bio to each copy offload request and encode all relevant copy
  offload parameters in that data buffer. These parameters include source
  device and source ranges for `REQ_COPY_IN` and destination device and
  destination ranges for `REQ_COPY_OUT`. Let the block drivers translate these
  parameters into something the storage device understands (NVMe simple copy
  parameters or SCSI XCOPY parameters). Fill in the parameter structure size
  in `bi_iter.bi_size`. Set `bi_vcnt` to 1 and fill in `bio->bi_io_vec[0]`.
* Map each source range and each destination range onto a different bio. Link
  all the bios with the `bi_next` pointer and attach these bios to the copy
  offload requests. Leave `bi_vcnt` zero. This is related but not identical to
  the approach followed by `__blkdev_issue_discard()`.

I think that the first approach would require more changes in the device mapper
than the second approach since the device mapper code knows how to split bios
but not how to split a buffer with LBA range descriptors.

The following code needs to be modified no matter how copy offloading is
implemented:

* Request cloning. The code for checking the limits before request are cloned
  compares `blk_rq_sectors()` with `max_sectors`. This is inappropriate for
  `REQ_COPY_*` requests.
* Request splitting. `bio_split()` assumes that `bi_iter.bi_size` represents
  the number of bytes affected on the medium.
* Code related to retrying the original requests of a merged request with
  mixed failfast attributes, e.g. `blk_rq_err_bytes()`.
* Code related to partially completing a request, e.g. `blk_update_request()`.
* The code for merging block layer requests.
* `blk_mq_end_request()` since it calls `blk_update_request()` and
  `blk_rq_bytes()`.
* The plugging code because of the following test in the plugging code:
  `blk_rq_bytes(last) >= BLK_PLUG_FLUSH_SIZE`.
* The I/O accounting code (task_io_account_read()) since that code uses
  bio_has_data() and hence skips discard, secure erase and write zeroes
  requests:
```
static inline bool bio_has_data(struct bio *bio)
{
	return bio && bio->bi_iter.bi_size &&
	    bio_op(bio) != REQ_OP_DISCARD &&
	    bio_op(bio) != REQ_OP_SECURE_ERASE &&
	    bio_op(bio) != REQ_OP_WRITE_ZEROESy;
}
```

Block drivers will need to use the `special_vec` member of struct request to
pass the copy offload parameters to the storage device. That member is used
e.g. when a REQ_OP_DISCARD operation is submitted to an NVMe driver. The SCSI
sd driver uses `special_vec` while processing an UNMAP or WRITE SAME command.

## Device Mapper

The device mapper may have to split a request. As an example, LVM is
based on the dm-linear driver. A request that is submitted to an LVM volume
has to be split if it affects multiple block devices. Copy offload requests
that affect multiple block devices should be split or should be onloaded.

The call chain for bio-based dm drivers is as follows:
```
dm_submit_bio(bio)
-> __split_and_process_bio(md, map, bio)
  -> __split_and_process_non_flush(clone_info)
    -> __clone_and_map_data_bio(clone_info, target_info, sector, len)
      -> clone_bio(dm_target_io, bio, sector, len)
      -> __map_bio(dm_target_io)
        -> ti->type->map(dm_target_io, clone)
```

## NVMe

Process copy offload commands by translating REQ_COPY_OUT requests into simple
copy commands.

## SCSI

>From inside `sd_revalidate_disk()`, query the third-party copy VPD page. Extract
the following parameters (see also SPC-6):

* MAXIMUM CSCD DESCRIPTOR COUNT
* MAXIMUM SEGMENT DESCRIPTOR COUNT
* MAXIMUM DESCRIPTOR LIST LENGTH
* Supported third-party copy commands.
* SUPPORTED CSCD DESCRIPTOR ID (0 or more)
* ROD type descriptor (0 or more)
* TOTAL CONCURRENT COPIES
* MAXIMUM IDENTIFIED CONCURRENT COPIES
* MAXIMUM SEGMENT LENGTH

>From inside `sd_init_command()`, translate REQ_COPY_OUT into either EXTENDED
COPY or POPULATE TOKEN + WRITE USING TOKEN.

Set the parameters in the copy offload commands as follows:

* We may have to set the STR bit. From SPC-6: "A sequential striped (STR) bit
  set to one specifies to the copy manager that the majority of the block
  device references in the parameter list represent sequential access of
  several block devices that are striped. This may be used by the copy manager
  to perform reads from a copy source block device at any time and in any
  order during processing of an EXTENDED COPY command as described in
  6.6.5.3. A STR bit set to zero specifies to the copy manager that disk
  references, if any, may not be sequential."
* Set the LIST ID USAGE field to 3 and the LIST ID to 0. This means that
  neither "held data" nor the RECEIVE COPY STATUS command are supported. This
  improves security because the data that is being copied cannot be accessed
  via the LIST ID.
* We may have to set the G_SENSE (good with sense data) bit. From SPC-6: " If
  the G _SENSE bit is set to one and the copy manager completes the EXTENDED
  COPY command with GOOD status, then the copy manager shall include sense
  data with the GOOD status in which the sense key is set to COMPLETED, the
  additional sense code is set to EXTENDED COPY INFORMATION AVAILABLE, and the
  COMMAND-SPECIFIC INFORMATION field is set to the number of segment
  descriptors the copy manager has processed."
* Clear the IMMED bit.

## System Call Interface

To submit copy offload requests from user space, we need:

* A system call for passing these requests, e.g. copy_file_range() or io_uring.
* Add a copy offload parameter format description to the user space ABI. The
  parameters include source device, source ranges, destination device and
  destination ranges.
* A flag that indicates whether or not it is acceptable to fall back to
  onloading the copy operation.

## Sysfs Interface

To do: define which aspects of copy offloading should be configurable through
new sysfs parameters under /sys/block/*/queue/.

## See Also

* Martin Petersen, [Copy
  Offload](https://www.mail-archive.com/linux-scsi@vger.kernel.org/msg28998.html),
  linux-scsi, 28 May 2014.
* Mikulas Patocka, [ANNOUNCE: SCSI XCOPY support for the kernel and device
  mapper](https://www.mail-archive.com/linux-kernel@vger.kernel.org/msg686111.html),
  15 July 2014.
* [kcopyd documentation](https://www.kernel.org/doc/html/latest/admin-guide/device-mapper/kcopyd.html), kernel.org.
* Martin K. Petersen, [Copy Offload - Here Be Dragons](http://mkp.net/pubs/xcopy.pdf), 2019-08-21.
* Martin K. Petersen, [Re: [dm-devel] [RFC PATCH v2 1/2] block: add simple copy
support](https://lore.kernel.org/linux-nvme/yq1blf3smcl.fsf@ca-mkp.ca.oracle.com/), linux-nvme mailing list, 2020-12-08.
* NVM Express Organization, [NVMe - TP 4065b Simple Copy Command 2021.01.25 -
  Ratified.pdf](https://workspace.nvmexpress.org/apps/org/workgroup/allmembers/download.php/4773/NVMe%20-%20TP%204065b%20Simple%20Copy%20Command%202021.01.25%20-%20Ratified.pdf), 2021-01-25.
* Selvakumar S, [[RFC PATCH v5 0/4] add simple copy
  support](https://lore.kernel.org/linux-nvme/20210219124517.79359-1-selvakuma.s1@samsung.com/),
  linux-nvme, 2021-02-19.