[LSF/MM/BPF TOPIC] Block storage copy offloading

Fri Jan 23 14:19:44 PST 2026

Adoption of zoned storage is increasing in mobile devices. Log-
structured filesystems are better suited for zoned storage than
traditional filesystems. These filesystems perform garbage collection.
Garbage collection involves copying data on the storage medium.
Offloading the copying operation to the storage device reduces energy
consumption. Hence the proposal to discuss integration of copy
offloading in the Linux kernel block, SCSI and NVMe layers.

Other use-cases for copy offloading include reducing network traffic in
NVMeOF setups while copying data and also increasing throughput while
copying data.

Note: when using fscrypt, the contents of files can be copied without
decrypting the data since how data is encrypted depends on the file
offset and not on the LBA at which data is stored. See also
https://docs.kernel.org/filesystems/fscrypt.html.

My goal is to publish a patch series before the LSF/MM/BPF summit starts
that implements the following approach, an approach that hasn't been
proposed yet as far as I know:
* Filesystems call a block layer function that initiates a copy offload
   operation asynchronously. This function supports a source block
   device, a source offset, a destination block device, a destination
   offset and the number of bytes to be copied.
* That block layer function submits separate REQ_OP_COPY_SRC and
   REQ_OP_COPY_DST operations. In both bios bi_private is set such that
   it points at copy offloading metadata. The bi_private pointer is used
   to associate the REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations that
   are involved in the same copying operation.
* There are two reasons why the choice has been made to have two copy
   operations instead of one:
   - Each bio supports a single offset and size (bi_iter). Copying data
     involves a source offset and a destination offset. Although it would
     be possible to store all the copying metadata in the bio data
     buffer, this approach is not compatible with the existing bio
     splitting code.
   - Device mapper drivers only support a single LBA range per bio.
* After a device mapper driver has finished mapping a bio, the result of
   the map operation is stored in the copy offloading metadata. This
   probably can be realized by intercepting dm_submit_bio_remap() calls.
* The device mapper mapping process is repeated until all input and
   output ranges have been mapped onto ranges not associated with a
   device mapper device. Repeating this process is necessary in case of
   stacked device mapper devices, e.g. dm-crypt on top of dm-linear.
* After the mapping process is finished, the block layer checks whether
   all LBA ranges are associated with the same non-stacking block driver
   (NVMe, SCSI, ...). If not, the copy offload operation fails and the
   block layer falls back to REQ_OP_READ and REQ_OP_WRITE operations.
* One or more copy operations are submitted to the block driver. The
   block driver is responsible for checking whether the copy operation
   can be offloaded. While the SCSI EXTENDED COPY command supports
   copying between logical units, whether the NVMe Copy command supports
   copying across namespaces depends on the version of the NVMe
   specification supported by the controller.
* It is verified whether the copy operation copied all data.
   If not, the block layer falls back to REQ_OP_READ and REQ_OP_WRITE.

Thanks,

Bart.