[LSF/MM/BPF TOPIC] Block storage copy offloading
Bart Van Assche
bvanassche at acm.org
Fri Jan 23 14:19:44 PST 2026
Adoption of zoned storage is increasing in mobile devices. Log-
structured filesystems are better suited for zoned storage than
traditional filesystems. These filesystems perform garbage collection.
Garbage collection involves copying data on the storage medium.
Offloading the copying operation to the storage device reduces energy
consumption. Hence the proposal to discuss integration of copy
offloading in the Linux kernel block, SCSI and NVMe layers.
Other use-cases for copy offloading include reducing network traffic in
NVMeOF setups while copying data and also increasing throughput while
copying data.
Note: when using fscrypt, the contents of files can be copied without
decrypting the data since how data is encrypted depends on the file
offset and not on the LBA at which data is stored. See also
https://docs.kernel.org/filesystems/fscrypt.html.
My goal is to publish a patch series before the LSF/MM/BPF summit starts
that implements the following approach, an approach that hasn't been
proposed yet as far as I know:
* Filesystems call a block layer function that initiates a copy offload
operation asynchronously. This function supports a source block
device, a source offset, a destination block device, a destination
offset and the number of bytes to be copied.
* That block layer function submits separate REQ_OP_COPY_SRC and
REQ_OP_COPY_DST operations. In both bios bi_private is set such that
it points at copy offloading metadata. The bi_private pointer is used
to associate the REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations that
are involved in the same copying operation.
* There are two reasons why the choice has been made to have two copy
operations instead of one:
- Each bio supports a single offset and size (bi_iter). Copying data
involves a source offset and a destination offset. Although it would
be possible to store all the copying metadata in the bio data
buffer, this approach is not compatible with the existing bio
splitting code.
- Device mapper drivers only support a single LBA range per bio.
* After a device mapper driver has finished mapping a bio, the result of
the map operation is stored in the copy offloading metadata. This
probably can be realized by intercepting dm_submit_bio_remap() calls.
* The device mapper mapping process is repeated until all input and
output ranges have been mapped onto ranges not associated with a
device mapper device. Repeating this process is necessary in case of
stacked device mapper devices, e.g. dm-crypt on top of dm-linear.
* After the mapping process is finished, the block layer checks whether
all LBA ranges are associated with the same non-stacking block driver
(NVMe, SCSI, ...). If not, the copy offload operation fails and the
block layer falls back to REQ_OP_READ and REQ_OP_WRITE operations.
* One or more copy operations are submitted to the block driver. The
block driver is responsible for checking whether the copy operation
can be offloaded. While the SCSI EXTENDED COPY command supports
copying between logical units, whether the NVMe Copy command supports
copying across namespaces depends on the version of the NVMe
specification supported by the controller.
* It is verified whether the copy operation copied all data.
If not, the block layer falls back to REQ_OP_READ and REQ_OP_WRITE.
Thanks,
Bart.
More information about the Linux-nvme
mailing list