[LSF/MM/BPF TOPIC] Block storage copy offloading

Viacheslav Dubeyko slava at dubeyko.com
Mon Jan 26 10:18:43 PST 2026


On Fri, 2026-01-23 at 14:19 -0800, Bart Van Assche wrote:
> Adoption of zoned storage is increasing in mobile devices. Log-
> structured filesystems are better suited for zoned storage than
> traditional filesystems. These filesystems perform garbage
> collection.
> Garbage collection involves copying data on the storage medium.
> Offloading the copying operation to the storage device reduces energy
> consumption. Hence the proposal to discuss integration of copy
> offloading in the Linux kernel block, SCSI and NVMe layers.
> 
> Other use-cases for copy offloading include reducing network traffic
> in
> NVMeOF setups while copying data and also increasing throughput while
> copying data.
> 

Idea is interesting, but...

I am not completely sure that copy offloading to the storage device can
reduce energy consumption. The storage device needs to spend energy for
executing this operation, anyway. Do you have any numbers that can
prove your point?

Also, I don't see how LFS file system can manage it. Because, LFS file
system contains a sequence of logs. And log contains as metadata as
user data. Even if one log contains only metadata and another one
contains user-data, then before sending metadata log on the volume the
user-data locations should be known and stored into metadata log(s).
So, what is your vision of model of collaboration LFS file system and
block layer? Which file system have you considered as working model of
your approach?

Thanks,
Slava.


> Note: when using fscrypt, the contents of files can be copied without
> decrypting the data since how data is encrypted depends on the file
> offset and not on the LBA at which data is stored. See also
> https://docs.kernel.org/filesystems/fscrypt.html.
> 
> My goal is to publish a patch series before the LSF/MM/BPF summit
> starts
> that implements the following approach, an approach that hasn't been
> proposed yet as far as I know:
> * Filesystems call a block layer function that initiates a copy
> offload
>    operation asynchronously. This function supports a source block
>    device, a source offset, a destination block device, a destination
>    offset and the number of bytes to be copied.
> * That block layer function submits separate REQ_OP_COPY_SRC and
>    REQ_OP_COPY_DST operations. In both bios bi_private is set such
> that
>    it points at copy offloading metadata. The bi_private pointer is
> used
>    to associate the REQ_OP_COPY_SRC and REQ_OP_COPY_DST operations
> that
>    are involved in the same copying operation.
> * There are two reasons why the choice has been made to have two copy
>    operations instead of one:
>    - Each bio supports a single offset and size (bi_iter). Copying
> data
>      involves a source offset and a destination offset. Although it
> would
>      be possible to store all the copying metadata in the bio data
>      buffer, this approach is not compatible with the existing bio
>      splitting code.
>    - Device mapper drivers only support a single LBA range per bio.
> * After a device mapper driver has finished mapping a bio, the result
> of
>    the map operation is stored in the copy offloading metadata. This
>    probably can be realized by intercepting dm_submit_bio_remap()
> calls.
> * The device mapper mapping process is repeated until all input and
>    output ranges have been mapped onto ranges not associated with a
>    device mapper device. Repeating this process is necessary in case
> of
>    stacked device mapper devices, e.g. dm-crypt on top of dm-linear.
> * After the mapping process is finished, the block layer checks
> whether
>    all LBA ranges are associated with the same non-stacking block
> driver
>    (NVMe, SCSI, ...). If not, the copy offload operation fails and
> the
>    block layer falls back to REQ_OP_READ and REQ_OP_WRITE operations.
> * One or more copy operations are submitted to the block driver. The
>    block driver is responsible for checking whether the copy
> operation
>    can be offloaded. While the SCSI EXTENDED COPY command supports
>    copying between logical units, whether the NVMe Copy command
> supports
>    copying across namespaces depends on the version of the NVMe
>    specification supported by the controller.
> * It is verified whether the copy operation copied all data.
>    If not, the block layer falls back to REQ_OP_READ and
> REQ_OP_WRITE.
> 
> Thanks,
> 
> Bart.



More information about the Linux-nvme mailing list