[RFC PATCH v4 2/3] block: add simple copy support

Tue Jan 5 18:33:57 EST 2021

On 2021/01/05 21:24, Selva Jove wrote:
> Thanks for the review, Damien.
> 
> On Mon, Jan 4, 2021 at 6:17 PM Damien Le Moal <Damien.LeMoal at wdc.com> wrote:
>>
>> On 2021/01/04 19:48, SelvaKumar S wrote:
>>> Add new BLKCOPY ioctl that offloads copying of one or more sources
>>> ranges to a destination in the device. Accepts copy_ranges that contains
>>> destination, no of sources and pointer to the array of source
>>> ranges. Each range_entry contains start and length of source
>>> ranges (in bytes).
>>>
>>> Introduce REQ_OP_COPY, a no-merge copy offload operation. Create
>>> bio with control information as payload and submit to the device.
>>> REQ_OP_COPY(19) is a write op and takes zone_write_lock when submitted
>>> to zoned device.
>>>
>>> If the device doesn't support copy or copy offload is disabled, then
>>> copy is emulated by allocating memory of total copy size. The source
>>> ranges are read into memory by chaining bio for each source ranges and
>>> submitting them async and the last bio waits for completion. After data
>>> is read, it is written to the destination.
>>>
>>> bio_map_kern() is used to allocate bio and add pages of copy buffer to
>>> bio. As bio->bi_private and bio->bi_end_io is needed for chaining the
>>> bio and over written, invalidate_kernel_vmap_range() for read is called
>>> in the caller.
>>>
>>> Introduce queue limits for simple copy and other helper functions.
>>> Add device limits as sysfs entries.
>>>       - copy_offload
>>>       - max_copy_sectors
>>>       - max_copy_ranges_sectors
>>>       - max_copy_nr_ranges
>>>
>>> copy_offload(= 0) is disabled by default.
>>> max_copy_sectors = 0 indicates the device doesn't support native copy.
>>>
>>> Native copy offload is not supported for stacked devices and is done via
>>> copy emulation.
>>>
>>> Signed-off-by: SelvaKumar S <selvakuma.s1 at samsung.com>
>>> Signed-off-by: Kanchan Joshi <joshi.k at samsung.com>
>>> Signed-off-by: Nitesh Shetty <nj.shetty at samsung.com>
>>> Signed-off-by: Javier González <javier.gonz at samsung.com>
>>> ---
>>>  block/blk-core.c          |  94 ++++++++++++++--
>>>  block/blk-lib.c           | 223 ++++++++++++++++++++++++++++++++++++++
>>>  block/blk-merge.c         |   2 +
>>>  block/blk-settings.c      |  10 ++
>>>  block/blk-sysfs.c         |  50 +++++++++
>>>  block/blk-zoned.c         |   1 +
>>>  block/bounce.c            |   1 +
>>>  block/ioctl.c             |  43 ++++++++
>>>  include/linux/bio.h       |   1 +
>>>  include/linux/blk_types.h |  15 +++
>>>  include/linux/blkdev.h    |  13 +++
>>>  include/uapi/linux/fs.h   |  13 +++
>>>  12 files changed, 458 insertions(+), 8 deletions(-)
>>>
>>> diff --git a/block/blk-core.c b/block/blk-core.c
>>> index 96e5fcd7f071..4a5cd3f53cd2 100644
>>> --- a/block/blk-core.c
>>> +++ b/block/blk-core.c
>>> @@ -719,6 +719,17 @@ static noinline int should_fail_bio(struct bio *bio)
>>>  }
>>>  ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO);
>>>
>>> +static inline int bio_check_copy_eod(struct bio *bio, sector_t start,
>>> +             sector_t nr_sectors, sector_t maxsector)
>>> +{
>>> +     if (nr_sectors && maxsector &&
>>> +         (nr_sectors > maxsector || start > maxsector - nr_sectors)) {
>>> +             handle_bad_sector(bio, maxsector);
>>> +             return -EIO;
>>> +     }
>>> +     return 0;
>>> +}
>>> +
>>>  /*
>>>   * Check whether this bio extends beyond the end of the device or partition.
>>>   * This may well happen - the kernel calls bread() without checking the size of
>>> @@ -737,6 +748,65 @@ static inline int bio_check_eod(struct bio *bio, sector_t maxsector)
>>>       return 0;
>>>  }
>>>
>>> +/*
>>> + * Check for copy limits and remap source ranges if needed.
>>> + */
>>> +static int blk_check_copy(struct bio *bio)
>>> +{
>>> +     struct block_device *p = NULL;
>>> +     struct request_queue *q = bio->bi_disk->queue;
>>> +     struct blk_copy_payload *payload;
>>> +     int i, maxsector, start_sect = 0, ret = -EIO;
>>> +     unsigned short nr_range;
>>> +
>>> +     rcu_read_lock();
>>> +
>>> +     p = __disk_get_part(bio->bi_disk, bio->bi_partno);
>>> +     if (unlikely(!p))
>>> +             goto out;
>>> +     if (unlikely(should_fail_request(p, bio->bi_iter.bi_size)))
>>> +             goto out;
>>> +     if (unlikely(bio_check_ro(bio, p)))
>>> +             goto out;
>>> +
>>> +     maxsector =  bdev_nr_sectors(p);
>>> +     start_sect = p->bd_start_sect;
>>> +
>>> +     payload = bio_data(bio);
>>> +     nr_range = payload->copy_range;
>>> +
>>> +     /* cannot handle copy crossing nr_ranges limit */
>>> +     if (payload->copy_range > q->limits.max_copy_nr_ranges)
>>> +             goto out;
>>
>> If payload->copy_range indicates the number of ranges to be copied, you should
>> name it payload->copy_nr_ranges.
>>
> 
> Agreed. Will rename the entries.
> 
>>> +
>>> +     /* cannot handle copy more than copy limits */
>>
>> Why ? You could loop... Similarly to discard, zone reset etc.
>>
> 
> Sure. Will add the support for handling copy larger than device limits.
> 
>>
>>> +     if (payload->copy_size > q->limits.max_copy_sectors)
>>> +             goto out;
>>
>> Shouldn't the list of source ranges give the total size to be copied ?
>> Otherwise, if payload->copy_size is user provided, you would need to check that
>> the sum of the source ranges actually is equal to copy_size, no ? That means
>> that dropping copy_size sound like the right thing to do here.
>>
> 
> payload->copy_size is used in copy emulation to allocate the buffer.
> Let me check
> one more time if it is possible to do without this.

If this is used for the emulation only, then it should be a local variable in
the emulation helper.

[...]
>>
>>> +             if (unlikely(blk_check_copy(bio)))
>>> +                     goto end_io;
>>> +             break;
>>>       case REQ_OP_SECURE_ERASE:
>>>               if (!blk_queue_secure_erase(q))
>>>                       goto not_supported;
>>> diff --git a/block/blk-lib.c b/block/blk-lib.c
>>> index 752f9c722062..4c0f12e2ed7c 100644
>>> --- a/block/blk-lib.c
>>> +++ b/block/blk-lib.c
>>> @@ -150,6 +150,229 @@ int blkdev_issue_discard(struct block_device *bdev, sector_t sector,
>>>  }
>>>  EXPORT_SYMBOL(blkdev_issue_discard);
>>>
>>> +int blk_copy_offload(struct block_device *bdev, struct blk_copy_payload *payload,
>>> +             sector_t dest, gfp_t gfp_mask)
>>> +{
>>> +     struct bio *bio;
>>> +     int ret, total_size;
>>> +
>>> +     bio = bio_alloc(gfp_mask, 1);
>>> +     bio->bi_iter.bi_sector = dest;
>>> +     bio->bi_opf = REQ_OP_COPY | REQ_NOMERGE;
>>> +     bio_set_dev(bio, bdev);
>>> +
>>> +     total_size = struct_size(payload, range, payload->copy_range);
>>> +     ret = bio_add_page(bio, virt_to_page(payload), total_size,
>>
>> How is payload allocated ? If it is a structure on-stack in the caller, I am not
>> sure it would be wise to do an IO using the thread stack page...
>>
>>> +                                        offset_in_page(payload));
>>> +     if (ret != total_size) {
>>> +             ret = -ENOMEM;
>>> +             bio_put(bio);
>>> +             goto err;
>>> +     }
>>> +
>>> +     ret = submit_bio_wait(bio);
>>> +err:
>>> +     bio_put(bio);
>>> +     return ret;
>>> +
>>> +}
>>> +
>>> +int blk_read_to_buf(struct block_device *bdev, struct blk_copy_payload *payload,
>>> +             gfp_t gfp_mask, void **buf_p)
>>> +{
>>> +     struct request_queue *q = bdev_get_queue(bdev);
>>> +     struct bio *bio, *parent = NULL;
>>> +     void *buf = NULL;
>>> +     bool is_vmalloc;
>>> +     int i, nr_srcs, copy_len, ret, cur_size, t_len = 0;
>>> +
>>> +     nr_srcs = payload->copy_range;
>>> +     copy_len = payload->copy_size << SECTOR_SHIFT;
>>> +
>>> +     buf = kvmalloc(copy_len, gfp_mask);
>>> +     if (!buf)
>>> +             return -ENOMEM;
>>> +     is_vmalloc = is_vmalloc_addr(buf);
>>> +
>>> +     for (i = 0; i < nr_srcs; i++) {
>>> +             cur_size = payload->range[i].len << SECTOR_SHIFT;
>>> +
>>> +             bio = bio_map_kern(q, buf + t_len, cur_size, gfp_mask);
>>> +             if (IS_ERR(bio)) {
>>> +                     ret = PTR_ERR(bio);
>>> +                     goto out;
>>> +             }
>>> +
>>> +             bio->bi_iter.bi_sector = payload->range[i].src;
>>> +             bio->bi_opf = REQ_OP_READ;
>>> +             bio_set_dev(bio, bdev);
>>> +             bio->bi_end_io = NULL;
>>> +             bio->bi_private = NULL;
>>> +
>>> +             if (parent) {
>>> +                     bio_chain(parent, bio);
>>> +                     submit_bio(parent);
>>> +             }
>>> +
>>> +             parent = bio;
>>> +             t_len += cur_size;
>>> +     }
>>> +
>>> +     ret = submit_bio_wait(bio);
>>> +     bio_put(bio);
>>> +     if (is_vmalloc)
>>> +             invalidate_kernel_vmap_range(buf, copy_len);
>>> +     if (ret)
>>> +             goto out;
>>> +
>>> +     *buf_p = buf;
>>> +     return 0;
>>> +out:
>>> +     kvfree(buf);
>>> +     return ret;
>>> +}
>>> +
>>> +int blk_write_from_buf(struct block_device *bdev, void *buf, sector_t dest,
>>> +             int copy_len, gfp_t gfp_mask)
>>> +{
>>> +     struct request_queue *q = bdev_get_queue(bdev);
>>> +     struct bio *bio;
>>> +     int ret;
>>> +
>>> +     bio = bio_map_kern(q, buf, copy_len, gfp_mask);
>>> +     if (IS_ERR(bio)) {
>>> +             ret = PTR_ERR(bio);
>>> +             goto out;
>>> +     }
>>> +     bio_set_dev(bio, bdev);
>>> +     bio->bi_opf = REQ_OP_WRITE;
>>> +     bio->bi_iter.bi_sector = dest;
>>> +
>>> +     bio->bi_end_io = NULL;
>>> +     ret = submit_bio_wait(bio);
>>> +     bio_put(bio);
>>> +out:
>>> +     return ret;
>>> +}
>>> +
>>> +int blk_prepare_payload(struct block_device *bdev, int nr_srcs, struct range_entry *rlist,
>>> +             gfp_t gfp_mask, struct blk_copy_payload **payload_p)
>>> +{
>>> +
>>> +     struct request_queue *q = bdev_get_queue(bdev);
>>> +     struct blk_copy_payload *payload;
>>> +     sector_t bs_mask;
>>> +     sector_t src_sects, len = 0, total_len = 0;
>>> +     int i, ret, total_size;
>>> +
>>> +     if (!q)
>>> +             return -ENXIO;
>>> +
>>> +     if (!nr_srcs)
>>> +             return -EINVAL;
>>> +
>>> +     if (bdev_read_only(bdev))
>>> +             return -EPERM;
>>> +
>>> +     bs_mask = (bdev_logical_block_size(bdev) >> 9) - 1;
>>> +
>>> +     total_size = struct_size(payload, range, nr_srcs);
>>> +     payload = kmalloc(total_size, gfp_mask);
>>> +     if (!payload)
>>> +             return -ENOMEM;
>>
>> OK. So this is what goes into the bio. The function blk_copy_offload() assumes
>> this is at most one page, so total_size <= PAGE_SIZE. Where is that checked ?
>>
> 
> CMIIW. As payload was allocated via kmalloc, it would be represented by a single
> contiguous segment. In case it crosses one page, rely on multi-page bio_vec to
> cover it.

That is not how I understand the code. If the allocation is more than one page,
you still need to add ALL pages to the BIO. What the multi-page bvec code will
do is to use a single bvec for all physically contiguous pages instead of adding
one bvec per page.

Thinking more about it, I think the function blk_copy_offload() could simply use
bio_map_kern() to allocate and prepare the BIO. That will avoid the need for the
add page loop.

> 
>>> +
>>> +     for (i = 0; i < nr_srcs; i++) {
>>> +             /*  copy payload provided are in bytes */
>>> +             src_sects = rlist[i].src;
>>> +             if (src_sects & bs_mask) {
>>> +                     ret =  -EINVAL;
>>> +                     goto err;
>>> +             }
>>> +             src_sects = src_sects >> SECTOR_SHIFT;
>>> +
>>> +             if (len & bs_mask) {
>>> +                     ret = -EINVAL;
>>> +                     goto err;
>>> +             }
>>> +
>>> +             len = rlist[i].len >> SECTOR_SHIFT;
>>> +
>>> +             total_len += len;
>>> +
>>> +             WARN_ON_ONCE((src_sects << 9) > UINT_MAX);
>>> +
>>> +             payload->range[i].src = src_sects;
>>> +             payload->range[i].len = len;
>>> +     }
>>> +
>>> +     /* storing # of source ranges */
>>> +     payload->copy_range = i;
>>> +     /* storing copy len so far */
>>> +     payload->copy_size = total_len;
>>
>> The comments here make the code ugly. Rename the variables and structure fields
>> to have something self explanatory.
>>
> 
> Agreed.
> 
>>> +
>>> +     *payload_p = payload;
>>> +     return 0;
>>> +err:
>>> +     kfree(payload);
>>> +     return ret;
>>> +}
>>> +
>>> +/**
>>> + * blkdev_issue_copy - queue a copy
>>> + * @bdev:       source block device
>>> + * @nr_srcs: number of source ranges to copy
>>> + * @rlist:   array of source ranges (in bytes)
>>> + * @dest_bdev:       destination block device
>>> + * @dest:    destination (in bytes)
>>> + * @gfp_mask:   memory allocation flags (for bio_alloc)
>>> + *
>>> + * Description:
>>> + *   Copy array of source ranges from source block device to
>>> + *   destination block devcie.
>>> + */
>>> +
>>> +
>>> +int blkdev_issue_copy(struct block_device *bdev, int nr_srcs,
>>> +             struct range_entry *src_rlist, struct block_device *dest_bdev,
>>> +             sector_t dest, gfp_t gfp_mask)
>>> +{
>>> +     struct request_queue *q = bdev_get_queue(bdev);
>>> +     struct blk_copy_payload *payload;
>>> +     sector_t bs_mask, dest_sect;
>>> +     void *buf = NULL;
>>> +     int ret;
>>> +
>>> +     ret = blk_prepare_payload(bdev, nr_srcs, src_rlist, gfp_mask, &payload);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     bs_mask = (bdev_logical_block_size(dest_bdev) >> 9) - 1;
>>> +     if (dest & bs_mask) {
>>> +             return -EINVAL;
>>> +             goto out;
>>> +     }
>>> +     dest_sect = dest >> SECTOR_SHIFT;
>>
>> dest should already be a 512B sector as all block layer functions interface use
>> this unit. What is this doing ?
>>
> 
> As source ranges and length received were in bytes, to keep things
> same, dest was
> also chosen to be in bytes. Seems wrong. Will change it to the 512B sector.

Using a byte interface seems very dangerous since writes can only be at best LBA
sized, and must be physical sector size aligned for zoned block devices. I
strobgly suggest that the interface use sector_t 512B unit.

> 
>>
>>> +
>>> +     if (bdev == dest_bdev && q->limits.copy_offload) {
>>> +             ret = blk_copy_offload(bdev, payload, dest_sect, gfp_mask);
>>> +             if (ret)
>>> +                     goto out;
>>> +     } else {
>>> +             ret = blk_read_to_buf(bdev, payload, gfp_mask, &buf);
>>> +             if (ret)
>>> +                     goto out;
>>> +             ret = blk_write_from_buf(dest_bdev, buf, dest_sect,
>>> +                             payload->copy_size << SECTOR_SHIFT, gfp_mask);
>>
>> Arg... This is really not pretty. At the very least, this should all be in a
>> blk_copy_emulate() helper or something named like that.
>>
> 
> Okay. Will move this to a helper.
> 
>> My worry though is that this likely slow for large copies, not to mention that
>> buf is likely to fail to allocate for large copy cases. As commented previously,
>> dm-kcopyd already does such copy well, with a read-write streaming pipeline and
>> zone support too for the write side. This should really be reused, at least
>> partly, instead of re-implementing this read-write here.
>>
> 
> I was a bit hesitant to use dm layer code in the block layer. It makes sense to
> reuse the code as much as possible. Will try to reuse dm-kcopyd code for copy
> emulation part.

You missed my point. I never suggested that you use DM code in the block layer.
That would be wrong. What I suggested is that you move the dm-kcopyd code from
DM into the block layer, changing the function names to blk_copy_xxx(). See the
current interface in include/linux/dm-kcopyd.h: there is absolutely nothing that
is DM specific in there, so moving that code into block/blk-copy.c should be
straightforward, mostly.

The way I see it, your series should look something like this:
1) A prep patch that moves dm-kcopyd to the block layer, changing the API names
and converting all DM driver users to the new API. This may be a very large
patch though, so splitting it into multiple peaces may be required to facilitate
review.
2) A prep patch that introduces queue flags for devices to advertize their
support for simple copy and a generic api for simple copy BIOs
3) A patch that adds the use of simple copy BIO into the moved dm-kcopyd code
4) The NVMe driver bits that probably will look like what you have now

With this, all DM drivers that currently use dm-kcopyd (RAID and dm-zoned at
least) will get free offload using simple copy commands for sector copies within
the same device. All other copy cases will still work as per kcopyd code. That
is very neat I think.

And you can add one more patch that use this generic copy block interface to
implement copy_file_range for raw block devices as Darrick suggested.

[...]
>>> +static ssize_t queue_copy_offload_store(struct request_queue *q,
>>> +                                    const char *page, size_t count)
>>> +{
>>> +     unsigned long copy_offload;
>>> +     ssize_t ret = queue_var_store(&copy_offload, page, count);
>>> +
>>> +     if (ret < 0)
>>> +             return ret;
>>> +
>>> +     if (copy_offload < 0 || copy_offload > 1)
>>
>> err... "copy_offload != 1" ?
> 
> Initial thought was to keep it either 0 or 1.  I'll change it to 0 or else.

If you use 0 and 1, then make copy_offload a bool. That is more natural given
the variable name.

[...]
>>> +static int blk_ioctl_copy(struct block_device *bdev, fmode_t mode,
>>> +             unsigned long arg)
>>> +{
>>> +     struct copy_range crange;
>>> +     struct range_entry *rlist;
>>> +     struct request_queue *q = bdev_get_queue(bdev);
>>> +     sector_t dest;
>>> +     int ret;
>>> +
>>> +     if (!(mode & FMODE_WRITE))
>>> +             return -EBADF;
>>> +
>>> +     if (!blk_queue_copy(q))
>>> +             return -EOPNOTSUPP;
>>
>> But you added the read-write emulation. So what is the point here ? This ioctl
>> should work for any device, with or without simple copy support. Did you test that ?
>>
> 
> Sorry. It was a mistake. Will fix this.

Please make sure to test to catch such mistakes before sending. It does sound
like your read-write manual copy has not been tested.

-- 
Damien Le Moal
Western Digital Research