[PATCH v7 4/9] block: Add core atomic write support
Hannes Reinecke
hare at suse.de
Mon Jun 3 02:26:46 PDT 2024
On 6/2/24 16:09, John Garry wrote:
> Add atomic write support, as follows:
> - add helper functions to get request_queue atomic write limits
> - report request_queue atomic write support limits to sysfs and update Doc
> - support to safely merge atomic writes
> - deal with splitting atomic writes
> - misc helper functions
> - add a per-request atomic write flag
>
> New request_queue limits are added, as follows:
> - atomic_write_hw_max is set by the block driver and is the maximum length
> of an atomic write which the device may support. It is not
> necessarily a power-of-2.
> - atomic_write_max_sectors is derived from atomic_write_hw_max_sectors and
> max_hw_sectors. It is always a power-of-2. Atomic writes may be merged,
> and atomic_write_max_sectors would be the limit on a merged atomic write
> request size. This value is not capped at max_sectors, as the value in
> max_sectors can be controlled from userspace, and it would only cause
> trouble if userspace could limit atomic_write_unit_max_bytes and the
> other atomic write limits.
> - atomic_write_hw_unit_{min,max} are set by the block driver and are the
> min/max length of an atomic write unit which the device may support. They
> both must be a power-of-2. Typically atomic_write_hw_unit_max will hold
> the same value as atomic_write_hw_max.
> - atomic_write_unit_{min,max} are derived from
> atomic_write_hw_unit_{min,max}, max_hw_sectors, and block core limits.
> Both min and max values must be a power-of-2.
> - atomic_write_hw_boundary is set by the block driver. If non-zero, it
> indicates an LBA space boundary at which an atomic write straddles no
> longer is atomically executed by the disk. The value must be a
> power-of-2. Note that it would be acceptable to enforce a rule that
> atomic_write_hw_boundary_sectors is a multiple of
> atomic_write_hw_unit_max, but the resultant code would be more
> complicated.
>
> All atomic writes limits are by default set 0 to indicate no atomic write
> support. Even though it is assumed by Linux that a logical block can always
> be atomically written, we ignore this as it is not of particular interest.
> Stacked devices are just not supported either for now.
>
> An atomic write must always be submitted to the block driver as part of a
> single request. As such, only a single BIO must be submitted to the block
> layer for an atomic write. When a single atomic write BIO is submitted, it
> cannot be split. As such, atomic_write_unit_{max, min}_bytes are limited
> by the maximum guaranteed BIO size which will not be required to be split.
> This max size is calculated by request_queue max segments and the number
> of bvecs a BIO can fit, BIO_MAX_VECS. Currently we rely on userspace
> issuing a write with iovcnt=1 for pwritev2() - as such, we can rely on each
> segment containing PAGE_SIZE of data, apart from the first+last, which each
> can fit logical block size of data. The first+last will be LBS
> length/aligned as we rely on direct IO alignment rules also.
>
> New sysfs files are added to report the following atomic write limits:
> - atomic_write_unit_max_bytes - same as atomic_write_unit_max_sectors in
> bytes
> - atomic_write_unit_min_bytes - same as atomic_write_unit_min_sectors in
> bytes
> - atomic_write_boundary_bytes - same as atomic_write_hw_boundary_sectors in
> bytes
> - atomic_write_max_bytes - same as atomic_write_max_sectors in bytes
>
> Atomic writes may only be merged with other atomic writes and only under
> the following conditions:
> - total resultant request length <= atomic_write_max_bytes
> - the merged write does not straddle a boundary
>
> Helper function bdev_can_atomic_write() is added to indicate whether
> atomic writes may be issued to a bdev. If a bdev is a partition, the
> partition start must be aligned with both atomic_write_unit_min_sectors
> and atomic_write_hw_boundary_sectors.
>
> FSes will rely on the block layer to validate that an atomic write BIO
> submitted will be of valid size, so add blk_validate_atomic_write_op_size()
> for this purpose. Userspace expects an atomic write which is of invalid
> size to be rejected with -EINVAL, so add BLK_STS_INVAL for this. Also use
> BLK_STS_INVAL for when a BIO needs to be split, as this should mean an
> invalid size BIO.
>
> Flag REQ_ATOMIC is used for indicating an atomic write.
>
> Co-developed-by: Himanshu Madhani <himanshu.madhani at oracle.com>
> Signed-off-by: Himanshu Madhani <himanshu.madhani at oracle.com>
> Signed-off-by: John Garry <john.g.garry at oracle.com>
> ---
> Documentation/ABI/stable/sysfs-block | 53 ++++++++++++++++
> block/blk-core.c | 19 ++++++
> block/blk-merge.c | 95 +++++++++++++++++++++++++++-
> block/blk-settings.c | 52 +++++++++++++++
> block/blk-sysfs.c | 33 ++++++++++
> block/blk.h | 3 +
> include/linux/blk_types.h | 8 ++-
> include/linux/blkdev.h | 54 ++++++++++++++++
> 8 files changed, 315 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
> index 831f19a32e08..cea8856f798d 100644
> --- a/Documentation/ABI/stable/sysfs-block
> +++ b/Documentation/ABI/stable/sysfs-block
> @@ -21,6 +21,59 @@ Description:
> device is offset from the internal allocation unit's
> natural alignment.
>
> +What: /sys/block/<disk>/atomic_write_max_bytes
> +Date: February 2024
> +Contact: Himanshu Madhani <himanshu.madhani at oracle.com>
> +Description:
> + [RO] This parameter specifies the maximum atomic write
> + size reported by the device. This parameter is relevant
> + for merging of writes, where a merged atomic write
> + operation must not exceed this number of bytes.
> + This parameter may be greater than the value in
> + atomic_write_unit_max_bytes as
> + atomic_write_unit_max_bytes will be rounded down to a
> + power-of-two and atomic_write_unit_max_bytes may also be
> + limited by some other queue limits, such as max_segments.
> + This parameter - along with atomic_write_unit_min_bytes
> + and atomic_write_unit_max_bytes - will not be larger than
> + max_hw_sectors_kb, but may be larger than max_sectors_kb.
> +
> +
> +What: /sys/block/<disk>/atomic_write_unit_min_bytes
> +Date: February 2024
> +Contact: Himanshu Madhani <himanshu.madhani at oracle.com>
> +Description:
> + [RO] This parameter specifies the smallest block which can
> + be written atomically with an atomic write operation. All
> + atomic write operations must begin at a
> + atomic_write_unit_min boundary and must be multiples of
> + atomic_write_unit_min. This value must be a power-of-two.
> +
> +
> +What: /sys/block/<disk>/atomic_write_unit_max_bytes
> +Date: February 2024
> +Contact: Himanshu Madhani <himanshu.madhani at oracle.com>
> +Description:
> + [RO] This parameter defines the largest block which can be
> + written atomically with an atomic write operation. This
> + value must be a multiple of atomic_write_unit_min and must
> + be a power-of-two. This value will not be larger than
> + atomic_write_max_bytes.
> +
> +
> +What: /sys/block/<disk>/atomic_write_boundary_bytes
> +Date: February 2024
> +Contact: Himanshu Madhani <himanshu.madhani at oracle.com>
> +Description:
> + [RO] A device may need to internally split an atomic write I/O
> + which straddles a given logical block address boundary. This
> + parameter specifies the size in bytes of the atomic boundary if
> + one is reported by the device. This value must be a
> + power-of-two and at least the size as in
> + atomic_write_unit_max_bytes.
> + Any attempt to merge atomic write I/Os must not result in a
> + merged I/O which crosses this boundary (if any).
> +
>
> What: /sys/block/<disk>/diskseq
> Date: February 2021
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 82c3ae22d76d..d9f58fe71758 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -174,6 +174,8 @@ static const struct {
> /* Command duration limit device-side timeout */
> [BLK_STS_DURATION_LIMIT] = { -ETIME, "duration limit exceeded" },
>
> + [BLK_STS_INVAL] = { -EINVAL, "invalid" },
> +
> /* everything else not covered above: */
> [BLK_STS_IOERR] = { -EIO, "I/O" },
> };
> @@ -739,6 +741,18 @@ void submit_bio_noacct_nocheck(struct bio *bio)
> __submit_bio_noacct(bio);
> }
>
> +static blk_status_t blk_validate_atomic_write_op_size(struct request_queue *q,
> + struct bio *bio)
> +{
> + if (bio->bi_iter.bi_size > queue_atomic_write_unit_max_bytes(q))
> + return BLK_STS_INVAL;
> +
> + if (bio->bi_iter.bi_size % queue_atomic_write_unit_min_bytes(q))
> + return BLK_STS_INVAL;
> +
> + return BLK_STS_OK;
> +}
> +
> /**
> * submit_bio_noacct - re-submit a bio to the block device layer for I/O
> * @bio: The bio describing the location in memory and on the device.
> @@ -797,6 +811,11 @@ void submit_bio_noacct(struct bio *bio)
> switch (bio_op(bio)) {
> case REQ_OP_READ:
> case REQ_OP_WRITE:
> + if (bio->bi_opf & REQ_ATOMIC) {
> + status = blk_validate_atomic_write_op_size(q, bio);
> + if (status != BLK_STS_OK)
> + goto end_io;
> + }
> break;
> case REQ_OP_FLUSH:
> /*
> diff --git a/block/blk-merge.c b/block/blk-merge.c
> index 8957e08e020c..ad07759ca147 100644
> --- a/block/blk-merge.c
> +++ b/block/blk-merge.c
> @@ -18,6 +18,46 @@
> #include "blk-rq-qos.h"
> #include "blk-throttle.h"
>
> +/*
> + * rq_straddles_atomic_write_boundary - check for boundary violation
> + * @rq: request to check
> + * @front: data size to be appended to front
> + * @back: data size to be appended to back
> + *
> + * Determine whether merging a request or bio into another request will result
> + * in a merged request which straddles an atomic write boundary.
> + *
> + * The value @front_adjust is the data which would be appended to the front of
> + * @rq, while the value @back_adjust is the data which would be appended to the
> + * back of @rq. Callers will typically only have either @front_adjust or
> + * @back_adjust as non-zero.
> + *
> + */
> +static bool rq_straddles_atomic_write_boundary(struct request *rq,
> + unsigned int front_adjust,
> + unsigned int back_adjust)
> +{
> + unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
> + u64 mask, start_rq_pos, end_rq_pos;
> +
> + if (!boundary)
> + return false;
> +
> + start_rq_pos = blk_rq_pos(rq) << SECTOR_SHIFT;
> + end_rq_pos = start_rq_pos + blk_rq_bytes(rq) - 1;
> +
> + start_rq_pos -= front_adjust;
> + end_rq_pos += back_adjust;
> +
> + mask = ~(boundary - 1);
> +
> + /* Top bits are different, so crossed a boundary */
> + if ((start_rq_pos & mask) != (end_rq_pos & mask))
> + return true;
> +
> + return false;
> +}
But isn't that precisely what 'chunk_sectors' is doing?
IE ensuring that requests never cross that boundary?
Q1: Shouldn't we rather use/modify/adapt chunk_sectors for this thing?
Q2: If we don't, shouldn't we align the atomic write boundary to the
chunk_sectors setting to ensure both match up?
Cheers,
Hannes
More information about the Linux-nvme
mailing list