[PATCH v4 05/11] block: Add core atomic write support
John Garry
john.g.garry at oracle.com
Mon Feb 26 01:23:35 PST 2024
On 25/02/2024 12:09, Ritesh Harjani (IBM) wrote:
> John Garry <john.g.garry at oracle.com> writes:
>
>> Add atomic write support as follows:
>> - report request_queue atomic write support limits to sysfs and udpate Doc
>> - add helper functions to get request_queue atomic write limits
>> - support to safely merge atomic writes
>> - add a per-request atomic write flag
>> - deal with splitting atomic writes
>> - misc helper functions
>>
>> New sysfs files are added to report the following atomic write limits:
>> - atomic_write_boundary_bytes
>> - atomic_write_max_bytes
>> - atomic_write_unit_max_bytes
>> - atomic_write_unit_min_bytes
>>
>> atomic_write_unit_{min,max}_bytes report the min and max atomic write
>> support size, inclusive, and are primarily dictated by HW capability. Both
>> values must be a power-of-2. atomic_write_boundary_bytes, if non-zero,
>> indicates an LBA space boundary at which an atomic write straddles no
>> longer is atomically executed by the disk. atomic_write_max_bytes is the
>> maximum merged size for an atomic write. Often it will be the same value as
>> atomic_write_unit_max_bytes.
>
> Instead of explaining sysfs outputs which are deriviatives of HW
> and request_queue limits (and also defined in Documentation), maybe we
> could explain how those sysfs values are derived instead -
>
> struct queue_limits {
> <...>
> unsigned int atomic_write_hw_max_sectors;
> unsigned int atomic_write_max_sectors;
> unsigned int atomic_write_hw_boundary_sectors;
> unsigned int atomic_write_hw_unit_min_sectors;
> unsigned int atomic_write_unit_min_sectors;
> unsigned int atomic_write_hw_unit_max_sectors;
> unsigned int atomic_write_unit_max_sectors;
> <...>
>
> 1. atomic_write_unit_hw_max_sectors comes directly from hw and it need
> not be a power of 2.
>
> 2. atomic_write_hw_unit_min_sectors and atomic_write_hw_unit_max_sectors
> is again defined/derived from hw limits, but it is rounded down so that
> it is always a power of 2.
>
> 3. atomic_write_hw_boundary_sectors again comes from HW boundary limit.
> It could either be 0 (which means the device specify no boundary limit) or a
> multiple of unit_max. It need not be power of 2, however the current
> code assumes it to be a power of 2 (check callers of blk_queue_atomic_write_boundary_bytes())
>
> 4. atomic_write_max_sectors, atomic_write_unit_min_sectors
> and atomic_write_unit_max_sectors are all derived out of above hw limits
> inside function blk_atomic_writes_update_limits() based on request_queue
> limits.
> a. atomic_write_max_sectors is derived from atomic_write_hw_unit_max_sectors and
> request_queue's max_hw_sectors limit. It also guarantees max
> sectors that can be fit in a single bio.
> b. atomic_write_unit_[min|max]_sectors are derived from atomic_write_hw_unit_[min|max]_sectors,
> request_queue's max_hw_sectors & blk_queue_max_guaranteed_bio_sectors(). Both of these limits
> are kept as a power of 2.
>
> Now coming to sysfs outputs -
> 1. atomic_write_unit_max_bytes: Same as atomic_write_unix_max_sectors in bytes
> 2. atomic_write_unit_min_bytes: Same as atomic_write_unit_min_sectors in bytes
> 3. atomic_write_boundary_bytes: same as atomic_write_hw_boundary_sectors
> in bytes
> 4. atomic_write_max_bytes: Same as atomic_write_max_sectors in bytes
>
ok, I can look to incorporate the advised formatting changes
>>
>> atomic_write_unit_max_bytes is capped at the maximum data size which we are
>> guaranteed to be able to fit in a BIO, as an atomic write must always be
>> submitted as a single BIO. This BIO max size is dictated by the number of
>
> Here it says that the atomic write must always be submitted as a single
> bio. From where to where?
submitted to the block layer/core
> I think you meant from FS to block layer.
sure, or also block device file operations (in fops.c) to block core
> Because otherwise we still allow request/bio merging inside block layer
> based on the request queue limits we defined above. i.e. bio can be
> chained to form
> rq->biotail->bi_next = next_rq->bio
> as long as the merged requests is within the queue_limits.
>
> i.e. atomic write requests can be merged as long as -
> - both rqs have REQ_ATOMIC set
> - blk_rq_sectors(final_rq) <= q->limits.atomic_write_max_sectors
> - final rq formed should not straddle limits->atomic_write_hw_boundary_sectors
>
> However, splitting of an atomic write requests is not allowed. And if it
> happens, we fail the I/O req & return -EINVAL.
...
>
> IMHO, the commit message can definitely use a re-write. I agree that you
> have put in a lot of information, but I think it can be more organized.#
ok, fine. I'll look at this. Thanks.
>
>>
>> Contains significant contributions from:
>> Himanshu Madhani <himanshu.madhani at oracle.com>
>
> Myabe it can use a better tag then.
> "Documentation/process/submitting-patches.rst"
ok
>
>>
>> Signed-off-by: John Garry <john.g.garry at oracle.com>
>> ---
>> Documentation/ABI/stable/sysfs-block | 52 ++++++++++++++
>> block/blk-merge.c | 91 ++++++++++++++++++++++-
>> block/blk-settings.c | 103 +++++++++++++++++++++++++++
>> block/blk-sysfs.c | 33 +++++++++
>> block/blk.h | 3 +
>> include/linux/blk_types.h | 2 +
>> include/linux/blkdev.h | 60 ++++++++++++++++
>> 7 files changed, 343 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
>> index 1fe9a553c37b..4c775f4bdefe 100644
>> --- a/Documentation/ABI/stable/sysfs-block
>> +++ b/Documentation/ABI/stable/sysfs-block
>> @@ -21,6 +21,58 @@ Description:
>> device is offset from the internal allocation unit's
>> natural alignment.
...
>>
>
> /* A comment explaining this function and arguments could be helpful */
already addressed according to earlier review
>
>> +static bool rq_straddles_atomic_write_boundary(struct request *rq,
>> + unsigned int front,
>> + unsigned int back)
>
> A better naming perhaps be start_adjust, end_adjust?
ok
>
>> +{
>> + unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
>> + unsigned int mask, imask;
>> + loff_t start, end;
>
> start_rq_pos, end_rq_pos maybe?
ok
>
>> +
>> + if (!boundary)
>> + return false;
>> +
>> + start = rq->__sector << SECTOR_SHIFT;
>
> blk_rq_pos(rq) perhaps?
ok
>
>> + end = start + rq->__data_len;
>
> blk_rq_bytes(rq) perhaps? It should be..
ok
>> +
>> + start -= front;
>> + end += back;
>> +
>> + /* We're longer than the boundary, so must be crossing it */
>> + if (end - start > boundary)
>> + return true;
>> +
>> + mask = boundary - 1;
>> +
>> + /* start/end are boundary-aligned, so cannot be crossing */
>> + if (!(start & mask) || !(end & mask))
>> + return false;
>> +
>> + imask = ~mask;
>> +
>> + /* Top bits are different, so crossed a boundary */
>> + if ((start & imask) != (end & imask))
>> + return true;
>
> The last condition looks wrong. Shouldn't it be end - 1?
>
>> +
>> + return false;
>> +}
>
> Can we do something like this?
>
> static bool rq_straddles_atomic_write_boundary(struct request *rq,
> unsigned int start_adjust,
> unsigned int end_adjust)
> {
> unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
> unsigned long boundary_mask;
> unsigned long start_rq_pos, end_rq_pos;
>
> if (!boundary)
> return false;
>
> start_rq_pos = blk_rq_pos(rq) << SECTOR_SHIFT;
> end_rq_pos = start_rq_pos + blk_rq_bytes(rq);
>
> start_rq_pos -= start_adjust;
> end_rq_pos += end_adjust;
>
> boundary_mask = boundary - 1;
>
> if ((start_rq_pos | boundary_mask) != (end_rq_pos | boundary_mask))
> return true;
>
> return false;
> }
>
> I was thinking this check should cover all cases? Thoughts?
that looks ok (apart from issue already detected later). It is quite
similar to how I coded it in the NVMe driver, apart from the initial >
boundary check.
>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index f288c94374b3..cd7cceb8565d 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -422,6 +422,7 @@ enum req_flag_bits {
>> __REQ_DRV, /* for driver use */
>> __REQ_FS_PRIVATE, /* for file system (submitter) use */
>>
>> + __REQ_ATOMIC, /* for atomic write operations */
>> /*
>> * Command specific flags, keep last:
>> */
>> @@ -448,6 +449,7 @@ enum req_flag_bits {
>> #define REQ_RAHEAD (__force blk_opf_t)(1ULL << __REQ_RAHEAD)
>> #define REQ_BACKGROUND (__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
>> #define REQ_NOWAIT (__force blk_opf_t)(1ULL << __REQ_NOWAIT)
>> +#define REQ_ATOMIC (__force blk_opf_t)(1ULL << __REQ_ATOMIC)
>
> Let's add this in the same order as of __REQ_ATOMIC i.e. after
> REQ_FS_PRIVATE macro
ok, fine
>> @@ -299,6 +299,14 @@ struct queue_limits {
>> unsigned int discard_alignment;
>> unsigned int zone_write_granularity;
>>
>> + unsigned int atomic_write_hw_max_sectors;
>> + unsigned int atomic_write_max_sectors;
>> + unsigned int atomic_write_hw_boundary_sectors;
>> + unsigned int atomic_write_hw_unit_min_sectors;
>> + unsigned int atomic_write_unit_min_sectors;
>> + unsigned int atomic_write_hw_unit_max_sectors;
>> + unsigned int atomic_write_unit_max_sectors;
>> +
> 1 liner comment for above members please?
ok
>> +static inline bool bdev_can_atomic_write(struct block_device *bdev)
>> +{
>> + struct request_queue *bd_queue = bdev->bd_queue;
>> + struct queue_limits *limits = &bd_queue->limits;
>> +
>> + if (!limits->atomic_write_unit_min_sectors)
>> + return false;
>> +
>> + if (bdev_is_partition(bdev)) {
>> + sector_t bd_start_sect = bdev->bd_start_sect;
>> + unsigned int granularity = max(
>
> atomic_align perhaps?
or just "align"
>
>> + limits->atomic_write_unit_min_sectors,
>> + limits->atomic_write_hw_boundary_sectors);
>> + if (do_div(bd_start_sect, granularity))
>> + return false;
>> + }
>
> since atomic_align is a power of 2. Why not use IS_ALIGNED()?
> (bitwise operation instead of div)?
already changed as advised
Thanks,
John
More information about the Linux-nvme
mailing list