[PATCH v4 05/11] block: Add core atomic write support

Mon Feb 26 01:23:35 PST 2024

On 25/02/2024 12:09, Ritesh Harjani (IBM) wrote:
> John Garry <john.g.garry at oracle.com> writes:
> 
>> Add atomic write support as follows:
>> - report request_queue atomic write support limits to sysfs and udpate Doc
>> - add helper functions to get request_queue atomic write limits
>> - support to safely merge atomic writes
>> - add a per-request atomic write flag
>> - deal with splitting atomic writes
>> - misc helper functions
>>
>> New sysfs files are added to report the following atomic write limits:
>> - atomic_write_boundary_bytes
>> - atomic_write_max_bytes
>> - atomic_write_unit_max_bytes
>> - atomic_write_unit_min_bytes
>>
>> atomic_write_unit_{min,max}_bytes report the min and max atomic write
>> support size, inclusive, and are primarily dictated by HW capability. Both
>> values must be a power-of-2. atomic_write_boundary_bytes, if non-zero,
>> indicates an LBA space boundary at which an atomic write straddles no
>> longer is atomically executed by the disk. atomic_write_max_bytes is the
>> maximum merged size for an atomic write. Often it will be the same value as
>> atomic_write_unit_max_bytes.
> 
> Instead of explaining sysfs outputs which are deriviatives of HW
> and request_queue limits (and also defined in Documentation), maybe we
> could explain how those sysfs values are derived instead -
> 
> struct queue_limits {
> <...>
> 	unsigned int		atomic_write_hw_max_sectors;
> 	unsigned int		atomic_write_max_sectors;
> 	unsigned int		atomic_write_hw_boundary_sectors;
> 	unsigned int		atomic_write_hw_unit_min_sectors;
> 	unsigned int		atomic_write_unit_min_sectors;
> 	unsigned int		atomic_write_hw_unit_max_sectors;
> 	unsigned int		atomic_write_unit_max_sectors;
> <...>
> 
> 1. atomic_write_unit_hw_max_sectors comes directly from hw and it need
> not be a power of 2.
> 
> 2. atomic_write_hw_unit_min_sectors and atomic_write_hw_unit_max_sectors
> is again defined/derived from hw limits, but it is rounded down so that
> it is always a power of 2.
> 
> 3. atomic_write_hw_boundary_sectors again comes from HW boundary limit.
> It could either be 0 (which means the device specify no boundary limit) or a
> multiple of unit_max. It need not be power of 2, however the current
> code assumes it to be a power of 2 (check callers of blk_queue_atomic_write_boundary_bytes())
> 
> 4. atomic_write_max_sectors, atomic_write_unit_min_sectors
> and atomic_write_unit_max_sectors are all derived out of above hw limits
> inside function blk_atomic_writes_update_limits() based on request_queue
> limits.
>      a. atomic_write_max_sectors is derived from atomic_write_hw_unit_max_sectors and
>         request_queue's max_hw_sectors limit. It also guarantees max
>         sectors that can be fit in a single bio.
>      b. atomic_write_unit_[min|max]_sectors are derived from atomic_write_hw_unit_[min|max]_sectors,
>         request_queue's max_hw_sectors & blk_queue_max_guaranteed_bio_sectors(). Both of these limits
>         are kept as a power of 2.
> 
> Now coming to sysfs outputs -
> 1. atomic_write_unit_max_bytes: Same as atomic_write_unix_max_sectors in bytes
> 2. atomic_write_unit_min_bytes: Same as atomic_write_unit_min_sectors in bytes
> 3. atomic_write_boundary_bytes: same as atomic_write_hw_boundary_sectors
> in bytes
> 4. atomic_write_max_bytes: Same as atomic_write_max_sectors in bytes
> 

ok, I can look to incorporate the advised formatting changes

>>
>> atomic_write_unit_max_bytes is capped at the maximum data size which we are
>> guaranteed to be able to fit in a BIO, as an atomic write must always be
>> submitted as a single BIO. This BIO max size is dictated by the number of
> 
> Here it says that the atomic write must always be submitted as a single
> bio. From where to where?

submitted to the block layer/core

> I think you meant from FS to block layer.

sure, or also block device file operations (in fops.c) to block core

> Because otherwise we still allow request/bio merging inside block layer
> based on the request queue limits we defined above. i.e. bio can be
> chained to form
>        rq->biotail->bi_next = next_rq->bio
> as long as the merged requests is within the queue_limits.
> 
> i.e. atomic write requests can be merged as long as -
>      - both rqs have REQ_ATOMIC set
>      - blk_rq_sectors(final_rq) <= q->limits.atomic_write_max_sectors
>      - final rq formed should not straddle limits->atomic_write_hw_boundary_sectors
> 
> However, splitting of an atomic write requests is not allowed. And if it
> happens, we fail the I/O req & return -EINVAL.

...

> 
> IMHO, the commit message can definitely use a re-write. I agree that you
> have put in a lot of information, but I think it can be more organized.#

ok, fine. I'll look at this. Thanks.

> 
>>
>> Contains significant contributions from:
>> Himanshu Madhani <himanshu.madhani at oracle.com>
> 
> Myabe it can use a better tag then.
> "Documentation/process/submitting-patches.rst"

ok

> 
>>
>> Signed-off-by: John Garry <john.g.garry at oracle.com>
>> ---
>>   Documentation/ABI/stable/sysfs-block |  52 ++++++++++++++
>>   block/blk-merge.c                    |  91 ++++++++++++++++++++++-
>>   block/blk-settings.c                 | 103 +++++++++++++++++++++++++++
>>   block/blk-sysfs.c                    |  33 +++++++++
>>   block/blk.h                          |   3 +
>>   include/linux/blk_types.h            |   2 +
>>   include/linux/blkdev.h               |  60 ++++++++++++++++
>>   7 files changed, 343 insertions(+), 1 deletion(-)
>>
>> diff --git a/Documentation/ABI/stable/sysfs-block b/Documentation/ABI/stable/sysfs-block
>> index 1fe9a553c37b..4c775f4bdefe 100644
>> --- a/Documentation/ABI/stable/sysfs-block
>> +++ b/Documentation/ABI/stable/sysfs-block
>> @@ -21,6 +21,58 @@ Description:
>>   		device is offset from the internal allocation unit's
>>   		natural alignment.

...

>>   
> 
> /* A comment explaining this function and arguments could be helpful */

already addressed according to earlier review

> 
>> +static bool rq_straddles_atomic_write_boundary(struct request *rq,
>> +					unsigned int front,
>> +					unsigned int back)
> 
> A better naming perhaps be start_adjust, end_adjust?

ok

> 
>> +{
>> +	unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
>> +	unsigned int mask, imask;
>> +	loff_t start, end;
> 
> start_rq_pos, end_rq_pos maybe?

ok

> 
>> +
>> +	if (!boundary)
>> +		return false;
>> +
>> +	start = rq->__sector << SECTOR_SHIFT;
> 
> blk_rq_pos(rq) perhaps?

ok

> 
>> +	end = start + rq->__data_len;
> 
> blk_rq_bytes(rq) perhaps? It should be..

ok

>> +
>> +	start -= front;
>> +	end += back;
>> +
>> +	/* We're longer than the boundary, so must be crossing it */
>> +	if (end - start > boundary)
>> +		return true;
>> +
>> +	mask = boundary - 1;
>> +
>> +	/* start/end are boundary-aligned, so cannot be crossing */
>> +	if (!(start & mask) || !(end & mask))
>> +		return false;
>> +
>> +	imask = ~mask;
>> +
>> +	/* Top bits are different, so crossed a boundary */
>> +	if ((start & imask) != (end & imask))
>> +		return true;
> 
> The last condition looks wrong. Shouldn't it be end - 1?
> 
>> +
>> +	return false;
>> +}
> 
> Can we do something like this?
> 
> static bool rq_straddles_atomic_write_boundary(struct request *rq,
> 					       unsigned int start_adjust,
> 					       unsigned int end_adjust)
> {
> 	unsigned int boundary = queue_atomic_write_boundary_bytes(rq->q);
> 	unsigned long boundary_mask;
> 	unsigned long start_rq_pos, end_rq_pos;
> 
> 	if (!boundary)
> 		return false;
> 
> 	start_rq_pos = blk_rq_pos(rq) << SECTOR_SHIFT;
> 	end_rq_pos = start_rq_pos + blk_rq_bytes(rq);
> 
> 	start_rq_pos -= start_adjust;
> 	end_rq_pos += end_adjust;
> 
> 	boundary_mask = boundary - 1;
> 
> 	if ((start_rq_pos | boundary_mask) != (end_rq_pos | boundary_mask))
> 		return true;
> 
> 	return false;
> }
> 
> I was thinking this check should cover all cases? Thoughts?

that looks ok (apart from issue already detected later). It is quite 
similar to how I coded it in the NVMe driver, apart from the initial > 
boundary check.

>> diff --git a/include/linux/blk_types.h b/include/linux/blk_types.h
>> index f288c94374b3..cd7cceb8565d 100644
>> --- a/include/linux/blk_types.h
>> +++ b/include/linux/blk_types.h
>> @@ -422,6 +422,7 @@ enum req_flag_bits {
>>   	__REQ_DRV,		/* for driver use */
>>   	__REQ_FS_PRIVATE,	/* for file system (submitter) use */
>>
>> +	__REQ_ATOMIC,		/* for atomic write operations */
>>   	/*
>>   	 * Command specific flags, keep last:
>>   	 */
>> @@ -448,6 +449,7 @@ enum req_flag_bits {
>>   #define REQ_RAHEAD	(__force blk_opf_t)(1ULL << __REQ_RAHEAD)
>>   #define REQ_BACKGROUND	(__force blk_opf_t)(1ULL << __REQ_BACKGROUND)
>>   #define REQ_NOWAIT	(__force blk_opf_t)(1ULL << __REQ_NOWAIT)
>> +#define REQ_ATOMIC	(__force blk_opf_t)(1ULL << __REQ_ATOMIC)
> 
> Let's add this in the same order as of __REQ_ATOMIC i.e. after
> REQ_FS_PRIVATE macro

ok, fine

 >> @@ -299,6 +299,14 @@ struct queue_limits {
 >>   	unsigned int		discard_alignment;
 >>   	unsigned int		zone_write_granularity;
 >>
 >> +	unsigned int		atomic_write_hw_max_sectors;
 >> +	unsigned int		atomic_write_max_sectors;
 >> +	unsigned int		atomic_write_hw_boundary_sectors;
 >> +	unsigned int		atomic_write_hw_unit_min_sectors;
 >> +	unsigned int		atomic_write_unit_min_sectors;
 >> +	unsigned int		atomic_write_hw_unit_max_sectors;
 >> +	unsigned int		atomic_write_unit_max_sectors;
 >> +
 > 1 liner comment for above members please?

ok

>> +static inline bool bdev_can_atomic_write(struct block_device *bdev)
>> +{
>> +	struct request_queue *bd_queue = bdev->bd_queue;
>> +	struct queue_limits *limits = &bd_queue->limits;
>> +
>> +	if (!limits->atomic_write_unit_min_sectors)
>> +		return false;
>> +
>> +	if (bdev_is_partition(bdev)) {
>> +		sector_t bd_start_sect = bdev->bd_start_sect;
>> +		unsigned int granularity = max(
> 
> atomic_align perhaps?

or just "align"

> 
>> +				limits->atomic_write_unit_min_sectors,
>> +				limits->atomic_write_hw_boundary_sectors);
>> +		if (do_div(bd_start_sect, granularity))
>> +			return false;
>> +	}
> 
> since atomic_align is a power of 2. Why not use IS_ALIGNED()?
> (bitwise operation instead of div)?

already changed as advised

Thanks,
John