[PATCH 1/1] block: Check the queue limit before bio submitting

Tue Oct 31 19:23:26 PDT 2023

On Wed, 2023-10-25 at 17:22 +0800, ed.tsai at mediatek.com wrote:
> From: Ed Tsai <ed.tsai at mediatek.com>
> 
> Referring to commit 07173c3ec276 ("block: enable multipage bvecs"),
> each bio_vec now holds more than one page, potentially exceeding
> 1MB in size and causing alignment issues with the queue limit.
> 
> In a sequential read/write scenario, the file system maximizes the
> bio's capacity before submitting. However, misalignment with the
> queue limit can result in the bio being split into smaller I/O
> operations.
> 
> For instance, assuming the maximum I/O size is set to 512KB and the
> memory is highly fragmented, resulting in each bio containing only
> one 2-pages bio_vec (i.e., bi_size = 1028KB). This would cause the
> bio to be split into two 512KB portions and one 4KB portion. As a
> result, the originally expected continuous large I/O operations are
> interspersed with many small I/O operations.
> 
> To address this issue, this patch adds a check for the max_sectors
> before submitting the bio. This allows the upper layers to
> proactively
> detect and handle alignment issues.
> 
> I performed the Antutu V10 Storage Test on a UFS 4.0 device, which
> resulted in a significant improvement in the Sequential test:
> 
> Sequential Read (average of 5 rounds):
> Original: 3033.7 MB/sec
> Patched: 3520.9 MB/sec
> 
> Sequential Write (average of 5 rounds):
> Original: 2225.4 MB/sec
> Patched: 2800.3 MB/sec
> 
> Signed-off-by: Ed Tsai <ed.tsai at mediatek.com>
> ---
>  block/bio.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/block/bio.c b/block/bio.c
> index 816d412c06e9..a4a1f775b9ea 100644
> --- a/block/bio.c
> +++ b/block/bio.c
> @@ -1227,6 +1227,7 @@ static int __bio_iov_iter_get_pages(struct bio
> *bio, struct iov_iter *iter)
>  	iov_iter_extraction_t extraction_flags = 0;
>  	unsigned short nr_pages = bio->bi_max_vecs - bio->bi_vcnt;
>  	unsigned short entries_left = bio->bi_max_vecs - bio->bi_vcnt;
> +	struct queue_limits *lim = &bdev_get_queue(bio->bi_bdev)-
> >limits;
>  	struct bio_vec *bv = bio->bi_io_vec + bio->bi_vcnt;
>  	struct page **pages = (struct page **)bv;
>  	ssize_t size, left;
> @@ -1275,6 +1276,11 @@ static int __bio_iov_iter_get_pages(struct bio
> *bio, struct iov_iter *iter)
>  		struct page *page = pages[i];
>  
>  		len = min_t(size_t, PAGE_SIZE - offset, left);
> +		if (bio->bi_iter.bi_size + len >
> +		    lim->max_sectors << SECTOR_SHIFT) {
> +			ret = left;
> +			break;
> +		}
>  		if (bio_op(bio) == REQ_OP_ZONE_APPEND) {
>  			ret = bio_iov_add_zone_append_page(bio, page,
> len,
>  					offset);
> -- 
> 2.18.0
> 

Hi Jens,

Just to clarify any potential confusion, I would like to provide
further details based on the assumed scenario mentioned above.

When the upper layer continuously sends 1028KB full-sized bios for
sequential reads, the Block Layer sees the following sequence:
	submit bio: size = 1028KB, start LBA = n
	submit bio: size = 1028KB, start LBA = n + 1028KB 
	submit bio: size = 1028KB, start LBA = n + 2056KB
	...

However, due to the queue limit restricting the I/O size to a maximum
of 512KB, the Block Layer splits into the following sequence:
	submit bio: size = 512KB, start LBA = n
	submit bio: size = 512KB, start LBA = n +  512KB
	submit bio: size =   4KB, start LBA = n + 1024KB	
	submit bio: size = 512KB, start LBA = n + 1028KB
	submit bio: size = 512KB, start LBA = n + 1540KB
	submit bio: size =   4KB, start LBA = n + 2052KB
	submit bio: size = 512KB, start LBA = n + 2056KB
	submit bio: size = 512KB, start LBA = n + 2568KB
	submit bio: size =   4KB, start LBA = n + 3080KB
	...

The original expectation was for the storage to receive large,
contiguous requests. However, due to non-alignment, many small I/O
requests are generated. This problem is easily visible because the
user pages passed in are often allocated by the buddy system as order 0
pages during page faults, resulting in highly non-contiguous memory.

As observed in the Antutu Sequential Read test below, it is similar to
the description above where the splitting caused by the queue limit
leaves small requests sandwiched in between:

block_bio_queue: 8,32 R 86925864 + 2144 [Thread-51]
block_split: 8,32 R 86925864 / 86926888 [Thread-51]
block_split: 8,32 R 86926888 / 86927912 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86925864 + 1024 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86926888 + 1024 [Thread-51]
block_bio_queue: 8,32 R 86928008 + 2144 [Thread-51]
block_split: 8,32 R 86928008 / 86929032 [Thread-51]
block_split: 8,32 R 86929032 / 86930056 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86928008 + 1024 [Thread-51]
block_rq_issue: 8,32 R 49152 () 86927912 + 96 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86929032 + 1024 [Thread-51]
block_bio_queue: 8,32 R 86930152 + 2112 [Thread-51]
block_split: 8,32 R 86930152 / 86931176 [Thread-51]
block_split: 8,32 R 86931176 / 86932200 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86930152 + 1024 [Thread-51]
block_rq_issue: 8,32 R 49152 () 86930056 + 96 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86931176 + 1024 [Thread-51]
block_bio_queue: 8,32 R 86932264 + 2096 [Thread-51]
block_split: 8,32 R 86932264 / 86933288 [Thread-51]
block_split: 8,32 R 86933288 / 86934312 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86932264 + 1024 [Thread-51]
block_rq_issue: 8,32 R 32768 () 86932200 + 64 [Thread-51]
block_rq_issue: 8,32 R 524288 () 86933288 + 1024 [Thread-51]

I simply prevents non-aligned situations in bio_iov_iter_get_pages.
Besides making the upper layer application aware of the queue limit, I
would appreciate any other directions or suggestions you may have.
Thank you!