[RFC PATCH v4 2/5] mm/readahead: Terminate async readahead on natural boundary

Mon May 5 02:37:38 PDT 2025

On Mon 05-05-25 11:13:26, Jan Kara wrote:
> On Wed 30-04-25 15:59:15, Ryan Roberts wrote:
> > Previously asynchonous readahead would read ra_pages (usually 128K)
> > directly after the end of the synchonous readahead and given the
> > synchronous readahead portion had no alignment guarantees (beyond page
> > boundaries) it is possible (and likely) that the end of the initial 128K
> > region would not fall on a natural boundary for the folio size being
> > used. Therefore smaller folios were used to align down to the required
> > boundary, both at the end of the previous readahead block and at the
> > start of the new one.
> > 
> > In the worst cases, this can result in never properly ramping up the
> > folio size, and instead getting stuck oscillating between order-0, -1
> > and -2 folios. The next readahead will try to use folios whose order is
> > +2 bigger than the folio that had the readahead marker. But because of
> > the alignment requirements, that folio (the first one in the readahead
> > block) can end up being order-0 in some cases.
> > 
> > There will be 2 modifications to solve this issue:
> > 
> > 1) Calculate the readahead size so the end is aligned to a folio
> >    boundary. This prevents needing to allocate small folios to align
> >    down at the end of the window and fixes the oscillation problem.
> > 
> > 2) Remember the "preferred folio order" in the ra state instead of
> >    inferring it from the folio with the readahead marker. This solves
> >    the slow ramp up problem (discussed in a subsequent patch).
> > 
> > This patch addresses (1) only. A subsequent patch will address (2).
> > 
> > Worked example:
> > 
> > The following shows the previous pathalogical behaviour when the initial
> > synchronous readahead is unaligned. We start reading at page 17 in the
> > file and read sequentially from there. I'm showing a dump of the pages
> > in the page cache just after we read the first page of the folio with
> > the readahead marker.

<snip>

> Looks good. When I was reading this code some time ago, I also felt we
> should rather do some rounding instead of creating small folios so thanks
> for working on this. Feel free to add:
> 
> Reviewed-by: Jan Kara <jack at suse.cz>

But now I've also remembered why what you do here isn't an obvious win.
There are storage devices (mostly RAID arrays) where optimum read size
isn't a power of 2. Think for example a RAID-0 device composed from three
disks. It will have max_pages something like 384 (512k * 3). Suppose we are
on x86 and max_order is 9. Then previously (if we were lucky with
alignment) we were alternating between order 7 and order 8 pages in the
page cache and do optimally sized IOs od 1536k. Now you will allocate all
folios of order 8 (nice) but reads will be just 1024k and you'll see
noticeable drop in read throughput (not nice). Note that this is not just a
theoretical example but a real case we have hit when doing performance
testing of servers and for which I was tweaking readahead code in the past.

So I think we need to tweak this logic a bit. Perhaps we should round_down
end to the minimum alignment dictated by 'order' and maxpages? Like:

1 << min(order, ffs(max_pages) + PAGE_SHIFT - 1)

If you set badly aligned readahead size manually, you will get small pages
in the page cache but that's just you being stupid. In practice, hardware
induced readahead size need not be powers of 2 but they are *sane* :).

								Honza

> > diff --git a/mm/readahead.c b/mm/readahead.c
> > index 8bb316f5a842..82f9f623f2d7 100644
> > --- a/mm/readahead.c
> > +++ b/mm/readahead.c
> > @@ -625,7 +625,7 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >  	unsigned long max_pages;
> >  	struct file_ra_state *ra = ractl->ra;
> >  	pgoff_t index = readahead_index(ractl);
> > -	pgoff_t expected, start;
> > +	pgoff_t expected, start, end, aligned_end;
> >  	unsigned int order = folio_order(folio);
> >  
> >  	/* no readahead */
> > @@ -657,7 +657,6 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >  		 * the readahead window.
> >  		 */
> >  		ra->size = max(ra->size, get_next_ra_size(ra, max_pages));
> > -		ra->async_size = ra->size;
> >  		goto readit;
> >  	}
> >  
> > @@ -678,9 +677,13 @@ void page_cache_async_ra(struct readahead_control *ractl,
> >  	ra->size = start - index;	/* old async_size */
> >  	ra->size += req_count;
> >  	ra->size = get_next_ra_size(ra, max_pages);
> > -	ra->async_size = ra->size;
> >  readit:
> >  	order += 2;
> > +	end = ra->start + ra->size;
> > +	aligned_end = round_down(end, 1UL << order);
> > +	if (aligned_end > ra->start)
> > +		ra->size -= end - aligned_end;
> > +	ra->async_size = ra->size;
> >  	ractl->_index = ra->start;
> >  	page_cache_ra_order(ractl, ra, order);
> >  }
> > -- 
> > 2.43.0
> > 
> -- 
> Jan Kara <jack at suse.com>
> SUSE Labs, CR
-- 
Jan Kara <jack at suse.com>
SUSE Labs, CR