[LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes

Tue Jun 9 01:39:14 PDT 2026

On 6/9/26 09:28, Christoph Hellwig wrote:
> Hannes,
> 
> can you share your results on the mailing list?
> 
I sure can.

We have run a simple testcase with on fio job on an LBS-enabled device, 
and another job permanently allocating and deallocating arrays of pages
of various array lengths.

We then took snapshots of /proc/buddyinfo to track memory pressure
over time.

Results are visualized in the attach plot.

With 4k block sizes we have seen a high number of 0- and 1- order pages,
and then the expected decline towards higher orders.

With 8k and 16k block sizes a noticeable 'bump' in free pages was 
developing in 2- and 3- order pages, which we think is down to 
compaction trying to merge pages together.
The number of 0- order pages increased slightly, but only half of the
maximum number of pages in the 'bump'.

With 32k block sizes the picture changed completely; the 'bump'
vanished, and there was only pronounces spike with 0-order pages
(about four times the size of the spike with 4k block sizes).

This led me to assume that compaction broke down at 32k block sizes;
this assumption was confirmed by Vlastimil Babka who pointed out that
there is a maximum order to which page compaction is attempted:

include/linux/mmzone.h:
/*
  * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
  * costly to service.  That is between allocation orders which should
  * coalesce naturally under reasonable reclaim pressure and those which
  * will not.
  */
#define PAGE_ALLOC_COSTLY_ORDER 3

and it's main usage is 'order > PAGE_ALLOC_COSTLY_ORDER'.
Which ties in directly with what we're seeing.

It will probably make sense to align the maximum block size which we
currently support (ie 64k) with this value to ensure that compaction
works with larger block sizes. Or maybe even the other way round;
tie the maximum block size which we support to PAGE_ALLOC_COSTLY_ORDER.
But that would mean to restrict the blocksize to 16k, whereas xfs
works happily with 32k. So we might want to raise PAGE_ALLOC_COSTLY_ORDER.

Question is, though, how could we measure the impact?
This particular value has been in since 2007 (commit 5ad333eb66ff1 
'lumpy reclaim V4'), and it might well be that the original
reasoning doesn't apply anymore.

At the same time, this value is tied to a _LOT_ of things
(not to mention the page allocator itself), so increasing it
to '4' has an extremely high chance of impacting mm performance.

I'll probably run mmtests and see what I get.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke                  Kernel Storage Architect
hare at suse.de                                +49 911 74053 688
SUSE Software Solutions GmbH, Frankenstr. 146, 90461 Nürnberg
HRB 36809 (AG Nürnberg), GF: I. Totev, A. McDonald, W. Knoblich
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fragmentation.png
Type: image/png
Size: 7250 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-nvme/attachments/20260609/6cf85ff9/attachment-0001.png>