[Lsf-pc] [LSF/MM/BPF TOPIC] Memory fragmentation with large block sizes
Vlastimil Babka
vbabka at suse.com
Tue Jun 9 02:37:54 PDT 2026
On 6/9/26 10:39, Hannes Reinecke wrote:
> On 6/9/26 09:28, Christoph Hellwig wrote:
>> Hannes,
>>
>> can you share your results on the mailing list?
>>
> I sure can.
>
> We have run a simple testcase with on fio job on an LBS-enabled device,
> and another job permanently allocating and deallocating arrays of pages
> of various array lengths.
>
> We then took snapshots of /proc/buddyinfo to track memory pressure
> over time.
>
> Results are visualized in the attach plot.
>
> With 4k block sizes we have seen a high number of 0- and 1- order pages,
> and then the expected decline towards higher orders.
>
> With 8k and 16k block sizes a noticeable 'bump' in free pages was
> developing in 2- and 3- order pages, which we think is down to
> compaction trying to merge pages together.
> The number of 0- order pages increased slightly, but only half of the
> maximum number of pages in the 'bump'.
>
> With 32k block sizes the picture changed completely; the 'bump'
> vanished, and there was only pronounces spike with 0-order pages
> (about four times the size of the spike with 4k block sizes).
>
> This led me to assume that compaction broke down at 32k block sizes;
> this assumption was confirmed by Vlastimil Babka who pointed out that
> there is a maximum order to which page compaction is attempted:
Yep, but after the LSF/MM session I've realized I made an off-by-one error
thanks to the misleading name of the define.
> include/linux/mmzone.h:
> /*
> * PAGE_ALLOC_COSTLY_ORDER is the order at which allocations are deemed
> * costly to service. That is between allocation orders which should
> * coalesce naturally under reasonable reclaim pressure and those which
> * will not.
> */
> #define PAGE_ALLOC_COSTLY_ORDER 3
Which is 32k
> and it's main usage is 'order > PAGE_ALLOC_COSTLY_ORDER'.
And indeed, this means any changes in behavior due to this should only
happen at 64k (order 4) or more. The value of 3 is in fact something like
"PAGE_ALLOC_MAX_CHEAP_ORDER". I'd send a rename patch (Kiryl fixed a similar
off-by-one gotcha with MAX_ORDER), but I suspect we'll be making more
involved changes here so I'd wait for that first.
> Which ties in directly with what we're seeing.
So it's probably not that straigtforward. We should investigate more first?
> It will probably make sense to align the maximum block size which we
> currently support (ie 64k) with this value to ensure that compaction
> works with larger block sizes. Or maybe even the other way round;
> tie the maximum block size which we support to PAGE_ALLOC_COSTLY_ORDER.
> But that would mean to restrict the blocksize to 16k, whereas xfs
> works happily with 32k. So we might want to raise PAGE_ALLOC_COSTLY_ORDER.
>
> Question is, though, how could we measure the impact?
> This particular value has been in since 2007 (commit 5ad333eb66ff1
> 'lumpy reclaim V4'), and it might well be that the original
> reasoning doesn't apply anymore.
>
> At the same time, this value is tied to a _LOT_ of things
> (not to mention the page allocator itself), so increasing it
> to '4' has an extremely high chance of impacting mm performance.
>
> I'll probably run mmtests and see what I get.
>
> Cheers,
>
> Hannes
>
>
> fragmentation.png
>
>
> _______________________________________________
> Lsf-pc mailing list
> Lsf-pc at lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/lsf-pc
More information about the Linux-nvme
mailing list