Tracking free pages

Fri Aug 30 10:00:34 EDT 2013

In my last missive, I talked about how to lay out pages in use by inodes.
Now, let's talk about tracking free pages.  (Tracking in-use pages can be
done two ways; either as the complement of the free pages, or by walking
every inode in the filesystem and recording which pages are in use).

Today's pmfs does not track free pages.  It scans the inode trees at
startup and constructs a list of free/used regions.  The startup time
for this makes it impractical for production.  So we need a persistent
data structure that tracks free pages.

The Linux DRAM page allocator is a buddy allocator.  See __free_one_page()
in mm/page_alloc.c.  This allows it to efficiently hand out pages
in powers of two (known as 'order' in that function).  This has the
advantage of only needing to check one other page at each level in
order to coalesce them.  For our purposes, it has the disadvantages
of requiring the auxiliary 'struct page' data structure, and the extra
overhead of maintaining lists for each power of two.  It's not usable
as-is for our purposes.

The great thing about tracking free pages is that there's no user data in
them.  We can use their contents to record information about them so that
we don't need auxiliary data structures (or the auxiliary data structures
shrink as more of the free pages are allocated).  More importantly,
with a completely full filesystem, it must be possible to free a single
page and record that it is free.

Delete performance can be quite important for a filesystem (eg rm -rf
linux-2.6 or 'make clean'), so returning pages to the free pool should
be quite fast.  On the other hand, it's not as important as the speed
of allocating new pages, and for this filesystem we particularly want
to be able to allocate new 1GB pages.

We need to be able to find free pages in two different ways.  We need
to find "a free page of size N" (for values of N in { 4k, 2M, 1G }), and
we need to be able to find whether "the page at address N is free" for
the purpose of coalescing freed pages together, and for the purpose of
allocating the next page in a file.

However, we never need to coalesce free pages larger than 1GB, and we're
quite willing to move to a non-contiguous address on 1GB boundaries.  So
the free 1GB pages don't need to be findable through the second method;
they just need to be on a list of free 1GB pages.

The 2MB and 4k pages need to be findable through both methods.  When we
try to extend a file, we first look for "the next page", and then we
look for "any page of the right size".  If that fails, we break down a
page of the next size up.

The data structure required to support the first requirement is trivial;
a doubly-linked list suffices.  We actually want three; one for 4k pages,
one for 2MB pages and one for 1GB pages.  Using the last few bytes of
the page for the list (next & previous pointers, a magic number and
an fs UUID) should be sufficient.  When taking a page off the list,
we adjust the pointers on the prev/next nodes, zero the last few bytes
of the page and return the now-zero page to the caller.

We could use the same data structure described for the inode to track free
pages, but that might require allocating memory in order to free memory,
which is generally considered a bad idea (eg if only block 1 is already
free and we free block ffff'ffff'ffff, we will need to allocate some
number of intermediate nodes in order to represent the two free blocks).
This could be fixable by preallocating or reserving some number of
intermediate nodes for the free inode, but it feels somewhat fragile.

We really only need a single bit to represent whether a 4k page is free
or used.  For a 1GB superpage, that would be 2^18 bits, which is 1kB.
Assuming a maximum filesystem size of 64TB (the current maximum addressable
physical memory on x86_64 systems), we could statically consume 64MB for
a bitmap (which is not an ridiculous overhead!), but we can slim it down
(not to mention making it more scalable) by using a tree.

At the highest level, we have up to 64k 1GB pages.  An 8-byte DPR for
each of them consumes 512kB.  The DPR may be a hole, or it may point to
any 4k page within the 1GB superpage.  The first 1k of the page pointed to
by the DPR is a bitmap (the bit is set if the page is used, clear if the
page is freed).

When we free a page, we look at the DPR for the superpage it is in.
If it represents a hole, this is the first page to be freed within the
superpage, so we memset(page, 0xff, 1024); memset(page+1024, 0, 3072);,
then replace the DPR with a DPR that represents the page we are freeing.
If the DPR is pointing to data, we memset(page, 0, 4096);.  For all
pages, we then clear the bit in the bitmap that corresponds to the
page we are freeing.  We also add the page to the first data structure,
the list of free pages.

Once all of the pages within a superpage are free, the bitmap is all zero,
so the superpage can be taken out of the bitmap tracking data structure
and placed on the list of free 1GB pages.