New inode layout

Wed Aug 28 09:51:06 EDT 2013

I feel we need to change the data structure pmfs uses to describe the
user data stored in an inode.  It's similar to the current data structure
in that it's a radix tree, but has some important advantages:

 - Transparently supports 4k/2M/1G pages without application intervention
 - Supports files up to the maximum Linux file size
 - Only requires atomic 8-byte writes to update the inode root, rather
   than 16-byte writes.
 - Doesn't confusingly claim to be a btree when it isn't :-)
 - Doesn't permit the confusion about whether a 'block' is 4k/2M/1G
 - Permits expansion to 512GB/256TB page sizes in future

I've written up this design document; it's been through a couple of rounds of
review internally, so I think it's time to invite criticism from the world
at large.

The PMFS inode data layout
~~~~~~~~~~~~~~~~~~~~~~~~~~

This design is based heavily on the same kind of radix tree used by
most CPUs to manage page tables.  See, for example, Chapter 4 of the
Intel Software Developer's Manual, Volume 3.  It's modified slightly to
account for the variable size of files.

All pages are 4kB in size.  Pages can be clustered together into
superpages on naturally aligned boundaries (2MB, 1GB and 512GB).
The inode contains a Data Page Reference (DPR).  This is a 64-bit quantity
stored in little-endian.  The top 12 bits are available for use as flags.
The bottom 52 bits reference a page through a scheme determined by the
values of the fields in the top 12 bits.  This allows us to address
every page in a 2^64 address space.

Each DPR either points to the user's data (if bit 63 is set) or it
points to an intermediate node in the radix tree.  Intermediate nodes
are an array of 512 DPRs, indexed by the appropriate bits in the offset.
The bits to use are determined by the Level field
((Offset >> (9 * Level - 6)) & 0x1ff).

DPR format:

Bit 63: Data.  If this bit is set, the DPR references data directly and
	the remaining bits of the offset are used to index into the Page
	Number.  If clear, the DPR references an array of DPRs.
Bits 60-62: Level.  The maximum number of bytes referenced by this DPR.
	0: No data is present (a hole in the file).
	1:   4 KiB
	2:   2 MiB
	3:   1 GiB
	4: 512 GiB
	5: 256 TiB
	6: 128 PiB
	7:  64 EiB
Bit 59: Volatile.  If this bit is set, the page referenced is in DRAM,
	not persistent memory.
Bits 52-58: Reserved
Bits 0-51: Page Number

Note that Level 1 implies Data 1.  A DPR with Level 1, Data 0 is invalid.

All non-hole DPRs in an intermediate node have the same Level,
which is one less than the Level of their parent DPR.  If the DPR
represents a hole, then the hole is the same size as the other DPRs
in that intermediate node.  If the Level of a DPR is neither zero nor
one less than the parent DPR, then the filesystem contains an error.
The implementation should check this invariant to ensure that a corrupt
tree does not contain a cycle.

This tree is the static representation of the file data.  For the
filesystem to be truly useful, it has to be paired with a page allocation
strategy that works well for a wide range of workloads.  That strategy
will be the subject of a future email.

I haven't figured out exactly how I want the 'Volatile' bit to work yet.
I envisage it being used like a delayed-allocation scheme that some
filesystems have.  We can accumulate writes in DRAM and write them all
out to persistent memory in a single shot.  The volatile bit would also
change the interpretation of the Page Number; it wouldn't be fs-relative,
but a PFN.

I'm considering some interesting uses for bits 52-55.  That's 256
bytes of data scattered across the 4k page.  We could put some useful
repairability data in there (eg put a UUID in the inode and copy it into
all of the intermediate nodes, a CRC of the page, etc.).

Layout examples
~~~~~~~~~~~~~~~

Example 1: The inode of a file with no data has a DPR of
0x0000'0000'0000'0000.

Example 2: A file with up to 4k of data in it might have an inode
DPR of 0x900d'cba9'8765'4321.  Its data would be found in page
0x000d'cba9'8765'4321 which is at address FSBASE + 0xdcab'9876'5432'1000.

DPR L1D1	-> 4kB user data
(Level 1, Data 1)

Example 3: A file with up to 2MB of data stored in 4k pages has a DPR
of the form 0x200x'xxxx'xxxx'xxxx.  The 4k of data found in that page
contains 512 DPRs, each of which has the form 0x900x'xxxx'xxxx'xxxx.

DPR L2D0	-> DPR L1D1	-> 4kB user data
		-> DPR L1D1	-> 4kB user data
		...

Example 4: A file with up to 2MB of data stored in a single 2MB superpage
has an inode DPR of the form 0xa00f'ffff'ffff'fe00.  The 2MB of data
found in that superpage is the file data.

DPR L2D1	-> 2MB user data

Example 5: A file with up to 1GB of data stored in pages smaller than 1GB
has an inode DPR of the form 0x300x'xxxx'xxxx'xxxx.  The 4k of data found
in that page contains 512 DPRs, each of which may or may not point
directly to user data, or contain holes.

DPR L3D0	-> DPR L2D0	-> DPR L1D1	-> 4kB user data
				-> DPR L1D1	-> 4kB user data
				...
		-> 0
		-> DPR L2D1	-> 2MB user data
		-> DPR L2D0	-> DPR L1D1	-> 4kB user data
				-> DPR L1D1	-> 4kB user data
				...
		...

Example 6: A file with up to 1GB of data stored in a single 1GB superpage
has an inode DPR of the form 0xb00f'ffff'fffc'0000.  The 1GB of data
found in that superpage is the file data.

DPR L3D1	-> 1GB user data

Example 7: One extra 4k page is appended to the file in Example 6.
We allocate an L3 intermediate node, move the inode DPR into it and update
the inode DPR with the L4 DPR pointing to the new intermediate node.
We then allocate an L1 intermediate node, and set the first entry to
point to the new file data.  We allocate an L2 intermediate node and set
the first entry to point to the L1 intermediate node.  We set the second
entry of the L3 intermediate node to point to the L2 intermediate node.

DPR L4D0	-> DPR L3D1	-> 1GB user data
		-> DPR L3D0	-> DPR L2D0	-> DPR L1D1	-> 4kB data
						-> 0
						...
				-> 0
				...
		-> 0
		...

Example 8: A 16 Exabyte file (pwrite(fd, buf, 1, (loff_t)-2);)
has a DPR of 0x700d'cba9'8765'4321 in the inode.  That data block
has as its 511th DPR 0x600d'cba9'8765'4322.  That data block has
as its 511th DPR 0x500d'cba9'8765'4323.  That data block has as
its 511th DPR 0x400d'cba9'8765'4324.  That data block has as its
511th DPR 0x300d'cba9'8765'4325.  That data block has as its 511th
DPR 0x200d'cba9'8765'4326.  That data block has as its 511th DPR
0x900d'cba9'8765'4327.  And data block 000d'cba9'8765'4327 is where you
find the actual data for this file.  This is the worst case scenario,
and it only takes 6 indirections.

DPR L7D0	-> 0
		...
		-> DPR L6D0	-> 0
				...
				-> DPR L5D0	-> 0
						...
						-> DPR L4D0	-> 0
								...
								-> DPR L3D0
(etc)

Example 9.  A read from a fragmented file up to 1GB in size

DPR L3D0	-> DPR L2D0	-> DPR L1D1	-> 4kB user data
				-> DPR L1D1	-> 4kB user data
				-> DPR L1D1	-> 4kB user data
				...
		-> DPR L2D1	-> 2MB user data
		-> DPR L2D1	-> 2MB user data

The above layout represents a file up to 1GB in size.  A read of 8192
bytes from file offset 4096 would first fetch the DPR from the inode.
Seeing a Level of 3, it would check that bits 30-63 of the file offset
were clear, and load the new DPR at (FSBASE + (DPR << 12) + 0) (the
value in bits 21-29 of file offset 4096).  Seeing a Level of 2 and Data
cleared to 0, it would load the new DPR from (FSBASE + (DPR << 12) + 8)
(the value in bits 12-20 of file offset 4096, multiplied by the size of
a DPR).  Seeing a Level of 1, it copies 4096 bytes of data from (FSBASE +
(DPR << 12) + 0) to userspace.  A naive implementation would then start
again at the inode DPR, but a smarter implementation would remember the
parent DPR and merely go one level back up the tree to read the next
DPR and copy the remaining 4k of data to userspace.

A read of 8k from offset 5MB would first fetch the DPR from the inode
and check that bits 30-63 of the offset were 0, as above.  Then it
would load the new DPR from byte address (FSBASE + (DPR << 12) + 16).
Seeing a Level of 2 and Data set to 1, it copies the 8k from (FSBASE +
(DPR << 12) + 1MB) to userspace.