New inode layout

Thu Aug 29 22:13:45 EDT 2013

On pre-allocation: If space is reliably zero-ed we definitely can simply allocate more than we need and then (if desired) "punch holes" to return space back. With that we do not need to know that space was pre-allocated.

On tree-based range accounting: we may need a similar mechanism to track overall space. refs pushes that approach to sub-spaces to get better resiliency. We are thinking about "shared PM". So, multi-tree (or as Ross said "forest") base accounting can help to turn pmfs into "clustered" fs...

Regards, Boris

> -----Original Message-----
> From: Matthew Wilcox [mailto:willy at linux.intel.com]
> Sent: Thursday, August 29, 2013 4:11 PM
> To: linux-pmfs at lists.infradead.org
> Cc: Zuckerman, Boris
> Subject: Re: New inode layout
> 
> 
> I got this response from Boris; resending to the mailing list for him:
> 
> > It would be beneficial to encode a notion of pre-allocated space. A
> > DPR can have user data as it described by bits 60-62 or it can point
> > to a reserved but not initialized space. Pre-allocating space becomes
> > useful when multiple files are written simultaneously, when network
> > reorders messages or when a remote flush daemon pushes ranges out of
> > order. When pre-allocated space is read into a user buffer it has to
> > be treated as a hole. If it's directly mapped, it has to be zeroed.
> 
> > Have you considered a schema when accounting is done in ranges rather
> > than in individual pages (blocks)? It's probably more challenging from
> > atomicity point of view, but may be more efficient...
> 
> My primary consideration when designing this layout was that files were going to be
> directly memory mapped without an intervening page cache.
> To facilitate using 2MB/1GB pages on x86, we want to make it as easy as possible for
> the filesystem to allocate aligned superpages for files which are large enough to make
> it worth doing.
> 
> I also intend for pre-zeroing to be done as a background task in the filesystem so
> there's usually a large chunk of zero-filled memory around to be allocated.  For your
> specific use-case (data arriving out of order over the network), I would expect the
> filesystem to either allocate pages that can later be coalesced, or greedily allocate a
> 2MB/1GB page, write pieces to it, and free the pieces it doesn't need on memory
> pressure or file close (or keep the whole thing if a sufficient percentage of it has been
> used).

> 
> So I do expect inodes to keep a certain amount of unwritten space around, but not
> using a preallocation scheme like extN uses.  I need to finish off the other two
> documents I'm working on and I think it'll be more clear.
> 
> On Wed, Aug 28, 2013 at 09:51:06AM -0400, Matthew Wilcox wrote:
> >
> > I feel we need to change the data structure pmfs uses to describe the
> > user data stored in an inode.  It's similar to the current data
> > structure in that it's a radix tree, but has some important advantages:
> >
> >  - Transparently supports 4k/2M/1G pages without application
> > intervention
> >  - Supports files up to the maximum Linux file size
> >  - Only requires atomic 8-byte writes to update the inode root, rather
> >    than 16-byte writes.
> >  - Doesn't confusingly claim to be a btree when it isn't :-)
> >  - Doesn't permit the confusion about whether a 'block' is 4k/2M/1G
> >  - Permits expansion to 512GB/256TB page sizes in future
> >
> > I've written up this design document; it's been through a couple of
> > rounds of review internally, so I think it's time to invite criticism
> > from the world at large.
> >
> > The PMFS inode data layout
> > ~~~~~~~~~~~~~~~~~~~~~~~~~~
> >
> > This design is based heavily on the same kind of radix tree used by
> > most CPUs to manage page tables.  See, for example, Chapter 4 of the
> > Intel Software Developer's Manual, Volume 3.  It's modified slightly
> > to account for the variable size of files.
> >
> > All pages are 4kB in size.  Pages can be clustered together into
> > superpages on naturally aligned boundaries (2MB, 1GB and 512GB).
> > The inode contains a Data Page Reference (DPR).  This is a 64-bit
> > quantity stored in little-endian.  The top 12 bits are available for use as flags.
> > The bottom 52 bits reference a page through a scheme determined by the
> > values of the fields in the top 12 bits.  This allows us to address
> > every page in a 2^64 address space.
> >
> > Each DPR either points to the user's data (if bit 63 is set) or it
> > points to an intermediate node in the radix tree.  Intermediate nodes
> > are an array of 512 DPRs, indexed by the appropriate bits in the offset.
> > The bits to use are determined by the Level field ((Offset >> (9 *
> > Level - 6)) & 0x1ff).
> >
> > DPR format:
> >
> > Bit 63: Data.  If this bit is set, the DPR references data directly and
> > 	the remaining bits of the offset are used to index into the Page
> > 	Number.  If clear, the DPR references an array of DPRs.
> > Bits 60-62: Level.  The maximum number of bytes referenced by this DPR.
> > 	0: No data is present (a hole in the file).
> > 	1:   4 KiB
> > 	2:   2 MiB
> > 	3:   1 GiB
> > 	4: 512 GiB
> > 	5: 256 TiB
> > 	6: 128 PiB
> > 	7:  64 EiB
> > Bit 59: Volatile.  If this bit is set, the page referenced is in DRAM,
> > 	not persistent memory.
> > Bits 52-58: Reserved
> > Bits 0-51: Page Number
> >
> > Note that Level 1 implies Data 1.  A DPR with Level 1, Data 0 is invalid.
> >
> > All non-hole DPRs in an intermediate node have the same Level, which
> > is one less than the Level of their parent DPR.  If the DPR represents
> > a hole, then the hole is the same size as the other DPRs in that
> > intermediate node.  If the Level of a DPR is neither zero nor one less
> > than the parent DPR, then the filesystem contains an error.
> > The implementation should check this invariant to ensure that a
> > corrupt tree does not contain a cycle.
> >
> > This tree is the static representation of the file data.  For the
> > filesystem to be truly useful, it has to be paired with a page
> > allocation strategy that works well for a wide range of workloads.
> > That strategy will be the subject of a future email.
> >
> > I haven't figured out exactly how I want the 'Volatile' bit to work yet.
> > I envisage it being used like a delayed-allocation scheme that some
> > filesystems have.  We can accumulate writes in DRAM and write them all
> > out to persistent memory in a single shot.  The volatile bit would
> > also change the interpretation of the Page Number; it wouldn't be
> > fs-relative, but a PFN.
> >
> > I'm considering some interesting uses for bits 52-55.  That's 256
> > bytes of data scattered across the 4k page.  We could put some useful
> > repairability data in there (eg put a UUID in the inode and copy it
> > into all of the intermediate nodes, a CRC of the page, etc.).
> >
> > Layout examples
> > ~~~~~~~~~~~~~~~
> >
> > Example 1: The inode of a file with no data has a DPR of
> > 0x0000'0000'0000'0000.
> >
> > Example 2: A file with up to 4k of data in it might have an inode DPR
> > of 0x900d'cba9'8765'4321.  Its data would be found in page
> > 0x000d'cba9'8765'4321 which is at address FSBASE + 0xdcab'9876'5432'1000.
> >
> > DPR L1D1	-> 4kB user data
> > (Level 1, Data 1)
> >
> > Example 3: A file with up to 2MB of data stored in 4k pages has a DPR
> > of the form 0x200x'xxxx'xxxx'xxxx.  The 4k of data found in that page
> > contains 512 DPRs, each of which has the form 0x900x'xxxx'xxxx'xxxx.
> >
> > DPR L2D0	-> DPR L1D1	-> 4kB user data
> > 		-> DPR L1D1	-> 4kB user data
> > 		...
> >
> > Example 4: A file with up to 2MB of data stored in a single 2MB
> > superpage has an inode DPR of the form 0xa00f'ffff'ffff'fe00.  The 2MB
> > of data found in that superpage is the file data.
> >
> > DPR L2D1	-> 2MB user data
> >
> > Example 5: A file with up to 1GB of data stored in pages smaller than
> > 1GB has an inode DPR of the form 0x300x'xxxx'xxxx'xxxx.  The 4k of
> > data found in that page contains 512 DPRs, each of which may or may
> > not point directly to user data, or contain holes.
> >
> > DPR L3D0	-> DPR L2D0	-> DPR L1D1	-> 4kB user data
> > 				-> DPR L1D1	-> 4kB user data
> > 				...
> > 		-> 0
> > 		-> DPR L2D1	-> 2MB user data
> > 		-> DPR L2D0	-> DPR L1D1	-> 4kB user data
> > 				-> DPR L1D1	-> 4kB user data
> > 				...
> > 		...
> >
> > Example 6: A file with up to 1GB of data stored in a single 1GB
> > superpage has an inode DPR of the form 0xb00f'ffff'fffc'0000.  The 1GB
> > of data found in that superpage is the file data.
> >
> > DPR L3D1	-> 1GB user data
> >
> > Example 7: One extra 4k page is appended to the file in Example 6.
> > We allocate an L3 intermediate node, move the inode DPR into it and
> > update the inode DPR with the L4 DPR pointing to the new intermediate node.
> > We then allocate an L1 intermediate node, and set the first entry to
> > point to the new file data.  We allocate an L2 intermediate node and
> > set the first entry to point to the L1 intermediate node.  We set the
> > second entry of the L3 intermediate node to point to the L2 intermediate node.
> >
> > DPR L4D0	-> DPR L3D1	-> 1GB user data
> > 		-> DPR L3D0	-> DPR L2D0	-> DPR L1D1	-> 4kB data
> > 						-> 0
> > 						...
> > 				-> 0
> > 				...
> > 		-> 0
> > 		...
> >
> > Example 8: A 16 Exabyte file (pwrite(fd, buf, 1, (loff_t)-2);) has a
> > DPR of 0x700d'cba9'8765'4321 in the inode.  That data block has as its
> > 511th DPR 0x600d'cba9'8765'4322.  That data block has as its 511th DPR
> > 0x500d'cba9'8765'4323.  That data block has as its 511th DPR
> > 0x400d'cba9'8765'4324.  That data block has as its 511th DPR
> > 0x300d'cba9'8765'4325.  That data block has as its 511th DPR
> > 0x200d'cba9'8765'4326.  That data block has as its 511th DPR
> > 0x900d'cba9'8765'4327.  And data block 000d'cba9'8765'4327 is where
> > you find the actual data for this file.  This is the worst case
> > scenario, and it only takes 6 indirections.
> >
> > DPR L7D0	-> 0
> > 		...
> > 		-> DPR L6D0	-> 0
> > 				...
> > 				-> DPR L5D0	-> 0
> > 						...
> > 						-> DPR L4D0	-> 0
> > 								...
> > 								-> DPR L3D0
> > (etc)
> >
> > Example 9.  A read from a fragmented file up to 1GB in size
> >
> > DPR L3D0	-> DPR L2D0	-> DPR L1D1	-> 4kB user data
> > 				-> DPR L1D1	-> 4kB user data
> > 				-> DPR L1D1	-> 4kB user data
> > 				...
> > 		-> DPR L2D1	-> 2MB user data
> > 		-> DPR L2D1	-> 2MB user data
> >
> > The above layout represents a file up to 1GB in size.  A read of 8192
> > bytes from file offset 4096 would first fetch the DPR from the inode.
> > Seeing a Level of 3, it would check that bits 30-63 of the file offset
> > were clear, and load the new DPR at (FSBASE + (DPR << 12) + 0) (the
> > value in bits 21-29 of file offset 4096).  Seeing a Level of 2 and
> > Data cleared to 0, it would load the new DPR from (FSBASE + (DPR <<
> > 12) + 8) (the value in bits 12-20 of file offset 4096, multiplied by
> > the size of a DPR).  Seeing a Level of 1, it copies 4096 bytes of data
> > from (FSBASE + (DPR << 12) + 0) to userspace.  A naive implementation
> > would then start again at the inode DPR, but a smarter implementation
> > would remember the parent DPR and merely go one level back up the tree
> > to read the next DPR and copy the remaining 4k of data to userspace.
> >
> > A read of 8k from offset 5MB would first fetch the DPR from the inode
> > and check that bits 30-63 of the offset were 0, as above.  Then it
> > would load the new DPR from byte address (FSBASE + (DPR << 12) + 16).
> > Seeing a Level of 2 and Data set to 1, it copies the 8k from (FSBASE +
> > (DPR << 12) + 1MB) to userspace.
> >
> >
> > _______________________________________________
> > Linux-pmfs mailing list
> > Linux-pmfs at lists.infradead.org
> > http://lists.infradead.org/mailman/listinfo/linux-pmfs