New inode layout
Matthew Wilcox
willy at linux.intel.com
Thu Aug 29 16:10:35 EDT 2013
I got this response from Boris; resending to the mailing list for him:
> It would be beneficial to encode a notion of pre-allocated space. A DPR
> can have user data as it described by bits 60-62 or it can point to a
> reserved but not initialized space. Pre-allocating space becomes useful
> when multiple files are written simultaneously, when network reorders
> messages or when a remote flush daemon pushes ranges out of order. When
> pre-allocated space is read into a user buffer it has to be treated as
> a hole. If it's directly mapped, it has to be zeroed.
> Have you considered a schema when accounting is done in ranges rather
> than in individual pages (blocks)? It's probably more challenging from
> atomicity point of view, but may be more efficient...
My primary consideration when designing this layout was that files were
going to be directly memory mapped without an intervening page cache.
To facilitate using 2MB/1GB pages on x86, we want to make it as easy
as possible for the filesystem to allocate aligned superpages for files
which are large enough to make it worth doing.
I also intend for pre-zeroing to be done as a background task in the
filesystem so there's usually a large chunk of zero-filled memory around
to be allocated. For your specific use-case (data arriving out of order
over the network), I would expect the filesystem to either allocate pages
that can later be coalesced, or greedily allocate a 2MB/1GB page, write
pieces to it, and free the pieces it doesn't need on memory pressure
or file close (or keep the whole thing if a sufficient percentage of it
has been used).
So I do expect inodes to keep a certain amount of unwritten space around,
but not using a preallocation scheme like extN uses. I need to finish off
the other two documents I'm working on and I think it'll be more clear.
On Wed, Aug 28, 2013 at 09:51:06AM -0400, Matthew Wilcox wrote:
>
> I feel we need to change the data structure pmfs uses to describe the
> user data stored in an inode. It's similar to the current data structure
> in that it's a radix tree, but has some important advantages:
>
> - Transparently supports 4k/2M/1G pages without application intervention
> - Supports files up to the maximum Linux file size
> - Only requires atomic 8-byte writes to update the inode root, rather
> than 16-byte writes.
> - Doesn't confusingly claim to be a btree when it isn't :-)
> - Doesn't permit the confusion about whether a 'block' is 4k/2M/1G
> - Permits expansion to 512GB/256TB page sizes in future
>
> I've written up this design document; it's been through a couple of rounds of
> review internally, so I think it's time to invite criticism from the world
> at large.
>
> The PMFS inode data layout
> ~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> This design is based heavily on the same kind of radix tree used by
> most CPUs to manage page tables. See, for example, Chapter 4 of the
> Intel Software Developer's Manual, Volume 3. It's modified slightly to
> account for the variable size of files.
>
> All pages are 4kB in size. Pages can be clustered together into
> superpages on naturally aligned boundaries (2MB, 1GB and 512GB).
> The inode contains a Data Page Reference (DPR). This is a 64-bit quantity
> stored in little-endian. The top 12 bits are available for use as flags.
> The bottom 52 bits reference a page through a scheme determined by the
> values of the fields in the top 12 bits. This allows us to address
> every page in a 2^64 address space.
>
> Each DPR either points to the user's data (if bit 63 is set) or it
> points to an intermediate node in the radix tree. Intermediate nodes
> are an array of 512 DPRs, indexed by the appropriate bits in the offset.
> The bits to use are determined by the Level field
> ((Offset >> (9 * Level - 6)) & 0x1ff).
>
> DPR format:
>
> Bit 63: Data. If this bit is set, the DPR references data directly and
> the remaining bits of the offset are used to index into the Page
> Number. If clear, the DPR references an array of DPRs.
> Bits 60-62: Level. The maximum number of bytes referenced by this DPR.
> 0: No data is present (a hole in the file).
> 1: 4 KiB
> 2: 2 MiB
> 3: 1 GiB
> 4: 512 GiB
> 5: 256 TiB
> 6: 128 PiB
> 7: 64 EiB
> Bit 59: Volatile. If this bit is set, the page referenced is in DRAM,
> not persistent memory.
> Bits 52-58: Reserved
> Bits 0-51: Page Number
>
> Note that Level 1 implies Data 1. A DPR with Level 1, Data 0 is invalid.
>
> All non-hole DPRs in an intermediate node have the same Level,
> which is one less than the Level of their parent DPR. If the DPR
> represents a hole, then the hole is the same size as the other DPRs
> in that intermediate node. If the Level of a DPR is neither zero nor
> one less than the parent DPR, then the filesystem contains an error.
> The implementation should check this invariant to ensure that a corrupt
> tree does not contain a cycle.
>
> This tree is the static representation of the file data. For the
> filesystem to be truly useful, it has to be paired with a page allocation
> strategy that works well for a wide range of workloads. That strategy
> will be the subject of a future email.
>
> I haven't figured out exactly how I want the 'Volatile' bit to work yet.
> I envisage it being used like a delayed-allocation scheme that some
> filesystems have. We can accumulate writes in DRAM and write them all
> out to persistent memory in a single shot. The volatile bit would also
> change the interpretation of the Page Number; it wouldn't be fs-relative,
> but a PFN.
>
> I'm considering some interesting uses for bits 52-55. That's 256
> bytes of data scattered across the 4k page. We could put some useful
> repairability data in there (eg put a UUID in the inode and copy it into
> all of the intermediate nodes, a CRC of the page, etc.).
>
> Layout examples
> ~~~~~~~~~~~~~~~
>
> Example 1: The inode of a file with no data has a DPR of
> 0x0000'0000'0000'0000.
>
> Example 2: A file with up to 4k of data in it might have an inode
> DPR of 0x900d'cba9'8765'4321. Its data would be found in page
> 0x000d'cba9'8765'4321 which is at address FSBASE + 0xdcab'9876'5432'1000.
>
> DPR L1D1 -> 4kB user data
> (Level 1, Data 1)
>
> Example 3: A file with up to 2MB of data stored in 4k pages has a DPR
> of the form 0x200x'xxxx'xxxx'xxxx. The 4k of data found in that page
> contains 512 DPRs, each of which has the form 0x900x'xxxx'xxxx'xxxx.
>
> DPR L2D0 -> DPR L1D1 -> 4kB user data
> -> DPR L1D1 -> 4kB user data
> ...
>
> Example 4: A file with up to 2MB of data stored in a single 2MB superpage
> has an inode DPR of the form 0xa00f'ffff'ffff'fe00. The 2MB of data
> found in that superpage is the file data.
>
> DPR L2D1 -> 2MB user data
>
> Example 5: A file with up to 1GB of data stored in pages smaller than 1GB
> has an inode DPR of the form 0x300x'xxxx'xxxx'xxxx. The 4k of data found
> in that page contains 512 DPRs, each of which may or may not point
> directly to user data, or contain holes.
>
> DPR L3D0 -> DPR L2D0 -> DPR L1D1 -> 4kB user data
> -> DPR L1D1 -> 4kB user data
> ...
> -> 0
> -> DPR L2D1 -> 2MB user data
> -> DPR L2D0 -> DPR L1D1 -> 4kB user data
> -> DPR L1D1 -> 4kB user data
> ...
> ...
>
> Example 6: A file with up to 1GB of data stored in a single 1GB superpage
> has an inode DPR of the form 0xb00f'ffff'fffc'0000. The 1GB of data
> found in that superpage is the file data.
>
> DPR L3D1 -> 1GB user data
>
> Example 7: One extra 4k page is appended to the file in Example 6.
> We allocate an L3 intermediate node, move the inode DPR into it and update
> the inode DPR with the L4 DPR pointing to the new intermediate node.
> We then allocate an L1 intermediate node, and set the first entry to
> point to the new file data. We allocate an L2 intermediate node and set
> the first entry to point to the L1 intermediate node. We set the second
> entry of the L3 intermediate node to point to the L2 intermediate node.
>
> DPR L4D0 -> DPR L3D1 -> 1GB user data
> -> DPR L3D0 -> DPR L2D0 -> DPR L1D1 -> 4kB data
> -> 0
> ...
> -> 0
> ...
> -> 0
> ...
>
> Example 8: A 16 Exabyte file (pwrite(fd, buf, 1, (loff_t)-2);)
> has a DPR of 0x700d'cba9'8765'4321 in the inode. That data block
> has as its 511th DPR 0x600d'cba9'8765'4322. That data block has
> as its 511th DPR 0x500d'cba9'8765'4323. That data block has as
> its 511th DPR 0x400d'cba9'8765'4324. That data block has as its
> 511th DPR 0x300d'cba9'8765'4325. That data block has as its 511th
> DPR 0x200d'cba9'8765'4326. That data block has as its 511th DPR
> 0x900d'cba9'8765'4327. And data block 000d'cba9'8765'4327 is where you
> find the actual data for this file. This is the worst case scenario,
> and it only takes 6 indirections.
>
> DPR L7D0 -> 0
> ...
> -> DPR L6D0 -> 0
> ...
> -> DPR L5D0 -> 0
> ...
> -> DPR L4D0 -> 0
> ...
> -> DPR L3D0
> (etc)
>
> Example 9. A read from a fragmented file up to 1GB in size
>
> DPR L3D0 -> DPR L2D0 -> DPR L1D1 -> 4kB user data
> -> DPR L1D1 -> 4kB user data
> -> DPR L1D1 -> 4kB user data
> ...
> -> DPR L2D1 -> 2MB user data
> -> DPR L2D1 -> 2MB user data
>
> The above layout represents a file up to 1GB in size. A read of 8192
> bytes from file offset 4096 would first fetch the DPR from the inode.
> Seeing a Level of 3, it would check that bits 30-63 of the file offset
> were clear, and load the new DPR at (FSBASE + (DPR << 12) + 0) (the
> value in bits 21-29 of file offset 4096). Seeing a Level of 2 and Data
> cleared to 0, it would load the new DPR from (FSBASE + (DPR << 12) + 8)
> (the value in bits 12-20 of file offset 4096, multiplied by the size of
> a DPR). Seeing a Level of 1, it copies 4096 bytes of data from (FSBASE +
> (DPR << 12) + 0) to userspace. A naive implementation would then start
> again at the inode DPR, but a smarter implementation would remember the
> parent DPR and merely go one level back up the tree to read the next
> DPR and copy the remaining 4k of data to userspace.
>
> A read of 8k from offset 5MB would first fetch the DPR from the inode
> and check that bits 30-63 of the offset were 0, as above. Then it
> would load the new DPR from byte address (FSBASE + (DPR << 12) + 16).
> Seeing a Level of 2 and Data set to 1, it copies the 8k from (FSBASE +
> (DPR << 12) + 1MB) to userspace.
>
>
> _______________________________________________
> Linux-pmfs mailing list
> Linux-pmfs at lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-pmfs
More information about the Linux-pmfs
mailing list