mtd/Documentation/jffs3 JFFS3design.tex,1.24,1.25

Sat Nov 12 07:27:26 EST 2005

Update of /home/cvs/mtd/Documentation/jffs3
In directory phoenix.infradead.org:/tmp/cvs-serv17608/jffs3

Modified Files:
	JFFS3design.tex 
Log Message:
Various minor fixes. Add objects description.
Switch version to 0.28.



Index: JFFS3design.tex
===================================================================
RCS file: /home/cvs/mtd/Documentation/jffs3/JFFS3design.tex,v
retrieving revision 1.24
retrieving revision 1.25
diff -u -r1.24 -r1.25

--- JFFS3design.tex	3 Nov 2005 14:25:39 -0000	1.24
+++ JFFS3design.tex	12 Nov 2005 12:27:23 -0000	1.25
@@ -57,9 +57,9 @@
 \large{Artem B. Bityuckiy\\
 dedekind at infradead.org}\\
 \vspace{13cm}
-\large{Version 0.27 (draft)}\\
+\large{Version 0.28 (draft)}\\
 \vspace{0.5cm}
-November 3, 2005
+November 12, 2005
 \end{center}
 \end{titlepage}
 %\maketitle
@@ -71,14 +71,17 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 \pagestyle{empty}
 
-\begin{abstract} JFFS2, the Journalling Flash File System version 2, is widely
-used in the embedded systems world. It was designed for relatively small flash
-chips and has serious problems when it is used on large flash devices.
-Unfortunately, these scalability problems are deep inside the design of the
-file system, and cannot be solved without full redesign.
+\begin{abstract}
+
+JFFS2, the Journalling Flash File System version 2, is widely used in the
+embedded systems world. It was designed for relatively small flash chips and
+has serious problems when it is used on large flash devices. Unfortunately,
+these scalability problems are deep inside the design of the file system, and
+cannot be solved without full redesign.
+
+This document describes JFFS3~-- a new flash file system which is designed to
+be scalable.
 
-This document describes JFFS3~-- a new flash file system which is designed
-to be scalable.
 \end{abstract}
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -111,7 +114,7 @@
 whole file system may be regarded as one large log. Any file system
 modification (i.e., file change, directory creation, changing file's
 attributes, etc) is appended to the log. The log is the only data structure on
-the flash media.  Modifications are encapsulated into small data structures
+the flash media. Modifications are encapsulated into small data structures
 called \emph{nodes}.
 
 So, JFFS2 is roughly a log, the log consists of nodes, each node contains a
@@ -169,7 +172,7 @@
 
 \end{itemize}
 
-So, it may be stood that JFFS2 \emph{does not scale}. But in spite of the
+So, it may be stood that JFFS2 \emph{does not scale}. But despite the
 scalability problems, JFFS2 has many advantages, for example:
 
 \begin{itemize}
@@ -200,7 +203,7 @@
 nodes in this eraseblocks. So, when JFFS2 mounts the file system, it needs to
 glance to the end of each eraseblock and read the summary node. This results in
 that JFFS2 only needs to read one or few NAND pages from the end of each
-eraseblock.  Instead, when there is no summary, JFFS2 reads almost every NAND
+eraseblock. Instead, when there is no summary, JFFS2 reads almost every NAND
 page of the eraseblock, because node headers are spread more or less evenly
 over the eraseblock.
 
@@ -265,6 +268,11 @@
 \item[\textbf{R12}]
 JFFS3 have to support \mbox{on-flight} compression.
 
+\item[\textbf{R13}]
+JFFS3 should provide good concurrency which means that it should be possible to
+read the file system during Garbage Collection and to read/write during the
+Journal Commit, read/write the file system simultaneously, etc.
+
 \end{enumerate}
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -488,11 +496,11 @@
 used by the \emph{Reiser4} file system (see [\ref{ref_Reiser4}]). All the file
 system objects (inodes, files, directory entries, extended attributes, etc) are
 kept in one large \mbox{$B^+$-tree}. Effectively, the whole JFFS3 file system
-may be regarded as one large \mbox{$B^+$-tree}.  This tree is further referred
+may be regarded as one large \mbox{$B^+$-tree}. This tree is further referred
 to just as "\emph{the tree}".
 
-Every object which is stored in the tree has a \emph{key}, and the object is
-found in the tree by this key. To make it clearer what are object keys, the
+Every object which is stored in the tree has a \emph{key}, and objects are
+found in the tree by their key. To make it clearer what are object keys, the
 following is an example of how they may look like:
 
 \begin{itemize}
@@ -500,10 +508,7 @@
 \item file data key: \{\texttt{inode number}, \texttt{offset}\};
 
 \item directory entry key: \{\texttt{parent directory inode number},
-\texttt{direntry name hash}\};
-
-\item extended attribute key: \{\texttt{target inode number}, \texttt{xattr
-name hash}\} and the like.
+\texttt{direntry name hash}\} and the like.
 
 \end{itemize}
 
@@ -584,15 +589,15 @@
 \includegraphics[width=159mm,height=30mm]{pics/flash-01.pdf}
 %end{latexonly}
 \end{center}
-\caption{Illustration of leaf and indexing nodes separation.}
+\caption{An illustration of leaf and indexing nodes separation.}
 \label{ref_FigureFlash_01}
 \end{figure}
 
 Eraseblocks which contain only indexing nodes are called \emph{indexing
 eraseblocks} and those with leaf nodes are called \emph{leaf eraseblocks}.
 
-The depth of the tree depends on how many objects are kept on the file system.
-The more files, directories, etc are present on the file system, the deeper is
+The depth of the tree depends on how many objects are kept in the file system.
+The more files, directories, etc are present in the file system, the deeper is
 the tree. Fortunately, the number of tree levels grows very slowly with the
 growing number of file system objects and the tree lookup scales as
 $O(log_n{S})$ (logarithmically).
@@ -660,7 +665,7 @@
 
 When something is read from the file system, JFFS3 first glimpses at the
 \mbox{in-RAM} journal tree to figure out if the needed data is in the journal.
-If the data are  there, the journal is read, otherwise JFFS3 performs the usual
+If the data are there, the journal is read, otherwise JFFS3 performs the usual
 tree lookup (see figure \ref{ref_FigureJournal_02}).
 
 %
@@ -675,13 +680,13 @@
 \includegraphics[width=130mm,height=60mm]{pics/journal-02.pdf}
 %end{latexonly}
 \end{center}
-\caption{The read request processing in JFFS3.}
+\caption{Read request processing in JFFS3.}
 \label{ref_FigureJournal_02}
 \end{figure}
 
 The journal is \emph{committed} when it is full or in some other appropriate
 for JFFS3 time. This means, that the indexing nodes corresponding to the
-journal changes are updated and written to the flash.  The committed
+journal changes are updated and written to the flash. The committed
 journal eraseblocks are then treated as leaf eraseblocks and new journal
 eraseblocks are picked by JFFS3 using the common JFFS3
 \mbox{wear-levelling} algorithm.
@@ -757,6 +762,92 @@
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 %
+% THE TREE
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{The tree}
+
+%
+% OBJECTS
+%
+\subsection{Objects} \label{ref_SectionObjects}
+
+JFFS3 keeps file system objects in the leaf level of the tree (in leaf nodes)
+and the below is the list of supported objects.
+
+\begin{enumerate}
+
+\item \emph{Data} objects contain files' data and are kept in \emph{data
+nodes}. Data nodes mainly hold one \mbox{RAM page} size of data (i.e., mainly
+4K), but for small files and for files' tails there may be less data.
+
+%
+% Figure with an data objects illustration
+%
+\begin{figure}[h]
+\begin{center}
+\begin{htmlonly}
+\includegraphics{pics/dataobj-01.png}
+\end{htmlonly}
+%begin{latexonly}
+\includegraphics[width=130mm,height=90mm]{pics/dataobj-01.pdf}
+%end{latexonly}
+\end{center}
+\caption{An illustration of files' data representation.}
+\label{ref_FigureDataObj-01}
+\end{figure}
+
+Figure \ref{ref_FigureDataObj-01} illustrates the correspondence between files'
+contents and data objects. Because of compression the actual data size of
+data nodes is less then the corresponding file fragments.
+
+Also, in order to optimize flash utilization, it is possible in JFFS3 that for
+large rarely changed files' there are multiple of \mbox{RAM page} bytes of data
+in one data node.
+
+\item \emph{Direntry} objects contain the correspondence between directory
+entry names and inode numbers. Direntry objects are stored in \emph{direntry
+nodes}. Every directory entry in the file system has a corresponding direntry
+object.
+
+\item \emph{Attr-data} objects contain attributes of inodes~-- both standard
+Unix attributes like user~ID, last modification time, inode
+length, etc and \mbox{JFFS3-specific} attributes like the type of compression,
+etc. Each inode has only one corresponding \mbox{attr-data} object.
+
+\item \emph{Xentry} objects contain the correspondence between names of
+extended attributes and \emph{xattr~IDs} (analogous to direntry objects, but
+there is \{xattr~name$\Leftrightarrow$xattr~ID\} mapping instead of
+\{direntry~name$\Leftrightarrow$inode~number\} mapping in the xentry). Each
+extended attribute in the JFFS3 file system has its own unique number, just
+like every inode has its own unique inode number. And in fact, JFFS3 utilizes
+the same space of numbers to enumerate inodes and extended attributes. Every
+extended attribute in the file system has a corresponding xattr entry object.
+
+Xentry objects are stored in \emph{xentry nodes}.
+
+\item \emph{Xattr-data} objects contain the data of extended attributes. The
+way how \mbox{xattr-data} objects are kept in the tree is equivalent to the way
+how data objects a kept. \emph{Xattr-data} objects are stored in
+\emph{xattr-data} nodes.
+
+\item \emph{Acl} objects contain Access Control Lists (ACL) of inodes (refer to
+[\ref{ref_ACL}] for information about ACLs). Acl objects are stored in
+\emph{acl nodes}.
+
+In \mbox{real-world} systems a lot of files have equivalent ACL while only few
+files have unique ACL. For the former group of files (or more strictly~--
+inodes) JFFS3 makes use of \emph{shared~acl} objects. This means, that there is
+only one acl object instance for all of these inodes. Shared acls are referred
+to from \mbox{attr-data} objects of these inodes. If a shared acl is written
+to, new acl object is created (\mbox{copy-on-write} mechanism). Conversely, for
+the latter group there is a distinct acl object per each inode.
+
+\end{enumerate}
+
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
 % THE SUPERBLOCK
 %
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -877,7 +968,7 @@
 chain eraseblock~2 while the first sector is dirty.
 
 \item And analogously, after many superblock updates, the chain eraseblock~1
-was updated many times and when it became full it changed its position.  Sure,
+was updated many times and when it became full it changed its position. Sure,
 the chain eraseblock~2 and the super eraseblock changed their positions many
 times as well. So, at the moment, the chain eraseblock~1 is at the eraseblock
 number~65, the chain eraseblock~2 is at the eraseblock~44 and the super
@@ -917,7 +1008,7 @@
 
 \item When a new reference is written to anchor/chain eraseblocks, the previous
 reference becomes dirty and on mount JFFS3 should find the valid reference. To
-facilitate this, each reference has its version number.  Each subsequent
+facilitate this, each reference has its version number. Each subsequent
 reference has higher version then the previous. Hence, JFFS3 may use the
 binary search algorithm to quickly find the valid reference.
 
@@ -1078,7 +1169,7 @@
 To find the superblock during mount, JFFS3 finds the valid reference in the
 anchor eraseblocks, then finds the valid reference in chain erase blocks
 $1$,~$2$,~$\ldots$,~$m-1$, and finally finds the valid superblock in the super
-eraseblock.  Since JFFS3 assigns versions to records in anchor/chain/super
+eraseblock. Since JFFS3 assigns versions to records in anchor/chain/super
 eraseblocks and the versions are increased by one on every update, the binary
 search algorithm may be used to quickly find the valid sector.
 
@@ -1133,7 +1224,7 @@
 
 \begin{enumerate}
 
-\item Quota support. Will quota be supported? How will it look like -- lust
+\item Quota support. Will quota be supported? How will it look like~-- lust
 generic linux quota or something better?
 
 \item Transactions:\\
@@ -1238,10 +1329,18 @@
 mechanism of owner/group/others permissions, see [\ref{ref_ACL}] for more
 details.
 
+\item \textbf{Acl}~-- an object in the tree containing inode's ACL. Refer to
+section \ref{ref_SectionObjects} for more information.
+
 \item \textbf{Anchor eraseblock, anchor area}~-- the second and the third
 \emph{good} eraseblocks of the JFFS3 partition which are reserved for the
 superblock management. See section \ref{ref_SectionSBAlg} for more details.
 
+\item \textbf{Attr-data}~-- an object in the tree where inode's attributes are
+stored (standard Unix attributes like creation time, owner ID, etc and other
+attributes like the type of compression, etc). Refer to section
+\ref{ref_SectionObjects} for more information.
+
 \item \textbf{$B$-tree}~-- a balanced search tree where each node has many
 children. See section \ref{ref_SectionBTrees}.
 
@@ -1259,6 +1358,9 @@
 (see section \ref{ref_SectionSBAlg}). The main reason why chain eraseblocks are
 needed is the need to provide good flash \mbox{wear-levelling}.
 
+\item \textbf{Clean eraseblock}~-- an eraseblock which contains no garbage, only
+valid information.
+
 \item \textbf{Directory entry, direntry}~-- basically an association between
 the name and the inode number. As any other object direntries are stored at the
 leaf level of the tree in direntry nodes.
@@ -1277,10 +1379,16 @@
 to \mbox{out-of-place} updates or objects deletion. It is the aim if the
 Garbage Collector to reclaim the space occupied by dirt.
 
+\item \textbf{Dirty eraseblock}~-- an eraseblock which contains some dirt along
+with valid nodes.
+
 \item \textbf{Dirty sector}~-- a sector which contains dirt.
 
 \item \textbf{Fanout}~-- the same as \textbf{branching factor}.
 
+\item \textbf{Free eraseblock}~-- an erased eraseblock (contains only
+\texttt{0xFF} words).
+
 \item \textbf{Garbage}~-- the same as \textbf{dirt}.
 
 \item \textbf{Garbage Collector}~-- a part of any Flash File System which
@@ -1319,7 +1427,7 @@
 
 \item \textbf{Journal tree}~-- an \mbox{in-memory} tree referring Journal nodes
 which were not committed so far. When JFFS3 reads, it first looks up the
-journal tree to find out whether the searched information is there.  See
+journal tree to find out whether the searched information is there. See
 section \ref{ref_SectionJournalIntro} for more details.
 
 \item \textbf{Key}~-- an identifier of objects in the tree.
@@ -1346,6 +1454,10 @@
 from JFFS3's viewpoint. May be equivalent to the minimal physical input/output
 unit (like in case of NAND flashes) or larger (like in case of NOR flashes).
 
+\item \textbf{Shared acl}~-- an acl object which is shared by many inodes for
+optimization purposes. Refer to section \ref{ref_SectionObjects} for more
+information.
+
 \item \textbf{Static eraseblock}~-- the fist good erasable block of the JFFS3
 partition where the file system static data is stored. JFFS3 may
 only read it and it is created/changed by external formatting tools.
@@ -1358,6 +1470,10 @@
 \item \textbf{Super eraseblock}~-- an eraseblock where the superblock is kept.
 See section \ref{ref_SectionSBAlg} details.
 
+\item \textbf{Target eraseblock}~-- the eraseblock which is currently being
+processed by the Garbage Collector, i.e., nodes are moved from the \emph{target
+eraseblock} and it is erased afterwords.
+
 \item \textbf{Quota}~-- a mechanism which allows to assign different limits on
 the file system (e.g., restrict users in the number of files they may create or
 in the amount of space they may consume, etc). See [\ref{ref_Quota}] for more
@@ -1370,6 +1486,9 @@
 \item \textbf{Twig nodes}~-- nodes which reside one level upper then leaf nodes
 (level 1).
 
+\item \textbf{Valid nodes}~-- nodes which contain actual information,
+non-obsolete nodes.
+
 \item \textbf{Wandering tree}~-- a method of updating trees when there is no
 possibility to perform \mbox{in-place} updates. The JFFS3 tree is a wandering
 \mbox{$B^+$-tree}. See section \ref{ref_SectionWandTrees} for more information.
@@ -1377,6 +1496,16 @@
 \item \textbf{Xattr}~-- a widely used contracted form for \textbf{extended
 attributes}.
 
+\item \textbf{Xattr}~-- an unique identifier of an extended attribute. Refer to
+section \ref{ref_SectionObjects} for more information.
+
+\item \textbf{Xattr-data}~-- an object in the tree containing the data of an
+extended attribute.
+
+\item \textbf{Xentry}~-- an object in the tree which stores the association
+between the name of an extended attribute and its xattr~ID. Refer to section
+\ref{ref_SectionObjects} for more information.
+
 \end{enumerate}
 
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%