mtd/Documentation/jffs3 JFFS3design.tex, 1.6, 1.7 sb_mgmt.eps, 1.1,
1.2 sb_mgmt.pdf, 1.1, 1.2
Artem Bityuckiy
dedekind at infradead.org
Mon Aug 1 07:37:45 EDT 2005
Update of /home/cvs/mtd/Documentation/jffs3
In directory phoenix.infradead.org:/tmp/cvs-serv30967
Modified Files:
JFFS3design.tex sb_mgmt.eps sb_mgmt.pdf
Log Message:
Very small update. Almost nothing new.
Index: JFFS3design.tex
===================================================================
RCS file: /home/cvs/mtd/Documentation/jffs3/JFFS3design.tex,v
retrieving revision 1.6
retrieving revision 1.7
diff -u -r1.6 -r1.7
--- JFFS3design.tex 16 Jun 2005 07:18:30 -0000 1.6
+++ JFFS3design.tex 1 Aug 2005 11:37:39 -0000 1.7
@@ -81,18 +81,17 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
-% JFFS3 MOTIVATION AND GOALS
+% MOTIVATION AND GOALS
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{JFFS3 motivation and goals}
+\section{Motivation and goals}
JFFS2, the Journalling Flash File System version 2 [\ref{ref_JFFSdwmw2}]
is widely used in the embedded systems world. JFFS2 was originally designed for
small NOR flashes ($<$ 32MB), and NAND support was added later. The first NAND
flashes were also small enough, but grew in size very quickly and are currently
much larger then 32MB JFFS2 was originally designed for (e.g., Samsung produces 2GB NAND
flashes [\ref{ref_SamsungNANDlist}]). Unfortunately, owing to its design,
-JFFS2 cannot work well on such large flash chips. The following problems
-arise:
+JFFS2 has serious problems on large flash chips:
\begin{itemize}
\item the \emph{mount time} becomes too high;
@@ -102,7 +101,8 @@
\end{itemize}
Due to the design of JFFS2, the three above parameters (mount time, memory
-consumption and the file access time) depend linearly on the size of flash and files.
+consumption and the file access time) depend linearly on the size of the flash
+partition and files.
To put it differently, JFFS2 \emph{does not scale}. But in spite of the scalability
problems, JFFS2 has many advantages:
@@ -123,18 +123,18 @@
\begin{enumerate}
\item provide fast mounting;
-\item consume few memory although it isn't forbidden to consume much RAM providing
-the RAM is used \emph{for caching} and may be any time freed;
+\item consume few memory, although it isn't forbidden to consume much RAM providing
+the RAM is used \emph{for caching} and may be freed on any demand;
\item be tolerant to unclean reboots;
\item all the JFFS3 characteristics must scale well up to 1TB flash chips.
\end{enumerate}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
-% JFFS3 design overview
+% INTRODUCTION
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\section{JFFS3 design overview}
+\section{Introduction}
This section overviews the main JFFS3 design aspects without any thorough
description and is to be regarded an introduction to the design of JFFS3.
@@ -143,27 +143,27 @@
improvement over the design of JFFS2 (which is not scalable) -- some completely
different principles must be applied to JFFS3.
-The following are the most essential things we want JFFS3 to meet:
+The following are the most essential things we want to meet in JFFS3:
\begin{itemize}
\item The JFFS3 memory consumption must not depend on the size of the JFFS3
partition, the number of inodes in the file system, size of files and
directories and the like. Of course, JFFS3 must be able to use the advantage
of the available
-RAM, but only for different kinds of \emph{caches} which may be flushed back to
-flash any time in case of memory pressure.
+RAM, but only for different kinds of \emph{caches} which may be freed
+any time in case of high memory pressure.
\item All the JFFS3 characteristics ought to vary not faster then $log(S)$,
-where $S$ is the size of
-JFFS3 partition. JFFS2-like linear dependencies are forbidden.
+where $S$ is the size of the
+JFFS3 partition. JFFS2-like linear dependencies are not acceptable.
\item The flash wear-levelling have to always be kept in mind and shouldn't
suffer.
-\item The fact that flash devices may be not very reliable - bits may flip, bad
-blocks may appear, etc should be taken into account.
+\item The fact that flash devices may be not very reliable (bits may flip, bad
+blocks may appear, etc) should be taken into account.
-\item It is obligatory to make JFFS3 be very tolerant to unclean reboots,
+\item It is obligatory for JFFS3 to be very tolerant to unclean reboots,
just like JFFS2.
\end{itemize}
@@ -174,57 +174,107 @@
disk drives (on the contrary to JFFS2, which treats flash eraseblocks as contiguous
space).
-\item JFFS3 orients to \emph{large-scale NAND flashes}. Other flash types may be
-supported too, but they should emulate NAND flash - e.g., JFFS3 assumes there is
-an OOB area on each NAND page (sector), etc.
+\item JFFS3 orients to \emph{large-scale flashes}. For small flashes
+(presumably less then 128MB), JFFS2 is the best choice.
\item There has to exist a \emph{superblock} containing the sufficient amount of
-information to perform quick JFFS3 mount. As flash chips don't admit of
-rewriting of the same place several times without the erasure of the whole
-eraseblock and because of the the wear-levelling
-issues, some complex superblock management schemes are compulsory.
+information to perform quick JFFS3 mount. Flash chips don't admit of
+writing to the same place several times without the erasure of the whole
+eraseblock and flash chips need wear-levelling support. This means that
+some complex superblock management schemes are compulsory.
+
+\item Since the compression is difficult to implement in the block-based
+approach, it was decided not to support it so far. This may be regarded as a future
+optimization.
-\item Since the compression is very difficult to implement in the block-based
-approach, it was decided not to support it.
+\end{itemize}
+
+It is important to note that we are using somewhat unusual
+terminology:
+
+\begin{itemize}
+\item \emph{erasable block, eraseblock} -- the minimal erasable unit of the
+flash chip; an eraseblock consists of several sectors;
+\item \emph{sector} -- the smallest writable unit of the \emph{flash chip},
+i.e. the NAND page in case of NAND; the
+minimal size of the sector is assumed to be 512 bytes;
+
+\item \emph{block} -- multiple adjacent sectors of size that is equivalent to the size
+of the RAM page (\texttt{PAGE\_SIZE}). Most often the block size is 4KB.
\end{itemize}
+The main reason why we chose this terminology is to be consistent with the
+terminology used by ordinary file systems.
+It also stands to reason that all the terms used in the document are
+explained at section~\ref{ref_SectDefinitions}. Please, glance at this
+section each time you have met a new term.
+
The way how JFFS3 stores the filesystem data is similar to the
-approach of \emph{\mbox{ReiserFS}} file system [\ref{ref_Reiser4}].
+approach used by the \emph{Reiser4} file system (see [\ref{ref_Reiser4}]).
All the file system
objects (inodes, file data blocks, directory entries, extended attributes, etc)
are kept in one large $B^+$-tree. Effectively, the whole JFFS3 file system may be
-regarded as one large $B^+$-tree, excluding some superblock-related things,
-which don't belong to the tree. Please, refer to [\ref{ref_Reiser4}] if you are
-interested in the merits of such a tree-based approach.
-
-The tree consists of \emph{nodes}, which may occupy either one sector
-(\emph{branch nodes}) or one block (\emph{leaf nodes})
-(please, refer to the definitions chapter \ref{ref_SectDefinitions}
-for the definitions of \emph{sector}, \emph{block}, \emph{branch node}, etc).
-All the objects stored in the tree are actually kept in the tree's leaf nodes,
-while the non-leaf nodes contain only (\emph{key}, \emph{link}) pairs (see the
-definition of $B^+$-tree in any book devoted to computer algorithms).
-
-Each file system object has an associated key which provides fast object search.
-Different objects has different keys, but keys may occasionally collide. In
-that case, all the colliding objects are kept together and JFFS3 selects the
-right one by means of the simple linear search.
-
-When files are created, changed or deleted, the corresponding leaf nodes of the
-tree are changed. When a node is changed, JFFS3 writes the node update to
-another place on flash, not to the same place where it was before.
-To keep the tree consistent, the upper
-node, which points to the leaf, have to also be changed (the link update), and
-so forth up to the root node. I.e., the more levels the tree has,
-the more writes any single node update requires. The same story with Garbage
-Collector - if it moves any tree node, the whole path up to the root must be
-updated.
-
-With the tree-based approach, mount essentially means locating the root node
-of the tree. The root node is referred from the JFFS3 superblock. The superblock
-is found relatively quickly (see section \ref{ref_SectionSB}), so the file
-system mount is also fast.
+regarded as one large $B^+$-tree (excluding some things like \emph{superblock},
+etc). Please, refer to [\ref{ref_Reiser4}] for the information about the merits
+of such approach.
+
+There are basically 4 main entities which constitute the JFFS3 file system:
+
+\begin{itemize}
+\item \emph{the tree} - contains files, directories, attributes, etc; this
+is the main, the largest and the most complex entity in JFFS3; \TODO{insert
+reference}
+
+\item \emph{the map} - contains the information about each flash
+eraseblock (wear-levelling data, bad flag, etc) and about each
+sector (\emph{pristine} or \emph{dirty}); the map is mainly used by the
+\emph{garbage collector} and the \emph{space allocator}; \TODO{insert
+reference}
+
+\item \emph{the superblock} - contains the reference to the \emph{root node} of
+the tree, the map, etc. as well as the static data about the whole file system
+(the version number, etc); here belongs supplementary eraseblocks which provide
+the possibility to quickly find the superblock (see
+section~\ref{ref_SectionSB});
+
+\item \emph{the journal} - the area where all changes of the file system are
+put at the first time; since
+the transactions support is not planned in JFFS3 (but it may be added later),
+the journal may be regarded as an optimization which helps to reduce the amount
+of flash I/O. \TODO{insert reference}
+\end{itemize}
+
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+%
+% TREE DESIGN
+%
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\section{Tree design} \label{ref_SectionSB}
+This section contains the design of the $B^+$-tree of the JFFS3 file system.
+
+% BASICS
+\subsection{Basics}
+The inexact definition of $B$-tree may be formulated as a balanced search tree
+where each node may have many children. The \emph{branching factor} or the
+\emph{fanout} defines the maximal number of node's children. While $B$-trees
+may contain both useful data and (key, link) pairs in branch nodes, $B^+$-trees
+are $B$-trees which store the data only in leaf nodes, while non-leaf nodes only
+contain keys and links. Please, refer to Donald Knuth's
+books for more information about $B$-trees.
+
+The tree may have several levels, depending on how many objects are stored in
+it, i.e., how much data are stored in the file system. JFFS3 has no assumption
+of the maximum number of levels. The levels of the tree are numbered starting
+from the leaf level (level 0) and ending at the root level.
+
+The nodes of level 0 are called \emph{leaf nodes}. The nodes of the level
+1 are called \emph{twig} nodes. The root of the tree contains the \emph{root}
+node. The other levels contain \emph{branch} nodes (note, we selected the same
+terminology which is used in the Reiser4 file system [\ref{ref_Reiser4}]).
+Leaf nodes may only contain objects, the other nodes may only contain
+\emph{keys} and \emph{links}.
+
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
@@ -232,40 +282,19 @@
%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Superblock management} \label{ref_SectionSB}
-The superblock (SB) is the data structure which contains information about the
-whole JFFS3 file system and its layout. When the file system is being mounted,
-the superblock should be found and read.
+\emph{The JFFS3 superblock} is the data structure that describes the
+the file system as a whole (i.e., the offset of the root node, the journal
+eraseblocks, etc). When the file system is being mounted, it first finds and
+reads the JFFS3superblock.
In case of traditional file systems the
-superblock usually resides at a fixed position of the disk and may be quickly
-found.
+superblock usually resides at a fixed position on the disk and may be found
+very quickly.
Conversely, due to the "out-of-place write" flash property we
-cannot guarantee the fixed position of the JFFS3 superblock. Things are getting
-even more complex because of the need to provide gentle flash wear levelling -- we
+cannot guarantee the fixed position of the JFFS3 superblock unless it . Things are getting
+even more complex because of the need to provide good flash wear levelling -- we
cannot just reserve several erasable blocks for the SB unless we've guaranteed
-that these SB~eraseblocks are not worn out before the other
-eraseblocks.
-
-It is important to note that we're using somewhat confusing terminology
-(see section~\ref{ref_SectDefinitions} for all the definitions in the document):
-
-\begin{itemize}
-\item \textbf{erasable block, eraseblock} -- the minimal erasable unit of the
-flash chip; eraseblock consists of sectors;
-
-\item \textbf{sector} -- the smallest writable unit of the \emph{flash chip},
-i.e. the NAND page in case of NAND; the
-minimal size of the sector is assumed to be 512KB;
-
-\item \textbf{block} -- the smallest writable unit of the \emph{file system};
-blocks are 4KB in size, like the size of the RAM page in most architectures;
-blocks may consist of multiple sectors.
-\end{itemize}
-
-The reason why we chose this terminology is to be consistent with the
-terminology used in case of ordinary file system systems and the Linux block I/O
-layer. Different flash chips use different terms and we decided to select more
-settled terminology.
+that these eraseblocks will not be worn out before the other eraseblocks.
Thus, we have the following two requirements that ought to be met:
\begin{itemize}
@@ -274,30 +303,42 @@
levelling.
\end{itemize}
-JFFS3 reserves the first two good erasable blocks of the JFFS3 partition for the
-superblock managing. We call these two eraseblocks \textbf{anchor eraseblocks}
-or \textbf{anchor area}.
-Note, the superblock takes only \emph{one sector} in JFFS3.
+The JFFS3 superblock consists of two parts:
+\begin{itemize}
+\item \emph{static superblock} - contains only stative data which is never
+changed by JFFS3; static superblock takes the first not bad eraseblock of the
+JFFS3 partition and may be change only by external user-level tools;
+\item just \emph{superblock} (\emph{SB}) - contains dynamic data, often changed and
+nd requires special methods to deal with it; the rest of this chapter is
+devoted to this dynamic part of the JFFS3 superblock; note, unlike the static
+superblock, the superblock takes one sector, not one eraseblock.
+\end{itemize}
-Suppose we are storing the JFFS3 superblock in the anchor area.
+To maintain the superblock management scheme, JFFS3 reserves two first good
+eraseblocks at the beginning of the flash partition (just after the
+static superblock). These two eraseblocks are called
+\emph{anchor eraseblocks}, or \emph{anchor area}.
+
+Suppose we are storing the superblock in the anchor area.
The superblock updates are written sequentially to adjacent
sectors 1,2,3,... of the anchor area.
When one of the anchor eraseblocks becomes full, the updates are
-written to the other anchor eraseblock and the first one is erased. Thus, the
+written to the second anchor eraseblock and the first one is erased. Thus,
unclean reboot tolerance is ensured. There may be many copies of the superblock
-in the anchor area, but only one copy is valid. Assigning version
+in the anchor area, but only the last one copy is valid. Assigning version
numbers to the copies we may find the valid SB. Since the copies are written
sequentially, we may even use the binary search algorithm.
-It stands no reason that the above scheme is bad because the anchor area
-will be worn out much earlier then the rest of flash. To prevent this
-we introduce \textbf{SB~eraseblocks} where we consequently write our
+The above scheme is bad because the anchor area
+will be worn out much earlier then the rest of flash. To prevent this
+we introduce \emph{anchor eraseblocks level 2} where we consequently write our
superblock updates. Anchor
-eraseblocks will contain references to the SB~eraseblock.
-When the SB~eraseblock
-becomes full, JFFS3 picks another SB~eraseblock, writes the superblock update to the new
-SB~eraseblock, and updates the reference to the SB~eraseblock in the anchor area.
-Note, that the SB~eraseblocks fit the main wear-levelling scheme,
+eraseblocks (level 1) will contain references to the L2~anchor eraseblock.
+When the L2~anchor eraseblock
+becomes full, JFFS3 picks another L2~anchor eraseblock, writes the superblock
+update to the new L2~anchor eraseblock, and updates the reference to the
+L2~anchor eraseblock in the anchor area.
+Note, that the L2~anchor eraseblocks fit the main wear-levelling scheme,
just like any other eraseblocks.
%
@@ -307,86 +348,83 @@
\begin{center}
\includegraphics[width=159mm,height=120mm]{sb_mgmt}
\end{center}
-\caption{The superblock management scheme with one level of SB~eraseblocks.}
+\caption{The superblock management scheme with 2 levels of anchor eraseblocks.}
\label{ref_FigureSBMgmt}
\end{figure}
-Figure \ref{ref_FigureSBMgmt} illustrates the SB management scheme with one
-level of the SB~eraseblock.
+Figure \ref{ref_FigureSBMgmt} illustrates the SB management scheme with two
+levels of anchor eraseblocks.
\begin{enumerate}
\item[a).] Eraseblocks number 1 and 2 are reserved for the anchor area.
-The first sector of the first anchor eraseblock refers the SB~eraseblock.
-The first sector of the SB~eraseblock (number 5) contains the superblock.
+The first sector of the first anchor eraseblock refers the L2~anchor eraseblock.
+The first sector of the L2~anchor eraseblock (number 5) contains the superblock.
\item[b).] The superblock was updated and was written to the second sector of
-the SB~eraseblock. The first sector of
-the SB~eraseblock becomes dirty and doesn't contain valid data any longer.
+the L2~anchor eraseblock. The first sector of
+the L2~anchor eraseblock becomes dirty and doesn't contain valid data any longer.
The anchor eraseblocks remain unchanged.
\item[c).] The superblock was updated several times and was written to the
-adjacent sectors of the SB~eraseblock. The last sector of the SB~eraseblock is
+adjacent sectors of the L2~anchor eraseblock. The last sector of the L2~anchor eraseblock is
occupied by the superblock, the other sectors contain dirt. The anchor eraseblocks
were not changed so far.
-\item[d).] The SB~eraseblock had no more free sectors. JFFS3 selected another
-SB~eraseblock, taking into account the wear-levelling. At that point the Garbage
-Collector may have been invoked if there were no free eraseblocks. The SB~eraseblock
-reference in the anchor block was updated, and at the moment the second sector of the anchor
-block is occupied by the new SB~eraseblock reference while the first sector contains
-dirt. The new SB~eraseblock is now at the flash eraseblock number 79. The
-old SB~eraseblock may now be erased and re-used for different purposes.
+\item[d).] The L2~anchor eraseblock has no more free sectors. JFFS3 selects another
+L2~anchor eraseblock, taking into account the wear-levelling. At this point Garbage
+Collector may be invoked if there are no free eraseblocks. Then the L2~anchor eraseblock
+reference in the anchor eraseblock is updated, and the second sector of the anchor
+block is occupied by the new L2~anchor eraseblock reference while the first sector contains
+dirt. The new L2~anchor eraseblock is now at the flash eraseblock number 79. The
+old L2~anchor eraseblock may now be erased and re-used for any purposes.
\item[e).] The superblock was updated many times and the position of the
-SB~eraseblock was changed many times. The current SB~eraseblock number is 3. The
+L2~anchor eraseblock was changed many times. The current L2~anchor eraseblock number is 3. The
first anchor eraseblock is full and its last sector refers the current
-SB~eraseblock.
+L2~anchor eraseblock.
-\item[f).] The superblock was updated many times. JFFS3 started using the
-second anchor eraseblock, and the first anchor eraseblock was erased. The
+\item[f).] The superblock was updated many times. JFFS3 starts using the
+second anchor eraseblock, and the first anchor eraseblock is erased. The
reason why JFFS3 makes use of two anchor eraseblocks is to guarantee the
unclean reboot tolerance.
\end{enumerate}
-Obviously, with the SB~eraseblock approach the anchor area is erased $N$ times
-less then if there would be no SB~eraseblocks ($N$ is the number of sectors
+Obviously, with the L2~anchor eraseblock approach the anchor area is erased $N$ times
+less then if there would be no L2~anchor eraseblocks ($N$ is the number of sectors
per eraseblock).
-If the flash chip is larger then a certain size, the proposed scheme with one
-level SB~eraseblocks does not guarantee that the anchor eraseblocks are not worn out
+If the flash chip is larger then a certain size, the proposed scheme with 2
+levels of anchor eraseblocks does not guarantee that the anchor eraseblocks are not worn out
earlier then the other eraseblocks. But we may introduce the
-second level of SB~eraseblocks. Then the anchor eraseblock will refer the
-SB~eraseblock L2, which, in turn, will refer the SB~eraseblock where the
-superblock is stored. If the two-level scheme is not enough, the three-level
-scheme may be used and so
-forth. Thus, by means of adding more SB~eraseblock levels we may guarantee the
-appropriate anchor area erase rate. Note, that the SB~eraseblocks of all
-the levels obey the global JFFS3 wear-levelling rules, so we don't need to
-worry about the wearing of those blocks.
-
-Note, we number the SB eraseblock levels starting from the SB eraseblocks,
-referred from the anchor area, i.e., the anchor area refers the SB eraseblock
-level $n$, which refers the SB eraseblock level $n-1$ and so on. The superblock
-is found in the SB eraseblock -- last level of the chain.
+third level of anchor eraseblocks. Then the anchor eraseblock will refer the
+L2~anchor eraseblock, which, in turn, will refer the L3~anchor eraseblock where the
+superblock will be stored. If the three-level scheme is not enough, four-level
+scheme may be used, and so
+forth. Thus, by means of adding more anchor eraseblock levels we may guarantee the
+appropriate anchor area erase rate. Note, that the anchor eraseblocks of any
+level obey the global JFFS3 wear-levelling rules, so we don't need to
+worry about the wearing of those eraseblocks.
-Let's calculate the number of SB~eraseblock levels assuming that:
+Let's calculate the number of anchor eraseblock levels assuming that:
\begin{itemize}
\item the live-cycle of one eraseblock is $L$, i.e., there are $L$ guaranteed
erases for each eraseblock (typically $\sim 10^5$ for NAND flashes);
-\item the total number of the flash eraseblocks is $M$;
+\item the total number eraseblocks on the JFFS2 flash partition is $M$;
\item the number of sectors per eraseblock is $N$;
-\item the file system data is updated with the constant frequency $R$ sectors
+\item the file system data is updated with average rate $R$ sectors
per second;
-\item the anchor area is updated with the constant rate $R_A$ sectors per
+\item the anchor area is updated with constant average rate $R_A$ sectors per
second;
-\item the \textbf{data area} is updated with the constant rate $R_D$ sectors per second;
+\item the file system data area (i.e., the whole flash minus the superblock
+and the anchor area) is updated
+with constant average rate $R_D$ sectors per second;
\item the period of time of the total anchor area wear is $T_A$;
\item the period of time of the total data area wear is $T_D$;
-\item the number of used SB~eraseblock layers is $n$.
+\item the number of used anchor eraseblock layers is $n$.
\end{itemize}
-In our calculation we assume the worst case scenario when any file system data
-update requires the superblock update.
+In our calculation we will assume the worst case scenario when any file system data
+update requires the superblock update (synchronous operation mode).
Obviously, what we want is to ensure that the anchor area is worn out not
earlier then the data area, i.e. the following inequality is true:
@@ -408,7 +446,7 @@
\label{ref_Equation_TA_and_TD}
\end{equation}
-If $n = 0$, i.e., there is no SB~eraseblocks and the superblock is stored in
+If $n = 0$, i.e., there are no L2~anchor eraseblocks and the superblock is stored in
the anchor area, then taking into account (\ref{ref_Equation_TA_and_TD})
and that in this case $R_A = R_D = R$, we have
@@ -416,10 +454,10 @@
\frac{T_A}{T_D} = 2 \cdot \frac{1}{(M-2)}
$$
-If $n = 1$. i.e., the level one SB~eraseblocks are used, the anchor area will
-be written $N$ times less frequently as the data area will be written $2$ times
+If $n = 1$. i.e., L2~anchor eraseblocks are used, the anchor area will
+be written $N$ times less frequently as the data area will be written $2$ times
more frequently, because each file system data update will require (a) the
-superblock update and (b) the SB reference update in the SB~eraseblock.
+superblock update and (b) the SB reference update in the anchor eraseblock.
Therefore, $R_A = R/N$ and $R_D = 2R$ and from (\ref{ref_Equation_TA_and_TD})
we have
@@ -427,12 +465,12 @@
\frac{T_A}{T_D} = 2 \cdot \frac{2N}{M-2}
$$
-When $n = 2$, i.e., two levels of SB~eraseblocks are used, the anchor area will
+When $n = 2$, i.e., three levels of anchor eraseblocks are used, the anchor area will
be written $N^2$ times less frequently, while the data area will be written
-$2+1/N$ times more frequently (one additional superblock update on each file
-system update and one L2 SB~eraseblock is updated once per N L1 SB~eraseblock
+$2+1/N$ times more frequently (one superblock update on each file
+system update and one L2~anchor eraseblock update per N L3~anchor eraseblock
updates). Therefore, $R_A=R/N^2$ and $R_D = (2 + 1/N) \cdot R$ and from
-\ref{ref_Equation_TA_and_TD}
+(\ref{ref_Equation_TA_and_TD}) we have
$$
\frac{T_A}{T_D} = 2 \cdot \frac{2N^2+N}{M-2}
@@ -463,37 +501,40 @@
\label{ref_EquationSBIneq1}
\end{equation}
-Table \ref{ref_TableNANDLevels} describes the number of required SB eraseblock
-levels for different NAND flash sizes.
+Table \ref{ref_TableNANDLevels} describes the number of required anchor eraseblock
+levels for NAND flash of different sizes.
-\begin{center}
\begin{table}[h]
+\begin{center}
\begin{tabular}{llllll}
-Flash type & Flash size & Sector size & $M$ & $N$ & Levels needed ($n$)\\
+\textbf{Type} & \textbf{Size} & \textbf{Sect. size} & $\bf M$ & $\bf N$ &
+\textbf{Levels ($\bf n$)}\\
\hline
Toshiba TC58DVM92A1FT & 64MB & 16KB & 4096 & 32 & 2\\
Toshiba TH58NVG1S3AFT05 & 512MB & 128KB & 4096 & 64 & 2\\
ST Micro NAND08G-B & 1GB & 128KB & 8192 & 64 & 2\\
Samsung K9K1G08X0B & 2GB & 128KB & 16384 & 64 & 2\\
\end{tabular}
-\caption{The number of required SB eraseblock levels for different types of
+\caption{The number of required anchor eraseblock levels for different types of
existing NAND flashes.} \label{ref_TableNANDLevels}
-\end{table}
\end{center}
+\end{table}
-Note, providing that $N=64$, three SB~eraseblock levels are enough to guarantee
-the acceptable anchor area wear leveling up to 128GB flash and four
-SB~eraseblock levels -- up to 8TB (the inequality \ref{ref_EquationSBIneq1}).
-
-To find the superblock during the mount JFFS3 finds the valid reference in the
-anchor eraseblocks, then finds the valid reference in the SB eraseblock level $n$,
-$n-1$, etc, and finally finds the valid superblock sector in the SB eraseblock.
-Since JFFS3 assigns versions to the sectors of the anchor eraseblocks and SB
-eraseblocks and the version is increased by one every time the sector is
-updated, the binary search algorithm may be used to find the valid sector.
-
-The valid SB eraseblock level $n$ reference in the anchor area may be found after
-$log_2(2N)+2$ steps, the reference to the level $n-1$ SB eraseblock -- after
+Note, providing that $N=64$, three anchor eraseblock levels are enough to guarantee
+acceptable anchor area wear leveling up to 128GB flash and four
+L2~anchor eraseblock levels -- up to 8TB (the inequality \ref{ref_EquationSBIneq1}).
+
+To find the superblock during the mount, JFFS3 finds the valid reference in the
+anchor eraseblocks, then finds the valid reference in the anchor eraseblocks of level
+$2$,~$3$,~$\ldots$,~$n-1$, and finally finds the valid superblock's sector in the
+anchor eraseblock level~$n$.
+Since JFFS3 assigns versions to the sectors of the anchor eraseblocks
+and the versions are increased by one on every update,
+the binary search algorithm may be used to find the valid sector.
+
+The valid L1~anchor eraseblock reference in the anchor area may be found after
+$log_2(2N)+2$ steps (one step involves one sector read operation),
+the reference to the L2~anchor eraseblock -- after
$log_2(N)+2$ steps, and so on. Thus, to find the superblock, JFFS3 must read
\begin{equation}
@@ -504,26 +545,27 @@
Table \ref{ref_TableNANDTimes} contains the approximate SB search time for
different existing NAND flashes.
-\begin{center}
\begin{table}[h]
+\begin{center}
\begin{tabular}{lllllll}
-Flash type & Flash size & $N$ & $n$ & Sector read time & $S$ & SB find time\\
+\textbf{Type} & \textbf{Size} & $\bf N$ & $\bf n$ & \textbf{Sect. read} & $\bf
+S$ & \textbf{SB find}\\
\hline
Toshiba TC58DVM92A1FT & 64MB & 32 & 2 & $\sim$50$\mu$s & 22 & $\sim$1.1ms\\
ST Micro NAND08G-B & 1GB & 64 & 2 & $\sim$130$\mu$s & 25 & $\sim$3.3ms\\
Samsung K9K1G08X0B & 2GB & 64 & 2 & $\sim$70$\mu$s & 25 & $\sim$1.6ms\\
\end{tabular}
\caption{The superblock search time for different NAND flashes} \label{ref_TableNANDTimes}
-\end{table}
\end{center}
+\end{table}
For larger flash chips which would utilize the level 3 SB management scheme (no
such flashes exist at the moment) the SB search time would be 4.3ms providing
the flash characteristics are the same as the ST Micro's one (see table
\ref{ref_TableNANDTimes}).
-Note, the above time doesn't contain the ECC/CRC checking overhead as well as any
-other CPU overhead.
+Note, the calculated SB search time doesn't contain the ECC/CRC checking
+overhead as well as any other CPU overhead.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%
@@ -532,44 +574,74 @@
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\section{Definitions}\label{ref_SectDefinitions}
\begin{enumerate}
-\item \textbf{Anchor erasable block, anchor area}
--- the first two \emph{good}
-erasable blocks of the JFFS3 partition which are reserved for the superblock management.
+\item \textbf{Anchor erasable blocks, anchor area}
+-- two \emph{good} erasable blocks at the beginning of the JFFS3 partition
+which are reserved for the superblock management.
+The JFFS3 superblock management scheme implies several levels of
+anchor eraseblocks (typically 2-4 levels). Anchor eraseblocks level 2, 3, etc
+have no fixed position on the flash.
+
+\item \textbf{Branch node} -- any node that is not leaf, not twig and not root
+is called branch node.
\item \textbf{Erasable block, eraseblock} -- the minimal erasable unit of the
flash chip.
-Although flash manuals usually call erasable blocks just "blocks" or
-"sectors", we prefer to be consistent with traditional Linux file system and
-block IO layer terminology and reserve the "block" and "sector" terms for
-other purposes.
-
-\item \textbf{Block} -- several adjancent sectors form a 4KiB block.
-Block size is equvalent to
-the size of the RAM page in most architectures.
-\item \textbf{Data area} -- all the JFFS3 partition excluding the anchor area.
+\item \textbf{Block} -- several adjacent sectors form a block.
+Block size is equivalent to the size of the RAM page of the machine
+(\texttt{PAGE\_SIZE} in the Linux kernel). The smallest block size is
+4KB.
-\item \textbf{Sector} -- the smallest writable unit of the \emph{flash chip},
-i.e. the NAND page in case of NAND. Multiple sectors compose a block. The
-minimal size of the sector is assumed to be 512KB.
+\item \textbf{Data area} -- the whole JFFS3 partition excluding those
+eraseblocks which are involved in the SB management
+(anchor eraseblocks, etc).
+
+\item \textbf{Dirty sector, dirty block} -- block or sector with invalid data,
+recycled by Garbage Collector.
+
+\item \textbf{Garbage Collector, GC} --
+
+\item \textbf{Journal} --
+
+\item \textbf{Leaf node} -- any node from the leaf level of the tree (level 0).
+Leaf nodes contain only data and do not further refer other nodes.
-\item \textbf{SB, superblock} -- the special data structure which describes the whole
-JFFS3 filesystem. The superblock may reside anywhere within the JFFS3 partition
-(except of the anchor eraseblocks). When JFFS3 is being mounted, the
-superblock must be found. JFFS3 superblock takes one sector.
-
-\item \textbf{SB eraseblock} -- The SB eraseblock is the flash eraseblock that
-contains the JFFS3 superblock. The JFFS3 superblock management scheme implies
-several levels of SB eraseblocks - typically 2-4 levels. For example, in the
-3-level scheme, the anchor eraseblock refers the SB eraseblock level 3, which
-refers the SB eraseblock level 2, which refers the SB eraseblock containing the
-superblock.
+\item \textbf{Map} --
-\item \textbf{Node} -- the tree pile, i.e., the tree consists of nodes. The
+\item \textbf{Node} -- the pile of the tree. The tree consists of nodes. The
root of the tree is the \textbf{root node}, the leafs of the tree are
-\textbf{leaf nodes}. Non-leaf nodes are called \textbf{branch nodes}.
-This terminology is very similar to the terminology used in
-Reiser4 [\ref{ref_Reiser4}].
+\textbf{leaf nodes}, the nodes of level 1 are \textbf{twig nodes},
+the nodes which are not root, twig or leaf are called \textbf{branch nodes}.
+
+\item \textbf{Pristine sector, pristine block} -- block or sector which
+contains valid data.
+
+\item \textbf{Sector} -- the smallest writable unit of the \emph{flash chip},
+i.e. the NAND page in case of NAND. Multiple sectors compose a block. The
+minimal size of the sector is assumed to be 512 bytes.
+
+\item \textbf{Space allocator} -- the JFFS3 subsystem which is responsible for
+allocation of eraseblocks, block and sectors. The wear-levelling algorithms are
+encapsulated here.
+
+\item \textbf{Static superblock} -- the fist good erasable block of the JFFS3
+partition where the per-file system static data is stored. JFFS3 may only read
+it and it is created by external formatting tools.
+
+\item \textbf{Superblock, SB} -- special structure which describes the whole
+JFFS3 filesystem. The dynamic part of the superblock may reside
+anywhere within the JFFS3 partition
+(except of the anchor area) while the static part is placed at the known and
+fixed position. The dynamic part contains the pointer to the root node, the
+journal eraseblocks, etc.
+
+\item \textbf{Tree} -- the main object JFFS3 design revolves about. The JFFS3
+tree is the $B^+$-tree where all the file system stuff (files, directories,
+extended attributes, etc) is stored.
+
+\item \textbf{Twig node} - the nodes which resides one level upper then the leaf
+nodes (we number the levels of the tree from the leaf level to the root
+level).
\end{enumerate}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -581,6 +653,7 @@
\begin{enumerate}
\item \textbf{ECC} - Error Correction Code
\item \textbf{CRC} - Cyclic Redundancy Check
+\item \textbf{GC} - Garbage Collector
\item \textbf{JFFS2} - Journalling Flash File System version 2
\item \textbf{JFFS3} - Journalling Flash File System version 3
\item \textbf{MTD} - Memory Technology Devices
@@ -599,21 +672,24 @@
\begin{itemize}
\item \textbf{David Woodhouse} \texttt{<dwmw2 at infradead.org>} -- the author of JFFS2,
-answered a great deal of my questions about MTD, JFFS2, JFFS3 design approaches,
+answered a great deal of my questions about MTD, JFFS2 and JFFS3 design approaches,
English (e.g., "how to express this correctly in English"), etc.
\item \textbf{Joern Engel} \texttt{<joern at wohnheim.fh-wedel.de>} -- discussed a
-lot of JFFS3 design aspects with me. Many ideas described in this document were
-suggested by Joern.
+lot of JFFS3 design aspects with me. Some ideas described in this document were
+pointed by Joern.
-\item \textbf{Thomas Gleixner} \texttt{<tglx at linutronix.de>} --
-proposed many interesting ideas which I'm exploiting in the JFFS3 design as
-well as answered many question, especially concerning flash hardware and
-low-level flash software.
-
-\item \textbf{Victor V. Vengerov} \texttt{<vvv at oktetlabs.ru>} -- My colleague
-from OKTET~Labs who spent much time discussing the JFFS3 design approaches with
-me and suggested many interesting ideas. He also reviewed my writings.
+\item \textbf{Nikita Danilov} \texttt{<nikita at clusterfs.com>} -- used to work
+in Namesys and implemented ReiserFS and Reiser4 file systems.
+Nikita answered many of my questions about Reiser4 FS internals.
+
+\item \textbf{Thomas Gleixner} \texttt{<tglx at linutronix.de>} -- helped me with
+many MTD-related things, especially concerning flash hardware and low-level
+flash software. Proposed some ideas which I'm exploiting in the JFFS3 design.
+
+\item \textbf{Victor V. Vengerov} \texttt{<vvv at oktetlabs.ru>} -- my colleague
+from OKTET~Labs who spent a lot of time discussing the JFFS3 design approaches with
+me and suggested many interesting ideas. He also reviewed some of my writings.
\end{itemize}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
@@ -631,9 +707,8 @@
Samsung Flash memory products,\\
\url{http://www.samsung.com/Products/Semiconductor/Flash/index.htm}
-\item \raggedright \label{ref_Reiser4},
+\item \raggedright \label{ref_Reiser4}
Reiser4 File System, \url{http://www.namesys.com/}
-
\end{enumerate}
\end{document}
Index: sb_mgmt.eps
===================================================================
RCS file: /home/cvs/mtd/Documentation/jffs3/sb_mgmt.eps,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
--- sb_mgmt.eps 6 Jun 2005 12:17:02 -0000 1.1
+++ sb_mgmt.eps 1 Aug 2005 11:37:39 -0000 1.2
@@ -15,63 +15,63 @@
%0000000000000000000000000000000000000000000000000000
%000000000000000000000000000000000000000000000000007FFFFFFFFFFFFFF8000000000000
%00000000000000001FC000000000000000000000000000000000
-%000001FFFFFFFFFFFFFFC000000000000000000000000000007FFFFFFFFFFFFFF8000000000000
+%0001FFFFFFFFFFFFFF800000000000000000000000000000007FFFFFFFFFFFFFF8000000000000
%0000000000000000104000000000000000000000000000000000
-%00000181000408002040C000000000000000000000000000007040010204080038000000000000
+%000102040800202001800000000000000000000000000000007040010204080038000000000000
%0000000000000000104000000000000000000000000000000000
-%00000181000408002040C000000000000000000000000000007040010204080038000000000000
-%0000000000000000104000000000000C00080000000000000000
[...985 lines suppressed...]
+2300 3200 m 2175 3200 l 2175 2200 l 2425 2200 l 2425 3200 l 2300 3200 l
+pc
+2550 3200 m 2425 3200 l 2425 2200 l 2675 2200 l 2675 3200 l 2550 3200 l
+pc
+2800 3200 m 2675 3200 l 2675 2200 l 2925 2200 l 2925 3200 l 2800 3200 l
pc
-3350 3200 m 3225 3200 l 3225 2200 l 3475 2200 l 3475 3200 l 3350 3200 l
+3050 3200 m 2925 3200 l 2925 2200 l 3175 2200 l 3175 3200 l 3050 3200 l
pc
-3600 3200 m 3475 3200 l 3475 2200 l 3725 2200 l 3725 3200 l 3600 3200 l
+3300 3200 m 3175 3200 l 3175 2200 l 3425 2200 l 3425 3200 l 3300 3200 l
pc
-3850 3200 m 3725 3200 l 3725 2200 l 3975 2200 l 3975 3200 l 3850 3200 l
+3550 3200 m 3425 3200 l 3425 2200 l 3675 2200 l 3675 3200 l 3550 3200 l
pc
-4100 3200 m 3975 3200 l 3975 2200 l 4225 2200 l 4225 3200 l 4100 3200 l
+3800 3200 m 3675 3200 l 3675 2200 l 3925 2200 l 3925 3200 l 3800 3200 l
pc
8627 3147 m 8502 3147 l 8502 2147 l 8752 2147 l 8752 3147 l 8627 3147 l
pc
Index: sb_mgmt.pdf
===================================================================
RCS file: /home/cvs/mtd/Documentation/jffs3/sb_mgmt.pdf,v
retrieving revision 1.1
retrieving revision 1.2
diff -u -r1.1 -r1.2
Binary files /tmp/cvsfFOo3h and /tmp/cvs9qsTTq differ
More information about the linux-mtd-cvs
mailing list