On the "safe filesystem" and write() topic
Vipin Malik
vipin at embeddedlinuxworks.com
Wed Jul 4 10:10:22 EDT 2001
Hi,
At 01:53 AM 7/4/2001 +0200, Bjorn Wesen wrote:
>I designed the JFFS specifications, log layout and GC method in the first
>place and me and Finn put a lot of thought into it while implementing so
>please consider some of these late night ramblings:
Definitely! Thoughts, discussions, suggestions most welcome and thank you
for reading my ramblings!
>The initial requirement was that a small partition of configuration files
>(the /etc directory to be more specific) should be able to reside in flash
>and be completely safe from inconvenient power-outs or crashes.
>
>It is my opinion (of course) that JFFS solves this in a manner as good as
>possible given the standard Linux VFS API. This means that when you
>rewrite a configuration file, you write the new one to another file and do
>a rename over the old once you're ready.
Agreed. Of course as long as the config files are small and relatively few and
not changing that often. Your example of config files in /etc fits the
bill perfectly.
> Technically JFFS is based on a
>log structure consisting of VFS operations, and this is the best you can
>do while not involving the application more than what standard VFS gives
>you. VFS operations are not "transactions" in the high-level sense though.
Agreed again.
>In our embedded products this is handled by a configuration handling
>daemon similar to linuxconf, which caches parameters and knows how to
>rewrite configuration files atomically (just like any other sane Unix
>program does it). There is no need for any transactional semantics for
>small configuration files.
This is surely the preferred way to do it for such files. As a matter of
fact it is most preferred for small config files. I think that I need to
explicitly mention it in one of my ramblings on my site ;)
> We sell a lot of these products and I certainly
>disagree with Vipin's comment on his website that it's impossible to use
>JFFS in embedded products :)
Wait a minute! Where did I say that in context of config files. And if I
did I need to go and correct it (so please send me an email).
I think I surely said it in context of JFFS (not JFFS2) loosing integrity
(including files at random) during power fail tests and I stand behind
those results till proven otherwise. Have you guys tested the JFFS fs under
power fail? What version are you using and what were your results?
> Log-files are not usually kept in flash and
>if they are they don't need anything more advanced than normal rotation
>and if a crash occurs, it's no big deal if the last line gets cut off
>completely or in the middle...
Again agreed. Log files being of the course the "append" type, and a simple
scan of the log file on startup will enable one to detect and remove this
last half written offending line.
>It is difficult (if not impossible) in any consistant way to handle the
>case with random write()'s inside an already existing file. The filesystem
>needs to "roll back" to any pre-existing state but it then needs to
>know what the desired state would be. What we do now is make sure the
>filsystem itself is never corrupt even if a file was under writing.
JFFS2 does that (not getting corrupt) under random power fail. JFFS
attempts to do that, but there is a bug in the latest version in CVS that
causes files to disappear at random in power fail testing. This happened
anywhere after 600+ to 1300+ power fails. I've mentioned this specifically
in my "JFFS: A Practical guide" on my site.
It's quite possible that *I* introduced this bug myself when I was mucking
around with JFFS trying to fix other problems. But considering the fact
that when I started testing JFFS, it would never last more than 10 power
cycles without a failed mount on power up, and other issues like leaking
memory to the point that the kernel panicked (again on mount after a power
fail)- to the point when I left it with my patches, that I get at least
600+ (and once 1300+) async power fails without any problem, which version
would you rather go with? With the maturing of JFFS2, IMHO folks should be
encouraged to migrate to JFFS2 if possible (I am). Is there anything that
JFFS gives you that you don't get with JFFS2?
>The problems arise from the vague definition of what the desired state
>would be - is it the data before the last write(), and what happens if you
>receive a signal ?
Isn't it the same case as what happens when you get a power fail? (please
pardon my lack of understanding of signals in kernels. Can the execution
that was interrupted with a signal ever resume at the interrupted point?)
> Writes to mmap'ed pages can't use that mechanism, and
>you'll be stuck with using write()'s when you really probably want to use
>libc wrappers like fwrite and fprintf.
That's true, but it's a tradeoff: If the task wants reliable writes to the
fs, it must not use any lib calls. As a matter of fact, that's the last
thing you want to use anyway as these wrappers buffer the programs writes,
defeating the purpose of the default mechanism of O_SYNC of the JFFS(2) fs.
>I agree that if you need a binary database which is big so that you cannot
>rewrite it when you update something, you'll need to rethink. Either just
>split the database in smaller files, or you'll need a transaction marker
>API down to the filesystem (an ioctl pair was suggested somewhere I
>think). I don't think trying to tweak write() would lead to anything
>generally useful though.
See, we agree on all the same points :)
The main issue here is not only a BIG database, but also one with a lot of
points in it that are being updated frequently. Each file has an overhead
(as well a max # of files limit on the fs). How reasonable is it to put
5000, 8 byte files on a 1MB JFFS(2) fs? (this file would only occupy <50KB
in a single (db) file) vs at least 5000*64(file overhead)+5000*8 = 360KB as
separate files, assuming that you can even fit 5000 files on your partition.
>The kernel-level transactional extension would probably be quite difficult
>to get consistent also, because Linux VFS does not know about it yet (this
>is eventually changing with the integration of the general journalling
>layer I guess). I get a headache thinking about it, perhaps it's possible
>perhaps it's not; perhaps this code already exist in the other journalling
>filesystems, perhaps it does not.
I cannot speak intelligently about this so I'll keep my mouth shut :)
>With regards to Kyle's question below though, the answer is certainly that
>he can do as he says but use the rename() operation and keep them on a
>single partition. There is no need for anything more advanced..
For a lot of solutions, this is certainly true. OTOH, the current blocking
times of JFFS2
(I didn't do this test on JFFS, but no reason to be different methinks)
makes putting
any config or db directly on the fs unreasonable. (if you've been following
my jitter tests recently, JFFS2 can block for 10's of seconds when it
getting quite full).
>(All this assumes other more technical problems are solved of course like
>the nasty surprises we've had with some flashes getting bits halfway
>erased...)
This "filpping bits" syndrome (TM Vipin Malik :) is solved reliably for
JFFS2. JFFS2 has passed 15K+ power fails without any failures that I could
detect or was looking for. IMHO it cannot be solved reliably for JFFS
because JFFS does not handle (or know about) erase sectors. I've solved it
be re-reading the same sector 4 times. See big note above
scan_for_partially_erased_sectors() (or something like that) in jffs/intrep.c
To a large extent, we've (I) have allowed the thought of having
transactions in JFFS(2) lapse. Maybe this is not such a bad thing after all
and with each discussion I better appreciate the cons of having
transactions in the fs. Anyway, there is a new project that is being
started on developing (or modifying an existing embedded db (mird)) to
provide for this transaction level processing for embedded systems on
JFFS(2). In addition to providing transactions it will also provide a
caching layer that will allow the transaction log to be put on *another*
non-volatile medium if such is available in your system. The big advantage
of this will be 0 latency, transaction protected, power fail safe writes
available to programs that use this interface. As a freebe it will also
provide for key/value type store/retrieve from a (small) hash database.
Read more about it at:
http://www.embeddedlinuxworks.com/articles/db_project.html
To sign up for the development mailing list, go to:
http://www.embeddedlinuxworks.com/cgi-bin/signup/signup-dev.cgi
Thanks for reading and your thoughts.
Regards,
Vipin
http://www.EmbeddedLinuxWorks.com
More information about the linux-mtd
mailing list