*BSD News Article 63714

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!newshost.telstra.net!act.news.telstra.net!psgrain!newsfeed.internetmci.com!in2.uu.net!news.reference.com!cnn.nas.nasa.gov!gizmo.nas.nasa.gov!not-for-mail
From: truesdel@gizmo.nas.nasa.gov (Dave Truesdell)
Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system
Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux
Date: 14 Mar 1996 18:10:22 -0800
Organization: A InterNetNews test installation
Lines: 96
Message-ID: <4iajie$9fn@gizmo.nas.nasa.gov>
References: <4gejrb$ogj@floyd.sw.oz.au> <4gilab$97u@park.uvsc.edu> <4giqu8$aqk@park.uvsc.edu> <4gira2$a9d@park.uvsc.edu> <hpa.31321eee.I.use.Linux@freya.yggdrasil.com> <4h7t5i$qoh@park.uvsc.edu> <DnoqB4.2sy@pe1chl.ampr.org> <4hirl9$nr7@gizmo.nas.nasa.gov> <Dnu8FD.CK2@pe1chl.ampr.org>
NNTP-Posting-Host: gizmo.nas.nasa.gov
X-Newsreader: NN version 6.5.0 #61 (NOV)
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:15565 comp.os.linux.development.system:19516

rob@pe1chl.ampr.org (Rob Janssen) writes:
>In <4hirl9$nr7@gizmo.nas.nasa.gov> truesdel@gizmo.nas.nasa.gov (Dave Truesdell) writes:

>>The point you seem to want to ignore is, while data integrity is not
>>guaranteed, it only affects those files being written at the time of a
>>crash.  If you don't guarantee metadata integrity, you could loose *every*
>>file on an active filesystem.

>Please show us how that can happen, and how sync metadata is going to
>avoid it.  I think you are only spreading FUD.
>(or is there some inherent fragility in FFS that is not in the classic
>UNIX filesystems and ext2fs?)

The "classic" UNIX filesystem?  As opposed to those in V6 or V7?  What makes
you think the "classic" UNIX filesystem didn't have the same need for
metadata integrity?  The only difference between the "classic" days and
today, is that the systems today tend to be much larger and stress the systems
design and implementation to a greater extent.  And *that* tends to exacerbate
any weaknesses in either.

How would sync metadata avoid these problems?  First, as has been pointed out
by others in this thread, what avoids these problems is metadata integrity.
Sync metadata update is just one method of maintaining this.  Other mechanisms,
such as *ordered* async updates would do as well.  Now, I haven't read the code
for the ext2fs, so as far as I know it could maintain metadata integrity by
ordering asynchronous writes.  Now, what does metadata integrity mean for
your filesystem?  I'll give two examples, one from experience, the second a
simple thought experiment.

First case:  Restoring a large filesystem on a large machine.
Here's an example of one of those 8 hour restores I mentioned.  The setup, a
500GB disk array, mounted async; 1GB memory (>500GB was allocated to the
buffer cache); ~1.5 million i-nodes to restore; running the restore in single
user (no update daemon running).  If the restore had been running for several
hours, and a hardware glitch crashed the machine, what state do you think the
filesystem would be in?  In this situation, data blocks, which are only written
on once, would age quickly and get flushed to disk as new data came in.  How
about indirect blocks?  They can be updated multiple times a files grow, so
they don't age quite as fast.  Directory blocks?  They can get written
multiple times, as new files and directories are created, so they don't age
quite so fast, either, so they're less likely to get flushed to disk.  The same
is true for inode blocks, too.  So, what situation are you left with?  Unless
all the metadata gets written to disk, you may have most of your data safely
on disk, but if the metadata hasn't been flushed, you may not know what
i-nodes have been allocated; which data blocks have been allocated; which
data blocks belong to which i-nodes, etc.

How would maintaining metadata integrity have changed things?  Just like above,
most of the data would have been flushed to disk, so no great difference there.
What would be different, is that the file structure itself would have been
maintained in a sensible state, on disk, instead of a random the patchwork of
inconsistent information.

While the average system running *BSD or Linux is several orders of magnitude
smaller, the situation is different only in degree, not in kind.  The large
buffer cache, and the lack of a running update daemon, didn't create the
problem, it only exaggerated the problem by allowing a larger number of
inconsistencies to accumulate.  Smaller caches and periodic sync's only
narrows the window of vulnerability, it doesn't close it.

BTW, Just to see what would happen, I tried to run an fsck on the partial
filesystem.  After what seemed like several hundred screens of just about
every error that fsck could detect, it finally dumped core.

Here's a thought experiment.  Let's take a small filesystem, with only one
non-zero length file in it.  Call it file "A".  Delete file "A" and create a
second non-zero length file named "B".  Now, crash the system, without
completely syncing.  When you go back and examine that filesystem, what will
you find?  Will you find file "A" still in existence and intact?  Will you
find file "B" in existence and intact?  What would you find if one of "A"'s
blocks had been reused by "B"?  If the integrity of the metadata is not
maintained, you could find file "A" with a chunk of "B"'s data in it.  The
situation gets worse if the reused block is an indirect block.  How would the
system interpret data the overwrote an indirect block?

>>If you ever had to manage systems where a restore takes 8 hours to run, even
>>when mounted async, you might care more about having a filesystem that
>>maintained metadata integrity.

>I have used and maintained UNIX systems for well over 12 years, I have
>had to come back in over the weekend to move 80MB filesystems or to wait hours
>just to load the base system, I have seen many interesting things happen
>after system crashes, but I *never* have seen a system or even heard of
>a system that lost all its files after a simple crash.

How many of those systems didn't attempt to maintain consistent metadata?
I've run V6 on a PDP-11/34 in a half meg of RAM, using a pair of RK05's for
a whopping 10MB for the filesystem.  I've written trillion byte files as part
of testing new modifications to the filesystem code.  I've tested filesystems
that claimed to be great improvements over the FFS, that I've been able to
trash (the filesystem could *NOT* be repaired) simply by writing two large
files simultaneously.  I've seen many people who think they've invented a
"better" filesystem and how often they've been wrong.
-- 
T.T.F.N., Dave Truesdell	truesdel@nas.nasa.gov/postmaster@nas.nasa.gov
Wombat Wrestler/Software Packrat/Baby Wrangler/Newsmaster/Postmaster