*BSD News Article 62295

Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system
Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!nntp.coast.net!howland.reston.ans.net!tank.news.pipex.net!pipex!peer-news.britain.eu.net!newsfeed.ed.ac.uk!dcs.ed.ac.uk!newshost!sct
From: "Stephen Tweedie" <sct@dcs.ed.ac.uk>
Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux
In-Reply-To: "Jordan K. Hubbard"'s message of Sat, 10 Feb 1996 01:00:36 -0800
X-Nntp-Posting-Host: calvay.dcs.ed.ac.uk
Message-ID: <SCT.96Feb15162320@calvay.dcs.ed.ac.uk>
Sender: cnews@dcs.ed.ac.uk (UseNet News Admin)
Organization: University of Edinburgh Dept. of Computer Science, Scotland
References: <4er9hp$5ng@orb.direct.ca> <strenDM7Gr4.Cn2@netcom.com>
	<DMD8rr.oIB@isil.lloke.dna.fi> <4f9skh$2og@dyson.iquest.net>
	<4fg8fe$j9i@pell.pell.chi.il.us> <311C5EB4.2F1CF0FB@FreeBSD.org>
Date: Thu, 15 Feb 1996 16:23:20 GMT
Lines: 216
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:14458 comp.os.linux.development.system:18138

Hi,

This whole thread has been simmering for so long that I thought I'd
better put in my tuppence worth.

In article <311C5EB4.2F1CF0FB@FreeBSD.org>, "Jordan K. Hubbard"
<jkh@FreeBSD.org> writes:

> I don't think that the async-vs-sync metadata write issues are worth
> debating since the whole topic is truly too subjective for meaningful
> discussion.

Agreed, wholeheartedly.

Sync-metadata is bad for performance.  It is better for recovery after
crashes (and those who argue otherwise don't know what they are
talking about).  You pays your money and you makes your choice.

However, the reliability difference is going to be insignificant for
many people.  You have got deferred write of file data even under
BSD/ffs, and for many people the difference between losing 15 seconds'
worth of data but having correct directory names and losing 15
seconds' worth of data plus 2.5 seconds of names is not worth worrying
about.  Either your data matters, in which case it is backed up anyway
:), or it doesn't in which case you may as well go for the performance
choice.  However, with today's reliability --- an order of magnitude
or more better than it was when ffs was first designed --- the issue
is, for many people, simply not too important.

Regarding Re: The better (more suitable)Unix?? FreeBSD or Linux; Terry
Lambert <terry@lambert.org> adds:

> "Jordan K. Hubbard" <jkh@FreeBSD.org> wrote:
> ] I don't think that the async-vs-sync metadata write issues are worth
> ] debating since the whole topic is truly too subjective for meaningful
> ] discussion.

> Bah Humbug.  See other posts in this thread.

> When anyone claims some operating system component is not subject
> to objective, mathematical analysis, they are mistaken.

> Objective, not subjective.

That's not what Jordan said.  Of course you can analyse the failure
modes for a given filesystem.  You can put bounds on the potential for
data loss or for silent corruption.  However, the question of whether
or not a given level of risk is acceptable or not IS subjective.
That's what the debate is about --- is the risk of async metadata
writes worth the performance penalty?  That is a subjective question,
which is why users should be given the choice between the two regimes.

Regarding Re: async or sync metadata [was: FreeBSD v. Linux];
mlelstv@comma.rhein.de (Michael van Elst) adds:

> In <4fjodc$o8j@venger.snds.com> grif@hill.ucr.edu (Michael Griffith) writes:

>> sync metadata with async data is not just slower,
>> it is LESS SAFE.

> I think this discussion is pretty ancient and the result was that
> "async data" doesn't mean that the metadata is updated before the
> data is committed but _after_.

It *can* mean this, but it usually doesn't.  One amusing anecdote
concerns a research project which ordered metadata writes after data
writes, for extra reliability.  A large database was run on this
filesystem, and was used intensively for a week before it crashed.
There was *always* at least one dirty block in the database file
during this period, and the deferred metadata writes meant that the
filesystem had therefore refused to write out ANY of the file's
top-level metadata all week. :)  The result after restore was a
perfectly consistent but empty file.

Normally the sync metadata write's aim is to ensure internal
consistency in the filesystem's metadata after crash recovery, but not
to ensure atomic writing of file data.

> The result is that the metadata always reflects a consistent state,
> wether the data is written async or not. With async data it doesn't
> necessarily reflect the last state though.

With async, it doesn't necessarily reflect ANY entirely consistent
state, and the fsck tool must deal with any inconsistencies as best it
can.

Regarding Re: The better (more suitable)Unix?? FreeBSD or Linux; Terry Lambert <terry@lambert.org> adds:

> hosokawa@mt.cs.keio.ac.jp (HOSOKAWA Tatsumi) wrote:

> Yes, async updates of metadata without another mechanism to
> satisfy the ordering requirements of deterministic recoverability
> is far more dangerous than sync.

> ] BTW, I've used BSD filesystem for years, I haven't see the dead file's
> ] garbage appeared as another file even after severe crashes (I'm a
> ] device driver writer of FreeBSD and I frequently experience unclean
> ] shutdowns).  I've thought that these premature files are truncated by
> ] fsck.  Is it wrong?

> No, you are right.  The block allocation map/bitmap is also updated
> synchrnously.  The idea that a file could contain bogus data that
> belonged to another (sensitive) file is nothing more than a red
> herring (ie: something intended to introduce uncertainty).

No, it is a valid objection, if you are doing only synchronous updates
of the block bitmaps.  If you are careful, you should easily be able
to ensure that no files share blocks after a crash.  However, there is
a genuine danger that a file will reuse a block *previously* occupied
by some other, now deleted, file.  You can only prevent this by either
zeroing out file contents on unlink, by deferring the write of the
indirect pointer until after the write of the new data block, or
deferring the write of the new block allocation in block bitmap until
after the data has been written.


Regarding Re: The better (more suitable)Unix?? FreeBSD or Linux; Terry
Lambert <terry@lambert.org> adds:

> Since the block allocation bitmap is updated *after* the data
> blocks are updated, a crash with written inode metadata for a
> file referring to previously used but not overwritten blocks
> will result in the fsck removing the block references.

Hmm.  Do you do this for all files?  I thought ffs only behaved this
way for O_SYNC files, even with synchronous metadata writing enabled.


In article <4fm0d7$ivs@park.uvsc.edu>, Terry Lambert
<terry@lambert.org> writes:

> BSD is potentially slower on stat operations because it obeys
> the POSIX mandate of "Shall be Updated".  There are situations
> where POSIX mandates "shall be marked for update".  Directory
> access times is not one of these, unless you cop out and claim
> that directories are not files and claim that the do not have
> to obey file time semantics because of this.

I've found that the biggest performance penalty when doing a series of
stat()s in a large directory is the cost of the directory lookups at
each step.  I'm not sure if BSD does anything fancy here, but ext2fs
has a directory name cache to help this type of operation.  The
readdir()s add their returned dirents to the cache, and if a stat()
follows quickly then the name to inode mapping can be done from the
cache without another linear scan of the directory.

In article <4fm2b1$ivs@park.uvsc.edu>, Terry Lambert
<terry@lambert.org> writes:

> Synchronous writes are actually unnecessary, as long as a delayed
> ordered write mechanism is employed to ensure idempotence.  They
> are just the easiest way to implement ordering guarantees.

> It is the ordering guarantees that are important, not the
> synchronicity or non-synchronicity of the underlying mechanism
> for making those guarantees.

Quite.  I'm hoping to have that implemented for ext2fs at some point.

> A related paper, Eric H. Herrin II and Raphael A. Finkel's "The
> Viva File System" goes into some detail on what constitutes an
> idempotent vs. a non-idempotent operation, and where you must
> guarantee order atomicity -- as does the UCB "SPRITE" paper.

Indeed.  The VIVA paper formed the basis of much of ext2fs's
performance code, and is also one of the clearest expositions I've
seen on the mechanics of ordered writing.

> No in that case that async I/O on non-metadata data will
> potentially cause it to be corrupt anyway -- just not in such a
> way as to cause the file system to be inconsistent, and therefore
> unrunnable.

I'm not sure what you are saying.  Are you conceding that ufs can
leave some data contents corrupt after recovery, or just before fsck?

> ]  And my experience running news on filesystems without
> ] synchronous metadata writes certainly hasn't shown any
> ] vulnerability, even when I've been running beta software like
> ] a software disk array that showed the distressing tendency to
> ] lock up and die when being driven hard.

> Most likely you haven't hit the window.  The disk syncing window
> on ext2fs is smaller that the UFS window (ie: it is synced more
> frequently in an attempt to foreshorten the window).  This reduces
> the probability in direct proportion to the MTBF of your power
> supply or other event that may cause a spontaneous reboot (or
> require a user-directed reset without a normal shutdown).

> This does not mean that the window is not there.

Granted.  But the window for corruption of file contents is the same
as it is on ufs, and e2fsck is extremely good at ensuring internal
consistency of the metadata after recovery (except that, obviously, it
just cannot guarantee that the recovered state matches any
serialisation of the metadata operations before recovery).

> As far as successful recovery following a soft failure: all file
> system recovery tools will, when run, result in a consistent file
> system structure.  The question is what is the probability of
> arriving at the "correct" consistent state given a large number
> of "potential" consistent states resulting from the permutations
> of predicted outcome for all potential outstanding metadata
> operations at the time of the crash.

What it comes down to is again a subjective question --- how much does
it matter that the consistent state you achieve after fsck matches one
particular state.  For many people, the important thing is that it IS
consistent after fsck, and it doesn't matter too much if 5 seconds'
worth of new files end up either lost or in /lost+found after a crash.

Cheers,
 Stephen.
---
Stephen Tweedie <sct@dcs.ed.ac.uk>
Department of Computer Science, Edinburgh University, Scotland.