*BSD News Article 61859


Return to BSD News archive

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.cs.su.oz.au!metro!metro!munnari.OZ.AU!news.ecn.uoknor.edu!news.eng.convex.com!newshost.convex.com!bcm.tmc.edu!news.msfc.nasa.gov!newsfeed.internetmci.com!swrinde!sdd.hp.com!hamblin.math.byu.edu!park.uvsc.edu!usenet
From: Terry Lambert <terry@lambert.org>
Newsgroups: comp.unix.bsd.freebsd.misc,comp.os.linux.development.system
Subject: Re: The better (more suitable)Unix?? FreeBSD or Linux
Date: 20 Feb 1996 23:20:49 GMT
Organization: Utah Valley State College, Orem, Utah
Lines: 160
Message-ID: <4gdl0h$qnc@park.uvsc.edu>
References: <4er9hp$5ng@orb.direct.ca> <JRICHARD.96Feb9101113@paradise.zko.dec.com> <4fnd50$h1f@news.ox.ac.uk> <4frg0s$1jv@park.uvsc.edu> <4g9loc$si0@news.ox.ac.uk>
NNTP-Posting-Host: hecate.artisoft.com
Xref: euryale.cc.adfa.oz.au comp.unix.bsd.freebsd.misc:14123 comp.os.linux.development.system:17746

mbeattie@sable.ox.ac.uk (Malcolm Beattie) wrote:

[ ... anecdote: fsck placed crap in a file after a crash ... ]

] >Clearly, someone anserwed "no" to "Clear?" during the fsck after
] >the crash.
] 
] Not "clearly" at all and, in fact, wrong. Please stick to to the technical
] explanations you excel at and stop with the aspersion casting.

The UFS storage and recovery algorithm in the sync case can not
result in what you saw.

Either the recovery tool was incorrectly used, or the port was
incorrectly performed.

I went with the most likely scenario.

Can you tell me which revision of UFS from what source the OSF/1
UFS implementation was derived?  There is a well known async
operation that should be sync in the Net/2 UFS implementation;
it's possible that they used that code.

Since the location and fix is well known, I find it extremely
unlikely that this is the explanation.


[ ... ]

] Since we're supposed to be talking filesystems here, I'll try
] to ask an intelligent question. Under Digital UNIX 3.2c, AdvFS
] is still funnelled to the master CPU on an SMP machine (fixed
] in 4.0, I believe). Is that because it's intrinsically hard
] to make a bitfile/extent-based filesystem SMB-safe or just
] because DEC are lazy?

This is an SMP granularity problem.  Having worked on a UFS
derived FS in several SMP kernels (UnixWare 2.x, Unisys 60xx
SVR4 ES/MP, and Solaris 2.3) and currently being involved in
kernel multithreading and SMP work on FreeBSD, I can still
only guess.

There are several possibilities.  DEC being lazy is the *least*
likley scenario.


[ Note: the following information is dated; the Solaris information
  was inferred from header files and debugging and may not be
  totally accurate ]


High grain parallelism is hard.  The issues are nearly identical
to those faced in kernel preeemption for Real Time and Kernel
Multithreading.

The main issue is reentrancy.  If you use sync updates to order
your metadata, each pending sync update is an outstanding
synchronization block.

To deal with this, you can either divorce the ordering from the
file system requests, using Delayed Ordered Writes, like USL
did with UnixWare 2.x, or soft updates, as in Ganger/Patt.  Or
you can go to a graphical lock manager that can compute transitive
closure over the graph to detect when a request would cause a
deadlock to occur.

As far as I know, the only OS to implement the graph soloution
at this time is Chorus, a European-originated microkernel which
competes with MACH (its biggest claim to fame is that unlike
MACH, it avoids most of the protection domain crossing in its
IPC, which introduces other problems).  Choris is available to
educational institutions for a $1000 license fee, last time I
checked.  Since both USL and Novell are seperately pursuing
Chorus based technology, it;s currently a hot thing for potential
employess at both places.

Without a divorce of the ordering from the FS proper, the file
system is either handled as a black box, with a single reentrancy
mutex (this is how non-MPized Solaris FS's must operate), or it
is handled using medium granularity with discrete locking of
"dangerous" routines causing more synchronization than "less
dangerous" routines.

Sequent's UFS locks (or used to lock) file system reeentrancy,
using the "black box".  You can see this by running multiple
finds on the same tree and watching the processor utilization;
like DEC, they thread through a signle processor.


Medium grain locking is what Unisys uses in the SVR4 ES/MP.
This is not as satisfying as high grain parallelism, but has
the very real advantage of exposing the internals to the
potential file system author.  The Unisys model is probably
the easiet to implement.  The locking is done in the vncalls.c
file in the kernel, and the reeentrancy is on a VOP call basis
with knowledge of the order dependencies in the underlying FS
implementations.  This makes it slightly less general, but most
easily supported.


High grain parallelism is possible without divorce.  This is
the Solaris 2.3 approach.  Sun has been very bad about documenting
internals for file system authors, but there is some exposure
of the underlying model, both in the University of Washingtom
Usenix papers (fpt.sage.usenix.org) and in their /usr/include/sys
header files (the kernel multithreading is most enlightening in
t_lock.h, and vnode.h and fs/ufs_* also provides some tantalizing
cludes).

The USL UFS implementation is high grain parallelism using a
divorce of the underlying I/O using delayed ordered writes.
This means that the VOPs can run to completion, and the I/O
ordering is the issue.  Unlike soft updates, the USL Delayed
Ordered Writes (Patent Pending) result in flat-graph lists
of I/O's which must be ordered.  The write clustering code
has a hard time picking "the optimal" approach using DOW, since
it can not reorder the ordered ops, only the unordered ones
(what are handled in a traditional UFS as "async").

Still, a DOW-based UFS yields ~160% of the performance in a
Uniprocessor machine, even after the SMP synchronization is
taken into account.

Soft updates, on the other hand, are reported to yield "within
5% of memory speed".



Extent based file systems have their own problems, specifically
the use of a single allocation pointer.  It's possible to add a
processor domain abstraction and allow multiple extent pointers.
SMP VMS actually does this with the VMS file system.  This is
very similar to the VM technique of per processor page pools, as
used by Sequent and described by Vahalia in "UNIX Internals: The
New Frontiers (ISBN 0-13-101908-2 from Prentice Hall).  Actually,
Vahalia prefers SLAB allocation, but there is no reason the two
techniques can't be combined.

I suspect DEC hasn't doene this yet because it's intrinsically
hard.  Maybe their next release can go for soft updates, thus
leap-frogging DOW (and avoiding the patent issues).


Look for high grain SMP in FreeBSD as code and time permit.  The
current approach is to go from Low to High grain parallelism
incrementally, using a "mutex push-down" approach to gradually
increase the paralellism.  The end intent is a graph soloution
similar to that in Chorus, using a hierachical lock management
mechanism.

Feel free to point the DEC people at the Ganger/Patt paper as
well, if you think it will help.


					Regards,
                                        Terry Lambert
                                        terry@cs.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.