*BSD News Article 65561

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!news.rmit.EDU.AU!news.unimelb.EDU.AU!munnari.OZ.AU!news.ecn.uoknor.edu!qns3.qns.com!imci4!newsfeed.internetmci.com!in1.uu.net!news.artisoft.com!usenet
From: Terry Lambert <terry@lambert.org>
Newsgroups: comp.os.linux.development.system,comp.unix.bsd.freebsd.misc
Subject: Re: Ideal filesystem
Date: 10 Apr 1996 07:43:09 GMT
Organization: Artisoft, Inc.
Lines: 243
Message-ID: <4kfoqd$dgs@coyote.Artisoft.COM>
References: <4hptj4$cf4@cville-srv.wam.umd.edu> <4jerrj$f12@park.uvsc.edu> <jlemonDp1GFM.H4I@netcom.com> <4jpjb6$77c@park.uvsc.edu> <jlemonDpEw1v.4Ez@netcom.com>
NNTP-Posting-Host: hecate.artisoft.com
Xref: euryale.cc.adfa.oz.au comp.os.linux.development.system:21081 comp.unix.bsd.freebsd.misc:17066

jlemon@netcom.com (Jonathan Lemon) wrote:

Finally, someone who understands the hash problem which is
introduced by weenieing-out on looking for the executable
fork!

] % chmod +attributed_directory_bits $home/bin/foo
] % chmod +executable_directory_bits $home/bin/foo
] % chmod +attributed_directory_bits $home/bin/ls
] % rehash
] % ls
] foo             ls
] % rm foo/a.out
] % foo
] % foo: executable not found.
] 
] Yes, this breaks 'drop-through' shell hashing, in that if I had 
] /usr/X11/bin/foo, it would not get run.  So what?  I don't think
] it's that big of a deal, and could probably be implemented either way.

It's a gratuitous change in behaviour.  It is therefore, by
definition, evil.  It can be avoided.  It should be avoided.

THe problem exists because the a.out fork is seperable from the
file foo.  This is an artifact of the file foo being a directory
and the conents of directories being seperable -- either\
intentionally, as in your example above, or unintentionally
as a result of a system carsh, bad block, whatever.

The association between "foo" and it's "a.out fork" is not
handled atomically, and operations on the fork are not
idempotent -- therefore one cannot make consistency guarantees
about it.

And all because we want to abuse the existing FS instead of
solving the problem below the user/kernel boundry.


] >A real index would be searchable in O(log2(n)+1) compares for
] >a total of n entries, whereas a directory used as an index
] >varies by implementation, and is never better than that, and
] >is more frequently simply O(n/2+1).
] 
] I wasn't claiming that it was an efficient index, only that it _was_ one.
] Besides, O(n/2) is for a pure sequential search - if you have a multilevel 
] directory, (eg: /u/j/jl/jlemon) you certainly get better than O(n/2).

I was only referring to the "terminal" component, which has been
converted to a directory, in the case that the a.out is "implied"
by an attribute bit -- an attribute bit not guaranteed to be
consistent with the a.out's existance.


] >By restricting the allowable entry manipulations to not include
] >exposure in the FS namespace, the implementation can prevent it
] >from ever being possible to dissociated a fork from the node
] >that contains it.
] 
] Uh.  In an ideal world, maybe.  But in an ideal world, lost+found would never
] be used, either.

You don't need an ideal world to kill lost+found.  You need to
use LFS instead of UFS on your BSD system.  Or you need to
use VXFS on your SVR4/UnixWAre system.  Or JFS on your AIX
system.

Or you need to integrate the code from Appendix A of the Soft
Updates paper and use UFS (putting the block count consistency
fix in the mount code itself) and delete fsck entirely.


I have Implemented an attributed file system before for Novell/USG
as part of the NetWare for UNIX 4.x product.  Once you know what
you need, it's not that hard... the hard part is making sure
that the changes you introduce are minimal.

All you need is parent inode and fork ID.  You throw away forks
when you throw away the las reference for the parent inode.
You use a name space escape (preferrrably POSIX, if you are
willing to modify the lookup code -- we didn't for licensing
reasons), and you are in.

Reassociation recovery is as simple as looking for "secondary
inodes" without parent references and reconnecting them using
the parent pointer reference in the secondary.

The interesting part of this (now sidetracked into directories
vs. forks) discussion is the ability to tie code into file
system events -- looking at file systems as directed graphs
of events in a hierarchy (one to which you could apply
ordering, commutation, association, and conflict resoloution
rules in order to apply a technology like soft udates generally
instead of having to use FS specific code).


] In another article of yours, I think that you mentioned something like
] "copying an EA file from one filesystem to another - either everything gets
] copied, or nothing.".
] 
] Please tell me what happens when we pull the power switch in the middle of
] the copy.  Does fsck just delete everything because it is an incomplete
] copy?  

The target is partially complete.

Obviously, this would require that the copy operated on a fork
by fork basis.  In reality, you want the copy to be an atomic
kernel operation, where possible.  Specifically, if I have two
filesystems mounted from a remote machine, and I copy from
one to the other, I want to send a message to the server to
do the copy instead of pulling the bits over the wire and pushing
them back.

As a logically atomic operation, the operation would rool forward
or back, depending on the complexity of available logging semantics
in the underlying FS.


] If the disk block containing the "root" of the EA is damaged, does that mean
] I automatically lose _all_ of the files within that EA?

You mean "all the EA's in the file".  The answer is "yes".

You might as well ask "if the inode for a file is damages so that
I don't have any of my block pointers available, do I lose the
data in the blocks that were pointed to by the block pointers that
are no longer there".

Effectively, you have damaged the file beyond algorithmic
recoverability in both cases.

On the other hand, if I lose a file to lost and found (say we
are stupid, and don't implement either log structuring or soft
updates or synchronous ordering of any kind), I do *not* get
a file for each of the attributes for the lost file -- I get
one file with all EA's, intact.

] >The failure mode in the directories is from the lack of a parent
] >pointer due to its exposure in the FS name space, specifically,
] >the way in which hard links are currently implemented.
] 
] This sounds like you want to have a double-linked nodes (eg, the node knows
] about it's parent), which is currently not possible, as you pointed out, since
] with hard links, a file may have more than one "parent".  Leaving this aside
] for the moment, wouldn't having double links increase the failure modes of 
] the filesystem?

No.  It would only increase the recovery modes.  Since using
ordered I/O *guarantees* deterministic recoverability, the
links will be consistently maintained.

The hard links work because you dicorce the object pointed to by
the directory reference from the file.

That is, each directory entry points to one object, and each
object points to one file in the flat numeric namespace.  The
directory entry target objects are maintained as a forward
linked list (again, deterministic recoverability: only one link
may be severed at a time, so the ring is always recoverable).

Each "directory node" is a reference instance for the flat
name space object, and the flat name space object has a pointer
to one parent (the original, to start with, but maybe further
along the ring if the original is deleted).

Since it doesn't matter which ring element is pointed to by a
file, the file back-link is always recoverable.

Each directory node has a pointer to its parent node.

Thus each hard link describes a single vector to the root of
the file system from its location.

An in core open file instance must reference the directory node,
not the flat node, and the directory node in core references the
in core flat node -- the traditional vnode.  The flat node goes
in the system open file table, while the directory instance goes
in the per process open file table.

Thus from any open fd, you can recover the path used to open it,
assuming the path still exists.

An unlinked file will be detectable as such.

One could easily add a "create entry for "in core directory
node" to cause the entire process context to be checkpointable.


] >A fork is not a hard link, and if it is to not be limited by the
] >size of an on disk inode structure, it must not be associated
] >the same way inodes are associated with names.
] >
] >To think of it another way, DOS, OS/2, NetWare, and NTFS can not
] >lose file names because they are not directory entries, they are
] >attributes of the file.  UNIX can, because logically names are
] >not attributes.
] 
] Um, so I blow away the FAT table in DOS.  I have not lost the filename,
] I have lost the _file_, "because the filename is an attribute of the file".
] Thanks but no thanks. 

In the link architecture I'm suggesting, the file name is *not*
an attribute of a file.  It is a pointer to a directory node.

But since each directory node has only one directory entry
pointing to it, even in the hard link case, and each directory
node has a pointer to its parent, one can recover the name
information if one stored it in an attribute, to recover
the file.

So the file won't be lost, it will be replaced where it came from.

This is important, because we can not afford to have a fork
dissociated from the file that contains it (the main drawback
of the file-as-directory).

Obviously, if I destroy information, then information is
destroyed.  To complain about that is to complain about
tautology.  8-).

If you blow away the inode's block pointers in current UNIX
file systems (ie: clri/fsdb), you are just as screwed as
the DOS example.

The point to my use of the DOS example was the issue of
reverse mapping, and the *option* to place a recovery attribute
on a file (it being impossible to seperate an attribute instance
from a file instance, except logically) and actually recover
the name information.

In a decent implementation, you'd roll the transaction forward
or back (depending on intent versus event logging) and be done
with it... the name recovery is just an example application
that would want the connection to be atomic and operations on
it to be idempotent.  It's not my *only* example.


					Regards,
                                        Terry Lambert
                                        terry@lambert.org
---
Any opinions in this posting are my own and not those of my present
or previous employers.