*BSD News Article 7126

Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!sgiblab!zaphod.mps.ohio-state.edu!wupost!cs.utexas.edu!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: Repeat of the question about VFS and VOP_SEEK()
Message-ID: <1992Oct27.181215.23644@fcom.cc.utah.edu>
Keywords: VOP_SEEK VOP_READ VOP_WRITE VOP_RDWR
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <BwLp9z.8J2@flatlin.ka.sub.org> <1992Oct25.121136.26473@fcom.cc.utah.edu> <1992Oct26.213408.21184@Veritas.COM>
Date: Tue, 27 Oct 92 18:12:15 GMT
Lines: 161

In article <1992Oct26.213408.21184@Veritas.COM> craig@Veritas.COM (Craig Harmer) writes:
>In article <1992Oct25.121136.26473@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
>}	An ASSERT() is used to insure the behaviour conforms to the 
>}	agreed upon [in POSIX 1003.1-1988] vnode interface regarding
>}	the preservation of atomicity in reads and writes.  This
>}	necessarily disallows calls to ufs_rdwr(), since the ufs_ilock()
>}	there would then become recursive.
>}
>}Clearly, if we can agree that POSIX compliant behaviour is what mandates the
>}atomicity of reads and writes (the part I inserted and put in brackets), then
>}we can agree that POSIX behaviour mandated the split.
>
>i don't see how atomicity guarantees demand seperate read/write
>interfaces.  imagine this code:
>
>ufs_rdwr(vp, uiop, type)
>	struct vnode *vp;
>	struct uio *uiop;
>	int type;
>{
>	ufs_ilock(VTOI(vp));
>
>	if (type == READ) {
>		ufs_read(vp, uiop);
>	} else {
>		ufs_write(vp, uiop):
>	}
>
>	ufs_iunlock(VTOI(vp));
>}

Apparently, with a loop-back VFS or some other mechansim supported by
POSIX semantics, this could go infinitely recursive.  I believe what
you were looking for instead of "ufs_ilock" and "ufs_iunlock" was
"vop_rwlock" followed by "vop_rwunlock", as documented in:

	UNIX(R) System V/386
	Release 4 Version 3
	Programmer's Guide:
	Writing File
	System Types

Atomicity is discussed in some minor detail in this document.

>assuming the inode lock is not released in ufs_read() or ufs_write()
>how is this not atomic with respect to other read and write requests?
>
>i don't see what POSIX has to do with the the splitting of VOP_RDWR
>at the vnode interface layer.

Candidly, I'm only parroting the comments on this one.  I believe it has
to do with maintaining POSIX compliance in an environment where kernel
preemption is allowed.  I seriously doubt that it was "change for changes
sake", since the prvious interface used "vop_rdwr".

>also, inode locks in SVR4.0 are recursive, at least for UFS and VxFS.

I think the point was with regard to vnode, not inode locks, from the
AT&T document.

>}|> >Thus perhaps the best answer is that the interface is ill defined.  In
>}|> >the previous post referenced above, I referred to the illogicality of
>}|> >making the call, since a seek offset is an artifact of an open file
>}|> >descriptor, and is not an attribute of an inode or vnode in most of
>}|> >the current implementations.  I also pointed out a potentially valid use
>}|> >for passing the seek down:  predictive read ahead.  The problem here is
>}|> >that either the read, the seek, or the open would have to be attributed
>}|> >to flag the descriptor for predictive behaviour if this is to be a
>}|> >successful optimization.
>
>the seek offsets are passed down because the file system independent
>layer doesn't persume to know the range of valid seek offsets for a
>file system type.  this gives the file systems specific code an
>opportunity to complain when the seek *system call* is made.
>lseek() can return an error if it needs to.

This is perhaps a valid contention, although I might argue it on the
definition of lseek() requiring a long argument.  I could definitely
see mounting, for instance, a file system that only supported 16 bits
for the file length, and used the other 16 bits for, as an example,
promiscuous selection of namespace for a multinamespace file system
(to support something like resource forks directly).  Doing this is
extremely questionable, since the semantics of the lookup mechanisms
(file name parameters to open() or creat() and returns from getdents())
aren't set up to handle such promiscuous naming.

Another potential use is on non-holey file systems (like DOS) to allocate
real space for what UFS would consider a "hole" in a file.

These reasons combined are probably sufficient to warrant passing down
the lseek() with VOP_SEEK() --- so I withdraw my objections (although
this will cause seek operations to be slower by a dereference and a
function call per reference for no benefit of an kind for all of the
current file systems supported by 386BSD.  By this same token, we should
provide a VOP_IGET() so as to allow future seperation of the directory
entry management and inode management "layers" -- which aren't currently
layered at all.  We could see a great deal of benefit from that with
very little effort.


>}I think I can safely say the benefits of predictive read ahead are
>}questionable unless there is a cooperative mechanism which obviates the
>}need to use lseek() to communicate the read ahead.  I can see the designers
>}leaving it in there for some future "smarter NFS", but nothing in user
>}space currently requires nor could benefit from predictive read ahead
>}implemented this way.
>
>if you're talking about using lseek() to "request" a read ahead, that's
>silly.  lseek() already has a set of semantics associated with it, and
>adding new ones would confuse the issue.  invent a new system call or
>convince USL to add the asynchronous I/O systems calls originally planned
>for SVR4.0.  

That's why I said "the benefits of predictive read ahead are questionable
.... [ if you ] ... need to use lseek() to communicate the read ahead" --
or, even more plainly, "predictive read ahead using VOP_SEEK() to inform
the file system to do it is a dumb idea".

>finally, read-ahead (and write-behind) are useful for applications
>that don't perform any buffering of their own.  a common application
>behavior in Unix is to read an entire file sequentially, or to
>truncate a file, write it sequentially, and close it.  file systems
>that detect this behavior and modify their behavior appropriately
>can provide significant performance improvements.

This is like arguing against external pagers.  I think the prediction
hueristics belong in the application, not in the file system; who is
better to judge the future behaviour of the application?

A certain amount of buffering is already done -- reads in UFS are in terms
of one or more blocks; they are *never* in smaller increments.  The main
cost in getting the buffered data out is the copyout across the user/kernel
boundry, and the expensive part of this is the page mapping.  An application
that does a read() a character at a time is going to bottleneck in reading
the data, not in getting data from the disk to the kernel buffer.  The
largest benefit here is that which can be gained from user-space caching
and copying across the user/kernel boundry in page multiples (at best) or
cache buffer size multiples (at worst, if the cache buffer element size
is not some multiple of the page size).  This minimizes the block reads
to disk, and minimizes the  page mapping which must be done to get the data
from kernel to user space.

I'll agree that application-directed read-ahead and write-behind are good
things, but unless you have pages mapped in user and kernel space at the
same time (like a shmctl'ed shared memory segment), I see little if any
benefit in kernel based predictive read-ahead for the examples you have
given.  This doesn't even address the overhead in "detecting the
behaviour" as a means of employing the hueristic.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------