*BSD News Article 7099

Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!wupost!zaphod.mps.ohio-state.edu!darwin.sura.net!paladin.american.edu!news.univie.ac.at!hp4at!mcsun!Germany.EU.net!rrz.uni-koeln.de!unidui!flyer!flatlin!bad
From: bad@flatlin.ka.sub.org (Christoph Badura)
Subject: Re: Repeat of the question about VFS and VOP_SEEK()
Organization: Guru Systems/Funware Department
Date: Tue, 27 Oct 1992 02:24:55 GMT
Message-ID: <BwrDDK.8qM@flatlin.ka.sub.org>
References: <b3co03lsb3LE00@amdahl.uts.amdahl.com> <1992Oct20.193544.2360@fcom.cc.utah.edu> <BwFu1E.759@pix.com> <1992Oct21.201738.22999@fcom.cc.utah.edu> <BwLp9z.8J2@flatlin.ka.sub.org> <1992Oct25.121136.26473@fcom.cc.utah.edu>
Lines: 152

In <1992Oct25.121136.26473@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:

>In article <BwLp9z.8J2@flatlin.ka.sub.org>, bad@flatlin.ka.sub.org (Christoph Badura) writes:
>|> In <1992Oct21.201738.22999@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
>|> >A lot of the differences are evolutionary rates differring between systems,
>|> >and different choices being made (SVR4 seperate vop_read and vop_write
>|> >out of the BSD vop_rdwr for POSIX compliance and to avoid a recursion
>|> >loop, for instance).
>|> 
>|> How could separating vop_rdwr into vop_read and vop_write help POSIX
>|> compliance. I'd be very interested in an explanation that takes into
>|> account that the SVR4 ufs-vop_read and ufs-vop_write almost
>|> instantaneousley call ufs_rwip.

>In the SVR4.4 kernel sources, in /usr/src/uts/i386/fs/ufs/ufs_vnops.c, in the
>function ufs_write(), it says (paraphrased for legal reasons):

>	An ASSERT() is used to insure the behaviour conforms to the 
>	agreed upon [in POSIX 1003.1-1988] vnode interface regarding
>	the preservation of atomicity in reads and writes.  This
>	necessarily disallows calls to ufs_rdwr(), since the ufs_ilock()
>	there would then become recursive.

>Clearly, if we can agree that POSIX compliant behaviour is what mandates the
>atomicity of reads and writes (the part I inserted and put in brackets), then
>we can agree that POSIX behaviour mandated the split.

The comment says (this is verbatim, typos are of course my fault):

	NOTE: this assertion is consistent with the agreed on
	vnode interface provisions for preserving atomicity of
	reads and writes, but it necessarily implies that the
	ufs_ilock() call in ufs_rdwr is recursive.

Immediatly following is an ASSERT() call, that insures that the inode
is read/write locked.

First, there is no assertion used to insure POSIX conformance or
atomicity.  The comment simply says that it is *save* to assert the
inode being read/write locked.  It also doesn't disallow the calling
of ufs_rdwr().  [Personally, I suspect that ufs_rdwr() is used in the
SunOS vnode interface, while SVR4 introduced vop_read, vop_write, and
vnode locking and unlocking functions.]

Second, P1003.1 doesn't define a vnode interface, as it doesn't deal
with kernel implementation at all, but with kernel functionality.

Third, P1003.1 doesn't make any guarantees about the atomicity of
reads and writes, except in the case that one writes less then
{PIPE_BUF} bytes to a pipe.  [I couldn't find a copy of 1003.1 right
now so I had to resort to the SVID and the X/Open Portability guide.
Both claim complete adherence to 1003.1 and the latter even explicitly
marks extensions to 1003.1. So I'm quite confident I don't
misrepresent anything here, though the ice is thinner than I prefer.]

Fourth, even your misrepresentation of the comment doesn't explain
*why* such a split helps POSIX conformance.

And in fact a quick look into uts/i386/fs/vnops.c would have revealed
the following sequence of events:

write()
{
	...
	rdwr(vnode, ..., FWRITE);
}

read()
{
	...
	rdwr(vnode, ..., FREAD);
}

rdwr(vp, ..., mode)
{
	VOP_LOCK(vp);
	if (mode & FWRITE)
		VOP_WRITE(vp, ...);
	if (mode & FREAD)
		VOP_READ(vp, ...);
	VOP_UNLOCK(vp);
}

Which is exactly what I was saying and contradicts your claims.

>|> >Thus perhaps the best answer is that the interface is ill defined.  In
>|> >the previous post referenced above, I referred to the illogicality of
>|> >making the call, since a seek offset is an artifact of an open file
>|> >descriptor, and is not an attribute of an inode or vnode in most of
>|> >the current implementations.  I also pointed out a potentially valid use
>|> >for passing the seek down:  predictive read ahead.  The problem here is
>|> >that either the read, the seek, or the open would have to be attributed
>|> >to flag the descriptor for predictive behaviour if this is to be a
>|> >successful optimization.
>|> 
>|> Since all that is needed for predictive read ahead below the VFS layer
>|> is a) a vnode and b) the new seek offset, I can't follow you
>|> illogicality claims.

>2)	It is illogical to make a call to a lower layer when the abstraction
>	(a seek offset) is limited in scope to an upper layer (making reads
>	and writes relative to the previous read or write in the system call
>	layer).

I maintain that it is not in the case of not doing sequential IO.

>3)	In practice, my suggested use (predictive read ahead) is implemented
>	by a modified system call layer eliminating dependence on the seek
>	offset, thus obviating the need to notify the file system itself of
>	such an animal.

Please explain how you plan to notify the lower layers about the new
point at which read ahead could start after a lseek() without passing
the new file offset to that layer.

Passing the the new file offset to the lower layers after a lseek()
seems to me an obvious and potentially powerful optimisation.

Obviously, the higher layer could start the read ahead by itself, but
since the current VFS layers are synchronous with regard to the
application and the lower layer could possibly make smarter decisions
about the desirability of read ahead (think about remote file
systems), I feel that this is not a viable way.

>4)	Predictive read ahead based on any mechanism *requires* some method
>	of promiscuously informing the file system that the file descriptor
>	in question will be used in such a way that predictive read ahead.

The UNIX kernel already implement predictive read ahead. If you want
more functionality, you must be able to flag an individual file
descriptor for (non)sequential access. In this case I propose the
introduction of a new system call fadvise(), analogous to madvise().

For even better optimisation one would indeed have to attribute each
open(), read(), write(), and lseek() with a flag indicating wether
read ahead is desired or not. But this is comp.unix.bsd and not
comp.os.research.

>I think I can safely say the benefits of predictive read ahead are questionable
>unless there is a cooperative mechanism which obviates the need to use lseek()
>to communicate the read ahead.

Let me just say, that unix successfully implements predictive read
ahead under the assumption that most applications exhibit strictly
sequential access patterns and that read ahead is disabled after a
call to lseek().

-- 
				Christoph Badura  ---  bad@flatlin.ka.sub.org

AIX is a better... is a better...  is a better... OpenSystem.
					IBM Rep at GUUG Symposium '92