*BSD News Article 34285

Xref: sserve comp.os.386bsd.questions:12321 comp.os.386bsd.misc:3184
Newsgroups: comp.os.386bsd.questions,comp.os.386bsd.misc
Path: sserve!newshost.anu.edu.au!harbinger.cc.monash.edu.au!msuinfo!agate!howland.reston.ans.net!swrinde!elroy.jpl.nasa.gov!decwrl!netcomsv!netcomsv!calcite!vjs
From: vjs@calcite.rhyolite.com (Vernon Schryver)
Subject: Re: NFS buffering  (was Whats wrong with Linux networking ???)
Message-ID: <CuH8uH.4Ev@calcite.rhyolite.com>
Organization: Rhyolite Software
Date: Sat, 13 Aug 1994 14:13:28 GMT
References: <32bflj$lig@cesdis1.gsfc.nasa.gov> <CuDJox.HE2@calcite.rhyolite.com> <32gk4d$ee@cesdis1.gsfc.nasa.gov>
Lines: 74

In article <32gk4d$ee@cesdis1.gsfc.nasa.gov> becker@cesdis.gsfc.nasa.gov (Donald Becker) writes:
>Vernon Schryver <vjs@calcite.rhyolite.com> wrote:

>>>The NFS protocol assures the client that when the write-RPC returns, the
>>>data block has been committed to persistent storage.  For common
>>>implementations that means the block has been physically queued for writing,
>>>not just put in the buffer cache. ...
>>
>>An NFS server that only queues the block for writing before responding
>>instead of waiting for the disk controller to say that the write has
>>been completed does not meet the NFS "stable storage" rules.  Such a
>
>Yes, Vernon, I deliberately used the word "queue" there.  (I was going to
>explain it, but felt it would detract from the main point of the article.)
>It's not the operating system buffer cache I'm referring to, but the disk
>controller queue.  Most modern disk controllers, both IDE and SCSI, actually
>just queue write requests and return immediately.  Sure, the vulnerability
>window is limited to tens of milliseconds, but I suspect most systems
>technically violate the "committed to stable storage" rule.   Not that I
>think this is particularly bad or dangerous...

Whether that is bad or dangerous is irrelevant.  Your suspicion is
completely wrong in the commercial world.  You cannot report LADDIS
numbers using such a write-caching disk controller.  That's a fact.
Well, you might cheat for a while, but you'll get busted.

Separately, the commercial grade systems I know about emphatically do
not turn on the write caches in disks.  Doing so trashes "filesystem
hardening" without gaining any performance.  You don't spend lots of
time on your disk queuiing algorithms while paying attention to filesystem
hardening only to throw up your hands and just hope the disk firmware
authors did their part, even if you have not found many serious firmware
bugs in all vendors' drives.  (No I personally haven't, but the people
I work with who write the disk drivers have unending lists of firmware
bugs in new and old drives.)

Of course, what happens on Joe Hobbiest's homebrew PC with no-name
motherboard is a different story.  LADDIS numbers are not exactly
relevant.  For that matter, "stable storage" is not always ... a concern.


>>>                          ...   You can get around this by writing a client
>>>implementation that allows multiple outstanding write requests for each
>>>writing thread, at the expense of write order inconsistency.

>A simple, common example: 'tail -f logfile', where "logfile" is written by a
>NFS client.  With multi-threaded writes it could show spurious zeroed blocks,
>while a single-threaded client would produce the expected results.

That is entirely false, in both premise and reasoning.

    1. if you do `tail -f logfile` on the client doing the writing,
	you cannot tell anything about the order in which blocks are
	written to the disk, regardless of whether the disk is local
	or NFS or RFS.

    2. if you do `tail -f logfile` on some other machine, then the 
	effects of NFS retransmissions can show temporarily zeroed blocks.
	Some biods (or other NFS daemons) will finish sooner than others.

    3. typical NFS client implementations, at least those influenced
	by both System V and BSD local filesystem designs, are not in
	the least careful about the order in which they write blocks
	from their caches.  The update or bdflush daemon simply looks
	for dirty blocks in the common buffer cache and causes biod to
	do the NFS transaction or does the NFS transaction itself.  Just
	as on the local disk.

(2) and (3) can and do cause "spurious zerod blocks."  I've seen them,
but of course only in multiple-client situations, and not for slowly
growing files like log files.


Vernon Schryver    vjs@rhyolite.com