*BSD News Article 17696

Newsgroups: comp.os.386bsd.bugs
Path: sserve!newshost.anu.edu.au!munnari.oz.au!news.Hawaii.Edu!ames!agate!howland.reston.ans.net!math.ohio-state.edu!caen!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: Nethack
Message-ID: <1993Jun29.181749.5833@fcom.cc.utah.edu>
Sender: news@fcom.cc.utah.edu
Organization: Weber State University, Ogden, UT
References: <20bfrm$le7@pdq.coe.montana.edu> <20cab6$b2d@binkley.cs.mcgill.ca> <C990xF.43n@sneaky.lonestar.org>
Date: Tue, 29 Jun 93 18:17:49 GMT
Lines: 78

In article <C990xF.43n@sneaky.lonestar.org> gordon@sneaky.lonestar.org (Gordon Burditt) writes:
>>	Whenever I tried to play it more than once or twice, it would
>>	die (on start up) complaining of some "init-prob error on 4 (215%)"
>>	or SOMETHING like that (my memory fails me).
>
>I get this also, sometimes.  The error suggests that the sum of the 
>probabilities in some table don't add to 100%.  In this case, since 
>the program worked once, it means the tables have been trashed.
>
>Some additional information:
>
>After a failure, on a quiet system, you get the same failure, over and
>over.  If you compare the installed executable vs. the one in the
>build directory, they are identical.  HOWEVER, if you copy the
>executable in the build directory over the installed one, the problem
>seems to go away.  For a while.
>
>After a failure, doing something time-consuming and disk-intensive,
>like building a kernel or grepping the news spool, the problem may
>go away.
>
>Conclusion, totally without proof:
>
>Modified data is getting cached somewhere, probably in the VM system.
>I suspect Nethack is modifying read-only storage somehow and the 
>modified version is getting re-used.  But I haven't found the place
>where it's doing it.

This shouldn't be true.  The data cache for a particular inode is in a
list off the vnode for that inode.  The in core memory structure for the
inode is a copy of the inode off the disk as a substructure of the in core
vnode -- there is no raw in core inode.

When the reference count on the vnode goes to 0, it is placed on the free
list.  Writeback of modified data takes place at this time.

Since the text pages are marked read only, obviously writes are occurring
that are not trapped.  This is not an unreasonable assumption, given that
writes are not trapped through a normal mechanism on 386 processers -- an
exception is not generated in protected mode, only in unprotected mode.

Data pages getting written to (*this* is the problem that gives the symptoms
you are seeing!) and not being copied on write to swap is the problem you
are seeing.  This can result from one of four situations:

1)	The data pages are being written back to the file.
2)	You are running multiple copies simultaneously so that data
	pages are being written only in core, but the core copy is
	shared.
3)	Some global data is assumed to be aggregate initialized (most
	likely to 0) by the compiler, yet this is not occuring (ie: a
	compiler bug).
4)	Some stack variables are being used before they are initialized,
	and you are getting the same pages for your stack on consecutive
	runs.

The way this is handled in most protected mode OS's on the brain-damaged
Intel architecture is to ensure that copy-on-write data pages are marked
read only, and that copy-on-write is actually handled during the trap...
this implies a reverse (or indexed) lookup to determine if the trap is
occuriing on a real read-only page or a copy-on-write page marked read
only to generate the trap.

A piece of this soloution was in the first patchkit as part of the write
fault fix.

This is part of the general soloution to the overall problem of issues
involved in using your real program image as a swap store instead of
using real swap.

Implementations, anyone?


					Terry Lambert
					terry@icarus.weber.edu
---
Any opinions in this posting are my own and not those of my present
or previous employers.