*BSD News Article 4951

Newsgroups: comp.unix.bsd
Path: sserve!manuel!munnari.oz.au!spool.mu.edu!wupost!udel!sbcs.sunysb.edu!sbcs!stark
From: stark@cs.sunysb.edu (Gene Stark)
Subject: Program dies with FP Exception
Message-ID: <STARK.92Sep13002650@sbstark.cs.sunysb.edu>
Sender: usenet@sbcs.sunysb.edu (Usenet poster)
Nntp-Posting-Host: sbstark
Organization: SUNY at Stony Brook Computer Science Dept.
Date: Sun, 13 Sep 1992 05:26:50 GMT
Lines: 68

Here's a tough one I've been trying to track down -- maybe somebody out there
who knows more can guess what is going on.

I am running 386BSD on a 486/33 system with 4MB RAM and a 210MB Connor IDE
drive.  A program I was working on dies on Signal 8 (Floating point exception)
in a perfectly repeatable fashion.  It is not so easy to tell where the
exception actually comes from, though, because the signal seems to be getting
delivered to the process much later, when it is leaving the system after
a call to "write".  I haven't been able to get a small test program that
repeats the bug, however there seem to be several crucial elements involved:

	(1)  A call to "atof", which returns a double that is then
		stored in a temporary on the stack.  Removing the call
		removes the error.

	(2)  The actual magnitude of the number being converted by "atof".
		I found that the string "1e10" and "1e12" cause the error,
		but "1e9", "1e6", and "0.0" do not.

	(3)  Some later "write" system calls.  The signal is actually
		delivered on the fourth call to write after the atof.
		What is happening in the interim is just C code without
		any other system calls.  I do not know what causes the
		signal to get delivered when it actually does.

After a lot of debugging, I boiled the problem down to this section of
source code:

	lp->token.value.flot = atof("1e10");

This compiles (no optimization) to the following:

	pushl $LC10
	call _atof
	addl $-8,%esp
	fstpl (%esp)		# This instruction seems to be the culprit
	popl %eax
	popl %edx
	movl 8(%ebp),%ecx
	movl %eax,20(%ecx)
	movl %edx,24(%ecx)

Removing the "fstpl" instruction removes the error.  Placing the code:

	pushl $1
	call _sleep

immediately after the "fstpl" instruction also removes the error.

Taking the code out of context and putting it in a small test program
does not produce the error, so presumably there is some interaction with
the virtual memory state.

I also tried putting the instruction

	movl $0,(%esp)

just after the fstpl instruction, on the theory that maybe the fstpl
was causing a page fault with bad consequences, but this did not eliminate
the error.

So, are these enough clues that somebody who knows more than I do can
guess what the problem might be?  Any help appreciated.

						- Gene Stark