*BSD News Article 69520

Path: euryale.cc.adfa.oz.au!newshost.anu.edu.au!harbinger.cc.monash.edu.au!nntp.coast.net!swidir.switch.ch!swsbe6.switch.ch!news.belnet.be!news.rediris.es!acebo.sdi.uam.es!b12mc6.cnb.uam.es!user
From: jrvalverde@samba.cnb.uam.es (jr)
Newsgroups: comp.unix.bsd.freebsd.misc
Subject: Re: Signal 11
Date: Mon, 27 May 1996 19:02:45 +0100
Organization: Centro Nacional de Biotecnologia
Lines: 87
Message-ID: <jrvalverde-2705961902450001@b12mc6.cnb.uam.es>
References: <nD356D43A@longacre.demon.co.uk>
NNTP-Posting-Host: b12mc6.cnb.uam.es
X-Newsreader: Value-Added NewsWatcher 2.0b24.0+

In article <nD356D43A@longacre.demon.co.uk>, searle@longacre.demon.co.uk
(Michael Searle) wrote:

> Does processes exiting on signal 11 always mean bad hardware (probably
> memory or mainboard), or can they be caused by other things (like buggy
> executables)? I have had them occasionally, but mostly on new software I
> hadn't tried before. I have never had gcc failing (and I have done several
...

And many people answered... Lemme try too.

I have had the same problem with my brand-new Pentium-133 with 32 MB
EDO-RAM and enough swap space. I monitored memory consumption and almost
never reached swap before the signal 11. I could reproduce the behavior
under Windows3.1 (with lots of difficulty, but that's a crappy system),
FreeBSD and Linux.

I guess the problem with win-3.1 was I could hardly push hardware as much
as with the other OSes. So it failed less. Also I hardly use Win3.1 at all.

The problem could be solved by retrying the command. And when it refused,
by cleaning up memory (to eliminate the ghost(+) of the last run to be reused)
which you can do with a 'dd' copy from HD to memory or -as I also tried-
with an 'ad-hoc' program.

(+) Unix retains the pages of the last programs run in case you re-run
them again, so it already has them in-core and doesn't need to re-read
from disk, thus given a faster response, that's what I call the ghost
or old carcasse (my term).

If in spite of that I kept running the system, sooner or later the filesystem
or the kernel would go bersek. With the FS it would mean i-nodes that
weren't correct to the kernel's eyes. With the kernel it would mean I
couldn't run any more programs, or I got a memory error with appropriate data
about the virtual page that was corrupt.

All this was similar under Linux and FreeBSD, running different versions
of GCC, and also running different programs.

Overload would delay crashes. I assume that by removing ghosts and making
it more defficult to find already loaded programs. It was more frequent
when compiling the kernel, and while I had the CD-ROM running. It was less
frequent on february (colder wheather here) than on april (milder temps).

Suggestions:

   - Bad RAM
   - Bad HD
   - Bad Cache
   - Bad bus/motherboard
   - Bad CPU
   - Bad cooler -> overheating
   - Interference from other devices
   - All or any combination of the above
   - Bad swap partition
   - erroneous transfers disk <-> memory
   - Others.

I have taken my machine for fix (it's still under guarantee) but seriously
doubt they'll find anything wrong since I'm sure they'll only try to test
it under DOS/Windows which doesn't allow to take as much advantage of the
system. (Yep! I just called them and they confirmed this: they haven't found
anything with their off-the-shelf test programs for Windows, they'll try
a script I left for UNIX next).

   You should also have a look at this URL. It will tell you a lot (I
discovered it too late):

      http://www.bitwizard.nl/sig11/

   In short, it is most probably a hardware problem due to current faster
CPUs pushing the limits of borderline hardware.

   It can be difficult to detect, but the URL gives some help. The most
difficult part will probably be to prove the problem to your provider,
since it will be difficult to find a Windows program stretching the
hardware as strongly as UNIX allows GCC to, and they surely will only
speak DOS/Windows... It's even worst here where they don't even 
know English and I suspect they won't understand the text of the
URL (sigh).

   What I have given them is a script that makes clean and recompiles
the kernel repeatedly, checking output from one make to the next to see
if there are differences (they can only come from errors). In my case
10 compiles would give 3-4 fails, but you might need more.

                              jr