*BSD News Article 18679

Path: sserve!newshost.anu.edu.au!munnari.oz.au!news.Hawaii.Edu!ames!agate!howland.reston.ans.net!xlink.net!math.fu-berlin.de!fub!unlisys!max.IN-Berlin.DE!not-for-mail
From: berry@moritz.IN-Berlin.DE (Stefan Behrens)
Newsgroups: comp.os.386bsd.questions
Subject: Re: drive light on and locked -- w/ patch
Date: 21 Jul 1993 00:19:45 +0200
Organization: Private Site in Berlin, Germany
Lines: 309
Message-ID: <22hr2f$hm@moritz.in-berlin.de>
References: <22gd69$6j3@hrd769.brooks.af.mil>
NNTP-Posting-Host: moritz.in-berlin.de
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Summary: patch for 386bsd0.1 pk0.2.4 that detects and cleares wd-ctlr lockups

In article <22gd69$6j3@hrd769.brooks.af.mil> burgess@hrd769.brooks.af.mil (Dave Burgess) writes:
>There was some discussion about three weeks ago about the system hang
>where the hard drive locks up with the the drive light lit.  Someone,
>whose name I have regretfully forgotten, posted either a description of
>a change that (s)he had made to the system that reset the drive.

I cannot remember seeing a patch for it myself, nor can egrep find one.
But I also had the problem with the locked IDE controller since I
installed 386bsd one year ago.
It's easy to recognize the situation of the locked driver. It's easy
to fix the situation. But I still don't know why this happens, why
the driver looses the interrupt from the controller, or maybe why
the controller doesn't generate the expected interrupt.
I wrote some code to catch the problem. For every block-read or write
request it will be checked whether the controller answers in time.
This is easy to do for IDE controllers and doesn't mean much overhead.


>In case I dreamed it (which has happened more than once), I propose a
>simple solution (sweets from the sweet :-).
>
>On a disk {read,select} start up an alarm that times out in three
>seconds.  On a successful operation from the disk, clear the alarm.
>When the alarm expires, that means that no disk activity has succeeded
>in the last three seconds, which would seem to me to be a good indicator
>that the drive/controller has siezed up again.

This really is what my code does. But it isn't necessary to start an
timeout for every request since IDE controller are so stupid, they
can only handle one request at once. So my code does the following:

o in general -- for every attached drive every two seconds a function
  will be called which checks for an in-work request, for which the
  controller doesn't answer
o in detail:
  - on a per drive basis an function will be called periodically and
    a timeout counter will be managed
  - when a read/write request is started, the counter is set to two
  - when the expected interrup comes in it will be cleared with zero
  - in the timeout function, when the counter is > 0, it's decremented
    o if it's decremented the first time (so it's one then) nothing is
      done
    o if it's decremented the second time (and zero then) the controller
      is timed out...
    o ...when it's timed out:
      - status/debug info will be printed
      - the failure will be logged
      - the controller will be reset in order to put it in a known state
      - the request will be restarted in a `sector by sector' way, this
        means multi-sector transfers are split up


I used code similar to this in a 386bsd0.1 pk0.2.2 environment with the
Barsoom wd-driver (used in NetBSD too) and Bruce's intr/npx/com stuff
for month.
This new patch (and first public posting) is against 386bsd0.1 pk0.2.4.
It should be a two line change to use it for NetBSD but I didn't try it.
The code for the detection of the problem is very well tested and very
old. It won't hurt systems that don't show the problems. The code for
solving the problem and for restarting the request is newer.
I have two machines up which run that code. One with only one IDE drive,
and one with two IDE drives and some SCSI devices. The first and newer
computer never had problems with locked wd controllers. But the code
doesn't hurt either. The second machine which is a server used to fail
very often because of this. Now it detects the lock, resets the drive
and restarts the request. The problem is finally solved for me.


>Note:  I am using the current-sources from sun-lamp using sup.

NetBSD uses the Barsoom wd-driver. It should be easy to change the
following patch for this driver.


>The drive does not fail like this with the 0.8
>released kernel, or from the sources that came out with the original 0.8
>release.

It happens in ``cooperation'' with some other drivers, e.g. with the
com driver or with the we-ethernet driver (for me).


Maybe someone who knows more about IDE controllers can comment on this:
in the situation where the controller gets locked, the status is
- inb(wdc+wd_status) --> WDCS_READY|WDCS_SEEKCMPLT|WDCS_DRQ
  which means the drive is ready, seek completed and the data request
  bit is set
- inb(wdc+wd_error) --> is 0
- the request is a multi sector block read request
- the action that helps for me is to restart the request with
  `du->du->dk_skip = 0;' and `du->dk_flags |= DKFL_SINGLE;'
- without enforcing the redo in single sector steps the restart didn't
  succeed


Ok, after that boring talking here's the patch against 386bsd0.1 pk0.2.4.
Try it, it won't hurt but it will save many system lockups!


*** /tmp/,RCSt1000483	Tue Jul 20 23:45:22 1993
--- wd.c	Tue Jul 20 23:33:01 1993
***************
*** 56,61 ****
--- 56,63 ----
   * 17 May 93	Rodney W. Grimes	Fixed all 1000000 to use WDCTIMEOUT,
   *					and increased to 1000000*10 for new
   *					intr-0.1 code.
+  * 15 Jul 93	Stefan Behrens		Added real timeout code to catch
+  * 					hanging controller
   */
  
  /* TODO:peel out buffer at low ipl, speed improvement */
***************
*** 150,155 ****
--- 152,174 ----
  int	wddebug;
  #endif
  
+ /*
+  * counter for lost int detection.
+  * Three values are used:
+  * 2 -- this is the initial value when the timeout is armed
+  * 1 -- timed out once, give it one more chance
+  * 0 -- timeout not armed, idle
+  * The per-drive-counter is an overkill here, for wd-controller a
+  * per-controller-counter would be enough. But this way it doesn't
+  * add new restrictions to the driver, and it's simple.
+  */
+ int	wdtimeout_counter[_NWD];
+ 
+ /*
+  * during recovery of timed out requests this counter is used
+  */
+ int	wdtimeout_retry[_NWD];
+ 
  struct	isa_driver wddriver = {
  	wdprobe, wdattach, "wd",
  };
***************
*** 160,165 ****
--- 179,185 ----
  int wdcontrol(struct buf *);
  int wdsetctlr(dev_t, struct disk *);
  int wdgetctlr(int, struct disk *);
+ int wdtimeout(caddr_t);
  
  /*
   * Probe for controller.
***************
*** 227,232 ****
--- 247,257 ----
  			du->dk_port = dvp->id_iobase;
  		}
  
+ 		wdtimeout_retry[unit] =
+ 		wdtimeout_counter[unit] = 0;	/* not armed yet */
+ 		wdtimeout((caddr_t) unit);		/* initially set timeout */
+ 
+ 
  		/* print out description of drive, suppressing multiple blanks*/
  		if(wdgetctlr(unit, du) == 0)  {
  			int i, blank;
***************
*** 402,408 ****
  			(bp->b_flags & B_READ) ? "read" : "write",
  			bp->b_bcount, blknum);
  	else
! 		printf(" %d)%x", du->dk_skip, inb(wdc+wd_altsts));
  #endif
  	addr = (int) bp->b_un.b_addr;
  	if (du->dk_skip == 0)
--- 427,433 ----
  			(bp->b_flags & B_READ) ? "read" : "write",
  			bp->b_bcount, blknum);
  	else
! 		printf(" %d)%x", du->dk_skip, inb(du->dk_port+wd_altsts));
  #endif
  	addr = (int) bp->b_un.b_addr;
  	if (du->dk_skip == 0)
***************
*** 537,543 ****
  	}
  
  	/* if this is a read operation, just go away until it's done.	*/
! 	if (bp->b_flags & B_READ) return;
  
  	/* ready to send data?	*/
  	timeout = 0;
--- 562,573 ----
  	}
  
  	/* if this is a read operation, just go away until it's done.	*/
! 	if (bp->b_flags & B_READ) {
! 		wdtimeout_counter[unit] = 2;	/* arm timeout counter */
! 		if (wdtimeout_retry[unit])
! 			printf("wd.c: retry block read\n");
! 		return;
! 	}
  
  	/* ready to send data?	*/
  	timeout = 0;
***************
*** 561,566 ****
--- 591,599 ----
  	outsw (wdc+wd_data, addr+du->dk_skip * DEV_BSIZE,
  		DEV_BSIZE/sizeof(short));
  	du->dk_bc -= DEV_BSIZE;
+ 	wdtimeout_counter[unit] = 2;	/* arm timeout counter */
+ 	if (wdtimeout_retry[unit])
+ 		printf("wd.c: retry block write\n");	/* never seen this */
  }
  
  /* Interrupt routine for the controller.  Acknowledge the interrupt, check for
***************
*** 588,593 ****
--- 621,630 ----
  	du = wddrives[wdunit(bp->b_dev)];
  	wdc = du->dk_port;
  
+ 	wdtimeout_counter[wdunit(bp->b_dev)] = 0;	/* unarm counter */
+ 	wdtimeout_retry[wdunit(bp->b_dev)] = 0; /* start from zero */
+ 
+ 
  #ifdef	WDDEBUG
  	printf("I ");
  #endif
***************
*** 1349,1352 ****
--- 1386,1463 ----
  	}
  	return(0);
  }
+ 
+ /*
+  * called periodically every two seconds for each attached drive.
+  * check if the drive didn't answer in time.
+  */
+ int
+ wdtimeout(caddr_t arg)
+ /* arg is # of unit */
+ {
+ 	int x = splbio();	/* XXX kept all the time */
+ 	register int unit = (int) arg;
+ 
+ 	if (wdtimeout_counter[unit]) { /* armed? */
+ 		if (--wdtimeout_counter[unit] == 0) { /* timed out? */
+ 			struct disk *du = wddrives[unit];
+ 			int wdc = du->dk_port;
+ 
+ 			wdtimeout_retry[unit]++;
+ 
+ 			/* log failure */
+ 			printf("wd.c: wd%d timed out, retry #%d\n",
+ 			       unit, wdtimeout_retry[unit]);
+ 
+ 			/* reset ctlr and redo request */
+ 			switch (wdtimeout_retry[unit]) {
+ 			case 1: /* for me one retry is enough :-) */
+ 			case 2:
+ 			case 3:
+ 				/*
+ 				 * print some status info
+ 				 *
+ 				 * The values I see for my system are:
+ 				 * status=58, error=0
+ 				 * That means status is:
+ 				 * WDCS_READY		Selected drive is ready
+ 				 * WDCS_SEEKCMPLT	Seek complete 
+ 				 * WDCS_DRQ		Data request bit.
+ 				 * Does anyone know a reason for the timeout?
+ 				 */
+ 				printf("wd.c: wd%d status %x, error %x\n",
+ 				       unit, inb(wdc+wd_status),
+ 				       inb(wdc+wd_error));
+ 				/*
+ 				 * reset the device, give it a known state
+ 				 */
+ 				outb(wdc+wd_ctlr, (WDCTL_RST|WDCTL_IDS));
+ 				DELAY(1000);
+ 				outb(wdc+wd_ctlr, WDCTL_IDS);
+ 				DELAY(1000);
+ 				(void) inb(wdc+wd_error);	/* XXX! */
+ 				outb(wdc+wd_ctlr, WDCTL_4BIT);
+ 				/*
+ 				 * we'll redo the xfer sector by sector.
+ 				 * This is the trick that helps here!
+ 				 */
+ 				du->dk_skip = 0;	/* start at #0 again */
+ 				du->dk_flags |= DKFL_SINGLE;	/* slow down */
+ 				break;
+ 			default: /* give up -- never happened for me */
+ 				panic("cannot solve problem with hanging wd ctrl");
+ 				break;
+ 			}
+ 
+ 			/* restart request */
+ 			wdstart();
+ 		}
+ 	}
+ 
+ 	/* plan next timeout */
+ 	timeout(wdtimeout, unit, 200);
+ 	splx(x);
+ 	return (0);
+ }
+ 
  #endif
-- 
Stefan (berry@max.IN-Berlin.DE)