*BSD News Article 8501


Return to BSD News archive

Xref: sserve comp.unix.ultrix:15351 comp.sys.dec:10554 comp.unix.bsd:8557
Path: sserve!manuel.anu.edu.au!munnari.oz.au!news.hawaii.edu!ames!saimiri.primate.wisc.edu!zaphod.mps.ohio-state.edu!wupost!uunet!mcsun!Germany.EU.net!rrz.uni-koeln.de!IKP.Uni-Koeln.DE!se
From: se@IKP.Uni-Koeln.DE (Stefan Esser)
Newsgroups: comp.unix.ultrix,comp.sys.dec,comp.unix.bsd
Subject: Better matching SCSI drive characteristics  (with patch for386BSD) (was Re: installation of a scsi disk)
Date: 2 Dec 1992 18:58:09 GMT
Organization: Institute for Mathematics, University of Cologne, Germany
Lines: 225
Distribution: world
Message-ID: <1fj101INNdeu@rs1.rrz.Uni-Koeln.DE>
References: <2021@nikhefh.nikhef.nl> <2022@nikhefh.nikhef.nl> <1992Nov20.215457.1595@nntpd2.cxo.dec.com> <1992Nov25.185108.23362@infodev.cam.ac.uk> <1992Nov27.233646.16972@nntpd2.cxo.dec.com>
NNTP-Posting-Host: snert.ikp.uni-koeln.de
Keywords: ultrix Seagate SCSI

In article <1992Nov27.233646.16972@nntpd2.cxo.dec.com>, alan@nabeth.enet.dec.com (Alan Rollow - Alan's Home for Wayward Tumbleweeds.) writes:
|> >It's high time the filing system STOPPED TRYING to understand geometry
|> >of SCSI discs, in my opinion.
|> 
|> Agreed.  And as it turns out, if you don't use rotational delay,
|> about the only bit of geometry information used by newfs(8) is
|> to setup the cylinder groups.  You could get really fancy and
|> partition along the zones and setup different entries for the
|> different zones (if you thought it mattered enough):
|> 
|> disk-foo-zone-1:mumble:whatever:\
|> 	:ns#zone-1-sectors:nt#tracks:nc#whatever-works:\
|> 
|> It is starting to look like disks are more and more often starting
|> to push the boundry conditions assumed by newfs(8).

It does only only require a trivial change in alloccgblk() to 
add a allocation scheme much better suited to SCSI disks !

(I didn't perform extensive tests to verify the improvement, but its
a minor change and it prevents some unneccessary seeks and lost 
revolutions due to incorrect assumptions of the drive geometry,
so it ought to be worth the effort.)

The relevant code is (taken from ufs_alloc.c from the 386BSD sources, 
but present in at least BSD4.2, BSD4.3 and several Ultrix releases):

        if (cg_blktot(cgp)[cylno] == 0)
                goto norot;
        if (fs->fs_cpc == 0) {
                /*
                 * block layout info is not available, so just have
                 * to take any block in this cylinder.
                 */
/***/           bpref = howmany(fs->fs_spc * cylno, NSPF(fs));
                goto norot;
        }

The only change required is to remove the initialization of 'bpref' to 
the start of cylinder in the marked line (/***/).
This doesn't have any negative effect, since that line is only executed 
in case of a 'misconfiguration':

The condition 'fs->fs_cpc == 0' is true, iff there are too many sectors 
per track for the kernel's "rotational positions table". This is triggered
by setting the number of heads to '1', the number of sectors per track 
to the real number of sectrors per cylinder (which on quite a few drives 
isn't a multiple of the number of heads or the numbers of sectors per track !) 
or some 1000 sectors for ZBR drives.

The result is, that the file system prefers allocation in ascending logical 
block number, spiraling over the blocks of the cylinder. This is different 
from the normal behaviour, if the file system block succeeding the last one 
allocated to a file is unavailable. The BSD-FFS allocates a free block 
under another head if there seems to be one at a rotational near position,
but given the fact, that most SCSI drives lie about their geometry, this 
doesn't work well with them (even if they don't use ZBR).

With the above one line patch (uncommenting the last line) and by use of 
a disktab entry specifying ONE head per cylinder, the allocation becomes 
much better suited for SCSI drives.

Advantages:

Switching heads makes the SCSI drive's read ahead cacheing useless.
Its often better to skip a block or two, than to switch heads.

If the drive incorporates track skew to compensate for the head switch
time, then the rotational positions can't be computed the way the FFS 
considers right. The FFS looses on average half a revolution of the disk 
when switching heads, in this case. 

Even worse, if the drive has a certain number of alternate sectors per 
cylinder, than the cylinder boundary isn't where the FFS expects it to be!
This means, that the FFS optimizations, which try to keep the blocks of 
a file within one cylinder, now tend to spread the file onto two adjacent 
cylinders.

Eg. my Fujitsu M2266 has 85 sectors/track and 15 heads. The cylinder thus 
contains 1275 sectors, but 3 of them are alternates and only 1272 available 
to the file system. With 'ns#85:nt#15', the first FFS cylinder extends 
3 sectors into the second physical cylinder, the second FFS cylinder 6, ...
At cyl. 212 the distance has grown to half a cylinder, resulting in the first
half of FFS cyl. 212 resideing on drive cyl. 212 and the second half lying on
drive cyl. 213. This results in seeks between track 212 and 213 when the 
FFS code believes, it was just switching heads.

When the (patched) FFS sees this drive as 'ns#1272:nt#1', it does the right 
thing! Blocks are allocated within the physical cylinder, head switching 
can happen in the drive, but the FFS doesn't need to know about that.

This applies to ZBR drives as well, since switching heads results in 
a loss of half a revolution plus head switch time on them as well.
When just allocating blocks in succeeding order, the seek to next cylinder
happens just once (the FFS can't be told where the cyl. boundary is really is),
but that's not bad compared to a worst case scenario of some 50 seeks that 
may result from the standard FFS allocation scheme on such a drive.


I nearly forgot to explain, why its not enough to just mkfs a filesystem
with one head and secpertrk:=heads*secpercyl specified.

The line 

		bpref = howmany(fs->fs_spc * cylno, NSPF(fs));

sets the preferred block to the first block of the current cylinder.
The kernel arrives there only in case, that the preferred block
(usually the one succeeding the last one allocated to that file)
is unavailable. And when 'bpref' doesn't become reset, but is allowed
to keep its value at entry to 'norot', then the kernel allocates the 
next free block behind bpref (wrapping around to the start of the 
cylinder in case there wasn't any free block).

'bpref' has been verified to be a valid block number before and it has 
been checked that there is at least one free block in that cylinder at 
the start of the alloccgbl subroutine.

So the patch can't do any harm, it doesn't change the layout policy 
unless the file system has been created with 'ns' in the range of a few 
hundreds (didn't check the limit, >1000 works for me...).

! Without the patch applied, specifying eg. ns#1000 leads to a very 
! bad allocation scheme. The kernel will scan for a free block 
! from the beginning of the cylinder, resulting in severely fragmented 
! files and wasting lots of CPU cycles ...

The patch can be used on systems with a mix of SCSI and eg. IPI drives.
The original BSD-FFS layout policy will be used for the IPI drives, 
which have been 'newfs'ed with their physical geometry data as usual 
(and not with ns=1).

There are other (more important) improvements that ought to be applied 
to the FFS to better work with SCSI drives (eg. > 8KB transfers), but 
since this one is that simple and doesn't have any negative impact, I'd 
like to see it incorporated into at least 386BSD.

#>>>> In case somebody wants to try it, here is a patch. 
#>>>> I had prepared it some time ago but never sent.
#>>>> It contains some comments to become part of the
#>>>> patched ufs_alloc.c.

#>>>> This has been tested on several DECstations running 
#>>>> Ultrix 4.1-4.2a, and never failed over the last 2 years ...
#>>>> (I had to apply a binary patch on these systems,
#>>>>  since I don't have access to Ultrix sources.)

#>>>> It won't change ANYTHING in the behaviour of your system,
#>>>> unless you create a new file system with a large number 
#>>>> of sectors per track. This is best done by specifying 
#>>>> ns#1:nt#1000 in the drives definition in /etc/disktab
#>>>> or by specifying these values directly to mkfs.

I'd like to get some feedback, in case you try it ...

STefan



*** ufs_alloc.c~	Fri Aug 28 11:14:03 1992
--- ufs_alloc.c	Fri Aug 28 11:47:23 1992
***************
*** 715,719 ****
  		 * to take any block in this cylinder.
  		 */
! 		bpref = howmany(fs->fs_spc * cylno, NSPF(fs));
  		goto norot;
  	}
--- 715,766 ----
  		 * to take any block in this cylinder.
  		 */
! 		/****
! 		 * the standard BSD distributions probably
! 		 * back to the first 4.2 release do the following,
! 		 * but I don't see any advantage in doing so.
! 		 * If its uncommented, SCSI drives can be supported 
! 		 * better, because its possible to force a reasonable 
! 		 * block layout for an SCSI drive without changing 
! 		 * the behaviour for other drives ...
! 		 */
! 	/*	bpref = howmany(fs->fs_spc * cylno, NSPF(fs));	*/
! 		/*
! 		 * The above line forced the norot code to scan
! 		 * for a free block starting at the <<beginning 
! 		 * of the current cylinder>>. I don't see any reason 
! 		 * for doing this, it should always be better to use 
! 		 * the <<current value of bpref>> as a starting point 
! 		 * (which is already guaranteed to be valid for this 
! 		 * purpose). By doing this, the BSD FFS block allocation 
! 		 * scheme which uses heuristics generally not applicable 
! 		 * to current SCSI drives, can be selectively switched 
! 		 * to a mode which doesn't make as many assumptions on
! 		 * the exact knowledge of the drive geometry and 
! 		 * which takes better advantage of the preread cache
! 		 * common on SCSI drives. To enable this mode, simply
! 		 * use a disktab entry with more sectors per track,
! 		 * than can be dealt with in the table used for finding 
! 		 * a rotational near block (which doesn't work on SCSI
! 		 * drives anyway). By declaring a track to contain some 
! 		 * 1000 sectors (I use the number of *data* sectors per 
! 		 * cylinder) the allocation now prefers a sector with 
! 		 * a slightly higher logical block number than the last 
! 		 * one used for the file. This increases the probability 
! 		 * of finding the data in the drive's preread cache.
! 		 * This is of much higher importance when using ZBR
! 		 * (zone bit recording) drives, where its impossible 
! 		 * to specify any geometry near that required by the 
! 		 * BSD FFS block allocation heuristics.
! 		 * Another problem with FFS disktab specifications not
! 		 * matching the actual geometry of the drive is, that 
! 		 * the FFS tries to allocate a sector within the same 
! 		 * *cylinder*, but in fact without knowing the borders
! 		 * of cylinders on most SCSI drives. This normally 
! 		 * leads to much unneccessary next track seeks, since 
! 		 * blocks get allocated spread over the second half of 
! 		 * one cylinder and the first one of the next because
! 		 * of wrong assumptions about the cylinder borders.
! 		 * 920828, Stefan Esser, <se@ikp.uni-koeln.de>
! 		 ****/
  		goto norot;
  	}

-- 
 Stefan Esser,  Institute of Nuclear Physics,  University of Cologne,  Germany
 se@IKP.Uni-Koeln.DE                                           [134.95.192.50]