*BSD News Article 7617

Path: sserve!manuel.anu.edu.au!munnari.oz.au!news.hawaii.edu!ames!agate!toe.CS.Berkeley.EDU!bostic
From: bostic@toe.CS.Berkeley.EDU (Keith Bostic)
Newsgroups: comp.unix.bsd
Subject: Re: Largest file size for 386BSD ?
Date: 9 Nov 1992 18:58:25 GMT
Organization: University of California, Berkeley
Lines: 128
Message-ID: <1dmcchINNt54@agate.berkeley.edu>
References: <1992Nov6.031757.20766@ntuix.ntu.ac.sg> <1992Nov6.173454.17896@fcom.cc.utah.edu>
NNTP-Posting-Host: toe.cs.berkeley.edu

There are four issues for file size in a UNIX-like system:

	1: the off_t type, the file "offset" measured in bytes
	2: the logical block type, measured in X block units
	3: the physical block type, measured in Y block units
	4: the number of meta-data blocks you can access

The off_t is the value returned by lseek, and in all BSD systems with
the exception of 4.4BSD, it's a 32-bit signed quantity.  In 4.4BSD,
it's a 64-bit signed quantity.  (As a side-note, this change broke
every application on the system.  The two big issues were programs that
depended on fseek and lseek returning similar values, and programs that
explicitly casted lseek values to longs.)   The 32-bit off_t limit
means that files cannot grow to be more than 2G in size, the 64-bit
limit means that you don't have to worry about it 'cause the next three
limits are going to kick in.  So, the bottom line for this limit is
2^off_t - 1, because a single out-of-band value, -1, is used to denote
an error.

The second limit is the logical block type, and in a BSD system is a
daddr_t, a signed 32-bit quantity.  The logical block type is the
number of logical blocks that a file may have.   The reason that this
has to be a signed quantity is that the "name space" for logical blocks
is split into two parts, the data blocks and the meta-data blocks.
Before 4.4BSD, the FFS used physical addresses for meta-data, so that
this division wasn't necessary.  However, this implies that you know
the disk address of a block at all times.  In a log-structured file
system, since you don't know the address until you actually write the
block (for lots of reasons), the "logical" name space has to be divided
up.  In the 4BSD LFS (and the 4.4BSD FFS and the Sprite LFS) the
logical name space is split by the top bit, i.e. "negative" block
numbers are meta-data blocks.  So, the bottom line for this limit is
2^31 logical blocks in a file.

The third limit is the physical block type.  In UNIX-like systems, the
physical block is also a daddr_t.  In the FFS, it's the fragment size,
and the FFS addresses the disks in units of fragments, i.e. an 8K block
1K fragment file system will address the disks in 1K units.  This limits
the size of the physical device.

The fourth limit is the number of data blocks that are accessible
through triple-indirect addressing.  In 4BSD there are 12 (NDADDR) direct
blocks and 3 (NIADDR) levels of indirection, for a total of:

	NDADDR +
	    NINDIR(blocksize) + NINDIR(blocksize)^2 + NINDIR(blocksize)^3

data blocks.

Given 64-bit off_t's, and 32-bit daddr_t's, this all boils down to:

Block size	# of data blocks	Max file size	Limiting type
  .5K		   2113676		~  1G		4
 1K		  16843020		~ 16G		4
 2K		 134480396		~262G		4
 4K		1074791436		 ~ 4T		4
 8K		2147483648		 ~16T		2
16K		2147483648		 ~32T		2

Note 1:
	For 32-bit off_t's, the maximum file size is 2G, except for 512
	byte block file systems where it's 1G.  The limiting type for
	all of these is #1, except for 512 byte block file systems where
	it's #4.

Note 2:
	If we go to 64-bit daddr_t's, the branching factor goes DOWN,
	because you need 8-bytes in the indirect block for each physical
	block.  The table then becomes:

Block size	# of data blocks	Max file size	Limiting type
  .5K		    266316		~130M		4
 1K		   2113676		~  2G		4
 2K		  16843020		~ 32G		4
 4K		 134480396		~512G		4
 8K		1074791436		 ~ 8T		4
16K		8594130956		~128T		4
	

>In article <1992Nov6.031757.20766@ntuix.ntu.ac.sg> eoahmad@ntuix.ntu.ac.sg (Othman Ahmad) writes:

>>This will be an important issue because soon we'll have hundreds of gigabytes,
>>instead of magabytes soon.
>>	It took the jump from tens mega to hundreds in just 10 years.

There are two issues that you need to consider.  The first is the actual
physical data that you have, which can probably be satisfied, in 99.99
percent of the cases, by 2G, let alone 16T.  The latter figure is also
probably fine given what we can physically store on both magnetic and
tertiary storage.  While it is true that big files are getting bigger (by
roughly an order of magnitude), most files are about the same size they
were ten years ago, i.e 40% are under 1K and 80% are under 20K [SOSP '91,
Mary Baker, Measurements of a Distributed File System].  Even that order
of magnitude isn't all that interesting for this case, as most files simple
aren't larger than 16T.

The second issue is the addressibility of the data.  Some applications
want to store large objects (measured in megabytes) in a huge sparse file.
These applications may have a 2G disk, but want files sized in terabytes.
There is no satisfactory answre on most current UNIX systems, but the
64-bit daddr_t's would seem to make the situation better.

In article <1992Nov6.173454.17896@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:

>Get around the problem:
>
>1)	Multiple partitions not exceeding the 4 Gig limit.
>2)	Larger terminal blocks.
>3)	Additional indirection levels.
>4)	Assumption of larger files = log-structure file systems (ala Sprite).

The interesting point for me is #4 -- although I'm not real sure what
you meant.  The advantages of LFS are two-fold.  First, the features
that theoretically would be available to applications, due to its
no-overwrite policy, are attractive, e.g. "unrm", versioning,
transactions.  Second, with multiple writers it has the potential for
improved performance.

It is becoming clearer, at least to me, that the LFS performance
advantages are not as obvious as they originally appeared, mostly
because of the strong effects of the cleaner.  I'm starting to agree
with Larry McVoy of [USENIX, January 1991, Extent-like Performance
from a UNIX File System] that FFS with read/write clustering is just
as fast as LFS in many circumstances, and faster in lots of large-file
applications where the disk is over, say, 80% utilized.

Keith Bostic