*BSD News Article 8057


Return to BSD News archive

Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!hp9000.csc.cuhk.hk!saimiri.primate.wisc.edu!zaphod.mps.ohio-state.edu!sdd.hp.com!swrinde!gatech!news.byu.edu!ux1!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: multibyte character representations and Unicode
Message-ID: <1992Nov23.193620.9513@fcom.cc.utah.edu>
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <721993836.11625@minster.york.ac.uk>
Date: Mon, 23 Nov 92 19:36:20 GMT
Lines: 73

In article <721993836.11625@minster.york.ac.uk> forsyth@minster.york.ac.uk writes:
>Terry Weber suggests that half one's disc space will vanish
>on adopting Unicode.  Not so: I draw your attention to Plan 9,
>which uses Unicode very successfully.  See the Plan 9 documentation
>on research.att.com (dist/plan9doc, I think).

If you are talking about truly using the Unicode standard, then you are
talking about using 16 bits for English characters instead of 8 bits.
While your disk space wouldn "disappear", it would be halved, unless you
mixed storage of Unicode and ASCII on the same disk.

Unicode contains a total of 34,348 characters.  This is 52% of the largest
number of characters representable in 16 bits (65536), and is also larger
than what can be represented with conditional multibyting (8th bit set on
first character indicating multibyte, otherwise 7 bit ASCII), which is
32,896 characters (128 + 128 * 256).

It seems to me that Unicode representation on disk requires 2 bytes per
character (symbol).  Thus a document file in English that used to tak
2K stored in ASCII would take 4K (2K symbols * 2 bytes per symbol) to
store in Unicode.

>Eventually Plan 9 switched to a new encoding -- which apparently has now been
>proposed for use in ISO 10646 -- that lacks all the unfortunate features.
>The second and third bytes of the encoding do not look like ASCII characters.
>(All bytes of an encoded character have the 0x80 bit set.)
>The consequence is that even fewer programs are affected:
>most pass Unicode encodings straight through.

This isn't really Unicode unless it follows Unicode encoding, and it lacks
the ability to provide a fixed size per symbol storage mechanism, but I
agree that ISO 10646 is a real possibility, although it seems rather English
centric.

In X, to provide an 8x8 Unicode font, it takes 274784 bytes of storage for
the actual font glyphs, plus overhead; a 10x20 takes 1030440 (just under a
Meg, assuming the overhead is less than 18K).  Both could easily be done in
ROM.

Without multibyte encoding (ie: straight 16 bit multibyte), the output is
straightforward using X.  The same is true for an "English-only" or other
("Cyrillic -only", etc.) font, since X fonts are allowed to be sparse;
thus the full Unicode font is only necessary for multinational use of the
same device... even then, the amount of glyphs in a font need only be
enough to intersect both sets.

Thus, in many cases, font-fill centric encoding (ie: this is the font I used,
and these are the 8 bit representations of the Unicode characters lexically
within the font) is sufficient to provide 8 bit storage for all but Kanji.
If the Japaneese could limit themselves to Kana (Katakana/Hirugana), then
they could also benefit from this storage technique as well (this would
also go a long way towards making them compute-competitive and reduce the
hoops one jumps through when using a Kanji keyboard).

>In particular, the `normal' file system names can hold Unicode
>characters without fuss.  There is certainly no need to switch to 16-bit
>representations for them, with all that that entails.

No argument here; however, I would say that picking a font-fill encoding
as a file storage attribute would be sufficient for this as well.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------