*BSD News Article 7831


Return to BSD News archive

Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!agate!doc.ic.ac.uk!uknet!yorkohm!minster!forsyth
From: forsyth@minster.york.ac.uk
Newsgroups: comp.unix.bsd
Subject: multibyte character representations and Unicode
Message-ID: <721993836.11625@minster.york.ac.uk>
Date: 17 Nov 92 09:50:36 GMT
Organization: Department of Computer Science, University of York, England
Lines: 42

Terry Weber suggests that half one's disc space will vanish
on adopting Unicode.  Not so: I draw your attention to Plan 9,
which uses Unicode very successfully.  See the Plan 9 documentation
on research.att.com (dist/plan9doc, I think).

Basically, there is a multibyte encoding for Unicode that works well.
Inside relatively FEW programs the multibyte encoding is converted
to an integer representation (the type `Rune') to simplify manipulation.
For instance, the text displayed in a text frame by sam or the window
manager is kept as Runes, but ONLY the text displayed.  Any hidden
text -- and text in disc files -- is kept in the multibyte encoding.

Some care is required in specifying the multibyte encoding.
It seems that Plan 9 originally followed the encoding specified in
the Unicode standard, but it has some messy consequences in practice:
not least that the 2nd and 3rd bytes can appear to be valid
ASCII.  (Why anyone would design an encoding that does this is beyond
me, since the problems are fairly obvious, but that's what Unicode did.)
Eventually Plan 9 switched to a new encoding -- which apparently has now been
proposed for use in ISO 10646 -- that lacks all the unfortunate features.
The second and third bytes of the encoding do not look like ASCII characters.
(All bytes of an encoded character have the 0x80 bit set.)
The consequence is that even fewer programs are affected:
most pass Unicode encodings straight through.

In particular, the `normal' file system names can hold Unicode
characters without fuss.  There is certainly no need to switch to 16-bit
representations for them, with all that that entails.

Actually, on Plan 9 you cannot even run the window manager without using
Unicode: it's name is `eight and a half' (ie, 8 followed by a 1/2 symbol!),
entered as `8 ALT 1 2' (on my keyboard, anyhow).

You can find much of the Plan 9 Rune support in
the source for Pike's editor `sam', also on research.att.com
(dist/sam, i think).
(You also get a very decent editor, a library that gives you a sane
interface to X11, and a library for managing text on a bitmap display.)

Obviously programs can store Runes in disc files if that's really what
they need, or if their authors work for disc manufacturers, but it
isn't necessary.