*BSD News Article 8232

Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!news.hawaii.edu!ames!saimiri.primate.wisc.edu!zaphod.mps.ohio-state.edu!menudo.uh.edu!sugar!ficc!peter
From: peter@ferranti.com (peter da silva)
Subject: Re: multibyte character representations and Unicode
Message-ID: <id.QOCV.ZJ2@ferranti.com>
Organization: Xenix Support, FICC
References: <1992Nov23.193620.9513@fcom.cc.utah.edu> <id.C19V.DY@ferranti.com> <1992Nov25.224757.4769@fcom.cc.utah.edu>
Date: Sat, 28 Nov 1992 05:17:56 GMT
Lines: 55

In article <1992Nov25.224757.4769@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
> In article <id.C19V.DY@ferranti.com> peter@ferranti.com (peter da silva) writes:
> >In article <1992Nov23.193620.9513@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
> >> While your disk space wouldn "disappear", it would be halved, unless you
> >> mixed storage of Unicode and ASCII on the same disk.

> >What, even for code?
> 
> For anything sotred in the 16 bit character set (Unicode) instead of some
> 8 bit set (Extended ASCII, ISO Latin-1).  Yes, this includes code, since
> code is something that you will access with supposedly "Unicode aware"
> editors, just like all other text.

Um, I generally access code with linkers and loaders.

And the biggest individual objects on the disk are bitmaps.

> >> and it lacks
> >> the ability to provide a fixed size per symbol storage mechanism,

> >Why do you want one?

> So that I can declare an array of ch_type and be guaranteed that my
> array length is not dependant on the encoding -- otherwise I won't be able
> to guarantee a fixed number of characters for an input field in a language
> independant fashion.

Huh? In any reasonable language-independent I/O routines you'll have to
allocate all your buffers dynamically anyway. Or, if you want to be lazy
for scratch buffers, just assume worst case encoding.

> It's ridiculous to think that the number of characters
> you can input into a field based on a fixed storage declarator would vary
> based on what the characters entered were.  In particular, can you
> imagine a database asking for a surname where you could enter 80 characters
> for an English surname, but some lesser number for a surname containing an
> umlaut-u or cedilla?

Sure I can. You're talking to a man named "da Silva". Badly written software
is an existing problem, and forcing worst-case encoding for everything just
to avert a pretty easily avoidable side-effect is pretty poor design.

> This could easily happen for a fixed byte length field
> where "characters" (glyphs, actually) were encoded to take 1-3 characters
> to produce.  This would be bad.

Allocate 320 bytes, and only allow 80 characters. Why is that worse than
allocating 320 bytes because that's how many bytes 80 characters take up?

As for "the solution", I really think you should investigate UTF.
-- 
%Peter da Silva/77487-5012 USA/+1 713 274 5180/Have you hugged your wolf today?
/D{def}def/I{72 mul}D/L{lineto}D/C{curveto}D/F{0 562 moveto 180 576 324 648 396
736 C 432 736 L 482 670 518 634 612 612 C}D/G{setgray}D .75 G F 612 792 L 0 792
L fill 1 G 324 720 24 0 360 arc fill 0 G 3 setlinewidth F stroke showpage % 100