*BSD News Article 8103

Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!news.hawaii.edu!ames!saimiri.primate.wisc.edu!zaphod.mps.ohio-state.edu!sol.ctr.columbia.edu!eff!news.byu.edu!ux1!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: multibyte character representations and Unicode
Message-ID: <1992Nov25.224757.4769@fcom.cc.utah.edu>
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <721993836.11625@minster.york.ac.uk> <1992Nov23.193620.9513@fcom.cc.utah.edu> <id.C19V.DY@ferranti.com>
Date: Wed, 25 Nov 92 22:47:57 GMT
Lines: 61

In article <id.C19V.DY@ferranti.com> peter@ferranti.com (peter da silva) writes:
>In article <1992Nov23.193620.9513@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
>> While your disk space wouldn "disappear", it would be halved, unless you
>> mixed storage of Unicode and ASCII on the same disk.
>
>What, even for code?

For anything sotred in the 16 bit character set (Unicode) instead of some
8 bit set (Extended ASCII, ISO Latin-1).  Yes, this includes code, since
code is something that you will access with supposedly "Unicode aware"
editors, just like all other text.

A 16 bit value less than 256 takes the same number of bits as a 16 bit value
greater than 255; the high bits are just zeroed.

>> This isn't really Unicode unless it follows Unicode encoding,
>
>Which it does.

Which means that ch_type is unsigned short instead of unsigned char.

>> and it lacks
>> the ability to provide a fixed size per symbol storage mechanism,
>
>Why do you want one?

So that I can declare an array of ch_type and be guaranteed that my
array length is not dependant on the encoding -- otherwise I won't be able
to guarantee a fixed number of characters for an input field in a language
independant fashion.  It's ridiculous to think that the number of characters
you can input into a field based on a fixed storage declarator would vary
based on what the characters entered were.  In particular, can you
imagine a database asking for a surname where you could enter 80 characters
for an English surname, but some lesser number for a surname containing an
umlaut-u or cedilla?  This could easily happen for a fixed byte length field
where "characters" (glyphs, actually) were encoded to take 1-3 characters
to produce.  This would be bad.

The alternative, like I said, would be to have a set of standard "code page"
mappings per 256 byte character sets to Unicode, and then store the files
in 8-bit encode based on the code pages.  The file would be attributed
with a code page identifier, and everything would be converted to Unicode
on read or write, but stored encoded.  Directory entries themselves would
have to be Unicode, unless the file system was mkfs'ed "nationalized" to
a particular code page.

This would save the European types, at least, from having to deal with 16
bit storage losses.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------