*BSD News Article 9437


Return to BSD News archive

Received: by minnie.vk1xwt.ampr.org with NNTP
	id AA5767 ; Fri, 01 Jan 93 01:55:14 EST
Path: sserve!manuel.anu.edu.au!munnari.oz.au!uunet!not-for-mail
From: avg@rodan.UU.NET (Vadim Antonov)
Newsgroups: comp.unix.bsd
Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
Date: 30 Dec 1992 17:48:04 -0500
Organization: UUNET Technologies Inc, Falls Church, VA
Lines: 97
Message-ID: <1ht8v4INNj7i@rodan.UU.NET>
References: <2564@titccy.cc.titech.ac.jp> <1992Dec30.010216.2550@nobeltech.se> <1992Dec30.061759.8690@fcom.cc.utah.edu>
NNTP-Posting-Host: rodan.uu.net
Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages

In article <1992Dec30.061759.8690@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
>The "ugly thing Unicode does with asiatic languages" is exactly what it
>does with all other languages:  There is a single lexical assignment for
>for each possible glyph.
>....
>ADMITTED DRAWBACKS IN UNICODE:
>
>The fact that lexical order is not maintained for all existing character
>sets (NOTE: NO CURRENT OR PROPOSED STANDARD SUPPORTS THIS IDEA!) means that
>a direct arithmatic translation is not possible for...

It means that:

1) "mechanistic" conversion between upper and lower case
   is impossible (as well as case-insensitive comparisons)

   Example:     Latin  T -> t
		Cyrillic T -> m
		Greek T -> ?

   This property alone renders Unicode useless for any business
   applications.

2) there is no trivial way to sort anything.
   An elementary sort program will require access to enormous
   tables for all possible languages.

   English: A B C D E ... T ...
   Russian: A .. B ... E ... C T ...

3) there is no reasonable way to do hyphenation.
   Since there is no way to tell language from the text there
   is no way to do any reasonable attempts to hyphenate.
   [OX - which language this word is from]?

   Good-bye wordprocessors and formatters?

4) "the similar gliphs" in Unicode are often SLIGHTLY different
   typographical gliphs -- everybody who ever dealt with international
   publishing knows that fonts are designed as a WHOLE and every
   letter is designed with all others in mind -- i.e. X in Cyrillic
   is NOT the same X as Latin even if the fonts are variations of
   the same style. I'd wish you to see how ugly the Russian
   texts prited on American desktop publishing systems with
   "few characters added" are.

   In reality it means that Unicode is not a solution for
   typesetting.

Having unique glyphs works ONLY WITHIN a group of languages
which are based on variations of a single alphabet with
non-conflicting alphabetical ordering and sets of
vowels. You can do that for European languages.
An attempt to do it for different groups (like Cyrillic and Latin)
is disastrous at best -- we already tried is and finally came to
the encodings with two absolutely separate alphabets.

I think that there is no many such groups, though, and it is possible
to identify several "meta-alpahbets". The meta-alphabets have no
defined rules for cross-sorting (unlike latters WITHIN one
meta-alphabet; you CAN sort English and German words together
and it still will make sense; sorting Russian and English together
is at best useless). It increases the number of codes but not
as drastically as codifying languages; there are hundreds of
languages based on a dozen of meta-alphabets.

>The fact that all character sets do not occur in their local lexical order
>means that a particular character can not be identified as to language by
>its ordinal value.  This is a small penalty to pay for the vast reduction
>in storage requirements between a 32-bit and a 16-bit character set that
>contains all required glyphs.

Not true. First of all nothing forces to use 32-bit representation
where only 10 bits are necessary.

So, as you see the Unicode is more a problem than a solution.
The fundamental idea is simply wrong -- it is inadequate for
anything except for Latin-based languages. No wonder we're
hearing that Unicode is US-centric.

Unfortunately Unicode looks like a cool solution for people who
never did any real localization work and i fear that this
particular mistake will be promoted as standard presenting
us a new round of headache. It does not remove necessity to
carry off-text information (like "X-Language: english") and
it makes it not better than existing ISO 8-bit encodings
(if i know the language i already know its alphabet --
all extra bits are simply wasted; and programs handling Unicode
text have to know the laguage for reasons stated before).

UNICODE IS A *BIG* MISTAKE.

(Don't get me wrong -- i'm for the universal encoding; it's
just that particular idea of unique glyphs that i strongly
oppose).

--vadim