*BSD News Article 9467

Received: by minnie.vk1xwt.ampr.org with NNTP
	id AA5812 ; Fri, 01 Jan 93 01:56:39 EST
Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
Message-ID: <1993Jan1.094759.8021@fcom.cc.utah.edu>
Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <1992Dec30.010216.2550@nobeltech.se> <1992Dec30.061759.8690@fcom.cc.utah.edu> <1ht8v4INNj7i@rodan.UU.NET>
Date: Fri, 1 Jan 93 09:47:59 GMT
Lines: 281

In article <1ht8v4INNj7i@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes:
>In article <1992Dec30.061759.8690@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
>>The "ugly thing Unicode does with asiatic languages" is exactly what it
>>does with all other languages:  There is a single lexical assignment for
>>for each possible glyph.
>>....
>>ADMITTED DRAWBACKS IN UNICODE:
>>
>>The fact that lexical order is not maintained for all existing character
>>sets (NOTE: NO CURRENT OR PROPOSED STANDARD SUPPORTS THIS IDEA!) means that
>>a direct arithmatic translation is not possible for...
>
>It means that:
>
>1) "mechanistic" conversion between upper and lower case
>   is impossible (as well as case-insensitive comparisons)
>
>   Example:     Latin  T -> t
>		Cyrillic T -> m
>		Greek T -> ?
>
>   This property alone renders Unicode useless for any business
>   applications.

This is simply untrue.  Because a subtractive/additive conversion is
impossible in *some* cases does not mean a *mechanistic* conversion is
also impossible.  In particular, a tabular conversion is an obvious
approach which has already been used with success, with a minimal
(multiply plus dereference) overhead.

The Lexical ordering of the Latin-1 character set is not in question;
case conversion is done by an arithmetic offset of decimal 32.

The Cyrillic characters within the Unicode standard (U+0400 -> U+04FF)
are based on the ECMA registry under ISO-2375 for use with ISO-2022.  It
contains several Cyrillic subsets.  The most recent and most widely
accepted of these is ISO-8859-5.  Unicode uses the same relative
positions as in ISO-8859-5.  Are you also adverse to ISO-8859-5?

There are a number of Cyrillic letters not defined in ISO-8859-5 (both
historical and extended) which exist in the Unicode standard; it is true
that the case conversion is not based on an offset of decimal 32 for the
extended characters not covered by the 8859-5 standard; however, the
historic letters (such as those used in Ukranian and Belorussian) are
dialectic in nater, and thus are regarded as a font change.  Bearing this
in mind, case conversion can be done in the context of the dialectic
table used for local representation of the characters for device I/O
using the decimal 32 offset *through the lookup table*.  I fail to follow
your T -> m conversion argument; could you please identify the letters
in question with regard to their ordinal values in ISO-8859-5?


The argument for case conversion within the Greek is equally flawed,
unless you are also taking issue with the ISO-8859-7 character set,
per ECMA registry ISO-2375 for use with ISO-2022.  Taking issue with
this particular standard would be difficult to support on your part,
as the ISO-8859-7 standard is based on the Greek national standard
ELOT-928 and also ECMA-118, the origin of which is Greece.

Again, historical forms are not in a lexically correct order for decimal
32 conversion of case; however, these are also dialectical variants and
the difficults inherent in these variants are resolvable under the same
mechanisms as those discussed for Cyrillic.

As to business suitability, it is unlikely that one would use something
like Polytonic (re: classical and Byzantine ancient Greek) for a business
application.


The main "disording" of character sets is with regard to the Japanese
JIS standard.   The minutes of the 20 Apr 90 UNICODE meeting (as reported
by Ken Whistler, Metaphor Computer Systems justify this as follows:

] Han Unification Issues:
] 
] The compromise WG2 position advocated Han unification, but it seemed
] to imply that the unified set would start off with codes in JIS order.
] There was some discussion of whether the compromise proposal really
] did or did not state (or imply) that.  Then the group reviewed the
] Japanese objections to a Han unification that does not incorporate
] JIS ordering.
] 
] The consensus was that a JIS-first ordering in a unified Han encoding
] is unacceptable for at least 3 reasons:
]         1. It is morally unacceptable to favor the Japanese standard
]                 this way in an international encoding, at the expense
]                 of the Chinese and Korean standards.
]         2. The proposal attempts to solve a technical problem (namely
]                 the actual work of unifying the characters) with a
]                 political solution.
]         3. Preservation of the JIS order, so as to attempt to
]                 encapsulate that as a default sort order, makes no
]                 sense outside of a JIS-oriented application.  The
]                 Han unification should present a more generally
]                 recognizable default sort order (i.e. one which
]                 can also be used by the Chinese and the Koreans,
]                 and which applies to the characters beyond JIS 1 & 2).
] 
] Examination of the cost/benefits of unified Han character encoding
] should lead to the following conclusions:  If an application is
] Japanese only, then simply use JIS.  If an application is truly
] multilingual, then a JIS-first encoding doesn't make particular
] sense.  Hence, the Unicode consensus is that an alternative and
] universal ordering principle should be applied to the unified
] Han set.  (The consensus is still that radical/stroke order, with
] or without level distinctions, is the right way to go.)

Of these, some argument can be made against only the final paragraph,
since it vies internationalization as a tool for multinationalization
rather than localization.  I feel that a strong argument can be held
out for internationalization as a means of providing fully data driven
localizations of software.  As such, the argument of monolingual vs.
multilingual is not supported.  However, lexical sort order can be
enforced in the access rather than the storage mechanism, making this
a null point.


>2) there is no trivial way to sort anything.
>   An elementary sort program will require access to enormous
>   tables for all possible languages.
>
>   English: A B C D E ... T ...
>   Russian: A .. B ... E ... C T ...

I believe this is addressed adequately in the ISO standards; however,
the lexical order argument is one of the sticking points against the
Japanese acceptance of Unicode, and is a valid argument in that arena.
The fact of the matter is that Unicode is not an information manipulation
standard, but (for the purposes of it's use in internationalization) a
storage and an I/O standard.  View this way, the lexical ordering argument
is nonapplicable.


>3) there is no reasonable way to do hyphenation.
>   Since there is no way to tell language from the text there
>   is no way to do any reasonable attempts to hyphenate.
>   [OX - which language this word is from]?
>
>   Good-bye wordprocessors and formatters?

By this, you are obviously not referring to idegrahic languages, such as
Han, since hyphenation is meaningless for such languages.  Setting aside
the argument that if you don't know how to hyphenate in a language, you
have no business generating situations requiring hyphenation by virtue
of the fact that you are basically illeterate in taht language... ;-).

Hyphenation as a process is language dependent, and, in particular,
dependent on the rendering mechanism (rendereing mechanisms are *not*
the subject under discussion; storage mechanisms *are*).  Bluntly
speaking, why does one need word processing software at all if this
type of thing is codified?  Hyphenation, like sorting, is manipulation
of the information in a native language specific way.

Find another standard to tell you how to write a word processor.


>4) "the similar gliphs" in Unicode are often SLIGHTLY different
>   typographical gliphs -- everybody who ever dealt with international
>   publishing knows that fonts are designed as a WHOLE and every
>   letter is designed with all others in mind -- i.e. X in Cyrillic
>   is NOT the same X as Latin even if the fonts are variations of
>   the same style. I'd wish you to see how ugly the Russian
>   texts prited on American desktop publishing systems with
>   "few characters added" are.
>
>   In reality it means that Unicode is not a solution for
>   typesetting.

No, you're right; neither is it a standard for the production of
pipefittings or the design of urban transportation systems.  Your
complaint is one of the representation of multilingual text using the
same characters (as a result of unification) in the same document.

>Having unique glyphs works ONLY WITHIN a group of languages
>which are based on variations of a single alphabet with
>non-conflicting alphabetical ordering and sets of
>vowels. You can do that for European languages.
>An attempt to do it for different groups (like Cyrillic and Latin)
>is disastrous at best -- we already tried is and finally came to
>the encodings with two absolutely separate alphabets.
>
>I think that there is no many such groups, though, and it is possible
>to identify several "meta-alpahbets". The meta-alphabets have no
>defined rules for cross-sorting (unlike latters WITHIN one
>meta-alphabet; you CAN sort English and German words together
>and it still will make sense; sorting Russian and English together
>is at best useless). It increases the number of codes but not
>as drastically as codifying languages; there are hundreds of
>languages based on a dozen of meta-alphabets.

Forgetting for the moment that worrying about the output mechanism for
such a document before worrying about the input mechanism whereby such
a document can be created, The Unicode 1.0 standard (in section 2.1)
clearly makes a distinction between "Plain" and "Fancy" text:

] Plain and Fancy Text
]
] Plain text is a pure sequence of character codes; plain Unicode text
] is a sequence of Unicode character codes.  Fancy text is any text
] representation consisting of plain text pluss added information such
] as font size, color, and so on.  For example, a multifont text as
] formated by a desktop publishing system is fancy text.

Clearly, then, the applications you are describing are *not* Unicode
applications, but "Fancy text" apllications which could potentially
make use of Unicode for character storage.

This is, incidently, the resoloution of the Chinese/Japanese/Korean
unification arguments.

>>The fact that all character sets do not occur in their local lexical order
>>means that a particular character can not be identified as to language by
>>its ordinal value.  This is a small penalty to pay for the vast reduction
>>in storage requirements between a 32-bit and a 16-bit character set that
>>contains all required glyphs.
>
>Not true. First of all nothing forces to use 32-bit representation
>where only 10 bits are necessary.

This would be Runic encoding, right?  I can post the Plan-9 and Metis
mechanisms for doing this, if you want.  Both are, in my opinion,
vastly inferior to other available mechanisms.  In particular, the
requirement of using up to 6 characters to represent a single 31 bit
value is particularly repulsive, especially for glyphs in excess of
hex 04000000 (6 character encoding mandatory).  Far eastern users
already have the penalty of effectively half the disk space per glyph
for storage of texts using raw (16 bit) Unicode.  Admittedly, this
has more to do with their use of pictographic rather than phonetic
writing, but asking them to sacrifice yet more disk spac for Western
convenience is ludicrous.

sunce the 386BSD file system works on byte boundries, I can't believe
were suggesting direct 10-bit encoding of characters, right?


>So, as you see the Unicode is more a problem than a solution.
>The fundamental idea is simply wrong -- it is inadequate for
>anything except for Latin-based languages. No wonder we're
>hearing that Unicode is US-centric.
>
>Unfortunately Unicode looks like a cool solution for people who
>never did any real localization work and i fear that this
>particular mistake will be promoted as standard presenting
>us a new round of headache. It does not remove necessity to
>carry off-text information (like "X-Language: english") and
>it makes it not better than existing ISO 8-bit encodings
>(if i know the language i already know its alphabet --
>all extra bits are simply wasted; and programs handling Unicode
>text have to know the laguage for reasons stated before).

I don't see many multinational applications or standards coming out
of Zambia or elsewhere (to point out the fact that they have to come
from somewhere, and the US is as good as any place else).  The fact
that much of Unicode is based on ISO standards, and ISO-10646 encompasses
all of Unicode, means that there is more than US support and input on
the standard.

>UNICODE IS A *BIG* MISTAKE.
>
>(Don't get me wrong -- i'm for the universal encoding; it's
>just that particular idea of unique glyphs that i strongly
>oppose).

I am willing to listen to arguments for any accepted or draft standards
you care to put forward.

Arguments *against* proposals are well and good, as long as the constructive
criticism is accompanied by constructive suggestions.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------