*BSD News Article 9478

Received: by minnie.vk1xwt.ampr.org with NNTP
	id AA5827 ; Fri, 01 Jan 93 01:57:11 EST
Path: sserve!manuel.anu.edu.au!munnari.oz.au!uunet!not-for-mail
From: avg@rodan.UU.NET (Vadim Antonov)
Newsgroups: comp.unix.bsd
Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
Date: 1 Jan 1993 18:27:05 -0500
Organization: UUNET Technologies Inc, Falls Church, VA
Lines: 217
Message-ID: <1i2k09INN4hl@rodan.UU.NET>
References: <1992Dec30.061759.8690@fcom.cc.utah.edu> <1ht8v4INNj7i@rodan.UU.NET> <1993Jan1.094759.8021@fcom.cc.utah.edu>
NNTP-Posting-Host: rodan.uu.net
Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages

In article <1993Jan1.094759.8021@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
>In article <1ht8v4INNj7i@rodan.UU.NET> avg@rodan.UU.NET (Vadim Antonov) writes:
>>In article <1992Dec30.061759.8690@fcom.cc.utah.edu> terry@cs.weber.edu (A Wizard of Earth C) writes:
>>1) "mechanistic" conversion between upper and lower case
>>   is impossible (as well as case-insensitive comparisons)
>>
>>   Example:     Latin  T -> t
>>		Cyrillic T -> m
>>		Greek T -> ?
>>
>>   This property alone renders Unicode useless for any business
>>   applications.
>
>This is simply untrue.  Because a subtractive/additive conversion is
>impossible in *some* cases does not mean a *mechanistic* conversion is
>also impossible.  In particular, a tabular conversion is an obvious
>approach which has already been used with success, with a minimal
>(multiply plus dereference) overhead.

You omitted one small "detail" -- you need to know the language of the word
the letter belongs to to make a conversion. Since Unicode does not
provide for specifying the language it is obvious that is should be
obtained from user or kept somewhere off the text. In both cases
as our program ALREADY knows the language from the environment it knows
the particular (small) alphabet -- no need to use multibyte encodings!
See how Unicode renders itself useless?

I wonder why programmers aren't taught mathematical logic. I'm somehow
an exception because i'm a mathematican by education and i use to look
for holes in "logical" statements.

>The Cyrillic characters within the Unicode standard (U+0400 -> U+04FF)
>are based on the ECMA registry under ISO-2375 for use with ISO-2022.  It
>contains several Cyrillic subsets.  The most recent and most widely
>accepted of these is ISO-8859-5.  Unicode uses the same relative
>positions as in ISO-8859-5.  Are you also adverse to ISO-8859-5?

ISO-8859-5 is ok, though it is a dead code. Nobody uses it in Russia,
mind you. The most wide-spread codes are KOI-8 (de-facto Unix and
networking standard) and the so-called "alternative" code which is
popular between MS-DOS users.

[lots of information about the dead code is omitted]

>The main "disording" of character sets is with regard to the Japanese
>JIS standard.   The minutes of the 20 Apr 90 UNICODE meeting (as reported
>by Ken Whistler, Metaphor Computer Systems justify this as follows:

Unfortunately i'm not competent to discuss Japanese and Chinese.

>Of these, some argument can be made against only the final paragraph,
>since it vies internationalization as a tool for multinationalization
>rather than localization.  I feel that a strong argument can be held
>out for internationalization as a means of providing fully data driven
>localizations of software.  As such, the argument of monolingual vs.
>multilingual is not supported.  However, lexical sort order can be
>enforced in the access rather than the storage mechanism, making this
>a null point.

Nay, you missed the same point again. You need information about
laguage case-conversion and sorting rules and you can obtain it from
the encoding (making user programs simple) or from user programs
(forcing them to ask user at every step or to keep track of the language).
What would you choose for your program?
Besides, as i already argued asking or keeping off-text inforamtion
makes the whole enterprise useless.

>>2) there is no trivial way to sort anything.
>>   An elementary sort program will require access to enormous
>>   tables for all possible languages.
>>
>>   English: A B C D E ... T ...
>>   Russian: A .. B ... E ... C T ...
>
>I believe this is addressed adequately in the ISO standards; however,

Your belief is wrong for it is not considered adequate by real users.

>the lexical order argument is one of the sticking points against the
>Japanese acceptance of Unicode, and is a valid argument in that arena.
>The fact of the matter is that Unicode is not an information manipulation
>standard, but (for the purposes of it's use in internationalization) a
>storage and an I/O standard.  View this way, the lexical ordering argument
>is nonapplicable.

It'd be sticking point about Slavic languages as well, you may be sure.
Knowing ex-Soviet standard-making routine i think the official fishy-eyed
representatives will silentlly vote pro to get some more time for raving
in Western stores and nobody will use it since then.  The "working" standards
in Russia aren't made by commitees.

>>3) there is no reasonable way to do hyphenation.
>>   Since there is no way to tell language from the text there
>>   is no way to do any reasonable attempts to hyphenate.
>>   [OX - which language this word is from]?
>>
>>   Good-bye wordprocessors and formatters?
>
>By this, you are obviously not referring to idegrahic languages, such as
>Han, since hyphenation is meaningless for such languages.  Setting aside
>the argument that if you don't know how to hyphenate in a language, you
>have no business generating situations requiring hyphenation by virtue
>of the fact that you are basically illeterate in taht language... ;-).

The reason may be as simple as reformatting spreadsheet containing
(particularly) addresses of companies in the language i don't comprehend
(though i can write it on the envelope).

>Hyphenation as a process is language dependent, and, in particular,
>dependent on the rendering mechanism (rendereing mechanisms are *not*
>the subject under discussion; storage mechanisms *are*).  Bluntly
>speaking, why does one need word processing software at all if this
>type of thing is codified?  Hyphenation, like sorting, is manipulation
>of the information in a native language specific way.

Exactly. But there is a lot of "legal" ways to do hyphenation -- and
there are algorithms which do reasonably well knowing nothing about
the language except which letters are vowels. It's quite enough
for printing address labels. If i'm formatting a book i can specify the
language myself.

>Find another standard to tell you how to write a word processor.

Is there any? :-)


>>4) "the similar gliphs" in Unicode are often SLIGHTLY different
>>   typographical gliphs -- everybody who ever dealt with international
>>   publishing knows that fonts are designed as a WHOLE and every
>>   letter is designed with all others in mind -- i.e. X in Cyrillic
>>   is NOT the same X as Latin even if the fonts are variations of
>>   the same style. I'd wish you to see how ugly the Russian
>>   texts prited on American desktop publishing systems with
>>   "few characters added" are.
>>
>>   In reality it means that Unicode is not a solution for
>>   typesetting.
>
>No, you're right; neither is it a standard for the production of
>pipefittings or the design of urban transportation systems. 

You somehow forgot that increasing number of texts get printed with
typographical quality with all the stuff which follows.
Ever saw a laser printer?

>Forgetting for the moment that worrying about the output mechanism for
>such a document before worrying about the input mechanism whereby such
>a document can be created, The Unicode 1.0 standard (in section 2.1)
>clearly makes a distinction between "Plain" and "Fancy" text:

I see no reasons why we should treat the regular expression matching
as "fancy" feature.

>Clearly, then, the applications you are describing are *not* Unicode
>applications, but "Fancy text" apllications which could potentially
>make use of Unicode for character storage.

Don't you think the ANY text is going to be fancy because Unicode
as it is does not provide adequate means for the trivial operations?

>This is, incidently, the resoloution of the Chinese/Japanese/Korean
>unification arguments.

As well i can provide every text with a font file. It is not a solution
at all.

>This would be Runic encoding, right?

Exactly.

>I can post the Plan-9 and Metis
>mechanisms for doing this, if you want.

Thank you, i already expressed my opinion on Plan 9 UTF in comp.os.research.
I also do not think it's exciting. There are much more efficient runic
encodings (my toy OS uses 7bit per byte and 8th bit as a continuation
indicator).

>sunce the 386BSD file system works on byte boundries, I can't believe
>were suggesting direct 10-bit encoding of characters, right?

10 bits are nothing more than example.

>I don't see many multinational applications or standards coming out
>of Zambia or elsewhere (to point out the fact that they have to come
>from somewhere, and the US is as good as any place else).  The fact
>that much of Unicode is based on ISO standards, and ISO-10646 encompasses
>all of Unicode, means that there is more than US support and input on
>the standard.

Pretty soon it will be a dead standard because of the logical problems
in the design. Voting is inadequate replacement for logic, you know.
I'd better stick to a good standard from Zambia than to the brain-dead
creature of ISO even if every petty bureaucrat voted for it.

>I am willing to listen to arguments for any accepted or draft standards
>you care to put forward.
>
>Arguments *against* proposals are well and good, as long as the constructive
>criticism is accompanied by constructive suggestions.

I expressed my point of view (and proposed some kind of solution) in
comp.std.internat, where the discussion should belong. I'd like you to
see the problem not as an excercise in wrestling consensus from an
international body but as a mathematical problem. From the logistical
point of view the solution is simply incorrect and no standard commitee
can vote out that small fact. The fundamental assumption Unicode is
based upon (i.e. one glyph - one code) makes the whole construction
illogical and it, unfortunately, cannot be mended without serious
redesign of the whole thing.

Try to understand the argument about the redundance of encoding with
external restrictions provided i used earlier in this letter. The
Unicode commitee really get caught in a logical trap and it's a pity
few people realize that.

--vadim