*BSD News Article 9489


Return to BSD News archive

Received: by minnie.vk1xwt.ampr.org with NNTP
	id AA5840 ; Fri, 01 Jan 93 01:57:39 EST
Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: INTERNATIONALIZATION: IN GENERAL
Message-ID: <1993Jan2.083734.22776@fcom.cc.utah.edu>
Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <1ht8v4INNj7i@rodan.UU.NET> <1993Jan1.094759.8021@fcom.cc.utah.edu> <1i2k09INN4hl@rodan.UU.NET>
Date: Sat, 2 Jan 93 08:37:34 GMT
Lines: 439

A discussion between Vadim Antonov (V) and myself (T):

V: 1) "mechanistic" conversion between upper and lower case
V:    is impossible (as well as case-insensitive comparisons)
V: 
V:    Example:     Latin  T -> t
V: 		Cyrillic T -> m
V: 		Greek T -> ?
V: 
V:    This property alone renders Unicode useless for any business
V:    applications.

T: This is simply untrue.  Because a subtractive/additive conversion is
T: impossible in *some* cases does not mean a *mechanistic* conversion is
T: also impossible.  In particular, a tabular conversion is an obvious
T: approach which has already been used with success, with a minimal
T: (multiply plus dereference) overhead.

V: You omitted one small "detail" -- you need to know the language of the word
V: the letter belongs to to make a conversion. Since Unicode does not
V: provide for specifying the language it is obvious that is should be
V: obtained from user or kept somewhere off the text. In both cases
V: as our program ALREADY knows the language from the environment it knows
V: the particular (small) alphabet -- no need to use multibyte encodings!
V: See how Unicode renders itself useless?

Correct.  You need to know the language, because the information you are
storing is which glyph to display rather than the language and the glyph.

There are several problems with a unique ordinal value per glyph, where
a particular glyph is not unique within the set of glyphs.  In particular,
programs which process text as data (like C compilers require the ability
to distinguish characters.  If one looks at the JIS standard, one sees that
in include an English alphabet.  Without unification between this and the
ISO-Latin-1 font, for instance, a great deal of additional code is required
to allow the compiler to recognize characters (basically, do it's own
unificiation.  You can't tell if the characters in the string "printf" were
input in a JIS or the Latin-1 font just by looking at them, but the
compiler can certainly tell that they are unique.

In order to provide "natural" operations on words (such as hyphenation,
case conversion, and, in particular, abbreviation (All of which are
potentially desirable in our hypothetical program which also alphabetizes),
you also require information about the language.  Hyphenation and
abbreviation, in particular, require a detailed knowledge of the idea of
sequences of glyphs (ie: words).  This information will not be available
regardless of your glyph encoding standard.

Other word processing operations (such as dictionary and thesaurus use
within the program) require knowledge of which language to use.

The idea of sort order should be (and is in Unicode) divorced from the
idea of information storage.  The fact that one will have text files,
data files, and text files which act as data files on the same machine
*requires* some type of promiscuous [out of data band] method of
determining the format of the data within a file.  This method, whether
it be language tagging of the files in the inode, or language tagging
of the user during the login process is imperitive.  To do otherwise
means that your localization data coexists with system data rather
than system data being localized as well.

The operations you wish to perform are the province of applications running
on the system, not the system itself.  Regardless of whether this is done
by an application programmer (as a per application localization) or by the
creator of a library used by applications (as part of developement system
localization), THE CODE BELONGS IN USER SPACE.


V: I wonder why programmers aren't taught mathematical logic. I'm somehow
V: an exception because i'm a mathematican by education and i use to look
V: for holes in "logical" statements.

Most American programmers are, if they attempt to get a degree at an
institute of higher learning in the US.  Most are also forcibly taught
how to bowl or shoot a bow and arrow as part of their graduation
requirements.

A point of contention: a logician is, by disipline, a philosopher, not a
mathematician.  Being the latter does not qualify one as the former.

The point of having to know what language a particular document is written
in in order to manipulate it was *not* omitted; it was taken as an *axiom*.

T: The Cyrillic characters within the Unicode standard (U+0400 -> U+04FF)
T: are based on the ECMA registry under ISO-2375 for use with ISO-2022.  It
T: contains several Cyrillic subsets.  The most recent and most widely
T: accepted of these is ISO-8859-5.  Unicode uses the same relative
T: positions as in ISO-8859-5.  Are you also adverse to ISO-8859-5?

V: ISO-8859-5 is ok, though it is a dead code. Nobody uses it in Russia,
V: mind you. The most wide-spread codes are KOI-8 (de-facto Unix and
V: networking standard) and the so-called "alternative" code which is
V: popular between MS-DOS users.

With all due respect, the ISO-8859-5 is an international standard to
which engineering outside of Russia is done for use in Russia.  Barring
another published standard for external use, this is probably what
Russian users are going to be stuck with for code originating outside
of Russia.  I suggest that if this concerns you, you should have the
"defacto standard" codified for use by external agencies.

One wonders at the ECMA registration of a supposed "non-standard" by Russian
nationals if the standard is not used in Russia.

T: Of these, some argument can be made against only the final paragraph,
T: since it vies internationalization as a tool for multinationalization
T: rather than localization.  I feel that a strong argument can be held
T: out for internationalization as a means of providing fully data driven
T: localizations of software.  As such, the argument of monolingual vs.
T: multilingual is not supported.  However, lexical sort order can be
T: enforced in the access rather than the storage mechanism, making this
T: a null point.

V: Nay, you missed the same point again. You need information about
V: laguage case-conversion and sorting rules and you can obtain it from
V: the encoding (making user programs simple) or from user programs
V: (forcing them to ask user at every step or to keep track of the language).
V: What would you choose for your program?

The process of "asking the user" is near 0 cost regardless of whether the
implementation is some means of file attribution per language or some
method of user attribution (ala proc struct, password file, or environment).

It becomes more complicate if you are attemting a multinational document;
the point here is to enable localization with user supplied data seta
rather than providing a tool for linguistic scholars or multilingual
word processors.  It is possible to do both of these things within the
confines of Unicode, penalizing only the authors of the applications.

>Besides, as i already argued asking or keeping off-text inforamtion
>makes the whole enterprise useless.

This is perhaps true if the goal is multinationalization rather than
internationalization or localization.  Consider a document in Japanese,
Tamil, and Devanagari (Sanskrit).  How does one resolve the issues of
input mechanism for these languages?  JIS encoding does not cut it.
Basically, for a multinational document, there must be multiple instances
if input mechanisms, or a switch between input mechanisms during the
input process.  A switch between mechanisms is sufficient indicator of a
switch between languages, since each input mechanism will be more or
less language specific in any case because of the N->M keyboard mapping
issues if nothing else.

I believe that multinational documented will be the exception, not the
rule.  I further believe that in the specific case of multinational
documents, the use of a particular in-band storage mechanism (such as
"Fancy Text" from the Unicode 1.0 standard) is not an unacceptable
penalty for exceptional use.

I believe the goal is *NOT* multinationalization, but internationalization.
In this context, internationalization refers not to the ability to provide
perfect access to all supported languages (by way of glyph preference), but
refers instead to an enabling technology to allow better operating system
support for localization.

Multinational use is out of the question until modifications are made to
the file system in terms of supporting multiple nation name spaces for the
files.

Localization in terms of multinationalization requires other considerations
not directly possible, in particular, the concept of "well known file system
objects" must be adjusted.  Consider, if you will, the fact that such a
localization of the existing UNIX infrastructure is currently impossible
in this framework.  I am thinking in particular renaming the /etc directory
or the /etc/passwd file to localized non-English equivalents.  The idea
of multinationalization falls under its own weight.  Considr a system used
by users of several languages (ie: a multinational environment).  Providing
each use with their own language's view of file requires a minimum of the
number of well known file names times the number of languages (bearing in
mind that translation may effect name length) for directory information
alone.  Now consider that each of these users will want their names and
passwords to be in their own language in a single password file.

Multinationalization is possible, but of questionalble utility and merit
in current computing systems.  We ned only worry about providing the
mechanisms for concurrency of use for the translators.


Consider now the goal of data-driven localization (a single translation
for all system application programs and switching of language environments
without application recompilation.

Does this goal require internationalization of applications?  The answer is
no.  The only thing it requires is internationalization of the underlying
system to allow data-driving of localization.  Applications themselves need
only be localized through their use of the underlying system.

Rather than rewriting all applications which use text as data (cf: the C
compiler example), unification of the glyph sets makes more sense.

The only goal I am esposing here is enabling for localization.  For this
task, Unicode is far from useless.

T: I believe this is addressed adequately in the ISO standards; however,

V: Your belief is wrong for it is not considered adequate by real users.

Then "real users" can supply a codified alternative in short order or lump it.

T: the lexical order argument is one of the sticking points against the
T: Japanese acceptance of Unicode, and is a valid argument in that arena.
T: The fact of the matter is that Unicode is not an information manipulation
T: standard, but (for the purposes of it's use in internationalization) a
T: storage and an I/O standard.  View this way, the lexical ordering argument
T: is nonapplicable.

V: It'd be sticking point about Slavic languages as well, you may be sure.
V: Knowing ex-Soviet standard-making routine i think the official fishy-eyed
V: representatives will silentlly vote pro to get some more time for raving
V: in Western stores and nobody will use it since then.  The "working" standards
V: in Russia aren't made by commitees.

Then this will have to change, or the Russian users will pay the price.
Those of us external to Russia are in no position to involve ourselves in
this process.  Any changes will have to originate in Russia.

I haven't seen you come right out and say the Cyrillic lexical order in
the Unicode standard (characters U+0400->U+04FF) and in the ISO-8859-5 sets
are "wrong".  Neither have I seen an alternative lexical order (with an
accompanying rationale) put forth.

V: 3) there is no reasonable way to do hyphenation.
V:    Since there is no way to tell language from the text there
V:    is no way to do any reasonable attempts to hyphenate.
V:    [OX - which language this word is from]?
V: 
V:    Good-bye wordprocessors and formatters?

T: By this, you are obviously not referring to idegrahic languages, such as
T: Han, since hyphenation is meaningless for such languages.  Setting aside
T: the argument that if you don't know how to hyphenate in a language, you
T: have no business generating situations requiring hyphenation by virtue
T: of the fact that you are basically illeterate in taht language... ;-).

V: The reason may be as simple as reformatting spreadsheet containing
V: (particularly) addresses of companies in the language i don't comprehend
V: (though i can write it on the envelope).

T: Hyphenation as a process is language dependent, and, in particular,
T: dependent on the rendering mechanism (rendereing mechanisms are *not*
T: the subject under discussion; storage mechanisms *are*).  Bluntly
T: speaking, why does one need word processing software at all if this
T: type of thing is codified?  Hyphenation, like sorting, is manipulation
T: of the information in a native language specific way.

V: Exactly. But there is a lot of "legal" ways to do hyphenation -- and
V: there are algorithms which do reasonably well knowing nothing about
V: the language except which letters are vowels. It's quite enough
V: for printing address labels. If i'm formatting a book i can specify the
V: language myself.

Address information can not be hyphenated, at least in US and other Western
mail of which I am personally aware.  This is a non-issue.  This is also
something that is not the responsibility of the operating system or the
storage mechanism therein... unless you are arguing that UFS knows to store
documents without hyphenation, and that the "cat" and "more" programs will
hyphenate for you.  If you are talking about ANY OTHER APPICATION, THE
HYPHENATION IS THE APPLICATIONS RESPONSIBILITY.  PERIOD.  The fact that you
will have to maintain vowel/consonent tables on a per language basis is
an obvious outcome of the processing of multinational information.  It makes
little difference to the application user how these tables are keyed.

T: Find another standard to tell you how to write a word processor.

V: Is there any? :-)

No, there isn't; that was the point.  It is not the intent of the Unicode
standard to provide a means of performing the operations normally
associated with word processing.  That is the job of the word processor, and
is the reason people who write word processors are paid money by an employer
rather than starving to death.


V: 4) "the similar gliphs" in Unicode are often SLIGHTLY different
V:    typographical gliphs -- everybody who ever dealt with international
V:    publishing knows that fonts are designed as a WHOLE and every
V:    letter is designed with all others in mind -- i.e. X in Cyrillic
V:    is NOT the same X as Latin even if the fonts are variations of
V:    the same style. I'd wish you to see how ugly the Russian
V:    texts prited on American desktop publishing systems with
V:    "few characters added" are.
V: 
V:    In reality it means that Unicode is not a solution for
V:    typesetting.

T: No, you're right; neither is it a standard for the production of
T: pipefittings or the design of urban transportation systems. 

V: You somehow forgot that increasing number of texts get printed with
V: typographical quality with all the stuff which follows.
V: Ever saw a laser printer?

Printing is simply another user-mode application program which can take
advantage of the language indicators (whether on the file or in a document)
for printing the prettiest, most lovely font of your choice.  You think
there are not font-selection mechanisms within Postscript for doing this?

Again, font *changes* only become a problem if one attempts to print a
*multinational* document.  Since we aren't interested in multinationalization,
it's unlikely that a Unicode font containing all Unicode glyphs will be
used for that purpose.

In all likelyhood, use will be in a localized environmentn *NOT* a
multinational one.  Since this is the case, it follows that the sum total of
the Unicode font implemented in the US will be the ISO Latin-1 set.
Similarly, if you are printing a Cyrillic document, you will be using a
Cyrillic font; the "X" character you are concerned about will be *localized*
to the Cyrillic "X", *NOT* the Latin "X".


V: I see no reasons why we should treat the regular expression matching
V: as "fancy" feature.

Because globbing characters are language dependant.  The easiest example
of this is the distingtion made between "localized" UNIX SVR4 for English
vs. Spanish.  The fact is, the character set used for Spanish replaces
several characters in the English set with other characters particular to
Spanish (DOS is the foremost example of this, with it's reference to code
pages and the fact that DOS file names fall within a very narrow set of
characters).  The globbing ("regular expression pattern match") characters
DO change for any patterns more complicated than "*".

T: Clearly, then, the applications you are describing are *not* Unicode
T: applications, but "Fancy text" apllications which could potentially
T: make use of Unicode for character storage.

V: Don't you think the ANY text is going to be fancy because Unicode
V: as it is does not provide adequate means for the trivial operations?

Perhaps any multinational text, yes; for normal text, processing will be
done using the localized form, not the Unicode form; therefore the issue
will never come up, unless the application requires embedded attributes
(like a desktop publishing package.  Since multinational processing is
the exception rather than the rule, let the multinational users pay the
proce in terms of "Fancy text".

V: As well i can provide every text with a font file. It is not a solution
V: at all.

Quite right; but doing so would be redundant unless you were using a output
device, such as a CRT or a printer.  It is the responsibility of the output
device to present the data in a suitable format.  For the most part, except
for printing, which is difficult enough currently, this will be done by
using localized fonts containing only a part of the full Unicode set (that
part necessary for the localization language in use for that session/user/file)
and thus will be coherently defined within the context of it's localization.

Again, multinational software is not being addressed; however, were we to
address the issue, I suspect that it would, in all cases, be implementation
dependant upon the multinational application.

V: Thank you, i already expressed my opinion on Plan 9 UTF in comp.os.research.
V: I also do not think it's exciting. There are much more efficient runic
V: encodings (my toy OS uses 7bit per byte and 8th bit as a continuation
V: indicator).

I don't know how stridently I can express this: runic encoding destroys
information (such as file size = character count) and makes file system
processing re character substituion totally unacceptable... consider the
case of a substitution of a character requiring 3 bytes to encode for on
that takes 1 byte (or 4 bytes) currently.  Say further that it is the
first character in a 2M file.  You are talking about either shifting the
contents of the entire file, or, MUCH WORSE, going to record oriented files
for text.  If there is defacto attribution of text vs. other files (shifting
the data is unacceptable.  period.), there is no reason to avoid making
that attribution as meaningful as possible.

V: Pretty soon it will be a dead standard because of the logical problems
V: in the design. Voting is inadequate replacement for logic, you know.
V: I'd better stick to a good standard from Zambia than to the brain-dead
V: creature of ISO even if every petty bureaucrat voted for it.

I agree; however, the peope involved were slightly more knowledgable about
the subject than your average "petty bureaucrat".  And there has not been
a suggested alternative, only rantings of "not Unicode".

V: I expressed my point of view (and proposed some kind of solution) in
V: comp.std.internat, where the discussion should belong. I'd like you to
V: see the problem not as an excercise in wrestling consensus from an
V: international body but as a mathematical problem. From the logistical
V: point of view the solution is simply incorrect and no standard commitee
V: can vote out that small fact. The fundamental assumption Unicode is
V: based upon (i.e. one glyph - one code) makes the whole construction
V: illogical and it, unfortunately, cannot be mended without serious
V: redesign of the whole thing.

Wrong, wrong, wrong.

1)	We are not discussiong the embodiment of a standard, but the
	applicability of existing standards to a particular problem.
	Basically, we could care less about anything other than the
	existing or draft standards and their suitability to the task
	at hand, the international enabling of 386BSD.

2)	We are not interested in "arriving" at a new standard or defending
	existing or draft standards, except as regards their suitability
	to our goal of enabling.

3)	The proposal of new soloutions (new standards) is neither useful
	nor interesting, in light of our need being "now" and the adoption
	of a new soloution or standard being "at some future date".

4)	Barring a suggestion of a more suitable standard, I and others
	will begin coding to the Unicode standard.

5)	Since we are discussing adoption of a standard for enabling of
	localization of 386BSD, and are neither intent on a general defense
	of any existing standard, nor the proposal of changes to an
	existing standard or the emobodiment of a new standard, this
	discussion doe *NOT* belong in comp.std.internat, since the
	subscribers of comp.unix.bsd are infinitely more qualified to
	determine which existing or draft standard they wish to use
	without a discussion of multinationalization (something only
	potentially useful to a limited audience, and then only at some
	future date when multinational processing on 386BSD becomes a
	generally desirable feature.


V: Try to understand the argument about the redundance of encoding with
V: external restrictions provided i used earlier in this letter. The
V: Unicode commitee really get caught in a logical trap and it's a pity
V: few people realize that.

I *understand* the argument; I simply *disagree* with it's applicability
to anything other than enabling multinationalization as opposed to
enabling localization, which is the goal.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------