*BSD News Article 9350

Received: by minnie.vk1xwt.ampr.org with NNTP
	id AA5616 ; Fri, 01 Jan 93 01:51:02 EST
Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!agate!dog.ee.lbl.gov!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: Dumb Americans (was INTERNATIONALIZATION: JAPAN, FAR EAST)
Message-ID: <1992Dec28.062554.24144@fcom.cc.utah.edu>
Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
Sender: news@fcom.cc.utah.edu
Organization: University of Utah Computer Center
References: <id.M2XV.VTA@ferranti.com> <1992Dec18.043033.14254@midway.uchicago.edu> <1992Dec18.212323.26882@netcom.com> <1992Dec19.083137.4400@fcom.cc.utah.edu> <2564@titccy.cc.titech.ac.jp>
Date: Mon, 28 Dec 92 06:25:54 GMT
Lines: 164

In article <2564@titccy.cc.titech.ac.jp>, mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
|> In article <1992Dec19.083137.4400@fcom.cc.utah.edu>
|> 	terry@cs.weber.edu (A Wizard of Earth C) writes:
|> 
|> >US Engineers produce software for the available market; because of the
|> >input difficulties involved in 6000+ glyph sets of symbols, there has been
|> >a marked lack of standardization in Japanese hardware and software. This
|> >means that the market in Japan consists of mostly "niche" markets, rather
|> >than being a commodity market.
|> 
|> Do you know what Shift JIS is? It's a defacto standard for charcter encoding
|> established by microsoft, NEC, ASCII etc. and common in Japanese PC market.

I am aware of JIS; however, even you must agree that the Japaneese hardware
and software markets have not reached the level of "commodity hardware"
found elsewhere in the world (ie: the US and Europe).  There are multiple
conflicting platforms, and thus multiple conflicting code sets for
implementation.  If we had to pick one platform to support (I am loathe to
do this, as it means support for other platforms may be ignored until
something incompatable has fossilized) it would probably be the NEC 98, which
is not even PC compatible.

I think other mechanisms, such as ATOK, Wnn, and KanjiHand deserve to be
examined.  One method would be to adopt exactly the input mechanism of
"Ichi-Taro" (the most popular NEC 98 word processer).

|> Now, DOS/V from IBM strongly supports Shift JIS.
|> 
|> In the workstation market in Japan, some supports Shift JIS, some
|> supports EUC and some supports both. Of course, many US companies
|> sell Japanized UNIX on thier workstations.

I think this is precisely what we want to avoid -- localization.  The basic
difference, to my mind, is that localization invloves the maintenance of
multiple code sets, whereas internationalization requires maintenance of
multiple data sets, a much smaller job.

|> >This has changed somewhat with the Nintendo
|> >corporations recent successes in Japan, where standardized hardware is
|> 
|> I'm sure you are just joking here.

Yes, this was intended to be a jab at localization of a system as opposed
to internationalization.  The set of Nintendo games in the US and Japan
are largely non-intersecting sets of software... games sold in the US are
not sold in Japan and vice versa.  I feel that "localization" is the
"Nintendo" soloution.  I also feel that we need to be striving for a level
of complexity well above that of a toy.

|> >Microsoft has adopted Unicode as a standard.  It will probably be the
|> >prevalent standard because of this -- the software world is too wrapped
|> >up in commodity (read "DOS") hardware for it to be otherwise.  Unicode
|> >has also done something that XPG4 has not: unified the Far Eastern and
|> >all other written character sets in a single font, with room for some
|> >expansion (to the full 16 bits) and a discussion of moving to a full
|> >32 bit mechanism.
|> 
|> Do you know that Japan vote AGAINST ISO10646/Unicode, because it's not
|> good for Japanese?
|> 
|> >So even if the Unicode standard ignores backward compatability
|> >with Japanese standards (and specific American and European standards),
|> >it better supports true internationalization.
|> 
|> The reason of disapproval is not backward compatibility.
|> 
|> The reason is that, with Unicode, we can't achieve internationalization.

This I don't understand.  The maximum translation table from one 16 bit value
to another is 16k.  This means 2 16k tables for translation into/out of
Unicode for Input/Output devices, and one 16k table and one 512 byte table
if a compact storage methos is used to remove the normal 2X storage penalty
for 256 character languages, like most European languages.

I don't see why the storage mechanism in any way effects the validity of the
data -- and thus I don't understand *why* you say "with Unicode, we can't
achieve internationalization."

|> >XPG4, by adopting the JIS standard, appears to be
|> >igonoring HAN (Chinese) and many other languages covered by the Unicode
|> >standard.
|> 
|> Unicode can not cover both Japanese and Chinese at the same time, because
|> the same code points are shared between similar characters in Japan
|> and in China.

I don't understand this, either.  This is like saying PC ASCII can not cover
both the US and the UK because the American and English pound signs are not
the same, or that it can't cover German or Dutch because of the 7 characters
difference needed for support of those languages.

|> Of course, it is possible to LOCALIZE Unicode so that it produces
|> Japanese characters only or Chinese characters only. But don't we
|> need internationalization?

The point of an internationalization effort (as *opposed* to a localization
effort) is the coexistance of languages within the same processing means.
The point is not to produce something which is capable of "only English" or
"only French" or "only Japanese" at the flick of an environment variable;
the point is to produce something which is *data driven* and localized by
a change of data rather than by a change of code.  To do otherwise would
require the use of multiple code trees for each language, which was the
entire impetus for an internationalization effort in the first place.

|> Or, how can I process a text containing both Japanese and Chinese?

Obviously, the input mechanisms will require localization for the set of
characters out of the Unicode set which will be used for a particular
language; there is no reason JIS input can not be used to input Unicode
as well as any other font; your argument that the lexical order of the
target language effects the usability of a storage standard is invalid.
Sure, the translation mechanisms may be *easier* to code given localization
of lexical ordering, but that doesn't mean they *can't* be coded otherwise;
if it was easy, we'd do it in hardware.  ;-).

|> >I think that Japaneese
|> >users (and European and American users, if nothing is done about storage
|> >encoding to 8 bit sets) are going to have to live with the drawbacks of
|> >the standard for a very long time (the primary one being two 16K tables
|> >for input and output for each language representable in 8 bits, and two
|> >16k tables for runic mapping for languages, like Japaneese, which don't
|> >fit on keyboards without postprocessing).
|> 
|> What? 16K? Do you think 16K is LARGE?
|> 
|> Then, you know nothing about how Japanese are input. We are happily using
|> several hundreds kilo bytes or even several mega bytes of electrical
|> dictionary, even on PCs.

No, I don't think 16k is large; however the drawback is not in the size of
the tables, but in their use on every character in from an input device or
out to an output device.  In addition, an optimization of the file system
to allow for "lexically compact storage" (my term) is necessary to make
Americans and Europeans accept the mechanism.  This involves yet another
set of localization-specific storage tables to translate from an ISO or
other local font to Unicode and back on attributed file storage.  To do
otherwise would require 16 bit sotrage of files, or worse, runic encoding
of any non-US ASCII characters in a file.  This either doubles the file
size for all text files (something the west _will_not_accept_), or
"pollutes" the files (all files except those stored in US-ASCII have file
sizes which no longer reflect true character counts on the file).

Admittedly, these mechanisms are adapatable for XPG4 (not widely available)
and XPG3 (does not support eastern languages), but the MicroSoft adoption
of Unicode tells us that at least 90% of the market is now committed to
Unicode, if not now, then in the near future.


I would like to hear any arguments anyone has regarding *why* Unicode is
"bad" and should not be adopted in the remaining 10% of the market (thus
ensuring incompatability and a lack of interoperability which is guaranteed
to prevent penetration of the existing 90%).


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------