*BSD News Article 9020

Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!spool.mu.edu!sdd.hp.com!cs.utexas.edu!hellgate.utah.edu!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: Re: INTERNATIONALIZATION: JAPAN, FAR EAST
Message-ID: <1992Dec16.221634.4879@fcom.cc.utah.edu>
Keywords: Han Kanji Katakana Hirugana ISO10646 Unicode Codepages
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <1992Dec14.185028.9757@fcom.cc.utah.edu> <1gksolINNmkg@frigate.doc.ic.ac.uk> <mathias.724467456@sune.stacken.kth.se>
Date: Wed, 16 Dec 92 22:16:34 GMT
Lines: 128

In article <mathias.724467456@sune.stacken.kth.se> mathias@stacken.kth.se (Mathias Bage) writes:
>In <1gksolINNmkg@frigate.doc.ic.ac.uk> kd@doc.ic.ac.uk (Kostis Dryllerakis) writes:
[ ... Re: INTERNATIONIZATION ... ]
>>        Preliminary attemps have already been made (I personally work
>>under X-windows with greek ISO-standard characters without many
>>problems) but a coordinated effort for internationalisation is indeed
>>necessary. Note that the rest of the operating systems are currently
>>"externally touched" in order to support the greek language i.e.  bu
>>hacking your way out.
>
>  Has anyone in this newsgroup ever heard of the Unicode/ISO10646
>(UCS) standard?  It exists today and has everything (almost), even
>though the Japanese don't like the sort order of the Kanji
>characters...  Look/ask in comp.internat.std for more info.  See also
>RFC 1345.

I mentioned Unicode as the proposed 386BSD target  standard, with ISO
character set attribution on specific files *within* the file system
as a means of avoiding eating huge chunks of storage in languages
with existing 8-bit representations (ie: the to/from translation would
be done in a file system layer (perhaps the VFS syscall layer) common
to all file systems).

I would be more likely to endorse Unicode than the 10646 draft standard
(which includes Unicode) simply because ISO-10646 *is* draft.

Unicode (from 5 of the 7 responses garnered so far) is pretty much
uniformly hated in Japan; the Japanese seem to prefer the JIS encoding
(ala kterm and jterm).  While this *is* embodied in an existing
standard (XPG4), it has the drawback of preventing a unified character
glyph space, such as that provided by Unicode.

I suspect this preference stems from the existing equipment, state
tables, and IBM VGA support for JIS more than any real prejudice
against the standard for technical reasons.

The unvarnished facts are:

1)	Microsoft NT is Unicode based.
2)	Unicode provides a ROMable X font (we'd have to build one;
	it's actually the fact of the non-overlapping glyph space
	that provides an advantage over JIS).
3)	Unicode provides a means of simultaneous storage of multilingual
	documents on the same system.
4)	Use of Unicode within the file system's directory service name
	space provides a means of internationalizing 386BSD itself.
5)	A "Unicode outline font" project is currently under way in
	China.
6)	Unicode allows for "localization ready" as opposed to simply
	"internationalizable" UNIX tools and utilities.
7)	Fixed field lengths are observed in utilities/programs regardless
	of the localized language (ie: 80 English characters=80 Greek
	characters=80 Cyrillic characters=80 Kanji characters).  A runic
	implementation would cause field lengths to vary, peraps radically.
8)	Support for nearly all written human languages, with a proposed
	expansion for a larger set.

The drawbacks are:

1)	Non-compliance with XPG4.
2)	Probable non-compliance with ISO-10646 (due to it being incomplete).
3)	Japaneese engineers don't like it (probable reason: current JIS
	investment in man hours/money).
4)	"Connection rules" For languages (like Tamil and Arabic) do not
	translate readily into X display technology.
5)	A rewrite is necessary for most of the JIS input tables and
	semantics to give an identical key sequence/Kanji presentation
	for Japanese.

The arguments are:

1)	Non-compliance with XPG4 is not a problem, since it is impossible
	to comply with both XPG4 and ISO-10646.
2)	By utilizing the ISO-10646 draft, conflicts with the completed
	standard can be minimized.
3)	This is sticky.  If the reason Japanese engineers dislike Unicode
	is simply embedded technology (JIS/XPG4-JIS), then we don't have
	a problem... the technology used should not be apparent to the
	user in any case.  If the JIS technology is preferred over the
	Unicode technology because of engineering simplification for
	romanji/kana conversion to kanji, then the problem is a little
	more difficult, but is surmountable with ~16K of conversion
	vector tables (small overhead compared to the memory taken by a
	single font).  If the JIS ordering is preferred because it aids
	in stroke-count analysis for symbol recognition, *then* we have
	a problem.
4)	Connection rules for, for instance, Tamil, can not be resolved
	adequately using any of the existing character technologies for
	X; thus it is not at issue.
5)	A rewrite will be necessary for these tables regardless, even were
	we to choose XPG4-JIS encoding, if only because the encoding is
	going to vary when the character tables are offset to form a
	Unicode-like non-intesecting glyph set (necessary for "localization
	ready" as opposed to "internationalizable" OS and tools).

Definitions:

	localization ready:	Missing per-locale translation of text
				strings.  All work has been done to
				display drivers & environment to support
				drop in message databases in the local
				language.

	internationalizable:	Missing per-locale translation of text
				strings.  Missing OS/FS support for
				local language representation.  May
				run "localized" apps like jterm/kterm.


A significant advantage of a "localization ready" OS is the ability to
supply a "default" environment through a static which is modified by
examination of the "LOCALE" or other language specification mechanism
in the user's environment.  Thus all applications written on the
system are already "enabled" by virtue of their use of the C library;
this assumes use of "unichar" types, etc., within the applications.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------