*BSD News Article 8019


Return to BSD News archive

Newsgroups: comp.unix.bsd
Path: sserve!manuel.anu.edu.au!munnari.oz.au!uunet!ferkel.ucsb.edu!taco!rock!stanford.edu!agate!spool.mu.edu!wupost!usc!sol.ctr.columbia.edu!eff!news.byu.edu!ux1!fcom.cc.utah.edu!cs.weber.edu!terry
From: terry@cs.weber.edu (A Wizard of Earth C)
Subject: [386bsd] INTERNATIONALIZATION (was can't deal with 8-bit input)
Message-ID: <1992Nov16.232035.6307@fcom.cc.utah.edu>
Summary: Going global
Sender: news@fcom.cc.utah.edu
Organization: Weber State University  (Ogden, UT)
References: <1992Nov16.081801.15019@kum.kaist.ac.kr>
Date: Mon, 16 Nov 92 23:20:35 GMT
Lines: 118

In article <1992Nov16.081801.15019@kum.kaist.ac.kr> jbkang@csking.kaist.ac.kr (Joongbin Kang) writes:
>  ...But another problem occured during using the 'hanterm', Korean version
>  of xterm. It can display Korean texts with MSB set (the same to most
>  oriental languages, such as kanji etc), but I couldn't input Korean text.
>  Hanterm itself provides Korean input automata, and it should work well
>  with X11R5. Another test shows that kernel seems to have trouble with
>  multibyte characters.
>    % cat
>    test
>    test (echoed to tty)
>    ^D
>    % cat
>    xxxx(entered korean characters -- it can be seen when typing)
>    (but no echo to tty!)
>    ^D (this DIDN'T work)
>    ^C
>    %
>  So, what's the problem? If I cannot use hangul in 386bsd, it loses
>  practicality...Help!

Most likely, the echo to your X term was broken when it sent you back
your characters... seriously!

The default cflags for a tty in 386BSD strip parity (the 8th bit) by
setting cs7 and setting even parity (-parodd, parenb), and setting the
iflag istrip.  The fact that you got the characters you typed back at
all is an indicator that istrip wasn't working on echoed characters.

An additional problem with ANSI teminal emulation, if not internationalized,
is the CSI characters (0x80-0x9f) which are seen as <ESC> + <char - 0x60>
(ie 0x9b = 0x1b + 0x3b ...or... <CSI> = <ESC>[).  Basically, you have to
disable this functionality to get around the problem with 0x80-0x9f range
characters.  SCO "gets around" the problem by allowing the output of the
characters in this range with an escape sequence to pick the character
set in that range instead of the normal output (<ESC>[12m ?  It's been
about 3 years since I wrote the SCO color console emulator for TERM from
Century Software.  Doing it this way get you a PC character set on your
console, but it is hardly 8-bit clean.

Hangul, like the Japaneese Katakana and Hirugana, is representable in an
8 bit set, the lower 128 characters being ASCII.  Unless you are tryingtrying
for Unicode, you should not need multibyte -- even then, you only need 16 bits,
not the 32 bits Sun is currently using for their Internationalization.

Get the echoing working with cat by setting your terminal modes correctly,
and you should be able to type in 8 bit characters using the input
automata (or even more clever, use a korean keyboard with one of the magic
unused shifts (like alt or compose) to get the English characters for
programming -- with an X terminal, the means to do this are provided in
the xmodmap utility.

I don't understand "kernel seems to have trouble with multibyte characters".
If by this, you mean the file system (trying to use 16 bit characters in
a file name), you are correct: the file system doesn't understand this type
of representation, epecially for characters in the 0-255 range, since they
will have initial leading NULLs and there is no provision in the kernel or
shells for byte-count prefixed multibyte strings with NULLs in them.

If you mean that you could not use 8-bit characters in a file name, then
the fault lies in the input mechanism (your shell, if it isn't 8 bit clean,
or your tty modes if it is).

Other than file naming (directory entry manipulation services), a stream
of data is a stream of data, and the file system storing it doesn't care
if it's a stream of bytes to be treated as a double-byte character set,
or a stream of bytes to be treated as ASCII.

An intrinsic limiting factor in the use of Unicode or other multibyte
technology within all shell tools and libraries is the fact that by
doing so, you effectively halve your usable disk space (by doubling the
size of the data to be stored, even if it is vanilla ASCII).  I think
it will be a long time before we see large numbers of products coming
out of the US which have this limitation.

One way to internationalize without falling into this trap is to adopt the
ISO Latin-1 character set (usable by the majority of countries) as the
standard console character set.  The "codrv" program gives us a means of
providing an initial load of this onto existing video hardware.

An additional help would be the adoption of the BSD4.4 file system, in
particular, the Ficus layering (ala John Heidemann), which would allow
for the provision of a Unicode naming layer so that some file systems
*could* be multibyte in nature.  An additional layer for Unicode disk
access would probably be desirable, since this would allow cannonical
representation of a text in it's native language, allowing multilingual
use of the same file system without interference like one might get
with one user running Latin-1 and another running Cyrillic-1.  This would
allow people requiring more than 8 bits for their name space/text space
to halve the size of their disk (which they would have to do anyway)
without negatively impacting those of us who can live in 8 bits.

As to the other internationaization issue, which is internationalization
of text strings in error messages and utilities, I think that we need to
adopt the XPG3 standards for string identification, with Unicode
storage of the strings (PC code page representation is for the birds).
This will buy us usable error messages with little or no penalty, since
the locale database can be loaded on a per-site basis (ie: I don't need
to load English or Spanish if I'm German).

The limiting factor on this (in the PC market anyway) is running the
display hardware in "text" mode.  Given a reliable mechanism for the
identification of video hardware (maybe a non-protected mode install
or portion of the boot?), the built in limitation of the IBM PC
character set will become less of a consideration, at least for the
8-bit character sets which can be fully downloaded to  VGA cards.


					Terry Lambert
					terry@icarus.weber.edu
					terry_lambert@novell.com
---
Any opinions in this posting are my own and not those of my present
or previous employers.
-- 
-------------------------------------------------------------------------------
                                        "I have an 8 user poetic license" - me
 Get the 386bsd FAQ from agate.berkeley.edu:/pub/386BSD/386bsd-0.1/unofficial
-------------------------------------------------------------------------------