This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Japanese and Unicode


NIIBE Yutaka <gniibe@etl.go.jp> writes:
> Some Japanese think that distinction of Chinese
> character in Chinese script and the one in Japanese script is
> important, as well as Alpha in Greek and "A".

A reasonable point.  But note that Greek Alpha is used quite a bit
in other languages (for math, physics, etc).  And it is a question
of resources (JDK takes up much more code space than all the
European scripts together), and how similar the characters are visually.
For example a Greek capital letter Rho ('R') looks like a latin P,
while a Pi ('P') looks completely different.  If people used
the same encoding for Greek as for Latin, and then got the wrong
font, then the results could be quite mis-leading.

> However, (currently) I'm not sure that Unicode is the solution for
> multilingual text handling.

It is not.  That is a much more complex problem.  However, using
Unicode does (I think) make that problem a bit easier than
alternatives (such as Mule).  (Mule does not support multi-lingual
text.  What it supports is multi-encoding text, which is much
less useful.)

What we are discussing is how to represent characters and
strings in Guile (and possibly by extension, in Emacs).

> I think that we should think again about the properties of ASCII
> character and its text handling, and we should retain the good
> properties of it in enhanced coded character set(s) as possible.

The good properties of Ascii include a simple universal encoding,
with a obvious mapping between characters and integers.
Another good property is that it is fixed-width.  These properties
are very derirable for Scheme (and also Emacs).  Unicode
has these properties.

There is no reasonable alternative.

The only real issues to discuss are:
a) How to support future extensions to ISO 10646 that go beyond 16 bits. and
b) whether to use a fixed-width or variable-width encoding or a mix.

I think the need for characters beyond 16 bits will be so rare
that applications that need then can use the Unicode "surrogate
mechanism".  The problem with that is that a single character
is encoded using two 16-bit Unicode characters.  I think that
is acceptable - high-quality text processing has to deal with
the fact that a single logical (or display) character may be composed
out of multiple Unicode characters anyway (because of accents,
ligatures, combining characters, etc).

In that case, it makes most sense to me to use 16-bit fixed width
characters for strings.  Emacs buffers should use UTF-8
(variable-width), at least a long as Emacs keeps the existing
buffer implementation.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner