This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Japanese and Unicode


Per Bothner writes:
 > However, using
 > Unicode does (I think) make that problem a bit easier than
 > alternatives (such as Mule).  (Mule does not support multi-lingual
 > text.  What it supports is multi-encoding text, which is much
 > less useful.)

Umm... In this discussion, I (somewhat) understand that multi-byte
representation of character is problematic, and the resource needed
(may) is much bigger when we adopt UCS-4.

But I believe that multiple character sets (or multiple scripts) text
is the start line of multi-lingual text, and if possible with least
additional cost, it's better we implement it in plain text.

Text of multiple character sets has good properties.  Assuming that
character sets specifies scripts, It doesn't loose any information
about script in translation from single character set text to multiple
character sets text (while translation from JIS-X-0208 to Unicode
loose this), and multiple scripts co-exist in plain text naturally.

I think that this is design issue of text handling.

      +-------------------------+
      |	Structured Text         |  ^
      +-------------------------+  | Higher layer
      |	Plain Text              |  |               | Lower layer
      +-------------------------+                  |
      |	Coded Character set(s)  |                  v
      +-------------------------+

I understand your point of "don't put many things in lower layer".
Also I agree that language information should be in higher layer (for
example, with SGML tag).

But I think that distinction of script should be in coded character
set.  More specifically, I think CJK unified ideographs are quite
useless, it's not correspond to any existing script, but artificial
one.  If people used the same encoding for Japanese script and Chinese
script (like Unicode does) and got the wrong font, then the results
could be mis-leading (somewhat/quite).

 > The good properties of Ascii include a simple universal encoding,
 > with a obvious mapping between characters and integers.
 > Another good property is that it is fixed-width.  These properties
 > are very derirable for Scheme (and also Emacs).  Unicode
 > has these properties.

I agree that the properties of ASCII, but I'm not sure that Unicode
has these properties.

 > There is no reasonable alternative.

Umm... If not possible (because of some reason (resource concern,
etc.)), I think that it's better to define string as "ASCII text only"
here, and leave other things as "application problem", right now.

For me, 16-bit character set is another arbitral limits with no good
reason.  When we didn't have enough resource, we had to live with
these (artificial) limits.  For example, in old type-writer, we used
to use small letter of "L" to represent the number one "1", but now we
don't do that.  If we have enough computing resource (I'm not sure),
we, East Asian, certainly need bigger character set.

 > The only real issues to discuss are:
 > a) How to support future extensions to ISO 10646 that go beyond 16 bits. and
 > b) whether to use a fixed-width or variable-width encoding or a mix.

Yes.  I understand the points.  Here, my point is "why not 32-bit?"

Anyway, your explanation makes clear the problem.  Thanks for
enlighten me.
--