This is the mail archive of the guile@cygnus.com mailing list for the guile project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
NIIBE Yutaka <gniibe@etl.go.jp> writes: > Some Japanese think that distinction of Chinese > character in Chinese script and the one in Japanese script is > important, as well as Alpha in Greek and "A". A reasonable point. But note that Greek Alpha is used quite a bit in other languages (for math, physics, etc). And it is a question of resources (JDK takes up much more code space than all the European scripts together), and how similar the characters are visually. For example a Greek capital letter Rho ('R') looks like a latin P, while a Pi ('P') looks completely different. If people used the same encoding for Greek as for Latin, and then got the wrong font, then the results could be quite mis-leading. > However, (currently) I'm not sure that Unicode is the solution for > multilingual text handling. It is not. That is a much more complex problem. However, using Unicode does (I think) make that problem a bit easier than alternatives (such as Mule). (Mule does not support multi-lingual text. What it supports is multi-encoding text, which is much less useful.) What we are discussing is how to represent characters and strings in Guile (and possibly by extension, in Emacs). > I think that we should think again about the properties of ASCII > character and its text handling, and we should retain the good > properties of it in enhanced coded character set(s) as possible. The good properties of Ascii include a simple universal encoding, with a obvious mapping between characters and integers. Another good property is that it is fixed-width. These properties are very derirable for Scheme (and also Emacs). Unicode has these properties. There is no reasonable alternative. The only real issues to discuss are: a) How to support future extensions to ISO 10646 that go beyond 16 bits. and b) whether to use a fixed-width or variable-width encoding or a mix. I think the need for characters beyond 16 bits will be so rare that applications that need then can use the Unicode "surrogate mechanism". The problem with that is that a single character is encoded using two 16-bit Unicode characters. I think that is acceptable - high-quality text processing has to deal with the fact that a single logical (or display) character may be composed out of multiple Unicode characters anyway (because of accents, ligatures, combining characters, etc). In that case, it makes most sense to me to use 16-bit fixed width characters for strings. Emacs buffers should use UTF-8 (variable-width), at least a long as Emacs keeps the existing buffer implementation. --Per Bothner Cygnus Solutions bothner@cygnus.com http://www.cygnus.com/~bothner