This is the mail archive of the guile@cygnus.com mailing list for the guile project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
From: Per Bothner <bothner@cygnus.com> Subject: Re: Japanese and Unicode Date: Tue, 21 Oct 1997 12:16:39 -0700 > > However, (currently) I'm not sure that Unicode is the solution for > > multilingual text handling. > > It is not. That is a much more complex problem. However, using > Unicode does (I think) make that problem a bit easier than > alternatives (such as Mule). Unicode was never meant to provide anything you need for multilangual text handling. Instead you need to tag the text somehow to provide the kind of information which cannot reasonably in Unicode (or character sets in general). Take SGML (w/ DocBook) as an example. Here you can write just tell him, <QUOTE><FOREIGNPHRASE LANG="fa_af">Besyar nawaqt nawaqt nist</FOREIGNPHRASE>.</QUOTE> (This is cut&paste'd from the DocBook manual). The tags provide the missing information and Unicode was designed with this in mind right from the beginning. And now it becomes clear that a rendering machine can get enough information to choose the correct glyph for a Unicode code point. > I think the need for characters beyond 16 bits will be so rare > that applications that need then can use the Unicode "surrogate > mechanism". I consider this as an error which is based on the very same problem which lead to ASCII. You must not regard the Unicode 2.0 standard as the end of the line. There are a lot of languages which are not yet representable. Among them are several Asian languages which might again occupy a lot of room (I don't know this for sure). For ASCII people in the US decided what is necessary with the well know problems. Now the First World people decided that it's enough to use all of Unicode (which includes their languages and a few others they are interested in). Countries without much influence will stay again before the door. > The problem with that is that a single character is encoded using > two 16-bit Unicode characters. I think that is acceptable - > high-quality text processing has to deal with the fact that a single > logical (or display) character may be composed out of multiple > Unicode characters anyway (because of accents, ligatures, combining > characters, etc). Well, applications certainly could handle this. But you must see that if surrogates are possible in the "wide strings" these are not anymore wide strings and all the string handling functions must be changed to handle surrogates. This certainly is slower than handling UCS4 right from the beginning and never get into this trouble. (Plus handling of 16bit values is on many platforms slower than reading 32bit values.) > Emacs buffers should use UTF-8 (variable-width), at least a long as > Emacs keeps the existing buffer implementation. This is fine. Guile should certainly provide functionality to multibyte strings. There is also the possibility to use Reuter's compression method but I don't know enough about this currently. -- Uli ---------------. drepper at gnu.org ,-. Rubensstrasse 5 Ulrich Drepper \ ,-------------------' \ 76149 Karlsruhe/Germany Cygnus Solutions `--' drepper at cygnus.com `------------------------