This is the mail archive of the guile@cygnus.com mailing list for the guile project.
Index Nav: | [Date Index] [Subject Index] [Author Index] [Thread Index] | |
---|---|---|
Message Nav: | [Date Prev] [Date Next] | [Thread Prev] [Thread Next] |
Jim Blandy <jimb@red-bean.com> writes: > [...] I agree that multi-byte encoding as well as the MULE encoding are the wrong way to go. > > Thus, my current inclinations: > - Use 16-bit characters in strings throughout. > - Prescribe the use of Unicode throughout. > - Provide functions to convert between Unicode character strings > all other widely-used formats: UTF-8, UTF-7, Latin-1, and the JIS > variants, as well as anything else people would like to contribute. > - Provide a separate "byte array" type, for applications which > genuinely want this. A few comments: - the Unicode consortium wants everybody to think that UCS2 is the right way. But it's a pain in the same way as using a multibyte encoding is. It was obvious right from the beginning that 16bits are not enough. It's similar to ASCII: Americans thought 7 bits are enough and now the users of alphabet languages want to make us know 16 bits are enough. A last resort was that Unicode 2.0 now contains an extension method which effectvely makes UCS2 a multibyte encoding. The answer can only be UCS4. It's no surprise that all reasonable i18n developers (this excludes those at IBM) use a 32bit type for wchar_t. This may sound like a big waste of space but if used correctly it isn't. Normally string are not meant to contain whole text books but instead are rather short. This means there is not that much redundancy. If you need to store large texts you can still fall back on a multibyte encoding, perhaps offer several of them so that the most effective can be chosen. This is closely related with your conversion functions. François Pinard (and in parts myself) currently extend GNU recode to work as a library. The result will be the new recode program and I'll also use it in GNU libc to write iconv() and the wide character I/O streams. Since you have the same problem you are the next client. The functions in the recode include some to convert from UCS4 to, say, UTF-7 or perhaps KOI-8. The former is effective if mainly chars from the latin alphabets are used (which are encoded first). Special encodings like KOI-8 can be used if the text is know to contain only characters which naturally can be represented in this charset. By offering the user an interface to the recode library to convert the UCS4 strings to multibyte strings in one of the provided encodings you don't have to fear the memory consumption of UCS4. And the recoding library will on systems not supporting the wide character I/O stream from ISO C amendment 1. The port implementation will have to be able to print UCS4 strings in the currently wanted external representation. -- Uli ---------------. drepper at gnu.org ,-. Rubensstrasse 5 Ulrich Drepper \ ,-------------------' \ 76149 Karlsruhe/Germany Cygnus Solutions `--' drepper at cygnus.com `------------------------