This is the mail archive of the guile@cygnus.com mailing list for the guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Japanese and Unicode


> There are a lot of languages which are not yet
> representable.  Among them are several Asian languages which might
> again occupy a lot of room (I don't know this for sure).

Almost half of the "General Scripts" area is still unassigned,
so there is quite a bit of room for (non-ideographic) scripts
(at least 3000 characters).

> Well, applications certainly could handle this.  But you must see that
> if surrogates are possible in the "wide strings" these are not anymore
> wide strings and all the string handling functions must be changed to
> handle surrogates.

Not necessarily.  String handling functions can treat surrogates as
regular uninterpreted characters, just the way they treat accents
as uninterpreted characters.

I.e. I'm proposing that string handling routines should not treat
either surrogates or accents specially, but treat them as plain
characters.  This includes the standard string-handling functions
of Scheme, Lisp, Java, and C/C++.  Instead, they should leave the
job to higher-level software, which may need to handle ligatures,
accents, language, hyphenation, surrogates, font substitution, etc, etc.
(Also software that translates from Unicode to some other encoding
cannot in general look at individual characters in isolation.
They too may have to consider multiple Unicode characters as a unit.)

> (Plus handling of 16bit values is on many platforms slower than
> reading 32bit values.)

But usually not.  Consider that memory tends to be the bottle-neck in
modern processors, and most string handling is sequential.  Anything
that means you need twice as much memory will hurt your data cache
and paging system.

	--Per Bothner
Cygnus Solutions     bothner@cygnus.com     http://www.cygnus.com/~bothner