This is the mail archive of the guile@sourceware.cygnus.com mailing list for the Guile project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]

Re: Multibyte encoding, scm_mb_supported_charset_p

To: Jim Blandy <jimb AT red-bean dot com>
Subject: Re: Multibyte encoding, scm_mb_supported_charset_p
From: Per Bothner <per AT bothner dot com>
Date: 13 Sep 1999 14:43:29 -0700
Cc: guile AT sourceware.cygnus dot com
References: <199909131255.OAA01480@forcix.roof.lan> <m37lluu2n7.fsf@savonarola.red-bean.com>

An idea to consider for when you switch to UTF8:  Use the modified
UTF8 that Java .class files use, where '\u0000' is represented
by the two byte { 0xC0, 0x80 }

Quoting http://java.sun.com/docs/books/vmspec/html/ClassFile.doc.html#7963:

        The null character ('u0000') and characters in the range 'u0080'
        to 'u07FF' are represented by a pair of bytes x and y: 

        x:  1 1 0 bits 6-10   y: 1 0 bits 0-5 

This gives you the nice property that you can still unambiguously
find the end of a string by searching for 0x00, which is nice for
compatibility with C, while still allowing internal NULs.

> Why Guile Does Not Use a Fixed-Width Encoding
> =============================================

While I agree using a UTF8 multibyte encoding is a reasonable
choice, I think the following point is a red herring:

>    However, there are no fixed-width encodings which include the
> characters we wish to include, and also fit in a reasonable amount of
> space.  Despite the Unicode standard's claims to the contrary, Unicode
> is not really a fixed-width encoding.  Unicode uses surrogate pairs to
> represent characters outside the 16-bit range; a surrogate pair must be
> treated as a single character, but occupies two 16-bit spaces.

My take on it is that once you get to really sophisticated/obscure
applications where you might actually use the surrogate pairs,
then there isn't really much you can actually *do* with characters
treated in isolation.  Unicode already defines a ton of combining
characters, accents, etc.  Trying to look at each "character" of
such a combination in isolation is seldom useful.  The only difference
in practice between a surrogate pair and a compound character is that
you can point to some graphical representation of the individual components
of the compound character;  however, the actual image you see when
the characters are combined may be very different from the picture
of the components in the Unicode book!

So surrogates are a red herring:  For intelligent processing, you
can't look at the 16-bit "characters" in isolation anyway.  For
simple operations (copying, simple searching) surrogates cause
no more problems that other characters.

This all boils down to the same argument as that for use multibyte
representations:  Using Unicode with surrogates is basically a
multi-short encoding.  And that boils down to:  You almost always
work with strings *sequentially* - there are few or no good applications
for actually randomly indexing into a string.  The problem is that
is the only interface we have in Scheme.  What we instead need is
a "character iterator" or "character mapper" interface.
-- 
	--Per Bothner
bothner@pacbell.net  per@bothner.com   http://www.bothner.com/~per/

Follow-Ups:
- Re: Multibyte encoding, scm_mb_supported_charset_p
  - From: Jim Blandy

References:
- Multibyte encoding, scm_mb_supported_charset_p
  - From: forcer
- Re: Multibyte encoding, scm_mb_supported_charset_p
  - From: Jim Blandy

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]