This is the mail archive of the guile@sourceware.cygnus.com mailing list for the Guile project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Multibyte encoding, scm_mb_supported_charset_p



> Just a hint to anyone who's implementing the multibyte encoding:
> There should be functions
> 
>       SCM scm_mb_supported_charset_p(SCM charset);
>       int scm_mb_c_supported_charset_p(char *charset);
> 
> which return boolean wether the charset is supported by guile.

Do you mean "charset" as in the arguments to scm_mb_iconv_open?
Sure --- that's a good idea.

Or do you mean "charset" as in the leading bytes described in
mbemacs.h?  Those charsets aren't a Scheme-level concept.  And I don't
really want to introduce it outside of the mb modules, because when we
switch to UTF-8, the concept is going away.

Handa-san needed some representation for text coming from a wide
variety of source character sets.  So he organized his universal
encoding around the existing character sets.  But it's still just an
implementation detail.

If you want functions like "char-greek?" or "char-chinese?", that's
cool.  Or if you want functions like "(char-encodable? CHAR CHARSET)",
that's cool too.  But the leading bytes are exactly the sort of detail
of the Emacs-Mule encoding that I *don't* want to spread around the
rest of Guile.

> Also, ``text ports'' (don't know about the correct terminology)

There is no correct terminology yet.  :)

> should store a "charset" to/from which they'll convert when
> writing/reading.

Yes, indeed.

> (The default should be ISO-8859-1 for backwards compatibility,
> probably to be replaced by UTF-8 rsn)

The operating system has various mysterious ways of indicating what
encoding it wants you to use --- environment variables with names like
LC_MUMBLEFROTZ and such.  I don't know what-all it is.  Tcl has a
bunch of logic for figuring it out, which I plan to imitate.

> Additionally the documentation should state things clearly:
> - Does SCM_*LENGTH return the number of bytes or number of chars
>   in the string?
> - If the latter, how does one get the number of bytes in a
>   string?

I think these will be renamed, because this is confusing.  We'll
provide a macro to get the length in bytes ("size"), and a function to
get the length in characters ("length").

> As a last suggestion, an addition to the gh_ interface:
> 
>   char *gh_mb_scm2newstr(SCM str, int *lenp, char *charset);

Yes, we need a bunch of those kinds of things.  Something like:

	char *gh_mb_scm2str_system (SCM str, int *lenp)

which just gives you the text of STR in whatever the system's encoding
is.

> While reading through mbapi.texi, i noticed the following
> promise:
> 
>   ``Using @code{scm_mb_index_cached} or
>     @code{scm_mb_index_cached_func}, you can scan a string from
>     left to right in time proportional to the length of the
>     string.''
> 
> Why only left to right? It should be equally fast to scan right
> to left. (I suppose this is how the scheme side string-index is
> implemented)

It is.  I couldn't figure out how to say it in the manual without
getting really confusing.  I do mention it down below.

> As a final question, when will multibyte characters and
> first-class environments be available (like, ``probably in a
> week/month/year'')

This is the really hard question for me.  I don't really have any idea.

I want to keep working on this while the issues are fresh in my mind.
It would suck to go and spend a month working on it, and then have it
languish.  So I'm hoping to work on it regularly.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]