This is the mail archive of the
guile@sourceware.cygnus.com
mailing list for the Guile project.
Re: Multibyte encoding, scm_mb_supported_charset_p
- To: forcer <forcer AT mindless -dot- com>
- Subject: Re: Multibyte encoding, scm_mb_supported_charset_p
- From: Jim Blandy <jimb AT red-bean -dot- com>
- Date: 13 Sep 1999 14:13:13 -0500
- Cc: guile AT sourceware.cygnus -dot- com
- References: <199909131255.OAA01480@forcix.roof.lan>
> Just a hint to anyone who's implementing the multibyte encoding:
> There should be functions
>
> SCM scm_mb_supported_charset_p(SCM charset);
> int scm_mb_c_supported_charset_p(char *charset);
>
> which return boolean wether the charset is supported by guile.
Do you mean "charset" as in the arguments to scm_mb_iconv_open?
Sure --- that's a good idea.
Or do you mean "charset" as in the leading bytes described in
mbemacs.h? Those charsets aren't a Scheme-level concept. And I don't
really want to introduce it outside of the mb modules, because when we
switch to UTF-8, the concept is going away.
Handa-san needed some representation for text coming from a wide
variety of source character sets. So he organized his universal
encoding around the existing character sets. But it's still just an
implementation detail.
If you want functions like "char-greek?" or "char-chinese?", that's
cool. Or if you want functions like "(char-encodable? CHAR CHARSET)",
that's cool too. But the leading bytes are exactly the sort of detail
of the Emacs-Mule encoding that I *don't* want to spread around the
rest of Guile.
> Also, ``text ports'' (don't know about the correct terminology)
There is no correct terminology yet. :)
> should store a "charset" to/from which they'll convert when
> writing/reading.
Yes, indeed.
> (The default should be ISO-8859-1 for backwards compatibility,
> probably to be replaced by UTF-8 rsn)
The operating system has various mysterious ways of indicating what
encoding it wants you to use --- environment variables with names like
LC_MUMBLEFROTZ and such. I don't know what-all it is. Tcl has a
bunch of logic for figuring it out, which I plan to imitate.
> Additionally the documentation should state things clearly:
> - Does SCM_*LENGTH return the number of bytes or number of chars
> in the string?
> - If the latter, how does one get the number of bytes in a
> string?
I think these will be renamed, because this is confusing. We'll
provide a macro to get the length in bytes ("size"), and a function to
get the length in characters ("length").
> As a last suggestion, an addition to the gh_ interface:
>
> char *gh_mb_scm2newstr(SCM str, int *lenp, char *charset);
Yes, we need a bunch of those kinds of things. Something like:
char *gh_mb_scm2str_system (SCM str, int *lenp)
which just gives you the text of STR in whatever the system's encoding
is.
> While reading through mbapi.texi, i noticed the following
> promise:
>
> ``Using @code{scm_mb_index_cached} or
> @code{scm_mb_index_cached_func}, you can scan a string from
> left to right in time proportional to the length of the
> string.''
>
> Why only left to right? It should be equally fast to scan right
> to left. (I suppose this is how the scheme side string-index is
> implemented)
It is. I couldn't figure out how to say it in the manual without
getting really confusing. I do mention it down below.
> As a final question, when will multibyte characters and
> first-class environments be available (like, ``probably in a
> week/month/year'')
This is the really hard question for me. I don't really have any idea.
I want to keep working on this while the issues are fresh in my mind.
It would suck to go and spend a month working on it, and then have it
languish. So I'm hoping to work on it regularly.