This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: representing charsets

From: Corinna Vinschen <corinna-cygwin at cygwin dot com>
To: cygwin-developers at cygwin dot com
Date: Wed, 31 Mar 2010 10:34:53 +0200
Subject: Re: representing charsets
References: <416096c61003300449u737a0c8x3155217e8e16aa1e@mail.gmail.com> <20100330144658.GA18364@calimero.vinschen.de> <w2m416096c61003302253t932c5756u9c96351869052804@mail.gmail.com>
Reply-to: cygwin-developers at cygwin dot com

On Mar 31 06:53, Andy Koppe wrote:
> Corinna Vinschen:
> > Andy Koppe:
> >> 3) Represent charsets as enum constants (or #defines) rather than
> >> strings throughout, with the singlebyte charsets ordered in such a way
> >> that they correspond to their order in the conversion tables, along
> >> these lines:
> >>
> >> enum {
> >> ? CS_UTF8 = 0,
> >>
> >> ? /* ISO singlebyte codepages */
> >> ? CS_ISO8859_1 = 1,
> >> ? CS_ISO8859_2 = 2,
> >> ? ...
> >> ? CS_ISO8859_11 = 11,
> >> ? /* ISO-8859-12 doesn't exist */
> >> ? CS_ISO8859_13 = 12,
> >> ? ...
> >> ? CS_ISO8859_16 = 15,
> >>
> >> ? /* Windows singlebyte codepages */
> >> ? CS_CP437 = 100,
> >> ? CS_CP720 = 101,
> >> ? CS_CP737 = 102,
> >> ? ...
> >>
> >> ? /* Multibyte codepages */
> >> ? CS_SJIS = 200,
> >> ? CS_GBK = 201,
> >> ? ...
> >> }
> >
> > But what is that good for? ?Which advantage do you have?
> 
> - No need to pass around both charset name and the charset table index.
> - The __cp_index and __iso8859_index functions can be junked.
> __cp_mbtowc/wctomb obtain the index with (cs_id - CS_CP437). Similar
> for ISO.
> - Only one list of valid codepages (since the one in __cp_index can go).
> - Get rid of the hack where the likes of KOI8-R or PT154 are
> internally represented as "CPxxx" names, some of which don't actually
> correspond to Windows codepages.
> - All those strcpy() calls in setlocale become simple assignments,
> e.g. charset_id = CS_EUCJP instead of strcpy(charset, "EUCJP"). Not
> relevant performance-wise, but in terms of space (for embedded
> targets).
> - Similarly, charset comparisons become simple integer comparisons
> instead of strcmps.

Hmm, ok.

> > If you
> > only keep the number, where do you get the charset name from?
> 
> A new function, e.g. 'void __get_charset_name(int cs_id, char *buf)',
> where a buffer of size ENCODING_LEN+1 needs to be passed in.
> nl_langinfo(CODESET) would simply call that  instead of doing its own
> strcmp-heavy parsing of internal names to turn them back into official
> names.

Actually the codesets for all LC_FOO categories is supposed to be stored
in the LC_FOO datastructure soon.  So the call to __get_charset_name
should be performed in the __FOO_load_locale functions.

Before you start I'd like to apply my patch from
http://sourceware.org/ml/newlib/2010/msg00221.html first.  This
already contains a change to nl_langinfo, which just fetches the
charset from the locale info.  At least at this point your and my
patch would clash.  With my patch, you only have to change the
__FOO_load_locale functions, but not nl_langinfo anymore.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

Follow-Ups:
- Re: representing charsets
  - From: Andy Koppe

References:
- representing charsets
  - From: Andy Koppe
- Re: representing charsets
  - From: Corinna Vinschen
- Re: representing charsets
  - From: Andy Koppe

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]