This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: codeset problems in wprintf and wcsftime


On Feb 20 16:31, Andy Koppe wrote:
> Corinna Vinschen:
> > while working on finalizing locale support for Cygwin it suddenly
> > occured to me that we have a problem in wprintf and wcsftime.
> >
> > Let's assume a funny combination of localization variables in the user's
> > environment:
> >
> > ?LANG=de_DE.utf8
> > ?LC_TIME=ja_JP.eucjp
> > ?LC_NUMERIC=en_US.iso88591
> >
> > Yes, it's pretty unlikely, but nevertheless possible and valid.
> >
> > So, at setlocale time we read and store the localized strings in the
> > codeset specified by the localization variable:
> >
> > ?- __locale_charset() ? ? ? ? ? ? returns UTF-8
> > ?- __get_current_time_locale() ? ?returns data stored in EUC-JP
> > ?- __get_current_numeric_locale() returns data stored in ISO-8859-1
> > ?- localeconv() ? ? ? ? ? ? ? ? ? returns with decimal_point and
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? thousands_sep stored in ISO-8859-1,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? and all other strings from the
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? LC_MONETARY category in UTF-8.
> > ?- nl_langinfo() ? ? ? ? ? ? ? ? ?CODESET is UTF-8,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? strings from the LC_TIME category are
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? returned in EUC-JP,
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? strings from LC_MESSAGES are returned
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? in UTF-8
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? RADIXCHAR and THOUSEP are returned in
> > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ISO-8859-1.
> >
> > This is no problem at all as long as you call the multibyte variations
> > printf and strftime, the user gets what she asked for, and who are we
> > to ask the user for the reason behind this choice.
> 
> Have you verified that the user does indeed get a mix of charsets when
> doing this on glibc?

Look at the output of the locale(1) tool:

  $ export LANG=de_DE.utf8
  $ export LC_TIME=ja_JP.eucjp
  $ export LC_NUMERIC=en_US.iso88591
  $ locale -k LC_CTYPE LC_TIME LC_NUMERIC | egrep 'codeset|charmap'
  charmap="UTF-8"
  time-codeset="EUC-JP"
  numeric-codeset="ISO-8859-1"

> I'm asking because another alternative to the solutions you outlined
> might be to store those strings as wchar versions only, to be used
> directly in wprintf and converted to the LC_CTYPE character set when
> needed in printf. That way, the user would always get readable output.

The multibyte variations are still much more often used than the
widechar functions.  I would prefer not to move the conversion burden
into these more often used functions.

> > - Store the charset not only for LC_CTYPE, but for each localization
> > ?category, and provide a function to request the charset.
> > ?This also requires to store the associated multibyte to widechar
> > ?conversion functions, obviously, and to call the correct functions
> > ?from wprintf and wcftime.
> >
> > - Redefine the locale data structs so that they contain multibyte and
> > ?widechar representations of all strings. ?Use the multibyte strings
> > ?in the multibyte functions, the widechar strings in the widechar
> > ?functions.
> >
> > Personally I'd prefer the second approach.
> 
> Agreed. Sounds like less overhead.


Corinna

-- 
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]