This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Bug in libiconv?


Hi Corinna and Chuck,

Please CC the bug-gnu-libiconv mailing list when discussing possible
bugs in GNU libiconv.


Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00292.html>:

> the application tests to convert a UTF-8 to WCHAR_T string in four
>   combinations of the current locale, in this order:
> 
>   - iconv_open "C",       iconv "C"
>   - iconv_open "C",       iconv "C.UTF-8"
>   - iconv_open "C.UTF-8", iconv "C"
>   - iconv_open "C.UTF-8", iconv "C.UTF-8"
> 
> Here's what happens in Linux:
> 
>   $ gcc -g -o ic ic.c
>   $ ./ic
>   in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
>   in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
>   in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
>   in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
> 
> Here's what happens on Cygwin:
> 
>   $ gcc -g -o ic ic.c -liconv
>   $ ./ic
>   iconv: 138 <Invalid or incomplete multibyte or wide character>
>   in = <Liian pitkà sana>, inbuf = <à sana>, inbytesleft = 7, outbytesleft = 492
>   iconv: 138 <Invalid or incomplete multibyte or wide character>
>   in = <Liian pitkà sana>, inbuf = <à sana>, inbytesleft = 7, outbytesleft = 492
>   iconv: 138 <Invalid or incomplete multibyte or wide character>
>   in = <Liian pitkà sana>, inbuf = <à sana>, inbytesleft = 7, outbytesleft = 492
>   in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 480

On glibc systems, the encoding "WCHAR_T" is equivalent to "UCS-4" with machine
dependent endianness and alignment. In particular it is independent of the
locale. That explains the first set of results.

In libiconv, on systems which don't define __STDC_ISO_10646__, the encoding
"WCHAR_T" is equivalent to wchar_t[], that is, dependent on the locale.
Changing the locale encoding after allocating an iconv_t from or to "WCHAR_T"
yields undefined behaviour. That explains the second set of results.


Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00299.html>:

> I defined __STDC_ISO_10646__ for Cygwin 1.7.8 yesterday.

What is the Cygwin wchar_t[] encoding? Is it UTF-16, like on Win32? The
documentation is silent about it. I had expected to find some word about it
in <http://cygwin.com/cygwin-api/compatibility.html#std-susv4>
or <http://cygwin.com/cygwin-api/std-notes.html>.

In any case, sizeof (wchar_t) == 2. I don't think defining __STDC_ISO_10646__
is compliant with ISO C 99 in this situation. ISO C 99 section 6.10.8.(2) says:

  __STDC_ISO_10646__
          An integer constant of the form yyyymmL (for example,
          199712L), intended to indicate that values of type wchar_t are the
          coded representations of the characters defined by ISO/IEC 10646,
          along with all amendments and technical corrigenda as of the
          specified year and month.

But when characters outside the basic plane, such as
U+12345 (CUNEIFORM SIGN URU TIMES KI), are encoded by 2 consecutive wchar_t
values, values of type wchar_t don't correspond to ISO/IEC 10646 characters.
(Or maybe I'm underestimating what "coded representations" means...?)


Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00357.html>:

>   #if __STDC_ISO_10646__ || ((defined _WIN32 || defined __WIN32__) && !defined __CYGWIN__)
> This should be
> ...
>   #if __STDC_ISO_10646__ || defined _WIN32 || defined __WIN32__ || defined __CYGWIN__

That makes sense if Cygwin guarantees that from now on and in the future,
the wchar_t encoding will always be UTF-16. Is this the case?


Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00299.html>:

> Why on earth is libiconv on Cygwin using Windows functions in some
> places?

So that I could reuse the essentially same code on Cygwin as on native Win32.

Charles has submitted a patch on this topic to bug-gnulib; I will handle it.

> the old cygwin_conv_to_posix_path function as well.

Is cygwin_conv_to_posix_path deprecated? Does it introduce limitations of
some kind?

> The usage of a fixed table instaed of the charset.alias file in
> libcharset/lib/localcharset.c, function get_charset_aliases() is
> not good, not good at all.

The alternative is to have this table stored in a file charset.alias;
but then every package that includes the module 'localcharset' from
gnulib (that is, libiconv, gettext, coreutils, and many others) will
want to modify this file during "make install". And this causes a lot of
headaches to packaging systems. Therefore, on platforms which have
widely used packaging systems (Linux, MacOS X, Cygwin), it's better to
avoid the need for this file. Additionally, on Win32 systems relocatability
is a must, and the code to compute the location of charset.alias from
the location of libiconv.dll would be overkill.


Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00303.html>:

> It looks like there's been some bitrot with respect
> to some of the "&& !CYGWIN" guards on WIN32.  Both libiconv and gettext,
> IIRC, jump thru hoops to ensure that [_]*WIN32 is defined for both
> "regular" win32 and for cygwin...which means defined(CYGWIN) guards are
> necessary.

The reason for these "&& !defined __CYGWIN__" clauses is that - at least
in Cygwin 1.5.x - gcc has an option that will define _WIN32 or __WIN32__.
So, when _WIN32 || __WIN32__ may evaluate to true on Cygwin, or it may
evaluate to false on Cygwin. Since I don't want libiconv or gettext
to be compiled in two possible ways on Cygwin, I add
"&& !defined __CYGWIN__".

Neither libiconv nor gettext defines or undefines _WIN32 or __WIN32__.
But they are prepared to either setting.


Replying to <http://www.cygwin.com/ml/cygwin/2011-01/msg00332.html>:

> there ARE still bugs in libiconv on Cygwin -- specifically:
>  - Even though iconv_open has been opened explicitely with "UTF-8" as
>    input string, the conversion still depends on the current application
>    codeset.  That doesn't make sense.

If the other argument to iconv_open is "CHAR" or "WCHAR_T", hence locale
dependent, and you change the locale in between, the result is undefined
behaviour.

>  - 'iconv_close ((iconv_t) -1);' crashes the application with a SEGV.

It's not a bug. From POSIX:2008
<http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv_open.html>
you can infer that (iconv_t) -1 is not a "conversion descriptor". It's a
return value used from iconv_open(), nothing more. From
<http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv_close.html>
you can see that the argument of iconv_close() has to be a conversion
descriptor. From the ERRORS section in the same page you can see that
iconv_close() is not required to catch a faulty argument. Note the word
"may", not "shall".


Bruno

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]