This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Bug in libiconv?


Hi Chuck,
hi everyone else,


In a twisted turn of events, I'm trying to get the orphaned catgets
package to work correctly on Cygwin 1.7.  As you might know, the package
is derived from the glibc package.  Apart from other portability issues
of this *very* glibc-centric piece of code, I found some problem which
appears to point to two bugs in Cygwin's libiconv2.

For some reason, the iconv conversion seems to be overly dependent on
the usage of setlocale, and the returned value in the fourth parameter
appears to be incorrect, if the output codeset is "WCHAR_T".

Here's a simple testcase:

==== SNIP ====
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
#include <iconv.h>
#include <locale.h>
#include <wchar.h>

iconv_t
open_iconv ()
{
  iconv_t cd_towcp = iconv_open ("WCHAR_T", "UTF-8");
  if (cd_towcp == (iconv_t) -1)
    {
      fprintf (stderr, "iconv_open: %d <%s>\n", errno, strerror (errno));
      exit (1);
    }
  return cd_towcp;
}

void
run_iconv (iconv_t cd_towcp, char *input)
{
  wchar_t out[256];

  char *inbuf = input;
  size_t inbytesleft = strlen (inbuf);
  char *outbuf = (char *) out;
  size_t outbytesleft = sizeof (out);
  size_t ret = iconv (cd_towcp, &inbuf, &inbytesleft, &outbuf, &outbytesleft);
  if (ret == (size_t) -1)
    fprintf (stderr, "iconv: %d <%s>\n", errno, strerror (errno));
  printf ("in = <%s>, inbuf = <%s>, inbytesleft = %zd, outbytesleft = %zd\n",
	  input, inbuf, inbytesleft, outbytesleft);
}

int
main ()
{
  iconv_t cd_towcp;
  char *finnish = "Liian pitk\303\244 sana";  // Umlaut-a
  
  setlocale (LC_ALL, "C");
  cd_towcp = open_iconv ();
  setlocale (LC_ALL, "C");
  run_iconv (cd_towcp, finnish);
  setlocale (LC_ALL, "C.UTF-8");
  run_iconv (cd_towcp, finnish);
  iconv_close (cd_towcp);
  
  setlocale (LC_ALL, "C.UTF-8");
  cd_towcp = open_iconv ();
  setlocale (LC_ALL, "C");
  run_iconv (cd_towcp, finnish);
  setlocale (LC_ALL, "C.UTF-8");
  run_iconv (cd_towcp, finnish);
  iconv_close (cd_towcp);

  return 0;
}
==== SNAP ====

Here are the important details:

- The input string is a fixed finnish UTF-8 sentence containing a
  single non-ASCII char.

- The testcase always calls setlocale before calling iconv_open(),
  then subsequently it sets setlocale before calling iconv().

- So the application tests to convert a UTF-8 to WCHAR_T string in four
  combinations of the current locale, in this order:

  - iconv_open "C",       iconv "C"
  - iconv_open "C",       iconv "C.UTF-8"
  - iconv_open "C.UTF-8", iconv "C"
  - iconv_open "C.UTF-8", iconv "C.UTF-8"

Here's what happens in Linux:

  $ gcc -g -o ic ic.c
  $ ./ic
  in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
  in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
  in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960
  in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 960

Here's what happens on Cygwin:

  $ gcc -g -o ic ic.c -liconv
  $ ./ic
  iconv: 138 <Invalid or incomplete multibyte or wide character>
  in = <Liian pitkà sana>, inbuf = <à sana>, inbytesleft = 7, outbytesleft = 492
  iconv: 138 <Invalid or incomplete multibyte or wide character>
  in = <Liian pitkà sana>, inbuf = <à sana>, inbytesleft = 7, outbytesleft = 492
  iconv: 138 <Invalid or incomplete multibyte or wide character>
  in = <Liian pitkà sana>, inbuf = <à sana>, inbytesleft = 7, outbytesleft = 492
  in = <Liian pitkà sana>, inbuf = <>, inbytesleft = 0, outbytesleft = 480

So, AFAICS, there are two problems:

  - Even though iconv_open has been opened explicitely with "UTF-8" as
    input string, the conversion still depends on the current application
    codeset.  That dsoesn't make sense.

  - Even though the last parameter to iconv is defined in bytes, the
    value of outbytesleft after the conversion is the number of remaining
    wchar"t's, not the number of remaining bytes.  That's contrary to what
    POSIX defines, see
    http://pubs.opengroup.org/onlinepubs/9699919799/functions/iconv.html

Is this analyzes correct?  Is there by any chance a newer version of
libiconv2 which does not have these problems?


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]