This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Bug in libiconv?


Hi Bruno,

On Feb  2 19:58, Bruno Haible wrote:
> [resent to the cygwin list; please add bug-gnu-libiconv to your replies]

Done.

> Hi Corinna,
> 
> Thanks for your reply <http://cygwin.com/ml/cygwin/2011-01/msg00410.html>
> 
> > > Please CC the bug-gnu-libiconv mailing list when discussing possible
> > > bugs in GNU libiconv.
> >
> > Ok
> 
> Thanks for giving it a try. But although you CCed bug-gnu-libiconv, your message
> did not reach the list (but Charles' one and Eric's one did). I guess this is
> because the cygwin.com mail server refuses to deliver to corinna-cygwin,
> therefore the spam detection at gnu.org recognized your sending address as a
> spammer's one. This makes it hard for me to detect that you replied to me,
> since I'm not reading the cygwin mailing list on a regular basis.

Uh, too bad.  Sorry about that.  I changed to my Red Hat email address
for this discussion.

> > I've put a lot of effort in 2009 and early 2010 to make the wchar_t
> > representation in Cygwin and newlib as much Unicode 5.2 compatible as
> > possible.  Even the wcrtomb and mbrtowc functions in newlib are capable
> > of dealing with UTF-16 surrogates.
> 
> I appreciate your effort on internationalization of Cygwin. You went as
> far as you could get with the given choice of wchar_t. It's just a fact
> that the <wctype.h> functions and wcwidth() cannot work right when wchar_t[]
> is UTF-16. And these functions are the only reasons why gnulib and coreutils
> code uses wide characters strings at all.

Well, as for the wctype functions you see how easy it is to convert
to wint_t and use that as input.  As for wcwidth, you're right.  However,
in Cygwin/newlib there's the wcswidth function which actually converts the
input string to wint_t type characters including surrogate handling and
then calls an internal __wcwidth function which works on wint_t types.
So there is a way to handle this stuff by just using standard functions,
and it isn't even overly complicated.

> I'm not criticizing the Cygwin choice. Even if Cygwin had chosen to define
> 'wchar_t' to a 32-bit type, the same problem would have remained for mingw
> programs running in UTF-8 or GB18030 locales. (I understand that such
> locales exist in Windows 7.)

Right.  However, GB18030 is not supported by Cygwin.

> > ...
> > I *don't* understand that you do the same for Win32.  Old
> > Windows versions are using the basic UCS-2 character plane, but newer
> > versions, at least since Windows XP are using UTF-16.
> 
> Thank you for this remark. I have corrected this in libiconv, and also
> added support for Cygwin >= 1.7 at the same place.

Thanks!

> > > > the application tests to convert a UTF-8 to WCHAR_T string in four
> > > >   combinations of the current locale, in this order:
> > > > 
> > > >   - iconv_open "C",       iconv "C"
> > > >   - iconv_open "C",       iconv "C.UTF-8"
> > > >   - iconv_open "C.UTF-8", iconv "C"
> > > >   - iconv_open "C.UTF-8", iconv "C.UTF-8"
> > ...
> > My testcase is a result of trying
> > to build a real-life application, gencat from glibc.  For some reason
> > gencat thinks it has to set the locale back to "C" in a hardcoded manner.
> > 
> > This works fine for glibc systems, but the invisible and, IMHO,
> > intransparent behaviour of libiconv on other systems makes it pretty
> > hard to understand the behaviour of an application when porting it.
> 
> I don't see this as a particular "intransparent behaviour of libiconv".
> When taking code that was tested only in a single environment (glibc in this
> case), you always have to make some effort to make it portable.

Oh, I meant my gencat experience just as an example.  IMHO this behaviour
is intransparent, no matter what you're trying to port, and where from
you're taking it.

I mean, if you're trying to call iconv for a conversion from some
codeset A to a codeset B, which are both explicitely mentioned when
calling iconv_open, then it is intransparent behaviour that the
conversion fails because you called setlocale with a codeset C.  There
is no apparent connection between the two actions.  The conversion from
A to B could be required for a file operation, while C is the CLI or GUI
charset.  Do you see what I mean?

> > > Is cygwin_conv_to_posix_path deprecated? Does it introduce limitations of
> > > some kind?
> >
> > Like the underlying Windows functions, Cygwin 1.7 now supports paths of
> > up to 32K chars.  The old cygwin_conv_to_posix_path function and it's
> > friends are written with the Windows ANSI API in mind, so they only
> > support paths of up to MAX_PATH == 260 chars.
> 
> Thanks for explaining. I'll try to avoid this function.

There should be no reason to call cygwin_conv_path functions, unless you
have a direct interaction with native Win32 functions.  So you can most
easily avoid using them at all by using the relocation technique from
Linux, utilizing /proc/self/maps, which in turn drops the requirement for
the DLLMain function.

> > > > The usage of a fixed table instaed of the charset.alias file in
> > > > libcharset/lib/localcharset.c, function get_charset_aliases() is
> > > > not good, not good at all.
> > > 
> > > The alternative is to have this table stored in a file charset.alias;
> > > but then every package that includes the module 'localcharset' from
> > > gnulib (that is, libiconv, gettext, coreutils, and many others) will
> > > want to modify this file during "make install". And this causes a lot of
> > > headaches to packaging systems. Therefore, on platforms which have
> > > widely used packaging systems (Linux, MacOS X, Cygwin), it's better to
> > > avoid the need for this file.
> > 
> > Now I'm puzzled.  If that's the case, why does libiconv request the
> > charset.alias file on *any* other system than DARWIN7, VMS, and Windows?
> > Especially on Linux?
> 
> I "optimized" only the MacOS X, VMS, and Windows OSes. It would have been
> more work to optimize all versions of Solaris, FreeBSD, AIX, etc. in the
> same way.
> 
> charset.alias is requested on Linux, even though it normally does not exist,
> so that packagers and users have a chance to modify the behaviour.

I beg to keep this choice to Cygwin users as well.  It will be empty by
default as well.  The supported codesets are documented in
http://cygwin.com/cygwin-ug-net/setup-locale.html#setup-locale-charsetlist
If some weird alias is required, the user can add it to charset.alias.
That's the optimal solution.

> Even if Cygwin/newlib handles Windows codepage aliases in all places where
> it matters for Cygwin, there are still places where it matters for gnulib,
> coreutils, gettext.

Since gnulib, coreutils and gettext are ported to Cygwin anyway, the
ported versions should live happily in the Cygwin world.  They get what
the system defines, and the system is Cygwin, not Windows.  Everything
else can be added to charset.alias, if required.

> > > Neither libiconv nor gettext defines or undefines _WIN32 or __WIN32__.
> > > But they are prepared to either setting.
> >
> > Isn't that just covering a PEBKAC?  I mean, there's no good reason to
> > define -mwin32 on the command line and the libiconv configure certainly
> > doesn't add it.  Whoever squeezed a -mwin32 onto the GCC command line,
> > or even defined -D__WIN32__ manually, deserves the result.
> 
> But such a user will then write a mail to a mailing list, and it will take
> time for me (or someone else) to investigate and answer it. By writing
>   #if (defined _WIN32 || defined __WIN32__) && !defined __CYGWIN__
> I avoid this potential problem.

Ok.  However, the other variation

   #if defined _WIN32 || defined __WIN32__ || defined __CYGWIN__

should be only used in very rare circumstances.  Usually it just means
that some unnecessary Windowism is used on Cygwin, and that there's
probably a POSIXy equivalent.  If not, kick us here on the list and
we can discuss it.

> Thanks again for your reply and for the hint to the bug in libiconv's code.

You're welcome and thanks for this fruitful discussion.  I'm glad if we
can find a well-working compromise for some of the problems, especially
in the unfortunate UTF-16 case.


Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]