This is the mail archive of the
cygwin
mailing list for the Cygwin project.
Re: Bug in collation functions?
- From: Ken Brown <kbrown at cornell dot edu>
- To: cygwin at cygwin dot com
- Date: Thu, 29 Oct 2015 14:42:44 -0400
- Subject: Re: Bug in collation functions?
- Authentication-results: sourceware.org; auth=none
- References: <563148AF dot 1000502 at cornell dot edu> <5631996D dot 7040908 at redhat dot com> <20151029075050 dot GE5319 at calimero dot vinschen dot de> <20151029083057 dot GH5319 at calimero dot vinschen dot de> <56321815 dot 7000203 at cornell dot edu> <20151029153516 dot GJ5319 at calimero dot vinschen dot de> <56323F2E dot 4030807 at cornell dot edu> <56324598 dot 9060604 at cornell dot edu> <56324E82 dot 7000402 at redhat dot com>
On 10/29/2015 12:51 PM, Eric Blake wrote:
On 10/29/2015 10:13 AM, Ken Brown wrote:
Never mind. My test case was flawed, because it didn't check for the
possibility that wcscoll might return 0. Here's a revised definition of
the "compare" function:
void
compare (const wchar_t *a, const wchar_t *b, const char *loc)
{
setlocale (LC_COLLATE, loc);
int res = wcscoll (a, b);
char c = res < 0 ? '<' : res > 0 ? '>' : '=';
printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
}
With this change (and the use of NORM_IGNORESYMBOLS) the test returns
the following on Cygwin:
$ ./wcscoll_test
"11" > "1.1" in POSIX locale
"11" = "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale
It still differs from Linux, but it's good enough to make the emacs test
pass. Moreover, this behavior actually seems more reasonable to me than
the Linux behavior. After all, if you're ignoring punctuation, how can
you decide which of "11" or "1.1" comes first?
Careful. POSIX is proposing some wording that say that normal locales
should always implement a fallback of last resort (and that locales that
do not do so should have a special name including '@', to make it
obvious). It is not standardized yet, but worth thinking about.
http://austingroupbugs.net/view.php?id=938
http://austingroupbugs.net/view.php?id=963
The intent of that wording is that if ignoring punctuation could cause
two strings to otherwise compare equal, the fallback of a total ordering
on all characters means that the final result of strcoll() will not be 0
unless the two strings are identical.
In that case, I think Cygwin should start by using NORM_IGNORESYMBOLS in
non-POSIX locales, with the goal of eventually moving toward emulating
glibc. I don't know what fallback glibc uses or how hard it would be to
implement this on Cygwin.
Here's a tangentially related issue, also motivated by a failing emacs
test: Should setlocale return null to indicate an error if it's given an
invalid locale name? This happens on Linux but not on Cygwin, as the
following modified test case shows:
$ cat wcscoll_test.c
#include <wchar.h>
#include <stdio.h>
#include <locale.h>
void
compare (const wchar_t *a, const wchar_t *b, const char *loc)
{
if (! setlocale (LC_COLLATE, loc))
printf ("Unable to set locale to %s\n", loc);
else
{
int res = wcscoll (a, b);
char c = res < 0 ? '<' : res > 0 ? '>' : '=';
printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
}
}
int
main ()
{
compare (L"11", L"1.1", "POSIX");
compare (L"11", L"1.1", "en_US.UTF-8");
compare (L"11", L"1 2", "POSIX");
compare (L"11", L"1 2", "en_US.UTF-8");
compare (L"11", L"1 2", "en_DE.UTF-8");
}
On Cygwin (with NORM_IGNORESYMBOLS), the output is
"11" > "1.1" in POSIX locale
"11" = "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale
"11" < "1 2" in en_DE.UTF-8 locale
but on Linux it is
"11" > "1.1" in POSIX locale
"11" < "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale
Unable to set locale to en_DE.UTF-8
Ken
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple