This is the mail archive of the
cygwin
mailing list for the Cygwin project.
Re: Bug in collation functions?
- From: Ken Brown <kbrown at cornell dot edu>
- To: cygwin at cygwin dot com
- Date: Thu, 29 Oct 2015 18:21:28 -0400
- Subject: Re: Bug in collation functions?
- Authentication-results: sourceware.org; auth=none
- References: <563148AF dot 1000502 at cornell dot edu> <5631996D dot 7040908 at redhat dot com> <20151029075050 dot GE5319 at calimero dot vinschen dot de> <20151029083057 dot GH5319 at calimero dot vinschen dot de> <56321815 dot 7000203 at cornell dot edu> <20151029153516 dot GJ5319 at calimero dot vinschen dot de> <56323F2E dot 4030807 at cornell dot edu> <56324598 dot 9060604 at cornell dot edu> <56324E82 dot 7000402 at redhat dot com> <563268A4 dot 6000005 at cornell dot edu> <56329462 dot 2090206 at cornell dot edu>
On 10/29/2015 5:49 PM, Ken Brown wrote:
On 10/29/2015 2:42 PM, Ken Brown wrote:
On 10/29/2015 12:51 PM, Eric Blake wrote:
On 10/29/2015 10:13 AM, Ken Brown wrote:
Never mind. My test case was flawed, because it didn't check for the
possibility that wcscoll might return 0. Here's a revised
definition of
the "compare" function:
void
compare (const wchar_t *a, const wchar_t *b, const char *loc)
{
setlocale (LC_COLLATE, loc);
int res = wcscoll (a, b);
char c = res < 0 ? '<' : res > 0 ? '>' : '=';
printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
}
With this change (and the use of NORM_IGNORESYMBOLS) the test returns
the following on Cygwin:
$ ./wcscoll_test
"11" > "1.1" in POSIX locale
"11" = "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale
It still differs from Linux, but it's good enough to make the emacs
test
pass. Moreover, this behavior actually seems more reasonable to me
than
the Linux behavior. After all, if you're ignoring punctuation, how can
you decide which of "11" or "1.1" comes first?
Careful. POSIX is proposing some wording that say that normal locales
should always implement a fallback of last resort (and that locales that
do not do so should have a special name including '@', to make it
obvious). It is not standardized yet, but worth thinking about.
http://austingroupbugs.net/view.php?id=938
http://austingroupbugs.net/view.php?id=963
The intent of that wording is that if ignoring punctuation could cause
two strings to otherwise compare equal, the fallback of a total ordering
on all characters means that the final result of strcoll() will not be 0
unless the two strings are identical.
In that case, I think Cygwin should start by using NORM_IGNORESYMBOLS in
non-POSIX locales, with the goal of eventually moving toward emulating
glibc. I don't know what fallback glibc uses or how hard it would be to
implement this on Cygwin.
I withdraw this suggestion. I took a look at the glibc code, and I
don't see any reasonable way for Cygwin to emulate it precisely. On the
other hand, I have an idea for a simple fallback. I'll play with it a
little and then submit a patch.
The fallback I had in mind is to return the shorter string if they have
different lengths and otherwise to revert to wcscmp. Using this, both
Cygwin and Linux give the following comparisons:
"11" > "1.1" in POSIX locale
"11" < "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1.2" in en_US.UTF-8 locale
"1 1" < "1.1" in POSIX locale
"1 1" < "1.1" in en_US.UTF-8 locale
If this seems reasonable, I'll test it more extensively and then submit
a patch.
Ken
P.S. In case others want to test this in different locales, here's the
patch so far, just for wcscoll:
diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index f7031f9..c33aa24 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1156,10 +1156,15 @@ wcscoll (const wchar_t *__restrict ws1, const
wchar_t *__restrict ws2)
if (!collate_lcid)
return wcscmp (ws1, ws2);
- ret = CompareStringW (collate_lcid, 0, ws1, -1, ws2, -1);
+ ret = CompareStringW (collate_lcid, NORM_IGNORESYMBOLS, ws1, -1, ws2,
-1);
if (!ret)
set_errno (EINVAL);
- return ret - CSTR_EQUAL;
+ ret -= CSTR_EQUAL;
+ if (!ret)
+ ret = wcslen (ws1) - wcslen (ws2);
+ if (!ret)
+ ret = wcscmp (ws1, ws2);
+ return ret;
}
extern "C" int
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple