This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Bug in collation functions?


On 10/29/2015 5:49 PM, Ken Brown wrote:
On 10/29/2015 2:42 PM, Ken Brown wrote:
On 10/29/2015 12:51 PM, Eric Blake wrote:
On 10/29/2015 10:13 AM, Ken Brown wrote:

Never mind.  My test case was flawed, because it didn't check for the
possibility that wcscoll might return 0.  Here's a revised
definition of
the "compare" function:

void
compare (const wchar_t *a, const wchar_t *b, const char *loc)
{
   setlocale (LC_COLLATE, loc);
   int res = wcscoll (a, b);
   char c = res < 0 ? '<' : res > 0 ? '>' : '=';
   printf ("\"%ls\" %c \"%ls\" in %s locale\n", a, c, b, loc);
}

With this change (and the use of NORM_IGNORESYMBOLS) the test returns
the following on Cygwin:

$ ./wcscoll_test
"11" > "1.1" in POSIX locale
"11" = "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1 2" in en_US.UTF-8 locale

It still differs from Linux, but it's good enough to make the emacs
test
pass.  Moreover, this behavior actually seems more reasonable to me
than
the Linux behavior.  After all, if you're ignoring punctuation, how can
you decide which of "11" or "1.1" comes first?

Careful.  POSIX is proposing some wording that say that normal locales
should always implement a fallback of last resort (and that locales that
do not do so should have a special name including '@', to make it
obvious).  It is not standardized yet, but worth thinking about.

http://austingroupbugs.net/view.php?id=938
http://austingroupbugs.net/view.php?id=963

The intent of that wording is that if ignoring punctuation could cause
two strings to otherwise compare equal, the fallback of a total ordering
on all characters means that the final result of strcoll() will not be 0
unless the two strings are identical.

In that case, I think Cygwin should start by using NORM_IGNORESYMBOLS in
non-POSIX locales, with the goal of eventually moving toward emulating
glibc.  I don't know what fallback glibc uses or how hard it would be to
implement this on Cygwin.

I withdraw this suggestion.  I took a look at the glibc code, and I
don't see any reasonable way for Cygwin to emulate it precisely.  On the
other hand, I have an idea for a simple fallback.  I'll play with it a
little and then submit a patch.

The fallback I had in mind is to return the shorter string if they have different lengths and otherwise to revert to wcscmp. Using this, both Cygwin and Linux give the following comparisons:

"11" > "1.1" in POSIX locale
"11" < "1.1" in en_US.UTF-8 locale
"11" > "1 2" in POSIX locale
"11" < "1.2" in en_US.UTF-8 locale
"1 1" < "1.1" in POSIX locale
"1 1" < "1.1" in en_US.UTF-8 locale

If this seems reasonable, I'll test it more extensively and then submit a patch.

Ken

P.S. In case others want to test this in different locales, here's the patch so far, just for wcscoll:

diff --git a/winsup/cygwin/nlsfuncs.cc b/winsup/cygwin/nlsfuncs.cc
index f7031f9..c33aa24 100644
--- a/winsup/cygwin/nlsfuncs.cc
+++ b/winsup/cygwin/nlsfuncs.cc
@@ -1156,10 +1156,15 @@ wcscoll (const wchar_t *__restrict ws1, const wchar_t *__restrict ws2)

   if (!collate_lcid)
     return wcscmp (ws1, ws2);
-  ret = CompareStringW (collate_lcid, 0, ws1, -1, ws2, -1);
+ ret = CompareStringW (collate_lcid, NORM_IGNORESYMBOLS, ws1, -1, ws2, -1);
   if (!ret)
     set_errno (EINVAL);
-  return ret - CSTR_EQUAL;
+  ret -= CSTR_EQUAL;
+  if (!ret)
+    ret = wcslen (ws1) - wcslen (ws2);
+  if (!ret)
+    ret = wcscmp (ws1, ws2);
+  return ret;
 }

 extern "C" int


--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]