This is the mail archive of the libc-help@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] wcscoll/strcoll needs to check if the char is out ofthe locale collation


On (12/03/2009 03:43 AM), Carlos O'Donell-san wrote:
On Wed, Dec 2, 2009 at 12:32 PM, Takao Fujiwara<tfujiwar@redhat.com> wrote:
You need to define correctness in terms of a standard or existing
practice in some other C library. What do C libraries on other
operating systems do?

I'm not sure if I understand the question. As I quoted in this original mail, the manpage explains the behavior. And the manpage is also available in http://opengroup.org. Unfortunately I could not find the specification between language chars but I have tested some chars on UNIX and Windows and the collation result is good for me except for GNU libc.

In UNIX, some of non-UTF-8 collation is not clear for me but UTF-8 collation
is good in the attached test case at least. So errno might be a good
solution for non-UTF-8 since glibc/localedata/locales/ja_JP is not used for
ja_JP.UTF-8 only but also for ja_JP.eucJP, ja_JP.SJIS.

This is better. You should list which other operating systems you tested on and found the results of the scort to be satisfactory.

Solaris wcscoll() on ja_JP.UTF-8: U+00CD > (4336) U+65E5 U+00CD < (-47699) U+D55C U+0041 < (-2926) U+65E5 U+65E5 < (-52035) U+D55C

U+0041 < U+65E5 < U+00CD < U+D55C

Windows ja:
U+00CD < (-1) U+65E5
U+0041 < (-1) U+D55C
U+00CD > (1) U+0041
U+0041 < (-1) U+65E5
U+D55C > (1) U+65E5
U+0041 < (-1) U+00CD

U+0041 < U+00CD < U+65E5 < U+D55C

java.text.Collator:
U+00CD > (1) U+0041
U+65E5 > (1) U+00CD
U+D55C > (1) U+65E5

U+0041 < U+00CD < U+65E5 < U+D55C



I'm attaching the test program (a.c) in this mail to explain the problem.
It compares the Korean chars and ASCII chars.
If you run the test program on ja_JP.UTF-8, wcscoll() returns "Korean
chars<    ASCII chars".
But I would expect "Korean chars>    ASCII chars" on ja_JP.UTF-8.

You need to explain why you expect this, is it done this way on another system?

The main reason of mine is, if the expectation is not the result, the sorted strings looks ugly on GUI(e.g. GDM language name list dialog). The current behavior is, some of non-latin chars are collated in front of latin chars and all Japanese chars are collated after latin chars and other chars are collated after all Japanese chars. I think the sort might need the common collation and locale specific collation likes UCA, ICU.

This is a good explanation of the problem. It should come first in your post.


My test case works with this way on UNIX and MS-Windows.

Which UNIX?

Solaris




Then I would think the returned value is not defined in ja_JP collation
and I thought setting errno would be good if the char is not defined in
the collation.
E.g. glibc/localedata/locales/ja_JP LC_COLLATE doesn't include U+D55C so
I think the ja_JP.UTF-8 collation table doesn't contain all UTF-8 chars.

I don't understand this paragraph, perhaps you could expand the explanation and take the reader through the logic?

If you have a look at glibc/localedata/locales/ja_JP, Japanese LC_COLLATE is defined in the lines between order_start forward and order_end. almost Japanese char collations are defined in the ja_JP file but any Korean char collations are not found in ja_JP file, e.g. U+D55C . wcscoll() uses __collidx_table_lookup() for the collation data and __collidx_table_lookup() returns '0' in U+D55C and I think the reason is that the code point is not defined in the ja_JP file. It mean I think the current ja_JP collation doesn't include all UTF-8 chars on ja_JP.UTF-8.

You should find out whether or not the ja_JP collation includes all UTF-8 chars. That way you can be certain.

Probably I would think the better approach is, if __collidx_table_lookup() returns '0', libc has a fallback function to collate all UTF-8 chars.
Currently I have no specific good idea.




Regarding to WCSCOLL(3P):

----------------------------
RETURN VALUE
On  error,  wcscoll()  shall  set
       errno, but no return value is reserved to indicate an error.

ERRORS
       The wcscoll() function may fail if:

       EINVAL The  ws1  or  ws2 arguments contain wide-character codes
outside
              the domain of the collating sequence.
----------------------------

You don't need to quote the manpage, simply describe the expected return value as defined in the appropriate standard.

However this might be the only definition in POSIX when I searched. I tested the behavior to assign the actual code points in wcscoll().

Then you need to explain that "This is the definition that describes the error behaviour for wcscoll."


The attachment glibc-xx-errno-strcoll.diff sets EINVAL if the value is
out of the table.
My understanding is, __collidx_table_lookup() checks in libc.so if the
char is defined in the collation table so my suggestion is to set errno
if the char is not defined in the table.
If wcscoll/strcoll/wscxfrm/strxfrm would set errno, I could enhance
g_utf8_collate(_key) later.
E.g. if wcscoll() returns undefined value with errno, wcscmp() could be
called later.

Have you tested this patch by running the glibc testsuite?

I'm not familiar with the glibc testsuite... so I just tested some code points with wcscoll(). If any links are available, I'd try it.

Did you build glibc with your patch?


Once you build glibc with your patch you can run "make -k check>&
result.log" to execute the testsuite. Then you must verify that your
patch has not introduced regressions in the testsuite.

Thanks for the pointer. I could have the test yesterday and it seems the test result is ok.



However somebody might say ja_JP collation table should have all UTF-8
chars but actually ja_JP file is not so.

Who might say this and why?

Actually I'm not sure if the actual people might say this. However the first patch sets errno if __collidx_table_lookup() returns '0' and my understanding is, the return value of __collidx_table_lookup() depends on the glib/localedata/locales/ja_JP file on ja_JP.UTF-8 locale.

You need to determine if this is a bug in the ja_JP collation table. You must determine if it should have all UTF-8 characters.

I think we don't have to update the ja_JP file itself.
Since other platforms don't set errno for UTF-8 locales, probably I thik the first patch (glibc-xx-errno-strcoll.diff) is not needed for UTF-8.
I'd like to think a fallback algorithm to fix the problem instead of the static collation data for all UTF-8 chars.


After I applied the second patch (glibc-xx-set-undefined.diff), the glibc collation order and Solaris one are almost same.
However today I noticed one problem.


The current collation:
U+D55C < U+0041 < U+65E5 < U+00CD

After glibc-xx-set-undefined.diff is applied:
U+0041 < U+65E5 < U+00CD < U+D55C

This result is ASCII < Japanese < ASCII + Compose < Korean.
Windows and Java result looks better than this result when the sorted strings are shown on application GUI.
So it might not be enough to just put the undefined chars in glib/localedata/locales/* after defined chars.


Thanks,
fujiwara



if a char is not defined in glib/localedata/locales/ja_JP,
__collidx_table_lookup() returns 0 in libc.so.
If we could use __collseq_table_lookup() instead, it would return the
max value for the undefined char and I could resolve this problem.
But I think we need to use __collidx_table_lookup() for wcscoll() since
the size of locale collation is unclear.

But the problem is when we receive 0, U+0 is actually defined in
glib/localedata/locales/ja_JP LC_COLLATION and the result is, the
undefined chars are always collated in front of the defined chars in
wcscoll().

E.g. If I think a is ASCII char, b is a Japanese char, c is a Korean
char, the collation would be c<    a<    b on ja_JP.UTF-8 since U+0 is
defined in ja_JP file.

But if you look at ja_JP file, the file also defines "UNDEFINED" in
LC_COLLATE.
UNDEFINED char should be collated at last.
But the word "UNDEFINED" seems to be used in localedef program only.
If we run wcscoll(), we don't know which index of weight[] is the
UNDEFINED value.

This is not a coherent description of the solution, internal details are not important right now, what is important is explaining clearly the two solutions.

OK, if the first patch glibc-xx-errno-strcoll.diff is acceptable, I could implement the following scenario:

if wcscoll() returns a value with no errno on ja_JP.* locale, I can expect
the value is defined in glib/localedata/locales/ja_JP and the returned value
can be used for software applications with the locale collation.
If wcscoll() returns a value with errno on ja_JP.*, I can expect the value
is not defined in the ja_JP file and the collation is not available with the
locale collation and then another function can be called, e.g. wcscmp(),
without the locale collation.
We need to call wcscoll(). E.g. if Composed 'a' + '`' (U+00E0) and ASCII 'b'
is compared, we'd like to get 'a' + '`'<  'b' on some locales.

Try to make your description more generic e.g. "with no errno on a locale". Your code must be applicable to all locales.

Probably I think the errno solution is also good for non-UTF-8 whose char
range is smaller than UTF-8. E.g. ISO8859-1 is very small code range but
wcscoll() takes wchar_t* as the argument. WCHAR_MAX value is higher than the
end of ISO8859-1 char so many wchar_t values are not used in ISO8859-1.

OK



Then I'm attaching another solution (glibc-xx-set-undefined.diff).

So my solution is, if wcscoll() receives 0 from findidx(), wcscoll() use
USTRING_MAX instead of weight[].

If I see zh_CN file, U+0 is not defined. The undefined chars are always
collated in front of the defined chars in wcscoll() because the
following line effects the result in wcscoll():

result = seq1len == 0 ? -1 : 1;

seq1len is 0 but the string is not shorter than the other in this case.
The string is not defined in the locale collation in this case actually.

I'd modified this part in glibc-xx-set-undefined.diff.

Probably it's good for wcscoll() to follow the 'UNDEFINED' keyword in
the locale collation file and I think 'UNDEFINED' should be put in the
last of the LC_COLLATE.

You need to expand on why you think this is needed.

If the second patch glibc-xx-set-undefined.diff is acceptable, I can apply the returned value of wcscoll() to the software applications simply and the GUI appearance looks better the current behavior because the patched libc can show ASCII< Japanese< other languages chars. I think it's easy for the GUI users to find one string in the sorted strings.

OK.


Cheers,
Carlos.



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]