This is the mail archive of the glibc-bugs@sources.redhat.com mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug localedata/374] The rules in LC_COLLATE are random and sometimes clearly wrong


------- Additional Comments From pablo at mandrakesoft dot com  2004-09-30 21:07 -------
I think indeed some LC_COLLATE definitions are wrong; like they haven't been
rewritten/updated to benefit of the new (glibc > 2.2) possibilities.

When you look at ar_SA, the LC_COLLATE is defined with lines like:

order_start             forward; forward
<U0020> <U0020>
...
<U0030> <U0030>
<U0031> <U0031>
<U0032> <U0032>
....
<U0041> <U0041>;<U0041>
<U0061> <U0041>;<U0061>
...

if you compare with iso14651_t1 (used (maybe completed) by most other locales)
you see things like this instead:
<U0020> IGNORE;IGNORE;IGNORE;<U0020> # 32 <SP>
...
<U0030> <0>;<BAS>;<MIN>;IGNORE # 171 0
<U0031> <1>;<BAS>;<MIN>;IGNORE # 172 1
<U0032> <2>;<BAS>;<MIN>;IGNORE # 173 2
...
<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a
...
<U0041> <a>;<BAS>;<CAP>;IGNORE # 319 A
...

While ar_SA gives for each element only or in some cases two information tokens;
the more modern LC_COLLATE definitions have 4.
You can also see that while in ar_SA the space (<U0020>) is treated the same 
as the digits, on the more modern LC_COLLATE definition it is not; in fact the
space is defined as sorting neutral.
The latin letters have information telling if they are uppercase or lowercase
in the modern LC_COLLATE; that information is missing in the definition in ar_SA

da_DK is a bit more strange, it uses a modern LC_COLLATE definition, but
redefines everything itself (instead of including iso14651_t1 and only
redefining what differs); spaces and blanks have 1st order sorting weight, which
seems very strange to me, but even if Danish language sort spaces in such a
peculiar way it is still strange to sort differently the space (0020) and the
non breaking space (00A0), semantically they are the same thing, the difference
is only typographical.

While the sorting of letters is correct (at least for the letters used by a
given language, ar_SA for example happily ignores any latin letter outside of
ascii, while ar_EG for example sorts "agrave" together with "a" ar_SA puts
"agrave" after the last arabic letter...), the handling of punctuation and
other special symbols should be reviewed imho.
Also, all locales should include iso14651_t1 so that there can be an acceptable
sorting for alphabetic symbols outside the range of the alphabet of the given
locale (in an UTF-8 world you will likely see such things; I get for example
mail from people with names having cacute, ccaron, lstroke, eogonek, etc.
in my language none of those exist, but I expect them to be sorted with 
"c", "c", "l", "e" respectively, and not after "z".

-- 


http://sources.redhat.com/bugzilla/show_bug.cgi?id=374

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]