This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug locale/18978] New: The collation symbol âUNDEFINEDâ does not work as specified in the standard
- From: "maiku.fabian at gmail dot com" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sourceware dot org
- Date: Thu, 17 Sep 2015 11:56:07 +0000
- Subject: [Bug locale/18978] New: The collation symbol âUNDEFINEDâ does not work as specified in the standard
- Auto-submitted: auto-generated
https://sourceware.org/bugzilla/show_bug.cgi?id=18978
Bug ID: 18978
Summary: The collation symbol âUNDEFINEDâ does not work as
specified in the standard
Product: glibc
Version: 2.22
Status: NEW
Severity: normal
Priority: P2
Component: locale
Assignee: unassigned at sourceware dot org
Reporter: maiku.fabian at gmail dot com
Target Milestone: ---
http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html
says:
opengroup> Collation Order
opengroup>
opengroup> [...]
opengroup>
opengroup> The symbol UNDEFINED shall be interpreted as including all
opengroup> coded character set values not specified explicitly or via
opengroup> the ellipsis symbol. Such characters shall be inserted in
opengroup> the character collation order at the point indicated by the
opengroup> symbol, and in ascending order according to their coded
opengroup> character set values. If no UNDEFINED symbol is specified,
opengroup> and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a
opengroup> warning message and place such characters at the end of the
opengroup> character collation order.
Unfortunatly it does not work like that in glibc.
For example:
The Japanese locale source file /usr/share/i18n/locales/ja_JP
has this in the LC_COLLATE section:
mfabian@ari:/usr/share/i18n/locales
$ grep -A 8 ^LC_COLLATE ja_JP
LC_COLLATE
order_start forward
%
% C0
%
<U0000>
<U0001>
<U0002>
<U0003>
mfabian@ari:/usr/share/i18n/locales
$ grep -B 8 '^END LC_COLLATE' ja_JP
<U9F97>
<U9F9E>
<U9FA1>
<U9FA2>
<U9FA3>
<U9FA5>
UNDEFINED
order_end
END LC_COLLATE
mfabian@ari:/usr/share/i18n/locales
$
I.e. it includes the âUNDEFINEDâ collation symbol at the end.
Now if I choose a character which is *not* specified in
the LC_COLLATE section, neither explicitly nor via the ellipsis
for example:
â U+215E VULGAR FRACTION SEVEN EIGHTHS
and check how it sorts, I find:
mfabian@ari:~/testdir
$ LANG=ja_JP.UTF-8 ls
â A B C D O U Z a b c d o u z Ã Ã Ä Ä Ã Ä Ã Ã Ã
mfabian@ari:~/testdir
$
I.e. it sorts at the beginning, not at the end (the other non-ASCII
characters in that sort example *are* explicitly specified
in the sort order, thatâs why they appear after âzâ which is how
it is specified).
To test this further, I created my own variant of
/usr/share/i18n/locales/POSIX
by removing the
LC_COLLATE
# This is the POSIX Locale definition for the LC_COLLATE category.
# The order is the same as in the ASCII code set.
order_start forward
<U0000>
<U0001>
normal stuff here
modified part follows:
<U0040> <- @
<U0044> <- D (moved here make sure I am really using my modified
locale)
<U0041> <- A
<U0043> <- C
UNDEFINED <- B is *not* specified any more! Therefore it should go here!
<U0045> <- E
<U0046> <- F
more normal stuff here
<U007E>
<U007F>
order_end
#
END LC_COLLATE
And when testing this (I installed this modified POSIX locale
using localedef under the name "POSIXMIKE"):
mfabian@ari:~/testdir
$ LANG=POSIXMIKE ls
B ?? ?? ?? ?? ?? ?? ?? ?? ?? ??? D A C O U Z a b c d o u
z
mfabian@ari:~/testdir
$
So the now unspecified âBâ is sorted at the beginning and *not*
after âCâ where the âUNDEFINEDâ collation symbol is.
--
You are receiving this mail because:
You are on the CC list for the bug.