This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug locale/18978] New: The collation symbol âUNDEFINEDâ does not work as specified in the standard


https://sourceware.org/bugzilla/show_bug.cgi?id=18978

            Bug ID: 18978
           Summary: The collation symbol âUNDEFINEDâ does not work as
                    specified in the standard
           Product: glibc
           Version: 2.22
            Status: NEW
          Severity: normal
          Priority: P2
         Component: locale
          Assignee: unassigned at sourceware dot org
          Reporter: maiku.fabian at gmail dot com
  Target Milestone: ---

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap07.html

says: 

opengroup> Collation Order
opengroup> 
opengroup> [...]
opengroup> 
opengroup> The symbol UNDEFINED shall be interpreted as including all
opengroup> coded character set values not specified explicitly or via
opengroup> the ellipsis symbol. Such characters shall be inserted in
opengroup> the character collation order at the point indicated by the
opengroup> symbol, and in ascending order according to their coded
opengroup> character set values. If no UNDEFINED symbol is specified,
opengroup> and the current coded character set contains characters not
opengroup> specified in this section, the utility shall issue a
opengroup> warning message and place such characters at the end of the
opengroup> character collation order.

Unfortunatly it does not work like that in glibc.

For example:

The Japanese locale source file /usr/share/i18n/locales/ja_JP
has this in the LC_COLLATE section:

    mfabian@ari:/usr/share/i18n/locales
    $ grep -A 8 ^LC_COLLATE ja_JP
    LC_COLLATE
    order_start forward
    %
    % C0
    %
    <U0000>
    <U0001>
    <U0002>
    <U0003>
    mfabian@ari:/usr/share/i18n/locales
    $ grep -B 8 '^END LC_COLLATE' ja_JP
    <U9F97>
    <U9F9E>
    <U9FA1>
    <U9FA2>
    <U9FA3>
    <U9FA5>
    UNDEFINED
    order_end
    END LC_COLLATE
    mfabian@ari:/usr/share/i18n/locales
    $

I.e. it includes the âUNDEFINEDâ collation symbol at the end.

Now if I choose a character which is *not* specified in
the LC_COLLATE section, neither explicitly nor via the ellipsis
for example:

    â U+215E VULGAR FRACTION SEVEN EIGHTHS

and check how it sorts, I find:

mfabian@ari:~/testdir
$ LANG=ja_JP.UTF-8 ls
â A  B  C  D  O  U  Z  a  b  c  d  o  u  z  Ã  Ã  Ä  Ä  Ã  Ä  Ã  Ã  Ã
mfabian@ari:~/testdir
$

I.e. it sorts at the beginning, not at the end (the other non-ASCII
characters in that sort example *are* explicitly specified
in the sort order,  thatâs why they appear after âzâ which is how
it is specified).

To test this further, I created my own variant of

/usr/share/i18n/locales/POSIX

by removing the

LC_COLLATE
# This is the POSIX Locale definition for the LC_COLLATE category.
# The order is the same as in the ASCII code set.
order_start forward
<U0000>
<U0001>

normal stuff here

modified part follows:

<U0040>         <- @
<U0044>         <- D (moved here make sure I am really using my modified
locale)
<U0041>         <- A
<U0043>         <- C 
UNDEFINED       <- B is *not* specified any more! Therefore it should go here!
<U0045>         <- E
<U0046>         <- F

more normal stuff here

<U007E>
<U007F>
order_end
#
END LC_COLLATE

And when testing this (I installed this modified POSIX locale
using localedef under the name "POSIXMIKE"):

mfabian@ari:~/testdir
$ LANG=POSIXMIKE ls
B  ??  ??  ??  ??  ??  ??  ??  ??  ??  ???  D  A  C  O  U  Z  a  b  c  d  o  u 
z
mfabian@ari:~/testdir
$

So the now unspecified âBâ is sorted at the beginning and *not*
after âCâ where the âUNDEFINEDâ collation symbol is.

-- 
You are receiving this mail because:
You are on the CC list for the bug.

Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]