This is the mail archive of the
glibc-bugs@sourceware.org
mailing list for the glibc project.
[Bug manual/12045] regex range semantics outside of POSIX should be documented
- From: "bonzini at gnu dot org" <sourceware-bugzilla at sourceware dot org>
- To: glibc-bugs at sources dot redhat dot com
- Date: 24 Sep 2010 12:35:03 -0000
- Subject: [Bug manual/12045] regex range semantics outside of POSIX should be documented
- References: <20100921152445.12045.eblake@redhat.com>
- Reply-to: sourceware-bugzilla at sourceware dot org
------- Additional Comments From bonzini at gnu dot org 2010-09-24 12:35 -------
It turns out that regex range semantics for glibc are "CEO". They _are_
consistent, it's the locale definition files that are not consistent.
I created a file with the 52 uppercase and lowercase letters and did a "sed -n
/[A-Z]/p" on this file. The results I get are either
this 26 A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
or this 51 AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ
here are the "51" locales:
ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK
sl_SI th_TH tr_CY tr_TR
These return 51 for both $l and $l.utf8. Every other locale returns 26 for both
unibyte and multibyte variants.
Locales using glibc's localedata/locales/iso14651_t1_common template return 26.
This template defines the collation like this:
<U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a start lowercase
<U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
<U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
...
<U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z
...
<U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ end lowercase
<U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A start uppercase
<U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á
...
<U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z
...
<U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ end uppercase
(There's no end to surprises: [a-z] comes _before_ [A-Z], which is why [A-z]
fails but [a-Z] works).
Instead, the "special" locales above use different sequence, for example in cs_CZ:
<U0041> <U0041>;<NONE>;<CAPITAL>;<U0041> # A
<U0061> <U0041>;<NONE>;<SMALL>;<U0041> # a
<U00AA> <U0041>;<NONE>;<U00AA>;<U0041> # ª
<U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041> # Á
<U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041> # á
...
<U005A> <U005A>;<NONE>;<CAPITAL>;<U005A> # Z
<U007A> <U005A>;<NONE>;<SMALL>;<U005A> # z
So, it looks like __collseq_table_lookup is what the POSIX rationale document
calls "CEO". I'll open a bug on the inconsistencies caused by using CEO. In
the meanwhile, this bug remains open for the documentation part.
--
What |Removed |Added
----------------------------------------------------------------------------
Component|regex |manual
Summary|regex range semantics |regex range semantics
|outside of POSIX should be |outside of POSIX should be
|documented and consistent |documented
http://sourceware.org/bugzilla/show_bug.cgi?id=12045
------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.