This is the mail archive of the glibc-bugs@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[Bug manual/12045] regex range semantics outside of POSIX should be documented


------- Additional Comments From bonzini at gnu dot org  2010-09-24 12:35 -------
It turns out that regex range semantics for glibc are "CEO".  They _are_
consistent, it's the locale definition files that are not consistent.

I created a file with the 52 uppercase and lowercase letters and did a "sed -n
/[A-Z]/p" on this file.  The results I get are either

this      26   A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
or this   51   AaBbCcDdEeFfGgHhIiJjKkLlMmNnOoPpQqRrSsTtUuVvWwXxYyZ

here are the "51" locales:

ar_SA cs_CZ hr_HR hsb_DE is_IS km_KH lo_LA lt_LT lv_LV or_IN pl_PL sk_SK
sl_SI th_TH tr_CY tr_TR

These return 51 for both $l and $l.utf8.  Every other locale returns 26 for both
unibyte and multibyte variants.

Locales using glibc's localedata/locales/iso14651_t1_common template return 26.
 This template defines the collation like this:

  <U0061> <a>;<BAS>;<MIN>;IGNORE # 198 a    start lowercase
  <U00AA> <a>;<PCL>;<EMI>;IGNORE # 199 ª
  <U00E1> <a>;<ACA>;<MIN>;IGNORE # 200 á
  ...
  <U007A> <z>;<BAS>;<MIN>;IGNORE # 507 z
  ...
  <U00FE> <th>;<BAS>;<MIN>;IGNORE # 516 Þ     end lowercase
  <U0041> <a>;<BAS>;<CAP>;IGNORE # 517 A    start uppercase
  <U00C1> <a>;<ACA>;<CAP>;IGNORE # 518 Á
  ...
  <U005A> <z>;<BAS>;<CAP>;IGNORE # 813 Z
  ...
  <U00DE> <th>;<BAS>;<CAP>;IGNORE # 824 þ    end uppercase

(There's no end to surprises: [a-z] comes _before_ [A-Z], which is why [A-z]
fails but [a-Z] works).

Instead, the "special" locales above use different sequence, for example in cs_CZ:

  <U0041> <U0041>;<NONE>;<CAPITAL>;<U0041>    # A
  <U0061> <U0041>;<NONE>;<SMALL>;<U0041>    # a
  <U00AA> <U0041>;<NONE>;<U00AA>;<U0041>    # ª
  <U00C1> <U0041>;<ACUTE>;<CAPITAL>;<U0041>    # Á
  <U00E1> <U0041>;<ACUTE>;<SMALL>;<U0041>    # á
  ...
  <U005A> <U005A>;<NONE>;<CAPITAL>;<U005A>    # Z
  <U007A> <U005A>;<NONE>;<SMALL>;<U005A>    # z

So, it looks like __collseq_table_lookup is what the POSIX rationale document
calls "CEO".  I'll open a bug on the inconsistencies caused by using CEO.  In
the meanwhile, this bug remains open for the documentation part.


-- 
           What    |Removed                     |Added
----------------------------------------------------------------------------
          Component|regex                       |manual
            Summary|regex range semantics       |regex range semantics
                   |outside of POSIX should be  |outside of POSIX should be
                   |documented and consistent   |documented


http://sourceware.org/bugzilla/show_bug.cgi?id=12045

------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]