This is the mail archive of the
libc-locales@sourceware.org
mailing list for the GNU libc locales project.
[Bug localedata/14010] New: Serious omissions in alphabeticcharacter class
- From: "bugdal at aerifal dot cx" <sourceware-bugzilla at sourceware dot org>
- To: libc-locales at sources dot redhat dot com
- Date: Mon, 23 Apr 2012 01:37:32 +0000
- Subject: [Bug localedata/14010] New: Serious omissions in alphabeticcharacter class
- Auto-submitted: auto-generated
http://sourceware.org/bugzilla/show_bug.cgi?id=14010
Bug #: 14010
Summary: Serious omissions in alphabetic character class
Product: glibc
Version: unspecified
Status: NEW
Severity: normal
Priority: P2
Component: localedata
AssignedTo: unassigned@sourceware.org
ReportedBy: bugdal@aerifal.cx
CC: libc-locales@sources.redhat.com
Classification: Unclassified
The localedata generation code defines is_alpha based on Unicode categories L*,
plus Nl, Nd, and a moderate number of special cases mostly to fix Thai language
support (to fix is_alpha returning false for letters in category Mn). However
Thai is not the only language affected; any language that uses non-spacing
letters is broken by glibc's deficient is_alpha definition. As a particular
example, all of the Tibetan subjoined letters are considered non-alphabetic
(and thus punctuation) by glibc.
Unicode addresses this issue by defining the Other_Alphabetic property in
PropList.txt and the Alphabetic derived property in DerivedCoreProperties.txt,
the latter of which consists of Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic. This
subsumes all special-case hacks for Thai in glibc's gen-unicode-ctype.c and
fixes the issue (at least approximately) for all other languages/scripts at the
same time.
glibc's localedata should adopt the definition of Alphabetic from Unicode's
DerivedCoreProperties.txt (and still add Nd and the special cases from So).
--
Configure bugmail: http://sourceware.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are on the CC list for the bug.