This is the mail archive of the libc-locales@sourceware.org mailing list for the GNU libc locales project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Character classifications and language-dependence


Hi,

Keld Jørn Simonsen <keld@dkuug.dk> writes:

> The reasoning behind considering a-circumflex and the like a letter,
> also in languages not normally using it, is that in general readers will
> recognize it as a letter, and somewhat know how to pronounce it etc.
> Thus in Denmark â is used for example in names of French wines, like
> "Château de Bonfils" and this may occur regularily eg. in newpaper
> advertisements, or on menus in restaturants. It is thus good to know
> that â can be part of a word, and thus it should be in class alpha of
> this locale. The same would be valid for possibly all other locales
> of the world.

This is a good point.  More generally, readers of variants of the Latin
alphabet will recognize accented Latin letters as letters.

OTOH, "i18n" also includes letters from other alphabets, like Greek and
Cyrillic, and it is unclear whether all those alphabets (and variants
thereof) can be considered "mutually recognizable" by their readers.

"Recognizability" of a letter is probably very subjective.  For
instance, accented letters found in Castellano, Italian, and French,
certainly look familiar to each other.  However, accented Latin letters
found in Central and Eastern European languages (e.g., `e' with cedilla,
as in Polish -- more generally, Latin letters not part of Latin-1)
certainly look very "unusual" to readers of French, Castellano, Italian,
etc...

> I don't know if there is any work on some locales to change this, 
> but I would recommend against it. However, one could think of creating
> new classes for specific purposes. What would your use be?

Actually, I don't have any specific use case in mind.  Since the UCD
already allows the construction of a list of "all existing letters",
regardless of the language or script they "belong" to, my feeling was
that, conversely, locales could provide more language-specific
knowledge.

Initially, I was just wondering whether this broad and (to some extent)
language-independent character classification is glibc-specific, or
whether it is following some standard or recommendation.

Thanks,
Ludovic.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]