This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)
- From: "Markus Scherer" <markus dot scherer at us dot ibm dot com>
- To: drepper at redhat dot com (Ulrich Drepper)
- Cc: Anthony Fok <anthony at thizlinux dot com>, drepper at myware dot mynet, fai at thizlinux dot com, Bruno Haible <haible at ilog dot fr>, kevin at thizlinux dot com, libc-alpha at sources dot redhat dot com, sunnygu at thizgroup dot com, suzhe at gnuchina dot org, Yu Shao <yshao at redhat dot com>
- Date: Thu, 17 Jan 2002 20:48:36 -0800
- Subject: Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)
I am sorry for writing another certainly offensive email -
It is not unusual to transport, process, and map unassigned codes in
various charsets including Unicode.
There is nothing "incorrect input" about U+33ff; the fact that it is not
mentioned in UnicodeData.txt only means that it has no assigned character
and that it has only default properties. A converter for UTF-8, SCSU, or
GB 18030 must convert it. For UTFs, this is even a Unicode conformance
requirement.
In fact, U+33ff (and all other unassigned and non-character code points)
do have a Unicode general category value of "Cn". They are not mentioned
in UnicodeData.txt because the definition of the Unicode character
database says that they aren't.
You can see this in the description of the general categories in
http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.html :
For general category Cn it says "Other, Not Assigned (no characters in the
file have this property)". This means that everything that is not in
UnicodeData.txt has Cn.
Another Unicode conformance requirement says that you must pass through
(and not throw away or corrupt) code points that you don't know anything
about if you purport to not modify the contents of the text.
Non-Unicode examples for handling unassigned code points include most East
Asian charsets with their areas for user-defined characters, private use,
"reserved" etc. They are frequently mapped, e.g. in GBK<->Unicode to/from
parts of the Unicode private-use areas.
My understanding is that a GB 18030 converter that does not handle
unassigned-but-legal codes will not pass certification (but I am also not
an expert in certification).
markus
Markus Scherer IBM GCoC-Unicode/ICU San José, CA
markus.scherer@us.ibm.com (also for SameTime)
Ulrich Drepper <drepper@redhat.com>
Sent by: drepper@myware.mynet
01/17/2002 06:03 PM
Please respond to drepper
To: Anthony Fok <anthony@thizlinux.com>
cc: Yu Shao <yshao@redhat.com>, libc-alpha@sources.redhat.com,
kevin@thizlinux.com, fai@thizlinux.com, sunnygu@thizgroup.com,
suzhe@gnuchina.org, Markus Scherer/Cupertino/IBM@IBMUS, Bruno Haible
<haible@ilog.fr>
Subject: Re: New GB18030 gconv module for glibc (from ThizLinux Laboratory)
Anthony Fok <anthony@thizlinux.com> writes:
> No, this is wrong. As I have said, in the Unicode standard, U+33FF
> is "legal" but "unassigned". If gb18030.c says it is "illegal", it is
> glibc's bug.
No. This is an incorrect input. Period. There is no discussion
about it. I've already said that no character which is not in the
current UnicodeData list must be converted.
--
---------------. ,-. 1325 Chesapeake Terrace
Ulrich Drepper \ ,-------------------' \ Sunnyvale, CA 94089 USA
Red Hat `--' drepper at redhat.com `------------------------