This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
charset documentation patches
- To: libc-alpha at sources dot redhat dot com
- Subject: charset documentation patches
- From: Markus Kuhn <Markus dot Kuhn at cl dot cam dot ac dot uk>
- Date: Sat, 30 Sep 2000 20:17:25 +0100
I just had a quick look at charset.texi and noticed quite a number of
minor mistakes that I fixed. Mostly UCS4 -> UCS-4 (that how it is
written in all the standards), second amendment -> Amendment 1 (there
was no second amendment to ISO C90), etc. I also slightly modernized the
description of the relationship between Unicode and ISO 10646.
Patch attached.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
Index: charset.texi
===================================================================
RCS file: /cvs/glibc/libc/manual/charset.texi,v
retrieving revision 1.23
diff -u -r1.23 charset.texi
--- charset.texi 2000/09/27 00:44:57 1.23
+++ charset.texi 2000/09/30 19:05:40
@@ -15,7 +15,7 @@
grappled with non-Roman character sets, where not all the characters
that make up a language's character set can be represented by @math{2^8}
choices. This chapter shows the functionality which was added to the C
-library to correctly support multiple character sets.
+library to support multiple character sets.
@menu
* Extended Char Intro:: Introduction to Extended Characters.
@@ -46,13 +46,13 @@
representations include files lying in a directory that are going to be
read and parsed.
-Traditionally there was no difference between the two representations.
-It was equally comfortable and useful to use the same one-byte
+Traditionally there has been no difference between the two representations.
+It was equally comfortable and useful to use the same single-byte
representation internally and externally. This changes with more and
larger character sets.
One of the problems to overcome with the internal representation is
-handling text which is externally encoded using different character
+handling text that is externally encoded using different character
sets. Assume a program which reads two texts and compares them using
some metric. The comparison can be usefully done only if the texts are
internally kept in a common format.
@@ -69,14 +69,28 @@
As shown in some other part of this manual,
@c !!! Ahem, wide char string functions are not yet covered -- drepper
there exists a completely new family of functions which can handle texts
-of this kind in memory. The most commonly used character set for such
-internal wide character representations are Unicode and @w{ISO 10646}.
-The former is a subset of the latter and used when wide characters are
-chosen to by 2 bytes (@math{= 16} bits) wide. The standard names of the
-@cindex UCS2
-@cindex UCS4
-encodings used in these cases are UCS2 (@math{= 16} bits) and UCS4
-(@math{= 32} bits).
+of this kind in memory. The most commonly used character sets for such
+internal wide character representations are Unicode and @w{ISO 10646}
+(also known as UCS for Universal Character Set). Unicode was originally
+planned as a 16-bit character set, whereas @w{ISO 10646} was designed to
+be a 31-bit large code space. The two standards are practically identical.
+They have the same character repertoire and code table, but Unicode specifies
+added semantics. At the moment, only characters in the first @code{0x10000}
+code positions (the so-called Basic Multilingual Plane, BMP) have been
+assigned, but the assignment of more specialized characters outside this
+16-bit space is already in progress. A number of encodings have been
+defined for Unicode and @w{ISO 10646} characters:
+@cindex UCS-2
+@cindex UCS-4
+@cindex UTF-8
+@cindex UTF-16
+UCS-2 is a 16-bit word that can only represent characters
+from the BMP, UCS-4 is a 32-bit word than can represent any Unicode
+and @w{ISO 10646} character, UTF-8 is an ASCII compatible encoding where
+ASCII characters are represented by ASCII bytes and non-ASCII characters
+by sequences of 2-6 non-ASCII bytes, and finally UTF-16 is an extension
+of UCS-2 in which pairs of certain UCS-2 words can be used to encode
+non-BMP characters up to @code{0x10ffff}.
To represent wide characters the @code{char} type is not suitable. For
this reason the @w{ISO C} standard introduces a new type which is
@@ -93,18 +107,18 @@
The @w{ISO C90} standard, where this type was introduced, does not say
anything specific about the representation. It only requires that this
-type is capable to store all elements of the basic character set.
+type is capable of storing all elements of the basic character set.
Therefore it would be legitimate to define @code{wchar_t} as
@code{char}. This might make sense for embedded systems.
But for GNU systems this type is always 32 bits wide. It is therefore
-capable to represent all UCS4 value therefore covering all of @w{ISO
-10646}. Some Unix systems define @code{wchar_t} as a 16 bit type and
+capable of representing all UCS-4 values and therefore covering all of
+@w{ISO 10646}. Some Unix systems define @code{wchar_t} as a 16-bit type and
thereby follow Unicode very strictly. This is perfectly fine with the
standard but it also means that to represent all characters from Unicode
-and @w{ISO 10646} one has to use surrogate character which is in fact a
-multi-wide-character encoding. But this contradicts the purpose of the
-@code{wchar_t} type.
+and @w{ISO 10646} one has to use UTF-16 surrogate characters which is in
+fact a multi-wide-character encoding. But this contradicts the purpose
+of the @code{wchar_t} type.
@end deftp
@comment wchar.h
@@ -119,8 +133,8 @@
@code{int} due to the parameter promotion.
@pindex wchar.h
-This type is defined in @file{wchar.h} and got introduced in the second
-amendment to @w{ISO C90}.
+This type is defined in @file{wchar.h} and got introduced in
+@w{Amendment 1} to @w{ISO C90}.
@end deftp
As there are for the @code{char} data type there also exist macros
@@ -133,7 +147,7 @@
The macro @code{WCHAR_MIN} evaluates to the minimum value representable
by an object of type @code{wint_t}.
-This macro got introduced in the second amendment to @w{ISO C90}.
+This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
@end deftypevr
@comment wchar.h
@@ -142,7 +156,7 @@
The macro @code{WCHAR_MIN} evaluates to the maximum value representable
by an object of type @code{wint_t}.
-This macro got introduced in the second amendment to @w{ISO C90}.
+This macro got introduced in @w{Amendment 1} to @w{ISO C90}.
@end deftypevr
Another special wide character value is the equivalent to @code{EOF}.
@@ -180,7 +194,7 @@
@end smallexample
@pindex wchar.h
-This macro was introduced in the second amendment to @w{ISO C90} and is
+This macro was introduced in @w{Amendment 1} to @w{ISO C90} and is
defined in @file{wchar.h}.
@end deftypevr
@@ -198,7 +212,7 @@
@cindex multibyte character
@cindex EBCDIC
For all the above reasons, an external encoding which is different
-from the internal encoding is often used if the latter is UCS2 or UCS4.
+from the internal encoding is often used if the latter is UCS-2 or UCS-4.
The external encoding is byte-based and can be chosen appropriately for
the environment and for the texts to be handled. There exist a variety
of different character sets which can be used for this external
@@ -215,7 +229,7 @@
@itemize @bullet
@item
-The simplest character sets are one-byte character sets. There can be
+The simplest character sets are single-byte character sets. There can be
only up to 256 characters (for @w{8 bit} character sets) which is not
sufficient to cover all languages but might be sufficient to handle a
specific text. Another reason to choose this is because of constraints
@@ -240,7 +254,7 @@
sequence of a character one can interpret a text correctly. Examples of
character sets using this policy are the various EUC character sets
(used by Sun's operations systems, EUC-JP, EUC-KR, EUC-TW, and EUC-CN)
-or SJIS (Shift JIS, a Japanese encoding).
+or SJIS (Shift-JIS, a Japanese encoding).
But there are also character sets using a state which is valid for more
than one character and has to be changed by another byte sequence.
@@ -257,23 +271,23 @@
acute'' character. To get the acute accent character on its on one has
to write @code{0xc2 0x20} (the non-spacing acute followed by a space).
-This type of characters sets is quite frequently used in embedded
-systems such as video text.
+This type of character set is used in some embedded systems such as
+teletex.
@item
@cindex UTF-8
-Instead of converting the Unicode or @w{ISO 10646} text used internally
+Instead of converting the Unicode or @w{ISO 10646} text used internally,
it is often also sufficient to simply use an encoding different than
-UCS2/UCS4. The Unicode and @w{ISO 10646} standards even specify such an
+UCS-2/UCS-4. The Unicode and @w{ISO 10646} standards even specify such an
encoding: UTF-8. This encoding is able to represent all of @w{ISO
-10464} 31 bits in a byte string of length one to seven.
+10464} 31 bits in a byte string of length one to six.
@cindex UTF-7
There were a few other attempts to encode @w{ISO 10646} such as UTF-7
but UTF-8 is today the only encoding which should be used. In fact,
-UTF-8 will hopefully soon be the only external which has to be
+UTF-8 will hopefully soon be the only external encoding that has to be
supported. It proves to be universally usable and the only disadvantage
-is that it favor Roman languages very much by making the byte string
+is that it favors Roman languages by making the byte string
representation of other scripts (Cyrillic, Greek, Asian scripts) longer
than necessary if using a specific character set for these scripts.
Methods like the Unicode compression scheme can alleviate these
@@ -324,7 +338,7 @@
The second family of functions got introduced in the early Unix standards
(XPG2) and is still part of the latest and greatest Unix standard:
@w{Unix 98}. It is also the most powerful and useful set of functions.
-But we will start with the functions defined in the second amendment to
+But we will start with the functions defined in @w{Amendment 1} to
@w{ISO C90}.
@node Restartable multibyte conversion
@@ -377,7 +391,7 @@
by the functions we are about to describe. Each locale uses its own
character set (given as an argument to @code{localedef}) and this is the
one assumed as the external multibyte encoding. The wide character
-character set always is UCS4, at least on GNU systems.
+character set always is UCS-4, at least on GNU systems.
A characteristic of each multibyte character set is the maximum number
of bytes which can be necessary to represent one character. This
@@ -456,8 +470,8 @@
function to another.
@pindex wchar.h
-This type is defined in @file{wchar.h}. It got introduced in the second
-amendment to @w{ISO C90}.
+This type is defined in @file{wchar.h}. It got introduced in
+@w{Amendment 1} to @w{ISO C90}.
@end deftp
To use objects of this type the programmer has to define such objects
@@ -495,7 +509,7 @@
it is zero.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -559,7 +573,7 @@
any static state.
@pindex wchar.h
-This function was introduced in the second amendment of @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -608,7 +622,7 @@
@code{EOF}.
@pindex wchar.h
-This function was introduced in the second amendment of @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -655,7 +669,7 @@
@code{(size_t) -1}. The conversion state is afterwards undefined.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -733,7 +747,7 @@
object local to @code{mbrlen} is used.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C90} and
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and
is declared in @file{wchar.h}.
@end deftypefun
@@ -839,7 +853,7 @@
available, otherwise buffer overruns can occur.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
declared in @file{wchar.h}.
@end deftypefun
@@ -977,7 +991,7 @@
following the last converted multibyte character.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
declared in @file{wchar.h}.
@end deftypefun
@@ -1058,7 +1072,7 @@
converted.
@pindex wchar.h
-This function was introduced in the second amendment to @w{ISO C} and is
+This function was introduced in @w{Amendment 1} to @w{ISO C90} and is
declared in @file{wchar.h}.
@end deftypefun
@@ -1231,8 +1245,8 @@
@node Non-reentrant Conversion
@section Non-reentrant Conversion Function
-The functions described in the last chapter are defined in the second
-amendment to @w{ISO C90}. But the original @w{ISO C90} standard also
+The functions described in the last chapter are defined in
+@w{Amendment 1} to @w{ISO C90}. But the original @w{ISO C90} standard also
contained functions for character set conversion. The reason that they
are not described in the first place is that they are almost entirely
useless.
@@ -1369,8 +1383,8 @@
For convenience reasons the @w{ISO C90} standard defines also functions
to convert entire strings instead of single characters. These functions
-suffer from the same problems as their reentrant counterparts from the
-second amendment to @w{ISO C90}; see @ref{Converting Strings}.
+suffer from the same problems as their reentrant counterparts from
+@w{Amendment 1} to @w{ISO C90}; see @ref{Converting Strings}.
@comment stdlib.h
@comment ISO
@@ -1513,7 +1527,7 @@
specified by the functions. The multibyte encoding used is specified by
the currently selected locale for the @code{LC_CTYPE} category. The
wide character set is fixed by the implementation (in the case of GNU C
-library it always is UCS4 encoded @w{ISO 10646}.
+library it always is UCS-4 encoded @w{ISO 10646}.
This has of course several problems when it comes to general character
conversion:
@@ -1806,12 +1820,12 @@
int result = 0;
iconv_t cd;
- cd = iconv_open ("UCS4", charset);
+ cd = iconv_open ("UCS-4", charset);
if (cd == (iconv_t) -1)
@{
/* @r{Something went wrong.} */
if (errno == EINVAL)
- error (0, 0, "conversion from `%s' to `UCS4' no available",
+ error (0, 0, "conversion from '%s' to 'UCS-4' not available",
charset);
else
perror ("iconv_open");
@@ -2024,7 +2038,7 @@
Unfortunately, the answer is: there is no general solution. On some
systems guessing might help. On those systems most character sets can
-convert to and from UTF8 encoded @w{ISO 10646} or Unicode text.
+convert to and from UTF-8 encoded @w{ISO 10646} or Unicode text.
Beside this only some very system-specific methods can help. Since the
conversion functions come from loadable modules and these modules must
be stored somewhere in the filesystem, one @emph{could} try to find them
@@ -2082,7 +2096,7 @@
@cindex triangulation
This is achieved by providing for each character set a conversion from
-and to UCS4 encoded @w{ISO 10646}. Using @w{ISO 10646} as an
+and to UCS-4 encoded @w{ISO 10646}. Using @w{ISO 10646} as an
intermediate representation it is possible to @dfn{triangulate}, i.e.,
converting with an intermediate representation.
@@ -2210,15 +2224,15 @@
@code{INTERNAL} mentioned. From the discussion above and the chosen
name it should have become clear that this is the name for the
representation used in the intermediate step of the triangulation. We
-have said that this is UCS4 but actually it is not quite right. The
-UCS4 specification also includes the specification of the byte ordering
-used. Since a UCS4 value consists of four bytes a stored value is
+have said that this is UCS-4 but actually it is not quite right. The
+UCS-4 specification also includes the specification of the byte ordering
+used. Since a UCS-4 value consists of four bytes a stored value is
effected by byte ordering. The internal representation is @emph{not}
-the same as UCS4 in case the byte ordering of the processor (or at least
-the running process) is not the same as the one required for UCS4. This
+the same as UCS-4 in case the byte ordering of the processor (or at least
+the running process) is not the same as the one required for UCS-4. This
is done for performance reasons as one does not want to perform
unnecessary byte-swapping operations if one is not interested in actually
-seeing the result in UCS4. To avoid trouble with endianess the internal
+seeing the result in UCS-4. To avoid trouble with endianess the internal
representation consistently is named @code{INTERNAL} even on big-endian
systems where the representations are identical.
@@ -2570,7 +2584,7 @@
character can consist of one to four bytes. Therefore the
@code{MIN_NEEDED_FROM} and @code{MAX_NEEDED_FROM} macros are defined
this way. The output is always the @code{INTERNAL} character set (aka
-UCS4) and therefore each character consists of exactly four bytes. For
+UCS-4) and therefore each character consists of exactly four bytes. For
the conversion from @code{INTERNAL} to ISO-2022-JP we have to take into
account that escape sequences might be necessary to switch the character
sets. Therefore the @code{__max_needed_to} element for this direction
Index: ctype.texi
===================================================================
RCS file: /cvs/glibc/libc/manual/ctype.texi,v
retrieving revision 1.23
diff -u -r1.23 ctype.texi
--- ctype.texi 2000/05/21 21:21:56 1.23
+++ ctype.texi 2000/09/30 19:12:41
@@ -265,8 +265,8 @@
@node Classification of Wide Characters, Using Wide Char Classes, Case Conversion, Character Handling
@section Character class determination for wide characters
-The second amendment to @w{ISO C89} defines functions to classify wide
-characters. Although the original @w{ISO C89} standard already defined
+@w{Amendment 1} to @w{ISO C90} defines functions to classify wide
+characters. Although the original @w{ISO C90} standard already defined
the type @code{wchar_t}, no functions operating on them were defined.
The general design of the classification functions for wide characters