This is the mail archive of the
libc-alpha@sources.redhat.com
mailing list for the glibc project.
Re: transliteration and wc*tomb
- To: libc-alpha at sources dot redhat dot com
- Subject: Re: transliteration and wc*tomb
- From: Markus Kuhn <Markus dot Kuhn at cl dot cam dot ac dot uk>
- Date: Sun, 01 Oct 2000 14:24:59 +0100
Ulrich Drepper wrote on 2000-09-25 16:41 UTC:
> Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> writes:
>
> > The standard (§7.20) only says that
> >
> > "MB_CUR_MAX [..]
> > expands to a positive integer expression with type
> > size_t that is the maximum number of bytes in a multibyte
> > character for the extended character set specified by the
> > current locale (category LC_CTYPE), which is never greater
> > than MB_LEN_MAX.
> >
> > If a locale contains a transliteration L"ü" -> "ue", then for this
> > locale the implementation will have to make sure that MB_CUR_MAX >=
> > strlen("ue").
>
> First of all, this is everything but practical. Many programs assume
> MB_CUR_MAX to be an attribute of the charset which is used to trigger
> certain handling. This all would break.
Can you give a practical example? I don't understand this point. And why
would this "triggered certain handling" not be appropriate for
transliterated output when it is appropriate for Shift-JIS or UTF-8
output? The handling required for these two is in any respect that I can
think of equivalent. What exactly would break?
Transliteration is just another external multi-byte encoding, nothing
else, except that it is a characteristic of transliteration that it only
affects output, not input. Everything the standards say on multi-byte
encodings also applies to transliteration.
> Second, I deliberately
> didn't enable the transliteration for the wc*tomb* functions since in
> the contexts they are used they have to be exact. THere is even an
> error number defined for the case an invalid character is found. This
> is different from the stream handling where this is keft
> implementation defined.
No!!! What standard are you reading? Mine clearly says in §7.19.3
[#11] The wide character input functions read multibyte
characters from the stream and convert them to wide
characters as if they were read by successive calls to the
fgetwc function. Each conversion occurs as if by a call to
the mbrtowc function, with the conversion state described by
the stream's own mbstate_t object. The byte input functions
read characters from the stream as if by successive calls to
the fgetc function.
[#12] The wide character output functions convert wide
characters to multibyte characters and write them to the
stream as if they were written by successive calls to the
fputwc function. Each conversion occurs as if by a call to
the wcrtomb function, with the conversion state described by
the stream's own mbstate_t object. The byte output
functions write characters to the stream as if by successive
calls to the fputc function.
[#13] In some cases, some of the byte input/output functions
also perform conversions between multibyte characters and
wide characters. These conversions also occur as if by
calls to the mbrtowc and wcrtomb functions.
[#14] An encoding error occurs if the character sequence
presented to the underlying mbrtowc function does not form a
valid (generalized) multibyte character, or if the code
value passed to the underlying wcrtomb does not correspond
to a valid (generalized) multibyte character. The wide
^^^^^^^^
character input/output functions and the byte input/output
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
functions store the value of the macro EILSEQ in errno if
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
and only if an encoding error occurs.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Every single letter of the standard forbids exactly the strange
distinction between i/o functions and wc*tomb* that you try to support
here. There is just as well an error code defined for every i/o function
as for the string conversion function. The standard is in my eyes highly
reasonable and does exactly what I would have expected it to do. Please
read the relevant parts that I quoted again carefully. A final ASCII
version (which is far far easier to keyword search and quote in online
discussions than the ISO paper version) is on
http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-C-FDIS.1999-04.txt
> Also, iconv() also does not enable transliteration by default
> for the same reason.
I won't comment on iconv() at the moment, because this is not an ISO C
API, but I suspect that there is also no good reason for why iconv
should behave radically different from the required behaviour for wide
i/o and the ISO C conversion functions.
> It is a bad idea to do this since it is greatly reducing portability
> and the ability to develop programs on Linux for other architectures.
Again, I'm afraid I completely fail to understand what exactly you refer
to here. Even if there is such a problem (which I can't see right now)
then glibc is probably still the first major C library that implements
transliterations. Others who will follow will look at it as an example,
which makes it even more important to get it simple, functional,
elegant, and above all in strict conformance with the requirements of
ISO C. My suggestions and criticisms are all just aimed at that.
I like the transliteration mechanism and I'm very glad that it found its
way into glibc 2.2. I just want to make sure that transliteration is
treated correctly like any other multi-byte encoding variant and that
its implementation does not violate the ISO C standard. It currently
very clearly does.
> Stdio is used for output which might also lead to some
> incompatibilities but the standard says so and the most problematic
> situations
I quoted already the holy ISO text that says that stdio shall behave
exactly as using wc*tomb(), and I think the standard makes a very
sensible requirement here, so there is little that I could add than to
repeat myself.
Markus
--
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>