This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Re: transliteration and wc*tomb

To: libc-alpha at sources dot redhat dot com
Subject: Re: transliteration and wc*tomb
From: Markus Kuhn <Markus dot Kuhn at cl dot cam dot ac dot uk>
Date: Sun, 01 Oct 2000 14:24:59 +0100
Ulrich Drepper wrote on 2000-09-25 16:41 UTC:
> Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> writes:
> 
> > The standard (§7.20) only says that
> > 
> >        "MB_CUR_MAX [..]
> >        expands to a positive integer expression with type
> >        size_t  that  is  the maximum number of bytes in a multibyte
> >        character for the extended character set  specified  by  the
> >        current  locale  (category LC_CTYPE), which is never greater
> >        than MB_LEN_MAX.
> > 
> > If a locale contains a transliteration L"ü" -> "ue", then for this
> > locale the implementation will have to make sure that MB_CUR_MAX >=
> > strlen("ue").
> 
> First of all, this is everything but practical.  Many programs assume
> MB_CUR_MAX to be an attribute of the charset which is used to trigger
> certain handling.  This all would break.

Can you give a practical example? I don't understand this point. And why
would this "triggered certain handling" not be appropriate for
transliterated output when it is appropriate for Shift-JIS or UTF-8
output? The handling required for these two is in any respect that I can
think of equivalent. What exactly would break?

Transliteration is just another external multi-byte encoding, nothing
else, except that it is a characteristic of transliteration that it only
affects output, not input. Everything the standards say on multi-byte
encodings also applies to transliteration.

> Second, I deliberately
> didn't enable the transliteration for the wc*tomb* functions since in
> the contexts they are used they have to be exact.  THere is even an
> error number defined for the case an invalid character is found.  This
> is different from the stream handling where this is keft
> implementation defined.

No!!! What standard are you reading? Mine clearly says in §7.19.3

       [#11] The wide  character  input  functions  read  multibyte
       characters   from  the  stream  and  convert  them  to  wide
       characters as if they were read by successive calls  to  the
       fgetwc  function.  Each conversion occurs as if by a call to
       the mbrtowc function, with the conversion state described by
       the stream's own mbstate_t object.  The byte input functions
       read characters from the stream as if by successive calls to
       the fgetc function.

       [#12]  The  wide  character  output  functions  convert wide
       characters to multibyte characters and  write  them  to  the
       stream  as  if  they were written by successive calls to the
       fputwc function.  Each conversion occurs as if by a call  to
       the wcrtomb function, with the conversion state described by
       the  stream's  own  mbstate_t  object.   The   byte   output
       functions write characters to the stream as if by successive
       calls to the fputc function.

       [#13] In some cases, some of the byte input/output functions
       also  perform  conversions  between multibyte characters and
       wide characters.  These conversions  also  occur  as  if  by
       calls to the mbrtowc and wcrtomb functions.

       [#14]  An  encoding  error  occurs if the character sequence
       presented to the underlying mbrtowc function does not form a
       valid  (generalized)  multibyte  character,  or  if the code
       value passed to the underlying wcrtomb does  not  correspond
       to  a  valid  (generalized)  multibyte  character.  The wide
                                                           ^^^^^^^^
       character input/output functions and the  byte  input/output
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       functions  store  the  value of the macro EILSEQ in errno if
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
       and only if an encoding error occurs.
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Every single letter of the standard forbids exactly the strange
distinction between i/o functions and wc*tomb* that you try to support
here. There is just as well an error code defined for every i/o function
as for the string conversion function. The standard is in my eyes highly
reasonable and does exactly what I would have expected it to do. Please
read the relevant parts that I quoted again carefully. A final ASCII
version (which is far far easier to keyword search and quote in online
discussions than the ISO paper version) is on

  http://www.cl.cam.ac.uk/~mgk25/volatile/ISO-C-FDIS.1999-04.txt

> Also, iconv() also does not enable transliteration by default
> for the same reason.

I won't comment on iconv() at the moment, because this is not an ISO C
API, but I suspect that there is also no good reason for why iconv
should behave radically different from the required behaviour for wide
i/o and the ISO C conversion functions.

> It is a bad idea to do this since it is greatly reducing portability
> and the ability to develop programs on Linux for other architectures.

Again, I'm afraid I completely fail to understand what exactly you refer
to here. Even if there is such a problem (which I can't see right now)
then glibc is probably still the first major C library that implements
transliterations. Others who will follow will look at it as an example,
which makes it even more important to get it simple, functional,
elegant, and above all in strict conformance with the requirements of
ISO C. My suggestions and criticisms are all just aimed at that.

I like the transliteration mechanism and I'm very glad that it found its
way into glibc 2.2. I just want to make sure that transliteration is
treated correctly like any other multi-byte encoding variant and that
its implementation does not violate the ISO C standard. It currently
very clearly does.

> Stdio is used for output which might also lead to some
> incompatibilities but the standard says so and the most problematic
> situations

I quoted already the holy ISO text that says that stdio shall behave
exactly as using wc*tomb(), and I think the standard makes a very
sensible requirement here, so there is little that I could add than to
repeat myself.

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>
Follow-Ups:
- BUG REPORT: transliteration breaks wcrtomb and printf
  - From: Markus Kuhn
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]