This is the mail archive of the cygwin mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Cygwin 1.7.1 sprintf() with format string having 8th bit set


2010/1/4 Thomas Wolff:
> My assumption has been that *printf should be byte-transparent unless where
> it uses explicit wide character arguments.

What's that assumption based on?


> After all, legacy applications that do not care about locales at all may
> legitimately assume this since a C char [] is a byte sequence;

Erm, the meaning of a byte sequence is up to each function.


> this is not affected by the legacy casual usage of the word "character"
> referring to a char value which does not automatically imply "wide
> character".

There is no casual usage of "byte" and "character" in the POSIX
standard. See http://www.opengroup.org/onlinepubs/000095399/basedefs/xbd_chap03.html.

In particular:

 3.84 Byte: An individually addressable unit of data storage that is
exactly an octet, used to store a character or a portion of a
character; see also Character. A byte is composed of a contiguous
sequence of 8 bits. The least significant bit is called the
"low-order" bit; the most significant is called the "high-order" bit.

3.87 Character: A sequence of one or more bytes representing a single
graphic symbol or control code.

3.92 Character String: A contiguous sequence of characters terminated
by and including the first null byte.

3.367 String: A contiguous sequence of bytes terminated by and
including the first null byte.

(And yep, a lot of confusion would go away if the 'char' type was
called 'byte' instead, but of course that's out of the question.)


> In that thread, someone had originally confused char * with wchar [] - the
> issue resolves cleanly if these are properly distinguished.
>
> Comments on the EILSEQ clause from that thread:
>>
>> > It's talking about "characters" rather than "bytes" there, which I
>> > think does leave the behaviour for invalid bytes undefined,

That sentence had nothing to do with EILSEQ. Here it is in its original context:

"I couldn't find specific text about invalid bytes in the POSIX printf
spec, but it does say the following: "The format is a character
string, beginning and ending in its initial shift state, if any. The
format is composed of zero or more directives: ordinary characters,
which are simply copied to the output stream, and conversion
specifications, each of which shall result in the fetching of zero or
more arguments."

It's talking about "characters" rather than "bytes" there, which I
think does leave the behaviour for invalid bytes undefined, so
newlib's printf implementation is in its rights to just stop
processing the string at one of those."

To emphasise this again, the printf spec explictly says that "the
format is a *character* string".


> I don't think there is such a thing like an invalid multibyte character in a
> char [] unless it is being interpreted with a multi-byte function, that's
> what e.g. the mb* functions are for.

Well, you're wrong. See the definition of 'character'.


> In a legacy application, especially in an sprintf which may not even be
> intended for printing, there is no intent to apply a multi-byte
> interpretation. This is over-imposing semantics on a basic C type.

No, it's necessary for printf to work correctly with all character
sets. For example, the second byte in a double-byte SJIS character can
actually be the same as the ASCII code for '%'. Hence, if printf
blindly copied bytes until encountering a '%', it would not be
possible to print such characters.


> So I do not agree that printf is right here, and if it were, the third line
> in the example would have had to fail as well, actually.

Including invalid bytes in the format string is undefined behaviour.
Anything can happen. And what likely happened is that the compiler
replaced the third sprintf call with strcpy (which is specified on
strings rather than character strings).

The real discussion to be had here is whether "C" should continue to
mean UTF-8 or return to ASCII for the sake of Linux compatibility. See
http://cygwin.com/ml/cygwin-developers/2009-12/msg00112.html for that.

Andy

--
Problem reports:       http://cygwin.com/problems.html
FAQ:                   http://cygwin.com/faq/
Documentation:         http://cygwin.com/docs.html
Unsubscribe info:      http://cygwin.com/ml/#unsubscribe-simple


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]