This is the mail archive of the
cygwin
mailing list for the Cygwin project.
Re: Non-canonical mode input via tcsetattr(), under mintty console
- From: Andy Koppe <andy dot koppe at gmail dot com>
- To: cygwin at cygwin dot com
- Date: Wed, 3 Mar 2010 16:36:37 +0000
- Subject: Re: Non-canonical mode input via tcsetattr(), under mintty console
- References: <513288.14252.qm@web19014.mail.hk2.yahoo.com> <4B8A6069.4030008@towo.net>
Thomas Wolff:
> Dave Lee schrieb:
>>
>> Hi all,
>>
>> I was testing a program that uses non-canonical mode input via
>> tcsetattr().
>>
>> ...
>> Specifically, I entered the chinese character "ä" (which means "rule"
>> or "example"). It occupies 3 bytes in UTF-8 representation: E4, BE, 8B.
>>
>> On standard console, the read() call returned THREE bytes (n == 3), and
>> (not surprisingly) E4, BE and 8B were returned to buf[].
>>
>> On mintty console, the read() call returned ONE byte (n == 1), and only
>> E4 were returned to buf[]. I could grab the other two bytes if I did
>> additional calls to read().
>>
> This is absolutely in line with the specified interface of read(), whether
> or not you apply some tcsetattr settings, and whether or not there is a
> difference between cygwin console and mintty. It is a traditional
> byte-oriented function and has no knowlege or handling of character
> encoding, and there is no guarantee that a multi-byte character comes in one
> piece.
Exactly.
> (Even if mintty were changed to try to feed them in one piece, there
> would still be no guarantee that you receive them in one piece.)
As it happens, mintty sends multibyte characters in a single write()
already, but the pseudo terminal device driver is indeed entitled to
pick them apart anyway: VMIN=1 and VTIME=0 means give me at least one
byte, as soon as you have it. It's also possible that multiple
characters are delivered at once.
> You have four options (two each whether you want UTF-8 or Unicode words in
> your program):
> [...]
> * Read bytes and transform with one of the mbtowc (multi-byte to
> wide-character) functions
> [...]
I'd go with that, because that way you can support not only UTF-8, but
all the charsets supported by the OS.
> (provided you want characters as Unicode words,
> not UTF-8 sequences in your program).
In that case, one can just ignore the widechar output and only use the
length info returned by mb(r)towc.
Andy
--
Problem reports: http://cygwin.com/problems.html
FAQ: http://cygwin.com/faq/
Documentation: http://cygwin.com/docs.html
Unsubscribe info: http://cygwin.com/ml/#unsubscribe-simple