This is the mail archive of the cygwin-developers mailing list for the Cygwin project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

From: Andy Koppe <andy dot koppe at gmail dot com>
To: cygwin-developers at cygwin dot com
Date: Sun, 27 Sep 2009 11:22:21 +0100
Subject: Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
References: <416096c60909262332j37d13eb4k400a7ca6c488872e@mail.gmail.com> <20090927091331.GB30851@calimero.vinschen.de>

2009/9/27 Corinna Vinschen:
> On Sep 27 07:32, Andy Koppe wrote:
>> > The __utf8_wctomb function could just create the corresponding
>> > UCS-2 values if no first half has been encountered before. ÂThe
>> > __utf8_mbtowc function could simply allow these UCS-2 values again.
>> >
>> > That works (I just tested it) and is a small change, but is it really
>> > desirable to allow UCS-2 values in UTF-8 strings?
>> [...]
>> The pragmatic approach is tempting though, and we do have reasonable
>> grounds for it given the 16-bit wchar_t. But I think it would need to
>> work for both low and high surrogates.
>>
>> Regarding the latter, __utf8_wctomb() currently writes the first byte
>> of a four-byte sequence when it sees a high surrogate, which of course
>> it can't take back if the following codepoint isn't a low surrogate.
>> This is a problem even if lone high surrogates aren't going to be
>> supported, because that byte on its own is invalid UTF-8.
>>
>> Reading the POSIX spec, however, wctomb() is allowed to write nothing,
>> return zero, and leave the entire high surrogate to be dealt with on
>> the next call. It just says "wctomb() shall [...] return the number of
>> bytes that constitute the character corresponding to the value of
>> wchar", and unlike with mbtowc(), a return value of zero is not
>> defined to have special meaning.
>>
>> There's also room to deal with a lone high surrogate at string end:
>> "If wchar is 0, a null byte shall be stored, preceded by any shift
>> sequence needed to restore the initial shift state, and wctomb() shall
>> be left in the initial shift state."
>
> It never occured to me that wcrtomb could return 0 and the calling
> functions like wcsnrtombs would simply proceed. ÂI'll have a look
> to change __utf8_wctomb accordingly.

Two further thoughts on allowing lone surrogates:
- __mb_cur_max for UTF-8 would need to go up to 6 to allow for a lone
high surrogate followed by a three-byte char.
- Due to the DCxx scheme, the three-byte UTF-8 encoding of DCxx would
roundtrip to a single-byte xx. Changing the code to something else
than DCxx wouldn't help.

Andy

Follow-Ups:
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen

References:
- Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Andy Koppe
- Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]