This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)


2009/9/27 Corinna Vinschen:
>> > It never occured to me that wcrtomb could return 0 and the calling
>> > functions like wcsnrtombs would simply proceed. ÂI'll have a look
>> > to change __utf8_wctomb accordingly.
>>
>> Two further thoughts on allowing lone surrogates:
>> - __mb_cur_max for UTF-8 would need to go up to 6 to allow for a lone
>> high surrogate followed by a three-byte char.
>
> In newlib (and thus Cygwin) __mb_cur_max is already 6 for UTF-8.

I see.


>> - Due to the DCxx scheme, the three-byte UTF-8 encoding of DCxx would
>> roundtrip to a single-byte xx. Changing the code to something else
>> than DCxx wouldn't help.
>
> I don't understand this one. ÂThat's not what I observe after I have
> changed the __utf8_wctomb and __utf8_mbtowc functions accordingly.
> A single byte 0x80 gets encoded to U+DC80. ÂThe round trip results
> in \xed\xb2\x80.

Ah, I'd assumed that U+DCxx in filenames would continue to map to xx
(and vice versa). Either way, this would mean that filenames aren't
transparent: the name can change between open() and readdir().

... pondering ...

Therefore I think that lone surrogates shouldn't be allowed after all,
because Unix filename transparency is more important than being able
to access Windows filenames with invalid UTF-16 (which can't have been
created within Cygwin).

Andy


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]