This is the mail archive of the
cygwin-developers
mailing list for the Cygwin project.
Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)
2009/9/27 Corinna Vinschen:
>> > What about this: ÂThe private use area U+f0xx is already used for ASCII
>> > chars invalid in Windows filenames. ÂThe same range can be used for
>> > invalid chars > 0x80. ÂThis could happen unconditionally.
>>
>> That's a great idea, allowing both lone surrogate support and Unix
>> filename transparency.
>>
>> [time passes]
>>
>> Nope, can't think of anything wrong with it. :)
>
> Did we get it? ÂDid we actually get it?
Not quite. :(
If the Unix filename contains the UTF-8 representation of U+F0xx, that
will now roundtrip to just the xx byte. U+F000 is particularly
problematic, as that roundtrips to a null byte.
Solution: if f_mbtowc comes back with a U+F0xx, scratch that, and
instead turn each of the original bytes into a U+F0xx, i.e.:
\xEF\x80\x80 -> U+F0EF U+F080 U+F080
One for later?
> I have a local implementation. for the entire thing,
>
> - Ctrl-X instead of Ctrl-N
> - invalid \xXX bytes -> U+ffXX
> - Allow CESU-8 sequences for lone surrogate halves
> - Change documentation accordingly.
Wow, that was quick!
> If you want to play with it, the entire patch is here (missing a ChangeLog
> for now):
>
> Â http://cygwin.de/hopefully-last-big-cygwin-locale-patch.diff
Compile problem:
cc1plus: warnings being treated as errors
../../.././winsup/cygwin/syscalls.cc: In function âchar*
setlocale(int, const char*)â:
../../.././winsup/cygwin/syscalls.cc:4186: error: âw_cwdâ may be used
uninitialized in this function
../../.././winsup/cygwin/syscalls.cc:4186: error: âw_pathâ may be used
uninitialized in this function
Looks like a false alarm though, and a pair of "=0"s made it compile.
Andy