Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)

Andy Koppe andy.koppe@gmail.com
Mon Sep 28 11:48:00 GMT 2009


2009/9/28 Corinna Vinschen:
> Thanks for the patch, but that won't work.  The problem is that ptr can
> validly be a NULL pointer if sys_cp_mbstowcs is called only to check
> for the length of the result.  With the above, you'll get crashes.

D'oh.

> In a case like this, you have to check the input string, along these
> lines:
>
>  if (((bytes = f_mbtowc () < 0)
>      || (bytes == 3 && pmbs[0] == 0xef && (pmbs[1] & 0xf4) == 0x80))
>    [...]

Makes sense.

Oh, and I thought of one more thing that won't roundtrip correctly
from Unix to Windows and back: a high surrogate directly followed by a
low surrogate, because they'll combine into a non-BMP codepoint
represented by a 4-byte sequence. That's near-impossible to happen by
chance though.

I'll give the DLL with your patches a spin tonight.

Andy



More information about the Cygwin-developers mailing list