This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Lone surrogates in UTF-8? (was: Re: Console codepage setting via chcp?)


On Sep 28 07:23, Andy Koppe wrote:
> 2009/9/28 Andy Koppe:
> > If the Unix filename contains the UTF-8 representation of U+F0xx, that
> > will now roundtrip to just the xx byte. U+F000 is particularly
> > problematic, as that roundtrips to a null byte.
> >
> > Solution: if f_mbtowc comes back with a U+F0xx, scratch that, and
> > instead turn each of the original bytes into a U+F0xx, i.e.:
> >
> > \xEF\x80\x80 -> U+F0EF U+F080 U+F080
> >
> > One for later?
> 
> Actually, I think there's a very simple way to implement this: just
> treat a U+F0xx result the same as an encoding error. For example:
> 
> --- strfuncs.cc.bak     2009-09-28 06:05:53.866000000 +0100
> +++ strfuncs.cc 2009-09-28 07:08:36.909000000 +0100
> @@ -602,9 +602,10 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
>                 *ptr = 0x18;
>             }
>         }
> -      else if ((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
> -                                 charset, &ps)) < 0
> -              && *pmbs >= 0x80)
> +      else if (((bytes = f_mbtowc (_REENT, ptr, (const char *) pmbs, nms,
> +                                 charset, &ps)) < 0
> +               && *pmbs >= 0x80)
> +              || (*ptr & 0xff00) == 0xf000)
>         {
>           /* The technique is based on a discussion here:
>              http://www.mail-archive.com/linux-utf8@nl.linux.org/msg00080.html
> @@ -615,7 +616,7 @@ sys_cp_mbstowcs (mbtowc_p f_mbtowc, cons
>              to store them in a symmetric way. */
>           bytes = 1;
>           if (dst)
> -           *ptr = L'\xf080' | *pmbs;
> +           *ptr = L'\xf000' | *pmbs;
>           memset (&ps, 0, sizeof ps);
>         }

Thanks for the patch, but that won't work.  The problem is that ptr can
validly be a NULL pointer if sys_cp_mbstowcs is called only to check
for the length of the result.  With the above, you'll get crashes.

In a case like this, you have to check the input string, along these
lines:

  if (((bytes = f_mbtowc () < 0)
      || (bytes == 3 && pmbs[0] == 0xef && (pmbs[1] & 0xf4) == 0x80))
    [...]

> Btw, is the '*pmbs >= 0x80' check necessary there? ASCII bytes should
> pass unharmed through all encodings (well, at the start of a mbchar
> anyway), and if they didn't, we'd probably still want to encode them
> as U+F0xx.

You're right.  That's a check we can safely omit.


Thanks,
Corinna

-- 
Corinna Vinschen                  Please, send mails regarding Cygwin to
Cygwin Project Co-Leader          cygwin AT cygwin DOT com
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]