This is the mail archive of the cygwin-developers mailing list for the Cygwin project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: charset changes


Thomas Wolff:
> I do handle GB18030 in mined; if you point me to where in newlib this is
> handled, I may try a patch.

You'd need to add support for it to the loadlocale function in
newlib/libc/locale/locale.c, whereby you'd need to implement a couple
of functions: __gb18300_wctomb and __gb18300_mbtowc. Cygwin's
implementations of the other CJK charsets are in
winsup/cygwin/strfuncs.cc.

>> four-byte GB18030
>> sequences may map both to BMP and non-BMP Unicode codepoints. With
>> Cygwin's wchar being 16-bit, this means that two wchars may have to be
>> returned for one GB18030 sequence. Yet mbrtowc can only return one
>> wchar, and unlike with UTF-8, there's no way to tell before the last
>> byte whether two wchars are needed. I don't see a way to address that
>> without bending the mbrtowc spec.
>>
> This sounds tricky; my encoding support in mined is not based on the wchar
> functions, so it may not be straightforward to
> map it into newlib, but I'll see. On the other hand, other systems do handle
> it, too, so there is an open source solution...

Other systems usually have a 32-bit wchar, though. I can see three
ways to tackle the issue, but none of them entirely satisfactory. When
encountering a 4-byte sequence in __gb18300_mbtowc that maps to a
non-BMP char (and hence a UTF-16 surrogate pair):
1. Just report an invalid sequence. BMP-only support would probably
still cover most practical needs.
2. Write the high surrogate and report that one byte less than
actually seen has been consumed. On the next mbtowc call, ignore the
input, write the low surrogate, and report that 1 byte has been
consumed. Unfortunately this scheme falls down if the user feeds in
the bytes one-by-one, as Corinna previously found when handling UTF-8
like this.
3. Write the high surrogate and report the actual number of bytes
consumed. On the next call, write the low surrogate, and return 0 to
indicate that no bytes have been consumed. Trouble is, a return value
of 0 from mbrtowc is supposed to indicate that a null character has
been found. While uses within Cygwin could be changed to recognise
string end by instead looking at the character actually written, this
would lead to truncated strings in applications.

Andy


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]