This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] CJK ambiguous width for non-Unicode charsets


On Nov  9 22:06, Andy Koppe wrote:
> The attached small patch affects character widths as reported by
> wcwidth(). It addresses an obscure issue.
> 
> The CJK ambiguous width category contains characters that are one
> character cell wide in some contexts and two cells in others. That
> category doesn't actually contain CJK characters as such, but things
> like the Greek and Cyrillic alphabets, accented Latin characters, and
> also line drawing characters. These are usually one cell wide, but in
> CJK legacy encodings such as SJIS or GBK, they were encoded as two
> bytes, and the usual practice was to have the display width correspond
> to the number of bytes. Accordingly, CJK terminal fonts usually have
> double-width glyphs for the affected characters. See also
> http://unicode.org/reports/tr11/#Ambiguous.
> 
> Newlib currently decides which width to use based on the selected
> LC_CTYPE locale, i.e. it will use double width for "zh", "jp", and
> "ko" locales, and single width for everything else, independent of the
> selected character set. The attached patch changes this so that single
> width will always be used for single-byte encodings such as the
> ISO-8859 ones, and that double width will always be used for the CJK
> legacy encodings. For UTF-8, the decision will still be made based on
> the locale. The @cjknarrow modifier can still be used to force single
> width, independent of locale and encoding.
> 
> The point of this is to fit in with the historical use of those legacy
> encodings, since the ambiguity only arose once the different charsets
> were combined into Unicode. I doubt anyone is using nonsensical
> locale/encoding combinations such as de_DE.GBK or ja_JP.ISO-8859-1, so
> this is primarily about the likes of C.GBK and C.SJIS. Those are
> currently ambiguous-narrow, but vim for example treats them as
> ambiguous-wide, which makes for "interesting" effects when editing
> files containing affected characters. The patch here fixes that.
> 
> Tested in Cygwin. I assume this will need to wait for Corinna's return.
> 
> 	* libc/locale/locale.c: Fix ambigous width to one for singlebyte
> 	charsets and two for non-Unicode multibyte charsets.

This appears to make a lot of sense.  Would you mind to enhance your
patch slightly to fix also the description in the locale.c
documentation?  There's a related paragraph starting with "This
implementation also supports a single modifier, <<"cjknarrow">>..."


Thanks,
Corinna

-- 
Corinna Vinschen
Cygwin Project Co-Leader
Red Hat


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]