This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

[PATCH] CJK ambiguous width for non-Unicode charsets


The attached small patch affects character widths as reported by
wcwidth(). It addresses an obscure issue.

The CJK ambiguous width category contains characters that are one
character cell wide in some contexts and two cells in others. That
category doesn't actually contain CJK characters as such, but things
like the Greek and Cyrillic alphabets, accented Latin characters, and
also line drawing characters. These are usually one cell wide, but in
CJK legacy encodings such as SJIS or GBK, they were encoded as two
bytes, and the usual practice was to have the display width correspond
to the number of bytes. Accordingly, CJK terminal fonts usually have
double-width glyphs for the affected characters. See also
http://unicode.org/reports/tr11/#Ambiguous.

Newlib currently decides which width to use based on the selected
LC_CTYPE locale, i.e. it will use double width for "zh", "jp", and
"ko" locales, and single width for everything else, independent of the
selected character set. The attached patch changes this so that single
width will always be used for single-byte encodings such as the
ISO-8859 ones, and that double width will always be used for the CJK
legacy encodings. For UTF-8, the decision will still be made based on
the locale. The @cjknarrow modifier can still be used to force single
width, independent of locale and encoding.

The point of this is to fit in with the historical use of those legacy
encodings, since the ambiguity only arose once the different charsets
were combined into Unicode. I doubt anyone is using nonsensical
locale/encoding combinations such as de_DE.GBK or ja_JP.ISO-8859-1, so
this is primarily about the likes of C.GBK and C.SJIS. Those are
currently ambiguous-narrow, but vim for example treats them as
ambiguous-wide, which makes for "interesting" effects when editing
files containing affected characters. The patch here fixes that.

Tested in Cygwin. I assume this will need to wait for Corinna's return.

	* libc/locale/locale.c: Fix ambigous width to one for singlebyte
	charsets and two for non-Unicode multibyte charsets.

Regards,
Andy

Attachment: ambiwidth.patch
Description: Binary data


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]