This is the mail archive of the newlib@sourceware.org mailing list for the newlib project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: KOI8 character sets

From: Jeff Johnston <jjohnstn at redhat dot com>
To: newlib at sourceware dot org
Cc: Andy Koppe <andy dot koppe at gmail dot com>
Date: Mon, 24 Aug 2009 18:14:55 -0400
Subject: Re: KOI8 character sets
References: <416096c60908220920n2b241394y66ac8dda9ce6f5a9@mail.gmail.com> <20090824200002.GA4969@calimero.vinschen.de>

Andy's patch checked in with Corinna's documentation change as well. I changed the documentation content to just state <<EUCJP>> and not <<EUCJP>>/<<eucJP>> and later on I used EUCJP and eucJP as an example of case-insensitivity rather than UTF-8. This made it easier to apply the doc patch and it made it clear that eucJP was still valid.

-- Jeff J.

Corinna Vinschen wrote:

On Aug 22 17:20, Andy Koppe wrote:
The attached patch adds support for the KOI8-R and KOI8-U character
sets. These are the de-facto standard character sets on Unix machines
and the Net in Russia, Ukraine, and other ex-Soviet states.
(ISO-8859-5, designed for all Cyrillic scripts, apparently never found
much acceptance.)
Under Windows they are known as codepages 20866 and 21866. Since they
are single-byte encodings with printable characters in the C1 range
from 0x80 to 0x9F, it seems best to handle them like DOS/Windows
codepages. The conversion tables were adapted from the iconv ones.
Tested on Cygwin 1.7.

ChangeLog:
2009-08-22  Corinna Vinschen  <corinna@vinschen.de>
        * libc/locale/locale.c (loadlocale): Map "KOI8-R" and "KOI8-U" to
        CP20866 and CP21866.
2009-08-22 Andy Koppe <andy.koppe@gmail.com> * libc/stdlib/sb_charsets.c (__cp_conv): Add KOI8-R (Russian, CP20866) and KOI8-U (Ukrainian, CP21866) to Windows codepage conversion tables. * libc/ctype/ctype_cp.h (__ctype_cp): Likewise for ctype tables.
The documentation in libc/locale/locale.c should note the KOI8 charsets
as well:
Index: libc/locale/locale.c =================================================================== RCS file: /cvs/src/src/newlib/libc/locale/locale.c,v retrieving revision 1.23 diff -u -p -r1.23 locale.c --- libc/locale/locale.c 21 Aug 2009 20:56:13 -0000 1.23 +++ libc/locale/locale.c 24 Aug 2009 19:57:42 -0000 @@ -54,20 +54,21 @@ the form <<"language">> is a two character string per ISO 639. <<"TERRITORY">> is a country code per ISO 3166. For <<"charset">> and <<"modifier">> see below. -Additionally to the POSIX specifier, five extensions are supported for +Additionally to the POSIX specifier, seven extensions are supported for backward compatibility with older implementations using newlib: -<<"C-UTF-8">>, <<"C-JIS">>, <<"C-EUCJP">>/<<"C-eucJP">>, <<"C-SJIS">>, -<<"C-ISO-8859-x">> with 1 <= x <= 15, or <<"C-CPxxx">> with xxx in [437, -720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125, 1250, 1251, -1252, 1253, 1254, 1255, 1256, 1257, 1258]. +<<"C-UTF-8">>, <<"C-JIS">>, <<"C-eucJP">>, <<"C-SJIS">>, <<C-KOI8-R>>, +<<C-KOI8-U>>, <<"C-ISO-8859-x">> with 1 <= x <= 15, or <<"C-CPxxx">> with +xxx in [437, 720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125, +1250, 1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258]. Even when using POSIX locale strings, the only charsets allowed are -<<"UTF-8">>, <<"JIS">>, <<"EUCJP">>/<<"eucJP">>, <<"SJIS">>, <<"ISO-8859-x">> -with 1 <= x <= 15, or <<"CPxxx">> with xxx in [437, 720, 737, 775, 850, -852, 855, 857, 858, 862, 866, 874, 1125, 1250, 1251, 1252, 1253, 1254, -1255, 1256, 1257, 1258]. Charsets are case insensitive. For instance, -<<"UTF-8">> and <<"utf-8">> are equivalent. <<"UTF-8">> can also be -written without dash, as in <<"UTF8">> or <<"utf8">>. +<<"UTF-8">>, <<"JIS">>, <<"eucJP">>, <<"SJIS">>, <<KOI8-R>>, <<KOI8-U>>, +<<"ISO-8859-x">> with 1 <= x <= 15, or <<"CPxxx">> with xxx in +[437, 720, 737, 775, 850, 852, 855, 857, 858, 862, 866, 874, 1125, 1250, +1251, 1252, 1253, 1254, 1255, 1256, 1257, 1258]. +Charsets are case insensitive. For instance, <<"UTF-8">> and <<"utf-8">> +are equivalent. <<"UTF-8">> can also be written without dash, as in +<<"UTF8">> or <<"utf8">>. (<<"">> is also accepted; if given, the settings are read from the corresponding LC_* environment variables and $LANG according to POSIX rules.

Corinna

References:
- KOI8 character sets
  - From: Andy Koppe
- Re: KOI8 character sets
  - From: Corinna Vinschen

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]