This is the mail archive of the libc-alpha@sources.redhat.com mailing list for the glibc project.
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
EUC-KR and the Won sign

To: libc-alpha at sources dot redhat dot com
Subject: EUC-KR and the Won sign
From: Bruno Haible <haible at ilog dot fr>
Date: Sun, 15 Oct 2000 13:58:41 +0200 (CEST)

Hi Ulrich,

Your latest EUC-KR modification unfortunately breaks an ISO C 99 restriction:
In section 7.17.(2) that standard says:
   the null character shall have the code value zero and each member of
   the basic character set shall have a code value equal to its value when
   used as the lone character in an integer character constant.
and the basic character set is enumerated in section 5.2.1.(3) and contains
the backslash character. Since backslash has the wide character code 0x005C,
it must therefore also have code 0x5C in any locale. In other words, all
encodings listed in the SUPPORTED file must be ASCII compatible (at least
for the graphic characters, excluding the control characters, '$' and '@').

Here is a patch which adds a verification of this constraint, and changes
back the EUC-KR charmap and converter, so that it passes the verification.
The Unicode half-width WON sign is mapped to EUC-KR full-width WON sign
because that makes more sense than mapping it to backslash (and on Unix
we don't use that character as a directory/filename separator). Recall
that Jungshik Shin wrote on 2000-09-25 (talking about JOHAB, but I
suspect this also holds for EUC-KR):
     "To represent WON SIGN, people usually use FULL-WIDTH WON SIGN"


ChangeLog:
2000-10-14  Bruno Haible  <haible@clisp.cons.org>

	* locale/programs/charmap.c (charmap_read): Verify ASCII
	compatibility of charmap.

	* iconvdata/euc-kr.c (euckr_from_ucs4): Remove function, integrate
	at call site.
	(BODY for FROM_LOOP): Change 0x5c mapping back to U005C.
	(BODY for TO_LOOP): Integrate euckr_from_ucs4. Map U20A9 to \xA3\xDC.
	* iconvdata/testdata/EUC-KR..UTF8: Adjust to this change.
	* iconvdata/EUC-KR.irreversible: Likewise.

localedata/ChangeLog:
2000-10-14  Bruno Haible  <haible@clisp.cons.org>

	* charmaps/EUC-KR: Change \x5c mapping back to U005C.
	* locales/ko_KR: Change currency_symbol back to UFFE6.

*** glibc-20001010/locale/programs/charmap.c.bak	Mon Oct  2 16:09:24 2000
--- glibc-20001010/locale/programs/charmap.c	Sun Oct 15 12:05:50 2000
***************
*** 192,197 ****
--- 192,254 ----
  	       DEFAULT_CHARMAP);
      }
  
+   /* Test of ASCII compatibility of locale encoding.
+ 
+      Verify that the encoding to be used in a locale is ASCII compatible,
+      at least for the graphic characters, excluding the control characters,
+      '$' and '@'.  This constraint comes from an ISO C 99 restriction.
+ 
+      ISO C 99 section 7.17.(2) (about wchar_t):
+        the null character shall have the code value zero and each member of
+        the basic character set shall have a code value equal to its value
+        when used as the lone character in an integer character constant.
+      ISO C 99 section 5.2.1.(3):
+        Both the basic source and basic execution character sets shall have
+        the following members: the 26 uppercase letters of the Latin alphabet
+             A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
+        the 26 lowercase letters of the Latin alphabet
+             a b c d e f g h i j k l m n o p q r s t u v w x y z
+        the 10 decimal digits
+             0 1 2 3 4 5 6 7 8 9
+        the following 29 graphic characters
+             ! " # % & ' ( ) * + , - . / : ; < = > ? [ \ ] ^ _ { | } ~
+        the space character, and control characters representing horizontal
+        tab, vertical tab, and form feed.
+ 
+      Therefore, for all members of the "basic character set", the 'char' code
+      must have the same value as the 'wchar_t' code, which in glibc is the
+      same as the Unicode code, which for all of the enumerated characters
+      is identical to the ASCII code. */
+   if (result != NULL)
+     {
+       static const char basic_charset[] =
+ 	{
+ 	  'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
+ 	  'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z',
+ 	  'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm',
+ 	  'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z',
+ 	  '0', '1', '2', '3', '4', '5', '6', '7', '8', '9',
+ 	  '!', '"', '#', '%', '&', '\'', '(', ')', '*', '+', ',', '-',
+ 	  '.', '/', ':', ';', '<', '=', '>', '?', '[', '\\', ']', '^',
+ 	  '_', '{', '|', '}', '~', ' ', '\t', '\v', '\f', '\0'
+ 	};
+       int failed = 0;
+       const char *p = basic_charset;
+ 
+       do
+ 	{
+ 	  struct charseq * seq = charmap_find_symbol (result, p, 1);
+ 
+ 	  if (seq == NULL || seq->ucs4 != *p)
+ 	    failed = 1;
+ 	}
+       while (*p++ != '\0');
+ 
+       if (failed)
+ 	error (0, 0, _("character map `%s' is not ASCII compatible"),
+ 	       result->code_set_name);
+     }
+ 
    return result;
  }
  
*** glibc-20001010/iconvdata/euc-kr.c.bak	Sat Oct 14 19:38:28 2000
--- glibc-20001010/iconvdata/euc-kr.c	Sat Oct 14 19:47:58 2000
***************
*** 24,59 ****
  #include <ksc5601.h>
  
  
- static inline void
- euckr_from_ucs4 (uint32_t ch, unsigned char *cp)
- {
-   if (ch > 0x9f)
-     {
-       if (__builtin_expect (ch, 0) == 0x20a9)
- 	{
- 	  /* Half-width Korean Currency WON sign.  */
- 	  cp[0] = '\\';
- 	  cp[1] = '\0';
- 	}
-       else if (__builtin_expect (ucs4_to_ksc5601 (ch, cp, 2), 0)
- 	  != __UNKNOWN_10646_CHAR)
- 	{
- 	  cp[0] |= 0x80;
- 	  cp[1] |= 0x80;
- 	}
-       else
- 	cp[0] = '\0';
-     }
-   else
-     {
-       /* There is no mapping for U005c but we nevertheless map it to
- 	 \x5c.  */
-       cp[0] = (unsigned char) ch;
-       cp[1] = '\0';
-     }
- }
- 
- 
  /* Definitions used in the body of the `gconv' function.  */
  #define CHARSET_NAME		"EUC-KR//"
  #define FROM_LOOP		from_euc_kr
--- 24,29 ----
***************
*** 75,88 ****
      uint32_t ch = *inptr;						      \
  									      \
      if (ch <= 0x9f)							      \
!       {									      \
! 	/* Plain ASCII with one exception.  */				      \
! 	if (ch == 0x5c)							      \
! 	  /* Half-width Korean Currency WON sign.  */			      \
! 	  ch = 0x20a9;							      \
! 	++inptr;							      \
!       }									      \
!     /* 0xfe(->0x7e : row 94) and 0xc9(->0x59 : row 41) are		      \
         user-defined areas.  */						      \
      else if (__builtin_expect (ch, 0xa1) == 0xa0			      \
  	     || __builtin_expect (ch, 0xa1) > 0xfe			      \
--- 45,53 ----
      uint32_t ch = *inptr;						      \
  									      \
      if (ch <= 0x9f)							      \
!       /* Plain ASCII .  */						      \
!       ++inptr;								      \
!     /* 0xfe(->0x7e : row 94) and 0xc9(->0x49 : row 41) are		      \
         user-defined areas.  */						      \
      else if (__builtin_expect (ch, 0xa1) == 0xa0			      \
  	     || __builtin_expect (ch, 0xa1) > 0xfe			      \
***************
*** 143,171 ****
      uint32_t ch = get32 (inptr);					      \
      unsigned char cp[2];						      \
  									      \
!     /* Decomposing Hangul syllables not available in KS C 5601 into	      \
!        Jamos should be considered either here or in euckr_from_ucs4() */      \
!     euckr_from_ucs4 (ch, cp) ;						      \
! 									      \
!     if (__builtin_expect (cp[0], '\1') == '\0' && ch != 0)		      \
        {									      \
! 	/* Illegal character.  */					      \
! 	STANDARD_ERR_HANDLER (4);					      \
!       }									      \
  									      \
!     *outptr++ = cp[0];							      \
!     /* Now test for a possible second byte and write this if possible.  */    \
!     if (cp[1] != '\0')							      \
!       {									      \
! 	if (__builtin_expect (outptr >= outend, 0))			      \
  	  {								      \
! 	    /* The result does not fit into the buffer.  */		      \
! 	    --outptr;							      \
  	    result = __GCONV_FULL_OUTPUT;				      \
  	    break;							      \
  	  }								      \
! 	*outptr++ = cp[1];						      \
        }									      \
  									      \
      inptr += 4;								      \
    }
--- 108,142 ----
      uint32_t ch = get32 (inptr);					      \
      unsigned char cp[2];						      \
  									      \
!     if (ch > 0x9f)							      \
        {									      \
! 	/* Map half-width Korean Currency WON sign			      \
! 	   to full-width Korean Currency WON sign.  */			      \
! 	if (__builtin_expect (ch == 0x20a9, 0))				      \
! 	  ch = 0xffe6;							      \
! 									      \
! 	/* Decomposing Hangul syllables not available in KS C 5601 into	      \
! 	   Jamos should be considered here.  */				      \
! 	if (__builtin_expect						      \
! 	    (ucs4_to_ksc5601 (ch, cp, 2) == __UNKNOWN_10646_CHAR, 0))	      \
! 	  {								      \
! 	    /* Illegal character.  */					      \
! 	    STANDARD_ERR_HANDLER (4);					      \
! 	  }								      \
  									      \
! 	if (__builtin_expect (outptr + 1 >= outend, 0))			      \
  	  {								      \
! 	    /* We have not enough room.  */				      \
  	    result = __GCONV_FULL_OUTPUT;				      \
  	    break;							      \
  	  }								      \
! 									      \
! 	*outptr++ = cp[0] | 0x80;					      \
! 	*outptr++ = cp[1] | 0x80;					      \
        }									      \
+     else								      \
+       /* Plain ASCII.  */						      \
+       *outptr++ = (unsigned char) ch;					      \
  									      \
      inptr += 4;								      \
    }
*** glibc-20001010/iconvdata/testdata/EUC-KR..UTF8.bak	Sat Oct 14 19:38:28 2000
--- glibc-20001010/iconvdata/testdata/EUC-KR..UTF8	Sat Oct 14 19:47:58 2000
***************
*** 1,7 ****
     ! " # $ % & ' ( ) * + , - . /
   0 1 2 3 4 5 6 7 8 9 : ; < = > ?
   @ A B C D E F G H I J K L M N O
!  P Q R S T U V W X Y Z [ тй ] ^ _
   ` a b c d e f g h i j k l m n o
   p q r s t u v w x y z { | } ~
  
--- 1,7 ----
     ! " # $ % & ' ( ) * + , - . /
   0 1 2 3 4 5 6 7 8 9 : ; < = > ?
   @ A B C D E F G H I J K L M N O
!  P Q R S T U V W X Y Z [ \ ] ^ _
   ` a b c d e f g h i j k l m n o
   p q r s t u v w x y z { | } ~
  
*** glibc-20001010/iconvdata/EUC-KR.irreversible.bak	Sat Oct 14 19:38:28 2000
--- glibc-20001010/iconvdata/EUC-KR.irreversible	Sat Oct 14 19:47:58 2000
***************
*** 1 ****
! 0x5C	0x005C
--- 1 ----
! 0xA3DC	0x20A9
*** glibc-20001010/localedata/charmaps/EUC-KR.bak	Sat Oct 14 19:38:28 2000
--- glibc-20001010/localedata/charmaps/EUC-KR	Sat Oct 14 19:47:58 2000
***************
*** 100,106 ****
  <U0059>     /x59         LATIN CAPITAL LETTER Y
  <U005A>     /x5a         LATIN CAPITAL LETTER Z
  <U005B>     /x5b         LEFT SQUARE BRACKET
! <U20A9>     /x5c         WON SIGN
  <U005D>     /x5d         RIGHT SQUARE BRACKET
  <U005E>     /x5e         CIRCUMFLEX ACCENT
  <U005F>     /x5f         LOW LINE
--- 100,106 ----
  <U0059>     /x59         LATIN CAPITAL LETTER Y
  <U005A>     /x5a         LATIN CAPITAL LETTER Z
  <U005B>     /x5b         LEFT SQUARE BRACKET
! <U005C>     /x5c         REVERSE SOLIDUS
  <U005D>     /x5d         RIGHT SQUARE BRACKET
  <U005E>     /x5e         CIRCUMFLEX ACCENT
  <U005F>     /x5f         LOW LINE
*** glibc-20001010/localedata/locales/ko_KR.bak	Sat Oct 14 19:38:28 2000
--- glibc-20001010/localedata/locales/ko_KR	Sat Oct 14 19:47:58 2000
***************
*** 11120,11126 ****
  LC_MONETARY
  
  int_curr_symbol		"<U004B><U0052><U0057><U0020>"
! currency_symbol		"<U20A9>"
  mon_decimal_point	"<U002E>"
  mon_thousands_sep	"<U002C>"
  mon_grouping		3;3
--- 11120,11126 ----
  LC_MONETARY
  
  int_curr_symbol		"<U004B><U0052><U0057><U0020>"
! currency_symbol		"<UFFE6>"
  mon_decimal_point	"<U002E>"
  mon_thousands_sep	"<U002C>"
  mon_grouping		3;3
Follow-Ups:
- Re: EUC-KR and the Won sign
  - From: Ulrich Drepper
Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]