This is the mail archive of the libc-alpha@sourceware.org mailing list for the glibc project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [PATCH] Alias for ISO-10646-UCS-2 charset


On Wed, Dec 05, 2012 at 09:42:04PM -0700, Jeff Law wrote:
> 
> Certain embedded devices use the ISO-10646-UCS-2 charset; it is
> currently not possible for glibc's iconv to translate messages from
> those devices.
> 
> The ISO-10646-UCS-2 charset is an older character set that was
> superseded by UTF-16 of the Unicode standard in July 1996.
> 
> UCS-2 and UTF-16 are identical for purposes of data exchange.  Both
> are 16 bit formats and have exactly the same code unit
> representation.
> 
> UCS-2 does not support supplementary characters and doesn't
> interpret pairs of surrogate code points as characters.
> 
> Given they are identical for data exchange, the easiest way to
> support this charset is to create an alias.

UCS-2 is not the same as UTF-16. When processing UCS-2, code units in
the surrogate range must be rejected as invalid code units.
Interpreting them in pairs as UTF-16 would break the property of
fixed-width character encoding and would allow invalid UCS-2 to
validate, possibly allowing corrupt transmitions.

What's worse, for conversions in the other direction (to UCS-2),
characters that cannot be represented in UCS-2 would wrongly be
converted to pairs of surrogates. In a program that tries "oldest"
encodings first with the goal of being conservative in what you
transmit, this will lead to an incorrect conclusion that the data can
be encoded as UCS-2, and will result in malformed data being received
by the recipient (surrogates are not legal in UCS-2).

Isn't UCS-2 already supported anyway, just without the ISO-10646
prefix on the name?

Rich


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]