This is the mail archive of the xsl-list@mulberrytech.com mailing list .

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

RE: encoding woes: ISO-8859-1 vs. UTF-8

From: Xiaocun Xu <xiaocunxu at yahoo dot com>
To: xsl-list at lists dot mulberrytech dot com
Date: Wed, 24 Jul 2002 07:23:13 -0700 (PDT)
Subject: RE: [xsl] encoding woes: ISO-8859-1 vs. UTF-8
Reply-to: xsl-list at lists dot mulberrytech dot com

--- Tony Graham <Tony.Graham@sun.com> wrote:
> Michael Kay wrote at 24 Jul 2002 09:05:31 +0100:
>  > > > ISO-8859-1 can only encode the characters in
> the
>  > > > range 0-255.
>  > > 
>  > > That's what I thought as well.  How did saxon
>  > > converted those two control chars into the
> proper
>  > > encoding for ЁА and ЁБ even though
> the input
>  > > XML was marked as encoding in ISO-8859-1?  I
> was fully 
>  > > expecting the import would fail, but somehow it
> was successful.
>  > 
>  > I have no idea. This isn't done by Saxon, it's
> done by the XML parser.
>  > If you were using the default parser (AElfred), I
> think that it actually
>  > accepts bytes x80-x9F with encoding="iso-8859-1",
> converting them into
>  > characters x80-x9F.
> 
> Windows code pages, e.g. CP 1252, typically encode
> #x201C, LEFT DOUBLE
> QUOTATION MARK, and #x201D, RIGHT DOUBLE QUOTATION
> MARK, as 0x93 and
> 0x94, respectively.
> 
> The Windows 2000 "Character Map" utility, for
> example, shows the
> characters with those byte values for their encoding
> when the
> "Character set" is "Windows: Western" or "Windows:
> Central Europe",
> etc.
> 
> #X201C and #x201D aren't part of ISO 8859-1, so when
> the encoding
> really is ISO 8859-1 and not CP 1252 (or similar),
> then the only way
> to represent #x201C and #x201D is as numeric
> character references:
> &#x201C (or Хs) and ЁБ (or є).
> 
> It appears that AElfred is accommodating the extras
> in the Windows
> code page even then the input is labelled
> ISO-8859-1.  Since it used
> to be said (and may still be true) that some
> Microsoft software
> labelled CP 1252 text as ISO 8859-1 (although I
> thought that Outlook
> was the main culprit) and since "real" ISO 8859-1
> isn't going to use
> the byte values for the CP 1252 extras (until we get
> NEL, that is),
> then it's forgiving of AElfred to accept the extras.
>  It's just that
> this "principle of least surprise" action surprised
> several of us.

Thanks for the explanation, that made a lot of sense, 
sounds like the entire MSOffice suite are culprit, if
not more.  If this is only allow by AElfred, I guess I
really have to resolve this problem when I am
upgrading to Saxon7.x and XercesJ2.

>  > > Good point.  For export output, I changed
> encoding to
>  > > UTF-8, that seems to have resolved the problem,
> now
>  > > export is successful.  Open the exported CSV in
> Hex
>  > > editor, those two chars are shown as Hex 93/94,
>  > > respectively.
>  > > 
>  > Now I really am puzzled.
> 
> I'm puzzled too. #x201C is not 0x93 in UTF-8.

Very strange indeed.  I checked the hex values stored
in SQLServer after import, both chars are stored as
&#22, the quotation mark in ISO-8859-1.  How did it
transpose these characters to &#93 and &#94 on export?
 Even I marked the export proprietary XML as UTF-8,
Saxon/AElfred had no problem processing it.

To consistently use UTF-8 for encoding, for import
Excel CSV, I guess I need to run native2ascii before I
start XSLT transformation.  But what happens on
export?  Open CSV in hex editor and it uses one byte
per char, how could the export generate CSV with
&#8220 and &#8221 chars?

Thanks,
Xiaocun

__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

References:
- RE: encoding woes: ISO-8859-1 vs. UTF-8
  - From: Tony Graham

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]