This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
RE: encoding woes: ISO-8859-1 vs. UTF-8
- From: Xiaocun Xu <xiaocunxu at yahoo dot com>
- To: xsl-list at lists dot mulberrytech dot com
- Date: Wed, 24 Jul 2002 07:23:13 -0700 (PDT)
- Subject: RE: [xsl] encoding woes: ISO-8859-1 vs. UTF-8
- Reply-to: xsl-list at lists dot mulberrytech dot com
--- Tony Graham <Tony.Graham@sun.com> wrote:
> Michael Kay wrote at 24 Jul 2002 09:05:31 +0100:
> > > > ISO-8859-1 can only encode the characters in
> the
> > > > range 0-255.
> > >
> > > That's what I thought as well. How did saxon
> > > converted those two control chars into the
> proper
> > > encoding for ¡° and ¡± even though
> the input
> > > XML was marked as encoding in ISO-8859-1? I
> was fully
> > > expecting the import would fail, but somehow it
> was successful.
> >
> > I have no idea. This isn't done by Saxon, it's
> done by the XML parser.
> > If you were using the default parser (AElfred), I
> think that it actually
> > accepts bytes x80-x9F with encoding="iso-8859-1",
> converting them into
> > characters x80-x9F.
>
> Windows code pages, e.g. CP 1252, typically encode
> #x201C, LEFT DOUBLE
> QUOTATION MARK, and #x201D, RIGHT DOUBLE QUOTATION
> MARK, as 0x93 and
> 0x94, respectively.
>
> The Windows 2000 "Character Map" utility, for
> example, shows the
> characters with those byte values for their encoding
> when the
> "Character set" is "Windows: Western" or "Windows:
> Central Europe",
> etc.
>
> #X201C and #x201D aren't part of ISO 8859-1, so when
> the encoding
> really is ISO 8859-1 and not CP 1252 (or similar),
> then the only way
> to represent #x201C and #x201D is as numeric
> character references:
> “ (or Ås) and ¡± (or ô).
>
> It appears that AElfred is accommodating the extras
> in the Windows
> code page even then the input is labelled
> ISO-8859-1. Since it used
> to be said (and may still be true) that some
> Microsoft software
> labelled CP 1252 text as ISO 8859-1 (although I
> thought that Outlook
> was the main culprit) and since "real" ISO 8859-1
> isn't going to use
> the byte values for the CP 1252 extras (until we get
> NEL, that is),
> then it's forgiving of AElfred to accept the extras.
> It's just that
> this "principle of least surprise" action surprised
> several of us.
Thanks for the explanation, that made a lot of sense,
sounds like the entire MSOffice suite are culprit, if
not more. If this is only allow by AElfred, I guess I
really have to resolve this problem when I am
upgrading to Saxon7.x and XercesJ2.
> > > Good point. For export output, I changed
> encoding to
> > > UTF-8, that seems to have resolved the problem,
> now
> > > export is successful. Open the exported CSV in
> Hex
> > > editor, those two chars are shown as Hex 93/94,
> > > respectively.
> > >
> > Now I really am puzzled.
>
> I'm puzzled too. #x201C is not 0x93 in UTF-8.
Very strange indeed. I checked the hex values stored
in SQLServer after import, both chars are stored as
, the quotation mark in ISO-8859-1. How did it
transpose these characters to ] and ^ on export?
Even I marked the export proprietary XML as UTF-8,
Saxon/AElfred had no problem processing it.
To consistently use UTF-8 for encoding, for import
Excel CSV, I guess I need to run native2ascii before I
start XSLT transformation. But what happens on
export? Open CSV in hex editor and it uses one byte
per char, how could the export generate CSV with
“ and ” chars?
Thanks,
Xiaocun
__________________________________________________
Do You Yahoo!?
Yahoo! Health - Feel better, live better
http://health.yahoo.com
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list