This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

RE: output encoding="iso-8859-1"


Thanks very much for the detailed answer Mike, this is all starting to make
sense.

There were a couple of points that make this stuff easier to understand.
Forgive me if this is stating the obvious, but it took me a while to
synthesize this... the info is scattered all over the place.

1) The "&#xxx;" notation in XML and HTML files are character references,
which refer to the decimal value of the character in the Unicode character
set.  This is entirely different from the encoding scheme that the document
declares.  If the encoding scheme says ISO-8859-1 these character references
still refer to Unicode character values.

2) The encoding scheme is supposed to declare the actual byte encoding of
the doc.  That's all.

3) It is non-trivial to manage content with extended characters across a
number of different applications and operating systems... Clearly, in my
case, strange stuff happened to the byte ordering during the "cut and paste"
process, and, as well, I am not sure if the apps I was using to view the
content were able to make sense of the UTF-8 multibyte characters anyway.
Rather than assume this will work you really need to discuss each
application individually.

My conclusion for now is that the safest way to do manage content with
international characters is to use the character references as discussed in
#1.  Unfortunately, this won't result in a WYSIWYG editing system, but
that's a small price to pay for increased portability of the content, across
all kinds of editors and OS's.

I'm sure I'll hear if some of this isn't accurate,

Thanks,
-Dan

-----Original Message-----
From: Mike Brown [mailto:mike@skew.org]
Sent: Monday, June 04, 2001 10:04 PM
To: xsl-list@lists.mulberrytech.com
Subject: Re: [xsl] output encoding="iso-8859-1"


Daniel Florian wrote:
> <?xml version="1.0" encoding="utf-8"?>
> <?xml-stylesheet type="text/xsl" href="Untitled2.xsl"?>
> <start>
> á °
> </start>

Everyone else's answers weren't to my satisfaction, so I'm jumping in on 
this one even though it's a few days old.

Your email was iso-8859-1 encoded. In other words, "á" (Latin small letter
a with acute) is byte 0xE1 and "°" (degree sign) is byte 0xBA. I'm
guessing that your original file is iso-8859-1 encoded, too.

Your XML is misdeclaring its encoding. It is an error to say it is utf-8
encoded when it is actually iso-8859-1. The bytes 0xE1 0x20 0xBA work out
to an invalid UTF-8 sequence and it shouldn't even be parseable XML, but 
apparently your parser doesn't care.

&#6192; = &#x1830; which is equivalent to the bytes 0xE1 0xA0 0xB0 in 
utf-8. I'd say your parser is being very liberal with its interpretation
of the bytes.

> What character reference is the &#6192?  This is supposed to be ISO-8859-1
> isn't it?

The 7 characters "&" "#" "6" "1" "9" "2" ";" are encoded in the output 
as their 7 respective iso-8859-1 bytes, as per your xsl:output 
instruction, yes. What "&#6192;" means, however, in the context of an XML 
or HTML document, is the single character known as MONGOLIAN LETTER SA.

>  Then how come I can't seem to find the character code for 6192

Maybe because you weren't looking at The Unicode Standard at unicode.org,
or the Letter Database at http://www.eki.ee/letter/, or at the standard
that is referenced by both the XML and HTML specs: ISO/IEC 10646-1.

> And also, what happened to the 2 distinct characters from the
> source xml?

Your 3 characters (including the space in between them) became 3 bytes in
the encoding supported by the editor that made the file. When read back in
by an XML parser under the assumption that utf-8 was the character map
used, and taking into account the fact that your parser is apparently very
forgiving of the illegal byte sequence, the 3 bytes together imply 1
abstract character -- that Mongolian character that you probably won't
find in any font. When this character is copied to the result tree in your
XSL transformation, it retains its identity as a single character. When
the result tree is serialized as iso-8859-1 bytes and the HTML syntax, it
is impossible to represent this character as anything other than "&#6192;"
or "&#x1830;"


   - Mike
____________________________________________________________________________
_
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal:
http://hyperreal.org/~mike/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]