This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: output encoding="iso-8859-1"


Daniel Florian wrote:
> <?xml version="1.0" encoding="utf-8"?>
> <?xml-stylesheet type="text/xsl" href="Untitled2.xsl"?>
> <start>
> á °
> </start>

Everyone else's answers weren't to my satisfaction, so I'm jumping in on 
this one even though it's a few days old.

Your email was iso-8859-1 encoded. In other words, "á" (Latin small letter
a with acute) is byte 0xE1 and "°" (degree sign) is byte 0xBA. I'm
guessing that your original file is iso-8859-1 encoded, too.

Your XML is misdeclaring its encoding. It is an error to say it is utf-8
encoded when it is actually iso-8859-1. The bytes 0xE1 0x20 0xBA work out
to an invalid UTF-8 sequence and it shouldn't even be parseable XML, but 
apparently your parser doesn't care.

&#6192; = &#x1830; which is equivalent to the bytes 0xE1 0xA0 0xB0 in 
utf-8. I'd say your parser is being very liberal with its interpretation
of the bytes.

> What character reference is the &#6192?  This is supposed to be ISO-8859-1
> isn't it?

The 7 characters "&" "#" "6" "1" "9" "2" ";" are encoded in the output 
as their 7 respective iso-8859-1 bytes, as per your xsl:output 
instruction, yes. What "&#6192;" means, however, in the context of an XML 
or HTML document, is the single character known as MONGOLIAN LETTER SA.

>  Then how come I can't seem to find the character code for 6192

Maybe because you weren't looking at The Unicode Standard at unicode.org,
or the Letter Database at http://www.eki.ee/letter/, or at the standard
that is referenced by both the XML and HTML specs: ISO/IEC 10646-1.

> And also, what happened to the 2 distinct characters from the
> source xml?

Your 3 characters (including the space in between them) became 3 bytes in
the encoding supported by the editor that made the file. When read back in
by an XML parser under the assumption that utf-8 was the character map
used, and taking into account the fact that your parser is apparently very
forgiving of the illegal byte sequence, the 3 bytes together imply 1
abstract character -- that Mongolian character that you probably won't
find in any font. When this character is copied to the result tree in your
XSL transformation, it retains its identity as a single character. When
the result tree is serialized as iso-8859-1 bytes and the HTML syntax, it
is impossible to represent this character as anything other than "&#6192;"
or "&#x1830;"


   - Mike
_____________________________________________________________________________
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal: http://hyperreal.org/~mike/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]