This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Re: xml invalid characters
- From: Mike Brown <mike at skew dot org>
- To: xsl-list at lists dot mulberrytech dot com
- Date: Fri, 22 Mar 2002 16:08:11 -0700 (MST)
- Subject: Re: [xsl] xml invalid characters
- Reply-to: xsl-list at lists dot mulberrytech dot com
stevenson wrote:
> How can I avoid these problem. The data is from the database, and the
> character crashing it is £
You probably have an encoding problem. I assume that you're having trouble
with the British currency symbol for a Pound? At least, that's what it looks
like on my screen.
Quick lesson:
The POUND SIGN is character number A3 (hex) in Unicode. "U+00A3" is how you
can write it unambiguously in prose.
Encoding provides a way of representing that A3 as bytes.
iso-8859-1: A3
utf-8: C2 A3
utf-16: 00 A3 (little endian)
A3 00 (big endian)
utf-8 and utf-16 can represent any Unicode character, but other encodings are
more limited, usually only representing 256 characters max.
If a character cannot be represented in a particular encoding, you write it as
a sequence of characters that can be represented in any encoding (spaces added
for clarity):
& # x A 3 ; or & # 1 6 3 ;
For example, us-ascii does not have POUND SIGN (this may be the source of your
problem; it's hard to say, without knowing all the stages of processing of
your data, and the role Cold Fusion plays in it). So you'd have to use this
escaped format.
& # x A 3 ;
us-ascii: 26 23 78 41 33 3B
And this escaped format (a "character reference") also works just as well in
other encodings:
iso-8859-1: 26 23 78 41 33 3B
utf-8: 26 23 78 41 33 3B
utf-16: 00 26 00 23 00 78 00 41 00 33 00 3B (little endian)
utf-16: 26 00 23 00 78 00 41 00 33 00 3B 00 (big endian)
Now check your XML document. When you look at the document in a text editor,
it might say
<?xml version="1.0" encoding="utf-8"?>
^^^^^^^^^^^^^^^^
This encoding declaration is an assertion made by the document as to how its
bytes map to Unicode characters. It is just a hint for the XML parser to use
when reading the document; it is not secret code that causes anything about
the document's *actual* encoding to change.
If this declaration is missing, UTF-8 or UTF-16 are assumed
(UTF-8 unless the document begins with bytes FF FE or FE FF).
It is your responsibility to ensure that the encoding declaration is an
accurate reflection of the document's *actual* encoding.
As you can guess, this is where most people run into problems. They are
passing "text" around in their software without paying attention to whether &
how it has been encoded. So, in order to diagnose encoding related problems,
you must trace the processes that your data passes through, and determine how
it is encoded/decoded at each step.
Also, you didn't say what your problem has to do with XSLT. This is the
xsl-list. If you have general xml processing questions, ask them on xml-dev.
If you're using XSLT, then you usually only need to be concerned about
- the source and stylesheet XML documents must have accurate encoding
declarations
- the output encoding, as controlled by <xsl:output encoding="..."/>
should be what you wanted (there is a FAQ regarding invoking MSXML
from scripts, where the output becomes UTF-16, depending on how
you capture it)
Good luck.
- Mike
____________________________________________________________________________
mike j. brown | xml/xslt: http://skew.org/xml/
denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list