This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Re: How to read the encoding of an XML document
So, James,
The bottom line is that what you want to do isn't readily possible, mainly
because in order to define a standard, XML has to limit the kinds of
encoding that processors are required to support. Whether a given parser
can parse a given encoding or whether an XSLT processor can write out a
given encoding, is up to the processor. The only thing the XML standard
stipulates is that a parser be able to read the standard Unicode character
sets.
One way to work around the problem would be to carry the encoding you want
as a parameter. (For this purpose you could preprocess the file to look in
the XML declaration and get that pseudo-attribute.) Unfortunately, since
you can't parameterize this setting in the stylesheet either, you won't be
able to rely on the processor's own serializer, but will have to work
around the back end as well. Maybe someone on the list could suggest how:
for example, by having the processor construct a DOM and then running the
DOM tree through your own serializer that would do the transcoding.
But this is a pretty steep requirement: in effect you're saying "whatever
character encoding you want to give me, that's okay", but processors aren't
going to like that even in the best of all possible worlds.
Cheers,
Wendell
At 11:53 AM 10/25/01, David wrote:
> > When you say Unicode, does that equate to UTF-8, UTF-16, UTF-32 or
> > something else?
>No unicode is essentially an abstract collection of characters, numbered
>1 to x10FFFF (most of which slots are empty). an XML notation of ō
>refers to that abstract character number 333.
>
>However to store unicode strings in files (and other places) you need
>some encoding that maps bytes in the file to these chracters. UTF-x are
>some of those encodings (all UTF encodings have the property that they can
>encode the whole unicode range) other encodings such as ascii or latin-1
>are similar, but can't encode the whole range of characters.
>
> > Or does the answer depend upon the XML parser you are
> > using, which in my case is MSXML3.0?
>
>No. Internally the parser obviously has to use some encoding to store
>things (often this is utf-16, and it is in the case of msxml) in some
>programming api's you need to know this as you het handed the string,
>but in XSLT you never need to know what happens internally.
>Your XSLT stylesheet is an XML document so it goes through the same
>process.
>
>Character data in the stylesheet is mapped to abstract unicode
>characters (using the encoding specified in the stylesheet)
>and the same happens for the source document. It is these abstract
>characters that are compared. So by then you don't need to know (and
>can't find out) what encoding the original files contained.
>
>So your source might be in latin-2 and your stylesheet might be in
>latin-1 but by the time they have both been parsed everything is in
>abstract unicode characters and it is these that are compared
>in any XSLT query. (In fact MSXML3 uses utf16 but this is an internal
>detail that has no affect on the stylesheet)
>
>David
======================================================================
Wendell Piez mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list