This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: How to read the encoding of an XML document


So, James,

The bottom line is that what you want to do isn't readily possible, mainly 
because in order to define a standard, XML has to limit the kinds of 
encoding that processors are required to support. Whether a given parser 
can parse a given encoding or whether an XSLT processor can write out a 
given encoding, is up to the processor. The only thing the XML standard 
stipulates is that a parser be able to read the standard Unicode character 
sets.

One way to work around the problem would be to carry the encoding you want 
as a parameter. (For this purpose you could preprocess the file to look in 
the XML declaration and get that pseudo-attribute.) Unfortunately, since 
you can't parameterize this setting in the stylesheet either, you won't be 
able to rely on the processor's own serializer, but will have to work 
around the back end as well. Maybe someone on the list could suggest how: 
for example, by having the processor construct a DOM and then running the 
DOM tree through your own serializer that would do the transcoding.

But this is a pretty steep requirement: in effect you're saying "whatever 
character encoding you want to give me, that's okay", but processors aren't 
going to like that even in the best of all possible worlds.

Cheers,
Wendell

At 11:53 AM 10/25/01, David wrote:
> > When you say Unicode, does that equate to UTF-8, UTF-16, UTF-32 or
> > something else?
>No unicode is essentially an abstract collection of characters, numbered
>1 to x10FFFF (most of which slots are empty). an XML notation of ō
>refers to that abstract character number 333.
>
>However to store unicode strings in files (and other places) you need
>some encoding that maps bytes in the file to these chracters. UTF-x are
>some of those encodings (all UTF encodings  have the property that they can
>encode the whole unicode range) other encodings such as ascii or latin-1
>are similar, but can't encode the whole range of characters.
>
> > Or does the answer depend upon the XML parser you are
> > using, which in my case is MSXML3.0?
>
>No. Internally the parser obviously has to use some encoding to store
>things (often this is utf-16, and it is in the case of msxml) in some
>programming api's you need to know this as you het handed the string,
>but in XSLT you never need to know what happens internally.
>Your XSLT stylesheet is an XML document so it goes through the same
>process.
>
>Character data in the stylesheet is mapped to abstract unicode
>characters (using the encoding specified in the stylesheet)
>and the same happens for the source document. It is these abstract
>characters that are compared. So by then you don't need to know (and
>can't find out) what encoding the original files contained.
>
>So your source might be in latin-2 and your stylesheet might be in
>latin-1 but by the time they have both been parsed everything is in
>abstract unicode characters and it is these that are compared
>in any XSLT query. (In fact MSXML3 uses utf16 but this is an internal
>detail that has no affect on the stylesheet)
>
>David


======================================================================
Wendell Piez                            mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]