This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

Re: Special Characters in URLs


Eriksson Magnus wrote:
> Yes, the URIs are interpreted by the Web Server/Web browser but I need them
> to be generated correctly by the XSLT processor -- to comply with the
> HTTP-standard (e.g. no white space in URLs). Is there a way to achieve this?

Re: the encoding:

The encoding of the document as a whole has no bearing on the %-style
escaping of characters in a URI. So for example if you have in your 
stylesheet

   <xsl:output method="html" encoding="iso-8859-1">
   and
   <a href="http://skew.org/printenv?greeting={greeting}";>click</a>

and your XML has:

   <greeting>&#161;Hola!</greeting>

then your output should end up like:

   <a href="http://skew.org/printenv?greeting=%C3%A1Hola!";>click</a>

You may have thought that the last 6 characters of that URI reference
would be bytes like:

    ¡  H  o  l  a  !
    A1 48 6F 6C 61 21  <-- iso-8859-1 bytes

because if you just did <xsl:value-of select="greeting"/> that is 
precisely what you would get.

The reason it changes when the XSL processor emits it in an href attribute
is because of this clause in the XSLT spec: "The html output method should
escape non-ASCII characters in URI attribute values using the method
recommended in Section B.2.1 of the HTML 4.0 Recommendation". And that 
section says to use UTF-8 as the basis for the %-escaping of the URI. This 
means you likely get this in the output:

    %  C  3  %  A  1  H  o  l  a  !
    25 43 33 25 41 31 48 6F 6C 61 21  <-- iso-8859-1 bytes, still

See, you *did* get iso-8859-1 output like you asked for. The UTF-8-ness is
actually at a higher level of abstraction.

Note that this escaping happens *only* for non-ASCII characters
(U-00000080 and higher). So it does not affect those ASCII characters that
are reserved or disallowed in a URI, like " ", among others.

Even if the XSLT processor failed to do the UTF-8 based escaping of
non-ASCII characters, the HTML user agents are supposed to do it when
interpreting the URI reference anyway.

Of course your problem is on the server end. Chances are, you are coding 
using an API that expects iso-8859-1 as the basis for the URL escaping, 
which is perfectly reasonable to do, especially in light of the fact that 
browsers tend to send URL-encoded form data with the URL-escaping being 
based on the actual encoding of the document containing the form (rather, 
the encoding that the browser is assuming the containing document is 
using; this is user-overridable).

If you make the containing document utf-8 instead of iso-8859-1, you can
assume that all the escaping is UTF-8 based, and then you can convert the
misinterpreted-as-iso-8859-1 strings you get from the form data API back 
to iso-8859-1 bytes and then read these bytes back into a string using 
utf-8 interpretation.

Your other option is to avoid putting the raw non-ASCII characters in the
URI refs in the first place. If you absolutely must have %A1 for inverted
exclamation mark, then the only way to ensure this is to make your
stylesheet put %A1 in the result tree. You can do this using an extension 
function (ideal) or with a clever recursive template.

Re: escaping of ASCII characters like " " (space), you must also control
this in your stylesheet. If you want "+" or "%20" (the latter is
preferable), then have your stylesheet explicitly put that in the result
tree.

See also: http://skew.org/xml/misc/URI-i18n/

Hope this helps.

   - Mike
_____________________________________________________________________________
mike j. brown, software engineer at  |  xml/xslt: http://skew.org/xml/
webb.net in denver, colorado, USA    |  personal: http://hyperreal.org/~mike/

 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]