This is the mail archive of the kawa@sourceware.org mailing list for the Kawa project.


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: Escaping of non-ASCII characters in XML


On 07/23/2012 04:57 AM, ÐÐÐÑÑÐÐ wrote:
Hello!

   I believe the current XML functions for creating XML and found XML in Kawa
practically unusable for languages with a non-Latin script.

   E.g. <p>ÐÐÑÐÐÑÑÐÐ</p> is automatically escaped to
<p>&#x41F;&#x435;&#x440;&#x435;&#x432;&#x456;&#x440;&#x43A;&#x430;</p>.
All non-ASCII characters are escaped.

This shouldn't really matter in principle. Humans normally wouldn't be looking at computer-generated XML/HTML. However, it does make the output bulkier, and it makes "View Source" (or the quivalent) uglier. so it's certainly not ideal.

   Does anyone really need this kind of escaping? Kawa's internal HTTP server
escapes strings after this anyway, so in this case it's a mere duplication.

(The server escaping is also not quite adequate for Ukrainian and Russian,
but this is a different issue.)

Can you remind me where Kawa's internal HTTP server does the string-escaping?


   Is it possible to add "xp.escapeNonAscii = false;" somewhere in the the
gnu.kawa.xml.KNode:toString function (gnu\kawa\xml\KNode.java, after line 32).
[I believe this should turn the escaping off, but I don't have JDK at hand to
check.] xp.escapeNonAscii shouldn't affect control characters (these are
encoded anyway), only characters outside ASCII.

It might make sense, but I'm a little uncomfortable with the idea that
toString output is different from printing to a file. What you then print out
the toString return to an ASCII or Latin-1-only file or terminal?


Of course you have the same problem printing strings in general.

   If this escaping is desirable for some reason (though I can't think of any),
is it possible to add some variable like *xml-escape-string* to turn this
escaping off?

It has the big advantage that the output is portable, regardless of the target
environment's character encoding.


Now if most of the world is using Unicode, perhaps it isn't as much of an issue
as it used to be. But I'm guessing ISO/IEC 8859-5 might still be fairly
common in your part of the world - and then what happens if I write out (say) Ã ?


W3C in http://www.w3.org/TR/xslt-xquery-serialization-30/#HTML_CHARDATA says:

  "Entity references and character references SHOULD be used only where
  the character is not present in the selected encoding"

A problem is that getting at the encoding and then figuring out if a character
is present are both non-trivial.


You would probably prefer Cyrillic letters to non-escaped, but you might want
à to be escaped. (And you might prefer this even if you or your server runs
in a Unicode locale, since your clients might not.) So ideally you'd like to
use a charset (http://www.gnu.org/software/kawa/Character-sets.html) to control
which characters are escaped. But there is a layering problem: XMLPrinter should
not depend on the Kawa-Scheme language, but charsets are implemented in
pure Scheme. (The solution to that may be to move the actual data type
and the core primitive methods to gnu/kawa/util.)
--
--Per Bothner
per@bothner.com http://per.bothner.com/



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]