This is the mail archive of the kawa@sourceware.org mailing list for the Kawa project.

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]
Other format:	[Raw text]

Re: Escaping of non-ASCII characters in XML

From: Per Bothner <per at bothner dot com>
To: ÐÐÐÑÑÐÐ <dmymd at yandex dot ru>
Cc: kawa at sources dot redhat dot com
Date: Mon, 23 Jul 2012 13:21:02 -0700
Subject: Re: Escaping of non-ASCII characters in XML
References: <108331343044663@web20g.yandex.ru>

On 07/23/2012 04:57 AM, ÐÐÐÑÑÐÐ wrote:

Hello!

   I believe the current XML functions for creating XML and found XML in Kawa
practically unusable for languages with a non-Latin script.

   E.g. <p>ÐÐÑÐÐÑÑÐÐ</p> is automatically escaped to
<p>&#x41F;&#x435;&#x440;&#x435;&#x432;&#x456;&#x440;&#x43A;&#x430;</p>.
All non-ASCII characters are escaped.


This shouldn't really matter in principle.  Humans normally wouldn't be
looking at computer-generated XML/HTML. However, it does make the output
bulkier, and it makes "View Source" (or the quivalent) uglier. so it's
certainly not ideal.

   Does anyone really need this kind of escaping? Kawa's internal HTTP server
escapes strings after this anyway, so in this case it's a mere duplication.

(The server escaping is also not quite adequate for Ukrainian and Russian,
but this is a different issue.)

Can you remind me where Kawa's internal HTTP server does the string-escaping?

   Is it possible to add "xp.escapeNonAscii = false;" somewhere in the the
gnu.kawa.xml.KNode:toString function (gnu\kawa\xml\KNode.java, after line 32).
[I believe this should turn the escaping off, but I don't have JDK at hand to
check.] xp.escapeNonAscii shouldn't affect control characters (these are
encoded anyway), only characters outside ASCII.

It might make sense, but I'm a little uncomfortable with the idea that toString output is different from printing to a file. What you then print out the toString return to an ASCII or Latin-1-only file or terminal?

Of course you have the same problem printing strings in general.

   If this escaping is desirable for some reason (though I can't think of any),
is it possible to add some variable like *xml-escape-string* to turn this
escaping off?

It has the big advantage that the output is portable, regardless of the target environment's character encoding.

Now if most of the world is using Unicode, perhaps it isn't as much of an issue as it used to be. But I'm guessing ISO/IEC 8859-5 might still be fairly common in your part of the world - and then what happens if I write out (say) Ã ?

W3C in http://www.w3.org/TR/xslt-xquery-serialization-30/#HTML_CHARDATA says:

  "Entity references and character references SHOULD be used only where
  the character is not present in the selected encoding"

A problem is that getting at the encoding and then figuring out if a character is present are both non-trivial.

You would probably prefer Cyrillic letters to non-escaped, but you might want Ã to be escaped. (And you might prefer this even if you or your server runs in a Unicode locale, since your clients might not.) So ideally you'd like to use a charset (http://www.gnu.org/software/kawa/Character-sets.html) to control which characters are escaped. But there is a layering problem: XMLPrinter should not depend on the Kawa-Scheme language, but charsets are implemented in pure Scheme. (The solution to that may be to move the actual data type and the core primitive methods to gnu/kawa/util.) -- --Per Bothner per@bothner.com http://per.bothner.com/

References:
- Escaping of non-ASCII characters in XML
  - From: Дмитрий

Index Nav:	[Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav:	[Date Prev] [Date Next]	[Thread Prev] [Thread Next]