This is the mail archive of the
kawa@sourceware.org
mailing list for the Kawa project.
Re: Escaping of non-ASCII characters in XML
- From: Per Bothner <per at bothner dot com>
- To: ÐÐÐÑÑÐÐ <dmymd at yandex dot ru>
- Cc: kawa at sources dot redhat dot com
- Date: Mon, 23 Jul 2012 13:21:02 -0700
- Subject: Re: Escaping of non-ASCII characters in XML
- References: <108331343044663@web20g.yandex.ru>
On 07/23/2012 04:57 AM, ÐÐÐÑÑÐÐ wrote:
Hello!
I believe the current XML functions for creating XML and found XML in Kawa
practically unusable for languages with a non-Latin script.
E.g. <p>ÐÐÑÐÐÑÑÐÐ</p> is automatically escaped to
<p>Перевірка</p>.
All non-ASCII characters are escaped.
This shouldn't really matter in principle. Humans normally wouldn't be
looking at computer-generated XML/HTML. However, it does make the output
bulkier, and it makes "View Source" (or the quivalent) uglier. so it's
certainly not ideal.
Does anyone really need this kind of escaping? Kawa's internal HTTP server
escapes strings after this anyway, so in this case it's a mere duplication.
(The server escaping is also not quite adequate for Ukrainian and Russian,
but this is a different issue.)
Can you remind me where Kawa's internal HTTP server does the
string-escaping?
Is it possible to add "xp.escapeNonAscii = false;" somewhere in the the
gnu.kawa.xml.KNode:toString function (gnu\kawa\xml\KNode.java, after line 32).
[I believe this should turn the escaping off, but I don't have JDK at hand to
check.] xp.escapeNonAscii shouldn't affect control characters (these are
encoded anyway), only characters outside ASCII.
It might make sense, but I'm a little uncomfortable with the idea that
toString output is different from printing to a file. What you then
print out
the toString return to an ASCII or Latin-1-only file or terminal?
Of course you have the same problem printing strings in general.
If this escaping is desirable for some reason (though I can't think of any),
is it possible to add some variable like *xml-escape-string* to turn this
escaping off?
It has the big advantage that the output is portable, regardless of the
target
environment's character encoding.
Now if most of the world is using Unicode, perhaps it isn't as much of
an issue
as it used to be. But I'm guessing ISO/IEC 8859-5 might still be fairly
common in your part of the world - and then what happens if I write out
(say) Ã ?
W3C in http://www.w3.org/TR/xslt-xquery-serialization-30/#HTML_CHARDATA
says:
"Entity references and character references SHOULD be used only where
the character is not present in the selected encoding"
A problem is that getting at the encoding and then figuring out if a
character
is present are both non-trivial.
You would probably prefer Cyrillic letters to non-escaped, but you might
want
à to be escaped. (And you might prefer this even if you or your server runs
in a Unicode locale, since your clients might not.) So ideally you'd
like to
use a charset (http://www.gnu.org/software/kawa/Character-sets.html) to
control
which characters are escaped. But there is a layering problem:
XMLPrinter should
not depend on the Kawa-Scheme language, but charsets are implemented in
pure Scheme. (The solution to that may be to move the actual data type
and the core primitive methods to gnu/kawa/util.)
--
--Per Bothner
per@bothner.com http://per.bothner.com/