This is the mail archive of the
xsl-list@mulberrytech.com
mailing list .
Re: XML apparently cannot be used for general text markup: whitespace gripe
- From: Wendell Piez <wapiez at mulberrytech dot com>
- To: xsl-list at lists dot mulberrytech dot com
- Date: Tue, 19 Mar 2002 11:57:16 -0500
- Subject: Re: [xsl] XML apparently cannot be used for general text markup: whitespace gripe
- Reply-to: xsl-list at lists dot mulberrytech dot com
Hi all,
At 09:06 AM 3/19/2002, Chad wrote:
> I've noticed a lot of xml-derived web pages out there have screwed up
>whitespace (words crammed together or an incorrect space before ending
>punctuation).
Or spurious whitespace within words, or ... and not only xml-derived pages
but many kinds of pages that apparently come out of automated production
systems. (Don't blame XML: this issue predates it.)
> My conclusion is that blocks straight text (such as paragraphs) cannot be
>further marked up with XML without screwing up spacing.
I'd like to answer this at (even) more length, but time constraints prevent
it. Still, it's an important issue.
As I see it there are really only two ways you can go with this problem.
Place responsibility for correct whitespace usage at the point of
production of the XML, clearly distinguishing where whitespace is
significant (must be preserved as given) and where it's not. Or place
responsibility at the point of processing, e.g. have the stylesheet
designer plan for munging.
The first approach is, I believe, preferable where possible, and is in
keeping with the whitespace-handling mechanisms provided in XML (such as
they are). The placement of whitespace correctly is recognized as an
authorial and editorial conern, much like correct spelling or grammar. In
this scenario, you would just never have to deal with the input:
<par>
Is his name really <first>John</first> <last>Doe</last>?
</par>
but instead, would have
<par>Is his name really <first>John</first> <last>Doe</last>?</par>
It would be an editorial responsibility to make sure that content of your
<par> elements would follow the rule here, that as far as whitespace is
concerned, WYSIWYG -- so garbage in, garbage out. This is a purist
approach, taking the line that if it's data, it's data, and that it's
really too much to expect any lightweight text processor to have heuristics
intelligent enough to know that, e.g., the initial whitespace appearing
after the start tag but before the word "Is", doesn't count, but *one* of
the spaces between the <first> and <last> elements does.
On the other hand, this approach is not always possible -- if not least
because in a system where whitespace cleanup was mandated editorially, you
might be the person asked to write a routine to fix the inevitable
whitespace problems before handing the data to the production staff, and
you might want to have automated or semi-automated ways to do this. (Not
all authors can be trusted; and what's worse, some XML systems introduce
whitespace "for you", taking control away.)
So how to write the XSLT to do the cleanup? As I said, I can't specify it
here in detail, but a general approach would be:
1. Normalize space on all text nodes
(i.e. remove leading and trailing, collapse all internal whitespace to
a single space character)
2. Use heuristics to add single #32 characters back in to pad where there
should
really be whitespace. Heuristics would include:
2a. which elements are concerned
e.g. add it back after whitespace normalization here:
<first>John</first> <last>Doe</last>
but not here:
H<sub>2</sub>O
2b. which neighboring characters are around
e.g. don't add whitespace back before punctuation characters, as in
<last>Doe</last>?</par>
3. Serialize and/or post-process using tools that will not introduce or
remove whitespace, particularly not inside elements that contain any
non-whitespace #PCDATA (or better, inside any element of any type that
contains #PCDATA anywhere).
Naturally it would be nice if these heuristics could be generalized to the
point where there could be a standard way of handling whitespace, e.g. in
browsers; but I think you can see that 2b. is a very tall order (language
dependent and not always consistent within a language) and 2a. is
impossible in the general case without either (a) some kind of support from
a schema or specification, to distinguish e.g. between "word-level" and
"character-level" markup, or (b) extending xml:space with some monstrous
semantics and using it all over the place.
At root, I think we see this problem as an expression of the Worlds in
Collision represented by XML: on the data side, people are used to throwing
in whitespace wherever, just to make the source code readable (which in
principle is a good thing); whereas on the document side, white space has
to be regarded as part of our source data since we simply have no way of
knowing when it's not. In other words, whitespace is both, or either, data
content, or "just markup" -- as it always has been.
But I'd be interested in what others have to say about this.
Cheers,
Wendell
Chad continued:
> For example, can anyone get this simple document into HTML without either
>removing required spaces or adding inappropriate spaces?
>
> <?xml version="1.0"?>
> <book>
> <par>
> Is his name really <first>John</first> <last>Doe</last>?
> </par>
> </book>
>
> Either you will end up with:
> "Is his name really JohnDoe?"
> which is wrong, or:
> "Is his name really John Doe ?"
> which is also wrong.
>
> Of course, this is a very simple example. In real-life situations bad
>whitespace causes really nasty problems. Of course, I'm pretty new to XSL
>so maybe I just can't read the directions. Here's my XSL example:
>
> <?xml version="1.0" encoding="utf-8"?>
> <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
>version="1.0">
> <xsl:output method="html"/>
> <xsl:preserve-space elements="*"/>
> <xsl:template match="/">
> <html><xsl:apply-templates/></html>
> </xsl:template>
> </xsl:transform>
>
> Does anyone know of a work-around for this common problem?
======================================================================
Wendell Piez mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc. http://www.mulberrytech.com
17 West Jefferson Street Direct Phone: 301/315-9635
Suite 207 Phone: 301/315-9631
Rockville, MD 20850 Fax: 301/315-8285
----------------------------------------------------------------------
Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================
XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list