This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: XML apparently cannot be used for general text markup: whitespace gripe


Hi all,

At 09:06 AM 3/19/2002, Chad wrote:
>  I've noticed a lot of xml-derived web pages out there have screwed up
>whitespace (words crammed together or an incorrect space before ending
>punctuation).

Or spurious whitespace within words, or ... and not only xml-derived pages 
but many kinds of pages that apparently come out of automated production 
systems. (Don't blame XML: this issue predates it.)

>  My conclusion is that blocks straight text (such as paragraphs) cannot be
>further marked up with XML without screwing up spacing.

I'd like to answer this at (even) more length, but time constraints prevent 
it. Still, it's an important issue.

As I see it there are really only two ways you can go with this problem. 
Place responsibility for correct whitespace usage at the point of 
production of the XML, clearly distinguishing where whitespace is 
significant (must be preserved as given) and where it's not. Or place 
responsibility at the point of processing, e.g. have the stylesheet 
designer plan for munging.

The first approach is, I believe, preferable where possible, and is in 
keeping with the whitespace-handling mechanisms provided in XML (such as 
they are). The placement of whitespace correctly is recognized as an 
authorial and editorial conern, much like correct spelling or grammar. In 
this scenario, you would just never have to deal with the input:

     <par>
       Is his name really <first>John</first>      <last>Doe</last>?
     </par>

but instead, would have

<par>Is his name really <first>John</first> <last>Doe</last>?</par>

It would be an editorial responsibility to make sure that content of your 
<par> elements would follow the rule here, that as far as whitespace is 
concerned, WYSIWYG --  so garbage in, garbage out. This is a purist 
approach, taking the line that if it's data, it's data, and that it's 
really too much to expect any lightweight text processor to have heuristics 
intelligent enough to know that, e.g., the initial whitespace appearing 
after the start tag but before the word "Is", doesn't count, but *one* of 
the spaces between the <first> and <last> elements does.

On the other hand, this approach is not always possible -- if not least 
because in a system where whitespace cleanup was mandated editorially, you 
might be the person asked to write a routine to fix the inevitable 
whitespace problems before handing the data to the production staff, and 
you might want to have automated or semi-automated ways to do this. (Not 
all authors can be trusted; and what's worse, some XML systems introduce 
whitespace "for you", taking control away.)

So how to write the XSLT to do the cleanup? As I said, I can't specify it 
here in detail, but a general approach would be:

1. Normalize space on all text nodes
    (i.e. remove leading and trailing, collapse all internal whitespace to
     a single space character)
2. Use heuristics to add single #32 characters back in to pad where there 
should
    really be whitespace. Heuristics would include:
    2a. which elements are concerned
       e.g. add it back after whitespace normalization here:
              <first>John</first>    <last>Doe</last>
            but not here:
              H<sub>2</sub>O
    2b. which neighboring characters are around
       e.g. don't add whitespace back before punctuation characters, as in
              <last>Doe</last>?</par>
3. Serialize and/or post-process using tools that will not introduce or 
remove whitespace, particularly not inside elements that contain any 
non-whitespace #PCDATA (or better, inside any element of any type that 
contains #PCDATA anywhere).

Naturally it would be nice if these heuristics could be generalized to the 
point where there could be a standard way of handling whitespace, e.g. in 
browsers; but I think you can see that 2b. is a very tall order (language 
dependent and not always consistent within a language) and 2a. is 
impossible in the general case without either (a) some kind of support from 
a schema or specification, to distinguish e.g. between "word-level" and 
"character-level" markup, or (b) extending xml:space with some monstrous 
semantics and using it all over the place.

At root, I think we see this problem as an expression of the Worlds in 
Collision represented by XML: on the data side, people are used to throwing 
in whitespace wherever, just to make the source code readable (which in 
principle is a good thing); whereas on the document side, white space has 
to be regarded as part of our source data since we simply have no way of 
knowing when it's not. In other words, whitespace is both, or either, data 
content, or "just markup" -- as it always has been.

But I'd be interested in what others have to say about this.

Cheers,
Wendell

Chad continued:
>  For example, can anyone get this simple document into HTML without either
>removing required spaces or adding inappropriate spaces?
>
>   <?xml version="1.0"?>
>   <book>
>      <par>
>       Is his name really <first>John</first>      <last>Doe</last>?
>     </par>
>   </book>
>
>  Either you will end up with:
>     "Is his name really JohnDoe?"
>   which is wrong, or:
>     "Is his name really John Doe ?"
>   which is also wrong.
>
>  Of course, this is a very simple example. In real-life situations bad
>whitespace causes really nasty problems.  Of course, I'm pretty new to XSL
>so maybe I just can't read the directions. Here's my XSL example:
>
>  <?xml version="1.0" encoding="utf-8"?>
>  <xsl:transform xmlns:xsl="http://www.w3.org/1999/XSL/Transform";
>version="1.0">
>   <xsl:output method="html"/>
>   <xsl:preserve-space elements="*"/>
>     <xsl:template match="/">
>       <html><xsl:apply-templates/></html>
>     </xsl:template>
>  </xsl:transform>
>
>  Does anyone know of a work-around for this common problem?


======================================================================
Wendell Piez                            mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]