This is the mail archive of the xsl-list@mulberrytech.com mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]

RE: Truncating output of a node


Hi Jim,

At 10:35 AM 4/19/01, Mike wrote:
> > I am trying to output the first n sentences of a node. I have
> > tried using
> > for-each with a conditional to stop output but have had no luck.
> >
> > Given the following XML fragment, what is the best way to
> > output only the
> > first n sentences? Note that the node has both text and child nodes.
> >
>Write a recursive template that takes the text and n as parameters; in this
>template,
>if n>0, output the first sentence (using substring-before), then make a
>recursive call on the the same template, passing the remaining text (using
>substring-after) and n-1 as the parameters.

This will work assuming you have identified some dependable way to delimit 
sentences in your data. You might assume that the presence of a character 
"." will indicate the end of a sentence. This is fine ... but what about 
sentences that happen to contain the string "...", or that end with a 
question mark? (Or what about sentences that appear with other kinds of 
punctuation?!)

Identifying what is actually a "sentence" is actually a difficult question 
in text processing, not easily tractable, which is why applications that 
require processing based on sentences will be much easier if you have 
markup embedded that tells you what's a sentence, and what's not. Your 
problem would be fairly trivial in XSLT if your input were something like:

<summary>
     <s>It is best to start a new <span class="highlight">message</span> for a
new thread.</s>
     <s>Do not start a new thread by replying to an unrelated <span
class="highlight">message</span> and just changing the subject line, since 
the header of your <span class="highlight">message</span> will contain 
references to the previous <span class="highlight">message</span> and your 
new <span class="highlight">message</span> will appear in the archive as 
one of the replies to the original <span class="highlight">message</span>.</s>
</summary>

If you don't have the option of changing the way your input is structured, 
Mike's solution of processing text content recursively is the only option 
-- and might be "good enough for government work" (as is sometimes said). 
But the presence of element nodes in mixed content (such as your embedded 
<span> elements) makes this much harder, unless you can just throw them 
away. In theory I suppose it could be done, but the code is going to be 
pretty ugly, especially if you allow for the possibility that a "sentence" 
could end *inside* one of the <span> children....

Any intrepid XSLT coders want to tackle that?

Good luck,
Wendell

======================================================================
Wendell Piez                            mailto:wapiez@mulberrytech.com
Mulberry Technologies, Inc.                http://www.mulberrytech.com
17 West Jefferson Street                    Direct Phone: 301/315-9635
Suite 207                                          Phone: 301/315-9631
Rockville, MD  20850                                 Fax: 301/315-8285
----------------------------------------------------------------------
   Mulberry Technologies: A Consultancy Specializing in SGML and XML
======================================================================


 XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]