This is the mail archive of the docbook-apps@lists.oasis-open.org mailing list .


Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]
Other format: [Raw text]

Re: [docbook-apps] Dynamic web serving of large Docbook


[Sorry, I accidentally hit Send before I was finished in the previous
message]

Hi Frans,
I've been dealing with this issue of modular doc processing for some time
during my publishing tools career.  I agree that you should not have to
divide up your source files just to facilitate modular processing.  But
without more details on your requirements, I can't give you a complete,
solution, but I can give you my thoughts on the matter.

It is quite possible to select content for processing without having to load
an entire document. Your server could construct a skeleton document that has
a single XInclude that references the content you want to process.  The
DocBook XSL stylesheets will handle most of the hierarchical elements as a
document root element, although you should check the 'root.elements'
variable in fo/docbook.xsl to make sure all the ones you want to process
will generate a page sequence in FO output.

One difficulty you will encounter is to simulate the chunking behavior of
chunk.xsl.  When you process a chapter as chunks, it will generate one chunk
for the chapter content before the first section, and then chunk sections
according to the parameter settings.  I'm not sure an XInclude xpointer can
select just the first part of a chapter.

Of course, when you select a chapter and process it by itself, it will
always be numbered 1.  You can get around that problem by generating an
olink database for your whole document, which will include the number
information for all numbered elements.  Your stylesheet customization would
have to change the templates in label.markup mode to look up the number in
the olink database instead of counting chapters in the document.  The same
would apply to number labels on figures, tables, sections, etc.  In that way
you are simulating the context of the selected content within the document.

You could also use the sequencing in the olink database to compute the Next,
Previous, and Up navigational links for a chunk.

For cross references, you could replace your link and xref elements with
olinks as Mike suggested.  That has other advantages, such as making it
possible to modularize your source files as you see fit.  You can unit test
your content with full validation and cross references resolved if you use
olinks.  If you don't use olinks, then you would have to change the
templates for xref and link to simulate the behavior of olinks.  That's
because they currently rely on finding the referenced id within the
currently loaded document.

But even if you use olinks, cross references will still present dependency
issues.  A cross reference that generates content will always need to have
the latest version of that content.  If it is in some other file, then you
have two choices:  open the other file and parse it enough to get the
information you need, or get the information from the olink database.
Opening the file will always get you the very latest information, but at a
significant processing cost if you do very many of them.  Using the olink
database means you are getting the information as of the last update of
database.

If you are using a source control system, you might be able to set up a
checkin script so that whenever a source file is checked in, the olink
database is automatically updated.  If you only make available to the server
the checked in content, then they will always be in sync.  Currently the
template that generates the database does the whole document, but you could
maybe customize that behavior to only update the data for a single source
file module to speed up the process.

But one limitation you may run into even if you do all of the above, which
is the large size of the DocBook XSL stylesheets.  They take quite a bit of
time just to load, so that may be the limiting factor in response time.
Even if you have a two-word document, the response may be too slow. You
might do some testing in this area.

The stylesheets are big because they are versatile.  They try to be
everything for everyone, so there is a lot of code for conditional
processing of variables, and for handling customization entry points.  Just
look at all the templates for titlepages, for example.  Once you have
settled on a design, though, the stylesheets could be *much* smaller and
load more quickly.  But I don't know of a way to "compile" the stylesheets
into a fixed compact form, sort of like a flattened DTD.  That would be a
very interesting project.  8^)

Another possibility  is to create a server module that loads the stylesheet
and keeps in memory, ready to handle requests as they come in.

I hope some this helps.

Bob Stayton
Sagehill Enterprises
DocBook Consulting
bobs@sagehill.net


----- Original Message ----- 
From: "Bob Stayton" <bobs@sagehill.net>
To: "Frans Englich" <frans.englich@telia.com>;
<docbook-apps@lists.oasis-open.org>
Cc: "Michael Smith" <smith@xml-doc.org>
Sent: Monday, October 18, 2004 10:35 AM
Subject: Re: [docbook-apps] Dynamic web serving of large Docbook


> Hi Frans,
> I've been dealing with this issue of modular doc processing for some time
> during my publishing tools career.  I agree that you should not have to
> divide up your source files just to facilitate modular processing.  But
> without more details on your requirements, I can't give you a complete,
> solution, but I can give you my thoughts on the matter.
>
> It is quite possible to select content for processing without having to
load
> an entire document. Your server could construct a skeleton document that
has
> a single XInclude that references the content you want to process.  The
> DocBook XSL stylesheets will handle most of the hierarchical elements as a
> document root element, although you should check the 'root.elements'
> variable in fo/docbook.xsl to make sure all the ones you want to process
> will generate a page sequence in FO output.
>
> One difficulty you will encounter is to simulate the chunking behavior of
> chunk.xsl.  When you process a chapter as chunks, it will generate one
chunk
> for the chapter content before the first section, and then chunk sections
> according to the parameter settings.  I'm not sure an XInclude xpointer
can
> select just the first part of a chapter.
>
> Of course, when you select a chapter and process it by itself, it will
> always be numbered 1.  You can get around that problem by generating an
> olink database for your whole document, which will include the number
> information for all numbered elements.  Your stylesheet customization
would
> have to change the templates in label.markup mode to look up the number in
> the olink database instead of counting chapters in the document.  The same
> would apply to number labels on figures, tables, sections, etc.  In that
way
> you are simulating the context of the selected content within the
document.
>
> You could also use the sequencing in the olink database to compute the
Next,
> Previous, and Up navigational links for a chunk.
>
> For cross refere
>
>
>
>
> Bob Stayton
> Sagehill Enterprises
> DocBook Consulting
> bobs@sagehill.net
>
>
> ----- Original Message ----- 
> From: "Frans Englich" <frans.englich@telia.com>
> To: <docbook-apps@lists.oasis-open.org>
> Cc: "Michael Smith" <smith@xml-doc.org>
> Sent: Saturday, October 16, 2004 6:11 PM
> Subject: Re: [docbook-apps] Dynamic web serving of large Docbook
>
>
> >
> > Michael, thanks for your extensive replies. I have been looking into
this
> > relatively extensively, and it sure is tricky. Docbook is a very
> attractive
> > format to have beneath, and being able to swiftly use it in large web
> > projects would make it even more powerfull. I think it applies to many,
so
> a
> > clean, thorough solution which is pushed upstream(into a CMS or
> stylesheets)
> > would gain many people.
> >
> > It should be noted I have no possibilities for financing or proprietary
> > solutions due to several reasons, one is that it's for an open source
> > project. Also, sorry about the late reply :|
> >
> > On Wednesday 13 October 2004 13:29, Michael Smith wrote:
> > > Frans,
> > >
> > > Reading through your message a little more...
> > >
> > > [...]
> > >
> > > > The perfect solution, AFAICT, would be a dynamic, cached,
generation.
> > > > When a certain section is requested, only that part is transformed,
> and
> > > > cached for future deliveries. It sounds nice, and sounds like it
would
> be
> > > > fast.
> > > >
> > > > I looked at Cocoon(cocoon.apache.org) for helping me with this, and
it
> > > > does many things well; it caches XSLT sheets, the source files, and
> even
> > > > CIncludes(same as XIncludes basically).
> > > >
> > > > However, AFAICT, Docbook makes it not easy:
> > > >
> > > > * If one section is to be transformed, the sheets must parse /all/
> > > > sources, in order to resolve references and so forth. There's no way
> to
> > > > workaround this, right?
> > >
> > > It seems like your main requirement as far as HTML output is to be
> > > able to preserve stable cross-references among your rendered
> > > pages. And you would like to be able to dynamically regenerate
> > > just a certain HTML page without regenerating every HTML page that
> > > it needs to cross-reference.
> > >
> > > And, if I understand you right, your requirement for PDF output is
> > > to be able to generate a PDF file with the same content as each
> > > HTML chunk, without regenerating the whole set/book it belongs to.
> > > (At least that's what I take your mention "chunked PDF" in your
> > > original message to mean.)
> >
> > Yes, correct interpretation.
> >
> > >
> > > (But -- this is just an indicental question -- in the case of the
> > > PDF chunks, you're not able to preserve cross-references between
> > > individual PDF files, right? There's no easy way to do that. Not
> > > that I know of at least.)
> >
> > Nope, the PDF would simply contain the content of the viewed page
without
> any
> > webspecifics such as navigation; used for printing. Example(upper right
> > corner):
> > http://xml.apache.org/
> >
> > >
> > > If the above is all an accurate description of your requirements,
> > > then I think a partial solution is
> > >
> > >   - set up the relationship between your source files and HTML
> > >     output such that the DocBook XML source for your parts are
> > >     stored as separate physical files that corresponded one-to-one
> > >     with the HTML files in your chunked output
> > >
> > >   - use olinks for cross-references (instead of using xref or link)
> > >
> > >       http://www.sagehill.net/docbookxsl/Olinking.html
> > >
> > > If you were to do those two things, then maybe:
> > >
> > >  1. You could do an initial "transform everything" step of your
> > >     set/book file, with the individual XML files brought together
> > >     using XInclude or entities; that would generate your TOC &
> > >     index and one big PDF file for the whole set/book
> > >
> > >  2. You would then need to to generate a target data file for each
> > >     of your individual XML files, using a unique filename value for
> > >     the targets.filename parameter for each one, and then
> > >     regenerate the HTML page for each individual XML file, and
> > >     also the corresponding PDF output file.
> > >
> > >  3. After doing that initial setup once, then each time an
> > >     individual part is requested (HTML page or individual PDF
> > >     file), you could regenerate just that from its corresponding
> > >     XML source file.
> > >
> > >     The cross-references in your HTML output will then be
> > >     preserved (as long as the relationship between files hasn't
> > >     changed and you use the target.database.document and
> > >     current.docid parameters when calling your XSLT engine).
> > >
> > > I _think_ that all would work. But Bob Stayton would know best.
> > > (He's the one who developed the olink implementation in the
> > > DocBook XSL stylesheets.)
> > >
> > > A limitation of it all is that, if a writer adds a new section to
> > > a document, you're still going to need to re-generate the whole
> > > set/book to get that new section to show up in the master TOC.
> > > Same thing if a writer adds an index marker, in order to get that
> > > marker to show up in the index.
> > >
> > > But one way to deal with that is, you could just do step 3 above
> > > on-demand, and have steps 1 and 2 re-run, via a cron job or
> > > equivalent, at some regular interval -- once a day or once an hour
> > > or at whatever the minimum interval is that you figure would be
> > > appropriate given how often writers are likely to add new sections
> > > or index markers.
> > >
> > > And during that interval, of course there would be some
> > > possibility of an end user not being aware of a certain newly
> > > added section because the TOC hasn't been regenerated yet, and
> > > similarly, not finding anything about that section in the index
> > > because it hasn't been regenerated yet.
> > >
> > > > * Cocoon specific: It cannot cache "a part" of a transformation,
which
> > > > means the point above isn't workarounded. Right? This would
otherwise
> > > > mean the transformation of all non-changed sources would be cached.
> > >
> > > Caching is something that you could do with or without Cocoon, and
> > > something that's entirely separate from transformation phase. You
> > > wouldn't necessarily need Cocoon or anything Cocoon-like if you
> > > used the solution above (and if it would actually work as I
> > > think). And using Cocoon just to handle caching would probably be
> > > overkill. I think there are probably some lighter-weight ways to
> > > handle caching.
> > >
> > > Anyway, I think the solution I described would be some work to set
> > > up -- but you could hire some outside expertise to help you do
> > > that (Bob Stayton comes to mind for some reason...).
> >
> >
> > I looked at the solution of using an olink database, but perhaps I
> discarded
> > it too quickly. Perhaps I'm setting the threshold to high(I am..), but I
> find
> > it hackish; it isn't transparent, and it most of all disturbs creation
of
> > content: one can't use standard Docbook, and authors have to bother with
> > technical problems. It's messy.
> >
> > One thing which can be remembered is that splitting the source document
> > mustn't be done propotionally to what pieces that are rendered; it only
> have
> > to be kept in such small pieces that performance is acceptable(it's a
> small
> > detail, but it can from an editing perspective be practical with a
> document
> > larger than what is to be viewed), /assuming/ the CMS( or whatever
content
> > generation mechanism is used) can map the generated output to a certain
> part
> > in the source file(like XInclude).
> >
> > To recapitulate, the problem is the initial transformation of the
> requested
> > content -- that the XSLs must traverse "all" the sources -- and that
> > performance hit is the same regardless of whether it's PDF, HTML, and if
> the
> > requested content is small. Once it's generated all is cool, since it's
> > cached for later deliveries. That's the key problem -- everything
depends
> on
> > it.
> >
> > Here's possible solutions:
> >
> >
> > 1. The olink way you described. It works, but it's complex, restraining,
> and
> > intrusive on content creation.
> >
> > 2. True static content(croned). Not intrusive on content creation, but
> it's
> > perhaps too simple(too dumb), and it actually can become a performance
> issue
> > too; generating PDFs for each section -- that's a lot of mega bytes to
> write
> > to disk each time the cron job runs.
> >
> > 3. To actually go for the long transformation which we try to avoid;
that
> all
> > the sources are transformed for each requested section. First of all,
this
> > long transformation happens for the first request -- the first user -- 
and
> > then it's cached. How long does it take then? Cocoon caches includes,
and
> the
> > files, so when the cache becomes invalidated one source file is
> reloaded(the
> > one which have changed) while all others and the Docbook XSLs(they're
> huge)
> > are kept in memory(DOM, I presume) -- perhaps that's enough for reducing
> that
> > first transformation to reasonable speeds. I'm only speculating, no
doubt
> > that it's the transformation that takes the longest time(perhaps someone
> > knows if I'm unrealistic, but otherwise real testing gives the definite
> > answer). If this worked, it would be the best solution.
> >
> > These approaches can also be combined; the html output could be
> static(cron),
> > while PDFs are dynamic. In this way the performance trouble of 2) are
> > gone(writing tons of PDF files), and perhaps the delay is ok for PDF.
From
> my
> > shallow reading about Forrest, I have understood it's good at combining
> > serving dynamic and generating static, perhaps it can be a way to pull
it
> all
> > together under one technical framework.
> >
> >
> > ***
> >
> > Another trouble, or at least something which requires action, with
> flexible
> > website integration is navigation. As I see it, Docbook is tricky on
that
> > front -- the XSLs are quite focused on static content generation, the
> chunked
> > output for example. Since dynamic generation basically takes a node and
> > transforms with docbook.xsl, navigation must be hand written, for
example
> if
> > one wants the TOC as a sidebar, and that it changes depending on what is
> > viewed(flexible integration). I bet this is relatively easy to do,
> > considering how the XSLs are written, and this could be good to have in
a
> > generic way somewhere(Forrest, Docbook XSLs, perhaps..).
> >
> >
> >
> > Yes, speculations. When I write something, have actual numbers, proof of
> > concept, or know what I actually talk about, I will definitely share it
on
> > this list.
> >
> > Hm.. That's as far as I see.
> >
> >
> > Cheers,
> >
> > Frans
> >
> >
> >
>



Index Nav: [Date Index] [Subject Index] [Author Index] [Thread Index]
Message Nav: [Date Prev] [Date Next] [Thread Prev] [Thread Next]