[Zope-xml] DocBook processing HowTo (long)

Tue Sep 10 12:28:44 EDT 2002

Hi all

I'm writing up what we're trying to accomplish here. I thought I'd post
it to the list, thinking that perhaps other people might find it
useful.

Actually, the whole thing is really a plea for correction and pointers!

I start out talking about DocBook generally and commandline processing
to get HTML and PDF, but then I get to Zope and XMLTransform. The
document ends a bit bluntly there, because I haven't been able to make
XMLTransform do anything useful with DocBook:
<xsl:include href="included.xsl" /> makes my Zope catatonic.

Here you go:

Publishing framework

     We get Word documents from BIS. We want to get them into DocBook for
     purposes of storage and publication.

      - DocBook is good for storage because it can easily be indexed, and
        earlier versions can be processed and brought up to date with the
        appropriate versions of the stylesheets.

      - DocBook is good for publication because it can be transformed
        flexibly. The distribution comes with XSL stylesheets to produce
        various flavours of HTML, and XSL-FO (Formatting Objects). FO is
        meant to be used for generating print output.

     I'll discuss these publishing steps:
     Converting Word docs to DocBook (Autoconversion, Editorial markup of
     the converted DocBook),
     Processing DocBook from the commandline,
     Processing DocBook within Zope.

Converting Word docs to DocBook

   Autoconversion step

     We receive Word doc files for publication. Currently, BIS do not use
     any stylesheet or consistency guidelines, so the Word docs cannot be
     autoconverted very successfully. The cleanest route that I currently
     know is via the wvware suite::

         $ wvHtml "Input Word document.doc" wvHtml_output.html && \
           tidy -indent -clean -asxml wvHtml_output.html > tidy_output.html

     The 'tidy' options: '-clean' turns this (baarf)::

         <div name="Normal" align="left"
         style=" padding: 0.00mm 0.00mm 0.00mm 0.00mm; ">
           <p style="text-indent: 0.00mm; text-align: left; line-height:
                     4.166667mm; color: Black; background-color: White;
                     ">
           <font color="DarkBlue"><b>Local Government
           Revenue</b></font></p>
         </div>

     into this (phew)::

         <div name="Normal" class="c2">
           <p class="c1">Local Government Revenue</p>
         </div>

     with corresponding CSS classes in a 'style' block in the 'head'::

         div.c2 {padding: 0.00mm 0.00mm 0.00mm 0.00mm; text-align: left}
         p.c1 {background-color: White; color: Black; font-weight: bold;
             line-height: 4.166667mm; text-align: left; text-indent:
             0.00mm}

     The '-asxml' option ensures that there is no whitespace between
     attribute names and values, and that singleton tags are closed, as
     this 'diff' illustrates::

         - <!--Section Begins--><br>
         + <!--Section Begins--><br />
         (...)
         - <td bgcolor="White" width="100.00%" rowspan="1" colspan=
         - "2">
         + <td bgcolor="White" width="100.00%" rowspan="1"
         + colspan="2">

     Therefore, the output of tidy ('tidy_output.html' above) can be
     transformed by an XSL stylesheet, which should be able to do the
     following (in increasing order of difficulty, from my current
     viewpoint):

         - simply replace HTML tags with the corresponding DocBook tags::

             <xsl:template match="li">
               <listitem>
                 <xsl:apply-templates/>
               </listitem>
             </xsl:template>

         - Remove empty elements (block elements with no PCDATA in their
           children), such as::

             <div name="Normal" class="c2">
               <p class="c7"></p>
             </div>

         - Insert required attributes, derived from HTML, such as the
           number of columns in a table ('cols').

     If we have few documents to convert, we can stop autoprocessing
     after the 'tidy' step. If we have many over a long period, it will
     become worthwhile to interpose stylesheets for more massaging.

     However it's done, the output of this transformation is the input
     for the next step: editorial markup.

   Editorial markup

     There is more information in the Word document than survives the
     conversion. This information is either coded only visually and
     contextually in the Word doc (a personal name on the title page,
     usually right-aligned, can be assumed to be the author), or it is
     inconsistently marked up (headers are sometimes 'hN' elements, but
     sometimes only plain paragraphs in large text). This information
     needs a human to recognize and markup. Here are the kinds of
     semantic markup that we'll want to do:

         - Document metadata: title, author, publication date, revision
           history, contact information, abstract, ...

         - Inline metadata: Table titles and figure captions, quotations
           (to distinguish them from emphasis or "irony" quotes), the
           titles of sources, citations of references, foreign phrases,
           acronyms, crossreferences (eg. furnishing of 'xreflabel') ...

         - Document parts: most have bibliographies, some have glossaries
           or lists of acronyms.

           Bibliographies can become very complex. If we markup them
           lightly, we don't stand to get that much back out of them. If
           we markup them carefully, we could have a sitewide citations
           database. This would have advantages such as: consistency
           across documents; editors/visitors would be able to see which
           sources/authors are cited most often by BIS; if we maintain a
           central bibliography, all documents that cite a source benefit
           if it is updated (this might be particularly useful in the
           case of online sources with URLs that change). See
           "Refdb":http://refdb.sourceforge.net/

           A sitewide glossary of acronyms/institutions would also be
           nice to have.

     Editing environment: the idea is that we outsource the markup, and
     so Vim isn't necessarily the most obvious choice as editor. For
     those who live in it, Vim offers a DocBook mode, which will alert
     you to non-DocBook tags or attributes thru syntax highlighting
     (however, Vim's syntax file isn't up to date, so it misrecognizes
     eg. 'articleinfo', introduced in DocBook v.4. We're currently at
     4.2) and 'matchit.vim', which matches start and end tags (though it
     doesn't realize that 'docbk' is also an 'xml' mode, so you have to
     go to XML mode or hack 'matchit.vim').

     I haven't found a nice free XML-aware editor yet. Here are some
     lists: Dave Pawson's: http://www.dpawson.co.uk/docbook/tools.html#d6e237
     Gary Lawrence Murphy's: http://teledyn.com/help/XML/?Editors
     Perhaps "jaxe":http://jaxe.sourceforge.net/Jaxe_en.html

Processing DocBook from the commandline

     For commandline XML processing, I have looked at 4Suite, PyXML,
     libxml2 and libxslt.

     **libxml2 and libxslt** are the XML-processing components of the
     Gnome project. Coded by Daniel Veillard in C. They seem to be quite
     solid and fast. They are meant to be used firstly as libraries in
     the Gnome environment, but come with some commandline tools and
     there are Python bindings available.

   HTML

     To process a DocBook XML file to HTML::

         $ xsltproc /.../docbook/xsl-stylesheets-1.52.2/html/docbook.xsl \
           /.../docbook/paper-input.xml > /.../docbook/paper-output.html

     This outputs a HTML file. To use this in Zope, you'd want to strip
     off the 'html' and 'body' tags, and the 'head' section.

   PDF

     To process a DocBook XML file to PDF, the main options seem to be
     'passivetex' and the Apache project's 'fop'. Both are a bit of a
     PITA. 'passivetex' is more mature, and as it's written in TeX, the
     output document is beautifully typeset. However, it's very sparsely
     documented. At the moment, you have to "patch the 
release":http://sourceforge.net/tracker/index.php?func=detail&aid=593600&group_id=21935&atid=373747
     to make it work with recent DocBook XSL. Personally, I've never had
     the time to make TeX do anything it doesn't do by default :(

     'fop' is written in Java, and the output PDF isn't really gorgeous.
     I'm a Java illiterate, so I don't know the Right Way to install
     this. If anyone can advise, please do. I got it to work as follows
     .. First get a '.fo' file from 'xsltproc', and then run 'fop' on
     that::

         $ xsltproc /.../docbook/xsl-stylesheets-1.52.2/fo/docbook.xsl \
           /.../docbook/paper-input.xml > /.../docbook/paper-output.fo
         $ cd /.../fop-0.20.4
         $ export JAVA_HOME=/opt/blackdown-jdk-1.3.1/
         $ ./build.sh
         $ ./fop.sh /.../docbook/paper-output.fo /.../docbook/paper-output.pdf

Processing DocBook within Zope

   Prerequisites for Python

     Setup is not hassle-free. AFAICT, you need to install either Gnome's
     libxml2 and libxslt, or PyXML and 4Suite 0.11, or 4Suite 0.12 alone
     (1). I think it's fine to install all of them, but I'd like
     reassurance on this point.

     (1) From the 4Suite distro::

         Prior to 4Suite-0.12.0 release, PyXML was required to use
         4Suite. 4Suite now no longer requires PyXML, as long as expat
         and pyexpat are installed.

         If you need DTD validation, or some advanced DOM processing
         tools, you should install PyXML 0.7 or higher

     The latest 4Suite release is 0.11.1, which works with PyXML 0.6.6
     *exactly*. Lots has changed in PyXML since 0.6.6 (eg. namespace
     processing). To use PyXML 0.7.1 with 4Suite, you need to get a CVS
     snapshot of the 4Suite 0.12 branch. See
     http://4suite.org/docs/4SuiteCVS.xml for instructions.

     Simplest by far seems to be to just install a recent snapshot, if
     you want 4Suite.

    libxml2 and libxslt

     I'm on Gentoo Linux, so 'emerge libxml2 libxslt' installs the
     libraries for the python reported by 'python -V', eg. Python 2.2.1.
     This is not a python used by Zope yet. Check the INSTANCE_HOME
     'start' files to ascertain which Zopes run what, and 'python2.1 -V'
     to see exactly which version this is: eg.::

         jean at blommie jean $ python2.1 -V
         Python 2.1.1

     The libxslt Python bindings need to be installed for *this* Python.
     The Zope 2.5.1 binary distro ships with Python 2.1.3 .. It doesn't
     look like libxml2/libxslt or 4Suite have hard dependencies on the
     Python minor version number, but I wouldn't swear to that.

     To install the libxslt bindings for other Python versions, I got
     hold of 'libxml2-python-2.4.24.tar.gz' and::

           $ tar xzvf libxml2-python-2.4.24.tar.gz
           $ cd libxml2-python-2.4.24/
           $ python2.1 setup.py install

     Anyway, by now we should have fulfilled XML Transform's
     prerequisites.

   XML Transform for Zope

     The most flexible and robust XSL tool for Zope seems to be Ariel
     Partners' XML Transform. It differs from ParsedXML in that it
     does not focus on exposing the DOM of XML objects to scripts etc.
     inside Zope. ParsedXML would have been compatible with XMLTransform
     if calling a ParsedXML object returned the raw XML representation of
     the object. It's a small change; maybe it's happened by now.

     In contrast, XMLTransform pairs up an XML source and XSL
     stylesheets::

         XMLSource -> XMLTransform <- XSLStylesheet
                            |
                            V
                     Output (can be HTML, XML, XSL, PDF, ...)

     The source document can be DTMLDocument, DTMLMethod, File,
     ExternalFile, ... anything that returns a valid string of XML when
     called (this means that XMLTransforms can be chained, the output of
     one forming the input of the other). Similarly, the XSL stylesheet
     can be anything that returns XSL. However, the XMLTransform can be
     configured to parse the stylesheet as either DTML or ZPT -- making
     the stylesheet dynamic -- before applying it to the XML.

     In principle, XMLTransform abstracts the XSL processor used.
     Currently, it can make use of libxslt or 4Suite, with the usual
     caveats .. URIs with namespaces wasn't available for libxslt, as
     libxslt doesn't export the bindings to Python.

     Local relative path resolution wasn't working last I looked, but
     Craeg Strong fixed it. I haven't been able to unbreak my
     installation since then, though, so I haven't tasted the fruits of
     his labour.

-- 
Jean Jordaan
Upfront Systems                         http://www.upfrontsystems.co.za