Converting OpenOffice.org documents to XHTML 1.0 strict with Writer2LaTeX: a quickguide (second version)

About this document

This is the second version of my quickguide, documenting XHTML conversion with the 2003-12-11 development release of Writer2LaTeX 0.3.1. Users of earlier development releases are encouraged to upgrade, as the release features a fair amount of fixes.

Update 2004-01-22: This quickguide is slightly outdated – for the latest bleeding edge release, check Henrik Just's WriterLaTeX page and follow its instructions. I am planning to release an update of this page soon.

Intro

OpenOffice.org 1.1 has a lot of excellent features and allows you to save files in a range of formats, going from *.sxw over *.doc and *.rtf to even *.pdf and so on. However, generating decent HTML documents is not possible with the default settings of the OpenOffice.org suite.

Let's say you want to convert the file example.sxw into a HTML document; this is what you get when saving as *.html. At first sight the document looks good, but when inspecting the source code, you can start frowning: a half baked doctype, font tags, abundant spans, blockquotes without p tags inside, etc. In some cases, even proprietary tags such as sdfield are inserted into your html page's code. A big no-no if you want standards compliance.

Besides saving, you can also export an SXW document to XHTML with OpenOffice.org 1.1's export filter function. Problem solved, you say. However. Check the source of this document to know it is not. The resulting code is full of errors, the footnotes disappeared, no doctype is defined and the SXW format's namespaces are still there; nope, neither this is what we want.

So, do we have to wait for OpenOffice.org 2.0 for decent (X)HTML conversion?

Hmm, no. A while ago, I stumbled upon Henrik Just's Writer2LaTeX utility. Although its name suggests that Writer2LaTeX is 'just' there for converting sxw-documents to LaTeX, it also does a great job as a SXW2XHTML tool. Unfortunately, both the documentation and the setup might be pretty confusing. Hence the idea for this quickguide. In the following paragraphs, I will explain how to set up and use Writer2LaTeX as an xhtml export filter in OpenOffice.org 1.1. However, its functionality goes further than that: besides xhtml output, there's also support for MathML and LaTeX export, Writer2LaTeX can be used as a command line utility or from another Java application, etc. If you're interested in those features, please have a look at Henrik Just's user manual. (Just to clear things out: I am not involved in the development of this tool. All credits go to OpenOffice.org's Henrik Just.)

Installation in 8 steps

  1. You need to have OpenOffice 1.1 or StarOffice 7 with:
    • Java support enabled. If this is not the case, you can run jvmsetup in the <OOo_installation_directory>\program\ folder. Needless to say, you need to have a JRE installed for that.
    • Mobile Device Filters installed during the installation process (the filters are not installed by default). If not, run setup again and install them.
  2. Quit all OpenOffice.org 1.1 windows and close the quickstarter in the task tray.
  3. Download the Writer2LaTeX version 0.3 ziparchive, the 2003-12-11 development release of writer2latex.jar and the 2003-12-11 development release of xmergefix.jar.
  4. Unzip the ziparchive to a temporary folder and:
    • replace the 0.3 version of writer2latex.jar with the 2003-12-11 development release of writer2latex.jar.
    • replace the 0.3 version of xmergefix.jar with the 2003-12-11 development release of xmergefix.jar.
  5. Go to <OOo_installation_directory>\program\classes\ and copy writer2latex.jar, xmergefix.jar and writer2latex.xml from the temporary folder into this directory.
  6. Rename xmerge.jar to oldxmerge.jar and xmergefix.jar to xmerge.jar.
  7. Then have a look in <OOo_installation_directory>\share\registry\data\org\openoffice\Office\ and open TypeDetection.xcu in a text-editor.
    • Under <node oor:name="Types"> you add this piece of code:
        <node oor:name="writer_latex_File" oor:op="replace">
         <prop oor:name="UIName">
          <value xml:lang="en-US">LaTeX 2e</value>
         </prop>
         <prop oor:name="Data">
          <value>0,,,,tex,20002,</value>
         </prop>
        </node>
      
        <node oor:name="writer_bibtex_File" oor:op="replace">
         <prop oor:name="UIName">
          <value xml:lang="en-US">BibTeX Data File</value>
         </prop>
         <prop oor:name="Data">
          <value>0,,,,bib,20002,</value>
         </prop>
        </node>
      
        <node oor:name="writer_xhtml10_File" oor:op="replace">
         <prop oor:name="UIName">
          <value xml:lang="en-US">XHTML 1.0 strict</value>
         </prop>
         <prop oor:name="Data">
          <value>0,,,,html,20002,</value>
         </prop>
        </node>
      
        <node oor:name="writer_xhtml_mathml_File" oor:op="replace">
         <prop oor:name="UIName">
          <value xml:lang="en-US">XHTML 1.1 + MathML 2.0</value>
         </prop>
         <prop oor:name="Data">
          <value>0,,,,xhtml,20002,</value>
         </prop>
        </node>
      
        <node oor:name="writer_xhtml_mathml_xsl_File" oor:op="replace">
         <prop oor:name="UIName">
          <value xml:lang="en-US">XHTML 1.1 + MathML 2.0 (xsl)</value>
         </prop>
         <prop oor:name="Data">
          <value>0,,,,xml,20002,</value>
         </prop>
        </node>
    • And under <node oor:name="Filters"> you add this codebit:
        <node oor:name="Latex File" oor:op="replace">
         <prop oor:name="UIName">
            <value xml:lang="en-US">LaTeX 2e</value>
         </prop>
         <prop oor:name="Data">
            <value>0,writer_latex_File,com.sun.star.text.TextDocument,com.sun.star.comp.Writer.XmlFilterAdaptor,524354,com.sun.star.documentconversion.XMergeBridge;classes/writer2latex.jar;com.sun.star.comp.Writer.XMLImporter;com.sun.star.comp.Writer.XMLExporter;staroffice/sxw;application/x-latex,0,,</value>
         </prop>
         <prop oor:name="Installed" oor:type="xs:boolean">
            <value>true</value>
         </prop>
        </node>
      
        <node oor:name="BibTeX Data File" oor:op="replace">
         <prop oor:name="UIName">
            <value xml:lang="en-US">BibTeX Data File</value>
         </prop>
         <prop oor:name="Data">
            <value>0,writer_bibtex_File,com.sun.star.text.TextDocument,com.sun.star.comp.Writer.XmlFilterAdaptor,524354,com.sun.star.documentconversion.XMergeBridge;classes/writer2latex.jar;com.sun.star.comp.Writer.XMLImporter;com.sun.star.comp.Writer.XMLExporter;staroffice/sxw;application/x-bibtex,0,,</value>
         </prop>
         <prop oor:name="Installed" oor:type="xs:boolean">
            <value>true</value>
         </prop>
        </node>
      
        <node oor:name="XHTML 1.0 strict File" oor:op="replace">
         <prop oor:name="UIName">
            <value xml:lang="en-US">XHTML 1.0 strict</value>
         </prop>
         <prop oor:name="Data">
            <value>0,writer_xhtml10_File,com.sun.star.text.TextDocument,com.sun.star.comp.Writer.XmlFilterAdaptor,524354,com.sun.star.documentconversion.XMergeBridge;classes/writer2latex.jar;com.sun.star.comp.Writer.XMLImporter;com.sun.star.comp.Writer.XMLExporter;staroffice/sxw;text/html,0,,</value>
         </prop>
         <prop oor:name="Installed" oor:type="xs:boolean">
            <value>true</value>
         </prop>
        </node>
      
        <node oor:name="XHTML 1.1 plus MathML 2.0 File" oor:op="replace">
         <prop oor:name="UIName">
            <value xml:lang="en-US">XHTML 1.1 + MathML 2.0</value>
         </prop>
         <prop oor:name="Data">
            <value>0,writer_xhtml_mathml_File,com.sun.star.text.TextDocument,com.sun.star.comp.Writer.XmlFilterAdaptor,524354,com.sun.star.documentconversion.XMergeBridge;classes/writer2latex.jar;com.sun.star.comp.Writer.XMLImporter;com.sun.star.comp.Writer.XMLExporter;staroffice/sxw;application/xhtml+xml,0,,</value>
         </prop>
         <prop oor:name="Installed" oor:type="xs:boolean">
            <value>true</value>
         </prop>
        </node>
       
         <node oor:name="XHTML 1.1 plus MathML 2.0 (xsl) File" oor:op="replace">
         <prop oor:name="UIName">
            <value xml:lang="en-US">XHTML 1.1 + MathML 2.0 (xsl)</value>
         </prop>
         <prop oor:name="Data">
            <value>0,writer_xhtml_mathml_xsl_File,com.sun.star.text.TextDocument,com.sun.star.comp.Writer.XmlFilterAdaptor,524354,com.sun.star.documentconversion.XMergeBridge;classes/writer2latex.jar;com.sun.star.comp.Writer.XMLImporter;com.sun.star.comp.Writer.XMLExporter;staroffice/sxw;application/xml,0,,</value>
         </prop>
         <prop oor:name="Installed" oor:type="xs:boolean">
            <value>true</value>
         </prop>
        </node>
    Two notes:
    • In case you upgrade from an earlier Writer2LaTeX installation (e.g. the one described in my first quickguide), don't simply add the codebits above, but also delete the lines you added the previous time. If you doubt which ones to delete and forgot to make a backup of the original TypeDetection.xcu, the previous (and deprecated) version of my quickguide can help you out.
    • According to Henrik Just's explanation, it's also possible (in case you have a single-user installation of OpenOffice.org 1.1) to extract the TypeDetection.xcu file you find in the Writer2LaTeX version 0.3 ziparchive to the <OOo_installation_directory>\user\registry\data\org\openoffice\Office\ folder, but as that didn't work out for me, I cannot recommend it.
  8. That's it! Fire up OpenOffice.org 1.1 and you'll see that in the File/Export menu, you can choose between XHTML 1.0 strict (.html), XHTML 1.1 + MathML 2.0 (.xhtml), and XHTML 1.1 + MathML 2.0 (.xsl) (.xml).

Playing around with the config files

Back to our example.sxw file. We use the newly installed filters and export example.sxw as example3.html by means of the XHTML 1.0 strict (.html) filter option. What do we get in our output folder? The example3.html file we asked + a file called example3-config.xml.

First, the html file. The source of example3.html looks pretty good and validates as XHTML 1.0 strict! However, certain textstrings I defined as paragraph or as character elements (= I gave them semantic values with OpenOffice.org 1.1's Stylist) in the SXW document, are now converted to ps and spans with classes. That is, Emphasis (or in OOo's raw XML format <text:span text:style-name="Emphasis">) is translated as <span class="Emphasis"> (accompanied by a corresponding CSS stylerule), Strong Emphasis becomes <span class="StrongEmphasis">, etc. This is not bad, but it can be done better: Emphasis should become <em>, Strong Emphasis <strong> and so on.

It is here the example3-config.xml file comes in. If you fire up the file in a texteditor, you see a bunch of options. The last four are XHTML related. They allow you to define a custom stylesheet, to ignore all style information (see below), to scale all layout values and to split the output document into several parts (command line only, though). Besides this, you have the the possibility to map OpenOffice.org 1.1 styles to XHTML tags by means of the <xhtml-style-map /> tag. To make optimal use of this possibility, we insert the following codebit under the last <option /> tag:

  <xhtml-style-map name="Preformatted Text" class="paragraph" element="pre" css="(none)" />
  <xhtml-style-map name="Emphasis" class="text" element="em" css="(none)" />
  <xhtml-style-map name="Strong Emphasis" class="text" element="strong" css="(none)" />
  <xhtml-style-map name="Citation" class="text" element="cite" css="(none)" />
  <xhtml-style-map name="Definition" class="text" element="dfn" css="(none)" />
  <xhtml-style-map name="Source Text" class="text" element="code" css="(none)" />
  <xhtml-style-map name="Quotations" class="paragraph"
block-element="blockquote" block-css="(none)" element="p" css="(none)" />

Then, we save the file as example4-config.xml in the output folder and use it as config file for the next document we will export: example4.html.

OK. Back to the example.sxw document; by means of the XHTML 1.0 strict (.html) filter we export the document as example4.html. And there you are, an xhtml strict output document, using structural tags where possible, and a combination of ps and spans + classes, in cases where only style-information is available (e.g. textstrings that got the value italics, instead of Emphasis). Also note that Quotations are nicely converted to blockquotes with nested ps.

In case you don't like the style information in the XHTML document's head section, you can set the xhtml_ignore_styles option to true. By means of an example5-config.xml file, we can generate then example5.html, which is an XHTML document without style information. However, note that the document's classes are preserved, which gives you several hooks for your own stylesheet rules (to be imported by means of the xhtml_custom_stylesheet option in the *-config.xml file).

On a note aside and in case you're not convinced yet about this filter's possibilities; I tried it with a 90 pages document and the conversion went like a breeze. Not one validation error. Need I say more?

Any questions, oddities, additions, ideas? Feel free to leave a comment over at WebGraphics!

Been here before? Then you've probably seen the now deprecated first version of this quickguide.

Andreas Bovens 2003.