A BRIEF INTRODUCTION TO TEX4HT

KAPIL HARI PARANJAPE

1. What do we have here?

What follows is a brief introduction to the TeX4ht system designed and currently maintained by Eitan M. Gurari. The source for this document is in the file tex4ht_doc.tex and can be processed using the command htlatex tex4ht_doc.tex as explained below. It is hoped that such processing will prove instructive as well.

2. Executive summary

TeX4ht is a system to convert TeX input into hypertext documents of different kinds. TeX4ht operates on input that is “standard” TEX or LATEX(but please check the last section for some differences). This input is processed by tex in the usual way except that certain additional macros are loaded which create some hooks in the output that can be used to produce the hypertext. The output is then post-processed by the program tex4ht which produces the hypertext. Auxiliary files such as .css files and image files are produced by the program t4ht.

Usage is simplified via the Perl script mk4ht which can be called directly to combine the above operations transparently. For example the source of this document can be processed using

mk4ht htlatex tex4ht_doc.tex

This will produce tex4ht_doc.html and some supplementary files which is the HTML version of this documentation. Similarly,

mk4ht xhmlatex tex4ht_doc.tex

will produce the XML version with MATH-ML and

mk4ht mzlatex tex4ht_doc.tex

will produce MATH-ML which uses fonts that are rendered well via the “Gecko” engine of mozilla. Additional such commands are

mk4ht oolatex tex4ht_doc.tex

to a format that can be read by OpenOffice and

mk4ht dblatex tex4ht_doc.tex

for DocBook and

mk4ht teilatex tex4ht_doc.tex

for TEI format XML output. The broad structure of the mk4ht command-line is

mk4ht #1 #2 #3 #4 #5

The first argument is the type of conversion required. Using mk4ht without arguments lists the conversions available. The second argument is the name of the file that is to be processed. The third, fourth and fifth arguments are optional and are described is some detail below.

The rest of this document introduces the system in a little more detail. See [1] and [2] for authoritative information. In the first following section (Section 3) we examine the options for modifying the way in which TEX processes the source; specifically these can be thought of as options for the macros in tex4ht.sty. The next section (Section 4) deals with the post-processing that converts TEX’s output into hypertext. The final section (Section 5) shows how one can change the way the system generates the supplementary files like images and style-sheets for the hypertext output.

This document is assumes that the reader has some familiarity with the TEX and LATEX systems; see [3] and [4] for more information.

3. Options for Styles

Options for TEX and LATEX processing can be added as the first optional argument (#3 above) to the mk4ht command. For example, the command

mk4ht xhmlatex tex4ht_doc.tex

is in fact similar1 to the command

mk4ht htlatex tex4ht_doc.tex "xhtml,mathml"

Similarly,

mk4ht oolatex tex4ht_doc.tex

is in fact similar to the command

mk4ht htlatex tex4ht_doc.tex "xhtml,ooffice"

In most cases this list of options begins with html or xhtml. Additional options available can be found by searching for the string --- Note --- at the start of a line in the resulting log file. For example

mk4ht htlatex tex4ht_doc.tex  
grep -A 1 ’^--- Note ---’ tex4ht_doc.log

will list all the available options for html conversion.

When this list of options does not start with html or xhtml then the system looks for a file with the name given by the first option and the .cfg extension. The simplest use of this feature is as follows. Create a file called bgimage.cfg containing the lines

\Preamble{html}  
\begin{document}  
\Css{BODY { background-image : url(background.png); }}  
\EndPreamble

After this

mk4ht htlatex tex4ht_doc.tex "bgimage"

will add an additional line to tex4ht_doc.css incorporating the image background.png. See the main documentation [1] for more details on creating configuration files.

4. Post processing

The optional arguments #4 and #5 refer to options for the tex4ht and t4ht commands respectively. Both these commands make use of the configuration file tex4ht.env (which may be over-ridden by .tex4ht in the current directory or the user’s home directory). This configuration file is called the “environment file” in the main documentation [1] in order to avoid confusing it with the configuration file described in the previous section.

The program tex4ht has to look for “font descriptions” that describe how various non-standard glyphs are to be “rendered” in hypertext. The TeX4ht system provides a number of possibilities like using Unicode or fonts suited to the Gecko engine of the Mozilla browser and so on. So the command

mk4ht mzlatex tex4ht_doc.tex

is almost2 equivalent to

mk4ht htlatex tex4ht_doc.tex "xhtml,mozilla" "-cmozhtf"

The -c<tagname> option for tex4ht picks up the tagged section from the tex4ht.env environment file. Any other command-line option of tex4ht can also be used as part of #4 which is just a space separated list of options for this command.

5. Creating Supplementary Files

The final step of conversion is the creation of supplementary files like image files for formulae and equations like

xn - 1  n∑-1 i
x---1-=    x
        i=0

which is the rendering of the LATEX input string

\[ \frac{x^n-1}{x-1} = \sum_{i=0}^{n-1} x^i \]

In most cases such TEX constructions can only be rendered as images. The tex4ht program creates a series of instructions for the t4ht program in a .lg file. The latter carries out these instructions by making use of external programs like dvipng or convert to create these images. The most useful option in the argument list #5 is -p which prevents images from being generated. Another useful option is -cvalidate which causes the net output to be validated using an external validation program such as xmllint. All the options in the argument list #5 are passed on t4ht.

6. Some difference between TeX4ht and TeX

We document some differences between the systems. For more up-to-date information please see the author’s documentation[1].

6.1. Regarding filenames. In short, do not use special characters in your filenames; ideally stick with filenames which are composed of standard ASCII alphanumerics wherever possible. Some explanations follow.

TEX nowadays accepts files with names that contain all manner of characters and so it is natural to imagine that TeX4ht will do so to. However, one has to be concerned with the filenames used in output as well as those used for input. Since the latter will appear in URL’s that will appear within the hypertext using special characters will cause hyperlinks to break. Thus TeX4ht does not currently behave well if special characters are used in input file names.

6.2. Extra braces required. In short, when in doubt enclosed sub- and super- scripts in braces if they are longer than a single character.

In this respect the syntax of the TeX language that is accepted by TeX4ht is stricter than that accepted by TEX and LATEX.

References

[1]     http://www.cse.ohio-state.edu/~gurari/mn.html The authoritative documentation maintained by Eitan M. Gurari.

[2]     http://www.cse.ohio-state.edu/~gurari Eitan M. Gurari’s web page that discusses related projects.

[3]     http://www.tug.org/ The TEX User’s group primary web site.

[4]     http://www.latex-project.org/ The LATEX project’s primary web site.