evan_tech

Previous Entry Share Next Entry
11:15 pm, 20 Jun 03

journal export to pdf

Summary: It’s possible, and really not that hard.


So XSL (the S is for “style”) appears to have two major subparts.
http://www.w3.org/TR/xsl/ says:

     This specification defines the features and syntax for the
     Extensible Stylesheet Language (XSL), a language for expressing
     stylesheets. It consists of two parts:
       1. a language for transforming XML documents, and
       2. an XML vocabulary for specifying formatting semantics.

In practice, those are known as
1. XSLT, T = transformations, for transforming one XML document into
   another.  This is the one I was looking at in the past.
2. FO, Formatting Objects, which is an HTML-like language for
   representing printed material.  This includes ideas like the
   conception of a “page” (like different layouts for even and odd
   pages) and higher-quality typesettings things like kerning.

I’ve been playing with the latter.  You can combine the two to take an
XML export of your journal, XSLT it into a FO document, and then render
the FO to a PDF.  (Really, an S2 style should create the FO document.)

There are a few systems for rendering FO:
 - some commercial ones (including the top few links whenever I tried
   Googling for information on this on Linux).
 - some Java one that is made by Apache or something.
 - http://xmlroff.sourceforge.net/
   which uses Pango(!) + PDFlib or Gnome-Print for rendering.
 - http://www.tei-c.org.uk/Software/passivetex/
   which uses TeX for rendering.

The last one is in Debian.

It’d be neat to be able to hook this up to some web page export, but
even rendering two pages of output takes a few seconds on my computer.
Some sort of delayed response (submit, then come back for the results)
would work.

Creating good styles themselves is the same problem we have with pushing
S2, but it’d be trivial to copy the style found in a printed
journal; I was looking at the free preview of Diary of Anne Frank on
Amazon, for example, and I think I could copy that in a few hours if I
knew FO better.  (Are there any copyright issues here?)

The other hard part brings us back to Pie (or whatever): the journal
content also needs to be transformed into FO.  For example, here’s a
trivial transformation that maps <b> tags into the <fo:inline> tag
(equivalent of HTML’s “span”):

<xsl:template match="b">
  <fo:inline font-weight="bold">
    <xsl:apply-templates/>
  </fo:inline>
</xsl:template>

(As you can see here and elsewhere, FO feels a lot like CSS in its
design.)

However, that would only work if journal content is available as part of the
XML tree.  It isn’t for us, because journals aren’t guaranteed to be
well-formed XML.  It would seem to me that we could just run the HTML
cleaner on the entries before they’re generated, and then we’re good to
go.  (Alternately, we can try to generate FO directly from the malformed
entries, but FO is written in XML so we still need a well-formed
hierarchy.)


Attached is a test PDF and the XSL used to generate it from a LogJam
exported month.  I snipped out most of the entries for testing, and it
doesn’t look so good because I don’t know much about FO.  There are also
still some HTML-isms in the XSL, but there’s enough there to get the
gist of it.

To build it yourself:
  apt-get install xsltproc passivetex xmltex
  xsltproc monthpdf.xsl logjam-xml-file.xml > output.fo
  pdfxmltex output.fo