XQuery 1.0 and XSLT 2.0: news and how to use them in Debian

(long post about the second generation of the XSL family and how to use the related languages in Debian)


The 2nd generation of the XSL family

At the beginning of 2007, W3C released a family of interrelated specification: XPath 2.0, XSLT 2.0, XQuery (1.0). The 3 specifications are based on the very same underlying data model which (finally!) supports typed values and exploits them to some extent, for example permitting some form of static type checking in XPath and XQuery. Types usually come from XML Schema, with both built-in and user-defined types supported.

A brief overview of the 3 specifications and what they change in the state of the art follows.

XPath 2.0

XPath 2.0 basically consists in XPath 1.0 + 3 new macro features. The first one is a revamp of those language features which are dealing with types, in order to better integrate them with the new typed data model. XPath now has operators to check whether a value belongs to a given type (so yes, type information are kept at run time), and to cast them from one type to another. The basic new type constructor is now the sequence, which replaces the old node-set solving many annoying issues, such as the impossibility of having a node ordering other than document order.

The second feature is the improvement of standard library of functions which was ridiculously small in XPath 1.0; it is much better now. Additionally, 2nd generation languages (XSLT 2.0 and XQuery) now support the ability to define functions which will then be visible to inner (XPath) expressions, pushing yet forward the possibilities of plain old XPath. Functions can now also declare types in their signatures (for their arguments and return value) and untyped arguments will be automatically casted to them upon invocation. Hence, even if you are not using an implementation which assigns types to your XML trees, once you "enter" the typed world calling a typed function (almost all standard library functions are decently typed) you will be able to stay there avoiding annoying casts everywhere.

Finally XPath 2.0 has turned into a powerful purely functional language and is now powered by constructs like conditionals, for-each loops, existential/universal quantifiers, and existentially-quantified comparison operators for sequences. Here is a complex expression to hwet your appetite (or scare you away ...), comments come as (: smiley faces :)

for $book in /bookshelf//books
return
  if ((every $author in $book/authors/author
       satisfies $author/nativeLang eq "it_IT")
      and $book/lang eq "it_IT")
  then $book
  else ()

(: think about the trouble of writing this in XSLT/XPath 1.0 ... \:)

XSLT 2.0

XSLT 2.0 is what I would call a "bug fix release" of XSLT 1.0 + the routinary reworking of the language to deal with typed values, which is not sensibly different than what has been done for XPath 2.0. The fixed "bugs" are several, starting from the annoying issue of result tree fragments. They are basically tree snippets that in XSLT 1.0 you were able to generate for future use. Unfortunately they were not thaaat reusable, given that you were not even able to navigate them with XPath operators! Now the specification is much more clear and distinguishes final result trees from ordinary variables, which can now contain sequences of (navigable) tree nodes.

Another important "bug" fixed is the new ability to output multiple documents with a single XSLT stylesheet: it marks the end of stupid extra post processing to be added in pipeline to a XSLT processor.

Other minor "bugs" fixed are a limited amount of backtracking capabilities among imported templates, regular expression support directly in the language, and powerful grouping constructs on the lines of SQL's GROUP BY (but much more powerful). Here is a template snippet exploiting the latter feature:

<xsl:for-each-group select="*" group-starting-with="h1">
  <div>
    <xsl:apply-templates select="current-group()" />
  </div>
<xsl:for-each-group>

XQuery

XQuery is the end of the chains imposed by XML-based syntaxes. Why the heck one has to use an XML syntax (as in the above snippet) only because she is manipulating XML tress is one of the mysteries of XML technologies which have always been floating around in my head.

XQuery is the (supposedly) SQL equivalent for databases of XML documents, but is actually much more than that. I depict it in my head as the XML manipulation language with a syntax I can finally stand. Technically it is XPath 2.0 (say 80% of the whole language) + some extra ingredients (say 20%); so remember that every XPath 2.0 expression is also a XQuery expression.

The main extra ingredient is the so called FLWOR expression (to be read: "flower expression", which in addition to the "smiley faces" used for comments gives a "back to 1968"-flavour to the language .... erm FLWOR/flower/flavour, no pun intended). A FLWOR expression is very similar to SQL's SELECT-FROM-WHERE: it lets you generate a tuple stream by iterating on sequences (F: for clauses), binding expression values to names (L: let clauses), filter out tuples which do not satisfy a required condition (W: where clause), order the survived tuples (O: order by clause), and finally return a sequence built using the residual tuple stream (R: return clause).

The other interesting extra ingredient is the ability to build the XML snippets you want to manipulate. Within XQuery you do that using plain XML syntax (the only place where a sane-minded programmer actually wants to see it!) which also supports a classical interpolation mechanism to embed expressions which will be evaluated inside XML snippets, and also the other way around. A canonical XQuery example is:

for $t in doc("books.xml")//title,
    $e in doc("reviews.xml")//entry
where $t = $e/title
return <review>{ $t, $e/remarks }</review>

(: braces denote the escaping context where XQuery expressions will
   be evaluated inside snippets; plain XML syntax is used for the
   other way around \:)

But remember: XQuery for XML is much more than SQL for RDBMS, thanks to the implicit templating mechanism implemented by interpolation, and thanks to several language features fostering modularity (user-defined functions, library modules, XPath 2.0 standard library, ...) you can basically do with it any kind of XML manipulation you can imagine.

I don't think I will ever write myself any other single line of XML output in DOM or XSLT ...


Cool, how can I use it in Debian?

... if only this stuff were decently supported in the open source world.

Last time I checked, the author of most parts of the GNOME toolchain for dealing with XML (libxml2, libxslt, ...) was not intentioned to implement XSLT 2.0, not even mentioning XQuery. This comes as no surprise, the whole GNOME XML toolkit is written in C, and XSLT 2.0 / XQuery have reached a level of complexity and formal specification which usually entails a higher level approach. So on the GNOME side we are stuck.

The other open source implementations of XSLT 2.0 / XQuery I'm aware of are Saxon and Galax.

Saxon is an XSLT 2.0 and XQuery implementation written in Java, which is unfortunate per se. Additionally, it is also unfortunate that it is only partially open source. Indeed, Saxon is split into SaxonB (for "basic") which is open source under the Mozilla Public License and SaxonSA which is commercial. While SaxonSA is a fully conformant, XML Schema-aware processor, with support for static typing, SaxonB is a basic-conformant processor with no type-aware features and actually much less optimized than SaxonSA. This is annoying. (Much more annoying is the fact that SaxonSA's author is the only editor of the XSLT 2.0 specification and that instead of his mail address the specification includes an URL pointing to the website selling SaxonSA ...)

Galax is an open source (IBM CPL / Lucent license) OCaml implementation of XQuery which is not fully conformant to the specification (though it gets quite close) which is type-aware and implements static typing.

The only XQuery / XSLT 2.0 implementation available in Debian at the time of writing is SaxonB. The binary package is libsaxonb-java, kudos to Michael Koch and the Debian Java Maintainers for having packaged it (and to have stood some annoying pings of mine :-) Debian bug #408842).

To execute XQuery code you just have to aptitude install libsaxonb-java, prepare a query.xq file containing your query, and then execute something like:

CLASSPATH=/usr/share/java/saxonb.jar \
java net.sf.saxon.Query query.xq

note that the information in README.Debian are still referring to old Saxon versions, see Debian bug #465894 which proposes a more up to date README.Debian.

Similarly, to perform a XSLT 2.0 transformation you have to do something like:

CLASSPATH=/usr/share/java/saxonb.jar \
java net.sf.saxon.Transform -ext:off -s:input.xml -xsl:style.xsl -o:output.xml

Do not remove the -ext:off flag when processing untrusted stylesheets!, see Debian bug #465885 for the reason.

I've written some handier (1-liner) shell script helpers which remove the need of invoking java manually. They are attached to Debian bug #465894 and I've proposed their addition to the saxonb package. Using them the above invocations become:

saxonb-xquery query.xq

saxonb-xslt -ext:off -s:input.xml -xsl:style.xsl -o:output.xml

Galax in Debian ... well, since long time I've been planning to package it (the ITP has been filed some months ago: Debian bug #447984) and the authors sent me a newer version than what is available online to gather feedback before the long overdue final release. Unfortunately I've been lagging behind in finishing the packaging (which is tricky due to the need of binding libraries to several different languages: OCaml is native, but there is also Java for example). Hopefully this post will give me some renewed motivation for finishing the work ...


References

  • Debian package w3-recs, ships the whole list of W3C Recommendations for offline consultation; it includes all the specifications we have discussed in this post
  • Saxon: home page of Saxon, a XSLT 2.0 / XQuery processor written in Java
  • Galax: home page of Galax, a XQuery processor written in OCaml
  • Debian package libsaxonb-java, ships SaxonB as a Debian package

Acknowledgements

Thanks to godog for his helpful comments.


Update: the helpers I've proposed have been accepted into the official package, I've also made available their manpages. Also the pending changes to README.Debian has been accepted. Kudos again to Michael Koch for his quick feedback on my patches!