Adopting w3-recs (i.e. experiences in package generation)

A week ago or so I decided to adopt w3-recs, a documentation only package aiming to ship recommendations of the W3C in Debian.

The previous maintainer (Frankie) step down from w3-recs maintenance since the packaging work was quite boring (add recommendations piecewise as soon as they came out, update them when needed, keep doc-base entries in sync, ...).

I didn't want to follow that path and, probably for the first time in my life, the (not so) semantic web has come to the rescue. Starting from the TR automation page, I've found an always up to date index of all Technical Reports published by W3C, enriched with some information about the status of the TRs (draft, candidate recommendation, recommendation, ...) which is in RDF/XML format, and hence machine processable.

The final result is that now the w3-recs package is mostly automatically generated via debian/rules target. The get-orig-source target first downloads the RDF index of the technical reports and then downloads all of them via wget (it took me sometime to find the appropriate flags, I've ended up with a award-winning wget command line containing 12 dash switches!). Then, at build time, several derived stuff is generated from the very same RDF index via XSLT: doc-base entries for all the shipped documents (the current grand total is 119 recommendations), a HTML index summarizing the available documents, and other debian/ stuff needed later on in the package build process.

The next steps for w3-recs should be automatically split of the package in sub-packages classified according to the corresponding W3C activity, and enrichment of the doc-base entries with abstracts. Unfortunately neither of the 2 kinds of information are provided yet in the RDF index (not so semantic web, after all).

Infrastructure deficiencies encountered in the process

  • wget is not that good at mirroring: I've been bitten by a lot of undocumented behaviours and bugs (e.g.: undocumented ordering requirements among command line flags, fuzzy timestamping, bugs in a posteriori URL rewriting when a destination base dir is specified, ...). I've found no better alternatives which can basically just mimick what firefox does when you save a "complete" (set of) web page(s) on disk

  • dhelp is just completely broken. It is not able to deal with some 100 doc-base entries, it just shows some of them (about 10, do not dare to hope in more) and even trims the title lines. Luckily doc-central is much better. At a times I find myself thinking that I'm the only believer in Debian's doc-base ...

  • Oh, and while we are at it: I'm still convinced that having the doc-base hierarchy bound to the menu policy is just plainly dumb