Hacking the PTS using SOAP

I've finally found some time, thanks to the Extremadura QA + ftpmaster + i18n meeting, to release the first draft of the SOAP interface to the PTS.

You probably already got the idea, which is quite simple after all. The PTS, as always, gathers information about (source) packages from various sources and melds them together into web pages. With a SOAP interface you just gain the ability of accessing such information from your programs via SOAP.

A proof of concept is overdue:

    $ cat ./test.py 
    import SOAPpy
    url = 'http://packages.qa.debian.org/cgi-bin/soap-alpha.cgi'
    ws = SOAPpy.SOAPProxy(url)
    print ws.versions(source="ocaml")['unstable']
    print ws.uploaders(source="ocaml")[1]['name']

    $ ./test.py 
    Stefano Zacchiroli

Everything is still in alpha version, but already working. Some links which you might find useful:

Please let me know if / how you are using of the SOAP interface, it will help for future developments.

How it works

Just a few comments on how it works. You might remember that a while ago I've made all PTS pages XHTML-valid. Well, on top of it I've implemented something along the lines microformats, that just make a clever use of ingredients already available in XHTML like classes and unique identifiers.

Having that, a "reshuffling" of the information already available on the web pages (which are now kinda "semantically" tagged) can be obtained by evaluating a handful of XPaths on the (not anymore) final XHTML pages. That's precisely what the CGI implementing the SOAP API is currently doing. This way one can avoid implementing two different access paths to the information collected by the PTS: one for rendering, and one for SOAP (no, reusing the rendering one for SOAP was not an option, given that it was originally written in XSLT).

The only annoyance I've encountered is that XPath is completely unaware of the "CSS-like" semantics of XHTML classes, which states that classes are space-separated list of class names, to be interpreted as sets. That means that to check whether an element belongs to a given class you need to fiddle with substring matches on the class attribute (which is quite crappy).

I think the PTS output misses very few changes in order to produce valid RDFa i.e. RDF inside XHTML, which would render the PTS a reference for Semantic Web metadata about Debian packages.

I've summarized a few pointers there : http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=585740

If only I had the time to hack such changes...

What do you think ?

Comment by berger_o [www-public.it-sudparis.eu] Sun 13 Jun 2010 03:59:56 PM CEST

Yes, that would be sensible. Also, as you observed in the bug report, it wouldn't be hard to have it (which is unsurprisingly, as the current implementation precisely rely on microformats, but back then RDFa was not completely drafted yet).

However, I don't expect to be able on that myself anytime soon :-) , patches welcome!

Comment by zack Mon 14 Jun 2010 09:56:47 AM CEST