all your ctag (and checksum) are belong to us

A few months after the initial announcement, here are some news about the sources.d.n service. I've been late in blogging this, but most of it has been implemented by myself and Matthieu Caneill during DebConf13, which has been a great DebConf, totally exceeding my expectations (and they were already fairly high!).

First, you might have noticed some user-visible changes:

  • there is now an advanced search page, which complements the already existing regex code search with the possibility of searching source files by their sha256, or the ctags defined therein

  • on the same topic, when browsing through a package and using regex search, you'll now search by default within that package, allowing to focus your searches more easily than before. (You can easily override this by editing the search box and removing the package: predicate.)

  • for the data geeks (or the wannabe host), there are now disk usage stats (note that they don't include the database size, though, see below for that)

  • the website also got a significant facelift, as part of which we have moved the detailed explanations of what the service is about out of your way. You now immediately get to the various browsing options.

On the other hand, under the hood:

  • to implement ctags and sha256 searches we needed a serious DBMS, so we switched from SQLite to PostgreSQL.

    Again, for the data geek: storing ctags/sha256 for all of sources.d.n content with decent indexes takes about 37 GB, for about 160 million rows in the ctags table and 20 million rows in the checksums one. (Currently filenames are duplicated between the two tables so, probably, the DB disk size might be reduced some.)

  • together with the switch to a serious DBMS, the update logics has been completely rewritten in Python (from Bash...), and should now be entirely transactional.

  • ... and given it was going to be Python anyhow, better to enjoy what it has to offer, no? So there is now a plugin mechanism that makes it easier to add extra data extractors, triggering them at each package update. Currently there are plugins for sha256sum, ctags, and sloccount (even though the latter is not yet exposed via the web interface). An added benefit of this is that if you want to deploy debsources elsewhere, you can easily disable the most time consuming extractors: running ctags and sha256sum on the fabulous 3 chromium/libreoffice/linux is not for the faint of disks...

  • we now receive push updates from the Debian mirror network, so that you'll get updates on sources.d.n as soon as a package hits Debian mirrors (+ processing time, which is about 15-20 minutes on the average update run). Many thanks to Simon Paillard and Adam Lackorzynski for their help in setting this up.

  • thanks to a suggestion by kugel we have adopted Geany's conventions for filetype detection, and we now take into account both file extensions and shebang lines (when available)

As you usual, your bug reports (and patches!) are more than welcome, just check BUGS before reporting to avoid duplicates.
That's all!