Debian: watch your stats!

Over the past few weeks, myself and Matthieu Caneill have worked quite a bit on Debsources. As we have now deployed most of the new features on http://sources.debian.net, it's time for another "What's new with Debsources?" blog post. Here is what's new:

  • Debsources now knows about Debian suites, i.e. which package is in which "release" (stable, testing, unstable, ...). This knowledge is already useful for some of the other features below and will be used more in the future.

  • since last summer Debsources has been running sloccount on all unpacked source packages, together with ctags and du, but the resulting information wasn't exposed on the Web. This is now fixed. Each package now has an infobox (example) which shows: disk usage, archive area, suites, and sloccount with per-language breakdown. The new infobox also subsumes the old puny list of package links.

    You can easily embed the infobox in other webapps if you need to (example). Check the URL scheme doc for more info.

  • Debsources now gathers and plot accurate Debian sources statistics, both overall and per-suite, in both snapshot and historical trends flavors.

    (Yeah, I know, the charts are not particularly good looking ATM, but that's easy to change without impacting the rest. So if you're a matplotlib artist and willing to help, please step forward!)

  • many changes have been going on also at the plumbing layer to make the service less resource hungry and more maintainable, in view of a migration to the official Debian infrastructure --- which I've in the meantime started discussing with DSA. Some highlights:

    • Debsources now has a rather comprehensive test suite, built using Nose. Most notably, we do test full update runs down to source unpacking (of a small subset of a Debian mirror), DB injection, and plugin execution --- which is quite neat.

    • the updater is now much faster (about 2x) and might require, in pathological cases, 10x less memory than before. Memory usage now caps at around 300MB, even when injecting ctags for large packages such as linux, chromium, and libreoffice.

    • the DB schema went through several refactoring cycles, and now uses a separate file table to index all known source file paths. In the past path information were duplicated across the checksums and ctags tables, not only wasting DB space, but also making the presence of file information conditional on the enablement of at least one of the two corresponding plugins. This is now fixed --- and migrating the full DB has been quite "fun". Unfortunately, we've also added quite a few large-ish indexes, resulting in no significant overall changes in DB size (currently at ~50GB), but at least in much faster queries :-)

      The next step on this front will be the addition of path-based searches, using the excellent Postgres trigram indexes.

Want more? Sure, we'll be happy to! But it'll happen faster if you help. Speaking of which: we've got Debsources into the new contributors game (see announcement) and we're looking forward to mentor new contributors.