moar, and moar, and moar debsources stats

A while ago I've announced the availability of several stats about Debian source code on http://sources.debian.net. Since then the statistical basis of those stats has increased a lot, and now includes all Debian historical releases, from hamm (July 1998) onward. This allows to appreciate macro-level evolution trends in Free Software, over a period of more than 15 years, through the eyes of a distro that sits at the nice intersection of the eldest, largest, and most reputed distros.

To get there I've added support for sticky suites to the plumbing layer of debsources, and then injected historical releases from http://archive.debian.org. The injection process took about a week (without any sort of parallelism, pretty slow disks, and computing sha256 checksums, ctags, and sloccount on all source files) and has been an "interesting" experience.

When you go back decades in technology time, bit rot is just around the corner, and I've found my share while injecting archive.d.o into sources.d.n. In both cases the respective maintainers (Guillem and Ganneff, kudos) have been positive about and helpful in improving the situation, despite the low impact of the bugs I've found on the average user. That's quite important for the long-term preservation of digital information in general, and for the perennity of access to Free Software in the specific case of Debian.

While we are it, I'm now maintaining a list of bugs affecting sources.d.n but belonging to other packages, in case you fancy helping out but are not a Python hacker. Interestingly enough, quite a bit of those bugs are related to the fact that tools debsources uses (e.g. ctags, sloccount) are also starting to show their age.

You might wander why buzz, rex, and bo are still missing from sources.d.n. That's in fact for similar reasons. Before hamm Debian didn't have complete archive coverage in terms of Sources indexes and .dsc files. Given that debsources rely on both to extract source packages, it first needs to grow an additional abstraction layer that can cope with their absence. It's SMOP, and planned.

And now let's have fun with ctags bombs.

Yours truly,
Stefano “Indiana” Zacchiroli
(credits: KiBi, #debian-ftp)

I couldn't find a link to the actual stats? But from the Debian Wiki that they are here: http://sources.debian.net/stats/

Comment by stevenc Sun 06 Apr 2014 03:08:11 PM CEST

Ah, thanks, I did indeed forgot to add a directly link to the stats, and only linked to the previous post on the subject.

Fixed now, thanks again!

Comment by zack Sun 06 Apr 2014 06:38:59 PM CEST
seems there's a bug in the graphs for 5 and 20 years...
Comment by Jean-Pierre Sun 06 Apr 2014 07:12:42 PM CEST

I don't see any bug.
But it's hard to know for sure as you didn't specify which bug you're seeing...

Comment by zack Sun 06 Apr 2014 08:54:07 PM CEST
for eg 'disk usage', on the 20 years graph the x axis legend spans aug2013-apr2014. Its shape is basically the same as the 1 and 5 years graphs, only with fewer sample points. Maybe there's something I'm really misunderstanding here?
Comment by Jean-Pierre Mon 07 Apr 2014 05:24:18 PM CEST

Right. So, it's not a bug in the data, but arguably a bug in how it is presented --- we can definitely do better on that front.

First of all, the 20-years data graphs are not meant to cover the historical evolution of Debian releases. Those data is currently available only at per-release pages.

The 20-years data are rather meant to cover the historical evolution of the sources.d.n dataset. We have only about 1 year of history, as sources.d.n didn't exist earlier on. That could be made clearer by having longer x-axes, going back 20 years; but the data would be invariably 0 for the years before 2013, so I'm not really sure what we will really gain by doing that.

Comment by zack Wed 09 Apr 2014 01:09:55 PM CEST