At FOSDEM, John Sullivan delivered an interesting talk titled Is copyleft being framed? to verify alleged claims on the decline of GPL-d software. (Slides are available.) The crux of the talk is the analysis he performed on the Debian archive to discover the amount of software we distribute that is covered by GPL, LGPL, or AGPL ("GPL-d" for short in the remainder).

John's talk steps in an interesting and long running debate (a recent summary of which is available in this ITWire article). The most interesting part is the discrepancy among John's results and Blackduck's, which are often used to argue how the popularity of the GPL license is declining. That might be the case. Or not. The more analyses we do to find it out, the better.

The underlying assumption on John's work is that Debian is a representative sample of the Free Software out there, which I think is a reasonable assumption. I find the analysis presented in the talk completely satisfactorily from a purely scientific point of view. The same cannot be said about Blackduck's result: both their methods and data are secret, making it impossible to reproduce their experiments. Highly unscientific.

Still, John's results are surprising: as much as 87 percent of Lenny's packages and 93 percent of Squeeze's are GPL-d. That seems a lot. Puzzled about that, John discussed with me the issue before his talk, in search for pitfalls in his methods or data. Finding none, I pointed him to the almighty DktrKranz for some extra review; who found nothing either. To stay on the safe side, even during his talk John called for independent reviews of his results. What could be wrong?

The tool used to gather the data is license-count from the debian-policy package. Input data are the debian/copyright files of all Debian source packages. If license-count is not bugged, our debian/copyright files might be. One thing that occurred to me only a few days ago is the habit of declaring a different license for Debian packaging (the files under debian/) than the software being packaged itself. That's a bad habit—because it might cause unwanted license mixtures via patches that live under debian/—but I've seen several occurrences of it in the Debian archive. For name and (self-)shame: I've also been guilty of it in the past, when I was young™.

Is that reason enough to skew results and overestimate GPL-d software? I don't think so, I hope not, but ultimately… I don't know. It'd be nice to rule out the possibility entirely. So if anyone is willing to do some sampling of affected debian/copyright files and propose patches for license-count to exclude those "false positives", please shout. (As a bonus point: that would also help to take more sound decision for the typical use case of license-count, i.e. deciding when a license should be added to /usr/share/common-licenses.)

Other independent reviews of the results are equally welcome.

Note: the above, as well as John's analysis, would be a trivial exercise if DEP-5 were already widely deployed in the Debian archive.


Update: add link to John's slides
Update 19/02/2012: Russ Allbery, author of license-count, posted a way more likely cause of data skew in John's analysis: double counting among the different types of copyleft licenses