At FOSDEM, John Sullivan delivered an interesting talk titled Is copyleft being framed? to verify alleged claims on the decline of GPL-d software. (Slides are available.) The crux of the talk is the analysis he performed on the Debian archive to discover the amount of software we distribute that is covered by GPL, LGPL, or AGPL ("GPL-d" for short in the remainder).
John's talk steps in an interesting and long running debate (a recent summary of which is available in this ITWire article). The most interesting part is the discrepancy among John's results and Blackduck's, which are often used to argue how the popularity of the GPL license is declining. That might be the case. Or not. The more analyses we do to find it out, the better.
The underlying assumption on John's work is that Debian is a representative sample of the Free Software out there, which I think is a reasonable assumption. I find the analysis presented in the talk completely satisfactorily from a purely scientific point of view. The same cannot be said about Blackduck's result: both their methods and data are secret, making it impossible to reproduce their experiments. Highly unscientific.
Still, John's results are surprising: as much as 87 percent of Lenny's packages and 93 percent of Squeeze's are GPL-d. That seems a lot. Puzzled about that, John discussed with me the issue before his talk, in search for pitfalls in his methods or data. Finding none, I pointed him to the almighty DktrKranz for some extra review; who found nothing either. To stay on the safe side, even during his talk John called for independent reviews of his results. What could be wrong?
The tool used to gather the data is
license-count from the debian-policy
package.
Input data are the debian/copyright
files of all
Debian source packages. If license-count
is not
bugged, our debian/copyright
files might be. One thing
that occurred to me only a few days ago is the habit of
declaring a different license for Debian packaging (the
files under debian/
) than the software being packaged
itself. That's a bad habit—because it might cause unwanted license
mixtures via patches that live under debian/
—but I've
seen several occurrences of it in the Debian archive. For name and
(self-)shame: I've also been guilty of it in the past, when I
was young™.
Is that reason enough to skew results and overestimate
GPL-d software? I don't think so, I hope not, but
ultimately… I don't know. It'd be nice to rule out the possibility
entirely. So if anyone is willing to do some sampling of affected
debian/copyright
files and propose patches for
license-count
to exclude those "false positives",
please shout. (As a bonus point: that would also help to take more
sound decision for the typical use case of
license-count
, i.e. deciding when a license should be
added to /usr/share/common-licenses
.)
Other independent reviews of the results are equally welcome.
Note: the above, as well as John's analysis, would be a trivial exercise if DEP-5 were already widely deployed in the Debian archive.
Update: add link to John's slides
Update 19/02/2012: Russ Allbery, author of
license-count
, posted a
way more likely cause of data skew in John's analysis: double
counting among the different types of copyleft licenses