pages tagged gplzack's home pagehttp://upsilon.cc/~zack/tags/gpl/zack's home pageikiwiki2012-02-19T21:22:06ZGPL-d Debian software skew (?)http://upsilon.cc/~zack/blog/posts/2012/02/gpl_d_debian_software_skew/2012-02-19T21:22:06Z2012-02-18T16:33:05Z
<p>At <a href="http://fosdem.org/2012/">FOSDEM</a>, John Sullivan
delivered an interesting talk titled <a href=
"http://fosdem.org/2012/schedule/event/is_copyleft_being_framed">Is
copyleft being framed?</a> to verify alleged claims on the decline
of GPL-d software. (<a href=
"http://info9.net/wiki/fosdem/LegalIssuesDevRoom/Speakers/sullivan_slides.pdf">Slides</a>
are available.) The crux of the talk is the analysis he performed
on the Debian archive to discover the amount of software we
distribute that is covered by GPL, LGPL, or AGPL ("GPL-d" for short
in the remainder).</p>
<p>John's talk steps in an interesting and long running debate (a
recent summary of which is available in this <a href=
"http://www.itwire.com/business-it-news/open-source/52838-gpl-use-in-debian-on-the-rise-study">
ITWire article</a>). The most interesting part is the discrepancy
among John's results and <a href=
"http://www.blackducksoftware.com/">Blackduck</a>'s, which are
often used to <a href=
"http://blogs.the451group.com/opensource/2011/12/15/on-the-continuing-decline-of-the-gpl/">
argue how the popularity of the GPL license is declining</a>. That
might be the case. Or not. The more analyses we do to find it out,
the better.</p>
<p>The underlying assumption on John's work is that Debian is a
representative sample of the Free Software out there, which I think
is a reasonable assumption. I find the analysis presented in the
talk completely satisfactorily from a purely scientific point of
view. The same cannot be said about Blackduck's result: both their
methods and data are secret, making it impossible to reproduce
their experiments. Highly <em>un</em>scientific.</p>
<p>Still, John's results are surprising: as much as 87 percent of
Lenny's packages and 93 percent of Squeeze's are GPL-d. That seems
<em>a lot</em>. Puzzled about that, John discussed with me the
issue before his talk, in search for pitfalls in his methods or
data. Finding none, I pointed him to the almighty <a href=
"http://dktrkranz.wordpress.com/">DktrKranz</a> for some extra
review; who found nothing either. To stay on the safe side, even
during his talk John called for independent reviews of his results.
<strong>What could be wrong?</strong></p>
<p>The tool used to gather the data is <a href=
"http://anonscm.debian.org/gitweb/?p=dbnpolicy/policy.git;a=blob;f=tools/license-count;hb=HEAD">
license-count</a> from the <code>debian-policy</code> package.
Input data are the <code>debian/copyright</code> files of all
Debian source packages. If <code>license-count</code> is not
bugged, our <code>debian/copyright</code> files might be. One thing
that occurred to me only a few days ago is the <strong>habit of
declaring a different license for Debian packaging</strong> (the
files under <code>debian/</code>) than the software being packaged
itself. That's a bad habit—because it might cause unwanted license
mixtures via patches that live under <code>debian/</code>—but I've
seen several occurrences of it in the Debian archive. For name and
(self-)shame: I've also been guilty of it in the past, <em>when I
was young™</em>.</p>
<p><strong>Is that reason enough to skew results and overestimate
GPL-d software?</strong> I don't think so, I hope not, but
ultimately… I don't know. It'd be nice to rule out the possibility
entirely. So if anyone is willing to do some sampling of affected
<code>debian/copyright</code> files and propose patches for
<code>license-count</code> to exclude those "false positives",
please shout. (As a bonus point: that would also help to take more
sound decision for the typical use case of
<code>license-count</code>, i.e. deciding when a license should be
added to <code>/usr/share/common-licenses</code>.)</p>
<p>Other independent reviews of the results are equally
welcome.</p>
<p>Note: the above, as well as John's analysis, would be a trivial
exercise if <a href="http://dep.debian.net/deps/dep5/">DEP-5</a>
were already widely deployed in the Debian archive.</p>
<hr />
<p><strong>Update</strong>: add link to John's slides<br />
<strong>Update 19/02/2012</strong>: Russ Allbery, author of
<code>license-count</code>, <a href=
"http://www.eyrie.org/~eagle/journal/2012-02/002.html">posted</a> a
way more likely cause of data skew in John's analysis: double
counting among the different types of copyleft licenses</p>