blog/archives/2013/09zack's home pagehttp://upsilon.cc/~zack/blog/archives/2013/09/zack's home pageikiwiki2013-09-17T16:00:53Zsources.debian.net - advanced search and other newshttp://upsilon.cc/~zack/blog/posts/2013/09/sources.debian.net_-_advanced_search_and_other_news/2013-09-17T16:00:53Z2013-09-17T16:00:53Z
<h1>all your ctag (and checksum) are belong to us</h1>
<p>A few months after the <a href=
"http://upsilon.cc/~zack/blog/posts/2013/07/introducing_sources.debian.net/">initial
announcement</a>, here are some news about the <a href=
"http://sources.debian.net">sources.d.n</a> service. I've been late
in blogging this, but most of it has been implemented by myself and
Matthieu Caneill during <a href=
"http://debconf13.debconf.org/">DebConf13</a>, which has been a
great DebConf, totally exceeding my expectations (and they were
already fairly high!).</p>
<p>First, you might have noticed some <em>user-visible
changes</em>:</p>
<ul>
<li>
<p>there is now an <a href=
"http://sources.debian.net/advancedsearch/"><strong>advanced
search</strong> page</a>, which complements the already existing
<a href="http://codesearch.debian.net/">regex code search</a> with
the possibility of searching source files by their
<strong>sha256</strong>, or the <strong>ctags</strong> defined
therein</p>
</li>
<li>
<p>on the same topic, when browsing through a package and using
regex search, you'll now search by default within <em>that</em>
package, allowing to focus your searches more easily than before.
(You can easily override this by editing the search box and
removing the <code>package:</code> predicate.)</p>
</li>
<li>
<p>for the data geeks (or the wannabe host), there are now <a href=
"http://sources.debian.net/about/stats/"><strong>disk usage
stats</strong></a> (note that they don't include the database size,
though, see below for that)</p>
</li>
<li>
<p>the website also got a significant <strong>facelift</strong>, as
part of which we have moved the detailed explanations of what the
service is about out of your way. You now immediately get to the
various browsing options.</p>
</li>
</ul>
<p>On the other hand, <em>under the hood</em>:</p>
<ul>
<li>
<p>to implement ctags and sha256 searches we needed a serious DBMS,
so we switched from SQLite to <strong>PostgreSQL</strong>.</p>
<p>Again, for the data geek: storing ctags/sha256 for all of
sources.d.n content with decent indexes takes about 37 GB, for
about 160 million rows in the ctags table and 20 million rows in
the checksums one. (Currently filenames are duplicated between the
two tables so, probably, the DB disk size might be reduced
some.)</p>
</li>
<li>
<p>together with the switch to a serious DBMS, the update logics
has been completely rewritten in Python (from Bash...), and should
now be entirely transactional.</p>
</li>
<li>
<p>... and given it was going to be Python anyhow, better to enjoy
what it has to offer, no? So there is now a <strong>plugin
mechanism</strong> that makes it easier to add extra data
extractors, triggering them at each package update. Currently there
are plugins for sha256sum, ctags, and sloccount (even though the
latter is not yet exposed via the web interface). An added benefit
of this is that if you want to deploy debsources elsewhere, you can
easily disable the most time consuming extractors: running ctags
<em>and</em> sha256sum on the fabulous 3 chromium/libreoffice/linux
is not for the faint of disks...</p>
</li>
<li>
<p>we now receive <strong>push updates</strong> from the Debian
mirror network, so that you'll get updates on sources.d.n as soon
as a package hits Debian mirrors (+ processing time, which is about
15-20 minutes on the average update run). Many thanks to Simon
Paillard and Adam Lackorzynski for their help in setting this
up.</p>
</li>
<li>
<p>thanks to a <a href=
"https://lwn.net/Articles/557371/">suggestion by kugel</a> we have
adopted <a href="http://www.geany.org/">Geany</a>'s conventions for
filetype detection, and we now take into account both file
extensions and shebang lines (when available)</p>
</li>
</ul>
<p>As you usual, your bug reports (and patches!) are more than
welcome, just check <a href=
"http://anonscm.debian.org/gitweb/?p=qa/debsources.git;a=blob;f=BUGS;hb=refs/heads/bugs">
BUGS</a> before reporting to avoid duplicates.<br />
That's all!</p>