documents Here is a list of my academic papers, classified by type of publication and in reverse chronological order:

international, peer-reviewed journal articles
editorials
book chapters
international, peer-reviewed conference proceedings
international, peer-reviewed workshop proceedings
national, peer-reviewed journal articles
national, peer-reviewed conference and workshop procedings
technical reports
dissertations
miscellanea

You might also be interested in my author profiles on DBLP and Google Scholar.

international, peer-reviewed journal articles

[.pdf] [.bib] doi> Antonio Boffa, Roberto Di Cosmo, Paolo Ferragina, Andrea Guerra, Giovanni Manzini, Giorgio Vinciguerra, Stefano Zacchiroli. On the Compressibility of Large-scale Source Code Datasets. To appear in Journal of Systems and Software, volume 227, pp. 112429. ISSN 0164-1212, Elsevier, 2025. Abstract...

Abstract: Storing ultra-large amounts of unstructured data (often called objects or blobs) is a fundamental task for several object-based storage engines, data warehouses, data-lake systems, and key–value stores. These systems cannot currently leverage similarities between objects, which could be vital in improving their space and time performance. An important use case in which we can expect the objects to be highly similar is the storage of large-scale versioned source code datasets, such as the Software Heritage Archive (Di Cosmo and Zacchiroli, 2017). This use case is particularly interesting given the extraordinary size (1.5 PiB), the variegated nature, and the high repetitiveness of the at-issue corpus. In this paper we discuss and experiment with content- and context-based compression techniques for source-code collections that tailor known and novel tools to this setting in combination with state-of-the-art general-purpose compressors and the information coming from the Software Heritage Graph. We experiment with our compressors over a random sample of the entire corpus, and four large samples of source code files written in different popular languages: C/C++, Java, JavaScript, and Python. We also consider two scenarios of usage for our compressors, called Backup and File-Access scenario, where the latter adds to the former the support for single file retrieval. As a net result, our experiments show (i) how much “compressible” each language is, (ii) which content- or context-based techniques compress better and are faster to (de)compress by possibly supporting individual file access, and (iii) the ultimate compressed size that, according to our estimate, our best solution could achieve in storing all the source code written in these languages and available in the Software Heritage Archive: namely, in 3 TiB (down from their original 78 TiB total size, with an average compression ratio of 4%).
[.pdf] [.bib] doi> Andrea Gurioli, Maurizio Gabbrielli, Stefano Zacchiroli. Stylometry for real-world expert coders: a zero-shot approach. In PeerJ Computer Science, 10:e2429. ISSN 2167-8359, PeerJ. 2024. Abstract...

Abstract: Code stylometry is the application of stylometry techniques to determine the authorship of software source code snippets. It is used in the industry to address use cases like plagiarism detection, code audits, and code review assignments. Most works in the code stylometry literature use machine learning techniques and (1) rely on datasets coming from in vitro coding competition for training, and (2) only attempt to recognize authors present in the training dataset (in-distribution authors). In this work we give a fresh look at code stylometry and challenge both these assumptions: (1) we recognize expert authors who contribute to real-world open-source projects, and (2) we show how to accurately recognize authors not present in the training set (out-distribution authors). We assemble a novel open dataset of code snippets for code stylometry tasks consisting of 114,400 code snippets, authored by 104 authors having contributed 1,100 snippets each. We develop a K-nearest neighbors algorithm (k-NN) classifier for the code stylometry task and train it on the dataset. Our system achieves a top accuracy of 69% among five randomly selected in-distribution authors, thus improving state of the art by more than 20%. We also show that when moving from in-distribution to out-distribution authors, the classification performances of the k-NN classifier remain the same, achieving a top accuracy of 71% among five randomly-selected out-distribution authors.
[.pdf] [.bib] doi> Annalí Casanueva, Davide Rossi, Stefano Zacchiroli, Théo Zimmermann. The Impact of the COVID-19 Pandemic on Women's Contribution to Public Code. To appear in Empirical Software Engineering, volume 30, article number: 25. ISSN 1382-3256, Springer. 2025. Abstract...

Abstract: Despite its promise of openness and inclusiveness, the development of free and open source software (FOSS) remains significantly unbalanced in terms of gender representation among contributors. To assist open source project maintainers and communities in addressing this imbalance, it is crucial to understand the causes of this inequality. In this study, we aim to establish how the COVID-19 pandemic has influenced the ability of women to contribute to public code. To do so, we use the Software Heritage archive, which holds the largest dataset of commits to public code, and the difference in differences (DID) methodology from econometrics that enables the derivation of causality from historical data. Our findings show that the COVID-19 pandemic has disproportionately impacted women’s ability to contribute to the development of public code, relatively to men. Further, our observations of specific contributor subgroups indicate that COVID-19 particularly affected women hobbyists, identified using contribution patterns and email address domains.
[.pdf] [.bib] doi> Stefano Balla, Maurizio Gabbrielli, Stefano Zacchiroli. Code stylometry vs formatting and minification. In PeerJ Computer Science, 10:e2142. ISSN 2167-8359, PeerJ. 2024. Abstract...

Abstract: The automatic identification of code authors based on their programming styles—known as authorship attribution or code stylometry—has become possible in recent years thanks to improvements in machine learning-based techniques for author recognition. Once feasible at scale, code stylometry can be used for well-intended or malevolent activities, including: identifying the most expert coworker on a piece of code (if authorship information goes missing); fingerprinting open source developers to pitch them unsolicited job offers; de-anonymizing developers of illegal software to pursue them. Depending on their respective goals, stakeholders have an interest in making code stylometry either more or less effective. To inform these decisions we investigate how the accuracy of code stylometry is impacted by two common software development activities: code formatting and code minification. We perform code stylometry on Python code from the Google Code Jam dataset (59 authors) using a code2vec-based author classifier on concrete syntax tree (CST) representations of input source files. We conduct the experiment using both CSTs and ASTs (abstract syntax trees). We compare the respective classification accuracies on: (1) the original dataset, (2) the dataset formatted with Black, and (3) the dataset minified with Python Minifier. Our results show that: (1) CST-based stylometry performs better than AST-based (51.00%→68%), (2) code formatting makes a significant dent (15%) in code stylometry accuracy (68%→53%), with minification subtracting a further 3% (68%→50%). While the accuracy reduction is significant for both code formatting and minification, neither is enough to make developers non-recognizable via code stylometry.
[.pdf] [.bib] doi> Laure Muselli, Mathieu O'Neil, Fred Pailler, Stefano Zacchiroli. Subverting or preserving the institution: Competing IT firm and foundation discourses about open source. To appear in New Media and Society. Early access. ISSN 1461-4448, SAGE Publishing. 2024. Abstract...

Abstract: The data economy depends on digital infrastructure produced in self-managed projects and communities. To understand how information technology (IT) firms communicate to a volunteer workforce, we examine IT firm and foundation employee discourses about open source. We posit that organizations employ rhetorical strategies to advocate for or resist changing the meaning of this institution. Our analysis of discourses collected at three open source professional conferences in 2019 is complemented by computational methods, which generate semantic clusters from presentation summaries. In terms of defining digital infrastructure, business models, and the firm-community relationship, we find a clear division between the discourses of large firm and consortia foundation employees, on one hand, and small firm and non-profit foundation employees, on the other. These divisions reflect these entities’ roles in the data economy and levels of concern about predatory “Big Tech” practices, which transform common goods to be shared into proprietary assets to be sold.
[.pdf] [.bib] doi> Jesus M. Gonzalez-Barahona, Sergio Montes-Leon, Gregorio Robles, Stefano Zacchiroli. The Software Heritage License Dataset (2022 Edition). In Empirical Software Engineering. volume 28, issue 6, article number: 147. ISSN 1382-3256, Springer. 2023. Abstract...

Abstract: Context: When software is released publicly, it is common to include with it either the full text of the license or licenses under which it is published, or a detailed reference to them. Therefore public licenses, including FOSS (free, open source software) licenses, are usually publicly available in source code repositories. Objective: To compile a dataset containing as many documents as possible that contain the text of software licenses, or references to the license terms. Once compiled, characterize the dataset so that it can be used for further research, or practical purposes related to license analysis. Method: Retrieve from Software Heritage—the largest publicly available archive of FOSS source code—all versions of all files whose names are commonly used to convey licensing terms. All retrieved documents will be characterized in various ways, using automated and manual analyses. Results: The dataset consists of 6.9 million unique license files. Additional metadata about shipped license files is also provided, making the dataset ready to use in various contexts, including: file length measures, MIME type, SPDX license (detected using ScanCode), and oldest appearance. The results of a manual analysis of 8102 documents is also included, providing a ground truth for further analysis. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files with metadata, referencing files via cryptographic checksums. Conclusions: Thanks to the extensive coverage of Software Heritage, the dataset presented in this paper covers a very large fraction of all software licenses for public code. We have assembled a large body of software licenses, characterized it quantitatively and qualitatively, and validated that it is mostly composed of licensing information and includes almost all known license texts. The dataset can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. It can also be used in practice to improve tools detecting licenses in source code.
[.pdf] [.bib] doi> Yiming Sun, Daniel M. Germán, Stefano Zacchiroli. Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code. In Empirical Software Engineering, volume 28, issue 5, article number: 107. ISSN 1382-3256, Springer. 2023. Abstract...

Abstract: We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers—such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination. By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI’s 244 K packages we found 11.2 M different global identifiers (classes and method/function names—with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases. We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.
[.pdf] [.bib] doi> Kevin Wellenzohn, Michael H. Böhlen, Sven Helmer, Antoine Pietri, Stefano Zacchiroli. Robust and Scalable Content-and-Structure Indexing. In The VLDB Journal, volume 32, pp. 689-715. ISSN 1066-8888, Springer. 2023. Abstract...

Abstract: Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publicly-available source code archive.
[.pdf] [.bib] doi> Mathieu O'Neil, Laure Muselli, Xiaolan Cai, Stefano Zacchiroli. Co-producing industrial public goods on GitHub: Selective firm cooperation, volunteer-employee labour and participation inequality. In New Media and Society, volume 26, issue 5, pp. 2556-2592. ISSN 1461-4448, SAGE Publishing. 2024. Abstract...

Abstract: The global economy's digital infrastructure is based on free and open source software. To analyse how firms indirectly collaborate via employee contributions to developer-run projects, we propose a formal definition of "industrial public goods"—inter-firm cooperation, volunteer and paid labour overlap, and participation inequality. We verify its empirical robustness by collecting networks of commits made by firm employees to active GitHub software repositories. Despite paid workers making more contributions, volunteers play a significant role. We find which firms contribute most, which projects benefit from firm investments, and identify distinct "contribution territories" since the two central firms never co-contribute to top-20 repositories. We highlight the challenge posed by "Big Tech" to the non-rival status of industrial public goods, thanks to cloud-based systems which resist sharing, and suggest there may be "contribution deserts" neglected by large information technology firms, despite their importance for the open source ecosystem's sustainability and diversity.
[.pdf] [.bib] doi> Chris Lamb, Stefano Zacchiroli. Reproducible Builds: Increasing the Integrity of Software Supply Chains. In IEEE Software, volume 39, issue 2, pp. 62-70. ISSN 0740-7459, IEEE Computer Society. 2022. Award: IEEE Software best paper award (for year 2022). Abstract...

Abstract: Although it is possible to increase confidence in Free and Open Source Software (FOSS) by reviewing its source code, trusting code is not the same as trusting its executable counterparts. These are typically built and distributed by third-party vendors, with severe security consequences if their supply chains are compromised. In this paper, we present reproducible builds, an approach that can determine whether generated binaries correspond with their original source code. We first define the problem, and then provide insight into the challenges of making real-world software build in a "reproducible" manner-this is, when every build generates bit-for-bit identical results. Through the experience of the Reproducible Builds project making the Debian Linux distribution reproducible, we also describe the affinity between reproducibility and quality assurance (QA).
[.pdf] [.bib] doi> Francesca Del Bonifro, Maurizio Gabbrielli, Antonio Lategano, Stefano Zacchiroli. Image-based many-language programming language identification. In PeerJ Computer Science, 7:e631. ISSN 2167-8359, PeerJ. 2021. Abstract...

Abstract: Programming language identification (PLI) is a common need in automatic program comprehension as well as a prerequisite for deeper forms of code understanding. Image-based approaches to PLI have recently emerged and are appealing due to their applicability to code screenshots and programming video tutorials. However, they remain limited to the recognition of a small amount of programming languages (up to 10 languages in the literature). We show that it is possible to perform image-based PLI on a large number of programming languages (up to 149 in our experiments) with high (92%) precision and recall, using convolutional neural networks (CNNs) and transfer learning, starting from readily-available pretrained CNNs. Results were obtained on a large real-world dataset of 300,000 code snippets extracted from popular GitHub repositories. By scrambling specific character classes and comparing identification performances we also show that the characters that contribute the most to the visual recognizability of programming languages are symbols (e.g., punctuation, mathematical operators and parentheses), followed by alphabetic characters, with digits and indentation having a negligible impact.
[.pdf] [.bib] doi> Stefano Zacchiroli. Gender Differences in Public Code Contributions: a 50-year Perspective. In IEEE Software, volume 38, issue 2, pp. 45-50. ISSN 0740-7459, IEEE Computer Society. 2021. Abstract...

Abstract: Gender imbalance in information technology in general, and Free/Open Source Software specifically, is a well-known problem in the field. Still, little is known yet about the large-scale extent and long-term trends that underpin the phenomenon. We contribute to fill this gap by conducting a longitudinal study of the population of contributors to publicly available software source code. We analyze 1.6 billion commits corresponding to the development history of 120 million projects, contributed by 33 million distinct authors over a period of 50 years. We classify author names by gender and study their evolution over time. We show that, while the amount of commits by female authors remains low overall, there is evidence of a stable long-term increase in their proportion over all contributions, providing hope of a more gender-balanced future for collaborative software development.
[.pdf] [.bib] doi> Simon Phipps, Stefano Zacchiroli. Continuous Open Source License Compliance. In IEEE Computer, volume 53, number 12, pp. 115-119. ISSN 0018-9162, IEEE Computer Society. 2020. Abstract...

Abstract: In this article we consider the role of policy and process in open source usage and propose in-workflow automation as the best path to promoting compliance.
[.pdf] [.bib] doi> Guillaume Rousseau, Roberto Di Cosmo, Stefano Zacchiroli. Software Provenance Tracking at the Scale of Public Source Code. In Empirical Software Engineering, volume 25, issue 4, pp. 2930-2959. ISSN 1382-3256, Springer. 2020. Abstract...

Abstract: We study the possibilities to track provenance of software source code artifacts within the largest publicly accessible corpus of publicly available source code, the Software Heritage archive, with over 4 billions unique source code files and 1 billion commits capturing their development histories across 50 million software projects. We perform a systematic and generic estimate of the replication factor across the different layers of this corpus, analysing how much the same artifacts (e.g., SLOC, files or commits) appear in different contexts (e.g., files, commits or source code repositories). We observe a combinatorial explosion in the number of identical source code files across different commits. To discuss the implication of these findings, we benchmark different data models for capturing software provenance information at this scale, and we identify a viable solution, based on the properties of isochrone subgraphs, that is deployable on commodity hardware, is incremental and appears to be maintainable for the foreseeable future. Using these properties, we quantify, at a scale never achieved previously, the growth rate of original, i.e. never-seen-before, source code files and commits, and find it to be exponential over a period of more than 40 years.
[.pdf] [.bib] doi> Mathieu O'Neil, Laure Muselli, Mahin Raissi, Stefano Zacchiroli. "Open source has won and lost the war": Legitimising commercial-communal hybridisation in a FOSS project. In New Media and Society. ISSN 1461-4448, 2020. SAGE Publications 2020. Abstract...

Abstract: Information technology (IT) firms are paying developers in Free and Open Source Software (FOSS) projects, leading to the emergence of hybrid forms of work. In order to understand how the firm-project hybridisation process occurs, we present the results of an online survey of participants in the Debian project, as well as interviews with Debian Developers. We find that the intermingling of the commercial logic of the firm and the communal logic of the project requires rhetorical legitimation. We analyse the discourses used to legitimise firm-project cooperation as well as the organisational mechanisms which facilitate this cooperation. A first phase of legitimation, based on firm adoption of open licenses and developer self-fulfilment, aims to erase the commercial/communal divide. A second more recent phase seeks to professionalise work relations inside the project and, in doing so, challenges the social order which restricts participation in FOSS. Ultimately, hybridisation raises the question of the fair distribution of the profits firms derive from FOSS.
[.pdf] [.bib] doi> Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Referencing Source Code Artifacts: a Separate Concern in Software Citation. In Computing in Science and Engineering, volume 22, issue 2, pp. 33-43. ISSN 1521-9615, IEEE. March 2020. Abstract...

Abstract: Among the entities involved in software citation, software source code requires special attention, due to the role it plays in ensuring scientific reproducibility. To reference source code we need identifiers that are not only unique and persistent, but also support integrity checking intrinsically. Suitable iden- tifiers must guarantee that denoted objects will always stay the same, without relying on external third parties and administrative processes. We analyze the role of identifiers for digital objects (IDOs), whose properties are different from, and complementary to, those of the various digital identifiers of objects (DIOs) that are today popular building blocks of software and data citation toolchains. We argue that both kinds of identifiers are needed and detail the syntax, semantics, and practical implementation of the persistent identifiers (PIDs) adopted by the Software Heritage project to reference billions of software source code artifacts such as source code files, directories, and commits.
[.pdf] [.bib] doi> Gabriele D'Angelo, Angelo Di Iorio, Stefano Zacchiroli. Spacetime Characterization of Real-Time Collaborative Editing. In Proceedings of the ACM on Human-Computer Interaction, volume 2, issue CSCW, Article No. 41. ISSN 2573-0142, ACM, November 2018. Abstract...

Abstract: Real-Time Collaborative Editing (RTCE) is a popular way of instrumenting cooperative work on documents, in particular on the Web. Little is known in the literature yet about RTCE usage patterns in the real world. In this paper we study how a popular RTCE editor (Etherpad) is used in the wild, digging into the edit histories of a large collection of documents (about 14 000 pads), retrieved from one of the most popular public instances of the platform, hosted by the Wikimedia Foundation. The pad analysis is supported by a novel conceptual model that allows to label edit operations as "collaborative" or not depending on their distance---in edit position (space), edit time, or spacetime (both)---from edits made by other authors. The model is applied to classify all edits from the pad corpus. Classification results are further used to characterize the collaboration behavior of pad authors. Findings show that: 1) about half of the pads have a single author and hence witnessed no collaboration; 2) collaboration on common document parts happens often, but it happens asynchronously with authors taking turns in editing; and 3) simultaneous editing of common document parts happens very rarely. These findings help in revisiting early RTCE design decisions (e.g., the granularity of conflict management in RTCE protocols) and give insights on how to address novel needs (e.g., end-to-end encryption and offline editing).
[.pdf] [.bib] doi> Jean-François Abramatic, Roberto Di Cosmo, Stefano Zacchiroli. Building the Universal Archive of Source Code. In Communications of the ACM, October 2018, volume 61, number 10, pp. 29-31. ISSN 0001-0782, ACM. Abstract...

Abstract:
[.pdf] [.bib] Mathieu O'Neil, Mahin Raissi, Molly de Blanc, Stefano Zacchiroli. Preliminary Report on the Influence of Capital in an Ethical-Modular Project: Quantitative data from the 2016 Debian Survey. In Journal of Peer Production, issue 10. ISSN 2213-5316, 2017.
[.pdf] [.bib] doi> Matthieu Caneill, Daniel M. Germán, Stefano Zacchiroli. The Debsources Dataset: Two Decades of Free and Open Source Software. In Empirical Software Engineering, volume 22, pp. 1405-1437, June, 2017. ISSN 1382-3256, Springer. Abstract...

Abstract: We present the Debsources Dataset: source code and related metadata spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. The dataset spans more than 3 billion lines of source code as well as metadata about them such as: size metrics (lines of code, disk usage), developer-defined symbols (ctags), file-level checksums (SHA1, SHA256, TLSH), file media types (MIME), release information (which version of which package containing which source code files has been released when), and license information (GPL, BSD, etc). The Debsources Dataset comes as a set of tarballs containing deduplicated unique source code files organized by their SHA1 checksums (the source code), plus a portable PostgreSQL database dump (the metadata). A case study is run to show how the Debsources Dataset can be used to easily and efficiently instrument very long-term analyses of the evolution of Debian from various angles (size, granularity, licensing, etc.), getting a grasp of major FOSS trends of the past two decades. The Debsources Dataset is Open Data, released under the terms of the CC BY-SA 4.0 license, and available for download from Zenodo with DOI reference 10.5281/zenodo.61089.
[.pdf] [.bib] doi> Roberto Di Cosmo, Jacopo Mauro, Stefano Zacchiroli, Gianluigi Zavattaro. Aeolus: a Component Model for the Cloud. In Information and Computation, volume 239, pp. 100-121. 2014. ISSN 0890-5401, Elsevier. Abstract...

Abstract: We introduce the Aeolus component model, which is specifically designed to capture realistic scenarii arising when configuring and deploying distributed applications in the so-called cloud environments, where interconnected components can be deployed on clusters of heterogeneous virtual machines, which can be in turn created, destroyed, and connected on-the-fly. The full Aeolus model is able to describe several component characteristics such as dependencies, conflicts, non-functional requirements (replication requests and load limits), as well as the fact that component interfaces to the world might vary depending on the internal component state. When the number of components needed to build an application grows, it becomes important to be able to automate activities such as deployment and reconfiguration. This correspond, at the level of the model, to the ability to decide whether a desired target system configuration is reachable, which we call the achievability problem, and producing a path to reach it. In this work we show that the achievability problem is undecidable for the full Aeolus model, a strong limiting result for automated configuration in the cloud. We also show that the problem becomes decidable, but Ackermann-hard, as soon as one drops non-functional requirements. Finally, we provide a polynomial time algorithm for the further restriction of the model where support for inter-component conflicts is also removed.
[.pdf] [.bib] doi> Pietro Abate, Roberto Di Cosmo, Ralf Treinen, Stefano Zacchiroli. Learning from the Future of Component Repositories. In Science of Computer Programming, volume 90, part B, pp. 93-115. ISSN 0167-6423, Elsevier, 2014. Abstract...

Abstract: An important aspect of the quality assurance of large component repositories is to ensure the logical coherence of component metadata, and to this end one needs to identify incoherences as early as possible. Some relevant classes of problems can be formulated in term of properties of the future repositories into which the current repository may evolve. However, checking such properties on all possible future repositories requires a way to construct a finite representation of the infinite set of all potential futures. A class of properties for which this can be done is presented in this work. We illustrate the practical usefulness of the approach with two quality assurance applications: (i) establishing the amount of "forced upgrades" induced by introducing new versions of existing components in a repository, and (ii) identifying outdated components that are currently not installable and need to be upgraded in order to become installable again. For both applications we provide experience reports obtained on the Debian free software distribution.
[.pdf] [.bib] doi> Pietro Abate, Roberto Di Cosmo, Ralf Treinen, Stefano Zacchiroli. A Modular Package Manager Architecture. In Information and Software Technology, volume 55, issue 2, pp. 459-474. ISSN 0950-5849, Elsevier, February 2013. Abstract...

Abstract: The success of modern software distributions in the Free and Open Source world can be explained, among other factors, by the availability of a large collection of software packages and the possibility to easily install and remove those components using state of the art package managers. However, package managers are often built using a monolithic architecture and hard-wired and ad-hoc dependency solvers implementing some customized heuristics. In this paper we propose a modular architecture relying on precise interface formalisms that allows the system administrator to choose from a variety of dependency solvers and backends. We argue that this is the path that leads to the next generation of package managers that will deliver better results, offer more expressive preference languages, and be easily adaptable to new platforms. We have built a working prototype, called MPM, following the design advocated in this paper, and we show how it largely outperforms a variety of state of the art package managers.
[.pdf] [.bib] doi> Pietro Abate, Roberto Di Cosmo, Ralf Treinen, Stefano Zacchiroli. Dependency Solving: a Separate Concern in Component Evolution Management. In Journal of Systems and Software, volume 85, issue 10, pp. 2228-2240. ISSN 0164-1212, Elsevier, October 2012. Abstract...

Abstract: Maintenance of component-based software platforms often has to face rapid evolution of software components. Component dependencies, conflicts, and package managers with dependency solving capabilities are the key ingredients of prevalent software maintenance technologies that have been proposed to keep software installations synchronized with evolving component repositories. We review state-of-the-art package managers and their ability to keep up with evolution at the current growth rate of popular component-based platforms, and conclude that their dependency solving abilities are not up to the task. We show that the complexity of the underlying upgrade planning problem is NP-complete even for seemingly simple component models, and argue that the principal source of complexity lies in multiple available versions of components. We then discuss the need of expressive languages for user preferences, which makes the problem even more challenging. We propose to establish dependency solving as a separate concern from other upgrade aspects, and present CUDF as a formalism to describe upgrade scenarios. By analyzing the result of an international dependency solving competition, we provide evidence that the proposed approach is viable.
[.pdf] [.bib] doi> Angelo Di Iorio, Francesco Draicchio, Fabio Vitali, Stefano Zacchiroli. Constrained Wiki: The WikiWay to Validating Content. In Advances in Human-Computer Interaction, volume 2012, article ID 893575, pp. 1-19. ISSN 1687-5893, Hindawi, 2012 Abstract...

Abstract: The "WikiWay" is the open editing philosophy of wikis meant to foster open collaboration and continuous improvement of their content. Just like other online communities, wikis often introduce and enforce conventions, constraints, and rules for their content, but do so in a considerably softer way, expecting authors to deliver content that satisfies the conventions and the constraints, or, failing that, having volunteers of the community, the WikiGnomes, fix others' content accordingly. Constrained wikis is our generic framework for wikis to implement validators of community-specific constraints and conventions that preserve the WikiWay and their open collaboration features. To this end, specific requirements need to be observed by validators and a specific software architecture can be used for their implementation, that is, as independent functions (implemented as internal modules or external services) used in a nonintrusive way. Two separate proof-of-concept validators have been implemented for MediaWiki and MoinMoin, respectively, providing an annotated view functions, that is, presenting content authors with violation warnings, rather than preventing them from saving a noncompliant text.
[.pdf] [.bib] doi> Roberto Di Cosmo, Davide Di Ruscio, Patrizio Pelliccione, Alfonso Pierantonio, Stefano Zacchiroli. Supporting Software Evolution in Component-Based FOSS Systems. In Science of Computer Programming, volume 76, issue 12, pp. 1144-1160. ISSN 0167-6423, Elsevier, 2011. Abstract...

Abstract: FOSS (Free and Open Source Software) systems present interesting challenges in system evolution. On one hand, most FOSS systems are based on very fine-grained units of software deployment, called packages, which promote system evolution; on the other hand, FOSS systems are among the largest software systems known and require sophisticated static and dynamic conditions to be verified, in order to successfully deploy upgrades on user machines. The slightest error in one of these conditions can turn a routine upgrade into a system administrator nightmare. In this paper we introduce a model-based approach to support the upgrade of FOSS systems. The approach promotes the simulation of upgrades to predict failures before affecting the real system. Both fine-grained static aspects (e.g. configuration incoherences) and dynamic aspects (e.g. the execution of configuration scripts) are taken into account, improving over the state of the art of upgrade planners. The effectiveness of the approach is validated by instantiating the approach to widely-used FOSS distributions.
[.pdf] [.bib] doi> Paolo Marinelli, Fabio Vitali, Stefano Zacchiroli. Towards the unification of formats for overlapping markup. In New Review of Hypermedia and Multimedia, volume 14, issue 1, January 2008, pp. 57-94. Taylor and Francis, ISSN 1361-4568. Abstract...

Abstract: Overlapping markup refers to the issue of how to represent data structures more expressive than trees, for example direct acyclic graphs, using markup (meta-)languages which have been designed with trees in mind, for example XML. In this paper we observe that the state of the art in overlapping markup is far from being the widespread and consistent stack of standards and technologies readily available for XML and develop a roadmap for closing the gap. In particular we present in the paper the design and implementation of what we believe to be the first needed step, namely: a syntactic conversion framework among the plethora of overlapping markup serialization formats. The algorithms needed to perform the various conversions are presented in pseudo-code, they are meant to be used as blueprints for researchers and practitioners which need to write batch translation programs from one format to the other.
[.pdf] [.bib] doi> Claudio Sacerdoti Coen, Stefano Zacchiroli. Spurious Disambiguation Errors and How to Get Rid of Them. In Mathematics in Computer Science, volume 2, number 2, pp. 355-378, December 2008. Springer Birkhäuser, ISSN 1661-8270. Abstract...

Abstract: The disambiguation approach to the input of formulae enables users of mathematical assistants to type correct formulae in a terse syntax close to the usual ambiguous mathematical notation. When it comes to incorrect formulae however, far too many typing errors are generated; among them we want to present only errors related to the formula interpretation meant by the user, hiding errors related to other interpretations. We study disambiguation errors and how to classify them into the spurious and genuine error classes. To this end we give a general presentation of the classes of disambiguation algorithms and efficient disambiguation algorithms. We also quantitatively assess the quality of the presented error classification criteria benchmarking them in the setting of a formal development of constructive algebra.
[.pdf] [.bib] doi> Andrea Asperti, Claudio Sacerdoti Coen, Enrico Tassi, Stefano Zacchiroli. User Interaction with the Matita Proof Assistant. In Journal of Automated Reasoning, volume 39, number 2. Springer Netherlands, ISSN 0168-7433, pp. 109-139, 2007. Abstract...

Abstract: Matita is a new, document-centric, tactic-based interactive theorem prover. This paper focuses on some of the distinctive features of the user interaction with Matita, mostly characterized by the organization of the library as a searchable knowledge base, the emphasis on a high-quality notational rendering, and the complex interplay between syntax, presentation, and semantics.

editorials

[.pdf] [.bib] doi> Federico Balaguer, Roberto Di Cosmo, Alejandra Garrido, Fabio Kon, Gregorio Robles, Stefano Zacchiroli. Open Source Systems: Towards Robust Practices. 13th IFIP WG 2.13 International Conference, OSS 2017, Buenos Aires, Argentina, May 22-23, 2017, Proceedings. IFIP Advances in Information and Communication Technology 496, Springer 2017, ISBN 978-3-319-57734-0.
[.pdf] [.bib] Mathieu O'Neil, Stefano Zacchiroli. Making Lovework: Editorial Notes for the JoPP issue on Peer Production and Work. In Journal of Peer Production, issue 10. ISSN 2213-5316, 2017.
[.pdf] [.bib] Angelo Di Iorio, Davide Rossi, Stefano Zacchiroli. Editorial. In Journal of Web Engineering, volume 14, number 1-2, pp. 1-2. ISSN 1540-9589, Rinton Press, March 2015.
[.pdf] [.bib] doi> Angelo Di Iorio, Davide Rossi, Stefano Zacchiroli. Web Technologies: Selected and extended papers from WT ACM SAC 2012. In Science of Computer Programming, volume 94, part 1, pp. 1-2. ISSN 0167-6423, Elsevier, 2014.
[.pdf] [.bib] doi> Angelo Di Iorio, Davide Rossi, Stefano Zacchiroli. Editorial. In Software: Practice and Experience, volume 43, issue 12, pp. 1393-1394. ISSN 1097-024X, Wiley, 2013.

book chapters

[.pdf] [.bib] doi> Roberto Di Cosmo, Stefano Zacchiroli. The Software Heritage Open Science Ecosystem. Chapter 2 of Software Ecosystems: Tooling and Analytics. Pages 33-61. Tom Mens, Coen De Roover, Anthony Cleve Ed., Springer, 2023. Abstract...

Abstract: Software Heritage is the largest public archive of software source code and associated development history, as captured by modern version control systems. As of February 2023 it has archived more than 12 billion unique source code files and 2 billion commits, coming from more than 180 million collaborative development projects. In this chapter we describe the Software Heritage ecosystem, focusing on research and open science use cases. On the one hand Software Heritage supports empirical research on software by materialising in a single Merkle direct acyclic graph the development history of public code. This giant graph of source code artifacts (files, directories, and commits) can be used—and has been used—to study repository forks, open source contributors, vulnerability propagation, software provenance tracking, source code indexing, and more. On the other hand Software Heritage ensures availability and guarantees integrity of the source code of software artifacts used in any field that relies on software to conduct experiments, contributing to making research reproducible. The source code used in scientific experiments can be archived—e.g., via integration with open access repositories—referenced using persistent identifiers that allow downstream integrity checks, and linked to/from other scholarly digital artifacts.
[.pdf] [.bib] doi> Angelo Di Iorio, Fabio Vitali, Stefano Zacchiroli. Wiki Semantics via Wiki Templating. Chapter 34 of Handbook of research on Web 2.0, 3.0 and x.0: technologies, business and social applications. San Murugesan Ed., pp. 329-348, IGI Global, 2010, ISBN 978-1605663845. Abstract...

Abstract: A foreseeable incarnation of Web 3.0 could inherit machine understandability from the Semantic Web, and collaborative editing from Web 2.0 applications. We review the research and development trends which are getting today Web nearer to such an incarnation. We present semantic wikis, microformats, and the so-called "lowercase semantic web": they are the main approaches at closing the technological gap between content authors and Semantic Web technologies. We discuss a too often neglected aspect of the associated technologies, namely how much they adhere to the wiki philosophy of open editing: is there an intrinsic incompatibility between semantic rich content and unconstrained editing? We argue that the answer to this question can be "no", provided that a few yet relevant shortcomings of current Web technologies will be fixed soon.

international, peer-reviewed conference proceedings

[.pdf] [.bib] Stefano Balla, Thomas Degueule, Romain Robbes, Jean-Rémy Falleri, Stefano Zacchiroli. Automatic Classification of Software Repositories: a Systematic Mapping Study. To appear in proceedings of EASE 2025: International Conference on Evaluation and Assessment in Software Engineering, 17-20 June 2025, Istanbul, Turkey. 2025. Abstract...

Abstract: The rapid growth of software repositories on development platforms such as GitHub, as well as archives like Software Heritage, prompts the need for better repository classification. Machine learning is increasingly used to automate this classification, but there are no secondary studies analyzing this research landscape. We present a systematic mapping study of 43 primary sources published between 2002 and 2023, where we examine the goals, inputs, outputs, training, and evaluation processes involved in automatic repository classification. Our findings reveal a growing interest in automatic classification, particularly to enhance the discoverability and recommendation of relevant repositories. Other applications, such as classification for mining studies, were surprisingly underrepresented. We also observe that a lack of standardized datasets, classification tasks, and evaluation metrics makes it difficult to compare the performance of different techniques.
[.pdf] [.bib] doi> Dimitri Kokkonis, Michaël Marcozzi, Emilien Decoux, Stefano Zacchiroli. ROSA: Finding Backdoors with Fuzzing. To appear in proceedings of ICSE 2025: 47th International Conference on Software Engineering, 27 Avril - 3 May 2025, Ottawa, Canada. 2025. Award: ICSE 2025 Best Artifact Award. Abstract...

Abstract: A code-level backdoor is a hidden access, programmed and concealed within the code of a program. For instance, hard-coded credentials planted in the code of an FTP server would enable maliciously logging into all the deployed instances of this server. Confirmed software supply-chain attacks have led to the injection of backdoors into popular open-source projects, and backdoors have been discovered in various router firmware. Manual code auditing for backdoors is challenging and existing semi-automated approaches can handle only a limited amount of programs and backdoors, while requiring manual reverse-engineering of the audited (binary) program. Graybox fuzzing (automated semi-randomized testing) has grown in popularity due to its success in discovering vulnerabilities and hence stands as a strong candidate for improved backdoor detection. However, current fuzzing knowledge does not offer any means to detect the triggering of a backdoor at runtime. In this work we introduce ROSA, a novel approach (and tool) which combines a state-of-the-art fuzzer (AFL++) with a new metamorphic test oracle, capable of detecting runtime backdoor triggers. To facilitate the evaluation of ROSA, we have created ROSARUM, the first openly available benchmark for assessing the detection of various backdoors in diverse programs. Experimental evaluation shows that ROSA has a level of robustness, speed and automation similar to classical fuzzing. Compared to existing detection tools, it can handle a diversity of backdoors and programs and it does not rely on manually reverse-engineering the fuzzed binary code.
[.pdf] [.bib] Julien Malka, Théo Zimmermann, Stefano Zacchiroli. Does Functional Package Management Enable Reproducible Builds at Scale? Yes.. To appear in proceedings of Mining Software Repositories 2025 (MSR 2025), 28-29 April 2025, Ottawa, Canada. 2025. Award: ACM SIGSOFT Distinguished Paper Award. Abstract...

Abstract: Reproducible Builds (R-B) guarantee that rebuilding a software package from source leads to bitwise identical artifacts. R-B is a promising approach to increase the integrity of the software supply chain, when installing open source software built by third parties. Unfortunately, despite success stories like high build reproducibility levels in Debian packages, uncertainty remains among field experts on the scalability of R-B to very large package repositories. In this work, we perform the first large-scale study of bitwise reproducibility in the context of the Nix functional package manager, rebuilding 709816 packages from historical snapshots of the nixpkgs repository, the largest cross-ecosystem open source software distribution, sampled in the period 2017-2023. We obtain very high bitwise reproducibility rates, between 69 and 91% with an upward trend, and even higher rebuildability rates, over 99%. We investigate unreproducibility causes, showing that about 14% of failures are due to embedded build dates. We release a novel dataset with all build statuses, logs, as well as full "diffoscopes": recursive diffs of where unreproducible build artifacts differ.
[.pdf] [.bib] Luís Soeiro, Thomas Robert, Stefano Zacchiroli. Wild SBOMs: a Large-scale Dataset of Software Bills of Materials from Public Code. To appear in proceedings of Mining Software Repositories 2025 (MSR 2025), 28-29 April 2025, Ottawa, Canada. 2025. Abstract...

Abstract: Developers gain productivity by reusing readily available Free and Open Source Software (FOSS) components. Such practices also bring some difficulties, such as managing licensing, components and related security. One approach to handle those difficulties is to use Software Bill of Materials (SBOMs). While there have been studies on the readiness of practitioners to embrace SBOMs and on the SBOM tools ecosystem, a large scale study on SBOM practices based on SBOM files produced in the wild is still lacking. A starting point for such a study is a large dataset of SBOM files found in the wild. We introduce such a dataset, consisting of over 78 thousand unique SBOM files, deduplicated from those found in over 94 million repositories. We include metadata that contains the standard and format used, quality score generated by the tool sbomqs, number of revisions, filenames and provenance information. Finally, we give suggestions and examples of research that could bring new insights on assessing and improving SBOM real practices.
[.pdf] [.bib] Andrea Gurioli, Maurizio Gabbrielli, Stefano Zacchiroli. Is This You, LLM? Recognizing AI-written Programs with Multilingual Code Stylometry. To appear in proceedings of SANER 2025: The IEEE International Conference on Software Analysis, Evolution and Reengineering, 4-7 March 2025, Montréal, Quebec, Canada. IEEE 2025. Abstract...

Abstract: With the increasing popularity of LLM-based code completers, like GitHub Copilot, the interest in automatically detecting AI-generated code is also increasing---in particular in contexts where the use of LLMs to program is forbidden by policy due to security, intellectual property, or ethical concerns. We introduce a novel technique for AI code stylometry, i.e., the ability to distinguish code generated by LLMs from code written by humans, based on a transformer-based encoder classifier. Differently from previous work, our classifier is capable of detecting AI-written code across 10 different programming languages with a single machine learning model, maintaining high average accuracy across all languages (84.1% ± 3.8%). Together with the classifier we also release H-AIRosettaMP, a novel open dataset for AI code stylometry tasks, consisting of 121247 code snippets in 10 popular programming languages, labeled as either human-written or AI-generated. The experimental pipeline (dataset, training code, resulting models) is the first fully reproducible one for the AI code stylometry task. Most notably our experiments rely only on open LLMs, rather than on proprietary/closed ones like ChatGPT.
[.pdf] [.bib] doi> Ludovic Courtès, Timothy Sample, Simon Tournier, Stefano Zacchiroli. Source Code Archiving to the Rescue of Reproducible Deployment. In proceedings of 2024 ACM Conference on Reproducibility and Replicability, June 18-20, 2024, Rennes, France. 10 pages. ACM 2024. Abstract...

Abstract: The ability to verify research results and to experiment with methodologies are core tenets of science. As research results are increasingly the outcome of computational processes, software plays a central role. GNU Guix is a software deployment tool that supports reproducible software deployment, making it a foundation for computational research workflows. To achieve reproducibility, we must first ensure the source code of software packages Guix deploys remains available. We describe our work connecting Guix with Software Heritage, the universal source code archive, making Guix the first free soft- ware distribution and tool backed by a stable archive. Our contribution is twofold: we explain the rationale and present the design and implementation we came up with; second, we report on the archival coverage for package source code with data collected over five years and discuss remaining challenges.
[.pdf] [.bib] doi> Tommaso Fontana, Sebastiano Vigna, Stefano Zacchiroli. WebGraph: The Next Generation (Is in Rust). In Companion Proceedings of the ACM Web Conference 2024 (WWW '24 Companion), May 2024, Singapore, pp. 686-689. ACM 2024. Abstract...

Abstract: We report the results of a yearlong effort to port the WebGraph framework from Java to Rust. For two decades WebGraph has been instrumental in the analysis and distribution of large graphs for the research community of TheWebConf, but the intrinsic limitations of the Java Virtual Machine had become a bottleneck for very large use cases, such as the Software Heritage Merkle graph with its half a trillion arcs. As part of this clean-slate implementation of WebGraph in Rust, we developed a few ancillary projects bringing to the Rust ecosystem some missing features of independent interest, such as easy, consistent and zero-cost memory mapping of data structures. WebGraph in Rust offers impressive performance improvements over the previous implementation, enabling open-source graph analytics on very large datasets on top of a modern system programming language.
[.pdf] [.bib] doi> Julien Malka, Théo Zimmermann, Stefano Zacchiroli. Reproducibility of Build Environments through Space and Time. In proceedings of 46th International Conference on Software Engineering (ICSE 2024) - New Ideas and Emerging Results (NIER) Track, April 2024, Lisbon, Portugal, pp. 97-101. ACM 2024. Abstract...

Abstract: Modern software engineering builds up on the composability of software components, that rely on more and more direct and transitive dependencies to build their functionalities. This principle of reusability however makes it harder to reproduce projects' build environments, even though reproducibility of build environments is essential for collaboration, maintenance and component lifetime. In this work, we argue that functional package managers provide the tooling to make build environments reproducible in space and time, and we produce a preliminary evaluation to justify this claim. Using historical data, we show that we are able to reproduce build environments of about 7 million Nix packages, and to rebuild 99.94% of the 14 thousand packages from a 6-year-old Nixpkgs revision.
[.pdf] [.bib] doi> Romain Lefeuvre, Jessie Galasso, Benoit Combemale, Houari Sahraoui, Stefano Zacchiroli. Fingerprinting and Building Large Reproducible Datasets. In proceedings of 2023 ACM Conference on Reproducibility and Replicability, June 27-29, 2023, Santa Cruz, California, USA. Pages 27-36. ACM 2023. Abstract...

Abstract: Obtaining a relevant dataset is central to conducting empirical studies in software engineering. However, in the context of mining software repositories, the lack of appropriate tooling for large scale mining tasks hinders the creation of new datasets. Moreover, limitations related to data sources that change over time (e.g., code bases) and the lack of documentation of extraction processes make it difficult to reproduce datasets over time. This threatens the quality and reproducibility of empirical studies. In this paper, we propose a tool-supported approach facilitating the creation of large tailored datasets while ensuring their reproducibility. We leveraged all the sources feeding the Software Heritage append-only archive which are accessible through a unified programming interface to outline a reproducible and generic extraction process. We propose a way to define a unique fingerprint to characterize a dataset which, when provided to the extraction process, ensures that the same dataset will be extracted. We demonstrate the feasibility of our approach by implementing a prototype. We show how it can help reduce the limitations researchers face when creating or reproducing datasets.
[.pdf] [.bib] doi> Daniele Serafini, Stefano Zacchiroli. Efficient Prior Publication Identification for Open Source Code. In proceedings of OpenSym '22: Proceedings of the 18th International Symposium on Open Collaboration, September 6-10, 2022, Madrid, Spain. Pages 1-8. ACM 2022. Abstract...

Abstract: Free/Open Source Software (FOSS) enables large-scale reuse of preexisting software components. The main drawback is increased complexity in software supply chain management. A common approach to tame such complexity is automated open source compliance, which consists in automating the verification of adherence to various open source management best practices about license obligation fulfillment, vulnerability tracking, software composition analysis, and nearby concerns. We consider the problem of auditing a source code base to determine which of its parts have been published before, which is an important building block of automated open source compliance toolchains. Indeed, if source code allegedly developed in house is recognized as having been previously published elsewhere, alerts should be raised to investigate where it comes from and whether this entails that additional obligations shall be fulfilled before product shipment. We propose an efficient approach for prior publication identification that relies on a knowledge base of known source code artifacts linked together in a global Merkle direct acyclic graph and a dedicated discovery protocol. We introduce swh-scanner, a source code scanner that realizes the proposed approach in practice using as knowledge base Software Heritage, the largest public archive of source code artifacts. We validate experimentally the proposed approach, showing its efficiency in both abstract (number of queries) and concrete terms (wall-clock time), performing benchmarks on 16'845 real-world public code bases of various sizes, from small to very large.
[.pdf] [.bib] doi> Zeinab Abou Khalil, Stefano Zacchiroli. Software Artifact Mining in Software Engineering Conferences: A Meta-Analysis. In proceedings of ESEM '22: Proceedings of the 16th ACM / IEEE International Symposium on Empirical Software Engineering and Measurement, September 19-23, 2022, Helsinki, Finland. Pages 227-237. ACM 2022. Abstract...

Abstract: Background: Software development results in the production of various types of artifacts: source code, version control system metadata, bug reports, mailing list conversations, test data, etc. Empirical software engineering (ESE) has thrived mining those artifacts to uncover the inner workings of software development and improve its practices. But which artifacts are studied in the field is a moving target, which we study empirically in this paper. Aims: We quantitatively characterize the most frequently mined and co-mined software artifacts in ESE research and the research purposes they support. Method: We conduct a meta-analysis of artifact mining studies published in 11 top conferences in ESE, for a total of 9621 papers. We use natural language processing (NLP) techniques to characterize the types of software artifacts that are most often mined and their evolution over a 16-year period (2004-2020). We analyze the combinations of artifact types that are most often mined together, as well as the relationship between study purposes and mined artifacts. Results: We find that: (1) mining happens in the vast majority of analyzed papers, (2) source code and test data are the most mined artifacts, (3) there is an increasing interest in mining novel artifacts, together with source code, (4) researchers are most interested in the evaluation of software systems and use all possible empirical signals to support that goal.
[.pdf] [.bib] doi> Stefano Zacchiroli. A Large-scale Dataset of (Open Source) License Text Variants. In proceedings of The 2022 Mining Software Repositories Conference (MSR 2022), 23-24 May 2022, Pittsburgh, PA, USA. Pages 757-761, ACM 2022. Award: Data and Tool Showcase Award. Abstract...

Abstract: We introduce a large-scale dataset of the complete texts of free/open source software (FOSS) license variants. To assemble it we have collected from the Software Heritage archive—the largest publicly available archive of FOSS source code with accompanying development history—all versions of files whose names are commonly used to convey licensing terms to software users and developers. The dataset consists of 6.5 million unique license files that can be used to conduct empirical studies on open source licensing, training of automated license classifiers, natural language processing (NLP) analyses of legal texts, as well as historical and phylogenetic studies on FOSS licensing. Additional metadata about shipped license files are also provided, making the dataset ready to use in various contexts; they include: file length measures, detected MIME type, detected SPDX license (using ScanCode), example origin (e.g., GitHub repository), oldest public commit in which the license appeared. The dataset is released as open data as an archive file containing all deduplicated license files, plus several portable CSV files for metadata, referencing files via cryptographic checksums.
[.pdf] [.bib] doi> Davide Rossi, Stefano Zacchiroli. Geographic Diversity in Public Code Contributions: An Exploratory Large-Scale Study Over 50 Years. In proceedings of The 2022 Mining Software Repositories Conference (MSR 2022), 23-24 May 2022, Pittsburgh, PA, USA. Pages 80-85, ACM 2022. Abstract...

Abstract: We conduct an exploratory, large-scale, longitudinal study of 50 years of commits to publicly available version control system repositories, in order to characterize the geographic diversity of contributors to public code and its evolution over time. We analyze in total 2.2 billion commits collected by Software Heritage from 160 million projects and authored by 43 million authors during the 1971–2021 time period. We geolocate developers to 12 world regions derived from the United Nation geoscheme, using as signals email top-level domains, author names compared with names distributions around the world, and UTC offsets mined from commit metadata. We find evidence of the early dominance of North America in open source software, later joined by Europe. After that period, the geographic diversity in public code has been constantly increasing. We also identify relevant historical shifts related to the UNIX wars, the increase of coding literacy in Central and South Asia, and broader phenomena like colonialism and people movement across countries (immigration/emigration).
[.pdf] [.bib] doi> Zeinab Abou Khalil, Stefano Zacchiroli. The General Index of Software Engineering Papers. In proceedings of The 2022 Mining Software Repositories Conference (MSR 2022), 23-24 May 2022, Pittsburgh, PA, USA. Pages 98-102, ACM 2022. Abstract...

Abstract: We introduce the General Index of Software Engineering Papers, a dataset of fulltext-indexed papers from the most prominent scientific venues in the field of Software Engineering. The dataset includes both complete bibliographic information and indexed n-grams (sequence of contiguous words after removal of stopwords and non-words, for a total of 577 276 382 unique n-grams in this release) with length 1 to 5 for 44 581 papers retrieved from 34 venues over the 1971–2020 period. The dataset serves use cases in the field of meta-research, allowing to introspect the output of software engineering research even when access to papers or scholarly search engines is not possible (e.g., due to contractual reasons). The dataset also contributes to making such analyses reproducible and independently verifiable, as opposed to what happens when they are conducted using 3rd-party and non-open scholarly indexing services. The dataset is available as a portable Postgres database dump and released as open data.
[.pdf] [.bib] doi> Davide Rossi, Stefano Zacchiroli. Worldwide Gender Differences in Public Code Contributions (and How They Have Been Affected by the COVID-19 Pandemic). In proceedings of 44th International Conference on Software Engineering (ICSE 2022) - Software Engineering in Society (SEIS) Track, May 2022, Pittsburgh, PA, USA. Pages 172-183, ACM 2022. Abstract...

Abstract: Gender imbalance is a well-known phenomenon observed throughout sciences which is particularly severe in software development and Free/Open Source Software communities. Little is know yet about the geography of this phenomenon in particular when considering large scales for both its time and space dimensions. We contribute to fill this gap with a longitudinal study of the population of contributors to publicly available software source code. We analyze the development history of 160 million software projects for a total of 2.2 billion commits contributed by 43 million distinct authors over a period of 50 years. We classify author names by gender using name frequencies and author geographical locations using heuristics based on email addresses and time zones. We study the evolution over time of contributions to public code by gender and by world region. For the world overall, we confirm previous findings about the low but steadily increasing ratio of contributions by female authors. When breaking down by world regions we find that the long-term growth of female participation is a world-wide phenomenon. We also observe a decrease in the ratio of female participation during the COVID-19 pandemic, suggesting that women’s ability to contribute to public code has been more hindered than that of men.
[.pdf] [.bib] doi> Thibault Allançon, Antoine Pietri, Stefano Zacchiroli. The Software Heritage Filesystem (SwhFS): Integrating Source Code Archival with Development. In proceedings of 43rd International Conference on Software Engineering (ICSE 2021) - Demonstrations Track, May 2021, Madrid, Spain. IEEE 2021. Pages 45-48. Abstract...

Abstract: We introduce the Software Heritage filesystem (SwhFS), a user-space filesystem that integrates large-scale open source software archival with development workflows. SwhFS provides a POSIX filesystem view of Software Heritage, the largest public archive of software source code and version control system (VCS) development history. Using SwhFS, developers can quickly “checkout” any of the 2 billion commits archived by Software Heritage, even after they disappear from their previous known location and without incurring the performance cost of repository cloning. SwhFS works across unrelated repositories and different VCS technologies. Other source code artifacts archived by Software Heritage—individual source code files and trees, releases, and branches—can also be accessed using common programming tools and custom scripts, as if they were locally available. A screencast of SwhFS is available online at dx.doi.org/10.5281/zenodo.4531411.
[.pdf] [.bib] doi> Francesca Del Bonifro, Maurizio Gabbrielli, Stefano Zacchiroli. Content-Based Textual File Type Detection at Scale. In proceedings of ICMLC 2021: The 13th International Conference on Machine Learning and Computing, February 2021, Shenzhen, China. Pages 485-492. ACM 2021. Abstract...

Abstract: Programming language detection is a common need in the analysis of large source code bases. It is supported by a number of existing tools that rely on several features, and most notably file extensions, to determine file types. We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content. Doing so is helpful to classify source code that lack file extensions (e.g., code snippets posted on the Web or executable scripts), to avoid misclassifying source code that has been recorded with wrong or uncommon file extensions, and also shed some light on the intrinsic recognizability of source code files. We propose a simple model that (a) use a language-agnostic word tokenizer for textual files, (b) group tokens in 1-/2-grams, (c) build feature vectors based on N-gram frequencies, and (d) use a simple fully connected neural network as classifier. As training set we use textual files extracted from GitHub repositories with at least 1000 stars, using existing file extensions as ground truth. Despite its simplicity the proposed model outperforms state-of-the-art tools and approaches in the literature both in terms of accuracy (≈ 90% in our experiments) and recognized classes (more than 130 file types).
[.pdf] [.bib] doi> Mathieu O'Neil, Laure Muselli, Stefano Zacchiroli, Xiaolan Cai, Fred Pailler. Firm discourses and digital infrastructure projects. In AoIR Selected Papers of Internet Research 2020, Research from the Annual Conference of the Association of Internet Researchers. 2020. Abstract...

Abstract: Free and open source software (a.k.a. FOSS) is now fully integrated into commercial ecosystems. IT firms invest in FOSS in order (a) to share with other firms development costs; (b) to help attract prospective employees in a competitive job market where hiring skilled IT professionals is challenging and (c) to shape the governance and technical orientation of projects: firm employees participating in leading in FOSS projects may help IT firms create digital infrastructure more suited to the firmware they develop atop this infrastructure. How does the world of FOSS volunteers connect to the world of commercial ecosystems? Are firms developing policies in relation to open source communities, requesting projects conform to certain technical or behavioral standards, for example? To what extent are these strategies successful? To answer, we present a qualitative analysis of firm discourses collected during three open source conferences. We then analyze the email discussion lists of Linux and Firefox and search for the occurrence of key firm discourse terms in order to ascertain in what way these discourses are being used by FOSS developers. Our in-depth analysis of firm discourses and exploratory analysis of project discussions around these terms show that the FOSS world encompasses a diversity of industrial outlooks. They also highlight the evolution of the role of foundations: whilst foundations used to protect projects from firm interference, some have now wholly been placed in the service of firm efforts to standardize project work, particularly around the key issue of security.
[.pdf] [.bib] doi> Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli. Determining the Intrinsic Structure of Public Software Development History. In proceedings of MSR 2020: The 17th International Conference on Mining Software Repositories, May 2020, Seoul, South Korea. Co-located with ICSE 2020. Pages 602-605, IEEE 2020. Abstract...

Abstract: Background: Collaborative software development has produced a wealth of version control system (VCS) data that can now be analyzed in full. Little is known about the intrinsic structure of the entire corpus of publicly available VCS as an interconnected graph. Understanding its structure is needed to determine the best approach to analyze it in full and to avoid methodological pitfalls when doing so. Objective: We intend to determine the most salient network topology properties of public software development history as captured by VCS. We will explore: degree distributions, determining whether they are scale-free or not; distribution of connect component sizes; distribution of shortest path lengths. Method: We will use Software Heritage---which is the largest corpus of public VCS data---compress it using webgraph compression techniques, and analyze it in-memory using classic graph algorithms. Analyses will be performed both on the full graph and on relevant subgraphs. Limitations: The study is exploratory in nature; as such no hypotheses on the findings is stated at this time. Chosen graph algorithms are expected to scale to the corpus size, but it will need to be confirmed experimentally. External validity will depend on how representative Software Heritage is of the software commons.
[.pdf] [.bib] doi> Antoine Pietri, Guillaume Rousseau, Stefano Zacchiroli. Forking Without Clicking: on How to Identify Software Repository Forks. In proceedings of MSR 2020: The 17th International Conference on Mining Software Repositories, May 2020, Seoul, South Korea. Co-located with ICSE 2020. Pages 277-287, IEEE 2020. Abstract...

Abstract: The notion of software "fork" has been shifting over time from the (negative) phenomenon of community disagreements that result in the creation of separate development lines and ultimately software products, to the (positive) practice of using distributed version control system (VCS) repositories to collaboratively improve a single product without stepping on each others toes. In both cases the VCS repositories participating in a fork share parts of a common development history. Studies of software forks generally rely on hosting platform metadata, such as GitHub, as the source of truth for what constitutes a fork. These “forge forks” however can only identify as forks repositories that have been created on the platform, e.g., by clicking a "fork" button on the platform user interface. The increased diversity in code hosting platforms (e.g., GitLab) and the habits of significant development communities (e.g., the Linux kernel, which is not primarily hosted on any single platform) call into question the reliability of trusting code hosting platforms to identify forks. Doing so might introduce selection and methodological biases in empirical studies. In this article we explore various definitions of "software forks", trying to capture forking workflows that exist in the real world. We quantify the differences in how many repositories would be identified as forks on GitHub according to the various definitions, confirming that a significant number could be overlooked by only considering forge forks. We study the structure and size of fork networks, observing how they are affected by the proposed definitions and discuss the potential impact on empirical research.
[.pdf] [.bib] doi> Paolo Boldi, Antoine Pietri, Sebastiano Vigna, Stefano Zacchiroli. Ultra-Large-Scale Repository Analysis via Graph Compression. In proceedings of SANER 2020: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, February 18-21, 2020, London, Ontario, Canada, pp. 184-194. IEEE 2020. Abstract...

Abstract: We consider the problem of mining the development history—as captured by modern version control systems—of ultra-large-scale software archives (e.g., tens of millions software repositories corresponding). We show that graph compression techniques can be applied to the problem, dramatically reducing the hardware resources needed to mine similarly-sized corpus. As a concrete use case we compress the full Software Heritage archive, consisting of 5 billion unique source code files and 1 billion unique commits, harvested from more than 80 million software projects—encompassing a full mirror of GitHub. The resulting compressed graph fits in less than 100 GB of RAM, corresponding to a hardware cost of less than 300 U.S. dollars. We show that the compressed in-memory representation of the full corpus can be accessed with excellent performances, with edge lookup times close to memory random access. As a sample exploitation experiment we show that the compressed graph can be used to conduct clone detection at this scale, benefiting from main memory access speed.
[.pdf] [.bib] doi> Pietro Abate, Roberto Di Cosmo, Georgios Gousios, Stefano Zacchiroli. Dependency Solving Is Still Hard, but We Are Getting Better at It. In proceedings of SANER 2020: The 27th IEEE International Conference on Software Analysis, Evolution and Reengineering, February 18-21, 2020, London, Ontario, Canada, pp. 547-551. IEEE 2020. Abstract...

Abstract: Dependency solving is a hard (NP-complete) problem in all non-trivial component models due to either mutually incompatible versions of the same packages or explicitly declared package conflicts. As such, software upgrade planning needs to rely on highly specialized dependency solvers, lest falling into pitfalls such as incompleteness—a combination of package versions that satisfy dependency constraints does exist, but the package manager is unable to find it. In this paper we look back at proposals from dependency solving research dating back a few years. Specifically, we review the idea of treating dependency solving as a separate concern in package manager implementations, relying on generic dependency solvers based on tried and tested techniques such as SAT solving, PBO, MILP, etc. By conducting a census of dependency solving capabilities in state-of-the-art package managers we conclude that some proposals are starting to take off (e.g., SAT-based dependency solving) while—with few exceptions—others have not (e.g., outsourcing dependency solving to reusable components). We reflect on why that has been the case and look at novel challenges for dependency solving that have emerged since.
[.pdf] [.bib] doi> Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Large-scale Analysis of Public Software Development History. In proceedings of MSR 2020: The 17th International Conference on Mining Software Repositories, May 2020, Seoul, South Korea. Co-located with ICSE 2020. Pages 1-5. IEEE 2020. Abstract...

Abstract: Software Heritage is the largest existing public archive of software source code and accompanying development history. It spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. These software artifacts were retrieved from major collaborative development platforms (e.g., GitHub, GitLab) and package repositories (e.g., PyPI, Debian, NPM), and stored in a uniform representation linking together source code files, directories, commits, and full snapshots of version control systems (VCS) repositories as observed by Software Heritage during periodic crawls. This dataset is unique in terms of accessibility and scale, and allows to explore a number of research questions on the long tail of public software development, instead of solely focusing on "most starred" repositories as it often happens.
[.pdf] [.bib] doi> Antoine Pietri, Diomidis Spinellis, Stefano Zacchiroli. The Software Heritage Graph Dataset: Public software development under one roof. In proceedings of MSR 2019: The 16th International Conference on Mining Software Repositories, May 2019, Montreal, Canada. Co-located with ICSE 2019. Pages 138-142, IEEE 2019. Abstract...

Abstract: Software Heritage is the largest existing public archive of software source code and accompanying development history: it currently spans more than five billion unique source code files and one billion unique commits, coming from more than 80 million software projects. This paper introduces the Software Heritage graph dataset: a fully-deduplicated Merkle DAG representation of the Software Heritage archive. The dataset links together file content identifiers, source code directories, Version Control System (VCS) commits tracking evolution over time, up to the full states of VCS repositories as observed by Software Heritage during periodic crawls. The dataset's contents come from major development forges (including GitHub and GitLab), FOSS distributions (e.g., Debian), and language-specific package managers (e.g., PyPI). Crawling information is also included, providing timestamps about when and where all archived source code artifacts have been observed in the wild. The Software Heritage graph dataset is available in multiple formats, including downloadable CSV dumps and Apache Parquet files for local use, as well as a public instance on Amazon Athena interactive query service for ready-to-use powerful analytical processing. Source code file contents are cross-referenced at the graph leaves, and can be retrieved through individual requests using the Software Heritage archive API.
[.pdf] [.bib] doi> Roberto Di Cosmo, Morane Gruenpeter, Stefano Zacchiroli. Identifiers for Digital Objects: the Case of Software Source Code Preservation. In proceedings of iPRES 2018: 15th International Conference on Digital Preservation, Boston, MA, USA, 24-27 September 2018, 9 pages. Abstract...

Abstract: In the very broad scope addressed by digital preservation initiatives, a special place belongs to the scientific and technical artifacts that we need to properly archive to enable scientific reproducibility. For these artifacts we need identifiers that are not only unique and persistent, but also support integrity in an intrinsic way. They must provide strong guarantees that the object denoted by a given identifier will always be the same, without relying on third parties and external administrative processes. In this article, we report on our quest for this identifiers for digital objects (IDOs), whose properties are different from, and complementary to, those of the various digital identifiers of objects (DIOs) that are in widespread use today. We argue that both kinds of identifiers are needed and present the framework for intrinsic persistent identifiers that we have adopted in Software Heritage for preserving billions of software artifacts.
[.pdf] [.bib] Roberto Di Cosmo, Stefano Zacchiroli. Software Heritage: Why and How to Preserve Software Source Code. In Proceedings of iPRES 2017: 14th International Conference on Digital Preservation, Kyoto, Japan, 25-29 September 2017, 10 pages. Abstract...

Abstract: Software is now a key component present in all aspects of our society. Its preservation has attracted growing attention over the past years within the digital preservation community. We claim that source code—the only representation of software that contains human readable knowledge—is a precious digital object that needs special handling: it must be a first class citizen in the preservation landscape and we need to take action immediately, given the in- creasingly more frequent incidents that result in permanent losses of source code collections. In this paper we present Software Heritage, an ambitious initiative to collect, preserve, and share the entire corpus of publicly accessible software source code. We discuss the archival goals of the project, its use cases and role as a participant in the broader digital preservation ecosystem, and detail its key design decisions. We also report on the project road map and the current status of the Software Heritage archive that, as of early 2017, has collected more than 3 billion unique source code files and 700 million commits coming from more than 50 million software development projects.
[.pdf] [.bib] doi> Roberto Di Cosmo, Antoine Eiche, Jacopo Mauro, Stefano Zacchiroli, Gianluigi Zavattaro, Jakub Zwolakowski. Automatic Deployment of Services in the Cloud with Aeolus Blender. In proceedings of ICSOC 2015: 13th International Conference on Service Oriented Computing, November 16-19, 2015, Goa, India. ISBN 978-3-662-48615-3, pp. 397-411, Springer-Verlag 2015. Abstract...

Abstract: We present Aeolus Blender (Blender in the following), a software product for the automatic deployment and configuration of complex service-based, distributed software systems in the "cloud". By relying on a configuration optimiser and a deployment planner, Blender fully automates the deployment of real-life applications on OpenStack cloud deployments, by exploiting a knowledge base of software services provided by the Mandriva Armonic tool suite. The final deployment is guaranteed to satisfy not only user requirements and relevant software dependencies, but also to be optimal with respect to the number of used virtual machines.
[.pdf] [.bib] doi> Roberto Di Cosmo, Michael Lienhardt, Jacopo Mauro, Stefano Zacchiroli, Gianluigi Zavattaro, Jakub Zwolakowski. Automatic Application Deployment in the Cloud: from Practice to Theory and Back. In proceedings of CONCUR 2015: 26th International Conference on Concurrency Theory, September 1-4, 2015, Madrid, Spain. Leibniz International Proceedings in Informatics (LIPIcs) 42, pp. 1-16, ISBN 978-3-939897-91-0, Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik 2015. Abstract...

Abstract: The problem of deploying a complex software application has been formally investigated in previous work by means of the abstract component model named Aeolus. As the problem turned out to be undecidable, simplified versions of the model were investigated in which decidability was restored by introducing limitations on the ways components are described. In this paper, we take an opposite approach, and investigate the possibility to address a relaxed version of the deployment problem without limiting the expressiveness of the component model. We identify three problems to be solved in sequence: (i) the verification of the existence of a final configuration in which all the constraints imposed by the single components are satisfied, (ii) the generation of a concrete configuration satisfying such constraints, and (iii) the synthesis of a plan to reach such a configuration possibly going through intermediary configurations that violate the non-functional constraints.
[.pdf] [.bib] doi> Stefano Zacchiroli. The Debsources Dataset: Two Decades of Debian Source Code Metadata. In proceedings of MSR 2015: The 12th Working Conference on Mining Software Repositories, May 16-17, 2015, Florence, Italy. Co-located with ICSE 2015. ISBN ISBN 978-0-7695-5594-2, pp. 466-469, IEEE 2015. Abstract...

Abstract: We present the Debsources Dataset: distribution metadata and source code metrics spanning two decades of Free and Open Source Software (FOSS) history, seen through the lens of the Debian distribution. Debsources is a software platform used to gather, search, and publish on the Web the full source code of the Debian operating system, as well as measures about it. A notable public instance of Debsources is available at http://sources.debian.net; it includes both current and historical releases of Debian. Plugins to compute popular source code metrics (lines of code, defined symbols, disk usage) and other derived data (e.g., checksums) have been written, integrated, and run on all the source code available on sources.debian.net. The Debsources Dataset is a PostgreSQL database dump of sources.debian.net metadata, as of February 10th, 2015. The dataset contains both Debian-specific metadata—e.g., which software packages are available in which release, which source code file belong to which package, release dates, etc.—and source code information gathered by running Debsources plugins. The Debsources Dataset offer a very long-term historical view of the macro-level evolution and constitution of FOSS through the lens of popular, representative FOSS projects of their times.
[.pdf] [.bib] doi> Pietro Abate, Roberto Di Cosmo, Louis Gesbert, Fabrice Le Fessant, Ralf Treinen, Stefano Zacchiroli. Mining Component Repositories for Installability Issues. In proceedings of MSR 2015: The 12th Working Conference on Mining Software Repositories, May 16-17, 2015, Florence, Italy. Co-located with ICSE 2015. ISBN ISBN 978-0-7695-5594-2, pp. 24-33, IEEE 2015. Abstract...

Abstract: Component repositories play an increasingly relevant role in software life-cycle management, from software distribution to end-user, to deployment and upgrade management. Software components shipped via such repositories are equipped with rich metadata that describe their relationship (e.g., dependencies and conflicts) with other components. In this practice paper we show how to use a tool, distcheck, that uses component metadata to identify all the components in a repository that cannot be installed (e.g., due to unsatisfiable dependencies), provides detailed information to help developers understanding the cause of the problem, and fix it in the repository. We report about detailed analyses of several repositories: the Debian distribution, the OPAM package collection, and Drupal modules. In each case, distcheck is able to efficiently identify not installable components and provide valuable explanations of the issues. Our experience provides solid ground for generalizing the use of distcheck to other component repositories.
[.pdf] [.bib] doi> Roberto Di Cosmo, Michael Lienhardt, Ralf Treinen, Stefano Zacchiroli, Jakub Zwolakowski, Antoine Eiche, Alexis Agahi. Automated Synthesis and Deployment of Cloud Applications. In proceedings of ASE 2014: 29th IEEE/ACM International Conference on Automated Software Engineering, September 15-19, 2014, Vasteras, Sweden. ISBN 978-1-4503-3013-8, pp. 211-222, ACM 2014. Abstract...

Abstract: Complex networked applications are assembled by connecting software components distributed across multiple machines. Building and deploying such systems is a challenging problem which requires a significant amount of expertise: the system architect must ensure that all component dependencies are satisfied, avoid conflicting components, and add the right amount of component replicas to account for quality of service and fault-tolerance. In a cloud environment, one also needs to minimize the virtual resources provisioned upfront, to reduce the cost of operation. Once the full architecture is designed, it is necessary to correctly orchestrate the deployment phase, to ensure all components are started and connected in the right order. We present a toolchain that automates the assembly and deployment of such complex distributed applications. Given as input a high-level specification of the desired system, the set of available components together with their requirements, and the maximal amount of virtual resources to be committed, it synthesizes the full architecture of the system, placing components in an optimal manner using the minimal number of available machines, and automatically deploys the complete system in a cloud environment.
[.pdf] [.bib] doi> Matthieu Caneill, Stefano Zacchiroli. Debsources: Live and Historical Views on Macro-Level Software Evolution. In proceedings of ESEM 2014: 8th International Symposium on Empirical Software Engineering and Measurement, September 18-19, 2014, Torino, Italy. ISBN 978-1-4503-2774-9, ACM 2014. Abstract...

Abstract: Context. Software evolution has been an active field of research in recent years, but studies on macro-level software evolution---i.e., on the evolution of large software collections over many years---are scarce, despite the increasing popularity of intermediate vendors as a way to deliver software to final users. Goal. We want to ease the study of both day-by-day and long-term Free and Open Source Software (FOSS) evolution trends at the macro-level, focusing on the Debian distribution as a proxy of relevant FOSS projects. Method. We have built Debsources, a software platform to gather, search, and publish on the Web all the source code of Debian and measures about it. We have set up a public Debsources instance at http://sources.debian.net, integrated it into the Debian infrastructure to receive live updates of new package releases, and written plugins to compute popular source code metrics. We have injected all current and historical Debian releases into it. Results. The obtained dataset and Web portal provide both long term-views over the past 20 years of FOSS evolution and live insights on what is happening at sub-day granularity. By writing simple plugins (~100 lines of Python each) and adding them to our Debsources instance we have been able to easily replicate and extend past empirical analyses on metrics as diverse as lines of code, number of packages, and rate of change---and make them perennial. We have obtained slightly different results than our reference study, but confirmed the general trends and updated them in light of 7 extra years of evolution history. Conclusions. Debsources is a flexible platform to monitor large FOSS collections over long periods of time. Its main instance and dataset are valuable resources for scholars interested in macro-level software evolution.
[.pdf] [.bib] doi> Michel Catan, Roberto Di Cosmo, Antoine Eiche, Tudor A. Lascu, Michael Lienhardt, Jacopo Mauro, Ralf Treinen, Stefano Zacchiroli, Gianluigi Zavattaro, Jakub Zwolakowski. Aeolus: Mastering the Complexity of Cloud Application Deployment. In proceedings of ESOCC 2013: Service-Oriented and Cloud Computing, 2nd European Conference, Málaga, Spain, September 11-13, 2013. LNCS 8135, pp. 1-3, Springer-Verlag, 2013. Abstract...

Abstract: Cloud computing offers the possibility to build sophisticated software systems on virtualized infrastructures at a fraction of the cost necessary just few years ago, but deploying/maintaining/reconfiguring such software systems is a serious challenge. The main objective of the Aeolus project, an initiative funded by ANR (the French "Agence Nationale de la Recherche"), is to tackle the scientific problems that need to be solved in order to ease the problem of efficient and cost-effective deployment and administration of the complex distributed architectures which are at the heart of cloud applications.
[.pdf] [.bib] doi> Roberto Di Cosmo, Ralf Treinen, Stefano Zacchiroli. Formal Aspects of Free and Open Source Software Components. In proceedings of FMCO 2012: HATS International School on Formal Models for Components and Objects, Bertinoro, Italy, 24-28 September 2012. LNCS 7866, pp. 216-239, Springer-Verlag, 2013. Abstract...

Abstract: Free and Open Source Software (FOSS) distributions are popular solutions to deploy and maintain software on server, desktop, and mobile computing equipment. The typical deployment method in the FOSS setting relies on software distributions as vendors, packages as independently deployable components, and package managers as upgrade tools. We review research results from the past decade that apply formal methods to the study of inter-component relationships in the FOSS context. We discuss how those results are being used to attack both issues faced by users, such as dealing with upgrade failures on target machines, and issues important to distributions such as quality assurance processes for repositories containing tens of thousands, rapidly evolving software packages.
[.pdf] [.bib] doi> Roberto Di Cosmo, Jacopo Mauro, Stefano Zacchiroli, Gianluigi Zavattaro. Component Reconfiguration in the Presence of Conflicts. In proceedings of ICALP 2013: 40th International Colloquium on Automata, Languages and Programming, Riga, Latvia, 8-12 July, 2013. LNCS 7966, pp. 187-198, Springer-Verlag, 2013. Abstract...

Abstract: Components are traditionally modeled as black-boxes equipped with interfaces that indicate provided/required ports and, often, also conflicts with other components that cannot coexist with them. In modern tools for automatic system management, components become grey-boxes that show relevant internal states and the possible actions that can be acted on the components to change such state during the deployment and reconfiguration phases. However, state-of-the-art tools in this field do not support a systematic management of conflicts. In this paper we investigate the impact of conflicts by precisely characterizing the increment of complexity on the reconfiguration problem.
[.pdf] [.bib] doi> Cyrille Valentin Artho, Kuniyasu Suzaki, Roberto Di Cosmo, Ralf Treinen, Stefano Zacchiroli. Why do software packages conflict?. In proceedings of MSR 2012: 9th IEEE Working Conference on Mining Software Repositories, co-located with ICSE 2012, IEEE, ISBN 978-1-4673-1760-3, pp. 141-150. 2-3 June 2012, Zurich, Switzerland. Abstract...

Abstract: Determining whether two or more packages cannot be installed together is an important issue in the quality assurance process of package-based distributions. Unfortunately, the sheer number of different configurations to test makes this task particularly challenging, and hundreds of such incompatibilities go undetected by the normal testing and distribution process until they are later reported by a user as bugs that we call "conflict defects". We performed an extensive case study of conflict defects extracted from the bug tracking systems of Debian and Red Hat. According to our results, conflict defects can be grouped into five main categories. We show that with more detailed package meta-data, about 30% of all conflict defects could be prevented relatively easily, while another 30% could be found by targeted testing of packages that share common resources or characteristics. These results allow us to make precise suggestions on how to prevent and detect conflict defects in the future.
[.pdf] [.bib] doi> Roberto Di Cosmo, Stefano Zacchiroli, Gianluigi Zavattaro. Towards a Formal Component Model for the Cloud. In proceedings of SEFM 2012: 10th International Conference on Software Engineering and Formal Methods, Thessaloniki, Greece, 1-5 October, 2012. LNCS 7504, ISBN 978-3-642-33825-0, pp. 156-171, Springer-Verlag, 2012. Abstract...

Abstract: We consider the problem of deploying and (re)configuring resources in a "cloud" setting, where interconnected software components and services can be deployed on clusters of heterogeneous (virtual) machines that can be created and connected on-the-fly. We introduce the Aeolus component model to capture similar scenarii from realistic cloud deployments, and instrument automated planning of day-to-day activities such as software upgrade planning, service deployment, elastic scaling, etc. We formalize the model and characterize the feasibility and complexity of configuration achievability in Aeolus.
[.pdf] [.bib] doi> Pietro Abate, Roberto Di Cosmo, Ralf Treinen, Stefano Zacchiroli. Learning from the Future of Component Repositories. In proceedings of CBSE 2012: 15th International ACM SIGSOFT Symposium on Component Based Software Engineering, Bertinoro, Italy, June 26-28, 2012. ISBN 978-1-4503-1345-2, pp. 51-60, ACM 2012. Award: ACM SIGSOFT Distinguished Paper Award. Abstract...

Abstract: An important aspect of the quality assurance of large component repositories is the logical coherence of component metadata. We argue that it is possible to identify certain classes of such problems by checking relevant properties of the possible future repositories into which the current repository may evolve. In order to make a complete analysis of all possible futures effective however, one needs a way to construct a finite set of representatives of this infinite set of potential futures. We define a class of properties for which this can be done. We illustrate the practical usefulness of the approach with two quality assurance applications: (i) establishing the amount of "forced upgrades" induced by introducing new versions of existing components in a repository, and (ii) identifying outdated components that need to be upgraded in order to ever be installable in the future. For both applications we provide experience reports obtained on the Debian distribution.
[.pdf] [.bib] doi> Pietro Abate, Roberto Di Cosmo, Ralf Treinen, Stefano Zacchiroli. MPM: a modular package manager. In proceedings of CBSE 2011: 14th International ACM SIGSOFT Symposium on Component Based Software Engineering, Boulder, Colorado, USA, 21-23 June, 2011. ISBN 978-1-4503-0723-9, pp. 179-188, ACM 2011. Award: ACM SIGSOFT Distinguished Paper Award. Abstract...

Abstract: Software distributions in the FOSS world rely on so-called package managers for the installation and removal of packages on target machines. State-of-the-art package managers are monolithic in architecture, and each of them is hard-wired to an ad-hoc dependency solver implementing a customized heuristics. In this paper we propose a modular architecture allowing for pluggable dependency solvers and backends. We argue that this is the path that leads to the next generation of package managers that will deliver better results, accept more expressive input languages, and can be easily adaptable to new platforms. We present a working prototype, called MPM, which has been implemented following the design advocated in this paper.
[.pdf] [.bib] doi> Roberto Di Cosmo, Stefano Zacchiroli. Feature Diagrams as Package Dependencies. In proceedings of SPLC 2010: 14th International Software Product Line Conference, Jeju Island, South Korea, 13-17 September 2010. LNCS 6287, ISBN 978-3-642-15578-9, pp. 476-480, Springer-Verlag, 2010. Abstract...

Abstract: FOSS (Free and Open Source Software) distributions use dependencies and package managers to maintain huge collections of packages and their installations; recent research have led to efficient and complete configuration tools and techniques, based on state of the art solvers, that are being adopted in industry. We show how to encode a significant subset of Free Feature Diagrams as interdependent packages, enabling to reuse package tools and research results into software product lines.
[.pdf] [.bib] doi> Lucas Nussbaum, Stefano Zacchiroli. The Ultimate Debian Database: Consolidating Bazaar Metadata for Quality Assurance and Data Mining. In proceedings of MSR 2010: 7th IEEE Working Conference on Mining Software Repositories, co-located with ICSE 2010, IEEE, ISBN 978-1-4244-6802-7, pp. 52-61. 2-3 May 2010, Cape Town, South Africa. Abstract...

Abstract: FLOSS distributions like RedHat and Ubuntu require a lot more complex infrastructures than most other FLOSS projects. In the case of community-driven distributions like Debian, the development of such an infrastructure is often not very organized, leading to new data sources being added in an impromptu manner while hackers set up new services that gain acceptance in the community. Mixing and matching data is then harder than should be, albeit being badly needed for Quality Assurance and data mining. Massive refactoring and integration is not a viable solution either, due to the constraints imposed by the bazaar development model. This paper presents the Ultimate Debian Database (UDD), which is the countermeasure adopted by the Debian project to the above "data hell". UDD gathers data from various data sources into a single, central SQL database, turning Quality Assurance needs that could not be easily implemented before into simple SQL queries. The paper also discusses the customs that have contributed to the data hell, the lessons learnt while designing UDD, and its applications and potentialities for data mining on FLOSS distributions.
[.pdf] [.bib] doi> Gabriele D'Angelo, Fabio Vitali, Stefano Zacchiroli. Content Cloaking: Preserving Privacy with Google Docs and other Web Applications. In proceedings of ACM SAC 2010: 25th Annual ACM Symposium on Applied Computing, ISBN 978-1-60558-639-7, pp. 826-830. 22-26 March 2010, Sierre, Switzerland. Abstract...

Abstract: Web office suites such as Google Docs offer unparalleled collaboration experiences in terms of low software requirements, ease of use, data ubiquity, and availability. When the data holder (Google, Microsoft, etc.) is not perceived as trusted though, those benefits are considered at stake with important privacy requirements. Content cloaking is a lightweight, cryptographic, client-side solution to protect content from data holders while using web office suites and other "Web 2.0", AJAX-based, collaborative applications.
[.pdf] [.bib] doi> Pietro Abate, Jaap Boender, Roberto Di Cosmo, Stefano Zacchiroli. Strong Dependencies between Software Components. In proceedings of ESEM 2009: 3rd International Symposium on Empirical Software Engineering and Measurement, ISBN 978-1-4244-4842-5, pp. 89-99. October 15-16, 2009 - Lake Buena Vista, Florida, USA. Abstract...

Abstract: Component-based systems often describe context requirements in terms of explicit inter-component dependencies. Studying large instances of such systems, such as free and open source software (FOSS) distributions, in terms of declared dependencies between packages is appealing. It is however also misleading when the language to express dependencies is as expressive as boolean formulae, which is often the case. In such settings, a more appropriate notion of component dependency exists: strong dependency. This paper introduces such notion as a first step towards modeling semantic, rather then syntactic, inter-component relationships. Furthermore, a notion of component sensitivity is derived from strong dependencies, with applications to quality assurance and to the evaluation of upgrade risks. An empirical study of strong dependencies and sensitivity is presented, in the context of one of the largest, freely available, component-based system.
[.pdf] [.bib] doi> Antonio Cicchetti, Davide Di Ruscio, Patrizio Pelliccione, Alfonso Pierantonio, Stefano Zacchiroli. A Model Driven Approach to Upgrade Package-Based Software Systems. In proceedings of ENASE 2009: 4th international conference on Evaluation of Novel Aspects to Software Engineering; held in conjunction with ICEIS 2009. 6-10 May 2009, Milan, Italy. CCIS volume 69, ISBN 978-3-642-14818-7, pp. 262-276, Springer-Verlag, 2010. Abstract...

Abstract: Complex software systems are more and more based on the abstraction of package, brought to popularity by Free and Open Source Software (FOSS) distributions. While helpful as an encapsulation layer, packages do not solve all problems of deployment, and more generally of management, of large software collections. In particular upgrades, which often affect several packages at once due to inter-package dependencies, often fail and do not hold good transactional properties. This paper shows how to apply model driven techniques to describe and manage software upgrades of FOSS distributions. It is discussed how to model static and dynamic aspects of package upgrades, the latter being the most challenging aspect to deal with, in order to be able to predict common causes of upgrade failures and undo residual effects of failed or undesired upgrades.
[.pdf] [.bib] doi> Angelo Di Iorio, Davide Rossi, Fabio Vitali, Stefano Zacchiroli. Where are your Manners? Sharing Best Community Practices in the Web 2.0. In proceedings of ACM SAC 2009: the 24th Annual ACM Symposium on Applied Computing. ISBN 978-1-60558-166-8, pp. 681-687, ACM. Abstract...

Abstract: The Web 2.0 fosters the creation of communities by offering users a wide array of social software tools. But, while the success of these tools is based on their ability to support different interaction patterns among users by imposing as less limitations as possible, the communities they support are not free of rules (just think about the posting rules in a community forum or the editing rules in a thematic wiki). In this paper we propose a framework for the sharing of best community practices in the form of a (potentially rule-based) annotation layer that can be integrated with existing Web 2.0 community tools (with specific focus on wikis). This solution is characterized by minimal intrusiveness and plays nicely within the open spirit of the Web 2.0 by proving users with behavioral hints rather than by enforcing the strict adherence to a set of rules.
[.pdf] [.bib] doi> Angelo Di Iorio, Fabio Vitali, Stefano Zacchiroli. Wiki Content Templating. In Proceedings of WWW 2008: 17th International World Wide Web Conference. April 21-25, 2008 Beijing, China. ACM ISBN 978-1-60558-085-2/08/04, pp. 615-624. Abstract...

Abstract: Wiki content templating enables reuse of content structures among wiki pages. In this paper we present a thorough study of this widespread feature, showing how its two state of the art models (functional and creational templating) are sub-optimal. We then propose a third, better, model called lightly constrained (LC) templating and show its implementation in the Moin wiki engine. We also show how LC templating implementations are the appropriate technologies to push forward semantically rich web pages on the lines of (lowercase) semantic web and microformats.
[.pdf] [.bib] doi> Claudio Sacerdoti Coen, Stefano Zacchiroli. Spurious Disambiguation Error Detection. In Proceedings of MKM 2007: The 6th International Conference on Mathematical Knowledge Management. Hagenberg, Austria -- 27-30 June 2007. LNAI 4573, Springer Berlin / Heidelberg, ISBN 978-3-540-73083-5, pp. 381-392, 2007. Abstract...

Abstract: The disambiguation approach to the input of formulae enables the user to type correct formulae in a terse syntax close to the usual ambiguous mathematical notation. When it comes to incorrect formulae we want to present only errors related to the interpretation meant by the user, hiding errors related to other interpretations (spurious errors). We propose a heuristic to recognize spurious errors, which has been integrated with the disambiguation algorithm of [1].
[.pdf] [.bib] doi> Andrea Asperti, Claudio Sacerdoti Coen, Enrico Tassi, Stefano Zacchiroli. Crafting a Proof Assistant. In Proceedings of Types 2006: Types for Proofs and Programs. Nottingham, UK -- April 18-21, 2006. LNCS 4502, Springer Berlin / Heidelberg, ISBN 978-3-540-74463-4, pp. 18-32, 2007. Abstract...

Abstract: Proof assistants are complex applications whose development has never been properly systematized or documented. This work is a contribution in this direction, based on our experience with the development of Matita: a new interactive theorem prover based, as Coq, on the Calculus of Inductive Constructions (CIC). In particular, we analyze its architecture focusing on the dependencies of its components, how they implement the main functionalities, and their degree of reusability. The work is a first attempt to provide a ground for a more direct comparison between different systems and to highlight the common functionalities, not only in view of reusability but also to encourage a more systematic comparison of different softwares and architectural solutions.
[.pdf] [.bib] doi> Luca Padovani, Stefano Zacchiroli. From Notation to Semantics: There and Back Again. In Proceedings of MKM 2006: The 5th International Conference on Mathematical Knowledge Management. Wokingham, UK -- August 11-12, 2006. LNAI 4108, Springer Berlin / Heidelberg, ISBN 978-3-540-37104-5, pp. 194-207, 2006. Abstract...

Abstract: Mathematical notation is a structured, open, and ambiguous language. In order to support mathematical notation in MKM applications one must necessarily take into account presentational as well as semantic aspects. The former are required to create a familiar, comfortable, and usable interface to interact with. The latter are necessary in order to process the information meaningfully. In this paper we investigate a framework for dealing with mathematical notation in a meaningful, extensible way, and we show an effective instantiation of its architecture to the field of interactive theorem proving. The framework builds upon well-known concepts and widely-used technologies and it can be easily adopted by other MKM applications.
[.pdf] [.bib] doi> Andrea Asperti, Ferruccio Guidi, Claudio Sacerdoti Coen, Enrico Tassi, Stefano Zacchiroli. A Content Based Mathematical Search Engine: Whelp. In Proceedings of TYPES 2004: Types for Proofs and Programs. Paris, France -- December 15-18, 2004. LNCS 3839, Springer Berlin / Heidelberg, ISBN 3-540-31428-8, pp. 17-32, 2006. Abstract...

Abstract: The prototype of a content based search engine for mathematical knowledge supporting a small set of queries requiring matching and/or typing operations is described. The prototype, called Whelp, exploits a metadata approach for indexing the information that looks far more flexible than traditional indexing techniques for structured expressions like substitution, discrimination, or context trees. The prototype has been instantiated to the standard library of the Coq proof assistant extended with many user contributions.
[.pdf] [.bib] doi> Luca Padovani, Claudio Sacerdoti Coen, Stefano Zacchiroli. A Generative Approach to the Implementation of Language Bindings for the Document Object Model. In Proceedings of GPCE'04 3rd International Conference on Generative Programming and Component Engineering. Vancouver, Canada -- October 24-28, 2004 LNCS 3286, Springer Berlin / Heidelberg, ISBN 3-540-23580-9, pp. 469-487, 2004. Abstract...

Abstract: The availability of a C implementation for the Document Object Model (DOM) gives the interesting opportunity of generating bindings for different programming languages automatically. Because of the DOM bias towards Java-like languages, a C implementation that fakes objects, inheritance, polymorphism, exceptions and uses reference-counting introduces a gap between the API specification and its actual implementation that the bindings should try to close. In this paper we overview the generative approach in this particular context and apply it for the generation of C++ and OCaml bindings.
[.pdf] [.bib] Claudio Sacerdoti Coen, Stefano Zacchiroli. Efficient Ambiguous Parsing of Mathematical Formulae. In Proceedings of MKM 2004: 3rd International Conference on Mathematical Knowledge Management. September 19-21, 2004 Bialowieza - Poland. LNCS 3119, Springer Berlin / Heidelberg, ISBN 3-540-23029-7, pp. 347-362, 2004. Abstract...

Abstract: Mathematical notation has the characteristic of being ambiguous: operators can be overloaded and information that can be deduced is often omitted. Mathematicians are used to this ambiguity and can easily disambiguate a formula making use of the context and of their ability to find the right interpretation. Software applications that have to deal with formulae usually avoid these issues by fixing an unambiguous input notation. This solution is annoying for mathematicians because of the resulting tricky syntaxes and becomes a show stopper to the simultaneous adoption of tools characterized by different input languages. In this paper we present an efficient algorithm suitable for ambiguous parsing of mathematical formulae. The only requirement of the algorithm is the existence of a validity predicate over abstract syntax trees of incomplete formulae with placeholders. This requirement can be easily fulfilled in the applicative area of interactive proof assistants, and in several other areas of Mathematical Knowledge Management.

international, peer-reviewed workshop proceedings

[.pdf] [.bib] doi> Luís Soeiro, Thomas Robert, Stefano Zacchiroli. Assessing the Threat Level of Software Supply Chains with the Log Model. In proceedings of 2023 IEEE International Conference on Big Data (BigData) - 6th Annual Workshop on Cyber Threat Intelligence and Hunting (CyberHunt 2023), Sorrento, Italy, pp. 3079-3088. IEEE, 2023. Abstract...

Abstract: The use of free and open source software (FOSS) components in all software systems is estimated to be above 90%. With such high usage and because of the heterogeneity of FOSS tools, repositories, developers and ecosystem, the level of complexity of managing software development has also increased. This has amplified both the attack surface for malicious actors and the difficulty of making sure that the software products are free from threats. The rise of security incidents involving high profile attacks is evidence that there is still much to be done to safeguard software products and the FOSS supply chain.Software Composition Analysis (SCA) tools and the study of attack trees help with improving security. However, they still lack the ability to comprehensively address how interactions within the software supply chain may impact security.This work presents a novel approach of assessing threat levels in FOSS supply chains with the log model. This model provides information capture and threat propagation analysis that not only account for security risks that may be caused by attacks and the usage of vulnerable software, but also how they interact with the other elements to affect the threat level for any element in the model.
[.pdf] [.bib] Pietro Abate, Roberto Di Cosmo, Louis Gesbert, Fabrice Le Fessant, Stefano Zacchiroli. Using Preferences to Tame your Package Manager. In proceedings of OCaml 2014: The OCaml Users and Developers Workshop, September 5, 2014, Gothenburg, Sweden. Co-located with ICFP 2014. 2014. Abstract...

Abstract: Determining whether some components can be installed on a system is a complex problem: not only it is NP-complete in the worst case, but there can also be exponentially many solutions to it. Ordinary package managers use ad-hoc heuristics to solve this installation problem and choose a particular solution, making extremely difficult to change or sidestep these heuristics when the result is not the one we expect. When software repositories become complex enough, one gets vastly superior results by delegating dependency handling to a specialised solver, and use optimisation functions (or preferences) to control the class of solutions that are found. The opam package manager relies on the CUDF pivot format, which allows OCaml users that have a CUDF-compliant solver on their machine to reap the benefits of preferences-based dependency resolution. Thanks to the solver farm provided by Irill, these benefits are now extended to the OCaml community at large. In this talk we will present the preferences language and explain how to use it.
[.pdf] [.bib] Cyrille Valentin Artho, Roberto Di Cosmo, Kuniyasu Suzaki, Stefano Zacchiroli. Sources of Inter-package Conflicts in Debian. In proceedings of LoCoCo 2011 International Workshop on Logics for Component Configuration, affiliated with CP 2011 Abstract...

Abstract: Inter-package conflicts require the presence of two or more packages in a particular configuration, and thus tend to be harder to detect and localize than conventional (intra-package) defects. Hundreds of such inter-package conflicts go undetected by the normal testing and distribution process until they are later reported by a user. The reason for this is that current meta-data is not fine-grained and accurate enough to cover all common types of conflicts. A case study of inter-package conflicts in Debian has shown that with more detailed package meta-data, at least one third of all package conflicts could be prevented relatively easily, while another one third could be found by targeted testing of packages that share common resources or characteristics. This paper reports the case study and proposes ideas to detect inter-package conflicts in the future.
[.pdf] [.bib] doi> Ralf Treinen, Stefano Zacchiroli. Expressing Advanced User preferences in Component Installation. In proceedings of IWOCE 2009: International Workshop on Open Component Ecosystem, affiliated with ESEC/FSE 2009. Foundations of Software Engineering, ISBN 978-1-60558-677-9, pp. 31-40, ACM 2009. Abstract...

Abstract: State of the art component-based software collections, such as FOSS distributions, are made of up to dozens of thousands components, with complex inter-dependencies and conflicts. Given a particular installation of such a system, each request to alter the set of installed components has potentially (too) many satisfying answers. We present an architecture that allows to express advanced user preferences about package selection in FOSS distributions. The architecture is composed by a distribution-independent format for describing available and installed packages called CUDF (Common Upgradeability Description Format), and a foundational language called MooML to specify optimization criteria. We present the syntax and semantics of CUDF and MooML, and discuss the partial evaluation mechanism of MooML which allows to gain efficiency in package dependency solvers.
[.pdf] [.bib] doi> Davide Di Ruscio, Patrizio Pelliccione, Alfonso Pierantonio, Stefano Zacchiroli. Towards maintainer script modernization in FOSS distributions. In proceedings of IWOCE 2009: International Workshop on Open Component Ecosystem, affiliated with ESEC/FSE 2009. Foundations of Software Engineering, ISBN 978-1-60558-677-9, pp. 11-20, ACM 2009. Abstract...

Abstract: Free and Open Source Software (FOSS) distributions are complex software systems, made of thousands packages that evolve rapidly, independently, and without centralized coordination. During packages upgrades, corner case failures can be encountered and are hard to deal with, especially when they are due to misbehaving maintainer scripts: executable code snippets used to finalize package configuration. In this paper we report a software modernization experience, the process of representing existing legacy systems in terms of models, applied to FOSS distributions. We present a process to define meta-models that enable dealing with upgrade failures and help rolling back from them, taking into account maintainer scripts. The process has been applied to widely used FOSS distributions and we report about such experiences.
[.pdf] [.bib] doi> Roberto Di Cosmo, Paulo Trezentos, Stefano Zacchiroli. Package Upgrades in FOSS Distributions: Details and Challenges. In proceedings of HotSWUp'08: Hot Topics in Software Upgrades. October 20, 2008, Nashville, Tennessee, USA. ACM ISBN 978-1-60558-304-4. Abstract...

Abstract: The upgrade problems faced by Free and Open Source Software distributions have characteristics not easily found elsewhere. We describe the structure of packages and their role in the upgrade process. We show that state of the art package managers have shortcomings inhibiting their ability to cope with frequent upgrade failures. We survey current countermeasures to such failures, argue that they are not satisfactory, and sketch alternative solutions.
[.pdf] [.bib] Paolo Marinelli, Fabio Vitali, Stefano Zacchiroli. Streaming Validation of Schemata: the Lazy Typing Discipline. In Proceedings of Extreme Markup Languages 2007: The Markup Theory and Practice Conference. August 7-10, 2007 Montreal, Canada. Abstract...

Abstract: Assertions, identity constraints, and conditional type assignments are (planned) features of XML Schema which rely on XPath evaluation to various ends. The allowed XPath subset exploitable in those features is trimmed down for streamability concerns partly understandable (the apparent wish to avoid buffering to determine the evaluation of an expression) and partly artificial. In this paper we dissect the XPath language in subsets with varying streamability characteristics. We also identify the larger subset which is compatible with the typing discipline we believe underlies some of the choices currently present in the XML Schema specifications. We describe such a discipline as imposing that the type of an element has to be decided when its start tag is encountered and its validity has to be when its end tag is. We also propose an alternative lazy typing discipline where both type assignment and validity assessment are fired as soon as they are available in a best effort manner. We believe our discipline is more flexible and delegate to schema authors the choice of where to place in the trade-off between using larger XPath subsets and increasing buffering requirements or expeditiousness of typing information availability.
[.pdf] [.bib] Paolo Marinelli, Stefano Zacchiroli. Co-Constraint Validation in a Streaming Context. In Proceedings of XML 2006: The world's oldest and biggest XML conference. Award: Winner of the XML Scholarship 2006 as best student paper. Boston, MA -- December 5-7, 2006. Abstract...

Abstract: In many use cases applications are bound to be run consuming only a limited amount of memory. When they need to validate large XML documents, they have to adopt streaming validation, which does not rely on an in-memory representation of the whole input document. In order to validate an XML document, different kinds of constraints need to be verified. Co-constraints, which relate the content of elements to the presence and values of other attributes or elements, are one such kind of constraints. In this paper we propose an approach to the problem of validating in a streaming fashion an XML document against a schema also specifying co-constraints. We describe how the streaming evaluation of co-constraints influences the output of the validation process. Our proposal makes use of the validation language SchemaPath, a light extension to XML Schema, adding conditional type assignment for the support of co-constraints. The paper is based on the description of our streaming SchemaPath validator.
[.pdf] [.bib] doi> Claudio Sacerdoti Coen, Enrico Tassi, Stefano Zacchiroli. Tinycals: Step by Step Tacticals. In Proceedings of UITP 2006: User Interfaces for Theorem Provers. Seattle, WA -- August 21, 2006. ENTCS (Elsevier, ISSN 1571-0661), volume 174, issue 2, pp. 125-142. May 2007. Abstract...

Abstract: Most of the state-of-the-art proof assistants are based on procedural proof languages, scripts, and rely on LCF tacticals as the primary tool for tactics composition. In this paper we discuss how these ingredients do not interact well with user interfaces based on the same interaction paradigm of Proof General (the de facto standard in this field), identifying in the coarse-grainedness of tactical evaluation the key problem. We propose Tinycals as an alternative to a subset of LCF tacticals, showing that the user does not experience the same problem if tacticals are evaluated in a more fine-grained manner. We present the formal operational semantics of tinycals as well as their implementation in the Matita proof assistant.
[.pdf] [.bib] doi> Angelo Di Iorio, Stefano Zacchiroli. Constrained Wiki: an Oxymoron?. In Proceedings of WikiSym 2006: the 2006 International Symposium on Wikis. Odense, Denmark -- August 21-23, 2006. ACM, 2006, ISBN 1-59593-413-8, pp. 89-98. Abstract...

Abstract: In this paper we propose a new wiki concept -- light constraints -- designed to encode community best practices and domain-specific requirements, and to assist in their application. While the idea of constraining user editing of wiki content seems to inherently contradict "The Wiki Way", it is well-known that communities of users involved in wiki sites have the habit of establishing best authoring practices. For domain-specific wiki systems which process wiki content, it is often useful to enforce some well-formedness conditions on specific page contents. This paper describes a general framework to think about the interaction of wiki system with constraints, and presents a generic architecture which can be easily incorporated into existing wiki systems to exploit the capabilities enabled by light constraints.
[.pdf] [.bib] Andrea Asperti, Stefano Zacchiroli. Searching Mathematics on the Web: State of the Art and Future Developments. In Proceedings of New Developments in Electronic Publishing AMS/SMM Special Session, Houston, May 2004 ECM4 Satellite Conference, Stockholm, June 2004 pp. 9-18. FIZ Karlsruhe, ISBN 3-88127-107-4. Abstract...

Abstract: A huge amount of mathematical knowledge is nowadays available on the World Wide Web. Many different solutions and technologies for searching that knowledge have been developed as well. We present the state of the art of searching mathematics on the Web, giving some insight on future developments in this area.
[.pdf] [.bib] Claudio Sacerdoti Coen, Stefano Zacchiroli. Brokers and Web-Services for Automatic Deduction: a Case Study. In Proceedings of Calculemus 2003: 11th Symposium on the Integration of Symbolic Computation and Mechanized Reasoning. Roma, Italy -- September 10-12, 2003, Aracne Editrice. ISBN 88-7999-545-6, pp. 43-57, 2003. Abstract...

Abstract: We present a planning broker and several Web-Services for automatic deduction. Each Web-Service implements one of the tactics usually available in interactive proof-assistants. When the broker is submitted a proof status (an incomplete proof tree and a focus on an open goal) it dispatches the proof to the Web-Services, collects the successful results, and send them back to the client as hints as soon as they are available. In our experience this architecture turns out to be helpful both for experienced users (who can take benefit of distributing heavy computations) and beginners (who can learn from it).

national, peer-reviewed journal articles

[.pdf] [.bib] Mehdi Dogguy, Stéphane Glondu, Sylvain Le Gall, Stefano Zacchiroli. Enforcing Type-Safe Linking using Inter-Package Relationships. In Studia Informatica Universalis, volume 9, issue 1, pp. 129-157. ISSN 1625-7545, Hermann 2011. Abstract...

Abstract: Strongly-typed languages rely on link-time checks to ensure that type safety is not violated at the borders of compilation units. Such checks entail very fine-grained dependencies among compilation units, which are at odds with the implicit assumption of backward compatibility that is relied upon by common library packaging techniques adopted by FOSS (Free and Open Source Software) package-based distributions. As a consequence, package managers are often unable to prevent users to install a set of libraries which cannot be linked together. We discuss how to guarantee link-time compatibility using inter-package relationships; in doing so, we take into account real-life maintainability problems such as support for automatic package rebuild and manageability of ABI (Application Binary Interface) strings by humans. We present the dh_ocaml implementation of the proposed solution, which is currently in use in the Debian distribution to safely deploy more than 300 OCaml-related packages.

national, peer-reviewed conference and workshop procedings

[.pdf] [.bib] Antoine Pietri, Stefano Zacchiroli. Towards Universal Software Evolution Analysis. In proceedings of BENEVOL 2018: The 17th Belgium-Netherlands Software Evolution Workshop, Delft, Netherlands, December 2018. CEUR Workshop Proceedings (CEUR-WS) vol. 2361 pp. 6-10, ISSN 1613-0073. Abstract...

Abstract: Software evolution studies have mostly focused on individual software products, generally developed as Free/Open Source Software (FOSS) projects, and more sparingly on software collections like component and package ecosystems. We argue in this paper that the next step in this organic scale expansion is universal software evolution analysis, i.e., the study of software evolution at the scale of the whole body of publicly available software. We consider the case of Software Heritage, the largest existing archive of publicly available software source code artifacts (more than 5 B unique files archived and 1 B commits, coming from more than 80 M software projects). We propose research requirements that would allow to leverage the Software Heritage archive to study universal software evolution. We discuss the challenges that need to be overcome to address such requirements and outline a research roadmap to do so.
[.pdf] [.bib] Mehdi Dogguy, Stéphane Glondu, Sylvain Le Gall, Stefano Zacchiroli. Enforcing Type-Safe Linking using Inter-Package Relationships. In proceedings of JFLA 2010: 21st Journée Francophones des Langages Applicatifs, pp. 29-54. 30/01-02/02/2010 - La Ciotat, France. Abstract...

Abstract: Strongly-typed languages rely on link-time checks to ensure that type safety is not violated at the borders of compilation units. Such checks entail very fine-grained dependencies among compilation units, which are at odds with the implicit assumption of backward compatibility that is relied upon by common library packaging techniques adopted by FOSS (Free and Open Source Software) package-based distributions. As a consequence, package managers are often unable to prevent users to install a set of libraries which cannot be linked together. We discuss how to guarantee link-time compatibility using inter-package relationships; in doing so, we take into account real-life maintainability problems such as support for automatic package rebuild and manageability of ABI (Application Binary Interface) strings by humans. We present the dh_ocaml implementation of the proposed solution, which is currently in use in the Debian distribution to safely deploy more than 300 OCaml-related packages.

technical reports

[.pdf] [.bib] doi> Kevin Wellenzohn, Michael H. Böhlen, Sven Helmer, Antoine Pietri, Stefano Zacchiroli. Robust and Scalable Content-and-Structure Indexing (Extended Version). arXiv technical report 2209.05126, 2022. Abstract...

Abstract: Frequent queries on semi-structured hierarchical data are Content-and-Structure (CAS) queries that filter data items based on their location in the hierarchical structure and their value for some attribute. We propose the Robust and Scalable Content-and-Structure (RSCAS) index to efficiently answer CAS queries on big semi-structured data. To get an index that is robust against queries with varying selectivities we introduce a novel dynamic interleaving that merges the path and value dimensions of composite keys in a balanced manner. We store interleaved keys in our trie-based RSCAS index, which efficiently supports a wide range of CAS queries, including queries with wildcards and descendant axes. We implement RSCAS as a log-structured merge (LSM) tree to scale it to data-intensive applications with a high insertion rate. We illustrate RSCAS's robustness and scalability by indexing data from the Software Heritage (SWH) archive, which is the world's largest, publicly-available source code archive.
[.pdf] [.bib] Guillaume Rousseau, Roberto Di Cosmo, Stefano Zacchiroli. Growth and Duplication of Public Source Code over Time: Provenance Tracking at Scale. Inria technical report, 2019. Abstract...

Abstract: We study the evolution of the largest known corpus of publicly available source code, i.e., the Software Heritage archive (4B unique source code files, 1B commits capturing their development histories across 50M software projects). On such corpus we quantify the growth rate of original, never-seen-before source code files and commits. We find the growth rates to be exponential over a period of more than 40 years. We then estimate the multiplication factor, i.e., how much the same artifacts (e.g., files or commits) appear in different contexts (e.g., commits or source code distribution places). We observe a combinatorial explosion in the multiplication of identical source code files across different commits. We discuss the implication of these findings for the problem of tracking the provenance of source code artifacts (e.g., where and when a given source code file or commit has been observed in the wild) for the entire body of publicly available source code. To that end we benchmark different data models for capturing software provenance information at this scale and growth rate. We identify a viable solution that is deployable on commodity hardware and appears to be maintainable for the foreseeable future.
[.pdf] [.bib] Roberto Di Cosmo, Antoine Eiche, Jacopo Mauro, Gianluigi Zavattaro, Stefano Zacchiroli, Jakub Zwolakowski. Automatic Deployment of Software Components in the Cloud with the Aeolus Blender. Inria technical report 2015. Abstract...

Abstract: Cloud computing allows to build sophisticated software sys-tems on virtualized infrastructures at a fraction of the cost that was necessary just a few years ago. The deployment of such complex systems, though, is still a serious issue due to the need of deploying a large number of packages and services, their elaborated interdependencies, and the need to define the (ideally optimal) allocation of software compo-nents onto available computing resources. In this paper we present the Aeolus Blender (Blender in the following), a toolchain that automates the assembly and deployment of complex component-based software systems in the "cloud". By relying on a configuration optimizer and a deployment planner, Blender fully automates the deploy-ment of real-life cloud applications on OpenStack infrastruc-tures, by exploiting a knowledge base of software compo-nents defined in the Mandriva Armonic tool-suite. The final deployment is guaranteed to satisfy not only user require-ments and software dependencies, but also to be optimal with respect to the number of used virtual machines.
[.pdf] [.bib] Roberto Di Cosmo, Michael Lienhardt, Ralf Treinen, Stefano Zacchiroli, Jakub Zwolakowski. Optimal Provisioning in the Cloud. Aeolus project technical report, 7 Juin 2013. Abstract...

Abstract: Complex distributed systems are classically assembled by deploying several existing software components to multiple servers. Building such systems is a challenging problem that requires a significant amount of problem solving as one must i) ensure that all inter-component dependencies are satisfied; ii) ensure that no conflicting components are deployed on the same machine; and iii) take into account replication and distribution to account for quality of service, or possible failure of some services. We propose a tool, Zephyrus, that automates to a great extent assembling complex distributed systems. Given i) a high level specification of the desired system architecture, ii) the set of available components and their requirements) and iii) the current state of the system, Zephyrus is able to generate a formal representation of the desired system, to place the components in an optimal manner on the available machines, and to interconnect them as needed.
[.pdf] [.bib] Roberto Di Cosmo, Jacopo Mauro, Stefano Zacchiroli, Gianluigi Zavattaro. Component reconfiguration in the presence of conflicts. Aeolus project technical report, 22 Avril 2013. Abstract...

Abstract: Components are traditionally modeled as black-boxes equipped with interfaces that indicate provided/required ports and, often, also conflicts with other components that cannot coexist with them. In modern tools for automatic system management, components become grey-boxes that show relevant internal states and the possible actions that can be acted on the components to change such state during the deployment and reconfiguration phases. However, state-of-the-art tools in this field do not support a systematic management of conflicts. In this paper we investigate the impact of conflicts by precisely characterizing the increment of complexity on the reconfiguration problem.
[.pdf] [.bib] Ralf Treinen, Stefano Zacchiroli. Common Upgradeability Description Format (CUDF) 2.0. Mancoosi project technical report 3, 24 November 2009. Abstract...

Abstract: The solver competition which will be organized by Mancoosi relies on the standardized format for describing package upgrade scenarios. This document describes the Common Upgradeability Description Format (CUDF), the document format used to encode upgrade scenarios, abstracting over distribution-specific details. Solvers taking part in the competition will be fed with input in CUDF format. The format is not specific to Mancoosi and is meant to be generally useful to describe upgrade scenarios when abstraction over distribution-specific details is desired.
[.pdf] [.bib] Pietro Abate, Jaap Boender, Roberto Di Cosmo, Stefano Zacchiroli. Strong Dependencies between Software Components. Mancoosi project technical report 2, 22 May 2009. Abstract...

Abstract: Component-based systems often describe context requirements in terms of explicit inter-component dependencies. Studying large instances of such systems, such as free and open source software (FOSS) distributions, in terms of declared dependencies between packages is appealing. It is however also misleading when the language to express dependencies is as expressive as boolean formulae, which is often the case. In such settings, a more appropriate notion of component dependency exists: strong dependency. This paper introduces such notion as a first step towards modeling semantic, rather then syntactic, inter-component relationships. Furthermore, a notion of component sensitivity is derived from strong dependencies, with applications to quality assurance and to the evaluation of upgrade risks. An empirical study of strong dependencies and sensitivity is presented, in the context of one of the largest, freely available, component-based system.
[.pdf] [.bib] Davide Di Ruscio, Patrizio Pelliccione, Alfonso Pierantonio, Stefano Zacchiroli. Metamodel for Describing System Structure and State. Mancoosi project deliverable, D2.1, work package 2. January 2009. Abstract...

Abstract: Today's software systems are very complex modular entities, made up of many interacting components that must be deployed and coexist in the same context. Modern operating systems provide the basic infrastructure for deploying and handling all the components that are used as the basic blocks for building more complex systems even though a generic and comprehensive support is far from being provided. In fact, in Free and Open Source Software (FOSS) systems, components evolve independently from each other and because of the huge amount of available components and their different project origins, it is not easy to manage the life cycle of a distribution. Users are in fact allowed to choose and install a wide variety of alternatives whose consistency cannot be checked a priori to their full extent. It is possible to easily make the system unusable by installing or removing some packages that "break" the consistency of what is installed in the system itself. This document proposes a model-driven approach to simulate system upgrades in advance and to detect predictable upgrade failures, possibly by notifying the user before the system is affected. The approach relies on an abstract representation of the systems and packages which are given in terms of models that are expressive enough to isolate inconsistent configurations (e.g., situations in which installed components rely on the presence of disappeared sub-components) that are currently not expressible as inter-package relationships.
[.pdf] [.bib] Ralf Treinen, Stefano Zacchiroli. Description of the CUDF Format. Mancoosi project deliverable, D5.1, work package 5. November 2008. Abstract...

Abstract: This document contains several related specifications, taken together they describe the document formats related to the solver competition which will be organized by Mancoosi. In particular, this document describes: DUDF (Distribution Upgradeability Description Format), the document format to be used to submit upgrade problem instances from user machines to a (distribution-specific) database of upgrade problems; CUDF (Common Upgradeability Description Format), the document format used to encode upgrade problems, abstracting over distribution-specific details. Solvers taking part in the competition will be fed with input in CUDF format.
[.pdf] [.bib] Luca Padovani, Stefano Zacchiroli. Stream Processing of XML Documents Made Easy with LALR(1) Parser Generators. Technical report UBLCS-2007-23, September 2007, Department of Computer Science, University of Bologna. Abstract...

Abstract: Because of their fully annotated structure, XML documents are normally believed to require a straightforward parsing phase. However, the standard APIs for accessing their content (the Document Object Model and the Simple API for XML) provide a programming interface that is very low-level and is thus inadequate for the recognition of any structure that is not isomorphic to its XML encoding. Even when the document undergoes validation, its unmarshalling into application-specific data using these APIs requires poorly maintainable, tedious-to-write, and possibly inefficient code. We describe a technique for the simultaneous parsing, validation, and unmarshalling of XML documents that combines a stream-oriented XML parser with a LALR(1) parser in order to guarantee efficient stream processing, expressive validation capabilities, and the possibility to associate user-provided actions with specific patterns occurring in the source documents.
[.pdf] [.bib] Angelo Di Iorio, Fabio Vitali, Stefano Zacchiroli. Templating Wiki Content for Fun and Profit. Technical report UBLCS-2007-21, August 2007, Department of Computer Science, University of Bologna. Abstract...

Abstract: Content templating enables reuse of content structures between wiki pages. Such a feature is implemented in several mainstream wiki engines. Systematic study of its conceptual models and comparison of the available implementations are unfortunately missing in the wiki literature. In this paper we aim to fill this gap first analyzing template-related user needs, and then reviewing existing approaches at content templating. Our investigation shows that two models emerge, functional and creational templating, and that both have weakness failing to properly fit in "The Wiki Way". As a solution, we propose the adoption of creational templates enriched with light constraints, showing that such a solution has a low implementative footprint in state-of-the-art wiki engines, and that it has a synergy with semantic wikis.

dissertations

[.pdf] [.bib] Stefano Zacchiroli. Large-scale Modeling, Analysis, and Preservation of Free and Open Source Software. HDR (Habilitation à diriger des recherches) dissertation, defended publicly on 27 November 2017, at Université Paris Diderot, France, before a jury composed of: Ahmed Bouajjani, Carlo Ghezzi, Jesus M. Gonzalez-Barahona, Roberto Di Cosmo, Jean-Bernard Stefani, Diomidis Spinellis, Andreas Zeller.
[.pdf] [.bib] Stefano Zacchiroli. User Interaction Widgets for Interactive Theorem Proving. Ph.D. dissertation, Technical report UBLCS-2007-10, March 2007, Department of Computer Science, University of Bologna (advisor: Andrea Asperti; refereed by: Christoph Benzmueller, Marino Miculan). Abstract...

Abstract: Matita (that means pencil in Italian) is a new interactive theorem prover under development at the University of Bologna. When compared with state-of-the-art proof assistants, Matita presents both traditional and innovative aspects. The underlying calculus of the system, namely the Calculus of (Co)Inductive Constructions (CIC for short), is well-known and is used as the basis of another mainstream proof assistant, Coq, with which Matita is to some extent compatible. In the same spirit of several other systems, proof authoring is conducted by the user as a goal directed proof search, using a script for storing textual commands for the system. In the tradition of LCF, the proof language of Matita is procedural and relies on tactic and tacticals to proceed toward proof completion. The interaction paradigm offered to the user is based on the script management technique at the basis of the popularity of the Proof General generic interface for interactive theorem provers: while editing a script the user can move forth the execution point to deliver commands to the system, or back to retract (or "undo") past commands. Matita has been developed from scratch in the past 8 years by several members of the Helm research group, this thesis author is one of such members. Matita is now a full-fledged proof assistant with a library of about 1.000 concepts. Several innovative solutions spun-off from this development effort. This thesis is about the design and implementation of some of those solutions, in particular those relevant for the topic of user interaction with theorem provers, and of which this thesis author was a major contributor. Joint work with other members of the research group is pointed out where needed. The main topics discussed in this thesis are briefly summarized below. Disambiguation. Most activities connected with interactive proving require the user to input mathematical formulae. Being mathematical notation ambiguous, parsing formulae typeset as mathematicians like to write down on paper is a challenging task; a challenge neglected by several theorem provers which usually prefer to fix an unambiguous input syntax. Exploiting features of the underlying calculus, Matita offers an efficient disambiguation engine which permit to type formulae in the familiar mathematical notation. Step-by-step tacticals. Tacticals are higher-order constructs used in proof scripts to combine tactics together. With tacticals scripts can be made shorter, readable, and more resilient to changes. Unfortunately they are de facto incompatible with state-of-the-art user interfaces based on script management. Such interfaces indeed do not permit to position the execution point inside complex tacticals, thus introducing a trade-off between the usefulness of structuring scripts and a tedious big step execution behavior during script replaying. In Matita we break this trade-off with tinycals: an alternative to a subset of LCF tacticals which can be evaluated in a more fine-grained manner. Extensible yet meaningful notation. Proof assistant users often face the need of creating new mathematical notation in order to ease the use of new concepts. The framework used in Matita for dealing with extensible notation both accounts for high quality bidimensional rendering of formulae (with the expressivity of MathML-Presentation) and provides meaningful notation, where presentational fragments are kept synchronized with semantic representation of terms. Using our approach interoperability with other systems can be achieved at the content level, and direct manipulation of formulae acting on their rendered forms is possible too. Publish/subscribe hints. Automation plays an important role in interactive proving as users like to delegate tedious proving sub-tasks to decision procedures or external reasoners. Exploiting the Web-friendliness of Matita we experimented with a broker and a network of web services (called tutors) which can try independently to complete open sub-goals of a proof, currently being authored in Matita. The user receives hints from the tutors on how to complete sub-goals and can interactively or automatically apply them to the current proof. Another innovative aspect of Matita, only marginally touched by this thesis, is the embedded content-based search engine Whelp which is exploited to various ends, from automatic theorem proving to avoiding duplicate work for the user. We also discuss the (potential) reusability in other systems of the widgets presented in this thesis and how we envisage the evolution of user interfaces for interactive theorem provers in the Web 2.0 era.
[.pdf] [.bib] Stefano Zacchiroli. Web services per il supporto alla dimostrazione interattiva (Web services for interactive theorem proving). Master thesis (Italian only), March 2003, Department of Computer Science, University of Bologna (advisor: Andrea Asperti; refereed by: Nadia Busi).

miscellanea

[.pdf] [.bib] Valentin Lorentz, Roberto Di Cosmo, Stefano Zacchiroli. The Popular Content Filenames Dataset: Deriving Most Likely Filenames from the Software Heritage Archive. Data paper. July 2023. hal-04171177 Abstract...

Abstract: The Popular Content Filenames Dataset provides for each unique file content present in the Software Heritage Graph dataset its most popular filename. For the 2022-04-25 version, it contains over 12 billion entries and weights 413 gigabytes. This dataset allows to easily select subsets of the file contents from the Software Heritage archive based on file name patterns, facilitating reseach tasks in areas like data compression and machine learning.
[.pdf] [.bib] Laure Muselli, Mathieu O'Neil, Fred Pailler, Stefano Zacchiroli. Le pillage de la communauté des logiciels libres. Le Monde diplomatique 814, January 2022, pp. 20-21. Abstract...

Abstract: Pendant que l’utopie numérique rêvée trente ans plus tôt enfantait un supermarché à partir de 1990, un groupe d’irréductibles maintenait envers et contre tout un projet fidèle aux origines : le logiciel libre. Coopté, récupéré et trahi par les mastodontes de l’industrie, le voici fragilisé.
[.pdf] [.bib] Ralf Treinen, Stefano Zacchiroli. Solving package dependencies: from EDOS to Mancoosi. In proceedings of DebConf8: 9th annual conference of the Debian project developers. August 10-16, 2008, Mar del Plata, Argentina. Abstract...

Abstract: Mancoosi (Managing the Complexity of the Open Source Infrastructure) is an ongoing research project funded by the European Union for addressing some of the challenges related to the "upgrade problem" of interdependent software components of which Debian packages are prototypical examples. Mancoosi is the natural continuation of the EDOS project which has already contributed tools for distribution-wide quality assurance in Debian and other GNU/Linux distributions. The consortium behind the project consists of several European public and private research institutions as well as some commercial GNU/Linux distributions from Europe and South America. Debian is represented by a small group of Debian Developers who are working in the ranks of the involved universities to drive and integrate back achievements into Debian. This paper presents relevant results from EDOS in dependency management and gives an overview of the Mancoosi project and its objectives, with a particular focus on the prospective benefits for Debian.