Télécom Paris, ACES seminar
2022-03-18
Stefano Zacchiroli
stefano.zacchiroli@telecom-paris.fr
https://upsilon.cc/zack
@zacchiro
Supply chain: the set of activities required by an organization to deliver goods or services to consumers.
Software supply chain: the set of software components and software services required to deliver an IT product or service to users.
Key artifact for audits: (S)BOM = (Software) Bill of Materials
A software supply chain attack is a particular kind of cyber-attack that aims at injecting malicious code into an otherwise legitimate software product.
left-pad
(2016)
function leftpad (str, len, ch) {
str = String(str);
var i = -1;
if (!ch && ch !== 0) ch = ' ';
len = len - str.length;
while (++i < len) { str = ch + str; }
return str;
}
Injection → Create New Package → Typosquatting
Create a new package with a name similar (e.g., Levenshtein distance <= 2) to an existing popular package, including malicious code. Examples:
Upload it to a distribution platform (e.g., PyPI)
Wait for users to mistype (e.g., pip install python-sqlite
)
Related attack vector: Use After Free
Injection → Infect Existing Package → Inject into Source → Commit (as maintainer) → Social Engineering to become Maintainer
Might require early investment to accrue enough “street credibility” to win over maintenance at the right moment. For popular packages with low bus factor it could be worth it.
Injection of Malicious Code → Infect Existing Package → Inject during the Build → Compromise Build System
Often, the code run by users run is written but not built by maintainers
Rather, it is built by 3rd-party vendors
It becomes attractive to break into vendor build systems, compromising binaries “downstream”, without anybody looking merely at source code noticing
Related attack vectors: Inject into [Package] Repository System (≠ VCS)
https://reproducible-builds.org/
[Lamb22]: Chris Lamb, Stefano Zacchiroli. Reproducible Builds: Increasing the Integrity of Software Supply Chains. IEEE Softw. 39(2): 62-70 (2022).
“You can’t trust code that you did not totally create yourself. […] No amount of source-level verification or scrutiny will protect you from using untrusted code.”
— Ken Thompson, Reflections on Trusting Trust, Turing Lecture 1984
40 years later nobody “totally creates” code they run
Reuse of open source software (FOSS) is everywhere in IT
Also, the FOSS we run is often not built by its developers
Precondition/hypothesis: we can “reproducibly build” all relevant (FOSS) products, i.e.:
The build process of a software product is reproducible if, after designating a specific version of its source code and all of its build dependencies, every build produces bit-for-bit identical artifacts, no matter the environment in which the build is performed. — [Lamb22]
(we’ll verify later how realistic this is)
Let’s try a large-scale experiment: making all Debian packages build reproducibly from source
Goals:
After controlling for source code, build deps., and toolchain, two main classes of issues arise in practice:
Uncontrolled build inputs: when toolchains allow the build process to be affected by the surrounding environment.
Build non-determinism that gets encoded in final built artifacts.
Let’s see a bestiary of real-world examples…
void usage() {
fprintf(stderr,
"foo-utils version 3.141 (built %s)\n",
__DATE__);
}
The __DATE__
C preprocessor macro “expands to a string constant that describes the date on which the preprocessor is being run.”
Fix: SOURCE_DATE_EPOCH
envvar (standardized by r-b) to enable controlling for this
fprintf (stderr,
"DEBUG: boop (%s:%s\n",
__FILE__, __LINE__);
__FILE__
C preprocessor macro “expands to the name of the current input file”. This results in non reproducibility when the program is built from different directories, e.g., /home/lamby/tmp
vs. /home/zack/tmp
.-ffile-prefix-map
option (and related -fdebug-prefix-map
) to support embedding relative (rather than absolute) paths
NAME
readdir - read a directory
SYNOPSIS
#include <dirent.h>
struct dirent *readdir(DIR *dirp);
[…] The order in which filenames are read by successive calls to
readdir() depends on the filesystem implementation; it is unlikely
that the names will be sorted in any fashion. […]
sort()
Even when the entire environment inputs are controlled for, many builds remain non-deterministic. For instance due to randomness in unexpected places.
my %h = ( a => 1, b => 2, c => 3);
foreach my $k (keys %h) {
print "$k\n";
}
Perl’s hash type does not define an ordering of its keys, so a call to sort should be inserted before keys %h
to make it deterministic.
--- a/direntry.c
+++ b/direntry.c
@@ -24,6 +24,7 @@
void initializeDirentry(
direntry_t *entry, Stream_t *Dir) {
+ memset(entry, 0, sizeof(direntry_t));
entry->entry = -1;
entry->Dir = Dir;
direntry_t
struct does not contain uninitialized memory.How do you find build reproducibility issues, at scale?
mass-rebuild all packages…
…building each of them twice…
…in two build environments configured to differ as much as possible
According to our definition of a reproducible build, legitimate build inputs should be controlled for and replicated identical in the 2nd build
To that end, the .buildinfo
file format has been standardized to capture these information
.buildinfo
— Example
Source: black
Version: 20.8b1-1
Checksums-Sha1:
9915459ae7a1a5c3efb984d7e5472f7976e996b1 2584 black_20.8b1-1.dsc
14bfd3011b795f85edbc8cc4dc034a91cfaa9bcd 111096 black_20.8b1-1_all.deb
69c3d4ae7115c51e7b00befe8b4afd5963601d66 285684 python-black-doc_20.8b1-1_all.deb
Checksums-Sha256: [...]
Build-Architecture: amd64
Installed-Build-Depends: autoconf (= 2.69-11.1), automake (= 1:1.16.2-4), […], gcc (= 4:10.2.0-1), […], python3 (= 3.8.2-3), […], xz-utils (= 5.2.4-1+b1), zlib1g (= 1:1.2.11.dfsg-2)
An example .buildinfo
file, recording both the environment and results of building Debian’s black
package. (See full version.)
.buildinfo
files also contain the cryptographic checksums of final build artifacts, acting as build attestationsI, Alice, given source X, build dependencies Y_1,…,Y_n and toolchain Z, have conducted a build run obtaining a set of artifacts with checksums K_1,…,K_m.
.buildinfo
— Usagesystematic R-B testing ⇒ systematic build testing, catching any FTBFS bug
some software will only FTBFS in the extreme R-B build environment; fixing it will make the software more robust in general
R-B testing can detect user-level breakages by serendipity
/tmp/build/foo/usage.html
instead of /usr/share/doc/foo/usage.html
{
'cgibin' => '/usr/lib/cgi-bin/gbrowse',
'conf' => '/etc/gbrowse',
'databases' => '/var/lib/gbrowse/databases',
'htdocs' => '/usr/share/gbrowse/htdocs',
'OpenIDConsumerSecret' => '639098210478536',
'tmp' => '/var/cache/gbrowse'
},
An example ConfigData.pm
. As it was created at build time, all users shared the same OpenIDConsumerSecret
. (See: Debian bug #833885.)
https://reproducible-builds.org/
Debian reached 95% reproducible packages, can we go all the way?
How to make signed buld artifacts reproducible (without distributing signing keys)?
How do end-user verify build artifacts before installation?
How little trusted code is acceptable?