# Does Functional Package Management Enable Reproducible Builds at Scale? Yes.
**Authors**: Julien Malka, Stefano Zacchiroli, Théo Zimmermann
> LTCI, Télécom ParisInstitut Polytechnique de ParisPalaiseau, Francejulien.malka@telecom-paris.fr
> LTCI, Télécom ParisInstitut Polytechnique de ParisPalaiseau, Francestefano.zacchiroli@telecom-paris.fr
> LTCI, Télécom ParisInstitut Polytechnique de ParisPalaiseau, Francetheo.zimmermann@telecom-paris.fr
\ExecuteBibliographyOptions
swhid=true \addbibresource bibliography.bib
Abstract
Reproducible Builds (R-B) guarantee that rebuilding a software package from source leads to bitwise identical artifacts. R-B is a promising approach to increase the integrity of the software supply chain, when installing open source software built by third parties. Unfortunately, despite success stories like high build reproducibility levels in Debian packages, uncertainty remains among field experts on the scalability of R-B to very large package repositories.
In this work, we perform the first large-scale study of bitwise reproducibility, in the context of the Nix functional package manager, rebuilding $709\,816$ packages from historical snapshots of the nixpkgs repository, the largest cross-ecosystem open source software distribution, sampled in the period 2017â2023.
We obtain very high bitwise reproducibility rates, between 69 and 91% with an upward trend, and even higher rebuildability rates, over 99%. We investigate unreproducibility causes, showing that about 15% of failures are due to embedded build dates. We release a novel dataset with all build statuses, logs, as well as full âdiffoscopesâ: recursive diffs of where unreproducible build artifacts differ.
Index Terms: reproducible builds, functional package management, software supply chain, reproducibility, security
I Introduction
Free and open source software (FOSS) is a great asset to build trust in a computing system, because one can audit the source code of installed components to determine if their security is up to oneâs standards. However trusting the source code of the components making up a system is not enough to trust the system itself: before a program can be run on a user machine, it is typically built A term that we will also use in this work for interpreted programs, where it is the runtime environment that has to be built. to obtain an executable artifact, and then distributed onto the target system, involving a set of processes and actors generally referred to as the software supply chain. In recent years, large scale attacks like Solarwinds [alkhadra_solar_2021] or the xz backdoor [noauthor_nvd_nodate] have specifically targeted the software supply chain, underlying the importance of measures to increase its security and also triggering policy response in the European Union and the United States of America [noauthor_cyber_2022, house_executive_2021]. The particular effectiveness of these attacks is due to the difficulty to analyze binary artifacts in order to understand how they might act on the system, hence increasing the need for tooling that provide traceability from executable binaries to their source code.
Reproducible builds (R-B) âthe property of being able to obtain the same, bitwise identical, artifacts from two independent builds of a software componentâis recognized as a promising way to increase trust in the distribution phase of binary artifacts [lamb_reproducible_2022]. Indeed, if a software is bitwise reproducible, a user may require several independent parties to reach a consensus on the result of a compilation before downloading the built artifacts from one of them, effectively distributing the trust in these artifacts between those parties. For an attacker wanting to compromise the supply chain of that component, it is no longer sufficient to compromise only one of the involved parties. Build reproducibility is however not easy to obtain in general, due to non-determinism in the build processes, documented both by practitioners and researchers [lamb_reproducible_2022, bajaj_unreproducible_2023]. The Reproducible Builds [noauthor_reproducible_2023] project has since 2015 worked to increase bitwise reproducibility throughout the FOSS ecosystems, by coming up with fixes for compilers and other toolchain components, working closely with upstream projects to integrate them.
Unfortunately, a recent study [fourne_its_2023] which interviewed 24 R-B experts still concluded that there is â a perceived impracticality of fully reproducible builds due to workload, missing organizational buy-in, unhelpful communication with upstream projects, or the goal being perceived as only theoretically achievable â and that â much of the industry believes [R-B] is out of reach â. While there exist some successful examples of package sets with high reproducibility levels like Debian, which consistently achieves a reproducibility rate of more than 95% [noauthor_overview_nodate], those good performances should be put in perspective with the strict quality policies applied in Debian and the relatively limited size of the package set. Uncertainty remains among field experts about the scalability of this approach to larger software distributions.
Nixpkgs is the largest cross-ecosystem FOSS distribution, totaling as of October 2024 about $100\,000$ packages. Based on the Repology rankings https://repology.org, accessed Oct. 2024. It includes components from a large variety of software ecosystems, making it an interesting target to study bitwise reproducibility at scale. Nixpkgs is built upon Nix, the seminal implementation of the functional package management (FPM) model [dolstra_purely_2006]. It is generally believed that the FPM model is effective to obtain R-B: FPM packages are pure functions (in the mathematical sense) from build- and run-time dependencies to build artifacts, described as âbuild recipeâs that can be executed locally by the package manager. Components are built in a sandboxed environment, disallowing access to unspecified dependencies, even if they are present on the system. Previous work has highlighted that this model allows to reproduce build environments both in space and time [malka_reproducibility_2024], a necessary property for build reproducibility. Additionally, nixpkgsâ predefined build processes implement best practices to ensure build reproducibility, like setting the SOURCE_DATE_EPOCH environment variable [noauthor_source_date_epoch_nodate] or automatically verifying that the build path does not appear in the built artifacts [noauthor_build_nodate]. Despite the potential for insightful distribution-wide build reproducibility metrics, nixpkgs limits its monitoring to the narrow set of packages included in the minimal and gnome-based ISO images [noauthor_nixos_nodate], where a reproducibility rate higher than 95% is consistently reported.
Contributions
In this work, we perform the first ever large scale empirical study of bitwise reproducibility of FOSS going back in time, rebuilding historical packages from evenly spaced snapshots of the nixpkgs package repository taken every 4.1 months from 2017 to 2023. With this experiment, we answer the following research questions:
- RQ1: What is the evolution of bitwise reproducible packages in nixpkgs between 2017 and 2023? How does the reproducibility rate evolve over time? Are unreproducible packages eventually fixed? Do reproducible packages remain reproducible?
- RQ2: What are the unreproducible packages? Are they concentrated in specific ecosystems? Are critical packages more likely to be reproducible?
- RQ3: Why are packages unreproducible? Is large-scale identification of common causes possible?
- RQ4: How are unreproducibilities fixed? Are they fixed by specific patches or as part of larger package updates? Are the fixes intentional or accidental?
Besides, we use our experiment to replicate and extend previous results [malka_reproducibility_2024], leading to an additional research question:
- RQ0: Does Nix allow rebuilding past packages reliably (even if not bitwise reproducibly)?
Results
Thanks to this large-scale experiment, we are able to establish for the first time that bitwise reproducibility is achievable at scale, with reproducibility rates ranging from 69% to 91% over the period 2017â2023, despite a continuous increase in the number of packages in nixpkgs. We highlight the wide variability in reproducibility rates across ecosystems packaged in nixpkgs, and show the significant impact that some core packages can have on the overall reproducibility rate of an ecosystem.
We estimate the prevalence of some common causes of non-reproducibility at a large scale for the first time, showing that about 15% of failures are due to embedded build dates.
As part of this work, we introduce a novel dataset containing build logs and metadata of over $709\,000$ package builds, and more than $114\,000$ occurrences of non-reproducibility with full artifacts including âdiffoscopesâ, i.e., recursive diffs of where unreproducible build artifacts differ. Ample room for further research is left open by the dataset, including exploiting the build logs, or applying more complex heuristics or qualitative research to the diffoscopes.
Paper structure
Section II presents the related work. Section III gives some background that is required to understand the experiment, whose methodology is then presented in Section IV. Some descriptive statistics about the dataset are presented in Section V, and the results to our RQs in Section VI. We discuss them in Section VII, and the threats to validity in Section VIII, concluding in Section IX.
II Related work
II-A Reproducible builds (R-B)
R-B are a relatively recent concept, which has been picked up and developed mostly by practitioners from Linux distributions and upstream maintainers. The Reproducible Builds project [noauthor_reproducible_2023] has been the main actor in the area. The project has produced a definition of R-B, best practices to achieve them, and tools to monitor the reproducibility of software distributions, and debug unreproducibilities (the diffoscope).
Besides, R-B have picked the interest of the academic community, with a growing number of papers on the topic.
R-B for the security of the software supply chain
R-B are often seen as a way to increase the security of the software supply chain [lamb_reproducible_2022]. Torres-Arias et al. provide a framework to enforce the integrity of the software supply chain for which they demonstrate an application to enforce R-B [torres-arias_-toto_2019]. Our paper does not contribute directly to this line of research but, by demonstrating the feasibility of R-B at scale, it strengthens the case of the approach.
Techniques for build reproducibility
In a series of articles [ren_automated_2018, ren_root_2020, ren_automated_2022], Ren et al. devised a methodology to automate the localization of sources of non-reproducibility in build processes and to automatically fix them, using a database of common patches that are then automatically adapted and applied.
An alternative technique to achieve build reproducibility is proposed by Navarro et al. [navarro_leija_reproducible_2020]. They propose âreproducible containersâ that are built in a way that makes the build process fully deterministic, at the expense of performance.
FPMs such as Nix [dolstra_purely_2006] and Guix [courtes_functional_2013] are also presented as a way to achieve R-B. Malka et al. [malka_reproducibility_2024] showed that Nix allows reproducing past build environment reliably, as well as rebuilding old packages with high confidence, but they do not address the question of bitwise reproducibility, which we do with this work.
Relaxing the bitwise reproducibility criterion
Because of the difficulty (real or perceived) to achieve bitwise reproducibility, some authors have proposed to relax the criterion to a more practical one. For instance, accountable builds [poll_analyzing_2021] aim to distinguish between differences that can be explained (accountable differences) or not (unaccountable differences). Our work highlights that bitwise reproducibility is achievable at scale in practice, and thus that relaxing the reproducibility criterion may not be necessary after all.
Empirical studies of R-B
Some other recent academic works have empirically studied R-B in the wild. Two papers from 2023 [butler_business_2023, fourne_its_2023] looked into business adoption of R-B and perceived effectiveness through interviews.
Bajaj et al. [bajaj_unreproducible_2023] mined historical results from R-B tracking in Debian to investigate causes, fix time, and other properties of unreproducibility in the distribution. Our work is similar, but instead of relying on historical R-B tracking, we actually rebuild packages and compare them bitwise to historical build results. When we report on packages being reproducible, it means they have stood the test of time. It also allows us to provide more detailed information on the causes of unreproducibility, in particular by generating diffoscopes and saving them for future research as part of our dataset; whereas diffoscopes from the Debian R-B tracking are not preserved in the long-term. Finally, Bajaj et al. used issue tracker data from the R-B project to identify the most common causes of non-reproducibility, possibly introducing a sampling bias since only root causes that were identified by Debian developers are counted in their statistics. In our work, we try to avoid this bias by performing a large-scale automatic analysis of diffoscopes to automatically identify the prevalence of a selection of causes of non-reproducibility. While we present heuristics comparable to some of the causes identified in Bajaj et al. âs taxonomy, we derive them from empirical data rather than relying on pre-labeled data from the Debian issue tracker.
The only other work that performed an experimental study of R-B investigated the impact of configuration options [randrianaina_options_2024]. Contrary to them, we rebuild historical versions of packages in their default configuration. Combining the historical snapshot approach of our work with their approach of varying the configuration options could be an interesting future work.
II-B Linux distributions and package ecosystems
Besides R-B, our work also relates to the literature on Linux distributions and package ecosystems. The nixpkgs repository being the largest cross-ecosystem software distribution, we are able to compare properties of packages across ecosystems. Several previous works have compared package ecosystems (e.g., [decan_empirical_2019]). For an overview of recent research on package ecosystems, see Mens and Decan [mens_overview_2024].
More specifically, nixpkgs is the basis of the NixOS Linux distribution. Linux distributions have a long history of being studied by the research community. Recently, Legay et al. [legay_quantitative_2021] measured the package freshness in Linux distributions. While this is not the topic of this work, our dataset could be used, e.g., to study how frequently packages are updated in nixpkgs.
III Background
We provide in this section some background knowledge about Nix and R-B, which is required to understand the details of our experiments.
III-A The FPM model and the Nix store
The FPM model applies the idea of functional programming to package management. Nix packages are viewed as pure functions from their inputs (source code, dependencies, build scripts) to their outputs (binaries, documentation, etc.). Any change to the inputs should produce a different package version. Nix allows multiple versions of the same package to be built and coexist on the same system. To that end, Nix stores build outputs in input-addressed directories (using a hashing function of the inputs) in the Nix store, usually located in the /nix/store directory on disk. Figure 1 shows an example of a Nix packaging expression (a Nix recipe) for the htop package.
⏠{stdenv, fetchFromGitHub, ncurses, autoreconfHook}: stdenv. mkDerivation rec { pname = "htop"; version = "3.2.1"; src = fetchFromGitHub { owner = "htop-dev"; repo = "htop"; rev = version; sha256 = "sha256-MwtsvdPHcUdegsYj9NGyded5XJQxXri1IM1j4gef1Xk="; }; nativeBuildInputs = [ autoreconfHook ]; buildInputs = [ ncurses ]; }; }
Figure 1: Example Nix expression for the htop package.
III-B Nix evaluation-build pipeline
Building binary outputs from a Nix package recipe is a two-step process. First, Nix evaluates the expression and transforms it into a derivation, an intermediary representation in the Nix store containing all the necessary information to run the build process. In particular, the derivation contains ahead of time the (input-addressed) output path, that is the exact location in the Nix store where the build artifacts will be stored if that derivation were to be built.
Then, given the derivation file as input, the nix-build command performs the build, creating the pre-computed output path in the Nix store upon completion.
In the same fashion as other Linux distributions, Nix packages may produce multiple outputs (a main output with binaries, one with documentation, etc.). Each output has its own directory in the Nix store, and building the derivation from source systematically produces all its outputs.
III-C Path substitution and binary caches
Alternatively to building from source, Nix offers the option to download prebuilt artifacts from third party binary caches, which are databases populated with build outputs generated by Nix. Binary caches are indexed by output paths, making it possible for Nix to check for the presence of a precompiled package in a configured cache after the evaluation phase. https://cache.nixos.org is the official cache for the Nix community and most Nix installations come configured to use it as a trusted cache.
III-D The nixpkgs continuous integration
Hydra [dolstra_nix_nodate] is the continuous integration (CI) platform for the nixpkgs project. At regular intervals in time, it fetches the latest version of nixpkgsâ git master branch and evaluates the pkgs/top-level/release.nix file embedded in the repository. This evaluation yields a list of derivations (or jobs) that are then built by Hydra: one derivation for each of the $â\,$ $100\,000$ packages contained in nixpkgs nowadays. Upon success of a predefined subset of these jobs, the revision is deemed valid and all the built artifacts are uploaded to the official binary cache to be available to end users.
III-E Testing bitwise reproducibility with Nix
Nix embarks some minimal tooling to test the reproducibility of a given derivation in the form of a --check flag passed to the nix-build command. To check for bitwise reproducibility, Nix needs a reference that it will try to acquire from one of the configured caches, or fail if not possible. Nix then acquires the build environment of the derivation under consideration, builds the derivation, and compares each of the outputs of the derivation against the local version. The --keep-failed flag can be used to instruct Nix to keep the unreproducible outputs locally for further processing. For our experiment, we alter the behavior of nix-build --check to prevent it from failing early as soon as one unreproducible output is detected.
III-F Diffoscope
Diffoscope [noauthor_diffoscope_nodate] is a tool developed and maintained by the Reproducible Builds project that aims to simplify the analysis of differences between software artifacts. It is able to recursively unpack binary archives and automatically use ecosystem specific diffing tools to allow for better understanding of what makes two software artifacts different. It generates HTML or JSON artifactsâalso called diffoscopes âthat can be either interpreted by humans or automatically processed.
IV Methodology
<details>
<summary>extracted/6418685/medias/pipelinev2.png Details</summary>

### Visual Description
\n
## Diagram: Nixpkgs Build and Analysis Workflow
### Overview
This diagram illustrates the workflow for evaluating Nixpkgs revisions, including historical CI builds and a new analysis pipeline. It details the process from revision sampling to artifact comparison and failure analysis. The diagram is divided into three main sections: "Historical CI Builds", "Our Rebuilds", and "Our Analysis", with connections showing data flow between them.
### Components/Axes
The diagram consists of rectangular blocks representing processes or states, connected by arrows indicating the flow of data or control. Key components include:
* **Nixpkgs revision:** The starting point, with a note indicating "Sample 17 revisions" from the "Nixpkgs git repository".
* **Evaluation of release.nix by Hydra:** A process within "Historical CI Builds".
* **Build all packages using Hydra builders:** A process within "Historical CI Builds".
* **Nixpkgs' binary cache:** A storage location within "Historical CI Builds".
* **Evaluation of release.nix by nix-eval-jobs:** A process within "Our Rebuilds".
* **Computation of evaluation closure and filtering out packages not in cache:** A process within "Our Rebuilds".
* **Build all packages using Buildbot infrastructure:** A process within "Our Rebuilds".
* **Compare artifacts:** A process within "Our Analysis".
* **Bisect to find fixing commit:** A process within "Our Analysis".
* **Generate Diffoscopes:** A process within "Our Analysis".
* **Manually analyze commit to infer intent:** A process within "Our Analysis".
* **Run heuristics to categorize them:** A process within "Our Analysis".
* **Status indicators:** "Reproducible", "Buildable but not reproducible", "Failure".
### Detailed Analysis or Content Details
The diagram shows the following flow:
1. **Historical CI Builds:**
* Nixpkgs revisions are sampled from the Nixpkgs git repository (17 revisions).
* These revisions are evaluated by Hydra.
* Hydra builds all packages using Hydra builders.
* If the build succeeds, artifacts are uploaded to the Nixpkgs' binary cache.
2. **Our Rebuilds:**
* Nixpkgs revisions are evaluated by nix-eval-jobs.
* The evaluation closure is computed, filtering out packages not in the cache.
* All packages are built using Buildbot infrastructure.
* The process has a "Success" or "Failure" status indicator.
3. **Our Analysis:**
* Artifacts from the Historical CI Builds and Our Rebuilds are compared.
* If the artifacts are identical, the status is "reproducible".
* If the artifacts are different, the status is "buildable but not reproducible".
* If the package is fixed later on, a bisect is performed to find the fixing commit.
* Diffoscopes are generated.
* The commit is manually analyzed to infer intent.
* Heuristics are run to categorize the commits.
The diagram uses dashed arrows to indicate asynchronous or less direct data flow. The "If build succeeds" and "If the package is fixed later on" branches indicate conditional paths.
### Key Observations
* The diagram highlights a comparison between historical builds (Hydra) and a new rebuild process (Buildbot).
* The "Our Analysis" section focuses on identifying reproducibility issues.
* The workflow includes both automated (artifact comparison, bisecting) and manual (intent analysis) steps.
* The diagram emphasizes the importance of the Nixpkgs binary cache in the build process.
### Interpretation
The diagram represents a quality assurance and reproducibility pipeline for Nixpkgs. The core goal is to ensure that builds are reproducible over time. The comparison of artifacts from historical and rebuilt packages is crucial for identifying regressions or inconsistencies. The inclusion of manual analysis suggests that automated tools alone are insufficient for fully understanding the reasons behind build failures or reproducibility issues. The bisecting process and diffoscope generation are standard techniques for pinpointing the source of changes and understanding their impact. The overall workflow demonstrates a commitment to maintaining the integrity and reliability of the Nixpkgs package collection. The diagram suggests a shift from relying solely on historical CI builds to a more proactive rebuild and analysis approach.
</details>
Figure 2: Description of our build and analysis pipeline.
Our build and analysis pipeline is summarized in Figure 2.
IV-A Reproducibility experiments
IV-A 1 Revision sampling
We start from the 200 nixpkgs revisions selected by Malka et al. [malka_reproducibility_2024] in the period July 2017âApril 2023. Since building revisions, as opposed to just evaluating them, is very computationally intensive, it was not feasible to build all 200 revisions. Also, it was difficult to correctly estimate how many revisions we could build, due to the ever-growing number of packages in each revision. We hence applied dichotomic sampling: we first build the most recent revision, then the oldest one, then the one in the middle of them, and so on always picking the one in the middle of the largest time interval when choosing. After 17 revisions built, we obtain a regularly spaced sample set of nixpkgs revisions, with one sampled revision every 4.1 months. On average, each revision corresponds to building more than 41 thousand packages (see Figure 3 for details, discussed later).
To perform our builds, we used a distributed infrastructure based on Buildbot [noauthor_buildbot_nodate], a Python CI pipeline framework. Our infrastructure has two types of machines: a coordinator and multiple builders. The coordinator is in charge of distributing the workload and storing data that must be persisted, while the builders are stateless and perform workloads sent by the coordinator. During the course of the experiment (from June to October 2024) the set of builders we used was composed of shared bare-metal machines running various versions of Ubuntu and Fedora and our coordinator was a virtual machine running Ubuntu. Note that to perform our bitwise reproducibility checks, we compare to the historical results coming from Hydra, that uses builders that, at the time, ran older versions of Nix than the ones we used on our builders. Further details on the operating systems, Nix versions and kernel versions that we used on builders can be found in the replication package.
IV-A 2 Evaluation and preprocessing
For each revision considered, the coordinator first evaluates pkgs/top-level/release.nix (containing the list of jobs built by Hydra for this revision) using nix-eval-jobs, a standalone and parallel Nix evaluator similar to the one used on Hydra. The outcome of this operation is a list of derivations. The release.nix file is human crafted and does not contain all the dependencies of the listed packages, even though they are built by Hydra along the way. Since we are interested in testing the reproducibility of the entire package graph built by Hydra, we post-process the list of jobs obtained after the evaluation phase to include all intermediary derivations by walking through the dependency graph of each derivation. During this post-processing, we also check that the derivation outputs are present in the official Nix binary cache. This is required to compare our build outputs with historical results for bitwise reproducibility. Derivations missing from the cache can indicate that they historically failed to build, although there can be other reasons for their absence.
IV-A 3 Building
To build a job $A$ and test the reproducibility of its build outputs, the builder uses the nix-build --check command, as described in Section III-E. This means that we always assume that $A$ âs build environment is buildable and always fetch it from the cache. This allows all the derivations in the sample set to be built and checked independently and in parallel, irrespective of where they are located in the package dependency graph. Note that the source code of packages to build is part of the build environment. Relying on the NixOS cache hence avoids incurring into issues such as source code disappearing from the original upstream distribution place. Investigating how much of NixOS can be rebuilt without relying on the cache is an interesting research question, recently explored for Guix [courtes-2024-guix-archiving], but out of scope for this paper.
After each build, we classify the derivation as either building reproducibly, building but not reproducibly or not building. We save the build metadata and the logs, and when available we download and store the historical build logs from the nixpkgs binary cache. Finally, for every unreproducible output path, we store both the historical artifacts and our locally built ones, for comparison purposes.
IV-B Ecosystem identification and package tracking
To answer RQ1, RQ2 and RQ4, we need to be able to discriminate packages by provenance ecosystem, and track them over time to follow their evolution. To categorize packages by ecosystem, we rely on the first component of the package name when it has several components (for example a package named haskellPackages.network is sorted into the Haskell ecosystem). Sometimes, there are several co-existing versions of an ecosystem in a given nixpkgs revision (for example python37Packages and python38Packages being present in the same revision), and sometimes the name of the ecosystem is modified between successive nixpkgs revisions. Therefore, some deduplication step is necessary. The first and last authors performed this step manually by inspecting the 144 ecosystems from the 17 nixpkgs revisions considered, ordered alphabetically, and deciding which ones to merge independently, then checking the consistency of their results, discussing the few differences (missed merges, or false positives) and reaching a consensus. For instance, the following ecosystems were merged into a single one: php56Packages, php70Packages, âŠ, php82Packages, phpExtensions, phpPackages and phpPackages-unit.
To deduplicate packages appearing in several versions of the same ecosystem, we order by version (favoring the most recent one) and consider any package set without a version number as having a higher priority (since it is the default one in the considered revision, as chosen by the nixpkgs maintainers).
IV-C Comparison with the minimal ISO image
As part of RQ2, we investigate the difference of reproducibility rate between critical packages whose reproducibility is monitored and the rest of the package set. We are also interested in knowing whether observing the reproducibility health of this subset of packages gives a good enough information on the state of the rest of the project. The minimal and gnome-based ISO images are considered critical subsets of packages and benefit from a community-maintained reproducibility monitoring. We study the minimal ISO image because it contains a limited amount of core packages. We evaluate the Nix expression associated with the image, compute its runtime closure (the set of packages included in the image) and match it with the packages of our dataset to infer their reproducibility statuses.
IV-D Analyzing causes of unreproducibility using diffoscopes
Analyzing causes of unreproducibility is a tricky debugging activity, usually carried out by practitioners (in particular, by Linux distribution maintainers and members of the Reproducible Builds project). Some automatic fault localization methods have been proposed [ren_root_2020], but they rely on instrumenting the build, while we have to run the Nix builds unchanged to avoid introducing biases.
For each unreproducible output, we run diffoscope with a 5-minute timeout, yielding a dataset of $86\,476$ diffoscopes. We then investigate whether we can use our large dataset of diffoscopes for automatic detection of causes of non-reproducibility. The diffoscope tool was mainly designed to help human debugging, but it also supports producing a JSON output, which can then be machine processed.
We wish to explore heuristics that can be applied at the line level, so we recurse through diffoscope structures until leaf nodes, which are diffs in unified diff format. We randomly draw one added line from $10\,000$ diffoscopes, sort them by similarity to ease visual inspection, and manually inspect them to derive relevant heuristics. We then run these heuristics on the full diffoscope dataset to determine the proportion of packages impacted by each cause (multiple causes can apply to the same package). The first and last author then evaluate the precision of each these heuristics by manually counting false positives in samples of matched lines for each heuristic.
IV-E Automatic identification of reproducibility fixes
To investigate fixes to unreproducibilities, for each unreproducible package that becomes reproducible, we run an automatic bisection process to find the first commit that fixes the reproducibility. By looking into the corresponding pull request on GitHub, we check if the maintainers provide information on why this fixes a reproducibility issue, or link to a corresponding bug report. In particular, we are interested to check how often the maintainers are aware that their commit is a reproducibility fix (as opposed to a routine package update, which embeds a reproducibility fix that would have been crafted by the upstream maintainers).
We start from the set of packages from all revisions, and we look specifically at packages that change status from unreproducible to reproducible in two successive revisions. These are our candidate packages (and âoldâ and ânewâ commits) for the bisection process.
Since we perform the bisection process on a different runner and at a different time compared to the dataset creation, it can happen that we cannot reproduce the status (reproducible or unreproducible) of some builds. Therefore, before starting the bisection, we verify that we obtain consistent results on the âoldâ and the ânewâ revision. Then, for those which behave as expected, we start an automatic git bisect process.
The script used for the automatic git bisect checks for the reproducibility status of the build to mark the selected commit as âoldâ or ânewâ. Commits that fail to build or are not available in cache are marked as âskippedâ. We use git bisect with the --first-parent flag because intermediate pull request commits are typically not in cache.
For the qualitative analysis, we first group packages by fixing commit, as seeing all packages fixed by a given commit gives valuable information that might help to understand the reproducibility failure being fixed. We then randomly sample fixes and open their commit page on GitHub, locating the corresponding pull request. We manually inspect the pull request (description, commit log, code changes) to first confirm that the bisect phase successfully identified the commit that fixed the reproducibility failure. It may be the case that after careful inspection, the change looks unrelated to the package being fixed (the bisection process can give incoherent results in case of a flaky reproducibility issue or because the package changed status several times between two data points) in which case we discard it. Once we have confirmed that the identified commit is correct, we check whether the commit authors indicate that they are fixing a reproducibility issue and if the commit is a package update or another change. We analyze 100 randomly sampled reproducibility fixes and report our findings. Additionally, we perform the same analysis on the 15 commits that fix the most packages (from $3052$ down to 27 packages fixed) to find potential differences of behavior of the contributors for those larger-scale fixes.
V Dataset
The main result of running the pipeline of Figure 2 is a large-scale dataset of historical package rebuilds, including (re)build information, bitwise reproducibility status and, in case of non-reproducibility, generated diffoscopes. In this paper, we use the dataset to answer our stated research questions, but many other research questions could be addressed using the dataset, including more in-depth analysis of non-reproducibility causes. We make the dataset available to the research and technical community to foster further exploration on the topic. In the remainder of this section, we provide some descriptive statistics of the dataset.
<details>
<summary>extracted/6418685/paper-images/package-evolution.png Details</summary>

### Visual Description
## Line Chart: Number of Packages Over Time
### Overview
This line chart depicts the number of packages over time, from 2018 to 2023. Three different package sources are tracked: "Dedupicated evaluation", "Evaluation of release.nix", and "Evaluation closure of release.nix". The chart shows a general upward trend for all package sources, with varying rates of increase.
### Components/Axes
* **X-axis:** "Date", ranging from 2018 to 2023, with markers at each year.
* **Y-axis:** "Number of packages", ranging from 0 to 70,000, with increments of 10,000.
* **Legend:** Located at the bottom-center of the chart.
* "Dedupicated evaluation" - represented by a teal line.
* "Evaluation of release.nix" - represented by a red line.
* "Evaluation closure of release.nix" - represented by a light blue line.
### Detailed Analysis
* **Dedupicated evaluation (Teal Line):** This line starts at approximately 16,000 packages in 2018 and steadily increases to approximately 72,000 packages in 2023. The slope is relatively consistent throughout the period, with a slight acceleration in growth from 2021 to 2023.
* 2018: ~16,000
* 2019: ~24,000
* 2020: ~32,000
* 2021: ~42,000
* 2022: ~54,000
* 2023: ~72,000
* **Evaluation of release.nix (Red Line):** This line begins at approximately 14,000 packages in 2018 and increases to approximately 55,000 packages in 2023. The growth is slower than the "Dedupicated evaluation" line, with a more pronounced acceleration in growth from 2021 to 2023.
* 2018: ~14,000
* 2019: ~18,000
* 2020: ~24,000
* 2021: ~32,000
* 2022: ~42,000
* 2023: ~55,000
* **Evaluation closure of release.nix (Light Blue Line):** This line starts at approximately 20,000 packages in 2018 and increases to approximately 62,000 packages in 2023. The growth is initially faster than the other two lines, but slows down from 2020 to 2022 before accelerating again in 2023.
* 2018: ~20,000
* 2019: ~25,000
* 2020: ~34,000
* 2021: ~44,000
* 2022: ~52,000
* 2023: ~62,000
### Key Observations
* All three package sources show a consistent upward trend, indicating overall growth in the number of packages.
* "Dedupicated evaluation" consistently has the highest number of packages throughout the period.
* The growth rate appears to be accelerating for all three sources in 2023.
* "Evaluation of release.nix" starts with the lowest number of packages but experiences significant growth.
### Interpretation
The chart demonstrates a clear trend of increasing package availability over time for all three evaluation methods. The dominance of "Dedupicated evaluation" suggests it is the primary method used or the most comprehensive source of packages. The accelerating growth in 2023 across all sources could indicate increased development activity, improved tooling, or a growing user base. The differences in growth rates between the methods might reflect varying levels of effort invested in each approach or inherent differences in the types of packages they include. The fact that all lines are trending upwards suggests a healthy and expanding ecosystem. The data suggests that the package ecosystem is growing rapidly, and that all three evaluation methods are contributing to this growth. The acceleration in 2023 is a notable observation that warrants further investigation to understand the underlying drivers.
</details>
Figure 3: Evolution of the number of packages in each nixpkgs revision (as defined by release.nix), in their evaluation closure and after deduplicating ecosystem copies.
The dataset spans $709\,816$ package builds coming from 17 nixpkgs revisions, built over a total of $14\,296$ hours. From those builds, $548\,390$ are coming directly from the release.nix file and can be tracked by name. They correspond to $58\,103$ unique packages that appear over the span of sampled revisions. As can be seen on Figure 3, the number of packages listed to be built by Hydra increased from $13\,527$ in 2017 to $53\,432$ in 2023. Ecosystems can be present in multiple versions. On average, deduplicating packages in multiple ecosystem copies decreases their number by 21% while adding the evaluation closure of the release.nix file increases the number of jobs by 29%.
<details>
<summary>extracted/6418685/paper-images/ecosystems.png Details</summary>

### Visual Description
## Line Chart: Ecosystem Job Trends (2018-2023)
### Overview
This line chart depicts the trend of job postings related to various software ecosystems from 2018 to 2023. The y-axis represents the "Number of Jobs" (in thousands), and the x-axis represents the "Date" (years 2018 through 2023). Multiple colored lines represent the job posting trends for each ecosystem. The chart includes a legend at the bottom identifying each ecosystem with its corresponding color.
### Components/Axes
* **X-axis:** "Date" - Years 2018, 2019, 2020, 2021, 2022, 2023.
* **Y-axis:** "Number of Jobs" - Scale ranges from 0 to 45,000, with increments of 5,000. The scale is represented in 'k' (thousands).
* **Legend:** Located at the bottom of the chart, listing the following ecosystems and their corresponding colors:
* Python (Blue)
* Other Ecosystems (Grey)
* Perl (Orange)
* Base (Green)
* Haskell (Teal)
* EmacsPackages (Pink)
* VimPlugins (Purple)
* Gnome (Magenta)
* Xorg (Yellow)
* Lisp (Light Green)
* Linux (Dark Blue)
### Detailed Analysis
The chart displays the number of job postings over time for each ecosystem. Here's a breakdown of the trends and approximate data points, verified by color matching with the legend:
* **Python (Blue):** Shows a consistent upward trend throughout the period.
* 2018: ~15k
* 2019: ~18k
* 2020: ~22k
* 2021: ~28k
* 2022: ~35k
* 2023: ~42k
* **Other Ecosystems (Grey):** Displays a generally increasing trend, but with more fluctuations.
* 2018: ~14k
* 2019: ~16k
* 2020: ~19k
* 2021: ~23k
* 2022: ~28k
* 2023: ~35k
* **Perl (Orange):** Shows a relatively flat trend, with a slight decline towards the end.
* 2018: ~5k
* 2019: ~5k
* 2020: ~5k
* 2021: ~4k
* 2022: ~3k
* 2023: ~2k
* **Base (Green):** Shows a steady, moderate increase.
* 2018: ~10k
* 2019: ~12k
* 2020: ~15k
* 2021: ~18k
* 2022: ~23k
* 2023: ~28k
* **Haskell (Teal):** Shows a moderate increase, with some fluctuations.
* 2018: ~3k
* 2019: ~4k
* 2020: ~5k
* 2021: ~7k
* 2022: ~10k
* 2023: ~13k
* **EmacsPackages (Pink):** Shows a slight increase, remaining relatively low.
* 2018: ~2k
* 2019: ~2k
* 2020: ~3k
* 2021: ~3k
* 2022: ~4k
* 2023: ~5k
* **VimPlugins (Purple):** Shows a slight increase, remaining relatively low.
* 2018: ~2k
* 2019: ~2k
* 2020: ~3k
* 2021: ~4k
* 2022: ~5k
* 2023: ~6k
* **Gnome (Magenta):** Shows a moderate increase.
* 2018: ~3k
* 2019: ~4k
* 2020: ~5k
* 2021: ~7k
* 2022: ~9k
* 2023: ~11k
* **Xorg (Yellow):** Shows a slight decline.
* 2018: ~3k
* 2019: ~3k
* 2020: ~2k
* 2021: ~2k
* 2022: ~1k
* 2023: ~1k
* **Lisp (Light Green):** Shows a slight increase, remaining relatively low.
* 2018: ~1k
* 2019: ~1k
* 2020: ~2k
* 2021: ~2k
* 2022: ~3k
* 2023: ~4k
* **Linux (Dark Blue):** Shows a moderate increase.
* 2018: ~4k
* 2019: ~5k
* 2020: ~7k
* 2021: ~9k
* 2022: ~12k
* 2023: ~15k
### Key Observations
* Python consistently has the highest number of job postings and exhibits the strongest growth.
* "Other Ecosystems" represents a significant portion of job postings, and its trend is similar to Python, though slightly lower.
* Perl and Xorg show declining or stagnant trends, indicating decreasing demand.
* The job postings for most ecosystems have increased significantly between 2018 and 2023.
* There's a noticeable surge in job postings across most ecosystems around 2022-2023.
### Interpretation
The data suggests a growing demand for software development skills in general, with Python being the dominant ecosystem. The increasing trend for "Other Ecosystems" indicates a diversification of technologies being used in the job market. The decline in Perl and Xorg suggests these technologies are becoming less relevant or are being replaced by newer alternatives. The surge in job postings around 2022-2023 could be attributed to the post-pandemic economic recovery and increased investment in technology. The chart provides valuable insights into the current and future trends in software development job markets, helping individuals and organizations make informed decisions about skill development and resource allocation. The relative stability of EmacsPackages and VimPlugins suggests a niche but consistent demand for expertise in these areas.
</details>
Figure 4: Evolution of the size of the nine most popular software ecosystems in nixpkgs, the packages whose ecosystem is undetermined (base), and the packages from other ecosystems (only packages listed in release.nix).
Figure 4 shows the evolution of the top 9 ecosystem sizes in nixpkgs, plus the base namespace. 61.7% of packages belong to an ecosystem, while the rest live in nixpkgs base namespace. The three largest ecosystems are Haskell, Python and Perl, which together account for 42.4% of the packages.
<details>
<summary>extracted/6418685/paper-images/strates.png Details</summary>

### Visual Description
\n
## Line Chart: Number of Jobs Over Time by Revision Package
### Overview
This image presents a line chart illustrating the number of jobs over time (from 2018 to 2023) for different revision packages. The chart displays multiple lines, each representing a specific package, showing how the number of jobs associated with that package changes over the years. The y-axis represents the "Number of Jobs" in thousands (k), and the x-axis represents the "Date" from 2018 to 2023. A legend at the bottom identifies each line by its corresponding revision package number.
### Components/Axes
* **X-axis:** "Date" ranging from 2018 to 2023, with tick marks at each year.
* **Y-axis:** "Number of Jobs" ranging from 0 to 45k, with tick marks at 5k intervals.
* **Legend:** Located at the bottom of the chart, listing the following revision packages with corresponding colors:
* 1377834 (Red)
* 1416607 (Orange)
* 1448353 (Green)
* 1476353 (Purple)
* 1498441 (Yellow)
* 1520586 (Cyan)
* 1544614 (Light Green)
* 1569427 (Blue)
* 1593430 (Magenta)
* 1622813 (Brown)
* 1653283 (Gray)
* 1687868 (Dark Red)
* 1724521 (Teal)
* 1755200 (Violet)
* 1776404 (Pink)
* 1787891 (Olive)
* 1794205 (Coral)
### Detailed Analysis
Here's a breakdown of the trends and approximate data points for each line, verified against the legend colors:
* **1377834 (Red):** The line starts at approximately 16k jobs in 2018, gradually increases to around 22k in 2020, then rises sharply to approximately 32k in 2022, and finally reaches around 35k in 2023.
* **1416607 (Orange):** Starts at around 14k jobs in 2018, increases to approximately 18k in 2020, then rises to around 25k in 2022, and ends at approximately 28k in 2023.
* **1448353 (Green):** Begins at approximately 11k jobs in 2018, steadily increases to around 16k in 2020, then rises to approximately 23k in 2022, and finishes at around 26k in 2023.
* **1476353 (Purple):** Starts at around 12k jobs in 2018, increases to approximately 17k in 2020, then rises to around 24k in 2022, and ends at approximately 27k in 2023.
* **1498441 (Yellow):** Begins at approximately 13k jobs in 2018, increases to around 17k in 2020, then rises to approximately 23k in 2022, and finishes at around 26k in 2023.
* **1520586 (Cyan):** Starts at around 11k jobs in 2018, increases to approximately 15k in 2020, then rises to around 21k in 2022, and ends at approximately 24k in 2023.
* **1544614 (Light Green):** Begins at approximately 12k jobs in 2018, increases to around 16k in 2020, then rises to approximately 22k in 2022, and finishes at around 25k in 2023.
* **1569427 (Blue):** Starts at around 13k jobs in 2018, increases to approximately 17k in 2020, then rises to approximately 23k in 2022, and ends at approximately 26k in 2023.
* **1593430 (Magenta):** Begins at approximately 11k jobs in 2018, increases to around 15k in 2020, then rises to approximately 21k in 2022, and finishes at around 24k in 2023.
* **1622813 (Brown):** Starts at around 12k jobs in 2018, increases to approximately 16k in 2020, then rises to approximately 22k in 2022, and ends at approximately 25k in 2023.
* **1653283 (Gray):** Begins at approximately 13k jobs in 2018, increases to around 17k in 2020, then rises to approximately 23k in 2022, and finishes at around 26k in 2023.
* **1687868 (Dark Red):** Starts at around 12k jobs in 2018, increases to approximately 16k in 2020, then rises to approximately 22k in 2022, and ends at approximately 25k in 2023.
* **1724521 (Teal):** Begins at approximately 11k jobs in 2018, increases to around 15k in 2020, then rises to approximately 21k in 2022, and finishes at around 24k in 2023.
* **1755200 (Violet):** Starts at around 12k jobs in 2018, increases to approximately 16k in 2020, then rises to approximately 22k in 2022, and ends at approximately 25k in 2023.
* **1776404 (Pink):** Begins at approximately 13k jobs in 2018, increases to around 17k in 2020, then rises to approximately 23k in 2022, and finishes at around 26k in 2023.
* **1787891 (Olive):** Starts at around 12k jobs in 2018, increases to approximately 16k in 2020, then rises to approximately 22k in 2022, and ends at approximately 25k in 2023.
* **1794205 (Coral):** Begins at approximately 11k jobs in 2018, increases to around 15k in 2020, then rises to approximately 21k in 2022, and finishes at around 24k in 2023.
### Key Observations
* There's a general upward trend for most packages from 2018 to 2023.
* The most significant growth is observed between 2020 and 2023.
* Package 1377834 (Red) consistently shows the highest number of jobs throughout the period.
* Several packages (e.g., 1448353, 1476353, 1498441) exhibit very similar trends, suggesting they might be related or affected by the same factors.
* The lines representing the different packages converge towards the end of the period (2022-2023), indicating a potential saturation point or a shift in job distribution.
### Interpretation
The chart demonstrates a clear increase in the number of jobs associated with various revision packages over the six-year period. This could indicate growing adoption of these packages, increased development activity, or a general expansion of the system or platform they relate to. The consistent lead of package 1377834 suggests it is a core or widely used component. The convergence of lines towards 2023 might signify that the growth rate is slowing down, or that new packages are emerging and attracting a share of the jobs. The similar trends among several packages suggest they are likely interdependent or part of a common workflow. Further investigation would be needed to understand the specific reasons behind these trends and the relationships between the packages. The data suggests a healthy and evolving system, but the slowing growth rate warrants monitoring to ensure continued progress.
</details>
Figure 5: Number of packages introduced by every revision of the dataset and their survival in the package set over time.
Figure 5 outlines the number of packages introduced in the dataset by each revision, and their survival over time. In particular, as of July 2017 the package set contained $13\,114$ elements, $8929$ of which were still present in April 2023.
VI Results
We present our experimental results below, organized by research question. Their discussion is provided later, in Section VII.
VI-A RQ0: Does Nix allow rebuilding past packages reliably (even if not bitwise reproducibly)?
This research question aims to reproduce the results from Malka et al. [malka_reproducibility_2024], as a starting baseline. That earlier work only built one nixpkgs revision, the most ancient in their dataset; in our case, we rebuilt that revision alongside with 16 others, evenly spaced over time to study trends. Figure 6 shows the proportion of packages between 2017 and 2023 that we successfully rebuilt (not necessarily in a bitwise reproducible manner, merely âsuccessfully builtâ for this RQ).
This proportion varies between 99.68% and 99.95%, confirming previously reported findings: Nix reproducibility of build environments allows for very high rebuildability rate over time. Note that this is not an exact replication of the revision in [malka_reproducibility_2024], because we also included packages not explicitly listed in release.nix, but present in the dependency graph, whereas they did not.
<details>
<summary>extracted/6418685/paper-images/rebuildability.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Successful Rebuilds Over Time
### Overview
The image presents a line chart illustrating the proportion of successful rebuilds over a period from 2018 to 2023. The chart uses a single blue line to represent the trend, plotted against time on the x-axis and proportion on the y-axis. The background is a light blue gradient.
### Components/Axes
* **X-axis Title:** "Date"
* **Y-axis Title:** "Proportion of successful rebuilds"
* **Y-axis Scale:** Ranges from 99.50% to 100.00% with increments of 0.10%.
* **Data Series:** A single line representing the proportion of successful rebuilds.
* **Time Scale:** The x-axis displays years from 2018 to 2023.
### Detailed Analysis
The blue line generally fluctuates between approximately 99.70% and 99.90% throughout the observed period.
* **2018:** Starts at approximately 99.85%, dips to a low of around 99.72% and ends at approximately 99.78%.
* **2019:** Begins at approximately 99.78%, rises to around 99.82%, then declines to approximately 99.74%.
* **2020:** Starts at approximately 99.74%, increases to a peak of around 99.88%, and then decreases to approximately 99.82%.
* **2021:** Begins at approximately 99.82%, drops to a low of approximately 99.70%, and then rises to around 99.85%.
* **2022:** Starts at approximately 99.85%, increases to a peak of approximately 99.92%, and then declines to approximately 99.80%.
* **2023:** Begins at approximately 99.80% and rises to approximately 99.93%.
The line generally exhibits an upward trend from 2021 to 2023.
### Key Observations
* The lowest point on the chart appears to be around 99.70% in 2021.
* The highest point on the chart appears to be around 99.93% in 2023.
* The proportion of successful rebuilds remains consistently high, always above 99.50%.
* There is a noticeable increase in the proportion of successful rebuilds towards the end of the observed period (2022-2023).
### Interpretation
The data suggests a consistently high success rate for rebuilds over the period 2018-2023. The slight fluctuations likely represent normal variations in the rebuild process. The upward trend observed in the later years (2021-2023) could indicate improvements in the rebuild process, infrastructure, or testing procedures. The consistently high values suggest a robust and reliable system. The absence of significant dips or failures indicates a well-maintained and stable process. The data does not reveal the *cause* of the fluctuations, only that they exist. Further investigation would be needed to determine the factors influencing these variations.
</details>
Figure 6: Proportion of rebuildable packages over time.
VI-B RQ1: What is the evolution of bitwise reproducible packages in nixpkgs between 2017 and 2023?
Apart from a significant regression in 2020, we obtain bitwise reproducibility levels between 69% and 91% (see Figure 7). The trends in Figure 8 show that the absolute number of bitwise reproducible packages has consistently gone up and followed the fast growth of the package set. The only exception is the data point for June 2020, where the number of reproducible packages dropped even though the total number of packages grew. We study and explain this reproducibility regression in Section VI-C below.
<details>
<summary>extracted/6418685/paper-images/reproducibility-overall-proportion.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Successful Rebuilds Over Time
### Overview
This line chart displays the proportion of successful rebuilds (categorized as 'reproducible', 'buildable', and 'failed') over the years 2018 to 2023. The y-axis represents the proportion of successful rebuilds, ranging from 30% to 100%, while the x-axis represents the date, spanning from 2018 to 2023. The chart uses three distinct colored lines to represent each status category.
### Components/Axes
* **X-axis Title:** Date
* **Y-axis Title:** Proportion of successful rebuilds
* **Legend:** Located at the bottom-left corner of the chart.
* **Reproducible:** Blue line
* **Buildable:** Orange line
* **Failed:** Red line
* **X-axis Markers:** 2018, 2019, 2020, 2021, 2022, 2023
* **Y-axis Markers:** 30.00%, 40.00%, 50.00%, 60.00%, 70.00%, 80.00%, 90.00%, 100.00%
### Detailed Analysis
* **Reproducible (Blue Line):** The blue line representing 'reproducible' rebuilds starts at approximately 71% in 2018. It generally slopes upward, reaching around 76% in 2019, then increases to approximately 83% in 2020. It dips to around 80% in 2021, then rises to approximately 86% in 2022, and finally reaches approximately 91% in 2023.
* **Buildable (Orange Line):** The orange line representing 'buildable' rebuilds starts at approximately 72% in 2018. It increases to around 86% in 2019, then drops sharply to approximately 64% in 2020. It recovers to around 84% in 2021, then increases to approximately 88% in 2022, and finally reaches approximately 92% in 2023.
* **Failed (Red Line):** The red line representing 'failed' rebuilds starts at approximately 100% in 2018. It decreases to around 86% in 2019, then drops to approximately 78% in 2020. It increases to around 82% in 2021, then decreases to approximately 80% in 2022, and finally reaches approximately 78% in 2023.
### Key Observations
* The 'failed' rebuilds consistently have the highest proportion among the three categories, although it decreases over time.
* The 'reproducible' and 'buildable' rebuilds show a generally increasing trend over the years, with some fluctuations.
* The most significant drop in 'buildable' rebuilds occurs in 2020, followed by a recovery in subsequent years.
* The 'reproducible' line is consistently lower than the 'buildable' line.
### Interpretation
The data suggests a general improvement in the success rate of rebuilds over the period 2018-2023. The decreasing trend in 'failed' rebuilds indicates that the build process is becoming more reliable. The fluctuations observed in the 'reproducible' and 'buildable' lines could be attributed to changes in the codebase, build environment, or testing procedures. The sharp drop in 'buildable' rebuilds in 2020 warrants further investigation to identify the root cause. The consistent difference between 'reproducible' and 'buildable' suggests that while builds are generally successful, ensuring reproducibility remains a challenge. The overall trend indicates a positive direction in build reliability and quality.
</details>
Figure 7: Proportion of reproducible, rebuildable (but unreproducible) and non-rebuildable packages over time.
<details>
<summary>extracted/6418685/paper-images/reproducibility-overall-absolute.png Details</summary>

### Visual Description
\n
## Line Chart: Successful Rebuilds Over Time
### Overview
This line chart depicts the number of successful rebuilds over time, categorized by their status: reproducible, buildable, and failed. The x-axis represents the date, ranging from 2018 to 2023. The y-axis represents the number of successful rebuilds, scaled from 0 to 70,000. The chart uses three colored lines to represent each status category.
### Components/Axes
* **X-axis Title:** Date
* **Y-axis Title:** Number of successful rebuilds
* **X-axis Markers:** 2018, 2019, 2020, 2021, 2022, 2023
* **Y-axis Scale:** 0, 10k, 20k, 30k, 40k, 50k, 60k, 70k (where 'k' denotes thousands)
* **Legend:** Located at the bottom-center of the chart.
* **Reproducible:** Blue line
* **Buildable:** Red/Orange line
* **Failed:** Light Green line
### Detailed Analysis
The chart displays three distinct lines representing the number of successful rebuilds for each status over time.
* **Reproducible (Blue Line):** This line generally slopes upward from 2018 to 2023, indicating an increasing number of reproducible rebuilds.
* 2018: Approximately 16,000
* 2019: Approximately 24,000
* 2020: Approximately 28,000
* 2021: Approximately 36,000
* 2022: Approximately 44,000
* 2023: Approximately 54,000
* **Buildable (Red/Orange Line):** This line also slopes upward, but with more fluctuation than the reproducible line. It consistently remains above the reproducible line.
* 2018: Approximately 18,000
* 2019: Approximately 28,000
* 2020: Approximately 34,000
* 2021: Approximately 43,000
* 2022: Approximately 52,000
* 2023: Approximately 64,000
* **Failed (Light Green Line):** This line shows a significant increase in the number of failed rebuilds, particularly from 2022 to 2023.
* 2018: Approximately 12,000
* 2019: Approximately 18,000
* 2020: Approximately 22,000
* 2021: Approximately 30,000
* 2022: Approximately 40,000
* 2023: Approximately 58,000
### Key Observations
* All three categories show an overall increasing trend in the number of rebuilds from 2018 to 2023.
* The "Buildable" status consistently has the highest number of successful rebuilds.
* The "Failed" status shows a dramatic increase in 2023, surpassing the "Reproducible" status.
* The gap between "Buildable" and "Reproducible" rebuilds widens over time.
### Interpretation
The data suggests a growing overall activity in rebuilds. While both "Reproducible" and "Buildable" rebuilds are increasing, the substantial rise in "Failed" rebuilds in 2023 is a significant concern. This could indicate issues with the build process, increased complexity of the codebase, or changes in the testing environment. The consistent lead of "Buildable" rebuilds over "Reproducible" rebuilds suggests that while builds are generally successful, ensuring reproducibility remains a challenge. Further investigation is needed to understand the root causes of the increased failure rate in 2023 and to improve the reproducibility of builds. The increasing trend across all categories suggests a growing project or system, which may necessitate scaling infrastructure and resources.
</details>
Figure 8: Absolute numbers of reproducible, rebuildable (but unreproducible) and non-rebuildable packages over time.
Figure 9 shows for each revision the cumulative amount of unreproducibilities introduced by that revision getting fixed over time. The large slope between the two first points of each plot indicates that most of the unreproducibilities introduced in a revision are fixed in the next revision, on average 62% of them (even raising to 85% if we account for packages fixed after the 2020 reproducibility regression).
<details>
<summary>extracted/6418685/paper-images/fixes-per-revision.png Details</summary>

### Visual Description
## Line Chart: Number of Fixed Packages Over Time
### Overview
This line chart displays the number of packages fixed over time (2018-2023) due to unreproducibility issues introduced by specific commits. Each line represents a different commit ID, and the y-axis shows the number of packages fixed. The x-axis represents the year.
### Components/Axes
* **X-axis:** Year (2018, 2019, 2020, 2021, 2022, 2023)
* **Y-axis:** Number of fixed packages (Scale: 0 to 3000, increments of 400)
* **Legend:** Located at the bottom of the chart, listing commit IDs and corresponding line colors.
* 1416607 (Red)
* 1498441 (Purple)
* 1569427 (Green)
* 1653283 (Pink)
* 1755200 (Teal)
* 1448353 (Orange)
* 1520586 (Blue)
* 1593430 (Light Green)
* 1687868 (Brown)
* 1776404 (Dark Blue)
* 1476353 (Cyan)
* 1544614 (Magenta)
* 1622813 (Yellow)
* 1724521 (Grey)
* 1787891 (Gold)
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points:
* **1416607 (Red):** Starts at approximately 0 in 2018, rises sharply to around 3100 in 2020, then declines to approximately 100 in 2023.
* **1498441 (Purple):** Remains relatively flat around 100-200 from 2018 to 2023.
* **1569427 (Green):** Starts at 0 in 2018, increases to around 400 in 2021, then declines to approximately 100 in 2023.
* **1653283 (Pink):** Starts at 0 in 2018, increases to around 200 in 2020, remains relatively flat around 200-300 until 2023.
* **1755200 (Teal):** Starts at 0 in 2018, rises to around 200 in 2019, then declines to approximately 0 in 2023.
* **1448353 (Orange):** Starts at 0 in 2018, increases rapidly to around 2800 in 2020, then declines to approximately 100 in 2023.
* **1520586 (Blue):** Remains relatively flat around 100-200 from 2018 to 2023.
* **1593430 (Light Green):** Starts at 0 in 2018, increases to around 400 in 2021, then declines to approximately 0 in 2023.
* **1687868 (Brown):** Starts at 0 in 2018, increases to around 200 in 2020, remains relatively flat around 200-300 until 2023.
* **1776404 (Dark Blue):** Remains relatively flat around 100-200 from 2018 to 2023.
* **1476353 (Cyan):** Starts at 0 in 2018, increases to around 200 in 2020, then declines to approximately 0 in 2023.
* **1544614 (Magenta):** Remains relatively flat around 100-200 from 2018 to 2023.
* **1622813 (Yellow):** Starts at 0 in 2018, increases to around 3000 in 2020, then declines to approximately 100 in 2023.
* **1724521 (Grey):** Remains relatively flat around 100-200 from 2018 to 2023.
* **1787891 (Gold):** Starts at 0 in 2018, increases to around 2800 in 2020, then declines to approximately 100 in 2023.
### Key Observations
* There's a significant spike in the number of fixed packages around 2020 for several commit IDs (1416607, 1448353, 1622813, 1787891).
* Most lines show a decreasing trend after 2020, suggesting that the initial impact of these commits was addressed.
* Some commit IDs (1498441, 1520586, 1776404, 1544614, 1724521) have a consistently low number of fixed packages throughout the period.
### Interpretation
The chart illustrates the impact of specific commits on package reproducibility. The large increase in fixed packages around 2020 suggests that these commits introduced significant reproducibility issues that required substantial effort to resolve. The subsequent decline in fixed packages indicates that the issues were largely addressed over time. The commits with consistently low numbers of fixed packages may represent changes that had minimal impact on reproducibility or were already well-managed. The data suggests a concentrated effort to address reproducibility problems in the 2020 timeframe, followed by a period of stabilization. The varying magnitudes of the spikes indicate that some commits were more problematic than others. The fact that most lines converge towards zero by 2023 suggests that the initial wave of reproducibility issues has been largely resolved, but ongoing maintenance is still required (as evidenced by the small, non-zero values for some commits).
</details>
Figure 9: Evolution of the cumulative number of fixes to non-reproducibilities, separated by the revision in which the non-reproducibilities were introduced.
Excluding the June 2020 revision (corresponding to the observed bitwise reproducibility regression), Figure 10 depicts the average evolution of the package set between two consecutive revisions. In particular, only 0.60% of the packages transition from reproducible to buildable but not reproducible status between two revisions, while 2.07% of the buildable but not reproducible packages become reproducible. The average growth rate of the package set is 1.07 and, on average, 80.02% of the new packages are reproducible.
<details>
<summary>extracted/6418685/paper-images/sankey-average.png Details</summary>

### Visual Description
\n
## Diagram: Build Status Interconnection
### Overview
The image depicts a diagram illustrating the interconnection between build statuses. It appears to represent a system where builds can transition between different states: Reproducible, Non-Existent, Failed, and Buildable. The diagram uses lines to show connections between these states, with the thickness of the lines potentially indicating the frequency or strength of the relationship. The diagram is symmetrical, with the states mirrored on either side.
### Components/Axes
The diagram consists of four labeled states positioned vertically on both the left and right sides of the image:
* **Reproducible** (Blue label, top position)
* **Non-Existent** (Red label, middle-top position)
* **Failed** (Orange label, middle-bottom position)
* **Buildable** (Green label, bottom position)
The connections between these states are represented by gray lines of varying thickness. There are no explicit axes or scales.
### Detailed Analysis or Content Details
The diagram shows a complex network of connections between the build states. Each state on the left is connected to each state on the right via multiple lines.
* **Reproducible** (Left) is connected to **Reproducible** (Right) by approximately 6 lines.
* **Non-Existent** (Left) is connected to **Non-Existent** (Right) by approximately 6 lines.
* **Failed** (Left) is connected to **Failed** (Right) by approximately 6 lines.
* **Buildable** (Left) is connected to **Buildable** (Right) by approximately 6 lines.
* **Reproducible** (Left) is connected to **Non-Existent** (Right) by approximately 4 lines.
* **Reproducible** (Left) is connected to **Failed** (Right) by approximately 4 lines.
* **Reproducible** (Left) is connected to **Buildable** (Right) by approximately 4 lines.
* **Non-Existent** (Left) is connected to **Reproducible** (Right) by approximately 4 lines.
* **Non-Existent** (Left) is connected to **Failed** (Right) by approximately 4 lines.
* **Non-Existent** (Left) is connected to **Buildable** (Right) by approximately 4 lines.
* **Failed** (Left) is connected to **Reproducible** (Right) by approximately 4 lines.
* **Failed** (Left) is connected to **Non-Existent** (Right) by approximately 4 lines.
* **Failed** (Left) is connected to **Buildable** (Right) by approximately 4 lines.
* **Buildable** (Left) is connected to **Reproducible** (Right) by approximately 4 lines.
* **Buildable** (Left) is connected to **Non-Existent** (Right) by approximately 4 lines.
* **Buildable** (Left) is connected to **Failed** (Right) by approximately 4 lines.
The lines are not uniform in thickness, suggesting varying degrees of connection or probability. The lines appear to be randomly distributed, with no clear pattern in their thickness.
### Key Observations
The diagram highlights that all build states are interconnected. The self-connections (e.g., Reproducible to Reproducible) appear stronger (more lines) than the connections to other states. The diagram does not provide any quantitative data about the frequency or probability of transitions between states, only a visual representation of their interconnectedness.
### Interpretation
This diagram likely represents a state transition model for software builds. The states represent the outcome of a build process, and the lines represent possible transitions between these states. The thicker lines suggest that certain transitions are more common or likely than others.
The strong self-connections indicate that a build is likely to remain in its current state. The connections between different states suggest that a build can transition from any state to any other state, although the probability of these transitions may vary.
The diagram could be used to analyze build failures and identify potential areas for improvement in the build process. For example, if there are many transitions from "Buildable" to "Failed", it might indicate a problem with the build environment or the build scripts.
The symmetry of the diagram suggests that the transitions are not directional; a build can move between states in either direction. This could indicate that the build process is not deterministic, or that there are external factors that can influence the build outcome.
</details>
Figure 10: Sankey graph of the average flow of packages between two revisions, excluding the revision from June 2020, considered as an outlier.
VI-C RQ2: What are the unreproducible packages?
We observe large disparities both in trends and in reproducibility by package ecosystem (see Figure 11). In the top three most popular ecosystems, Perl has consistently maintained a proportion of reproducible packages above 98%, while Haskell reproducibility rate stagnated around the 60% mark, even decreasing by more than 7 percentage points during the time of our study. A fix to a long-standing reproducibility issue has been introduced in GHC 9.12.1 (new -fobject-determinism flag). As of January 2025, nixpkgs does not make use of this flag to improve Haskell build reproducibility. The Python ecosystem on the contrary sees a positive evolution over time, with reproducibility rates as low as 27.64% in December 2017, reaching 98.28% in April 2023. As can be seen on Figure 11, the Python ecosystem has however known a major dip in reproducibility in June 2020, with a drop to 6.01% (for almost $-90$ percentage points!).
<details>
<summary>extracted/6418685/paper-images/reproducibility-ecosystems.png Details</summary>

### Visual Description
\n
## Line Chart: Proportion of Bitwise Reproducible Packages Over Time
### Overview
This line chart displays the proportion of bitwise reproducible packages over time, from 2018 to 2023. The chart tracks four different package sources: base, Haskell, Python, and Perl. The y-axis represents the proportion (as a percentage) of bitwise reproducible packages, while the x-axis represents the date.
### Components/Axes
* **X-axis Title:** Date
* **Y-axis Title:** Proportion of bitwise reproducible packages
* **Y-axis Scale:** 0.00% to 100.00%
* **X-axis Scale:** 2018 to 2023
* **Legend:** Located at the bottom-center of the chart.
* **base:** Green line
* **haskell:** Blue line
* **python:** Red line
* **perl:** Purple line
### Detailed Analysis
* **base (Green Line):** The line starts at approximately 30% in 2018, rises sharply to nearly 100% around 2019, remains at approximately 100% through 2022, and then dips slightly to approximately 95% in 2023.
* **haskell (Blue Line):** The line begins at approximately 80% in 2018, decreases to around 70% in 2019, rises to approximately 90% in 2020, drops to around 75% in 2021, and then stabilizes around 80% through 2023.
* **python (Red Line):** The line starts at approximately 60% in 2018, fluctuates between 50% and 60% through 2020, then decreases to approximately 45% in 2021, and remains around 50% through 2023.
* **perl (Purple Line):** The line begins at approximately 80% in 2018, decreases to around 65% in 2019, drops sharply to approximately 10% in 2020, rises to approximately 70% in 2021, and then stabilizes around 75% through 2023.
### Key Observations
* The "base" packages demonstrate the highest and most stable reproducibility rate, consistently above 90% after 2019.
* "Perl" packages experienced the most significant fluctuation in reproducibility, with a dramatic drop in 2020 followed by a recovery.
* "Python" packages consistently exhibit the lowest reproducibility rate among the four package sources.
* "Haskell" packages show moderate fluctuations, generally remaining between 70% and 90%.
### Interpretation
The chart illustrates the varying levels of bitwise reproducibility across different package sources over time. The significant improvement in "base" package reproducibility around 2019 suggests successful implementation of reproducibility measures within that ecosystem. The volatility of "Perl" reproducibility indicates potential challenges or changes in the build process or dependencies. The consistently lower reproducibility rate of "Python" packages may point to inherent complexities in the Python packaging ecosystem or a lack of focus on bitwise reproducibility. The data suggests that achieving bitwise reproducibility is not uniform across different programming language ecosystems and requires ongoing effort and attention. The sharp changes in the lines indicate specific events or changes in the build or packaging processes that impacted reproducibility. The overall trend suggests that while reproducibility is improving for some package sources, it remains a challenge for others.
</details>
Figure 11: Proportion of reproducible packages belonging to the three most popular ecosystems and the base namespace of nixpkgs.
As can be seen in Figure 12, this dip in the proportion of reproducible Python packages can be explained by a large number of packages transitioning from the reproducible to the buildable but not reproducible state, indicating a regression in bitwise reproducibility at that time. To identify the root cause of that regression, we used a git bisection on the nixpkgs repository between June and October 2020 in order to identify the commit fixing the unreproducibility issue, and derived that the root cause of the regression was an update in the pip executable changing the behavior of byte-compilation. Details can be found in https://github.com/pypa/pip/issues/7808.
Figure 13 shows the difference in reproducibility rates between the NixOS minimal ISO imageâa package set for which there exists community-based reproducibility monitoring, and made of packages that are considered criticalâand the whole package set. The minimal ISO image has a very high proportion of reproducible packages in the considered period, with rates consistently higher than 95% starting from May 2019. Its reproducibility rates however do not follow the evolution of the overall package set, such that observing the reproducibility of the minimal ISO image does not give any clue about the reproducibility of the package set as a whole.
<details>
<summary>extracted/6418685/paper-images/reproducibility-absolute-python.png Details</summary>

### Visual Description
\n
## Area Chart: Package Status Over Time
### Overview
This area chart depicts the number of software packages over time (2018-2023), categorized by their build status: reproducible, buildable, and failed. The chart uses stacked areas to show the cumulative number of packages for each status.
### Components/Axes
* **X-axis:** Date, ranging from 2018 to 2023, with yearly markers.
* **Y-axis:** Number of packages, ranging from 0 to 6000, with increments of 1000.
* **Legend (bottom-center):**
* Reproducible (Blue)
* Buildable (Red)
* Failed (Green)
* **Title:** Not explicitly present, but the chart represents "Package Status" over time.
### Detailed Analysis
The chart shows three data series, each representing a build status.
* **Reproducible (Blue):** This line starts at approximately 200 packages in 2018 and steadily increases to around 5700 packages in 2023. The line slopes consistently upward, indicating a continuous increase in the number of reproducible packages.
* 2018: ~200 packages
* 2019: ~800 packages
* 2020: ~1800 packages
* 2021: ~3300 packages
* 2022: ~4400 packages
* 2023: ~5700 packages
* **Buildable (Red):** This line begins at approximately 500 packages in 2018 and increases to around 3000 packages in 2023. The line shows a generally upward trend, but with some fluctuations.
* 2018: ~500 packages
* 2019: ~1300 packages
* 2020: ~2400 packages
* 2021: ~3200 packages
* 2022: ~3300 packages
* 2023: ~3000 packages
* **Failed (Green):** This line starts at approximately 100 packages in 2018, rises to a peak of around 1000 packages in 2020, then dramatically drops to near zero in 2021, and remains low through 2023.
* 2018: ~100 packages
* 2019: ~400 packages
* 2020: ~1000 packages
* 2021: ~100 packages
* 2022: ~200 packages
* 2023: ~100 packages
The total number of packages (sum of all three statuses) increases over time, from approximately 800 in 2018 to around 9000 in 2023.
### Key Observations
* The number of reproducible packages consistently increases over the entire period.
* The number of buildable packages also increases, but at a slower rate than reproducible packages, and plateaus in 2022-2023.
* The number of failed packages peaked in 2020 and has since decreased significantly, suggesting improvements in the build process.
* The largest increase in total packages occurred between 2020 and 2021.
### Interpretation
The data suggests a positive trend in the software package ecosystem, with an increasing number of packages becoming reproducible and buildable. The significant reduction in failed packages indicates improvements in build infrastructure or package quality control. The consistent growth of reproducible packages is particularly encouraging, as it suggests increased reliability and trustworthiness of the software. The plateauing of buildable packages in recent years might warrant further investigation to identify potential bottlenecks or areas for improvement. The spike in failed packages in 2020 could be attributed to a specific event or change in the build environment that year. The stacked area chart effectively visualizes the composition of the package ecosystem over time, highlighting the relative proportions of each build status.
</details>
Figure 12: Evolution of the absolute number of reproducible, rebuildable (but unreproducible) and non-rebuildable packages from the Python ecosystem.
<details>
<summary>extracted/6418685/paper-images/minimal-iso-reproducibility.png Details</summary>

### Visual Description
\n
## Line Chart: Bitwise Reproducible Packages Over Time
### Overview
The image presents a line chart illustrating the proportion of bitwise reproducible packages over time, from 2018 to 2023. Two data series are displayed: one representing the "Whole distribution" and the other representing the "Minimal ISO image". The chart aims to show the trend of reproducibility for software packages within these two contexts.
### Components/Axes
* **X-axis:** Labeled "Date", ranging from 2018 to 2023. The axis is linearly scaled.
* **Y-axis:** Labeled "Proportion of bitwise reproducible packages", ranging from 50.00% to 100.00%. The axis is linearly scaled.
* **Legend:** Located at the bottom-center of the chart. It identifies the two data series:
* "Whole distribution" - represented by a blue line.
* "Minimal ISO image" - represented by an orange/red line.
* **Gridlines:** A light blue grid is present to aid in reading values.
### Detailed Analysis
**Whole Distribution (Blue Line):**
The blue line exhibits a generally upward trend, but with significant fluctuations.
* **2018:** Approximately 71% (± 2%).
* **2019:** Approximately 84% (± 2%).
* **2020:** Approximately 82% (± 2%).
* **2021:** Approximately 84% (± 2%).
* **2022:** Approximately 87% (± 2%).
* **2023:** Approximately 91% (± 2%).
**Minimal ISO Image (Orange/Red Line):**
The orange/red line shows a relatively stable and slightly increasing trend.
* **2018:** Approximately 91% (± 1%).
* **2019:** Approximately 92% (± 1%).
* **2020:** Approximately 92% (± 1%).
* **2021:** Approximately 92% (± 1%).
* **2022:** Approximately 93% (± 1%).
* **2023:** Approximately 93% (± 1%).
### Key Observations
* The "Minimal ISO image" consistently demonstrates a higher proportion of bitwise reproducible packages compared to the "Whole distribution" throughout the observed period.
* The "Whole distribution" shows a more volatile trend, with a noticeable dip around 2020.
* Both data series show an overall increasing trend in reproducibility from 2018 to 2023.
* The difference in reproducibility between the two distributions appears to be relatively constant, around 8-10 percentage points.
### Interpretation
The data suggests that while both the whole distribution and the minimal ISO image are becoming more reproducible over time, the minimal ISO image consistently achieves a higher level of reproducibility. This could be due to the reduced complexity and dependencies within the minimal ISO image, making it easier to ensure bitwise consistency. The fluctuations in the "Whole distribution" may be attributed to changes in the package ecosystem, build processes, or dependency management practices. The overall upward trend for both distributions indicates progress in improving software reproducibility, which is crucial for ensuring the reliability and verifiability of software builds and deployments. The dip in 2020 for the "Whole distribution" warrants further investigation to identify the underlying causes. The consistent gap between the two lines suggests that minimizing dependencies and complexity is a key factor in achieving high reproducibility.
</details>
Figure 13: Evolution of the proportion of reproducible packages belonging to the minimal ISO image and in the entire package set.
VI-D RQ3: Why are packages unreproducible?
We identified four heuristics that are both effective (they give a large number of matches in our diffoscope dataset) and relevant (they correspond to a software engineering practice that can be changed) to automatically determine why packages are not bitwise reproducible:
- Embedded dates: presence of build dates in various places in the built artifacts (version file, log file, file names, etc.). To detect dates embedded in build outputs, we look specifically for the year 2024 in added lines, given that we ran all our builds during 2024, but none of the historical builds ran during this year.
- uname output: some unreproducible builds embed build information such as the machine that the build was run on. While the real host name is hidden by the Nix sandbox, other pieces of information (such as the Linux kernel version or OS version, reported by uname) are still available.
- Environment variables: some builds embed some or all available environment variables in their build outputs. This typically causes unreproducibility because an environment variable containing the number of cores available to build on the machine is set by Nix.
- Build ID: some ecosystems (Go for example) embed a unique (but not deterministic) build ID into the artifacts.
Despite a well-known recommendation by the Reproducible Builds project to avoid embedding build dates in build artifacts to achieve bitwise reproducibility, we find that $12\,831$ of our $86\,317$ packages with diffoscopes contain a date, accounting for 14.8% of them. Still in the most recent nixpkgs revision, we find $470$ instances of this non-reproducibility cause, showing that embedding dates is still an existing practice. Additionally, we find that embedded environment variables and build IDs each account for 2.2% of our unreproducible packages. Finally, we find uname outputs in $1097$ of our unreproducible builds (1.3%), $721$ of which also include a date as part of the uname output. Figure 14 shows the evolution of the number of packages matched by each heuristic over time. Altogether, a total of 19.7% of the packages for which we have generated a diffoscope have an unreproducibility that can be explained by at least one of our heuristics.
We evaluate the precision of each heuristic by manually verifying $500$ matched lines for every heuristic and counting false positives. For the uname, build ID, and environment variables heuristics, we report a precision of 100%. For the date heuristic, we obtain a precision of 97.8% (11 false positives).
<details>
<summary>extracted/6418685/paper-images/evolution-heuristics.png Details</summary>

### Visual Description
## Line Chart: Packages Matched by Heuristic Over Time
### Overview
This line chart displays the number of packages matched by different heuristics over the period from 2018 to 2023. Four heuristics are compared: Embedded dates, Embedded uname output, Embedded environment variable, and Embedded build ID. The y-axis represents the number of packages matched, while the x-axis represents the date.
### Components/Axes
* **X-axis Title:** Date
* **Y-axis Title:** Number of packages matched by the heuristic
* **Legend:** Located at the bottom-center of the chart.
* Embedded dates (Purple line)
* Embedded uname output (Orange line)
* Embedded environment variable (Green line)
* Embedded build ID (Blue line)
* **X-axis Markers:** 2018, 2019, 2020, 2021, 2022, 2023
* **Y-axis Scale:** Approximately 0 to 900, with increments of 100.
### Detailed Analysis
* **Embedded dates (Purple Line):** This line generally slopes upward from 2018 to 2020, peaking at approximately 880 in 2020. It then declines to around 750 in 2021, rises again to approximately 820 in 2022, and then sharply drops to around 450 in 2023.
* 2018: ~740
* 2019: ~780
* 2020: ~880
* 2021: ~750
* 2022: ~820
* 2023: ~450
* **Embedded uname output (Orange Line):** This line remains relatively flat throughout the period, fluctuating between approximately 20 and 80. It shows a slight dip around 2021.
* 2018: ~40
* 2019: ~50
* 2020: ~40
* 2021: ~20
* 2022: ~60
* 2023: ~30
* **Embedded environment variable (Green Line):** This line shows a significant increase from 2021 to 2022, rising from approximately 100 to 450. It then decreases to around 200 in 2023.
* 2018: ~100
* 2019: ~150
* 2020: ~100
* 2021: ~100
* 2022: ~450
* 2023: ~200
* **Embedded build ID (Blue Line):** This line exhibits a generally upward trend from 2018 to 2022, starting at approximately 700 and reaching a peak of around 850 in 2022. It then experiences a substantial decline to approximately 650 in 2023.
* 2018: ~700
* 2019: ~750
* 2020: ~800
* 2021: ~780
* 2022: ~850
* 2023: ~650
### Key Observations
* The "Embedded dates" and "Embedded build ID" heuristics consistently match a significantly higher number of packages compared to the other two.
* The "Embedded environment variable" heuristic shows a dramatic increase in matched packages in 2022, followed by a decrease in 2023.
* The "Embedded uname output" heuristic remains relatively stable with a low number of matches.
* 2023 shows a significant drop in matches for "Embedded dates" and "Embedded build ID".
### Interpretation
The chart demonstrates the effectiveness of different heuristics in matching packages over time. The consistently high number of matches for "Embedded dates" and "Embedded build ID" suggests these are reliable indicators. The sudden increase in "Embedded environment variable" matches in 2022 could indicate a change in how packages are built or labeled, or a refinement in the heuristic itself. The sharp decline in matches for "Embedded dates" and "Embedded build ID" in 2023 is a notable anomaly that warrants further investigation. It could be due to changes in package metadata, updates to the matching algorithm, or a shift in the package landscape. The relatively low and stable matches for "Embedded uname output" suggest this heuristic is less effective or applicable to the majority of packages. The data suggests a dynamic relationship between the heuristics and the packages they are attempting to match, highlighting the need for continuous monitoring and adaptation of these methods.
</details>
Figure 14: Evolution of the number of packages for which we generated diffoscopes that are matched by each of our heuristics, over time.
VI-E RQ4: How are unreproducibilities fixed?
Our analysis of 100 randomly sampled reproducibility fixes showed that, in most cases, authors are not aware that they are fixing a reproducibility issue, or they do not mention it: in 93 instances, we did not find any trace of reproducibility being mentioned. In fact, in 76 cases we found out that the reproducibility fix was just a routine package update, and it is very likely that the package becoming reproducible had more to do with changes in the upstream software than with a conscious action from the package maintainer. Other fixes included internal nixpkgs changes that had reproducibility impact as a side effect of the main reason behind the change.
Our study of the 15 most impactful fixes suggests that those are more often done with the intent of fixing reproducibility: in more than half of them (8 out of 15) the author mentioned and documented the reproducibility issue being fixed and 9 of them were changes internal to nixpkgs. Some large-scale reproducibility fixes were still package updates like toolchain updates.
VII Discussion
This work brings valuable insights to the ongoing discussions about software supply chain security, reproducible builds, and functional package management.
False myth: reproducible builds (R-B) do not scale
One strongly held belief about R-B is that they work well in limited contextsâeither selected âcriticalâ applications (e.g., cryptocurrencies or anonymity technology) or strictly curated distributionsâbut either do not scale or are not worth the effort in larger settings. Our results can be interpreted as counter proof of this belief: in nixpkgs, the largest general-purpose repository with more than 70k packages (in April 2023), more than 90.5% packages are nowadays bitwise reproducible from source. R-B do scale to general purpose, regularly used software.
Recommendation: invest in infrastructure for binary caches and build attestations
To reap the security benefits of the R-B vision [lamb_reproducible_2022] in practice, users need multiple components: (1) signed attestations by each independent builder, publishing the obtained checksums at the end of a build; (2) caches of built artifacts, so that users can avoid rebuilding from source; (3) verification technology in package managers to verify that built artifacts in (2) match the consensus checksums in (1). With few exceptions, the infrastructure to support all this does not exist yet. Now that the âR-B do not scaleâ excuse is out of the picture, we call for investments to develop this infrastructure, preferably in a distributed architectural style to make it more secure, robust, and sustainable.
False myth: Nix implies bitwise reproducibility
Contradictory to the previous myth, there is also a belief that Nix (or functional package management more generally) implies bitwise reproducibility. Our results disprove this belief: about 9.4% of nixpkgs packages are not bitwise reproducible. This is not surprising, because R-B violations happen for multiple reasons, only some of which are mitigated by FPM per se.
Recommendation: use FPM for bitwise reproducibility and rebuildability
Still, Nix appears to perform really well at bitwise reproducibility âout of the boxâ, and even more so at rebuildability (above 99%). Based on this, we recommend to use Nix, or FPM more generally, for all use cases that require either mere rebuildability or full bitwise reproducibility. At worse, they provide a very good starting point.
This work has investigated the causes of the lingering non reproducibility in nixpkgs, but not those of reproducibility; it would be interesting to know why Nix performs so well, possibly for adoption in different contexts. It is possible that bitwise reproducibility is an emerging property of FPM, or that it comes from technical measures during build like sandboxing, or that Nix is simply benefiting (more than other distributions?) from the decade-long efforts of the R-B project in upstream toolchains. Exploring all this is left as future work.
QA monitoring for bitwise reproducibility
The significantly higher (and stably so) bitwise reproducibility performances of the NixOS minimal ISO image (see Figure 13) suggests that quality assurance (QA) monitoring for build reproducibility is an effective way to increase it. The fact that Debian uses a similar approach with good results [lamb_reproducible_2022] is compatible with this interpretation. A deeper analysis of the relationship between QA monitoring and reproducibility rate is out of scope for this work, but it is quite likely that extending reproducibility monitoring to all packages will result in an easy win for nixpkgs (assuming bitwise reproducibility is seen as a project goal).
VIII Threats to validity
VIII-A Construct validity
We rebuilt Nix packages from revisions in the period July 2017âApril 2023, for about 6 years. That is enough to observe arguably long-term trends, which also appear to be stable in our analysis; except for one temporary regression, which we analyzed and explained. Still, we cannot exclude significant differences in reproducibility trends before/after the studied period.
Due to the high computation cost, build time, and environmental impact of rebuilding general purpose software from source, we sampled nixpkgs revisions to rebuild uniformly (every 4.1 months) within the studied period, totaling $14\,296$ hours of build time. We cannot exclude trend anomalies between sampled revisions either, but that seems unlikely due to the stability of the observed long-term trends. More importantly, this means that we are less likely to catch short-spanned reproducibility regressions, which would be introduced and fixed within the same period between two sampled revisions. This can be improved upon by sampling and rebuilding more nixpkgs revisions, complementing this work.
When rebuilding a given package, we relied on the Nix binary cache for all its transitive dependencies. As we have rebuilt all packages from any sampled nixpkgs revisions, our package coverage is complete. But this way we have not measured the impact of unreproducible packages on transitive reverse dependencies, i.e., how many additional packages become unreproducible if systematically built from scratch, including all their dependencies? Our experiment matches real-world use cases and is hence adequate to answer our RQs, but it would still be interesting to know.
Also, during rebuilds we have not attempted to re-download online assets, but relied on the Nix cache. Hence, we have not measured the impact of lacking digital preservation practices on reproducibility and rebuildability. It would be interesting and possible to measure such impact by, e.g., first trying to download assets from their original hosting places and, for source code assets, fallback to archives such as Software Heritage [dicosmo_2017_swh] in case of failure.
VIII-B External validity
We have rebuilt packages from nixpkgs, the largest cross-ecosystem FOSS distribution, using the Nix functional package manager (FPM). Other FPMs with significant user bases exist, such as Guix [courtes_functional_2013]. Given the similarity in design choices, we do not have reasons to believe that Guix would perform any differently in terms of build reproducibility than Nix, but we have not empirically verified it; doing so would be useful complementary work.
Neither did we verify historical build reproducibility in classic FOSS distributions (e.g., Debian, Fedora, etc.), as out of the FPM scope. Doing so would still be interesting, for comparison purposes. It would be more challenging, though, due to the fact that outside the FPM context, it is significantly harder to recreate exact build environments from years ago.
IX Conclusion
In this work we conducted the first large-scale experiment of bitwise reproducibility in the context of the Nix functional package manager, by rebuilding $709\,816$ packages coming from 17 revisions of the nixpkgs software repository, sampled every 4.1 months from 2017 to 2023. Our findings show that bitwise reproducibility in nixpkgs is very high and has known an upward trend, from 69% in 2017 to 91% in 2023. The mere ability to rebuild packages (whether bitwise reproducibly or not) is even higher, stably around 99.8%.
We have highlighted disparities in reproducibility across ecosystems that coexist in nixpkgs, as well as between packages for which bitwise reproducibility is actively monitored and the others. We have developed heuristics to understand common (un)reproducibility causes, finding that 15% of unreproducible packages were embedding the date during the build process. Finally, we studied reproducibility fixes and found out that only a minority of changes inducing a reproducibility fix were done intentionally; the rest appear to be incidental.
Data availability statement
The dataset produced as part of this work is archived on and available from Zenodo [replication_metadata, replication_diffoscopes]. A full replication package containing the code used to run the experiment describe in this paper is archived on and available from Software Heritage [replication-code:1.0].
Acknowledgments
This work would have not been possible without the availability of historical data from the official Nix package cache at https://cache.nixos.org with its long-standing âno expiryâ policy. In the context of ongoing discussions to prune the cache, we strongly emphasize its usefulness for performing empirical research on package reproducibility, like this work. We also thank the NixOS Foundation for guaranteeing us that the revisions that we needed for this research would not be garbage-collected.
\printbibliography