2501.15919
Model: gemini-2.0-flash
# Does Functional Package Management Enable Reproducible Builds at Scale? Yes.
**Authors**: Julien Malka, Stefano Zacchiroli, Théo Zimmermann
> LTCI, Télécom ParisInstitut Polytechnique de ParisPalaiseau, Francejulien.malka@telecom-paris.fr
> LTCI, Télécom ParisInstitut Polytechnique de ParisPalaiseau, Francestefano.zacchiroli@telecom-paris.fr
> LTCI, Télécom ParisInstitut Polytechnique de ParisPalaiseau, Francetheo.zimmermann@telecom-paris.fr
\ExecuteBibliographyOptions
swhid=true \addbibresource bibliography.bib
Abstract
Reproducible Builds (R-B) guarantee that rebuilding a software package from source leads to bitwise identical artifacts. R-B is a promising approach to increase the integrity of the software supply chain, when installing open source software built by third parties. Unfortunately, despite success stories like high build reproducibility levels in Debian packages, uncertainty remains among field experts on the scalability of R-B to very large package repositories.
In this work, we perform the first large-scale study of bitwise reproducibility, in the context of the Nix functional package manager, rebuilding $709\,816$ packages from historical snapshots of the nixpkgs repository, the largest cross-ecosystem open source software distribution, sampled in the period 2017â2023.
We obtain very high bitwise reproducibility rates, between 69 and 91% with an upward trend, and even higher rebuildability rates, over 99%. We investigate unreproducibility causes, showing that about 15% of failures are due to embedded build dates. We release a novel dataset with all build statuses, logs, as well as full âdiffoscopesâ: recursive diffs of where unreproducible build artifacts differ.
Index Terms: reproducible builds, functional package management, software supply chain, reproducibility, security
I Introduction
Free and open source software (FOSS) is a great asset to build trust in a computing system, because one can audit the source code of installed components to determine if their security is up to oneâs standards. However trusting the source code of the components making up a system is not enough to trust the system itself: before a program can be run on a user machine, it is typically built A term that we will also use in this work for interpreted programs, where it is the runtime environment that has to be built. to obtain an executable artifact, and then distributed onto the target system, involving a set of processes and actors generally referred to as the software supply chain. In recent years, large scale attacks like Solarwinds [alkhadra_solar_2021] or the xz backdoor [noauthor_nvd_nodate] have specifically targeted the software supply chain, underlying the importance of measures to increase its security and also triggering policy response in the European Union and the United States of America [noauthor_cyber_2022, house_executive_2021]. The particular effectiveness of these attacks is due to the difficulty to analyze binary artifacts in order to understand how they might act on the system, hence increasing the need for tooling that provide traceability from executable binaries to their source code.
Reproducible builds (R-B) âthe property of being able to obtain the same, bitwise identical, artifacts from two independent builds of a software componentâis recognized as a promising way to increase trust in the distribution phase of binary artifacts [lamb_reproducible_2022]. Indeed, if a software is bitwise reproducible, a user may require several independent parties to reach a consensus on the result of a compilation before downloading the built artifacts from one of them, effectively distributing the trust in these artifacts between those parties. For an attacker wanting to compromise the supply chain of that component, it is no longer sufficient to compromise only one of the involved parties. Build reproducibility is however not easy to obtain in general, due to non-determinism in the build processes, documented both by practitioners and researchers [lamb_reproducible_2022, bajaj_unreproducible_2023]. The Reproducible Builds [noauthor_reproducible_2023] project has since 2015 worked to increase bitwise reproducibility throughout the FOSS ecosystems, by coming up with fixes for compilers and other toolchain components, working closely with upstream projects to integrate them.
Unfortunately, a recent study [fourne_its_2023] which interviewed 24 R-B experts still concluded that there is â a perceived impracticality of fully reproducible builds due to workload, missing organizational buy-in, unhelpful communication with upstream projects, or the goal being perceived as only theoretically achievable â and that â much of the industry believes [R-B] is out of reach â. While there exist some successful examples of package sets with high reproducibility levels like Debian, which consistently achieves a reproducibility rate of more than 95% [noauthor_overview_nodate], those good performances should be put in perspective with the strict quality policies applied in Debian and the relatively limited size of the package set. Uncertainty remains among field experts about the scalability of this approach to larger software distributions.
Nixpkgs is the largest cross-ecosystem FOSS distribution, totaling as of October 2024 about $100\,000$ packages. Based on the Repology rankings https://repology.org, accessed Oct. 2024. It includes components from a large variety of software ecosystems, making it an interesting target to study bitwise reproducibility at scale. Nixpkgs is built upon Nix, the seminal implementation of the functional package management (FPM) model [dolstra_purely_2006]. It is generally believed that the FPM model is effective to obtain R-B: FPM packages are pure functions (in the mathematical sense) from build- and run-time dependencies to build artifacts, described as âbuild recipeâs that can be executed locally by the package manager. Components are built in a sandboxed environment, disallowing access to unspecified dependencies, even if they are present on the system. Previous work has highlighted that this model allows to reproduce build environments both in space and time [malka_reproducibility_2024], a necessary property for build reproducibility. Additionally, nixpkgsâ predefined build processes implement best practices to ensure build reproducibility, like setting the SOURCE_DATE_EPOCH environment variable [noauthor_source_date_epoch_nodate] or automatically verifying that the build path does not appear in the built artifacts [noauthor_build_nodate]. Despite the potential for insightful distribution-wide build reproducibility metrics, nixpkgs limits its monitoring to the narrow set of packages included in the minimal and gnome-based ISO images [noauthor_nixos_nodate], where a reproducibility rate higher than 95% is consistently reported.
Contributions
In this work, we perform the first ever large scale empirical study of bitwise reproducibility of FOSS going back in time, rebuilding historical packages from evenly spaced snapshots of the nixpkgs package repository taken every 4.1 months from 2017 to 2023. With this experiment, we answer the following research questions:
- RQ1: What is the evolution of bitwise reproducible packages in nixpkgs between 2017 and 2023? How does the reproducibility rate evolve over time? Are unreproducible packages eventually fixed? Do reproducible packages remain reproducible?
- RQ2: What are the unreproducible packages? Are they concentrated in specific ecosystems? Are critical packages more likely to be reproducible?
- RQ3: Why are packages unreproducible? Is large-scale identification of common causes possible?
- RQ4: How are unreproducibilities fixed? Are they fixed by specific patches or as part of larger package updates? Are the fixes intentional or accidental?
Besides, we use our experiment to replicate and extend previous results [malka_reproducibility_2024], leading to an additional research question:
- RQ0: Does Nix allow rebuilding past packages reliably (even if not bitwise reproducibly)?
Results
Thanks to this large-scale experiment, we are able to establish for the first time that bitwise reproducibility is achievable at scale, with reproducibility rates ranging from 69% to 91% over the period 2017â2023, despite a continuous increase in the number of packages in nixpkgs. We highlight the wide variability in reproducibility rates across ecosystems packaged in nixpkgs, and show the significant impact that some core packages can have on the overall reproducibility rate of an ecosystem.
We estimate the prevalence of some common causes of non-reproducibility at a large scale for the first time, showing that about 15% of failures are due to embedded build dates.
As part of this work, we introduce a novel dataset containing build logs and metadata of over $709\,000$ package builds, and more than $114\,000$ occurrences of non-reproducibility with full artifacts including âdiffoscopesâ, i.e., recursive diffs of where unreproducible build artifacts differ. Ample room for further research is left open by the dataset, including exploiting the build logs, or applying more complex heuristics or qualitative research to the diffoscopes.
Paper structure
Section II presents the related work. Section III gives some background that is required to understand the experiment, whose methodology is then presented in Section IV. Some descriptive statistics about the dataset are presented in Section V, and the results to our RQs in Section VI. We discuss them in Section VII, and the threats to validity in Section VIII, concluding in Section IX.
II Related work
II-A Reproducible builds (R-B)
R-B are a relatively recent concept, which has been picked up and developed mostly by practitioners from Linux distributions and upstream maintainers. The Reproducible Builds project [noauthor_reproducible_2023] has been the main actor in the area. The project has produced a definition of R-B, best practices to achieve them, and tools to monitor the reproducibility of software distributions, and debug unreproducibilities (the diffoscope).
Besides, R-B have picked the interest of the academic community, with a growing number of papers on the topic.
R-B for the security of the software supply chain
R-B are often seen as a way to increase the security of the software supply chain [lamb_reproducible_2022]. Torres-Arias et al. provide a framework to enforce the integrity of the software supply chain for which they demonstrate an application to enforce R-B [torres-arias_-toto_2019]. Our paper does not contribute directly to this line of research but, by demonstrating the feasibility of R-B at scale, it strengthens the case of the approach.
Techniques for build reproducibility
In a series of articles [ren_automated_2018, ren_root_2020, ren_automated_2022], Ren et al. devised a methodology to automate the localization of sources of non-reproducibility in build processes and to automatically fix them, using a database of common patches that are then automatically adapted and applied.
An alternative technique to achieve build reproducibility is proposed by Navarro et al. [navarro_leija_reproducible_2020]. They propose âreproducible containersâ that are built in a way that makes the build process fully deterministic, at the expense of performance.
FPMs such as Nix [dolstra_purely_2006] and Guix [courtes_functional_2013] are also presented as a way to achieve R-B. Malka et al. [malka_reproducibility_2024] showed that Nix allows reproducing past build environment reliably, as well as rebuilding old packages with high confidence, but they do not address the question of bitwise reproducibility, which we do with this work.
Relaxing the bitwise reproducibility criterion
Because of the difficulty (real or perceived) to achieve bitwise reproducibility, some authors have proposed to relax the criterion to a more practical one. For instance, accountable builds [poll_analyzing_2021] aim to distinguish between differences that can be explained (accountable differences) or not (unaccountable differences). Our work highlights that bitwise reproducibility is achievable at scale in practice, and thus that relaxing the reproducibility criterion may not be necessary after all.
Empirical studies of R-B
Some other recent academic works have empirically studied R-B in the wild. Two papers from 2023 [butler_business_2023, fourne_its_2023] looked into business adoption of R-B and perceived effectiveness through interviews.
Bajaj et al. [bajaj_unreproducible_2023] mined historical results from R-B tracking in Debian to investigate causes, fix time, and other properties of unreproducibility in the distribution. Our work is similar, but instead of relying on historical R-B tracking, we actually rebuild packages and compare them bitwise to historical build results. When we report on packages being reproducible, it means they have stood the test of time. It also allows us to provide more detailed information on the causes of unreproducibility, in particular by generating diffoscopes and saving them for future research as part of our dataset; whereas diffoscopes from the Debian R-B tracking are not preserved in the long-term. Finally, Bajaj et al. used issue tracker data from the R-B project to identify the most common causes of non-reproducibility, possibly introducing a sampling bias since only root causes that were identified by Debian developers are counted in their statistics. In our work, we try to avoid this bias by performing a large-scale automatic analysis of diffoscopes to automatically identify the prevalence of a selection of causes of non-reproducibility. While we present heuristics comparable to some of the causes identified in Bajaj et al. âs taxonomy, we derive them from empirical data rather than relying on pre-labeled data from the Debian issue tracker.
The only other work that performed an experimental study of R-B investigated the impact of configuration options [randrianaina_options_2024]. Contrary to them, we rebuild historical versions of packages in their default configuration. Combining the historical snapshot approach of our work with their approach of varying the configuration options could be an interesting future work.
II-B Linux distributions and package ecosystems
Besides R-B, our work also relates to the literature on Linux distributions and package ecosystems. The nixpkgs repository being the largest cross-ecosystem software distribution, we are able to compare properties of packages across ecosystems. Several previous works have compared package ecosystems (e.g., [decan_empirical_2019]). For an overview of recent research on package ecosystems, see Mens and Decan [mens_overview_2024].
More specifically, nixpkgs is the basis of the NixOS Linux distribution. Linux distributions have a long history of being studied by the research community. Recently, Legay et al. [legay_quantitative_2021] measured the package freshness in Linux distributions. While this is not the topic of this work, our dataset could be used, e.g., to study how frequently packages are updated in nixpkgs.
III Background
We provide in this section some background knowledge about Nix and R-B, which is required to understand the details of our experiments.
III-A The FPM model and the Nix store
The FPM model applies the idea of functional programming to package management. Nix packages are viewed as pure functions from their inputs (source code, dependencies, build scripts) to their outputs (binaries, documentation, etc.). Any change to the inputs should produce a different package version. Nix allows multiple versions of the same package to be built and coexist on the same system. To that end, Nix stores build outputs in input-addressed directories (using a hashing function of the inputs) in the Nix store, usually located in the /nix/store directory on disk. Figure 1 shows an example of a Nix packaging expression (a Nix recipe) for the htop package.
⏠{stdenv, fetchFromGitHub, ncurses, autoreconfHook}: stdenv. mkDerivation rec { pname = "htop"; version = "3.2.1"; src = fetchFromGitHub { owner = "htop-dev"; repo = "htop"; rev = version; sha256 = "sha256-MwtsvdPHcUdegsYj9NGyded5XJQxXri1IM1j4gef1Xk="; }; nativeBuildInputs = [ autoreconfHook ]; buildInputs = [ ncurses ]; }; }
Figure 1: Example Nix expression for the htop package.
III-B Nix evaluation-build pipeline
Building binary outputs from a Nix package recipe is a two-step process. First, Nix evaluates the expression and transforms it into a derivation, an intermediary representation in the Nix store containing all the necessary information to run the build process. In particular, the derivation contains ahead of time the (input-addressed) output path, that is the exact location in the Nix store where the build artifacts will be stored if that derivation were to be built.
Then, given the derivation file as input, the nix-build command performs the build, creating the pre-computed output path in the Nix store upon completion.
In the same fashion as other Linux distributions, Nix packages may produce multiple outputs (a main output with binaries, one with documentation, etc.). Each output has its own directory in the Nix store, and building the derivation from source systematically produces all its outputs.
III-C Path substitution and binary caches
Alternatively to building from source, Nix offers the option to download prebuilt artifacts from third party binary caches, which are databases populated with build outputs generated by Nix. Binary caches are indexed by output paths, making it possible for Nix to check for the presence of a precompiled package in a configured cache after the evaluation phase. https://cache.nixos.org is the official cache for the Nix community and most Nix installations come configured to use it as a trusted cache.
III-D The nixpkgs continuous integration
Hydra [dolstra_nix_nodate] is the continuous integration (CI) platform for the nixpkgs project. At regular intervals in time, it fetches the latest version of nixpkgsâ git master branch and evaluates the pkgs/top-level/release.nix file embedded in the repository. This evaluation yields a list of derivations (or jobs) that are then built by Hydra: one derivation for each of the $â\,$ $100\,000$ packages contained in nixpkgs nowadays. Upon success of a predefined subset of these jobs, the revision is deemed valid and all the built artifacts are uploaded to the official binary cache to be available to end users.
III-E Testing bitwise reproducibility with Nix
Nix embarks some minimal tooling to test the reproducibility of a given derivation in the form of a --check flag passed to the nix-build command. To check for bitwise reproducibility, Nix needs a reference that it will try to acquire from one of the configured caches, or fail if not possible. Nix then acquires the build environment of the derivation under consideration, builds the derivation, and compares each of the outputs of the derivation against the local version. The --keep-failed flag can be used to instruct Nix to keep the unreproducible outputs locally for further processing. For our experiment, we alter the behavior of nix-build --check to prevent it from failing early as soon as one unreproducible output is detected.
III-F Diffoscope
Diffoscope [noauthor_diffoscope_nodate] is a tool developed and maintained by the Reproducible Builds project that aims to simplify the analysis of differences between software artifacts. It is able to recursively unpack binary archives and automatically use ecosystem specific diffing tools to allow for better understanding of what makes two software artifacts different. It generates HTML or JSON artifactsâalso called diffoscopes âthat can be either interpreted by humans or automatically processed.
IV Methodology
<details>
<summary>extracted/6418685/medias/pipelinev2.png Details</summary>

### Visual Description
## Workflow Diagram: Nixpkgs Build Analysis
### Overview
The image is a workflow diagram illustrating the process of analyzing Nixpkgs builds, comparing historical CI builds with rebuilds, and diagnosing issues. It outlines the steps from obtaining Nixpkgs revisions to categorizing differences and determining reproducibility.
### Components/Axes
* **Overall Structure**: The diagram is divided into three main sections: "Historical CI Builds" (top, red), "Our Rebuilds" (bottom, green), and "Our Analysis" (right, blue). A starting point on the left shows the "Nixpkgs revision" and "Nixpkgs git repository" (purple).
* **Nodes**: The diagram uses rectangular boxes to represent processes or data stores. Arrows indicate the flow of information or actions.
* **Decision Points**: Conditional statements are included, such as "If build succeeds" and "If the package is fixed later on."
* **Status Indicators**: Color-coded status indicators are used, such as "Status: reproducible" (green), "Status: buildable but not reproducible" (orange), and "Status: failure" (red).
### Detailed Analysis or ### Content Details
1. **Nixpkgs Revision Source (Left)**:
* "Nixpkgs revision" (purple box)
* "Sample 17 revisions" text pointing to "Nixpkgs git repository" (purple box)
2. **Historical CI Builds (Top, Red)**:
* "Evaluation of release.nix by Hydra"
* "Build all packages using Hydra builders"
* "If build succeeds upload artifacts"
* "Nixpkgs' binary cache"
3. **Our Rebuilds (Bottom, Green)**:
* "Evaluation of release.nix by nix-eval-jobs"
* "Computation of evaluation closure and filtering out packages not in cache"
* "Build all packages using our Buildbot infrastructure"
* "Success" (green arrow) -> "Store artifacts"
* "Failure" (red arrow) -> "Status: failure" (red text)
4. **Our Analysis (Right, Blue)**:
* "Compare artifacts"
* "Identical" (green arrow) -> "Status: reproducible" (green text)
* "Different" (orange arrow) -> "Status: buildable but not reproducible" (orange text)
* "If the package is fixed later on"
* "Bisect to find fixing commit"
* "Manually analyze commit to infer intent"
* "Generate Diffoscopes"
* "Run heuristics to categorize them"
### Key Observations
* The diagram contrasts historical CI builds with a rebuild process to identify and analyze differences.
* The analysis focuses on determining whether builds are reproducible.
* The workflow includes steps to diagnose and categorize issues when builds are not reproducible.
### Interpretation
The diagram illustrates a systematic approach to ensuring the reproducibility of Nixpkgs builds. It starts with obtaining Nixpkgs revisions, comparing historical builds with rebuilds, and then analyzing any discrepancies. The "Our Analysis" section provides a detailed process for identifying the root causes of non-reproducible builds, including bisecting to find the fixing commit, manually analyzing commits, generating diffoscopes, and running heuristics to categorize the issues. The ultimate goal is to improve the reliability and consistency of Nixpkgs builds.
</details>
Figure 2: Description of our build and analysis pipeline.
Our build and analysis pipeline is summarized in Figure 2.
IV-A Reproducibility experiments
IV-A 1 Revision sampling
We start from the 200 nixpkgs revisions selected by Malka et al. [malka_reproducibility_2024] in the period July 2017âApril 2023. Since building revisions, as opposed to just evaluating them, is very computationally intensive, it was not feasible to build all 200 revisions. Also, it was difficult to correctly estimate how many revisions we could build, due to the ever-growing number of packages in each revision. We hence applied dichotomic sampling: we first build the most recent revision, then the oldest one, then the one in the middle of them, and so on always picking the one in the middle of the largest time interval when choosing. After 17 revisions built, we obtain a regularly spaced sample set of nixpkgs revisions, with one sampled revision every 4.1 months. On average, each revision corresponds to building more than 41 thousand packages (see Figure 3 for details, discussed later).
To perform our builds, we used a distributed infrastructure based on Buildbot [noauthor_buildbot_nodate], a Python CI pipeline framework. Our infrastructure has two types of machines: a coordinator and multiple builders. The coordinator is in charge of distributing the workload and storing data that must be persisted, while the builders are stateless and perform workloads sent by the coordinator. During the course of the experiment (from June to October 2024) the set of builders we used was composed of shared bare-metal machines running various versions of Ubuntu and Fedora and our coordinator was a virtual machine running Ubuntu. Note that to perform our bitwise reproducibility checks, we compare to the historical results coming from Hydra, that uses builders that, at the time, ran older versions of Nix than the ones we used on our builders. Further details on the operating systems, Nix versions and kernel versions that we used on builders can be found in the replication package.
IV-A 2 Evaluation and preprocessing
For each revision considered, the coordinator first evaluates pkgs/top-level/release.nix (containing the list of jobs built by Hydra for this revision) using nix-eval-jobs, a standalone and parallel Nix evaluator similar to the one used on Hydra. The outcome of this operation is a list of derivations. The release.nix file is human crafted and does not contain all the dependencies of the listed packages, even though they are built by Hydra along the way. Since we are interested in testing the reproducibility of the entire package graph built by Hydra, we post-process the list of jobs obtained after the evaluation phase to include all intermediary derivations by walking through the dependency graph of each derivation. During this post-processing, we also check that the derivation outputs are present in the official Nix binary cache. This is required to compare our build outputs with historical results for bitwise reproducibility. Derivations missing from the cache can indicate that they historically failed to build, although there can be other reasons for their absence.
IV-A 3 Building
To build a job $A$ and test the reproducibility of its build outputs, the builder uses the nix-build --check command, as described in Section III-E. This means that we always assume that $A$ âs build environment is buildable and always fetch it from the cache. This allows all the derivations in the sample set to be built and checked independently and in parallel, irrespective of where they are located in the package dependency graph. Note that the source code of packages to build is part of the build environment. Relying on the NixOS cache hence avoids incurring into issues such as source code disappearing from the original upstream distribution place. Investigating how much of NixOS can be rebuilt without relying on the cache is an interesting research question, recently explored for Guix [courtes-2024-guix-archiving], but out of scope for this paper.
After each build, we classify the derivation as either building reproducibly, building but not reproducibly or not building. We save the build metadata and the logs, and when available we download and store the historical build logs from the nixpkgs binary cache. Finally, for every unreproducible output path, we store both the historical artifacts and our locally built ones, for comparison purposes.
IV-B Ecosystem identification and package tracking
To answer RQ1, RQ2 and RQ4, we need to be able to discriminate packages by provenance ecosystem, and track them over time to follow their evolution. To categorize packages by ecosystem, we rely on the first component of the package name when it has several components (for example a package named haskellPackages.network is sorted into the Haskell ecosystem). Sometimes, there are several co-existing versions of an ecosystem in a given nixpkgs revision (for example python37Packages and python38Packages being present in the same revision), and sometimes the name of the ecosystem is modified between successive nixpkgs revisions. Therefore, some deduplication step is necessary. The first and last authors performed this step manually by inspecting the 144 ecosystems from the 17 nixpkgs revisions considered, ordered alphabetically, and deciding which ones to merge independently, then checking the consistency of their results, discussing the few differences (missed merges, or false positives) and reaching a consensus. For instance, the following ecosystems were merged into a single one: php56Packages, php70Packages, âŠ, php82Packages, phpExtensions, phpPackages and phpPackages-unit.
To deduplicate packages appearing in several versions of the same ecosystem, we order by version (favoring the most recent one) and consider any package set without a version number as having a higher priority (since it is the default one in the considered revision, as chosen by the nixpkgs maintainers).
IV-C Comparison with the minimal ISO image
As part of RQ2, we investigate the difference of reproducibility rate between critical packages whose reproducibility is monitored and the rest of the package set. We are also interested in knowing whether observing the reproducibility health of this subset of packages gives a good enough information on the state of the rest of the project. The minimal and gnome-based ISO images are considered critical subsets of packages and benefit from a community-maintained reproducibility monitoring. We study the minimal ISO image because it contains a limited amount of core packages. We evaluate the Nix expression associated with the image, compute its runtime closure (the set of packages included in the image) and match it with the packages of our dataset to infer their reproducibility statuses.
IV-D Analyzing causes of unreproducibility using diffoscopes
Analyzing causes of unreproducibility is a tricky debugging activity, usually carried out by practitioners (in particular, by Linux distribution maintainers and members of the Reproducible Builds project). Some automatic fault localization methods have been proposed [ren_root_2020], but they rely on instrumenting the build, while we have to run the Nix builds unchanged to avoid introducing biases.
For each unreproducible output, we run diffoscope with a 5-minute timeout, yielding a dataset of $86\,476$ diffoscopes. We then investigate whether we can use our large dataset of diffoscopes for automatic detection of causes of non-reproducibility. The diffoscope tool was mainly designed to help human debugging, but it also supports producing a JSON output, which can then be machine processed.
We wish to explore heuristics that can be applied at the line level, so we recurse through diffoscope structures until leaf nodes, which are diffs in unified diff format. We randomly draw one added line from $10\,000$ diffoscopes, sort them by similarity to ease visual inspection, and manually inspect them to derive relevant heuristics. We then run these heuristics on the full diffoscope dataset to determine the proportion of packages impacted by each cause (multiple causes can apply to the same package). The first and last author then evaluate the precision of each these heuristics by manually counting false positives in samples of matched lines for each heuristic.
IV-E Automatic identification of reproducibility fixes
To investigate fixes to unreproducibilities, for each unreproducible package that becomes reproducible, we run an automatic bisection process to find the first commit that fixes the reproducibility. By looking into the corresponding pull request on GitHub, we check if the maintainers provide information on why this fixes a reproducibility issue, or link to a corresponding bug report. In particular, we are interested to check how often the maintainers are aware that their commit is a reproducibility fix (as opposed to a routine package update, which embeds a reproducibility fix that would have been crafted by the upstream maintainers).
We start from the set of packages from all revisions, and we look specifically at packages that change status from unreproducible to reproducible in two successive revisions. These are our candidate packages (and âoldâ and ânewâ commits) for the bisection process.
Since we perform the bisection process on a different runner and at a different time compared to the dataset creation, it can happen that we cannot reproduce the status (reproducible or unreproducible) of some builds. Therefore, before starting the bisection, we verify that we obtain consistent results on the âoldâ and the ânewâ revision. Then, for those which behave as expected, we start an automatic git bisect process.
The script used for the automatic git bisect checks for the reproducibility status of the build to mark the selected commit as âoldâ or ânewâ. Commits that fail to build or are not available in cache are marked as âskippedâ. We use git bisect with the --first-parent flag because intermediate pull request commits are typically not in cache.
For the qualitative analysis, we first group packages by fixing commit, as seeing all packages fixed by a given commit gives valuable information that might help to understand the reproducibility failure being fixed. We then randomly sample fixes and open their commit page on GitHub, locating the corresponding pull request. We manually inspect the pull request (description, commit log, code changes) to first confirm that the bisect phase successfully identified the commit that fixed the reproducibility failure. It may be the case that after careful inspection, the change looks unrelated to the package being fixed (the bisection process can give incoherent results in case of a flaky reproducibility issue or because the package changed status several times between two data points) in which case we discard it. Once we have confirmed that the identified commit is correct, we check whether the commit authors indicate that they are fixing a reproducibility issue and if the commit is a package update or another change. We analyze 100 randomly sampled reproducibility fixes and report our findings. Additionally, we perform the same analysis on the 15 commits that fix the most packages (from $3052$ down to 27 packages fixed) to find potential differences of behavior of the contributors for those larger-scale fixes.
V Dataset
The main result of running the pipeline of Figure 2 is a large-scale dataset of historical package rebuilds, including (re)build information, bitwise reproducibility status and, in case of non-reproducibility, generated diffoscopes. In this paper, we use the dataset to answer our stated research questions, but many other research questions could be addressed using the dataset, including more in-depth analysis of non-reproducibility causes. We make the dataset available to the research and technical community to foster further exploration on the topic. In the remainder of this section, we provide some descriptive statistics of the dataset.
<details>
<summary>extracted/6418685/paper-images/package-evolution.png Details</summary>

### Visual Description
## Area Chart: Package Evolution Over Time
### Overview
The image is an area chart displaying the number of packages over time from 2018 to 2023. Three data series are plotted: "Deduplicated evaluation", "Evaluation of release.nix", and "Evaluation closure of release.nix". The chart illustrates the growth and trends of these package types over the specified period.
### Components/Axes
* **X-axis:** Date, ranging from 2018 to 2023.
* **Y-axis:** Number of packages, ranging from 0 to 70k (70,000).
* **Legend:** Located at the bottom of the chart.
* Blue: Deduplicated evaluation
* Red: Evaluation of release.nix
* Green: Evaluation closure of release.nix
### Detailed Analysis
* **Deduplicated evaluation (Blue):**
* Trend: Generally increasing over time.
* 2018: Approximately 16k
* 2019: Approximately 19k
* 2020: Approximately 22k
* 2021: Approximately 24k
* 2022: Approximately 30k
* 2023: Approximately 45k
* **Evaluation of release.nix (Red):**
* Trend: Generally increasing over time, with a significant jump in 2023.
* 2018: Approximately 14k
* 2019: Approximately 25k
* 2020: Approximately 30k
* 2021: Approximately 37k
* 2022: Approximately 42k
* 2023: Approximately 55k
* **Evaluation closure of release.nix (Green):**
* Trend: Generally increasing over time, with a significant jump in 2023.
* 2018: Approximately 22k
* 2019: Approximately 31k
* 2020: Approximately 37k
* 2021: Approximately 45k
* 2022: Approximately 48k
* 2023: Approximately 75k
### Key Observations
* All three package types show an increasing trend from 2018 to 2023.
* The "Evaluation closure of release.nix" (Green) series has the highest number of packages in 2023.
* The "Deduplicated evaluation" (Blue) series has the lowest number of packages throughout the entire period.
* There is a significant increase in all package types in 2023.
### Interpretation
The chart indicates a steady growth in the number of packages for all three evaluation types over the years. The substantial increase in 2023 suggests a significant development or adoption of these packages during that year. The "Evaluation closure of release.nix" consistently has a higher number of packages compared to the other two, indicating its importance or prevalence. The "Deduplicated evaluation" has the lowest number of packages, which might suggest that it is a more specialized or optimized category. The relationship between the series shows the evolution and relative importance of each package type within the ecosystem.
</details>
Figure 3: Evolution of the number of packages in each nixpkgs revision (as defined by release.nix), in their evaluation closure and after deduplicating ecosystem copies.
The dataset spans $709\,816$ package builds coming from 17 nixpkgs revisions, built over a total of $14\,296$ hours. From those builds, $548\,390$ are coming directly from the release.nix file and can be tracked by name. They correspond to $58\,103$ unique packages that appear over the span of sampled revisions. As can be seen on Figure 3, the number of packages listed to be built by Hydra increased from $13\,527$ in 2017 to $53\,432$ in 2023. Ecosystems can be present in multiple versions. On average, deduplicating packages in multiple ecosystem copies decreases their number by 21% while adding the evaluation closure of the release.nix file increases the number of jobs by 29%.
<details>
<summary>extracted/6418685/paper-images/ecosystems.png Details</summary>

### Visual Description
## Stacked Area Chart: Ecosystem Job Trends Over Time
### Overview
The image is a stacked area chart illustrating the number of jobs associated with various programming ecosystems over time, from 2018 to 2023. The chart displays the trends for each ecosystem, showing how the job market for each has evolved over the years.
### Components/Axes
* **Title:** Implicit, but the chart represents "Ecosystem Job Trends Over Time"
* **X-axis:** Date, ranging from 2018 to 2023 in yearly increments.
* **Y-axis:** Number of jobs, ranging from 0 to 45k in increments of 5k.
* **Legend:** Located at the bottom of the chart, listing the ecosystems and their corresponding colors:
* other ecosystems (blue)
* base (red)
* haskell (green)
* perl (orange)
* emacsPackages (cyan)
* vimPlugins (pink)
* gnome (light pink)
* xorg (yellow)
* python (purple)
* lisp (light green)
* linux (light blue)
### Detailed Analysis
Here's a breakdown of each ecosystem's trend:
* **other ecosystems (blue):** Starts at approximately 1k in 2018 and gradually increases to approximately 5k in 2023.
* **base (red):** Starts at approximately 7k in 2018 and increases to approximately 18k in 2023.
* **haskell (green):** Starts at approximately 5k in 2018 and increases to approximately 7k in 2023.
* **perl (orange):** Starts at approximately 3k in 2018 and increases to approximately 5k in 2023.
* **emacsPackages (cyan):** Starts at approximately 2k in 2018 and increases to approximately 12k in 2023.
* **vimPlugins (pink):** Starts at approximately 1k in 2018 and increases to approximately 3k in 2023.
* **gnome (light pink):** Starts at approximately 1k in 2018 and increases to approximately 2k in 2023.
* **xorg (yellow):** Starts at approximately 1k in 2018 and remains relatively stable around 1k in 2023.
* **python (purple):** Starts at approximately 3k in 2018 and increases to approximately 10k in 2023.
* **lisp (light green):** Starts at approximately 1k in 2018 and remains relatively stable around 1k in 2023.
* **linux (light blue):** Starts at approximately 1k in 2018 and increases to approximately 3k in 2023.
### Key Observations
* **Overall Growth:** Most ecosystems show an upward trend in the number of jobs from 2018 to 2023.
* **Significant Growth:** emacsPackages (cyan) and base (red) show the most significant growth in job numbers.
* **Stable Ecosystems:** xorg (yellow) and lisp (light green) remain relatively stable over the period.
* **Dominant Ecosystems:** base (red) and haskell (green) consistently represent a larger portion of the job market compared to other ecosystems.
### Interpretation
The data suggests that the job market for most programming ecosystems has grown between 2018 and 2023. The ecosystems with the most significant growth, such as emacsPackages and base, indicate increasing demand for skills related to these technologies. The stable ecosystems, like xorg and lisp, may indicate a consistent but not expanding demand. The stacked nature of the chart allows for a comparison of the relative size of each ecosystem's job market over time, highlighting the dominance of base and haskell. The trends observed could be influenced by factors such as the adoption of new technologies, the popularity of specific programming languages, and the overall growth of the software development industry.
</details>
Figure 4: Evolution of the size of the nine most popular software ecosystems in nixpkgs, the packages whose ecosystem is undetermined (base), and the packages from other ecosystems (only packages listed in release.nix).
Figure 4 shows the evolution of the top 9 ecosystem sizes in nixpkgs, plus the base namespace. 61.7% of packages belong to an ecosystem, while the rest live in nixpkgs base namespace. The three largest ecosystems are Haskell, Python and Perl, which together account for 42.4% of the packages.
<details>
<summary>extracted/6418685/paper-images/strates.png Details</summary>

### Visual Description
## Area Chart: Packages Introduced Over Time
### Overview
The image is an area chart displaying the number of jobs over time, broken down by packages introduced in different revisions. The x-axis represents the date from 2018 to 2023, and the y-axis represents the number of jobs, ranging from 0 to 45k. Each colored area represents a different package revision, as indicated in the legend.
### Components/Axes
* **Title:** None explicitly present in the image.
* **X-Axis:**
* Label: "Date"
* Scale: 2018, 2019, 2020, 2021, 2022, 2023
* **Y-Axis:**
* Label: "Number of jobs"
* Scale: 0, 5k, 10k, 15k, 20k, 25k, 30k, 35k, 40k, 45k
* **Legend:** Located at the bottom of the chart.
* Title: "Packages introduced in revision:"
* Entries:
* Blue: 1377834
* Purple: 1476353
* Pink: 1544614
* Yellow: 1622813
* Teal: 1724521
* Light Blue: 1787891
* Red: 1416607
* Orange: 1498441
* Green: 1569427
* Dark Blue: 1653283
* Light Purple: 1755200
* Light Pink: 1794205
* Dark Teal: 1448353
* Cyan: 1520586
* Salmon: 1593430
* Coral: 1687868
* Peach: 1776404
### Detailed Analysis
* **Package 1377834 (Blue):** Starts at approximately 13k in 2018, decreases to around 10k by 2023.
* **Package 1476353 (Purple):** Starts at approximately 15k in 2018, increases to around 18k by 2023.
* **Package 1544614 (Pink):** Starts at approximately 16k in 2018, increases to around 20k by 2023.
* **Package 1622813 (Yellow):** Starts at approximately 16k in 2018, decreases to around 14k by 2023.
* **Package 1724521 (Teal):** Starts at approximately 16k in 2018, decreases to around 14k by 2023.
* **Package 1787891 (Light Blue):** Starts at approximately 16k in 2018, increases to around 20k by 2023.
* **Package 1416607 (Red):** Starts at approximately 13k in 2018, decreases to around 12k by 2023.
* **Package 1498441 (Orange):** Starts at approximately 16k in 2018, decreases to around 14k by 2023.
* **Package 1569427 (Green):** Starts at approximately 13k in 2018, decreases to around 12k by 2023.
* **Package 1653283 (Dark Blue):** Starts at approximately 13k in 2018, decreases to around 10k by 2023.
* **Package 1755200 (Light Purple):** Starts at approximately 16k in 2018, increases to around 18k by 2023.
* **Package 1794205 (Light Pink):** Starts at approximately 16k in 2018, increases to around 20k by 2023.
* **Package 1448353 (Dark Teal):** Starts at approximately 13k in 2018, decreases to around 12k by 2023.
* **Package 1520586 (Cyan):** Starts at approximately 16k in 2018, increases to around 45k by 2023.
* **Package 1593430 (Salmon):** Starts at approximately 16k in 2018, increases to around 20k by 2023.
* **Package 1687868 (Coral):** Starts at approximately 16k in 2018, increases to around 18k by 2023.
* **Package 1776404 (Peach):** Starts at approximately 16k in 2018, increases to around 18k by 2023.
### Key Observations
* The total number of jobs increases significantly from 2018 to 2023.
* Some packages (e.g., 1377834, 1653283) show a decreasing trend in the number of jobs over time.
* Other packages (e.g., 1476353, 1544614, 1787891, 1755200, 1794205, 1593430, 1687868, 1776404) show an increasing trend in the number of jobs over time.
* Package 1520586 (Cyan) shows the most significant increase in the number of jobs, especially in 2023.
### Interpretation
The area chart illustrates the evolution of job numbers associated with different package revisions over a 5-year period. The increasing total number of jobs suggests overall growth or increased activity. The varying trends among different packages indicate that some revisions may have become less relevant or efficient over time (decreasing trends), while others have gained more traction or utility (increasing trends). The substantial growth of package 1520586 suggests a particularly successful or widely adopted revision. The data could be used to inform decisions about package maintenance, resource allocation, and future development efforts.
</details>
Figure 5: Number of packages introduced by every revision of the dataset and their survival in the package set over time.
Figure 5 outlines the number of packages introduced in the dataset by each revision, and their survival over time. In particular, as of July 2017 the package set contained $13\,114$ elements, $8929$ of which were still present in April 2023.
VI Results
We present our experimental results below, organized by research question. Their discussion is provided later, in Section VII.
VI-A RQ0: Does Nix allow rebuilding past packages reliably (even if not bitwise reproducibly)?
This research question aims to reproduce the results from Malka et al. [malka_reproducibility_2024], as a starting baseline. That earlier work only built one nixpkgs revision, the most ancient in their dataset; in our case, we rebuilt that revision alongside with 16 others, evenly spaced over time to study trends. Figure 6 shows the proportion of packages between 2017 and 2023 that we successfully rebuilt (not necessarily in a bitwise reproducible manner, merely âsuccessfully builtâ for this RQ).
This proportion varies between 99.68% and 99.95%, confirming previously reported findings: Nix reproducibility of build environments allows for very high rebuildability rate over time. Note that this is not an exact replication of the revision in [malka_reproducibility_2024], because we also included packages not explicitly listed in release.nix, but present in the dependency graph, whereas they did not.
<details>
<summary>extracted/6418685/paper-images/rebuildability.png Details</summary>

### Visual Description
## Chart: Proportion of Successful Rebuilds Over Time
### Overview
The image is a line chart showing the proportion of successful rebuilds over time, from 2018 to 2023. The y-axis represents the proportion of successful rebuilds, ranging from 99.50% to 100.00%. The x-axis represents the date, spanning from 2018 to 2023. The area under the line is shaded in blue.
### Components/Axes
* **X-axis:** Date, with labels for the years 2018, 2019, 2020, 2021, 2022, and 2023.
* **Y-axis:** Proportion of successful rebuilds, with labels for 99.50%, 99.60%, 99.70%, 99.80%, 99.90%, and 100.00%.
* **Data Series:** A single data series representing the proportion of successful rebuilds, plotted as a blue line with the area under the line shaded in blue.
### Detailed Analysis
The data series shows the proportion of successful rebuilds fluctuating between approximately 99.68% and 99.90% over the years.
* **2018:** Starts at approximately 99.86%, dips to 99.73% and rises to 99.81%.
* **2019:** Dips to approximately 99.72%, rises to 99.76%, and dips to 99.68%.
* **2020:** Rises to approximately 99.75%, then rises sharply to 99.85%.
* **2021:** Dips to approximately 99.70%, then rises to 99.81%.
* **2022:** Rises to approximately 99.83%, then rises sharply to 99.89%, then dips to 99.76%.
* **2023:** Dips to approximately 99.72%, then rises sharply to 99.87%, then rises to 99.97%.
### Key Observations
* The proportion of successful rebuilds generally fluctuates within a narrow range (99.68% to 99.90%).
* There is an overall upward trend in the proportion of successful rebuilds from 2018 to 2023.
* The highest proportion of successful rebuilds is observed in 2023, reaching approximately 99.97%.
* The lowest proportion of successful rebuilds is observed in 2019, reaching approximately 99.68%.
### Interpretation
The chart indicates a high level of reliability in the rebuild process, with the proportion of successful rebuilds consistently above 99.50%. The upward trend suggests continuous improvement or optimization of the rebuild process over the years. The fluctuations within the narrow range could be attributed to various factors such as software updates, hardware changes, or temporary system issues. The data suggests that the rebuild process is stable and becoming more reliable over time.
</details>
Figure 6: Proportion of rebuildable packages over time.
VI-B RQ1: What is the evolution of bitwise reproducible packages in nixpkgs between 2017 and 2023?
Apart from a significant regression in 2020, we obtain bitwise reproducibility levels between 69% and 91% (see Figure 7). The trends in Figure 8 show that the absolute number of bitwise reproducible packages has consistently gone up and followed the fast growth of the package set. The only exception is the data point for June 2020, where the number of reproducible packages dropped even though the total number of packages grew. We study and explain this reproducibility regression in Section VI-C below.
<details>
<summary>extracted/6418685/paper-images/reproducibility-overall-proportion.png Details</summary>

### Visual Description
## Area Chart: Proportion of Successful Rebuilds Over Time
### Overview
The image is an area chart showing the proportion of successful rebuilds over time, broken down by status: reproducible, buildable, and failed. The x-axis represents the date from 2018 to 2023, and the y-axis represents the proportion of successful rebuilds, ranging from 30.00% to 90.00%.
### Components/Axes
* **Title:** None
* **X-axis:**
* **Label:** Date
* **Scale:** 2018, 2019, 2020, 2021, 2022, 2023
* **Y-axis:**
* **Label:** Proportion of successful rebuilds
* **Scale:** 30.00%, 40.00%, 50.00%, 60.00%, 70.00%, 80.00%, 90.00%
* **Legend:** Located at the bottom of the chart.
* **Statuses:**
* Reproducible (light blue)
* Buildable (light red)
* Failed (light green)
### Detailed Analysis
* **Reproducible (light blue):** The proportion of reproducible rebuilds starts at approximately 70.00% in 2018, dips slightly to around 69.00% in early 2018, then increases to approximately 85.00% by 2019. It then decreases to approximately 81.00% in 2020, then dips sharply to approximately 62.00% in late 2020. It then increases to approximately 85.00% in 2021, and remains relatively stable at approximately 85.00% in 2022, before increasing to approximately 91.00% in 2023.
* **Buildable (light red):** The proportion of buildable rebuilds remains relatively constant at approximately 98.00% throughout the entire period from 2018 to 2023.
* **Failed (light green):** The proportion of failed rebuilds remains relatively constant at approximately 99.00% throughout the entire period from 2018 to 2023.
### Key Observations
* The proportion of reproducible rebuilds fluctuates more significantly than the other two categories.
* The proportion of buildable and failed rebuilds remains relatively stable over the entire period.
* There is a notable dip in the proportion of reproducible rebuilds around 2020.
### Interpretation
The chart indicates that the reproducibility of rebuilds has varied over time, with a notable dip around 2020. The buildable and failed statuses have remained relatively consistent, suggesting that these aspects of the rebuild process are more stable. The increase in reproducible rebuilds in 2023 suggests an improvement in the reproducibility of the rebuild process. The relationship between the statuses shows that the majority of rebuilds are buildable, with a smaller proportion being reproducible, and an even smaller proportion failing. The dip in reproducibility around 2020 could be due to changes in the build environment, dependencies, or other factors that affect reproducibility.
</details>
Figure 7: Proportion of reproducible, rebuildable (but unreproducible) and non-rebuildable packages over time.
<details>
<summary>extracted/6418685/paper-images/reproducibility-overall-absolute.png Details</summary>

### Visual Description
## Area Chart: Successful Rebuilds Over Time
### Overview
The image is an area chart displaying the number of successful rebuilds over time, categorized by status: reproducible, buildable, and failed. The x-axis represents the date from 2018 to 2023, and the y-axis represents the number of successful rebuilds, ranging from 0 to 70k.
### Components/Axes
* **Title:** Number of successful rebuilds
* **X-axis:** Date (Years: 2018, 2019, 2020, 2021, 2022, 2023)
* **Y-axis:** Number of successful rebuilds (Scale: 0, 10k, 20k, 30k, 40k, 50k, 60k, 70k)
* **Legend (Bottom-Left):**
* Reproducible (Blue)
* Buildable (Salmon/Orange)
* Failed (Green)
### Detailed Analysis
* **Reproducible (Blue):** The number of reproducible rebuilds generally increases over time.
* 2018: ~15k
* 2019: ~23k
* 2020: ~29k
* 2021: ~37k
* 2022: ~41k
* 2023: ~65k
* **Buildable (Salmon/Orange):** This represents the area between the "reproducible" line and the "failed" line.
* **Failed (Green):** The number of failed rebuilds fluctuates but generally increases slightly over time.
* 2018: ~20k
* 2019: ~29k
* 2020: ~35k
* 2021: ~43k
* 2022: ~53k
* 2023: ~72k
### Key Observations
* All three categories show an increasing trend over time, with a significant jump in "reproducible" rebuilds in 2023.
* The "reproducible" category consistently has the lowest number of rebuilds compared to "buildable" and "failed."
* The gap between "failed" and "reproducible" widens significantly in 2023.
### Interpretation
The chart indicates a general increase in the number of successful rebuilds across all categories (reproducible, buildable, and failed) from 2018 to 2023. The most significant increase is observed in the "reproducible" category in 2023, suggesting improvements or changes in the build process that led to more reproducible builds. The "failed" category consistently remains higher than the "reproducible" category, indicating that there are still areas for improvement in reducing build failures. The widening gap between "failed" and "reproducible" in 2023 could be due to a combination of factors, such as increased build attempts, changes in the build environment, or improvements in the reproducibility of successful builds.
</details>
Figure 8: Absolute numbers of reproducible, rebuildable (but unreproducible) and non-rebuildable packages over time.
Figure 9 shows for each revision the cumulative amount of unreproducibilities introduced by that revision getting fixed over time. The large slope between the two first points of each plot indicates that most of the unreproducibilities introduced in a revision are fixed in the next revision, on average 62% of them (even raising to 85% if we account for packages fixed after the 2020 reproducibility regression).
<details>
<summary>extracted/6418685/paper-images/fixes-per-revision.png Details</summary>

### Visual Description
## Line Chart: Number of Fixed Packages vs. Year of Unreproducibility Introduction
### Overview
The image is a line chart showing the number of fixed packages over time (from 2018 to 2023) for different categories of packages, grouped by the year in which unreproducibility was introduced. Each line represents a different year of unreproducibility introduction, and the y-axis represents the number of fixed packages.
### Components/Axes
* **Title (Y-Axis):** "Number of fixed packages"
* **Scale (Y-Axis):** 0, 200, 400, 2800, 2900, 3000
* **Title (X-Axis):** The x-axis is implicitly "Year".
* **Scale (X-Axis):** 2018, 2019, 2020, 2021, 2022, 2023
* **Legend:** Located at the bottom of the chart, labeled "Unreproducibility introduced in:". The legend lists the years of unreproducibility introduction, each associated with a specific color.
* 1416607 (Brown)
* 1498441 (Light Pink)
* 1569427 (Yellow)
* 1653283 (Light Purple)
* 1755200 (Dark Gray)
* 1448353 (Green)
* 1520586 (Magenta)
* 1593430 (Orange)
* 1687868 (Dark Red)
* 1776404 (Cyan)
* 1476353 (Blue)
* 1544614 (Light Blue)
* 1622813 (Dark Green)
* 1724521 (Dark Purple)
* 1787891 (Olive)
### Detailed Analysis
* **Brown Line (1416607):** Starts at approximately 10 in 2018, increases to approximately 30 in 2023.
* **Light Pink Line (1498441):** Starts at approximately 0 in 2018, increases to approximately 40 in 2023.
* **Yellow Line (1569427):** Starts at approximately 0 in 2018, increases to approximately 50 in 2023.
* **Light Purple Line (1653283):** Starts at approximately 0 in 2018, increases to approximately 60 in 2023.
* **Dark Gray Line (1755200):** Starts at approximately 0 in 2018, increases to approximately 70 in 2023.
* **Green Line (1448353):** Starts at approximately 10 in 2018, increases sharply to approximately 150 in 2019, and then increases gradually to approximately 210 in 2023.
* **Magenta Line (1520586):** Starts at approximately 0 in 2018, increases to approximately 80 in 2023.
* **Orange Line (1593430):** Starts at approximately 0 in 2018, increases sharply to approximately 2700 in 2021, and then increases gradually to approximately 2900 in 2023.
* **Dark Red Line (1687868):** Starts at approximately 0 in 2018, increases to approximately 40 in 2023.
* **Cyan Line (1776404):** Starts at approximately 0 in 2018, increases sharply to approximately 2700 in 2020, and then remains relatively constant at approximately 2700 in 2023.
* **Blue Line (1476353):** Starts at approximately 0 in 2018, increases to approximately 140 in 2019, and then increases gradually to approximately 150 in 2023.
* **Light Blue Line (1544614):** Starts at approximately 0 in 2018, increases sharply to approximately 400 in 2020, and then remains relatively constant at approximately 400 in 2023.
* **Dark Green Line (1622813):** Starts at approximately 0 in 2018, increases to approximately 220 in 2023.
* **Dark Purple Line (1724521):** Starts at approximately 0 in 2018, increases to approximately 80 in 2023.
* **Olive Line (1787891):** Starts at approximately 0 in 2018, increases to approximately 40 in 2023.
### Key Observations
* The orange line (1593430) shows the most significant increase in the number of fixed packages, followed by the cyan line (1776404).
* Most lines show a gradual increase in the number of fixed packages over time.
* The cyan line (1776404) and orange line (1593430) show a sharp increase in the number of fixed packages around 2020 and 2021, respectively, and then plateau.
* The other lines show a much smaller increase in the number of fixed packages.
### Interpretation
The chart illustrates the progress in fixing packages that were initially introduced with unreproducibility in different years. The orange line (1593430) and cyan line (1776404) indicate that packages with unreproducibility introduced in those years have seen the most significant fixing efforts. The plateauing of these lines suggests that the majority of fixable packages from those years have already been addressed. The other lines, with their gradual increases, indicate ongoing efforts to fix packages with unreproducibility introduced in those years, but at a slower pace. The data suggests that the focus of fixing efforts may have shifted over time, or that some types of unreproducibility are more easily addressed than others.
</details>
Figure 9: Evolution of the cumulative number of fixes to non-reproducibilities, separated by the revision in which the non-reproducibilities were introduced.
Excluding the June 2020 revision (corresponding to the observed bitwise reproducibility regression), Figure 10 depicts the average evolution of the package set between two consecutive revisions. In particular, only 0.60% of the packages transition from reproducible to buildable but not reproducible status between two revisions, while 2.07% of the buildable but not reproducible packages become reproducible. The average growth rate of the package set is 1.07 and, on average, 80.02% of the new packages are reproducible.
<details>
<summary>extracted/6418685/paper-images/sankey-average.png Details</summary>

### Visual Description
## Sankey Diagram: Build Status Flow
### Overview
The image is a Sankey diagram illustrating the flow of build status. It shows the distribution of projects across different build outcomes, with the width of the connecting flows indicating the relative proportion of projects transitioning between statuses. The diagram has three categories on the left: "Reproducible" (blue), "Non-Existant" (orange), and "Buildable" (purple), and the same three categories on the right. The flows between the categories show how projects transition from one status to another.
### Components/Axes
* **Left Categories:**
* Reproducible (Blue)
* Non-Existant (Orange)
* Failed (Black)
* Buildable (Purple)
* **Right Categories:**
* Reproducible (Blue)
* Non-Existant (Orange)
* Failed (Black)
* Buildable (Purple)
* **Flows:** Grey flows connect the categories on the left to the categories on the right, indicating the movement of projects between statuses. The width of the flows represents the relative proportion of projects transitioning between statuses.
### Detailed Analysis
* **Reproducible (Blue):** A very large flow connects "Reproducible" on the left to "Reproducible" on the right, indicating that a significant proportion of projects remain reproducible.
* **Non-Existant (Orange):** Flows connect "Non-Existant" on the left to all three categories on the right.
* **Failed (Black):** Flows connect "Failed" on the left to all three categories on the right.
* **Buildable (Purple):** Flows connect "Buildable" on the left to all three categories on the right.
### Key Observations
* The largest flow is between "Reproducible" on the left and "Reproducible" on the right, suggesting that most projects that are initially reproducible remain so.
* The flows from "Non-Existant", "Failed", and "Buildable" on the left are distributed across all three categories on the right, indicating that projects in these states can transition to any of the three states.
### Interpretation
The Sankey diagram provides a visual representation of how build statuses change over time or across different stages of a project. The large flow between "Reproducible" categories suggests that maintaining reproducibility is a stable state for many projects. The distribution of flows from the other categories indicates that projects that are initially non-existent, failed, or buildable can transition to any of the three states, highlighting the dynamic nature of build statuses. The diagram could be used to identify areas where projects are frequently transitioning between statuses, which could indicate potential issues or areas for improvement in the build process.
</details>
Figure 10: Sankey graph of the average flow of packages between two revisions, excluding the revision from June 2020, considered as an outlier.
VI-C RQ2: What are the unreproducible packages?
We observe large disparities both in trends and in reproducibility by package ecosystem (see Figure 11). In the top three most popular ecosystems, Perl has consistently maintained a proportion of reproducible packages above 98%, while Haskell reproducibility rate stagnated around the 60% mark, even decreasing by more than 7 percentage points during the time of our study. A fix to a long-standing reproducibility issue has been introduced in GHC 9.12.1 (new -fobject-determinism flag). As of January 2025, nixpkgs does not make use of this flag to improve Haskell build reproducibility. The Python ecosystem on the contrary sees a positive evolution over time, with reproducibility rates as low as 27.64% in December 2017, reaching 98.28% in April 2023. As can be seen on Figure 11, the Python ecosystem has however known a major dip in reproducibility in June 2020, with a drop to 6.01% (for almost $-90$ percentage points!).
<details>
<summary>extracted/6418685/paper-images/reproducibility-ecosystems.png Details</summary>

### Visual Description
## Line Chart: Proportion of Bitwise Reproducible Packages Over Time
### Overview
The image is a line chart showing the proportion of bitwise reproducible packages over time for four different package sources: base, haskell, python, and perl. The x-axis represents the date, ranging from 2018 to 2023. The y-axis represents the proportion of bitwise reproducible packages, ranging from 0.00% to 100.00%. Each package source is represented by a different colored line.
### Components/Axes
* **Title (Y-Axis):** "Proportion of bitwise reproducible packages"
* Scale: 0.00% to 100.00% in increments of 20.00%
* **Title (X-Axis):** "Date"
* Scale: 2018, 2019, 2020, 2021, 2022, 2023
* **Legend:** Located at the bottom of the chart.
* "Packages coming from:"
* Blue line: "base"
* Red line: "haskell"
* Green line: "python"
* Purple line: "perl"
### Detailed Analysis
* **Base (Blue):** The proportion of reproducible packages from "base" starts at approximately 81% in 2018, increases slightly to about 82% in 2019, then increases to approximately 87% in 2019. It decreases to approximately 81% in 2020, increases to approximately 89% in 2021, and remains relatively stable around 90% from 2021 to 2023.
* 2018: ~81%
* 2019: ~82%
* 2020: ~81%
* 2021: ~89%
* 2022: ~90%
* 2023: ~89%
* **Haskell (Red):** The proportion of reproducible packages from "haskell" starts at approximately 59% in 2018, decreases slightly to about 56% in 2019, then increases to approximately 57% in 2019. It decreases to approximately 54% in 2020, decreases to approximately 53% in 2021, and decreases to approximately 50% in 2022, then increases to approximately 52% in 2023.
* 2018: ~59%
* 2019: ~56%
* 2020: ~54%
* 2021: ~53%
* 2022: ~50%
* 2023: ~52%
* **Python (Green):** The proportion of reproducible packages from "python" starts at approximately 28% in 2018, increases sharply to about 96% in 2019, remains stable at approximately 96% in 2020, then drops sharply to about 7% in 2021, then increases sharply to about 95% in 2021, and increases slightly to approximately 98% in 2023.
* 2018: ~28%
* 2019: ~96%
* 2020: ~96%
* 2021: ~7%
* 2022: ~97%
* 2023: ~98%
* **Perl (Purple):** The proportion of reproducible packages from "perl" is consistently high, starting at approximately 99% in 2018 and remaining relatively stable around 99% to 100% throughout the entire period.
* 2018: ~99%
* 2019: ~99%
* 2020: ~99%
* 2021: ~99%
* 2022: ~99%
* 2023: ~99%
### Key Observations
* Perl consistently has the highest proportion of reproducible packages.
* Python experiences a significant drop in reproducibility in 2021, followed by a recovery.
* Haskell has the lowest proportion of reproducible packages and shows a slight downward trend.
* Base shows a gradual increase in reproducibility over the period.
### Interpretation
The chart illustrates the varying levels of effort and success in achieving bitwise reproducibility across different package ecosystems. Perl demonstrates a strong commitment to reproducibility, while Haskell lags behind. The dramatic fluctuation in Python's reproducibility suggests a potential issue or change in the build process around 2021. The gradual increase in Base's reproducibility indicates a steady improvement in their build process. These trends highlight the importance of consistent and well-maintained build processes for ensuring software reproducibility.
</details>
Figure 11: Proportion of reproducible packages belonging to the three most popular ecosystems and the base namespace of nixpkgs.
As can be seen in Figure 12, this dip in the proportion of reproducible Python packages can be explained by a large number of packages transitioning from the reproducible to the buildable but not reproducible state, indicating a regression in bitwise reproducibility at that time. To identify the root cause of that regression, we used a git bisection on the nixpkgs repository between June and October 2020 in order to identify the commit fixing the unreproducibility issue, and derived that the root cause of the regression was an update in the pip executable changing the behavior of byte-compilation. Details can be found in https://github.com/pypa/pip/issues/7808.
Figure 13 shows the difference in reproducibility rates between the NixOS minimal ISO imageâa package set for which there exists community-based reproducibility monitoring, and made of packages that are considered criticalâand the whole package set. The minimal ISO image has a very high proportion of reproducible packages in the considered period, with rates consistently higher than 95% starting from May 2019. Its reproducibility rates however do not follow the evolution of the overall package set, such that observing the reproducibility of the minimal ISO image does not give any clue about the reproducibility of the package set as a whole.
<details>
<summary>extracted/6418685/paper-images/reproducibility-absolute-python.png Details</summary>

### Visual Description
## Area Chart: Package Status Over Time
### Overview
The image is an area chart displaying the number of packages over time, categorized by their status: reproducible, buildable, and failed. The x-axis represents the date from 2018 to 2023, and the y-axis represents the number of packages, ranging from 0 to 6000.
### Components/Axes
* **Title:** None explicitly given in the image.
* **X-axis:** Date, ranging from 2018 to 2023 in yearly increments.
* **Y-axis:** Number of packages, ranging from 0 to 6000 in increments of 1000.
* **Legend:** Located at the bottom-left of the chart.
* Reproducible: Represented by a blue line and filled area.
* Buildable: Represented by a salmon/orange line and filled area.
* Failed: Represented by a teal/green line.
### Detailed Analysis
* **Reproducible Packages (Blue):**
* Trend: Generally increasing over time, with a significant dip in 2020.
* 2018: Approximately 500
* 2019: Approximately 600
* 2020: Approximately 200
* 2021: Approximately 3300
* 2022: Approximately 4600
* 2023: Approximately 5600
* **Buildable Packages (Salmon/Orange):**
* Trend: Increases from 2018 to 2021, then plateaus.
* 2018: Approximately 1700
* 2019: Approximately 2200
* 2020: Approximately 2800
* 2021: Approximately 3600
* 2022: Approximately 4700
* 2023: Approximately 5700
* **Failed Packages (Teal/Green):**
* Trend: Generally increasing over time.
* 2018: Approximately 1700
* 2019: Approximately 2200
* 2020: Approximately 2900
* 2021: Approximately 3600
* 2022: Approximately 4700
* 2023: Approximately 5700
### Key Observations
* The number of reproducible packages experienced a sharp decline in 2020 before recovering and increasing significantly in subsequent years.
* The number of buildable packages appears to plateau after 2021, closely tracking the number of failed packages.
* The number of failed packages consistently increases over the years.
* The total number of packages (represented by the top of the "failed" area) shows a steady increase from 2018 to 2023.
### Interpretation
The chart illustrates the evolution of package status over time. The dip in reproducible packages in 2020 could indicate a change in build processes or testing procedures that temporarily affected reproducibility. The subsequent recovery and growth suggest that these issues were addressed. The plateau in buildable packages, coupled with the continued increase in failed packages, might indicate increasing complexity or stricter testing criteria, leading to more packages failing to meet the build requirements. Overall, the chart suggests a growing ecosystem of packages with ongoing efforts to improve reproducibility and build success.
</details>
Figure 12: Evolution of the absolute number of reproducible, rebuildable (but unreproducible) and non-rebuildable packages from the Python ecosystem.
<details>
<summary>extracted/6418685/paper-images/minimal-iso-reproducibility.png Details</summary>

### Visual Description
## Line Chart: Proportion of Bitwise Reproducible Packages
### Overview
The image is a line chart comparing the proportion of bitwise reproducible packages over time for two categories: "Whole distribution" and "Minimal ISO image". The x-axis represents the date, ranging from 2018 to 2023. The y-axis represents the proportion of bitwise reproducible packages, ranging from 50.00% to 90.00%.
### Components/Axes
* **Title (Y-Axis):** "Proportion of bitwise reproducible packages" (vertical, left side)
* **Title (X-Axis):** "Date" (horizontal, bottom)
* **X-Axis Markers:** 2018, 2019, 2020, 2021, 2022, 2023
* **Y-Axis Markers:** 50.00%, 60.00%, 70.00%, 80.00%, 90.00%
* **Legend:** Located at the bottom of the chart.
* **Blue Line:** "Whole distribution"
* **Red Line:** "Minimal ISO image"
* **Chart Background:** Light blue grid
### Detailed Analysis
**1. Minimal ISO Image (Red Line):**
* **Trend:** Generally increasing, with slight fluctuations.
* **Data Points:**
* 2018: Approximately 92.5%
* 2019: Approximately 93.5%
* 2020: Approximately 95.5%
* 2021: Approximately 95%
* 2022: Approximately 97%
* 2023: Approximately 96%
**2. Whole Distribution (Blue Line):**
* **Trend:** More volatile than the "Minimal ISO image" line, with an overall upward trend, but with a significant dip in 2020.
* **Data Points:**
* 2018: Approximately 70%
* 2019: Approximately 70%
* 2020: Approximately 85%
* 2021: Approximately 81%
* 2022: Approximately 62%
* 2023: Approximately 85%
### Key Observations
* The "Minimal ISO image" consistently has a higher proportion of bitwise reproducible packages than the "Whole distribution".
* The "Whole distribution" experiences a notable drop in reproducibility around 2020, followed by a recovery.
* Both categories show an overall trend of increasing reproducibility over the years.
### Interpretation
The chart suggests that the "Minimal ISO image" benefits from a more controlled environment, leading to higher reproducibility rates compared to the "Whole distribution." The dip in the "Whole distribution" around 2020 could be attributed to changes in the build process, dependencies, or other factors affecting reproducibility. The overall upward trend in both categories indicates ongoing efforts to improve the reproducibility of packages. The difference between the two lines highlights the impact of the scope and complexity of the package set on reproducibility.
</details>
Figure 13: Evolution of the proportion of reproducible packages belonging to the minimal ISO image and in the entire package set.
VI-D RQ3: Why are packages unreproducible?
We identified four heuristics that are both effective (they give a large number of matches in our diffoscope dataset) and relevant (they correspond to a software engineering practice that can be changed) to automatically determine why packages are not bitwise reproducible:
- Embedded dates: presence of build dates in various places in the built artifacts (version file, log file, file names, etc.). To detect dates embedded in build outputs, we look specifically for the year 2024 in added lines, given that we ran all our builds during 2024, but none of the historical builds ran during this year.
- uname output: some unreproducible builds embed build information such as the machine that the build was run on. While the real host name is hidden by the Nix sandbox, other pieces of information (such as the Linux kernel version or OS version, reported by uname) are still available.
- Environment variables: some builds embed some or all available environment variables in their build outputs. This typically causes unreproducibility because an environment variable containing the number of cores available to build on the machine is set by Nix.
- Build ID: some ecosystems (Go for example) embed a unique (but not deterministic) build ID into the artifacts.
Despite a well-known recommendation by the Reproducible Builds project to avoid embedding build dates in build artifacts to achieve bitwise reproducibility, we find that $12\,831$ of our $86\,317$ packages with diffoscopes contain a date, accounting for 14.8% of them. Still in the most recent nixpkgs revision, we find $470$ instances of this non-reproducibility cause, showing that embedding dates is still an existing practice. Additionally, we find that embedded environment variables and build IDs each account for 2.2% of our unreproducible packages. Finally, we find uname outputs in $1097$ of our unreproducible builds (1.3%), $721$ of which also include a date as part of the uname output. Figure 14 shows the evolution of the number of packages matched by each heuristic over time. Altogether, a total of 19.7% of the packages for which we have generated a diffoscope have an unreproducibility that can be explained by at least one of our heuristics.
We evaluate the precision of each heuristic by manually verifying $500$ matched lines for every heuristic and counting false positives. For the uname, build ID, and environment variables heuristics, we report a precision of 100%. For the date heuristic, we obtain a precision of 97.8% (11 false positives).
<details>
<summary>extracted/6418685/paper-images/evolution-heuristics.png Details</summary>

### Visual Description
## Line Chart: Number of Packages Matched by Heuristic Over Time
### Overview
The image is a line chart showing the number of packages matched by different heuristics over time, from 2017 to 2023. The y-axis represents the number of packages, and the x-axis represents the date. There are four lines, each representing a different heuristic: "Embedded dates" (blue), "Embedded uname output" (red), "Embedded environment variable" (green), and "Embedded build ID" (purple).
### Components/Axes
* **Title:** "Number of packages matched by the heuristic" (y-axis)
* **X-axis:** "Date"
* **X-axis Markers:** 2018, 2019, 2020, 2021, 2022, 2023
* **Y-axis:** "Number of packages matched by the heuristic"
* **Y-axis Markers:** 0, 200, 400, 600, 800
* **Legend:** Located at the bottom of the chart.
* **Embedded dates:** (blue line)
* **Embedded uname output:** (red line)
* **Embedded environment variable:** (green line)
* **Embedded build ID:** (purple line)
### Detailed Analysis
* **Embedded dates (blue line):** This line starts at approximately 720 in 2017, increases to approximately 750 in 2018, then to 820 in 2019. It decreases to approximately 700 in 2019, then increases sharply to approximately 920 in 2020. It remains high at approximately 900 in 2021, then drops to approximately 750 in 2021. It decreases to approximately 620 in 2022, then increases to approximately 680 in 2023, before dropping to approximately 480 in 2023.
* **Embedded uname output (red line):** This line remains relatively flat throughout the period. It starts at approximately 60 in 2017, increases to approximately 70 in 2018, then decreases to approximately 60 in 2019. It increases slightly to approximately 70 in 2020, then decreases to approximately 60 in 2021. It increases to approximately 80 in 2022, then decreases to approximately 70 in 2023, before decreasing to approximately 60 in 2023.
* **Embedded environment variable (green line):** This line remains relatively flat until 2021, then increases sharply. It starts at approximately 30 in 2017, remains at approximately 30 in 2018, then decreases to approximately 20 in 2019. It remains at approximately 30 in 2020 and 2021, then increases sharply to approximately 440 in 2022. It increases slightly to approximately 500 in 2022, then decreases sharply to approximately 80 in 2023, before decreasing to approximately 20 in 2023.
* **Embedded build ID (purple line):** This line increases from 2017 to 2020, then decreases slightly before increasing again. It starts at approximately 10 in 2017, increases to approximately 150 in 2018, then remains at approximately 150 in 2019. It increases to approximately 160 in 2020, then decreases to approximately 120 in 2021. It increases to approximately 180 in 2021, then decreases to approximately 110 in 2022. It increases to approximately 140 in 2023, before increasing to approximately 160 in 2023.
### Key Observations
* The "Embedded dates" heuristic consistently matches the highest number of packages.
* The "Embedded uname output" heuristic consistently matches the lowest number of packages.
* The "Embedded environment variable" heuristic shows a significant spike in 2021 and 2022.
### Interpretation
The chart illustrates the prevalence of different types of embedded information within software packages over time. The "Embedded dates" heuristic consistently identifies the most packages, suggesting that date information is commonly included. The spike in "Embedded environment variable" matches in 2021 and 2022 could indicate a change in coding practices or an increased reliance on environment variables during those years. The relatively low and stable number of matches for "Embedded uname output" suggests that this type of information is less frequently embedded in packages. The "Embedded build ID" heuristic shows a gradual increase over time, suggesting that build IDs are becoming more common in software packages.
</details>
Figure 14: Evolution of the number of packages for which we generated diffoscopes that are matched by each of our heuristics, over time.
VI-E RQ4: How are unreproducibilities fixed?
Our analysis of 100 randomly sampled reproducibility fixes showed that, in most cases, authors are not aware that they are fixing a reproducibility issue, or they do not mention it: in 93 instances, we did not find any trace of reproducibility being mentioned. In fact, in 76 cases we found out that the reproducibility fix was just a routine package update, and it is very likely that the package becoming reproducible had more to do with changes in the upstream software than with a conscious action from the package maintainer. Other fixes included internal nixpkgs changes that had reproducibility impact as a side effect of the main reason behind the change.
Our study of the 15 most impactful fixes suggests that those are more often done with the intent of fixing reproducibility: in more than half of them (8 out of 15) the author mentioned and documented the reproducibility issue being fixed and 9 of them were changes internal to nixpkgs. Some large-scale reproducibility fixes were still package updates like toolchain updates.
VII Discussion
This work brings valuable insights to the ongoing discussions about software supply chain security, reproducible builds, and functional package management.
False myth: reproducible builds (R-B) do not scale
One strongly held belief about R-B is that they work well in limited contextsâeither selected âcriticalâ applications (e.g., cryptocurrencies or anonymity technology) or strictly curated distributionsâbut either do not scale or are not worth the effort in larger settings. Our results can be interpreted as counter proof of this belief: in nixpkgs, the largest general-purpose repository with more than 70k packages (in April 2023), more than 90.5% packages are nowadays bitwise reproducible from source. R-B do scale to general purpose, regularly used software.
Recommendation: invest in infrastructure for binary caches and build attestations
To reap the security benefits of the R-B vision [lamb_reproducible_2022] in practice, users need multiple components: (1) signed attestations by each independent builder, publishing the obtained checksums at the end of a build; (2) caches of built artifacts, so that users can avoid rebuilding from source; (3) verification technology in package managers to verify that built artifacts in (2) match the consensus checksums in (1). With few exceptions, the infrastructure to support all this does not exist yet. Now that the âR-B do not scaleâ excuse is out of the picture, we call for investments to develop this infrastructure, preferably in a distributed architectural style to make it more secure, robust, and sustainable.
False myth: Nix implies bitwise reproducibility
Contradictory to the previous myth, there is also a belief that Nix (or functional package management more generally) implies bitwise reproducibility. Our results disprove this belief: about 9.4% of nixpkgs packages are not bitwise reproducible. This is not surprising, because R-B violations happen for multiple reasons, only some of which are mitigated by FPM per se.
Recommendation: use FPM for bitwise reproducibility and rebuildability
Still, Nix appears to perform really well at bitwise reproducibility âout of the boxâ, and even more so at rebuildability (above 99%). Based on this, we recommend to use Nix, or FPM more generally, for all use cases that require either mere rebuildability or full bitwise reproducibility. At worse, they provide a very good starting point.
This work has investigated the causes of the lingering non reproducibility in nixpkgs, but not those of reproducibility; it would be interesting to know why Nix performs so well, possibly for adoption in different contexts. It is possible that bitwise reproducibility is an emerging property of FPM, or that it comes from technical measures during build like sandboxing, or that Nix is simply benefiting (more than other distributions?) from the decade-long efforts of the R-B project in upstream toolchains. Exploring all this is left as future work.
QA monitoring for bitwise reproducibility
The significantly higher (and stably so) bitwise reproducibility performances of the NixOS minimal ISO image (see Figure 13) suggests that quality assurance (QA) monitoring for build reproducibility is an effective way to increase it. The fact that Debian uses a similar approach with good results [lamb_reproducible_2022] is compatible with this interpretation. A deeper analysis of the relationship between QA monitoring and reproducibility rate is out of scope for this work, but it is quite likely that extending reproducibility monitoring to all packages will result in an easy win for nixpkgs (assuming bitwise reproducibility is seen as a project goal).
VIII Threats to validity
VIII-A Construct validity
We rebuilt Nix packages from revisions in the period July 2017âApril 2023, for about 6 years. That is enough to observe arguably long-term trends, which also appear to be stable in our analysis; except for one temporary regression, which we analyzed and explained. Still, we cannot exclude significant differences in reproducibility trends before/after the studied period.
Due to the high computation cost, build time, and environmental impact of rebuilding general purpose software from source, we sampled nixpkgs revisions to rebuild uniformly (every 4.1 months) within the studied period, totaling $14\,296$ hours of build time. We cannot exclude trend anomalies between sampled revisions either, but that seems unlikely due to the stability of the observed long-term trends. More importantly, this means that we are less likely to catch short-spanned reproducibility regressions, which would be introduced and fixed within the same period between two sampled revisions. This can be improved upon by sampling and rebuilding more nixpkgs revisions, complementing this work.
When rebuilding a given package, we relied on the Nix binary cache for all its transitive dependencies. As we have rebuilt all packages from any sampled nixpkgs revisions, our package coverage is complete. But this way we have not measured the impact of unreproducible packages on transitive reverse dependencies, i.e., how many additional packages become unreproducible if systematically built from scratch, including all their dependencies? Our experiment matches real-world use cases and is hence adequate to answer our RQs, but it would still be interesting to know.
Also, during rebuilds we have not attempted to re-download online assets, but relied on the Nix cache. Hence, we have not measured the impact of lacking digital preservation practices on reproducibility and rebuildability. It would be interesting and possible to measure such impact by, e.g., first trying to download assets from their original hosting places and, for source code assets, fallback to archives such as Software Heritage [dicosmo_2017_swh] in case of failure.
VIII-B External validity
We have rebuilt packages from nixpkgs, the largest cross-ecosystem FOSS distribution, using the Nix functional package manager (FPM). Other FPMs with significant user bases exist, such as Guix [courtes_functional_2013]. Given the similarity in design choices, we do not have reasons to believe that Guix would perform any differently in terms of build reproducibility than Nix, but we have not empirically verified it; doing so would be useful complementary work.
Neither did we verify historical build reproducibility in classic FOSS distributions (e.g., Debian, Fedora, etc.), as out of the FPM scope. Doing so would still be interesting, for comparison purposes. It would be more challenging, though, due to the fact that outside the FPM context, it is significantly harder to recreate exact build environments from years ago.
IX Conclusion
In this work we conducted the first large-scale experiment of bitwise reproducibility in the context of the Nix functional package manager, by rebuilding $709\,816$ packages coming from 17 revisions of the nixpkgs software repository, sampled every 4.1 months from 2017 to 2023. Our findings show that bitwise reproducibility in nixpkgs is very high and has known an upward trend, from 69% in 2017 to 91% in 2023. The mere ability to rebuild packages (whether bitwise reproducibly or not) is even higher, stably around 99.8%.
We have highlighted disparities in reproducibility across ecosystems that coexist in nixpkgs, as well as between packages for which bitwise reproducibility is actively monitored and the others. We have developed heuristics to understand common (un)reproducibility causes, finding that 15% of unreproducible packages were embedding the date during the build process. Finally, we studied reproducibility fixes and found out that only a minority of changes inducing a reproducibility fix were done intentionally; the rest appear to be incidental.
Data availability statement
The dataset produced as part of this work is archived on and available from Zenodo [replication_metadata, replication_diffoscopes]. A full replication package containing the code used to run the experiment describe in this paper is archived on and available from Software Heritage [replication-code:1.0].
Acknowledgments
This work would have not been possible without the availability of historical data from the official Nix package cache at https://cache.nixos.org with its long-standing âno expiryâ policy. In the context of ongoing discussions to prune the cache, we strongly emphasize its usefulness for performing empirical research on package reproducibility, like this work. We also thank the NixOS Foundation for guaranteeing us that the revisions that we needed for this research would not be garbage-collected.
\printbibliography