# LeanAgent: Lifelong Learning for Formal Theorem Proving
**Authors**:
- Chaowei Xiao, Anima Anandkumar (California Institute of Technology, Stanford University, University of Wisconsin, Madison)
> Equal contribution.
Abstract
Large Language Models (LLMs) have been successful in mathematical reasoning tasks such as formal theorem proving when integrated with interactive proof assistants like Lean. Existing approaches involve training or fine-tuning an LLM on a specific dataset to perform well on particular domains, such as undergraduate-level mathematics. These methods struggle with generalizability to advanced mathematics. A fundamental limitation is that these approaches operate on static domains, failing to capture how mathematicians often work across multiple domains and projects simultaneously or cyclically. We present LeanAgent, a novel lifelong learning framework for formal theorem proving that continuously generalizes to and improves on ever-expanding mathematical knowledge without forgetting previously learned knowledge. LeanAgent introduces several key innovations, including a curriculum learning strategy that optimizes the learning trajectory in terms of mathematical difficulty, a dynamic database for efficient management of evolving mathematical knowledge, and progressive training to balance stability and plasticity. LeanAgent successfully generates formal proofs for 155 theorems across 23 diverse Lean repositories where formal proofs were previously missing, many from advanced mathematics. It performs significantly better than the static LLM baseline, proving challenging theorems in domains like abstract algebra and algebraic topology while showcasing a clear progression of learning from basic concepts to advanced topics. In addition, we analyze LeanAgent’s superior performance on key lifelong learning metrics. LeanAgent achieves exceptional scores in stability and backward transfer, where learning new tasks improves performance on previously learned tasks. This emphasizes LeanAgent’s continuous generalizability and improvement, explaining its superior theorem-proving performance.
1 Introduction
Mathematics can be expressed in informal and formal languages. Informal mathematics utilizes natural language and intuitive reasoning, whereas formal mathematics employs symbolic logic to construct machine-verifiable proofs (Kevin Buzzard, 2019). State-of-the-art large language models (LLMs), such as o1 (OpenAI, 2024) and Claude (Claude Team, 2024), produce incorrect informal proofs (Zhou et al., 2024). This highlights the importance of formal mathematics in ensuring proof correctness and reliability. Interactive theorem provers (ITPs), such as Lean (De Moura et al., 2015), have emerged as tools for formalizing and verifying mathematical proofs. However, constructing formal proofs using ITPs is complex and time-consuming; it requires extremely detailed proof steps and involves working with extensive mathematical libraries.
Recent research has explored using LLMs to generate proof steps or complete proofs. For example, LeanDojo (Yang et al., 2023) introduced the first open-source framework to spur such research. Existing approaches typically involve training or fine-tuning LLMs on a specific dataset (Jiang et al., 2022). However, data scarcity in formal theorem proving (Polu et al., 2022) hinders the generalizability of these approaches (Liu et al., 2023). For example, ReProver, the retrieval-augmented LLM from the LeanDojo family, uses a retriever fine-tuned on Lean’s math library, mathlib4 (mathlib4 Community, 2024). Although mathlib4 contains over 100,000 formalized mathematical theorems and definitions, it covers primarily up to undergraduate mathematics. Consequently, ReProver performs poorly on more challenging mathematics, such as Terence Tao’s formalization of the Polynomial Freiman-Ruzsa (PFR) Conjecture (Tao et al., 2024).
<details>
<summary>extracted/6256159/Final2.png Details</summary>

### Visual Description
## Diagram: LeanAgent Training Pipeline
### Overview
This diagram illustrates the training pipeline for a "LeanAgent" system, likely an AI agent designed for theorem proving. The pipeline involves extracting data from GitHub repositories, utilizing curriculum learning and progressive training techniques, and ultimately generating new proofs. The diagram is segmented into four main sections, visually separated by colored backgrounds: Data Extraction, Curriculum Learning, Progressive Training, and sorry Theorem Proving. Arrows indicate the flow of data and processes.
### Components/Axes
The diagram contains the following key components:
* **Data Sources:** GitHub Repositories, Premise Corpus
* **Processes:** Extract Data Per Repository, Curriculum Learning, Progressive Training, sorry Theorem Proving, Best-First Tree Search
* **Data Representations:** theorem + proof, Theorems + Proofs, Theorem Complexities, Retrieved Premises w/ Updated Retriever, Generated Tactics
* **Metrics:** Complexity, # Proof Steps
* **Databases:** Dynamic Database
* **Visualizations:** Distribution of Theorem Complexities (Easy, Medium, Hard), Descending # of Easy Theorems.
* **Parameters:** 1 epoch, Stability, Plasticity
### Detailed Analysis or Content Details
**1. Data Extraction (Top Section - Light Orange)**
* Data is extracted from "GitHub Repositories" and a "Premise Corpus".
* The extracted data consists of "theorem + proof" pairs.
* This data feeds into "Curriculum Learning" and "Progressive Training" respectively.
* The process is repeated for each repository.
**2. Curriculum Learning (Second Section - Light Green)**
* **Complexity vs. # Proof Steps:** A visual representation shows the relationship between theorem complexity and the number of proof steps.
* The x-axis is labeled "# Proof Steps".
* The y-axis is labeled "Complexity".
* A data point is shown with "complexity: 2.7" and another with "complexity: 7.4". The ellipsis ("...") indicates more data points exist.
* **Theorem Complexity Distribution:** A bell curve represents the distribution of theorem complexities, categorized as "Easy", "Medium", and "Hard".
* **Descending # of Easy Theorems:** A horizontal bar chart shows a descending number of easy theorems. The bars are colored green, yellow, and red.
**3. Progressive Training (Third Section - Light Blue)**
* **Latest Retriever -> Limited Exposure:** Data flows from the "Latest Retriever" to "Limited Exposure" (1 epoch).
* **Limited Exposure -> Balanced Training:** "Limited Exposure" feeds into "Balanced Training", which has parameters "Stability" and "Plasticity".
* **Balanced Training -> New Retriever:** "Balanced Training" outputs a "New Retriever".
**4. sorry Theorem Proving (Bottom Section - Light Purple)**
* **sorry Theorems -> Retrieved Premises:** "sorry Theorems" are input into a process that retrieves premises with an updated retriever. The retrieved premises include "theorem" and "lemma".
* **Retrieved Premises + Generated Tactics -> Best-First Tree Search:** Retrieved premises are combined with "Generated Tactics" (e.g., "apply premise").
* **Best-First Tree Search -> New Proofs:** The combined data is processed by "Best-First Tree Search" to generate "New Proofs".
### Key Observations
* The pipeline is iterative, with the "New Retriever" from Progressive Training feeding back into the "sorry Theorem Proving" stage.
* The Curriculum Learning section emphasizes the importance of complexity and proof step length in organizing the training data.
* The Progressive Training section highlights the balance between stability and plasticity in the learning process.
* The diagram uses visual cues (color-coding, arrows) to clearly indicate the flow of data and processes.
* The "sorry Theorems" component suggests the system is initially dealing with incomplete or unproven theorems.
### Interpretation
The diagram depicts a sophisticated AI training pipeline designed to learn theorem proving. The use of curriculum learning suggests a strategy of starting with simpler theorems and gradually increasing complexity. Progressive training, with its emphasis on stability and plasticity, likely aims to prevent catastrophic forgetting while still allowing the agent to adapt to new information. The iterative nature of the pipeline, with the "New Retriever" feeding back into the system, indicates a continuous learning process. The "Best-First Tree Search" component suggests a search algorithm is used to explore potential proof paths. The overall goal appears to be to automate the process of theorem proving by leveraging data from existing repositories and continuously refining the agent's learning capabilities. The use of "sorry Theorems" suggests the system is designed to tackle challenging problems where initial attempts at proving theorems may fail. The diagram provides a high-level overview of the system's architecture and training methodology, without delving into the specific algorithms or implementation details.
</details>
Figure 1: LeanAgent overview. LeanAgent searches for Lean repositories and uses LeanDojo to extract theorems and proofs. It uses curriculum learning, computing theorem complexity as $e^{S}$ ( $S$ = proof steps) and calculating the 33rd and 67th complexity percentiles across all theorems to sort repositories by easy theorem count. LeanAgent adds the curriculum to its dynamic database. As a retrieval-based framework, LeanAgent generates a dataset (premise corpus and a collection of theorems and proofs) for each repository in the curriculum and progressively trains its retriever. Progressive training happens over one epoch to prevent forgetting old knowledge. Then, LeanAgent uses the updated retriever in a search-based method to generate formal proofs for theorems where formal proofs were previously missing, known as sorry theorems. It adds new proofs to the database.
The dynamic nature of mathematical research exacerbates this generalizability issue. Mathematicians often formalize across multiple domains and projects simultaneously or cyclically. For example, Terence Tao has worked on various projects in parallel, including formalizations of the PFR Conjecture, symmetric mean of real numbers, classical Newton inequality, and asymptotic analysis (Tao, ; Tao, 2024a; Tao et al., 2024; Tao, 2024b). Patrick Massot has been formalizing Scholze’s condensed mathematics and the Perfectoid Spaces project (Community, 2024a; b). These examples highlight a critical gap in current theorem-proving AI approaches: the lack of a system that can adapt and improve across multiple, diverse mathematical domains over time, given limited Lean data availability.
Connection to Lifelong Learning. Crucially, how mathematicians formalize is relevant to lifelong learning, i.e. learning multiple tasks without forgetting (Wang et al., 2024b). A significant challenge is catastrophic forgetting: when adaptation to new distributions leads to a loss of understanding of old ones (Jiang et al., 2024). The core challenge is balancing plasticity (the ability to learn and adapt) with stability (the ability to retain existing knowledge) (Wang et al., 2024b). Increasing plasticity to learn new tasks efficiently can lead to overwriting previously learned information. However, enhancing stability to preserve old knowledge may impair the model’s ability to acquire new skills (van de Ven et al., 2024). Achieving the right balance is key to continuous generalizability in theorem proving.
LeanAgent. We present LeanAgent, a novel lifelong learning framework for theorem proving. As shown in Figure 1, LeanAgent’s workflow consists of (1) deriving complexity measures of theorems to compute a curriculum for learning, (2) progressive training to learn while balancing stability and plasticity, and (3) searching for proofs of sorry theorems by leveraging a best-first tree search, all while using a dynamic database to manage its evolving mathematical knowledge. LeanAgent works with any LLM; we implement it with retrieval for improved generalizability (Yang et al., 2023).
We employ a simple progressive training method to avoid catastrophic forgetting. Progressive training allows LeanAgent to continuously adapt to new mathematical knowledge while preserving previously learned information. This process involves incrementally training the retriever on newly generated datasets from each repository in the curriculum. Starting with a pre-trained retriever (e.g., ReProver’s retriever based on ByT5 (Xue et al., 2022)), LeanAgent trains on each new dataset for one additional epoch. Restricting progressive training to one epoch helps balance stability and plasticity. Crucially, this training is repeated for each dataset generated from the database, gradually expanding LeanAgent’s knowledge base. This approach increases the space of possible proof states (where a state consists of a theorem’s hypotheses and current proof progress) while adding new premises to the premise embeddings. More sophisticated lifelong learning methods like Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017), which uses the Fisher Information Matrix to constrain important weights for previous tasks, result in excessive plasticity. The uncontrolled plasticity is due to the inability of these methods to adapt parameter importance as theorem complexity increases. This forces rapid changes in parameters crucial for learning advanced concepts. Such methods fail to adapt to the evolving complexity of mathematical theorems, making them unsuitable for lifelong learning in theorem proving.
Extensive experiments across 23 diverse Lean repositories demonstrate LeanAgent’s advancements in lifelong learning for theorem proving. LeanAgent successfully generates formal proofs for 155 theorems across these 23 repositories where formal proofs were previously missing, known as sorry theorems, many from advanced mathematics. For example, it proves challenging sorry theorems in abstract algebra and algebraic topology related to Coxeter systems and the Hairy Ball Theorem (Coxeter, 2024; Hairy Ball Theorem, 2024). LeanAgent also proves 7 theorems using exploits found within Lean’s type system. We find that LeanAgent demonstrates progressive learning in theorem proving, initially proving basic sorry theorems and significantly advancing to more complex ones. It significantly outperforms the static ReProver baseline in terms of proving new sorry theorems. We have issued pull requests to the respective repositories with the newly proven sorry theorems. Some of these proofs utilized unintended constructs within the repositories’ implementations, which are currently being addressed through appropriate fixes.
In theorem proving, we find that stability, without losing too much plasticity, is crucial for continuous generalizability to new repositories. Backward transfer (BWT), where learning new tasks improves performance on previously learned tasks, is essential in theorem proving (Wang et al., 2024b). Mathematicians require a lifelong learning framework for theorem proving that is both continuously generalizable and continuously improving. We conduct an extensive ablation study using six lifelong metrics carefully proposed or selected from the literature. LeanAgent’s simple components of curriculum learning and progressive training improve stability and BWT scores substantially, emphasizing its continuous generalizability and improvement and explaining its superior sorry theorem proving performance.
2 Preliminaries
Neural Theorem Proving. The current state-of-the-art of learning-based provers employs Transformer-based (Vaswani et al., 2017) LLMs that process expressions as plain text strings (Azerbayev et al., 2024; Xin et al., 2024b; Shao et al., 2024). In addition, researchers have explored complementary aspects like proof search algorithms (Lample et al., ; Wang et al., 2023). Moreover, other works break the theorem-proving process into smaller proving tasks (Song et al., 2024; Wang et al., 2024a; Lin et al., 2024).
Premise Selection. A critical challenge in theorem proving is the effective selection of relevant premises (Irving et al., 2016; Tworkowski et al., ). However, many existing approaches treat premise selection as an isolated problem (Wang & Deng, 2020; Piotrowski et al., 2023) or use selected premises only as input to symbolic provers (Alama et al., 2014; Mikuła et al., 2024).
Retrieval-Augmented LLMs. While retrieval-augmented language models have been extensively studied in areas like code generation (Lu et al., 2022; Zhou et al., 2023), their application to formal theorem proving is relatively new. However, relevant architectures have been researched in natural language processing (NLP) (Lu et al., 2024; Borgeaud et al., 2022; Thakur et al., 2024).
Lifelong Learning. Lifelong learning addresses catastrophic forgetting in sequential task learning (Chen et al., 2024). Approaches include regularization methods (Kirkpatrick et al., 2017), memory-based techniques (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019; Shin et al., 2017), and knowledge distillation (Li & Hoiem, 2017; Kim et al., 2023). Other strategies involve dynamic architecture adjustment (Mendez & Eaton, 2021) and recent work on gradient manipulation and selective re-initialization (Chen et al., 2024; Dohare et al., 2024). We justify not using these strategies in Appendix A.6.
Curriculum Learning in Theorem Proving. Prior work created a synthetic inequality generator to produce a curriculum of statements of increasing difficulty (Polu et al., 2022). For reinforcement learning, an existing work used the length of proofs to help determine rewards (Zombori et al., 2019).
3 Methodology
A useful lifelong learning strategy for theorem proving requires (a) a repository order strategy and (b) a learning strategy. We solve (a) with curriculum learning to utilize the structure of Lean proofs and (b) with progressive training to balance stability and plasticity. LeanAgent consists of four main components: curriculum learning, dynamic database management, progressive training of the retriever, and sorry theorem proving. Further methodology details are in Appendix A.1 and a discussion of why curriculum learning works in theorem proving is available in Appendix A.6.
3.1 Curriculum Learning
LeanAgent uses curriculum learning to learn on increasingly complex mathematical repositories. This process optimizes LeanAgent’s learning trajectory, allowing it to build upon foundational knowledge before tackling more advanced concepts.
First, we automatically search for and clone Lean repositories from GitHub. We use LeanDojo for each repository to extract fine-grained information about their theorems, proofs, and dependencies. Then, we calculate the complexity of each theorem using $e^{S}$ , where $S$ represents the number of proof steps. However, sorry theorems, which have no proofs, are assigned infinite complexity. We use an exponential scaling to address the combinatorial explosion of possible proof paths as the length of the proof increases. Further justification for considering this complexity metric is in Appendix A.6.
We compute the 33rd and 67th percentiles of complexity across all theorems in all repositories. Using these percentiles, we categorize non- sorry theorems into three groups: easy (theorems with complexity below the 33rd percentile), medium (theorems with complexity between the 33rd and 67th percentiles), and hard (theorems with complexity above the 67th percentile). We then sort repositories by the number of easy theorems they contain. This sorting forms the basis of our curriculum, with LeanAgent starting on repositories with the highest number of easy theorems.
3.2 Dynamic Database Management
Then, we add the sorted repositories to LeanAgent’s custom dynamic database using the data LeanAgent extracted. This way, we can keep track of and interact with the knowledge that LeanAgent is aware of and the proofs it has produced. We also include the complexity of each theorem computed in the previous step into the dynamic database, allowing for efficient reuse of repositories in a future curriculum. Details of the database’s contents and features can be found in Appendix A.1.
For each repository in the curriculum, LeanAgent uses the dynamic database to generate a dataset by following the same procedure used to make LeanDojo Benchmark 4 (details in Appendix A.1). This dataset includes a collection of theorems and their proofs. Each step of these proofs contains detailed annotations, such as how the step changes the state of the proof. A state consists of a theorem’s hypotheses and the current progress in proving the theorem. As such, this pairing of theorems and proofs demonstrates how to use specific tactics (functions) and premises in sequence to prove a theorem. In addition, the dataset includes a premise corpus, serving as a library of facts and definitions.
3.3 Progressive Training of the Retriever
LeanAgent then progressively trains its retriever on the newly generated dataset. This strategy allows LeanAgent to continuously adapt to new mathematical knowledge from the premises in new datasets while preserving previously learned information, crucial for lifelong learning in theorem proving. Progressive training achieves this by incrementally incorporating new knowledge from each repository.
Although LeanAgent works with any LLM, we provide a specific implementation here. We start with ReProver’s retriever, a fine-tuned version of the ByT5 encoder (Xue et al., 2022), leveraging its general pre-trained knowledge from mathlib4. We train LeanAgent on the new dataset for an additional epoch. This limited exposure helps prevent overfitting to the new data while allowing LeanAgent to learn essential new information. LeanAgent’s retriever, and therefore the embeddings it generates, are continuously updated during progressive training. Thus, at the end of the current progressive training run, we precompute embeddings for all premises in the corpus generated by LeanAgent’s current state to ensure that we properly evaluate LeanAgent’s validation performance. To understand how LeanAgent balances stability and plasticity, we save the model iteration with the highest validation recall for the top ten retrieved premises (R@10). This is a raw plasticity value: it can be used to compute other metrics that describe LeanAgent’s ability to adapt to and handle new types of mathematics in the latest repository (details in Sec. 4). Then, we compute the average test R@10 over all previous datasets the model has progressively trained on, a raw stability value.
As mentioned previously, we repeat this procedure for each dataset we generate from the database, hence the progressive nature of this training. Progressive training adds new premises to the premise embeddings and increases the space of possible proof states. This allows LeanAgent to explore more diverse paths to prove theorems, discovering new proofs that it couldn’t produce with its original knowledge base.
3.4 sorry Theorem Proving
For each sorry theorem, LeanAgent generates a proof with a best-first tree search by generating tactic candidates at each step, in line with prior work (Yang et al., 2023). Using the embeddings from the entire corpus of premises we previously collected, LeanAgent retrieves relevant premises from the premise corpus based on their similarity to the current proof state, represented as a context embedding. Then, it filters the results using a corpus dependency graph to ensure that we only consider premises accessible from the current file. We add these retrieved premises to the current state and generate tactic candidates using beam search. Then, we run each tactic candidate through Lean to obtain potential next states. Each successful tactic application adds a new edge to the proof search tree. We choose the tactic with the maximum cumulative log probability of the tactics leading to it. If the search reaches a dead-end, we backtrack and explore alternative paths. We repeat the above steps until the search finds a proof, exhausts all possibilities, or reaches the time limit of 10 minutes.
If LeanAgent finds a proof, it adds it to the dynamic database. The newly added premises from this proof will be included in a future premise corpus involving the current repository. Moreover, LeanAgent can learn from the new proof during progressive training in the future, aiding further improvements.
4 Experiments
4.1 Experimental Setup
Table 1: Selected repository descriptions
| PFR Hairy Ball Theorem Coxeter | Polynomial Freiman-Ruzsa Conjecture Algebraic topology result Coxeter groups |
| --- | --- |
| Mathematics in Lean Source | Lean files for the textbook |
| Formal Book | Proofs from THE BOOK |
| MiniF2F | Math olympiad-style problem solving |
| SciLean | Scientific computing |
| Carleson | Carleson’s Theorem |
| Lean4 PDL | Propositional Dynamic Logic |
We devote this section to describing two types of experiments: (1) sorry Theorem Proving: We compare sorry theorem proving performance between LeanAgent and ReProver. We also examine the progression of LeanAgent’s proven sorry theorems during lifelong learning to the end of lifelong learning. This shows LeanAgent’s continuous generalizability and improvement. (2) Lifelong Learning Analysis: We conduct an ablation study with six lifelong learning metrics to explain LeanAgent’s superiority in sorry theorem proving. Moreover, these results explain LeanAgent’s superior handling of the stability-plasticity tradeoff. Please see Appendix A.2 for experiment implementation details. We release LeanAgent at https://github.com/lean-dojo/LeanAgent.
Repositories. We evaluate our approach on a diverse set of 23 Lean repositories to assess its generalizability across different mathematical domains (Skřivan, 2024; Kontorovich, 2024; Avigad, 2024; Tao et al., 2024; Renshaw, 2024; Fermat’s Last Theorem, 2024; DeepMind, 2024; Carneiro, 2024; Wieser, 2024; Mizuno, 2024; Murphy, 2024; Formal Logic, 2024; Con-nf, 2024; Gadgil, 2024; Yang, 2024; Zeta 3 Irrational, 2024; Firsching, Moritz, 2024; Monnerjahn, 2024; van Doorn, 2024; Dillies, 2024; Hairy Ball Theorem, 2024; Coxeter, 2024; Gattinger, 2024). Details of key repositories are in Table 1. Further details of these repositories, including commits and how we chose the initial curriculum and the sub-curriculum (described in Sec. 4.2), are in Appendix A.3.
4.2 sorry Theorem Proving
We compare the number of sorry theorems LeanAgent can prove, both during and after lifelong learning, to the ReProver baseline. We use ReProver as the baseline because we use its retriever as LeanAgent’s initial retriever in our experiments.
<details>
<summary>extracted/6256159/newproof4.png Details</summary>

### Visual Description
## Diagram: Formal Theorem Blocks
### Overview
The image presents a diagram consisting of four distinct rectangular blocks, each representing a formal theorem or lemma from different software projects: PFR, SciLean, Coxeter, and MiniF2F. Each block contains a formal statement of a theorem or lemma, along with associated code or commands related to its proof or definition. The blocks are arranged in a roughly square configuration.
### Components/Axes
The diagram does not have traditional axes or a legend. It consists of four labeled blocks, each containing text. The blocks are visually separated by green borders. The labels (PFR, SciLean, Coxeter, MiniF2F) are positioned above each block.
### Detailed Analysis or Content Details
**1. PFR Block (Top-Left)**
* **Label:** PFR
* **Theorem/Lemma:** `lemma condRho_of_translate`
* **Formal Statement:**
`condRho (fun w -> X ω + s) Y A = condRho X Y A := by`
`simp only [condRho, rho_of_translate]`
* **Types/Parameters:**
* `Ω : Type*`
* `[MeasureSpace Ω]`
* `(X : Ω -> G) (Y : Ω -> S) (A : Finset G) (s:G)`
**2. SciLean Block (Bottom-Left)**
* **Label:** SciLean
* **Theorem/Lemma:** `theorem re_float (a : Float)`
* **Formal Statement:**
`RCLike.re a = a := by`
`exact RCLike.re_eq_self_of_le_le_rfl`
**3. Coxeter Block (Bottom-Center)**
* **Label:** Coxeter
* **Theorem/Lemma:** `lemma invmap_of_eq`
* **Formal Statement:**
`invmap S s = s := by`
`simp [CoxeterSystem.Presentation.invmap]`
`unfold CoxeterSystem.toMatrix`
`apply CoxeterSystem.monoidLift.mapLift.of`
* **Types/Parameters:**
* `(S:Set G) [CoxeterSystem G S] {s :S}`
**4. MiniF2F Block (Top-Right)**
* **Label:** MiniF2F
* **Theorem/Lemma 1:** `theorem induction_12dvd4expnp1p20`
* **Formal Statement:**
` (n : N) :`
`12 * 4^ (n+1) + 20 := by`
`omega`
`norm_num`
`induction 'n with n hn`
`simp`
* **Theorem/Lemma 2:** `theorem amc12a_2002_p6`
* **Formal Statement:**
`(n : N)`
`(h₀ : 0 < n) :`
`∃ m, (m > n ∧ ∃ p, m * p ≤ n + p)` := by
`lift n to N+ using h₀`
`cases 'n with n`
`exact (_, lt_add_of_pos_right _ zero_lt_one, 1, by simp)`
**5. Formal Book Block (Bottom-Right)**
* **Label:** Formal Book
* **Theorem/Lemma:** `theorem wedderburn (h: Fintype R): IsField R`
* **Formal Statement:**
`:= by`
`apply Field.toIsField`
### Key Observations
The diagram presents snippets of formal mathematical proofs or definitions. Each block represents a self-contained theorem or lemma. The content within each block is highly technical and specific to the respective software project. There is no apparent relationship between the theorems/lemmas presented in different blocks.
### Interpretation
The diagram serves as a visual catalog of formal theorems or lemmas from different formalization projects. It highlights the use of formal methods in mathematics and computer science. The diversity of the projects (PFR, SciLean, Coxeter, MiniF2F, Formal Book) suggests a broad application of formal verification and proof techniques. The content is not intended for general consumption but rather for researchers and developers working in the field of formal methods. The diagram doesn't demonstrate a specific relationship between the theorems, but rather showcases the independent work being done in different areas of formalization. The use of specific commands like `simp`, `unfold`, `apply`, `induction`, and `omega` indicates the use of interactive theorem provers or proof assistants.
</details>
Figure 2: Case studies of LeanAgent’s new proofs. LeanAgent shows an ability to work with these repositories, often able to retrieve the necessary premises (highlighted). For example, LeanAgent proves a sorry theorem from PFR, condRho_of_translate, by simply expanding definitions, showing its proving ability on the PFR repository. In addition, LeanAgent could prove re_float from SciLean during lifelong learning, while ReProver could not. Moreover, its MiniF2F proofs demonstrate its ability to generate relatively longer and more complex proofs for complex mathematics. LeanAgent’s proof of invmap.of_eq and wedderburn represents its theorem proving capabilities with abstract algebra premises.
Table 2: Accuracy in proving sorry theorems across repositories. Accuracy is calculated as (proven theorems / total sorry theorems). “LA” denotes LeanAgent and “ReProver+” denotes the setting where we update ReProver on all 23 repositories at once. “During” shows accuracy during lifelong learning, “Add. After” shows additional accuracy after lifelong learning, and “Total” shows the combined accuracy. “MIL” stands for Mathematics in Lean Source and “Hairy Ball” refers to the Hairy Ball Theorem repository. Repositories with no sorry theorems or no proven ones are not shown. The best accuracy for each repository is in bold. As noted previously, we progressively train on MiniF2F after the initial curriculum to demonstrate the use case of formalizing in a new repository after learning a curriculum. As such, we don’t evaluate LeanAgent after lifelong learning on MiniF2F.
| MIL MiniF2F Formal Book | 29 406 29 | 72.4 24.4 10.3 | 48.3 24.4 6.9 | 24.1 - 3.4 | 48.3 20.9 6.9 | 55.2 20.9 10.3 |
| --- | --- | --- | --- | --- | --- | --- |
| SciLean | 294 | 9.2 | 7.5 | 1.7 | 8.2 | 8.5 |
| Hairy Ball | 14 | 7.1 | 0.0 | 7.1 | 0.0 | 7.1 |
| Coxeter | 15 | 6.7 | 6.7 | 0.0 | 0.0 | 6.7 |
| Carleson | 24 | 4.2 | 4.2 | 0.0 | 4.2 | 4.2 |
| Lean4 PDL | 30 | 3.3 | 3.3 | 0.0 | 3.3 | 3.3 |
| PFR | 37 | 2.7 | 2.7 | 0.0 | 0.0 | 0.0 |
In addition, a use case of LeanAgent is proving in a new repository after learning a curriculum; we progressively train on MiniF2F to demonstrate this. Note that we choose the Lean4 version of the MiniF2F repository (Yang, 2024) and disregard its separation into validation and test splits (reasoning in Appendix A.5). LeanAgent’s success rate on the Lean4 version of the MiniF2F test set is also in Appendix A.5. Moreover, a mathematician could use LeanAgent for (1) an initial curriculum $A$ , and later (2) a sub-curriculum $B$ . LeanAgent can then help the mathematician prove in the repositories in curriculum $A+B$ . To demonstrate this scenario, we continue LeanAgent on a sub-curriculum $B$ of 8 repositories.
Results are in Table 2, with case studies in Figure 2. Appendix A.5 provides a more thorough discussion, including an ablation study, and contains the complete theorems and proofs relevant to this section.
LeanAgent demonstrates continuous generalizability and improvement in theorem-proving capabilities across multiple repositories. LeanAgent’s proofs are a superset of the sorry theorems proved by ReProver in most cases. Moreover, to isolate the effect of curriculum learning, we compare LeanAgent against ReProver+, the ReProver model updated on all 23 repositories at once, and notice that LeanAgent outperforms it on several repositories, emphasizing the importance of curriculum learning. Overall, LeanAgent progresses from basic concepts (arithmetic, simple algebra) to advanced topics (abstract algebra, topology).
PFR. LeanAgent can prove a sorry theorem from this repository, while ReProver cannot. It also generalizes to a different commit (not included in progressive training), uncovering 7 system exploits. LeanAgent proves two theorems with just the rfl tactic, one of which ReProver cannot, and proves 5 sorry theorems with a $0=1$ placeholder theorem statement.
SciLean. During lifelong learning, LeanAgent proves theorems related to fundamental algebraic structures, linear and affine maps, and measure theory basics. By the end of lifelong learning, it proves concepts in advanced function spaces, sophisticated bijections, and abstract algebraic structures.
Mathematics in Lean Source. During lifelong learning, LeanAgent proves theorems about basic algebraic structures and fundamental arithmetic properties. By the end of lifelong learning, it proves more complex theorems involving quantifier manipulation, set theory, and relations.
MiniF2F. ReProver demonstrates proficiency in basic arithmetic, elementary algebra, and simple calculus. However, by the end of lifelong learning, LeanAgent handles theorems with advanced number theory, sophisticated algebra, complex calculus and analysis, abstract algebra, and complex induction.
Sub-curriculum. In the Formal Book repository, LeanAgent progresses from proving basic real analysis and number theory theorems to more advanced abstract algebra, exemplified by its proof of Wedderburn’s Little Theorem. For the Coxeter repository, LeanAgent proves a complex lemma about Coxeter systems, showcasing its increased understanding of group theory. In the Hairy Ball Theorem repository, LeanAgent proves a key step of the theorem, demonstrating improved performance in algebraic topology. Only LeanAgent can prove these theorems, demonstrating that it has much more advanced theorem-proving capabilities than ReProver.
4.3 Lifelong Learning Analysis
To our knowledge, no other lifelong learning frameworks for theorem proving exist in the literature. As such, we conduct an ablation study with six lifelong learning metrics to showcase LeanAgent’s superior handling of the stability-plasticity tradeoff. These results help explain LeanAgent’s superiority in sorry theorem proving performance. We compute these metrics for the original curriculum of 14 repositories.
Specifically, the ablation study consists of seven additional setups constructed from a combination of learning and dataset options. Options for learning setups are progressive training with or without EWC. Dataset setups involve a dataset order and construction. Options for dataset orders involve Single Repository or Merge All, where each dataset consists of all previous repositories and the new one. Given the most popular repositories on GitHub by star count, options for dataset construction include popularity order or curriculum order. Appendix A.3 shows these orders and additional repository details.
Metrics. We use six lifelong learning metrics: Windowed-Forgetting 5 (WF5), Forgetting Measure (FM), Catastrophic Forgetting Resilience (CFR), Expanded Backward Transfer (EBWT), Windowed-Plasticity 5 (WP5), and Incremental Plasticity (IP). A description of these metrics is in Table 3 (De Lange et al., 2023; Wang et al., 2024b; Díaz-Rodríguez et al., 2018). Our reasoning for considering these metrics is detailed in Appendix A.4.
Table 3: Description of lifelong learning metrics.
| WF5 | Measures forgetting over a 5-task window | Lower | Existing |
| --- | --- | --- | --- |
| FM | Average performance drop on old tasks | Lower | Existing |
| CFR | Ratio of min to max average test R@10 | Higher | Proposed |
| EBWT | Average improvement on old tasks after learning new ones | Higher | Existing |
| WP5 | Max average test R@10 increase over a 5-task window | Higher | Existing |
| IP | Rate of validation R@10 change per task | Higher | Proposed |
We describe why we introduce two new metrics to address specific aspects of lifelong learning in theorem proving:
- Catastrophic Forgetting Resilience (CFR). This metric captures LeanAgent’s ability to maintain performance on its weakest task relative to its best performance, crucial in the presence of diverse mathematical domains.
- Incremental Plasticity (IP). IP provides a more granular view of plasticity than aggregate measures and is sensitive to the order of tasks, particularly relevant in lifelong learning for theorem proving.
In addition, these metrics in the Merge All strategy measure cumulative knowledge refinement rather than isolated task performance (details in Appendix A.4). Due to these interpretational differences, we analyze Single Repository and Merge All setups separately. We consider an improvement of at least 3% to be significant.
Table 4: Comparison of lifelong learning metrics across setups. The best scores for each metric are in bold.
| WF5 ( $\downarrow$ ) FM ( $\downarrow$ ) CFR ( $\uparrow$ ) | 0.18 0.85 0.88 | 7.60 6.53 0.87 | 7.17 4.04 0.88 | 0.73 2.11 0.85 | 15.83 10.50 0.76 | 2.23 4.06 0.94 | 13.34 11.44 0.75 | 5.82 3.80 0.90 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| EBWT ( $\uparrow$ ) | 1.21 | 0.51 | 1.04 | 0.76 | -0.20 | 0.73 | -1.34 | -0.39 |
| WP5 ( $\uparrow$ ) | 2.47 | 0.89 | 1.47 | 3.42 | 0.00 | 0.09 | 0.00 | 0.11 |
| IP ( $\uparrow$ ) | 1.02 | 0.36 | 0.26 | 1.06 | -1.50 | -0.64 | -1.71 | -0.89 |
| Single Repository: | Merge All: |
| --- | --- |
| Setup 1: No EWC, Popularity Order | Setup 4: No EWC, Popularity Order |
| Setup 2: EWC, Popularity Order | Setup 5: No EWC, Curriculum Learning |
| Setup 3: EWC, Curriculum Learning | Setup 6: EWC, Popularity Order |
| Setup 7: EWC, Curriculum Learning | |
Single Repository Analysis. We first analyze the Single Repository results from Table 4. LeanAgent demonstrates superior stability across multiple metrics. The WF5 metric is 75.34% lower for LeanAgent than the next best setup, suggesting it maintains performance over a window more effectively. Its FM score is 59.97% lower than Setup 3’s, showcasing its resilience against catastrophic forgetting. Furthermore, LeanAgent, Setup 1, and Setup 2 demonstrate high and consistent resilience against catastrophic forgetting, with CFR values above 0.87 and minimal (±0.01) differences. This underscores LeanAgent’s ability to continuously generalize over time. In addition, LeanAgent has a 16.25% higher EBWT, indicating its ability to continuously improve over time.
In contrast, Setup 3 exhibits characteristics of higher plasticity. It shows a 38.26% higher WP5 over LeanAgent, indicating a greater ability to rapidly adapt to new tasks in a window. This is complemented by its 3.98% higher IP over LeanAgent, suggesting a more pronounced improvement on new tasks over time. However, these plasticity gains come at a significant cost: Setup 3 suffers from more severe catastrophic forgetting, as evidenced by its significantly worse stability metrics compared to LeanAgent. This excessive plasticity in Setup 3 stems from EWC’s inability to adapt parameter importance as theorem complexity increases. EWC preserves parameters important for simpler theorems, which may not be crucial for more complex ones. Consequently, these preserved parameters resist change while other parameters change rapidly for complex theorems. This forces the model to become more plastic overall, relying heavily on non-preserved parameters for new, complex theorems.
LeanAgent’s favorable stability and EBWT scores make it the most suitable for lifelong learning in the Single Repository setting.
Merge All Analysis. Next, we analyze the Merge All setups from Table 4. Setup 5’s WF5 metric is 61.68% lower than the next best setup (Setup 7), suggesting Setup 5 balances and retains knowledge across an expanding dataset most effectively. Furthermore, Setup 5’s CFR score is 3.77% higher than that of Setup 7, again demonstrating high and consistent resilience in the face of an expanding, potentially more complex dataset. However, Setup 7 has a 6.44% lower FM score than Setup 5’s, showcasing its ability to maintain performance on earlier data points. Moreover, Setup 5 is the only setup with a positive EBWT, indicating that learning new tasks improves performance on the entire historical dataset. The other setups have a negative EBWT, indicating performance degradation on earlier tasks after learning new ones.
Only Setups 5 and 7 have a non-zero WP5, suggesting the ability to adapt to the growing complexity of the combined dataset. The zero values for Setups 4 and 6 indicate that popularity order struggles to show improvement when dealing with merged data. However, although Setup 5 has the highest IP score with a 27.75% improvement over Setup 7, all 4 setups have negative IP values. This indicates a decrease in validation R@10 over time, suggesting that the Merge All strategy struggles to maintain performance.
Experiment 5’s favorable stability and EBWT scores suggest it is the best at balancing the retention of earlier knowledge with the adaptation to new data in a combined dataset. However, its negative IP value indicates a fundamental issue with its approach.
Comparative Analysis and Insights. Although the metrics have different interpretations in the Single Repository and Merge All settings, we can still draw some meaningful comparisons by focusing on overall trends and relative performance. We must consider that the negative IP values in Merge All setups indicate a significant issue. This drawback outweighs the potential benefits seen in other metrics like WP5, as it indicates a fundamental inability to maintain and improve performance in a continuously growing dataset. In contrast, LeanAgent demonstrates a positive IP, indicating its ability to incorporate new knowledge. This, combined with its superior stability and EBWT metrics relative to other Single Repository methods, suggests that LeanAgent is better suited than Setup 5 for continuous generalizability and improvement.
Consistency with sorry Theorem Proving Performance. This lifelong learning analysis is consistent with LeanAgent’s sorry theorem proving performance. LeanAgent’s superior stability metrics (WF5, FM, and CFR) explain its ability to maintain performance across diverse mathematical domains, as evidenced by its success in proving theorems from various repositories like SciLean, Mathematics in Lean Source, and PFR. Its high EBWT score aligns with its progression from basic concepts to advanced topics in theorem proving. While LeanAgent shows slightly lower plasticity (WP5 and IP) compared to some setups, this trade-off results in better overall performance, as reflected in its ability to prove a superset of sorry theorems compared to ReProver in most cases. This analysis demonstrates LeanAgent’s overall superiority in lifelong learning for theorem proving.
5 Conclusion
We have presented LeanAgent, a lifelong learning framework for theorem proving that achieves continuous generalizability and improvement across diverse mathematical domains. Key components include a curriculum learning strategy, progressive training approach, and custom dynamic database infrastructure. LeanAgent successfully generates formal proofs for 155 theorems where formal proofs were previously missing and uncovers 7 exploits across 23 Lean repositories, including from challenging mathematics. This highlights its potential to assist in formalizing complex proofs across multiple domains and identifying system exploits. For example, LeanAgent successfully proves challenging theorems in abstract algebra and algebraic topology. It outperforms the ReProver baseline in proving new sorry theorems, progressively learning from basic to complex mathematical concepts. Moreover, LeanAgent shows significant performance in forgetting measures and backward transfer, explaining its continuous generalizability and continuous improvement.
Future work could explore integration with Lean Copilot, providing real-time assistance with a mathematician’s repositories. In addition, a limitation of LeanAgent is its inability to prove certain theorems due to a lack of data on specific topics, such as odeSolve.arg_x 0.semiAdjoint_rule in SciLean about ODEs. To solve this problem, future work could use reinforcement learning for synthetic data generation during curriculum construction. Moreover, future work could use LeanAgent with additional math LLMs and search strategies.
Acknowledgments
Adarsh Kumarappan is supported by the Summer Undergraduate Research Fellowships (SURF) program at Caltech. Anima Anandkumar is supported by the Bren named chair professorship, Schmidt AI 2050 senior fellowship, and ONR (MURI grant N00014-18-12624). We thank Terence Tao for detailed discussions and feedback that significantly improved this paper. We thank Zulip chat members for engaging in clarifying conversations that were incorporated into the paper.
References
- Alama et al. (2014) Jesse Alama, Tom Heskes, Daniel Kühlwein, Evgeni Tsivtsivadze, and Josef Urban. Premise Selection for Mathematics by Corpus Analysis and Kernel Methods. Journal of Automated Reasoning, 52(2):191–213, February 2014. ISSN 0168-7433, 1573-0670. doi: 10.1007/s10817-013-9286-5.
- Arana & Stafford (2023) Andrew Arana and Will Stafford. On the difficulty of discovering mathematical proofs. Synthese, 202(2):1–29, 2023. doi: 10.1007/s11229-023-04184-5.
- Avigad (2024) Jeremy Avigad. Avigad/mathematics_in_lean_source, August 2024.
- Azerbayev et al. (2024) Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An Open Language Model For Mathematics, March 2024.
- Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48, Montreal Quebec Canada, June 2009. ACM. ISBN 978-1-60558-516-1. doi: 10.1145/1553374.1553380.
- Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. Improving language models by retrieving from trillions of tokens, February 2022.
- Carneiro (2024) Mario Carneiro. Digama0/lean4lean, September 2024.
- Chang et al. (2021) Ernie Chang, Hui-Syuan Yeh, and Vera Demberg. Does the Order of Training Samples Matter? Improving Neural Data-to-Text Generation with Curriculum Learning. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 727–733, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.61.
- Chaudhry et al. (2019) Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient Lifelong Learning with A-GEM, January 2019.
- Chen et al. (2023) Jiefeng Chen, Timothy Nguyen, Dilan Gorur, and Arslan Chaudhry. Is forgetting less a good inductive bias for forward transfer?, March 2023.
- Chen et al. (2024) Yupeng Chen, Senmiao Wang, Zhihang Lin, Zeyu Qin, Yushun Zhang, Tian Ding, and Ruoyu Sun. MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning. 2024. doi: 10.48550/ARXIV.2407.20999.
- Cirik et al. (2016) Volkan Cirik, Eduard Hovy, and Louis-Philippe Morency. Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks, November 2016.
- Claude Team (2024) Claude Team. Introducing claude 3.5 sonnet, June 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet.
- Community (2024a) Lean Community. Leanprover-community/lean-liquid. leanprover-community, September 2024a.
- Community (2024b) Lean Community. Leanprover-community/lean-perfectoid-spaces. leanprover-community, August 2024b.
- Con-nf (2024) Con-nf. Leanprover-community/con-nf: A formal consistency proof of Quine’s set theory New Foundations. https://github.com/leanprover-community/con-nf/tree/main, 2024.
- Coxeter (2024) Coxeter. NUS-Math-Formalization/coxeter at 96af8aee7943ca8685ed1b00cc83a559ea389a97. https://github.com/NUS-Math-Formalization/coxeter/tree/96af8aee7943ca8685ed1b00cc83a559ea389a97, 2024.
- De Lange et al. (2023) Matthias De Lange, Gido van de Ven, and Tinne Tuytelaars. Continual evaluation for lifelong learning: Identifying the stability gap, March 2023.
- De Moura et al. (2015) Leonardo De Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob Von Raumer. The Lean Theorem Prover (System Description). In Amy P. Felty and Aart Middeldorp (eds.), Automated Deduction - CADE-25, volume 9195, pp. 378–388, Cham, 2015. Springer International Publishing. ISBN 978-3-319-21400-9 978-3-319-21401-6. doi: 10.1007/978-3-319-21401-6˙26.
- DeepMind (2024) Google DeepMind. Google-deepmind/debate. Google DeepMind, August 2024.
- Díaz-Rodríguez et al. (2018) Natalia Díaz-Rodríguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. Don’t forget, there is more than forgetting: New metrics for Continual Learning, October 2018.
- Dillies (2024) Yaël Dillies. YaelDillies/LeanAPAP, September 2024.
- Dohare et al. (2024) Shibhansh Dohare, J. Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A. Rupam Mahmood, and Richard S. Sutton. Loss of plasticity in deep continual learning. Nature, 632(8026):768–774, August 2024. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-024-07711-7.
- Fermat’s Last Theorem (2024) Fermat’s Last Theorem. ImperialCollegeLondon/FLT. Imperial College London, September 2024.
- Firsching, Moritz (2024) Firsching, Moritz. Mo271/FormalBook: Formalizing ”Proofs from THE BOOK”, 2024.
- Formal Logic (2024) Formalized Formal Logic. FormalizedFormalLogic/Foundation. FormalizedFormalLogic, September 2024.
- Gadgil (2024) Siddhartha Gadgil. Siddhartha-gadgil/Saturn: Experiments with SAT solvers with proofs in Lean 4. https://github.com/siddhartha-gadgil/Saturn, 2024.
- Gattinger (2024) Malvin Gattinger. M4lvin/lean4-pdl, September 2024.
- Hairy Ball Theorem (2024) Hairy Ball Theorem. Corent1234/hairy-ball-theorem-lean at a778826d19c8a7ddf1d26beeea628c45450612e6. https://github.com/corent1234/hairy-ball-theorem-lean/tree/a778826d19c8a7ddf1d26beeea628c45450612e6, 2024.
- Irving et al. (2016) Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Een, Francois Chollet, and Josef Urban. DeepMath - Deep Sequence Models for Premise Selection. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016.
- Jiang et al. (2022) Albert Q. Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers. 2022. doi: 10.48550/ARXIV.2205.10893.
- Jiang et al. (2024) Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei. Interpretable Catastrophic Forgetting of Large Language Model Fine-tuning via Instruction Vector. 2024. doi: 10.48550/ARXIV.2406.12227.
- Kevin Buzzard (2019) Kevin Buzzard. The Future of Mathematics? Professor Kevin Buzzard - 30 May 2019, June 2019.
- Kim et al. (2023) Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, and Sanjiv Kumar. EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval. 2023. doi: 10.48550/ARXIV.2301.12005.
- Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, March 2017. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1611835114.
- Kontorovich (2024) Alex Kontorovich. AlexKontorovich/PrimeNumberTheoremAnd, August 2024.
- (37) Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. HyperTree Proof Search for Neural Theorem Proving.
- Li et al. (2024) Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, and Yuxiong He. DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):18490–18498, March 2024. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v38i16.29810.
- Li & Hoiem (2017) Zhizhong Li and Derek Hoiem. Learning without Forgetting, February 2017.
- Lin et al. (2024) Haohan Lin, Zhiqing Sun, Yiming Yang, and Sean Welleck. Lean-STaR: Learning to Interleave Thinking and Proving, August 2024.
- Liu et al. (2023) Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu. FIMO: A Challenge Formal Dataset for Automated Theorem Proving, December 2023.
- Lopez-Paz & Ranzato (2017) David Lopez-Paz and Marc’ Aurelio Ranzato. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Loshchilov & Hutter (2017) I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, November 2017.
- Lu et al. (2024) Minghai Lu, Benjamin Delaware, and Tianyi Zhang. Proof Automation with Large Language Models, September 2024.
- Lu et al. (2022) Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. ReACC: A Retrieval-Augmented Code Completion Framework. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6227–6240, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.431.
- mathlib4 Community (2024) The mathlib4 Community. Leanprover-community/mathlib4. leanprover-community, September 2024.
- Mendez & Eaton (2021) Jorge A Mendez and Eric Eaton. Lifelong Learning of Compositional Structures. 2021.
- Mikuła et al. (2024) Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak, Bartosz Piotrowski, Albert Qiaochu Jiang, Jin Peng Zhou, Christian Szegedy, Łukasz Kuciński, Piotr Miłoś, and Yuhuai Wu. Magnushammer: A Transformer-Based Approach to Premise Selection, March 2024.
- Mizuno (2024) Yuma Mizuno. Yuma-mizuno/lean-math-workshop. https://github.com/yuma-mizuno/lean-math-workshop, 2024.
- Monnerjahn (2024) Ludwig Monnerjahn. Louis-Le-Grand/Formalisation-of-constructable-numbers, September 2024.
- Murphy (2024) Logan Murphy. Loganrjmurphy/LeanEuclid, September 2024.
- OpenAI (2024) OpenAI. OpenAI o1 System Card, September 2024.
- Piotrowski et al. (2023) Bartosz Piotrowski, Ramon Fernández Mir, and Edward Ayers. Machine-Learned Premise Selection for Lean. In Automated Reasoning with Analytic Tableaux and Related Methods: 32nd International Conference, TABLEAUX 2023, Prague, Czech Republic, September 18–21, 2023, Proceedings, pp. 175–186, Berlin, Heidelberg, September 2023. Springer-Verlag. ISBN 978-3-031-43512-6. doi: 10.1007/978-3-031-43513-3˙10.
- Polu et al. (2022) Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and I. Sutskever. Formal Mathematics Statement Curriculum Learning. ArXiv, February 2022.
- Renshaw (2024) David Renshaw. Dwrensha/compfiles: Catalog Of Math Problems Formalized In Lean. https://github.com/dwrensha/compfiles, September 2024.
- Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, April 2024.
- Shin et al. (2017) Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual Learning with Deep Generative Replay, December 2017.
- Skřivan (2024) Tomáš Skřivan. Lecopivo/SciLean: Scientific computing in Lean 4. https://github.com/lecopivo/SciLean/tree/master, September 2024.
- Song et al. (2024) Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Towards large language models as copilots for theorem proving in lean, 2024. URL https://arxiv.org/abs/2404.12534.
- (60) Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. Baby Steps: How “Less is More” in Unsupervised Dependency Parsing.
- Subramanian et al. (2017) Sandeep Subramanian, Sai Rajeswar, Francis Dutil, Chris Pal, and Aaron Courville. Adversarial Generation of Natural Language. In Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, and Scott Yih (eds.), Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 241–251, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-2629.
- (62) Terence Tao. Teorth/asymptotic.
- Tao (2024a) Terence Tao. Teorth/newton, June 2024a.
- Tao (2024b) Terence Tao. Teorth/symmetric_project, July 2024b.
- Tao et al. (2024) Terence Tao, Pietro Monticone, Lorenzo Luccioli, and Rémy Degenne. Teorth/pfr, August 2024.
- Thakur et al. (2024) Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An In-Context Learning Agent for Formal Theorem-Proving, August 2024.
- (67) Szymon Tworkowski, Maciej Mikula, Tomasz Odrzygozdz, Konrad Czechowski, Szymon Antoniak, Albert Q Jiang, Christian Szegedy, Lukasz Kucinski, Piotr Milos, and Yuhuai Wu. Formal Premise Selection With Language Models.
- van de Ven et al. (2024) Gido M. van de Ven, Nicholas Soures, and Dhireesha Kudithipudi. Continual Learning and Catastrophic Forgetting, March 2024.
- van Doorn (2024) Floris van Doorn. Fpvandoorn/carleson, September 2024.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Wang et al. (2023) Haiming Wang, Ye Yuan, Zhengying Liu, Jianhao Shen, Yichun Yin, Jing Xiong, Enze Xie, Han Shi, Yujun Li, Lin Li, Jian Yin, Zhenguo Li, and Xiaodan Liang. DT-Solver: Automated Theorem Proving with Dynamic-Tree Sampling Guided by Proof-level Value Function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12632–12646, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.706.
- Wang et al. (2024a) Haiming Wang, Huajian Xin, Zhengying Liu, Wenda Li, Yinya Huang, Jianqiao Lu, Zhicheng Yang, Jing Tang, Jian Yin, Zhenguo Li, and Xiaodan Liang. Proving Theorems Recursively, May 2024a.
- Wang et al. (2024b) Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, August 2024b. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2024.3367329.
- Wang & Deng (2020) Mingzhe Wang and Jia Deng. Learning to Prove Theorems by Learning to Generate Theorems. ArXiv, February 2020.
- Wang et al. (2024c) Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, and Tong Zhang. TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts, October 2024c.
- Wieser (2024) Eric Wieser. Eric-wieser/lean-matrix-cookbook: The matrix cookbook, proved in the Lean theorem prover. https://github.com/eric-wieser/lean-matrix-cookbook, 2024.
- Xin et al. (2024a) Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data, May 2024a.
- Xin et al. (2024b) Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search, August 2024b.
- Xue et al. (2022) Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022. doi: 10.1162/tacl˙a˙00461.
- Yang (2024) Kaiyu Yang. Yangky11/miniF2F-lean4. https://github.com/yangky11/miniF2F-lean4/tree/main, 2024.
- Yang et al. (2023) Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. 2023. doi: 10.48550/ARXIV.2306.15626.
- Zaremba & Sutskever (2015) Wojciech Zaremba and Ilya Sutskever. Learning to Execute, February 2015.
- Zeta 3 Irrational (2024) Zeta 3 Irrational. Ahhwuhu/zeta_3_irrational at 3d68ddd90434a398c9a72f30d50c57f15a0118c7. https://github.com/ahhwuhu/zeta_3_irrational/tree/3d68ddd90434a398c9a72f30d50c57f15a0118c7, 2024.
- (84) Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. URL http://arxiv.org/abs/2403.03507.
- Zhou et al. (2024) Jin Peng Zhou, Charles Staats, Wenda Li, Christian Szegedy, Kilian Q. Weinberger, and Yuhuai Wu. Don’t Trust: Verify – Grounding LLM Quantitative Reasoning with Autoformalization, March 2024.
- Zhou et al. (2023) Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. DocPrompting: Generating Code by Retrieving the Docs, February 2023.
- Zombori et al. (2019) Zsolt Zombori, Adrian Csiszarik, Henryk Michalewski, Cezary Kaliszyk, and Josef Urban. Curriculum Learning and Theorem Proving. 2019.
Appendix A Appendix
A.1 Further Methodology Details
Repository Scanning and Data Extraction. We use the GitHub API to query for Lean repositories based on sorting parameters (e.g., by repository stars or most recently updated repositories). We maintain a list of known repositories to avoid; the list can be updated to allow LeanAgent to re-analyze the same repository on a new commit or Lean version.
We clone each identified repository locally using the Git version control system. To ensure compatibility with our theorem-proving pipeline, we check the Lean version required by each repository and compare it with the supported versions of our system. If the required version is incompatible, we skip the repository and move on to the next one. Otherwise, LeanAgent switches its Lean version to match the repository’s version. This version checking is performed by parsing the repository’s configuration files and extracting the specified Lean version.
Dynamic Database Management. This database contains many key features that are useful in our setting. For example, it can add new repositories, update existing ones, and generate merged datasets from multiple repositories with customizable splitting strategies. In addition, it can query specific theorems or premises across repositories, track the progress of proof attempts (including the proof status of sorry theorems), and analyze the structure and content of Lean proofs, including tactic sequences and proof states.
The database keeps track of various details: Repository metadata; theorems categorized as already proven, sorry theorems that are proven, or sorry theorems that are unproven; premise files with their imports and individual premises; traced files for tracking which files have been processed; detailed theorem information, including file path, start/end positions, and full statements; and traced tactics with annotated versions, including the proof state before and after application.
If we encounter duplicate theorems between repositories while merging repositories, we use the theorem from the repository most recently added to the database. We deduplicate premise files and traced files by choosing the first one encountered while merging the repositories. We also generate metadata containing details of all the repositories used to generate the dataset and statistics regarding the theorems, premise files, and traced files in the dataset, such as the total number of theorems.
We provide the user with many options to generate a dataset. To generate the set of theorems and proofs, the default option is to simply use the theorems, proofs, premise files, and traced files from the current curriculum repository in the database. Specifically, we use the random split from LeanDojo to create training, validation, and testing sets. We refrain from using the novel split from LeanDojo, as we would like LeanAgent to learn as much as possible from a repository to perform well on its hardest theorems. The data in the splits include details about the proofs of theorems, including the URL and commit of the source repository, the file path of the theorem, the full name of the theorem, the theorem statement, the start and end positions in the source file, and a list of traced tactics with annotations. The validation and test split each contain 2% of the total theorem and proofs, following the methodology from LeanDojo. Moreover, the database uses a topological sort over the traced files in the repository to generate the premise corpus. This corpus is a JSON Lines file, where each line is a JSON object consisting of a path to a Lean source file, the file’s imports, and the file’s premise statements and definitions.
Progressive Training of the Retriever. We describe some additional steps for progressive training. To precompute the embeddings, we use a single forward pass with batch processing to serialize and tokenize premises from the entire corpus. Then, we use the retriever’s encoder to process the batches and generate embeddings.
sorry Theorem Proving. We start by processing the premise corpus to use it more efficiently during premise retrieval. This involves initializing a directed dependency graph to represent each file path in the corpus, adding files as nodes and imports as edges, and creating a transitive closure of this graph. We also track all premises encountered during this process, building a comprehensive knowledge base.
Crucially, we limit retrieval to a subset of all available premises to aid the effectiveness of the results. Specifically, we choose the top 25% of accessible and relevant premises, following ReProver’s method.
Proof Integration and Pull Request Generation. We integrate the generated proofs into the original Lean files and create pull requests to propose the changes to the repository owners. This aids the development of these repositories and functions as more training data for future research.
To achieve this, in a temporary Git branch, we iterate over the Lean files and locate the sorry keywords corresponding to the generated proofs. We then replace these sorry keywords with the actual proof text, working from the bottom of each file upward to preserve the position of theorems. After integrating the proofs, we commit our changes, push them, and create a pull request for the repository on GitHub.
A.2 Experiment Implementation Details
We use ReProver’s retriever trained on the random split from LeanDojo. We use four NVIDIA A100 GPUs with 80GB of memory each for progressive training. LeanAgent uses a distributed architecture leveraging PyTorch Lightning and Ray for parallel processing. We use bfloat16 mixed precision and optimize with AdamW (Loshchilov & Hutter, 2017) with an effective batch size of 16 (achieved through a batch size of 4 with gradient accumulation over 4 steps). In the first 1,000 steps, the learning rate warms up linearly from 0 to the maximum value of $10^{-3}$ . Then it decays to 0 using a cosine schedule. In addition, we apply gradient clipping with a value of 1.0. Just as ReProver does during training, we sample 3 negative premises per example, including 1 in-file negative premise. The maximum sequence length for the retriever is set to 1024 tokens. The maximum sequence length for the generator is set to 512 tokens for input and 128 tokens for output.
The prover uses a best-first search strategy with no limit on the maximum number of expansions of the search tree. It generates 64 tactic candidates and retrieves 100 premises for each proof state. LeanAgent uses ReProver’s tactic generator for the experiments. We generate tactics with a beam search of size 5. We used 4 CPU workers, 1 per GPU. Due to the wide variety of repositories and experimental setups that we tested, the time for each experiment widely varied. For example, the experiments in Table 10 took from 4 to 9 days to complete.
Furthermore, we do not compare LeanAgent with any existing LLM-based prover besides ReProver because LeanAgent is a framework, not a model. As mentioned previously, it can be used with any LLM. As such, a comparison would be impractical for reasons including differences in data, pre-training, and fine-tuning. We only compare with ReProver because we use ReProver’s retriever as the starting one in LeanAgent, allowing for a more faithful comparison.
Moreover, we do not compare with Aesop because it is not an ML model. We aim to improve upon ML research for theorem proving, such as ReProver. Moreover, Aesop is not a framework, but ReProver was included as the starting point of the LeanAgent framework, which is why we compare LeanAgent to ReProver. Rather than comparing against existing tools, we aim to understand how lifelong learning can work in theorem proving.
Furthermore, although LeanAgent can work with other LLMs such as Llemma (Azerbayev et al., 2024) and DeepSeek-Prover (Xin et al., 2024a), using these LLMs in our work would require architectural modifications that go beyond the scope of our current work. For example, the 7B model DeepSeek-Prover as well as the 7B and 34B Llemma models are not retrieval-based. As such, rather than progressively training a retriever, we would progressively train the entire model. This may be feasible with methods such as Gradient Low-Rank Projection (Zhao et al., ), but this would lead to fundamentally different usage than we currently demonstrate. Specifically, rather than using a best-first tree search approach as we do with ReProver’s retriever and tactic generator, we may instead need to generate the entire proof at once. This setup is quite dissimilar from our current evaluation, and so these results may be too dissimilar from our current evaluation framework.
In addition, we would like to note that because LeanAgent does not claim to contribute a new search algorithm, it can be used with other search strategies such as Hypertree Proof Search (Lample et al., ). However, the source code for Hypertree Proof Search was only recently released on GitHub, explaining why we did not use it thus far.
Moreover, the objective function for Elastic Weight Consolidation (EWC) is given by:
$$
L(\theta)=L_{B}(\theta)+\frac{\lambda}{2}\sum_{i}F_{i}(\theta_{i}-\theta_{A,i}%
)^{2}
$$
where $L_{B}(\theta)$ is the loss for the current task B, $i$ is the label for each parameter, $\theta_{A,i}$ are the parameters from the previous task A, $F_{i}$ is the Fisher information matrix, and $\lambda$ is a hyperparameter that controls the strength of the EWC penalty. For the setups that use EWC, we performed a grid search over $\lambda$ values in {0.01, 0.1, 1, 10, 100}. For each value, we ran Setup 2 on separate testing repositories. We found 0.1 to yield the best overall stability and plasticity scores.
A.3 Repository Details
Table 5: Additional repository descriptions
Table 6: Repository commits. Formalization of Const. Numbers denotes Formalization of Constructable Numbers.
| PFR Hairy Ball Theorem Coxeter | fa398a5b853c7e94e3294c45e50c6aee013a2687 a778826d19c8a7ddf1d26beeea628c45450612e6 96af8aee7943ca8685ed1b00cc83a559ea389a97 |
| --- | --- |
| Mathematics in Lean Source | 5297e0fb051367c48c0a084411853a576389ecf5 |
| Formal Book | 6fbe8c2985008c0bfb30050750a71b90388ad3a3 |
| MiniF2F | 9e445f5435407f014b88b44a98436d50dd7abd00 |
| SciLean | 22d53b2f4e3db2a172e71da6eb9c916e62655744 |
| Carleson | bec7808b907190882fa1fa54ce749af297c6cf37 |
| Lean4 PDL | c7f649fe3c4891cf1a01c120e82ebc5f6199856e |
| Prime Number Theorem And | 29baddd685660b5fedd7bd67f9916ae24253d566 |
| Compfiles | f99bf6f2928d47dd1a445b414b3a723c2665f091 |
| FLT | b208a302cdcbfadce33d8165f0b054bfa17e2147 |
| Debate | 7fb39251b705797ee54e08c96177fabd29a5b5a3 |
| Lean4Lean | 05b1f4a68c5facea96a5ee51c6a56fef21276e0f |
| Matrix Cookbook | f15a149d321ac99ff9b9c024b58e7882f564669f |
| Math Workshop | 5acd4b933d47fd6c1032798a6046c1baf261445d |
| LeanEuclid | f1912c3090eb82820575758efc31e40b9db86bb8 |
| Foundation | d5fe5d057a90a0703a745cdc318a1b6621490c21 |
| Con-nf | 00bdc85ba7d486a9e544a0806a1018dd06fa3856 |
| Saturn | 3811a9dd46cdfd5fa0c0c1896720c28d2ec4a42a |
| Zeta 3 Irrational | 914712200e463cfc97fe37e929d518dd58806a38 |
| Formalization of Const. Numbers | 01ef1f22a04f2ba8081c5fb29413f515a0e52878 |
| LeanAPAP | 951c660a8d7ba8e39f906fdf657674a984effa8b |
Table 7: Repository orders (initial curriculum). Note that Popularity Order is by star count on August 20, 2024.
| 1 2 3 | Compfiles Mathematics in Lean Source Prime Number Theorem And | SciLean FLT PFR |
| --- | --- | --- |
| 4 | Math Workshop | Prime Number Theorem And |
| 5 | FLT | Compfiles |
| 6 | PFR | Debate |
| 7 | SciLean | Mathematics in Lean Source |
| 8 | Debate | Lean4Lean |
| 9 | Matrix Cookbook | Matrix Cookbook |
| 10 | Con-nf | Math Workshop |
| 11 | Foundation | LeanEuclid |
| 12 | Saturn | Foundation |
| 13 | LeanEuclid | Con-nf |
| 14 | Lean4Lean | Saturn |
Table 8: Curriculum order (sub-curriculum)
| 1 2 3 | Zeta 3 Irrational Formal Book Formalization of Constructable Numbers |
| --- | --- |
| 4 | Carleson |
| 5 | LeanAPAP |
| 6 | Hairy Ball Theorem |
| 7 | Coxeter |
| 8 | Lean4 PDL |
Table 9: Repository theorem and premise counts
| PFR Hairy Ball Theorem Coxeter | 74306 73026 71273 | 109855 131217 127608 |
| --- | --- | --- |
| Mathematics in Lean Source | 78886 | 117699 |
| Formal Book | 74654 | 112458 |
| MiniF2F | 71313 | 127202 |
| SciLean | 72244 | 129711 |
| Carleson | 73851 | 109334 |
| Lean4 PDL | 20599 | 46400 |
| Prime Number Theorem And | 79147 | 115751 |
| Compfiles | 121391 | 178108 |
| FLT | 75082 | 114830 |
| Debate | 68853 | 103684 |
| Lean4Lean | 2559 | 22689 |
| Matrix Cookbook | 67585 | 102294 |
| Math Workshop | 76942 | 115458 |
| LeanEuclid | 15423 | 40555 |
| Foundation | 25047 | 57964 |
| Con-nf | 29489 | 64177 |
| Saturn | 10982 | 34497 |
| Zeta 3 Irrational | 120174 | 176332 |
| Formalization of Const. Numbers | 74050 | 109645 |
| LeanAPAP | 71090 | 109477 |
Repository Statistics and Information. Additional repository descriptions are in Table 5. The commits we used for experiments are in Table 6. Moreover, the repository orders are detailed in Table 7 and Table 8. Furthermore, the total number of theorems and premises per repository (including dependencies) are in Table 9.
Repository Selection Process. Many repositories have issues such as incompatibilities with LeanDojo, unsupported Lean versions, and build failures. As such, our process for choosing the 14 repositories in the first curriculum was simply using LeanAgent to extract information from the most popular repositories on GitHub. We disregard incompatible and inapplicable ones, such as those with no theorems. We performed this process on August 20, 2024. We performed a similar process for the eight repositories in the second curriculum with two differences: (1) We checked that the number of sorry theorems visible from GitHub was at least 10. This narrowed down the available repositories significantly. (2) We included some more recently updated repositories to provide some variety in the age of the repositories in our curriculum. We performed this process on September 14, 2024. However, many of the repositories that passed this test had fewer than 10 sorry theorems when processed by LeanDojo; this is mainly due to the functionalities of LeanDojo.
A.4 Lifelong Learning Metric Details
Prior work has noted that lifelong learning methods generally lack standard evaluation metrics (De Lange et al., 2023; Díaz-Rodríguez et al., 2018). As such, our selection primarily focused on metrics that emphasized a change over time, aligning with our problem setup. In addition, we removed metrics that were redundant. For example, prior work suggests that evaluating lifelong learning frameworks only after each task, rather than over time, leads to substantial forgetting (De Lange et al., 2023). As such, we adopt WF and WP in our analysis of LeanAgent. We use a window size of 5 for WF and WP as this represents a relatively medium-term understanding, given that we have 14 repositories. This would provide a balanced interpretation of forgetting and plasticity. Furthermore, we use the EBWT metric, in line with previous work, to evaluate LeanAgent throughout its lifetime rather than simply at the end (Díaz-Rodríguez et al., 2018). Moreover, we chose not to include the Forward Transfer metric as prior work has shown that a lower FM leads to better forward transfer (Chen et al., 2023). As such, we only check FM. We also chose not to include lifelong learning metrics for overall performance, such as Time Weighted Cumulative Performance (TWCP), Area Under the Learning Curve (AULC), and Average Accuracy (AA), as these would lead to redundancy in our analysis. Specifically, the metrics we chose were all computed using validation R@10 and the average test R@10, which are already measures of LeanAgent’s performance.
We provide some additional details on the metrics we used. Windowed-Forgetting 5 (WF5) quantifies model stability by measuring the maximum performance decrease in average test R@10 over a sliding window of 5 evaluations. Following prior work, we define WF for a given window size and then average it over all evaluation tasks to provide a single measure of stability. Moreover, Catastrophic Forgetting Resilience (CFR) is a key indicator of the stability-plasticity trade-off. Furthermore, the Forgetting Measure (FM) measures the negative influence that learning a task has on the test R@10 of all old tasks. It is the average forgetting of all old tasks, where forgetting of a task is the difference between its highest and current performance. Furthermore, BWT measures the positive influence of learning a new task on the test R@10 of old tasks. EBWT improves upon this metric by considering the average of the BWT computed after each task. Windowed-Plasticity 5 (WP5) measures the ability to learn new information by quantifying the maximum average test R@10 increase over a sliding window of 5 evaluations. Incremental Plasticity (IP) tracks changes in validation R@10 for each task over time.
However, it is important to note that our lifelong learning metrics have different interpretations in the Merge All dataset construction strategy, which differs from the traditional task-incremental setup. To our knowledge, an interpretation of these metrics in this setting has not been thoroughly conducted. As such, we propose that metrics should be interpreted with an understanding that they may reflect an adaptation to gradual shifts in data distribution rather than abrupt task changes. Specifically, WF5 may reflect not just forgetting old tasks but also the ability to balance and retain knowledge across an expanding dataset. WP5 could indicate how well the model adapts to the growing complexity of the combined dataset rather than purely learning new, isolated tasks. FM, in this context, may represent the ability to maintain performance on earlier data points as the dataset grows. EBWT might reflect the capacity to leverage newly added data to improve performance on the entire historical dataset. CFR becomes a measure of stability in the face of an expanding, potentially more complex dataset. IP may represent how quickly the model adapts to the evolving nature of the combined dataset rather than discrete new tasks. These metrics in the Merge All case measure the ability to accumulate and refine knowledge over time rather than strictly measuring performance on isolated tasks.
It is worth analyzing the effect of EWC. Our results in Sec. 4.3 suggest that the effect of EWC is not uniform across different task-ordering strategies. In curriculum-based ordering, EWC seems to improve plasticity (WP5 and IP) at the cost of stability and continuous improvement (WF5, FM, EBWT, and CFR). An exception is Setup 7, which improves WP5 and FM. This suggests that the Merge All strategy creates a more nuanced balance between stability and plasticity. EWC generally improves stability and plasticity metrics, except IP, in the popularity order for the Single Repository strategy. This may be because this ordering is less optimized for learning, and EWC helps to mitigate some of its shortcomings. However, when used with a more effective curriculum-based ordering, EWC interferes with the carefully structured learning process, leading to mixed results. Moreover, in the Merge All scenario, EWC offers benefits only for the curriculum learning setups, suggesting that its effectiveness might be limited in more complex, merged datasets. This can be explained by the fact that the Merge All strategy is a memory-based approach to lifelong learning. As such, using both EWC and the Merge All strategy may lead to excessive stability. This analysis further explains why LeanAgent’s setup is superior.
A.5 Further sorry Theorem Proving Discussion
We provide some additional discussion on the results in Table 2.
First, we note that comparing LeanAgent to ReProver+, the fine-tuning baseline, is not a direct comparison. Specifically, we note how mathematicians often formalize across multiple domains and projects simultaneously or cyclically. We use this as motivation for connecting mathematical knowledge between domains. Moreover, a key use case is formalizing new repositories without retraining on the entire dataset each time. When a mathematician creates some new repositories and adds them to a curriculum, they can simply progressively train LeanAgent on the new repositories while maintaining performance on previous ones, as shown in our experiments with both the initial curriculum and sub-curriculum. This is much more practical than retraining on the entire dataset, which would be expensive in terms of compute and time, especially as the number of repositories grows.
Moreover, data scarcity is a major problem in theorem proving. As such, having enough high-quality data for effective pre-training on all repositories may not be feasible. Training on all data, an approach similar to existing work, could prevent the model from generalizing to new repositories. However, LeanAgent does not have this restraint as it continuously generalizes to and improves on ever-expanding mathematical knowledge without forgetting previously learned knowledge.
In addition, the results in Table 4 show that our lifelong learning setup leads to effective backward transfer. This is a strong advantage that pre-training does not provide and is also a reason why LeanAgent demonstrates progressive learning, starting with basic concepts and advancing to more complex ones. Also, although not a direct comparison to the pre-training approach, the “Merge All” strategy indicates decreased performance over time. This dataset strategy can be interpreted as being closer to pre-training than the “Single Repository” strategy, suggesting lower than desired pre-training performance when training on all data.
Second, we note that LeanAgent improves over the direct fine-tuning baseline on various repositories, including MiniF2F, Scilean, and PFR. We find that these repositories contain a range of mathematical concepts that require progressively more advanced reasoning capabilities. This highlights the effectiveness of our lifelong learning approach in continuously generalizing to and improving on expanding mathematical knowledge without catastrophic forgetting.
Table 10: Accuracy comparison across setups. Accuracy is calculated as (proven theorems / total sorry theorems). MIL denotes the Mathematics in Lean Source repository, LA denotes LeanAgent, and SX denotes Setup X (e.g., S1 is Setup 1). The best accuracy for each repository is in bold.
| MIL SciLean PFR | 29 294 37 | 72.4 9.2 2.7 | 55.2 8.5 0.0 | 41.4 7.5 0.0 | 55.2 6.8 0.0 | 55.2 8.5 0.0 | 48.3 8.5 0.0 | 58.6 8.8 0.0 | 48.3 7.8 0.0 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
In addition to comparing LeanAgent and ReProver, we conduct an ablation study between LeanAgent and the seven variants discussed in Sec. 4.3 regarding the original curriculum. The detailed sorry theorem proving comparison, which focuses on some of the repositories compared in Sec. 4.2, is in Table 10. Note that when using the Merge All strategy, only sorry theorems from the new repository are proven during each iteration of lifelong learning. We devote the rest of this section to the detailed comparison of sorry theorems that these setups can prove.
Mathematics in Lean Source. We notice a progression in LeanAgent’s proving ability in the Mathematics in Lean Source repository. During lifelong learning, LeanAgent demonstrates a grasp of fundamental algebraic structures and basic mathematical operations:
a) Group and Ring Theory: LeanAgent proves theorems about basic algebraic structures. For instance, MyGroup.mul_right_inv shows that multiplying an element by its inverse yields the identity and MyRing.add_right_cancel demonstrates the cancellation property in ring addition.
<details>
<summary>extracted/6256159/carbon17.png Details</summary>

### Visual Description
\n
## Screenshot: Mathematical Theorem Code
### Overview
The image is a screenshot displaying code snippets of mathematical theorems, likely from a formal proof assistant or a mathematical software environment. The code appears to be written in a functional programming style, possibly Lean or Coq, given the syntax. There are three colored circles at the top-left corner.
### Components/Axes
There are no axes or traditional chart components. The image consists of text and three colored circles positioned at the top-left. The circles are colored red, orange, and green, from left to right.
### Detailed Analysis or Content Details
The screenshot contains two theorem definitions:
1. **theorem mul_right_inv (a : G) : a * a⁻¹ = 1 := by simp**
* `theorem mul_right_inv`: This is the name of the theorem.
* `(a : G)`: This defines a variable `a` of type `G`.
* `: a * a⁻¹ = 1`: This is the statement of the theorem, asserting that for any element `a` in `G`, `a` multiplied by its inverse `a⁻¹` equals the identity element `1`.
* `:= by simp`: This indicates that the theorem is proved by simplification (`simp`).
2. **theorem add_right_cancel {a b c : R} (h : a + b = c + b) : a = c := by simpa using h**
* `theorem add_right_cancel`: This is the name of the theorem.
* `{a b c : R}`: This defines variables `a`, `b`, and `c` of type `R`.
* `(h : a + b = c + b)`: This defines a hypothesis `h` stating that `a + b` is equal to `c + b`.
* `: a = c`: This is the statement of the theorem, asserting that `a` is equal to `c`.
* `:= by simpa using h`: This indicates that the theorem is proved by simplification (`simpa`) using the hypothesis `h`.
### Key Observations
The code snippets demonstrate basic mathematical properties: the existence of multiplicative inverses and the cancellation property of addition. The use of `simp` and `simpa` suggests a tactic-based proof system where the proof assistant automatically attempts to simplify the expression to verify the theorem.
### Interpretation
The image showcases a snippet of formal mathematical reasoning. The theorems are expressed in a precise, unambiguous language suitable for automated verification. The use of a proof assistant allows for rigorous validation of mathematical statements, reducing the risk of errors. The theorems themselves are fundamental in abstract algebra and real analysis. The `G` and `R` likely represent a group and the real numbers respectively. The presence of these theorems suggests the code is part of a larger library or project aimed at formalizing mathematical knowledge.
</details>
b) Elementary Number Theory: LeanAgent handles fundamental arithmetic properties, including MyRing.zero_mul, which proves that zero multiplied by any number is zero, and MyRing.neg_neg, which shows that the negative of a negative number is the original number.
<details>
<summary>extracted/6256159/carbon19.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Mathematical Theorems
### Overview
The image is a screenshot displaying a code snippet, likely from a formal verification or proof assistant system (possibly Lean). It presents two mathematical theorems, `zero_mul` and `neg_neg`, along with their proofs. The background is a dark teal color, and there are three colored circles in the top-left corner (red, yellow, green).
### Components/Axes
There are no axes or traditional chart components. The key elements are:
* **Colored Circles:** Located in the top-left corner. Their purpose is unclear without further context.
* **Theorem `zero_mul`:** States that zero multiplied by any real number `a` is equal to zero.
* **Proof of `zero_mul`:** Uses the `rw` tactic with the reference `[MulZeroClass.zero_mul]`.
* **Theorem `neg_neg`:** States that the negation of the negation of any real number `a` is equal to `a`.
* **Proof of `neg_neg`:** Uses the `simp` tactic.
### Detailed Analysis or Content Details
The code snippet can be transcribed as follows:
```
theorem zero_mul (a : R) : 0 * a = 0 := by
rw [MulZeroClass.zero_mul]
theorem neg_neg (a : R) : - -a = a := by
simp
```
* **`theorem zero_mul (a : R) : 0 * a = 0 := by`**: This line declares a theorem named `zero_mul`. It takes a real number `a` as input (indicated by `(a : R)`) and asserts that `0 * a = 0`. The `:= by` indicates the start of the proof.
* **`rw [MulZeroClass.zero_mul]`**: This line applies the `rw` (rewrite) tactic. It replaces the left-hand side of the equation with the right-hand side based on the lemma or theorem `MulZeroClass.zero_mul`. This suggests that `MulZeroClass.zero_mul` is a pre-defined theorem stating that zero multiplied by any element in the `MulZeroClass` is zero.
* **`theorem neg_neg (a : R) : - -a = a := by`**: This line declares a theorem named `neg_neg`. It takes a real number `a` as input and asserts that `- -a = a`.
* **`simp`**: This line applies the `simp` (simplify) tactic. This tactic attempts to simplify the expression using a set of predefined simplification rules.
### Key Observations
* The code uses a formal language for mathematical proofs.
* The theorems are basic properties of real numbers.
* The proofs are concise, relying on pre-defined lemmas/theorems and simplification tactics.
* The type `R` is used to denote the set of real numbers.
### Interpretation
The screenshot demonstrates a snippet of code used for formal verification or automated theorem proving. The code defines and proves two fundamental properties of real numbers: the zero multiplication property and the double negation property. The use of tactics like `rw` and `simp` indicates a declarative programming style where the user specifies *how* to prove a theorem rather than *what* the proof steps are. The `MulZeroClass.zero_mul` reference suggests a modular approach to theorem proving, where common properties are encapsulated in reusable classes. The colored circles in the top-left corner are likely status indicators or visual cues within the development environment, but their specific meaning is not apparent from the image alone. The overall purpose is to ensure the correctness of mathematical statements through rigorous, machine-checkable proofs.
</details>
c) Order Theory: LeanAgent grasps order theory, as evidenced by absorb1, which proves that the infimum of x and the supremum of x and y is always equal to x, and absorb2, which demonstrates that the supremum of x and the infimum of x and y is always equal to x.
<details>
<summary>extracted/6256159/carbon20.png Details</summary>

### Visual Description
\n
## Screenshot: Mathematical Theorem Display
### Overview
The image is a screenshot displaying two mathematical theorems, likely from a formal proof assistant or interactive theorem prover environment. The theorems are presented as code-like statements, and there are three colored circles positioned at the top-left corner of the dark-themed window.
### Components/Axes
There are no axes or traditional chart components. The key elements are:
* **Colored Circles:** Three circles positioned horizontally at the top-left. From left to right, they are red, orange, and green. Their purpose is unclear without further context.
* **Theorem 1:** `theorem absorb1 : x ∩ (x ∪ y) = x := by simp`
* **Theorem 2:** `theorem absorb2 : x ∪ x ∩ y = x := by simp`
### Detailed Analysis or Content Details
The screenshot displays two theorems related to set theory, specifically absorption laws. Let's break down each theorem:
* **Theorem absorb1:**
* `theorem absorb1 :` This declares a theorem named "absorb1".
* `x ∩ (x ∪ y) = x :=` This is the statement of the theorem. It states that the intersection of x and the union of x and y is equal to x. The symbol `∩` represents intersection, and `∪` represents union.
* `by simp` This indicates that the theorem was proven using a simplification tactic (likely within the proof assistant).
* **Theorem absorb2:**
* `theorem absorb2 :` This declares a theorem named "absorb2".
* `x ∪ x ∩ y = x :=` This is the statement of the theorem. It states that the union of x and the intersection of x and y is equal to x.
* `by simp` This indicates that the theorem was proven using a simplification tactic.
### Key Observations
* Both theorems are absorption laws in set theory.
* Both theorems were proven using the `simp` tactic, suggesting they are relatively straightforward to prove.
* The environment appears to be a formal proof assistant, likely Lean, Coq, or similar.
* The colored circles at the top-left may indicate status or configuration options within the environment, but their exact meaning is unknown.
### Interpretation
The screenshot demonstrates the use of a formal proof assistant to verify mathematical theorems. The theorems themselves are fundamental results in set theory, illustrating the absorption property. The `simp` tactic suggests that the proofs rely on basic simplification rules. The screenshot provides a glimpse into the workflow of formal mathematical reasoning, where theorems are stated, proven, and verified using computational tools. The colored circles could be indicators of the current state of the proof environment, such as compilation status or debugging information. Without more context, their precise function remains unclear.
</details>
d) Rudimentary Real Analysis: LeanAgent demonstrates an early capability to handle properties related to real numbers and absolute values, as shown by C03S05.MyAbs.abs_add, which proves the triangle inequality for real numbers.
<details>
<summary>extracted/6256159/carbon21.png Details</summary>

### Visual Description
\n
## Screenshot: Theorem Statement in a Code Editor
### Overview
The image is a screenshot of a code editor window displaying a theorem statement related to absolute values. The editor appears to be a macOS application, indicated by the three colored circles in the top-left corner. The theorem is presented in a formal, mathematical notation likely used in a proof assistant or formal verification tool.
### Components/Axes
The screenshot contains the following elements:
* **macOS Window Controls:** Three colored circles (red, yellow, green) in the top-left corner, representing the minimize, maximize, and close buttons.
* **Code Editor Area:** A dark-themed rectangular area containing the theorem statement.
* **Theorem Statement:** The core textual content of the image.
### Detailed Analysis or Content Details
The theorem statement is as follows:
```
theorem abs_add (x y : ℝ) : |x + y| ≤ |x| + |y| := by
apply abs_add_le
```
Let's break down the statement:
* `theorem abs_add (x y : ℝ)`: This declares a theorem named `abs_add` that takes two arguments, `x` and `y`, both of type `ℝ` (representing the set of real numbers).
* `: |x + y| ≤ |x| + |y|`: This is the statement of the theorem itself. It asserts that the absolute value of the sum of `x` and `y` is less than or equal to the sum of their absolute values (the triangle inequality).
* `:= by`: This indicates the start of the proof.
* `apply abs_add_le`: This suggests that the theorem is being proven by applying a previously defined lemma or theorem named `abs_add_le`.
### Key Observations
The statement is a standard mathematical theorem (the triangle inequality for real numbers). The syntax suggests it's written for a proof assistant like Lean, Coq, or Isabelle. The use of `:= by` and `apply` are common in these systems.
### Interpretation
The image demonstrates a formalization of a mathematical theorem within a proof assistant environment. The theorem states the triangle inequality, a fundamental concept in mathematics. The `apply abs_add_le` suggests a modular approach to proof construction, where existing lemmas are reused to prove new theorems. This is a common practice in formal verification, ensuring the correctness and reliability of mathematical proofs and software systems. The screenshot provides a glimpse into the process of formalizing mathematical knowledge and reasoning within a computer-assisted environment.
</details>
10/14 of these proven sorry theorems from Mathematics in Lean Source during the lifelong learning process are from the exercise file for proving identities about algebraic structures. This indicates that LeanAgent starts its grasp of mathematical concepts from the basics.
Crucially, by the end of the lifelong learning process, LeanAgent exhibits significant growth in its mathematical reasoning abilities:
a) Quantifier Manipulation: LeanAgent exhibits more advanced logical reasoning by managing multiple quantifiers and implications, as evidenced by C03S01.my_lemma3, which proves a complex statement involving bounds and absolute values with multiple quantifiers and conditions, and C03S05.MyAbs.abs_lt, which establishes that the absolute value of x being less than y is equivalent to $-y<x\land x<y$ .
<details>
<summary>extracted/6256159/carbon23.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Mathematical Theorem Proofs
### Overview
The image is a screenshot of a code editor displaying two mathematical theorem proofs, likely written in a formal proof assistant language (possibly Coq or similar). The background is dark, and the text is light-colored, typical of many code editors. There are three colored circles at the top-left corner, likely status indicators.
### Components/Axes
There are no axes or traditional chart components. The key elements are:
* **Colored Circles:** Three circles at the top-left: Red, Yellow, Green. Their meaning is not explicitly stated but likely indicate status (e.g., compilation status, error status).
* **Theorem 1:** `theorem my_lemma3 : ∀ {x y ∈ ℝ}, 0 < ε → ε ≤ 1 → |x| < ε → |y| < ε → |x * y| < ε := by apply C03S01.my_lemma`
* **Theorem 2:** `theorem abs_lt : |x| < y ↔ y < x ∧ x < y := by cases x exact abs_lt`
### Detailed Analysis or Content Details
**Theorem 1: `my_lemma3`**
* **Theorem Statement:** `∀ {x y ∈ ℝ}, 0 < ε → ε ≤ 1 → |x| < ε → |y| < ε → |x * y| < ε`
* This states: For all real numbers x and y, if epsilon (ε) is greater than 0 and less than or equal to 1, and the absolute values of x and y are both less than epsilon, then the absolute value of their product (x * y) is also less than epsilon.
* **Proof Step:** `apply C03S01.my_lemma`
* This indicates that the theorem is being proven by applying a previously defined lemma named `my_lemma` located in the module `C03S01`.
**Theorem 2: `abs_lt`**
* **Theorem Statement:** `|x| < y ↔ y < x ∧ x < y`
* This states: The absolute value of x is less than y if and only if y is greater than x and x is greater than y. This appears to be a definition or a basic property of absolute values.
* **Proof Step:** `cases x`
* This indicates that the proof proceeds by considering different cases based on the value of x.
* **Proof Step:** `exact abs_lt`
* This indicates that the current case is directly solved by the lemma or definition `abs_lt`.
### Key Observations
* The theorems involve real numbers (ℝ) and epsilon-delta arguments, common in mathematical analysis.
* The proofs are concise, relying on applying existing lemmas or definitions.
* The code uses a formal syntax, suggesting a proof assistant environment.
### Interpretation
The image demonstrates a snippet of formal mathematical reasoning. The theorems and their proofs are likely part of a larger mathematical development within a proof assistant. The use of lemmas and case analysis are standard techniques in formal verification. The `my_lemma3` theorem likely relates to the continuity of multiplication, while `abs_lt` defines the relationship between absolute values and inequalities. The colored circles at the top-left likely indicate the status of the code (e.g., compilation success, errors). The overall context suggests a rigorous mathematical environment where proofs are constructed and verified with high precision.
</details>
b) Set Theory and Relations: LeanAgent handles abstract set-theoretic concepts, as shown by C03S01.Subset.trans, which proves that subset relations are transitive.
<details>
<summary>extracted/6256159/carbon24.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Theorem Definition
### Overview
The image is a screenshot of a dark-themed terminal window displaying a code snippet defining a theorem related to set theory. The window has three colored circles at the top-left corner, likely representing window control buttons.
### Components/Axes
The screenshot consists of:
* **Terminal Window:** A rectangular area with a dark background.
* **Colored Circles:** Three circles positioned at the top-left corner of the window. From left to right, they are red, yellow/orange, and green.
* **Code Text:** A block of text in a monospace font, representing the code snippet.
### Detailed Analysis or Content Details
The code snippet defines a theorem named `Subset.trans`. The code is as follows:
```
theorem Subset.trans : r ⊆ s → s ⊆ t → r ⊆ t := by
exact Set.Subset.trans
```
Breaking down the code:
* `theorem Subset.trans :` declares a theorem named `Subset.trans`.
* `r ⊆ s → s ⊆ t → r ⊆ t` is the statement of the theorem. It states that if `r` is a subset of `s` and `s` is a subset of `t`, then `r` is a subset of `t`. The symbol `⊆` represents the subset relation, and `→` represents logical implication.
* `:= by` indicates the start of the proof.
* `exact Set.Subset.trans` completes the proof by referencing another theorem, `Set.Subset.trans`, which likely proves the transitivity of the subset relation.
### Key Observations
The code snippet is a concise and formal definition of the transitivity property of subsets. It leverages existing theorems to provide a proof. The use of unicode symbols for subset and implication enhances readability for those familiar with mathematical notation.
### Interpretation
This code snippet demonstrates a formal approach to mathematical reasoning using a proof assistant or theorem prover. The `exact` keyword suggests that the proof is straightforward and relies on a previously established result. The theorem itself is a fundamental concept in set theory, stating that subset relationships are transitive. This type of code is commonly found in formal verification systems, where mathematical proofs are used to ensure the correctness of software or hardware. The screenshot likely comes from a system like Lean, Coq, or Isabelle.
</details>
Now, only 2/7 sorry theorems from Mathematics in Lean Source are from the exercise file for proving identities about algebraic structures. This suggests that lifelong learning allowed LeanAgent to transition to gaining a stronger ability to work with premises for more complicated proofs.
We gain some insights from comparing the performance of LeanAgent over time on Mathematics in Lean Source to other setups. For example, the fact that ReProver can handle harder theorems out of the box, such as C03S01.my_lemma3, but fewer theorems overall suggests that it has a broader knowledge base initially but loses performance from a lack of adaptability. Furthermore, Setup 5 proves the same sorry theorems as LeanAgent does during lifelong learning. This suggests that pure curriculum learning without EWC or the Merge All strategy emphasizes grasping easier concepts earlier. This mimics the insights gained from the lifelong learning analysis. However, Setups 3 and 7 (curriculum with EWC and/or Merge All) demonstrate some knowledge plasticity, proving harder theorems during lifelong learning, such as C03S01.Subset.trans in Setup 3. However, this comes at the cost of proving basic theorems, showing catastrophic forgetting. For example, Setups 3 and 7 could not prove the trivial theorems MyGroup.mul_right_inv and MyRing.zero_mul, respectively, during lifelong learning, whereas LeanAgent could. This again aligns with the insights from the lifelong learning scores from our previous analysis.
By the end of lifelong learning, the sorry theorems that LeanAgent prove from Mathematics in Lean Source are a superset of those that the other setups prove. This shows that LeanAgent’s lifelong learning setup provides continuously improving capabilities to reason about more advanced premises and proofs than other setups. For example, LeanAgent is the only system, except for Setup 6, which can prove theorem C03S05.MyAbs.abs_lt. LeanAgent achieved this by using the available premises, such as abs_lt with a statement similar to C03S05.MyAbs.abs_lt.
An interesting case study can be found in the dichotomy between the theorems C03S05.MyAbs.neg_le_abs_self and C03S05.MyAbs.le_abs_self. LeanAgent can prove C03S05.MyAbs.neg_le_abs_self by referencing C03S05.MyAbs.le_abs_self, which is still unproven at that point:
<details>
<summary>extracted/6256159/carbon27.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Theorem Proof
### Overview
The image is a screenshot displaying a code snippet, likely from a formal verification or theorem proving environment. The code defines a theorem and its proof using a `simpa` tactic. The snippet is presented within a rectangular block with a dark background and light-colored text. Three colored dots are positioned to the left of the block.
### Components/Axes
There are no axes or traditional chart components. The key elements are:
* **Colored Dots:** Three dots, colored red, orange, and green, positioned horizontally to the left of the code block. Their purpose is unclear without further context.
* **Code Block:** Contains the theorem statement and proof.
* **Text:** The code itself, consisting of keywords, identifiers, and mathematical notation.
### Detailed Analysis or Content Details
The code snippet can be transcribed as follows:
```
theorem neg_le_abs_self (x : ℝ) : -x ≤ |x| := by
simpa using C03S05.MyAbs.le_abs_self (-x)
```
Breaking down the code:
* `theorem neg_le_abs_self (x : ℝ) : -x ≤ |x| := by`: This line declares a theorem named `neg_le_abs_self`. It takes a real number `x` (denoted by `x : ℝ`) as input and states that `-x` is less than or equal to the absolute value of `x` (denoted by `-x ≤ |x|`). The `:= by` indicates the start of the proof.
* `simpa using C03S05.MyAbs.le_abs_self (-x)`: This line applies the `simpa` tactic (likely a simplification tactic) using a lemma or theorem named `C03S05.MyAbs.le_abs_self` to the expression `-x`. The `simpa` tactic attempts to simplify the goal using the provided lemma.
### Key Observations
* The code uses mathematical notation for real numbers (`ℝ`) and absolute value (`|x|`).
* The `simpa` tactic suggests the use of a pre-defined lemma or theorem for simplification.
* The theorem itself is a fundamental property of absolute values.
### Interpretation
The code snippet demonstrates a formal proof of the property that the negative of a real number is less than or equal to its absolute value. The use of the `simpa` tactic indicates a reliance on automated or semi-automated proof techniques. The `C03S05.MyAbs.le_abs_self` lemma likely encapsulates a previously proven result related to absolute values, allowing for a concise proof of the current theorem. The colored dots to the left of the code block are likely status indicators or visual cues within the development environment, but their specific meaning is not discernible from the image alone. They could represent compilation status, test results, or other relevant information.
</details>
At the end of lifelong learning, LeanAgent can prove C03S05.MyAbs.le_abs_self:
<details>
<summary>extracted/6256159/carbon28.png Details</summary>

### Visual Description
\n
## Screenshot: Lean Theorem Proof
### Overview
The image is a screenshot of a Lean proof environment. It displays a theorem statement and the commands used to prove it. The background is a dark teal color, and there are three colored circles in the top-left corner, likely indicating the status of the environment.
### Components/Axes
There are no axes or charts in this image. The key components are:
* **Colored Circles (Top-Left):** Red, Yellow, and Green. These likely represent build status or other environment indicators.
* **Theorem Statement:** `theorem le_abs_self (x : ℝ) : x ≤ |x| := by`
* **Proof Commands:**
* `rw [le_abs]`
* `simp`
### Detailed Analysis or Content Details
The theorem statement defines a theorem named `le_abs_self`. It states that for any real number `x` (denoted by `x : ℝ`), `x` is less than or equal to its absolute value (`|x|`). The `:= by` indicates the start of the proof.
The proof consists of two commands:
1. `rw [le_abs]`: This command rewrites the goal using the lemma or theorem `le_abs`. It's likely `le_abs` is a previously defined theorem relating to the absolute value and less than or equal to.
2. `simp`: This command simplifies the goal using a set of simplification rules.
### Key Observations
The theorem is a fundamental property of absolute values. The proof is concise, using only two commands, suggesting that the `le_abs` lemma is well-chosen and the `simp` command is effective in completing the proof.
### Interpretation
This screenshot demonstrates a simple yet important theorem in real analysis being proven within the Lean proof assistant. Lean is an interactive theorem prover, meaning that the user guides the proof process step-by-step. The screenshot shows how a mathematical statement can be formalized and verified using a computer. The use of `rw` and `simp` are common tactics in Lean proofs, representing rewriting and simplification respectively. The colored circles likely indicate the status of the proof environment (e.g., red for errors, yellow for warnings, green for success). The brevity of the proof suggests the power of having well-defined lemmas and simplification rules within the Lean environment.
</details>
It achieves this through its use of the le_abs premise, which provides conditions for when an element is less than or equal to the absolute value of another. This suggests that LeanAgent begins by using existing knowledge where possible before trying to realizing why existing facts are reasonable.
SciLean. We examine the sorry theorems from SciLean that LeanAgent proved to gain some further key insights about its performance. During the lifelong learning process, LeanAgent demonstrated stronger understanding relative to ReProver in a wide range of mathematical concepts from SciLean. These theorems primarily focus on:
a) Fundamental Algebraic Structures: LeanAgent proves basic algebraic operations and properties, such as SciLean.scalar_div_one, which proves that dividing any number by one yields the same number, SciLean.scalar_min_zero_one, which demonstrates the minimum value between 0 and 1 is 0, and Function.invFun.id_rule, which proves that the inverse of the identity function is the identity function itself.
<details>
<summary>extracted/6256159/carbon29.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Mathematical Theorem Proofs
### Overview
The image is a screenshot of a dark-themed code editor displaying three mathematical theorem proofs, likely written in a formal proof assistant language (possibly Lean). The code consists of theorem declarations, followed by proof steps using tactics like `simp`, `rw`, and `apply`.
### Components/Axes
There are no axes or traditional chart components. The screenshot primarily contains text-based code. The top of the image features three colored circles (red, orange, green) which appear to be UI elements, possibly status indicators.
### Detailed Analysis or Content Details
The code snippet contains the following theorems and proof steps:
1. **Theorem:** `scalar_div_one (x : R) : x / 1 = x := by`
* Proof step: `simp`
2. **Theorem:** `scalar_min_zero_one : min (0 : R) (1 : R) = 0 := by`
* Proof step: `rw [min_comm]`
* Proof step: `simp`
3. **Theorem:** `id_rule : invFun (fun (x : X) => x) = fun x => x := by`
* Proof step: `apply Function.invFun_comp`
* Proof step: `exact Function.injective_id`
The code uses type annotations (e.g., `x : R`, `x : X`) and mathematical operators (e.g., `/`, `min`, `=`). The `:= by` syntax indicates the start of a proof block. The tactics used (`simp`, `rw`, `apply`, `exact`) are standard commands in proof assistants to simplify expressions, rewrite using equalities, apply lemmas, and complete proofs, respectively.
### Key Observations
The code demonstrates a series of simple mathematical identities being formally proven. The theorems involve scalar division by one, the minimum of zero and one, and the identity function. The proofs are concise, utilizing basic simplification and rewriting tactics.
### Interpretation
The screenshot illustrates the process of formalizing mathematical reasoning using a proof assistant. The code represents a rigorous and machine-checkable verification of mathematical statements. The use of tactics allows for a step-by-step construction of the proof, ensuring its correctness. The theorems themselves are fundamental mathematical properties, suggesting the code might be part of a larger library or tutorial on mathematical foundations. The dark theme and code editor interface indicate a development environment geared towards formal verification and mathematical software.
</details>
b) Linear and Affine Maps: LeanAgent handles basic properties of linear and affine maps effectively, recognizing their structure in IsLinearMap.isLinearMap_apply, which proves the linearity of function applications, and IsAffineMap.IsAffineMap_apply, which demonstrates the affine property of function applications.
<details>
<summary>extracted/6256159/carbon30.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Mathematical Theorem Definitions
### Overview
The image is a screenshot of a code editor displaying mathematical theorem definitions, likely written in a formal proof language. The code appears to define theorems related to linear and affine maps. There are three colored circles at the top-left corner of the image, likely indicating the status of the editor or project.
### Components/Axes
There are no axes or charts in this image. The visible components are:
* **Colored Circles:** Three circles positioned at the top-left corner. From left to right, they are red, orange, and green.
* **Code Block 1:** Defines the `isLinearMap_apply` theorem.
* **Code Block 2:** Defines the `IsAffineMap_apply` theorem.
### Detailed Analysis or Content Details
The code blocks contain the following text:
**Code Block 1:**
```
theorem isLinearMap_apply (i : ι) : IsLinearMap R (fun f : (i : ι) → E i → f i) := by
constructor
all_goals aesop
```
**Code Block 2:**
```
theorem IsAffineMap_apply (i : ι) : IsAffineMap R (fun f : (i : ι) → E i → f i) := by
constructor
constructor
simp
simp
```
The code uses the following symbols and keywords:
* `theorem`: Indicates a theorem definition.
* `isLinearMap_apply`, `IsAffineMap_apply`: Names of the theorems.
* `(i : ι)`: A variable `i` of type `ι`.
* `: IsLinearMap R (fun f : (i : ι) → E i → f i)`: Specifies the type of the theorem's result.
* `:= by`: Indicates the start of the theorem's proof.
* `constructor`: A tactic used in proof assistants to apply constructors.
* `all_goals aesop`: A tactic that applies the `aesop` automated theorem prover to all goals.
* `simp`: A tactic that simplifies the current goal using the available simplifications.
* `R`, `E`, `ι`: Symbols likely representing mathematical types or sets.
* `fun f : (i : ι) → E i → f i`: A function definition.
### Key Observations
* Both theorems define properties related to applying functions to elements of a type `ι`.
* The `isLinearMap_apply` theorem uses `all_goals aesop` for proof automation, while `IsAffineMap_apply` uses multiple `constructor` and `simp` tactics.
* The code is well-formatted and uses consistent indentation.
### Interpretation
The code snippet demonstrates the formal definition of theorems related to linear and affine maps within a proof assistant environment (likely Lean). The theorems state that applying a function `f` of a specific type results in a linear or affine map. The use of tactics like `constructor`, `aesop`, and `simp` indicates a process of automated and manual proof construction. The code suggests a mathematical context involving vector spaces or similar structures, where linear and affine transformations are fundamental concepts. The different proof strategies employed for each theorem might reflect the varying complexity of their proofs. The colored circles at the top-left likely represent the status of the project or editor, with red potentially indicating an error, orange a warning, and green a successful build or compilation.
</details>
c) Measure Theory Basics: LeanAgent starts grasping measure theory concepts, exemplified by SciLean.ite_pull_measureOf, which handles conditional measure selection between two measures based on a proposition, SciLean.Measure.prod_volume, which proves that the product of two volume measures is the volume measure itself, and SciLean.ite_pull_ennreal_toReal, which proves that conditionally pulling out an extended non-negative real and converting it to a real is equivalent to converting the individual components first.
<details>
<summary>extracted/6256159/carbon40.png Details</summary>

### Visual Description
\n
## Code Snippet: Mathematical Theorems
### Overview
The image displays a code snippet, likely from a formal proof assistant system (such as Lean or Coq), containing three mathematical theorems. The code is presented in a monospaced font against a dark background. The theorems involve concepts from measure theory and real analysis.
### Components/Axes
There are no axes or traditional chart components. The content consists entirely of text-based code. The code is structured into three distinct theorem definitions, each separated by a blank line. Each theorem definition includes:
* `theorem` keyword
* Theorem name (e.g., `ite_pull_measure0f`)
* Type signature with arguments enclosed in square brackets (e.g., `{X} [MeasurableSpace X] (c : Prop) [Decidable c]`)
* Theorem body, defining the relationship between variables and functions.
* `:= by` followed by a proof tactic (e.g., `split_ifs <>; rfl`)
### Detailed Analysis or Content Details
**Theorem 1: `ite_pull_measure0f`**
```
theorem ite_pull_measure0f {X} [MeasurableSpace X] (c : Prop) [Decidable c]
(μ v : Measure X) (A : Set X) :
(if c then μ else v) A
=
(if c then μ A else v A) := by
split_ifs <>; rfl
```
* **Arguments:**
* `{X}`: A type variable representing a set.
* `[MeasurableSpace X]`: A typeclass instance indicating that `X` is a measurable space.
* `(c : Prop)`: A proposition `c`.
* `[Decidable c]`: A typeclass instance indicating that `c` is decidable (i.e., we can determine whether it is true or false).
* `(μ v : Measure X)`: Two measures, `μ` and `v`, defined on the measurable space `X`.
* `(A : Set X)`: A measurable set `A` within the measurable space `X`.
* **Statement:** The theorem states that applying the conditional measure `(if c then μ else v)` to the set `A` is equivalent to applying the conditional measure to `A` individually for each measure `μ` and `v`.
* **Proof:** The proof uses the `split_ifs` tactic to split the goal into cases based on the truth value of `c`, and then uses the `rfl` (reflexivity) tactic to show that the two sides are equal in each case.
**Theorem 2: `Measure.prod_volume`**
```
theorem Measure.prod_volume {X Y} [MeasureSpace X] [MeasureSpace Y] :
(Measure.prod (volume : Measure X) (volume : Measure Y)) = volume := by
rfl
```
* **Arguments:**
* `{X Y}`: Type variables representing two sets.
* `[MeasureSpace X]`: A typeclass instance indicating that `X` is a measure space.
* `[MeasureSpace Y]`: A typeclass instance indicating that `Y` is a measure space.
* **Statement:** The theorem states that the product measure of the volumes of two measure spaces `X` and `Y` is equal to the volume itself.
* **Proof:** The proof uses the `rfl` tactic, indicating that the two sides are definitionally equal.
**Theorem 3: `ite_pull_enreal_toReal`**
```
theorem ite_pull_enreal_toReal (c : Prop) [Decidable c] (x y : ENNReal) :
(if c then x else y).toReal
=
(if c then x.toReal else y.toReal) := by
split_ifs <>; rfl
```
* **Arguments:**
* `(c : Prop)`: A proposition `c`.
* `[Decidable c]`: A typeclass instance indicating that `c` is decidable.
* `(x y : ENNReal)`: Two extended non-negative real numbers, `x` and `y`.
* **Statement:** The theorem states that converting the conditional extended non-negative real number `(if c then x else y)` to a real number is equivalent to converting each extended non-negative real number `x` and `y` individually and then applying the conditional.
* **Proof:** The proof uses the `split_ifs` tactic to split the goal into cases based on the truth value of `c`, and then uses the `rfl` tactic to show that the two sides are equal in each case.
### Key Observations
* The theorems all involve conditional statements (`if...then...else`) and measure-theoretic concepts.
* The proofs are very concise, relying heavily on tactics like `split_ifs` and `rfl`, which suggest that the underlying definitions are well-suited for automated reasoning.
* The use of typeclasses (`[MeasurableSpace X]`, `[Decidable c]`) indicates a strong emphasis on type safety and generality.
### Interpretation
The code snippet demonstrates a formalization of mathematical concepts within a proof assistant system. The theorems likely serve as building blocks for more complex proofs in measure theory and real analysis. The use of conditional statements and typeclasses allows for precise and general statements about mathematical objects. The concise proofs suggest that the system is capable of automating significant portions of the reasoning process. The theorems relate to the properties of measures, conditional measures, and conversions between extended real numbers and real numbers. The overall purpose is to establish rigorous mathematical truths in a machine-checkable format.
</details>
d) Floating-Point Operations: LeanAgent demonstrates an early grasp of floating-point representations and their correspondence to real numbers, shown by SciLean.re_float, proving that a floating-point number’s real-like part is itself.
<details>
<summary>extracted/6256159/carbon33.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet
### Overview
The image is a screenshot of a terminal or code editor window displaying a line of code, likely related to a formal verification or theorem proving system. The window has three colored circles in the top-left corner, acting as window controls.
### Components/Axes
The screenshot contains the following elements:
* **Window Controls:** Three colored circles (red, yellow, green) positioned in the top-left corner.
* **Code:** A single line of code is displayed in a monospaced font.
### Detailed Analysis or Content Details
The code snippet is as follows:
```
theorem re_float (a : Float) : RClike.re a = a := by
exact RClike.re_eq_self_of_le le_rfl
```
Breaking down the code:
* `theorem re_float (a : Float) :` - This declares a theorem named `re_float` that takes a single argument `a` of type `Float`.
* `RClike.re a = a :=` - This is the statement of the theorem, asserting that applying the function `RClike.re` to `a` results in `a` itself. This suggests `RClike.re` is an identity function or a function that returns its input unchanged.
* `by exact RClike.re_eq_self_of_le le_rfl` - This provides the proof of the theorem. `exact` indicates that the theorem is proven directly by a previously defined lemma or fact. `RClike.re_eq_self_of_le` is likely a lemma stating that if `RClike.re a` is less than or equal to `a`, then `RClike.re a` equals `a`. `le_rfl` is likely a reflexivity lemma stating that any value is less than or equal to itself.
### Key Observations
The code snippet appears to be part of a formal verification system, possibly related to real numbers (`Float`) and a module or library named `RClike`. The theorem asserts a property about a function `RClike.re`, and the proof uses existing lemmas to establish the result.
### Interpretation
The code snippet demonstrates a simple theorem proving exercise. The theorem `re_float` likely aims to establish that a function `RClike.re` is an identity function for floating-point numbers within the context of the `RClike` library. The proof leverages existing lemmas (`RClike.re_eq_self_of_le` and `le_rfl`) to directly demonstrate the theorem's validity. This is a common pattern in formal verification, where complex properties are proven by breaking them down into smaller, previously established facts. The use of `Float` suggests the code is dealing with real number computations, and the `RClike` module might provide specific properties or operations for these numbers.
</details>
The proofs during this phase are characteristically concise, often using basic tactics like simp, rfl, or aesop that do not use premises. This suggests that LeanAgent recognizes these theorems are straightforward enough to prove without the complex retrieval of premises.
Crucially, by the end of the lifelong learning process, LeanAgent exhibits significant growth in its mathematical reasoning abilities on SciLean, just as it did with Mathematics in Lean Source:
a) Advanced Function Spaces: LeanAgent understands concepts in advanced function spaces, such as SciLean.ContCDiffMapFD_eta, which demonstrates the eta reduction property for continuously differentiable maps over finite dimensions.
<details>
<summary>extracted/6256159/carbon34.png Details</summary>

### Visual Description
\n
## Screenshot: Mathematical Theorem Statement
### Overview
The image is a screenshot of a dark-themed text editor or terminal window displaying a mathematical theorem statement, likely from a formal verification or proof assistant system. The statement is written in a symbolic notation, and appears to be related to functional differentiation. There are three colored dots in the top-left corner.
### Components/Axes
There are no axes or traditional chart components. The visible elements are:
* **Colored Dots:** Three dots in the top-left corner, colored red, orange, and green (from left to right). Their purpose is unclear without further context.
* **Theorem Statement:** A single line of text containing a theorem name and its formal definition.
* **Commands:** Two lines of text below the theorem statement, "simp only [DFunLike.ext_ifff]" and "aesop", which appear to be commands used within the system to simplify or prove the theorem.
### Detailed Analysis or Content Details
The theorem statement is:
`theorem ContCDiffMapFD_eta (f : X →FD[K,n] Y) : (fun x ↦FD[K,n] f x) = f := by`
The commands are:
`simp only [DFunLike.ext_ifff]`
`aesop`
### Key Observations
The theorem name is `ContCDiffMapFD_eta`. The theorem involves a function `f` mapping from `X` to `FD[K,n] Y`. The statement asserts the equality of two expressions involving `f`. The commands suggest a simplification step using `DFunLike.ext_ifff` and a subsequent proof attempt using `aesop`.
### Interpretation
The image represents a step in a formal mathematical proof or verification process. The theorem likely deals with continuous differentiation of a function `f` within a specific framework denoted by `FD[K,n]`. The `simp only` command indicates an attempt to simplify the theorem using a specific equality criterion (`DFunLike.ext_ifff`). The `aesop` command suggests the use of an automated theorem prover to attempt to complete the proof. The colored dots in the top-left corner are likely status indicators or visual cues within the environment, but their exact meaning is unknown without additional context. The overall context suggests a highly technical and formal environment for mathematical reasoning.
</details>
b) Sophisticated Bijections: LeanAgent grows in its ability to work with product spaces and bijections, proving theorems such as Function.Bijective.Prod.mk.arg_fstsnd.Bijective_rule_simple’, which proves the bijectivity of a function that swaps elements in a product space and Function.Bijective.Equiv.invFun.arg_a0.Bijective_rule, which proves that the composition of a bijection and its inverse remains bijective. Crucially, these theorems might seem simple, but they demonstrate LeanAgent’s capability to handle abstract algebraic thinking.
<details>
<summary>extracted/6256159/carbon39.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Theorem Definitions
### Overview
The image is a screenshot of a code editor displaying two theorem definitions, likely from a formal proof assistant system like Lean. The code appears to be related to mathematical logic and function properties (bijectivity). The background is dark, and the text is light-colored for readability. There are three colored dots in the top-left corner (red, orange, green).
### Components/Axes
There are no axes or traditional chart components. The screenshot consists entirely of text-based code. The key elements are:
* **Theorem Names:** `Prod.mk.arg_fstsnd.Bijective_rule_simple'` and `Equiv.invFun.arg_a0.Bijective_rule`
* **Type Signatures:** These define the input and output types of the functions involved. They use notation like `X x Y`, `(xy,2, xy,1)`, `Y = Z`, `X -> Z`.
* **Proof Commands:** Commands like `:= by`, `constructor <;>`, `intro h`, `all_goals aesop`, `convert hf`, `simp [hf]`, `exact f.symm.bijective.comp hf` are used to construct the formal proof.
### Detailed Analysis or Content Details
**Theorem 1: `Prod.mk.arg_fstsnd.Bijective_rule_simple'`**
* **Statement:** `Bijective (fun xy : XxY => (xy.2, xy.1))`
* **Proof:**
* `:= by` - Indicates the start of the proof.
* `constructor <;>` - Likely uses a constructor to build the proof.
* `intro h` - Introduces a hypothesis `h`.
* `all_goals aesop` - Applies the `aesop` tactic to all proof goals.
**Theorem 2: `Equiv.invFun.arg_a0.Bijective_rule`**
* **Statement:** `Bijective (fun x => f.invFun (g x))`
* **Context:** `(f : Y = Z) (g : X -> Z) (hf : Bijective g)`
* **Proof:**
* `:= by` - Indicates the start of the proof.
* `convert hf` - Attempts to convert the current goal to match the hypothesis `hf`.
* `simp [hf]` - Simplifies the goal using the hypothesis `hf`.
* `exact f.symm.bijective.comp hf` - Completes the proof using a composition of functions and the hypothesis `hf`.
### Key Observations
* The code uses a formal, symbolic notation common in proof assistants.
* The theorems relate to the bijectivity (one-to-one correspondence) of functions.
* The proofs are concise and rely on automated tactics (`aesop`, `simp`) and strategic conversions.
* The colored dots in the top-left corner do not appear to be part of the code itself and are likely indicators of the editor's state or configuration.
### Interpretation
The screenshot demonstrates a snippet of formal mathematical reasoning. The theorems define and prove properties of functions, specifically their bijectivity. The use of a proof assistant allows for rigorous verification of these properties. The code is not directly interpretable without understanding the underlying mathematical concepts and the specific syntax of the proof assistant being used. The `aesop` tactic suggests an attempt to automate parts of the proof process, while the `convert` and `simp` commands indicate manual steps to transform the goal into a solvable form. The `exact` command signifies the completion of the proof by matching the current goal to a known result. The presence of type signatures and hypotheses highlights the importance of precise definitions and assumptions in formal proofs.
</details>
c) Abstract Algebraic Structures: LeanAgent proves further abstract algebraic properties, including SciLean.CDifferentiable.id_rule, which proves that the identity function is continuously differentiable.
<details>
<summary>extracted/6256159/carbon36.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Theorem Definition
### Overview
The image is a screenshot displaying a code snippet, likely from a formal verification or proof assistant system. The code defines a theorem related to differentiability. The snippet is presented within a dark-themed interface with a rounded rectangular background.
### Components/Axes
The image contains the following elements:
* **Colored Dots (Top-Left):** Three dots are present in the top-left corner: red, orange, and green. Their purpose is unclear without further context, but they may indicate status or selection.
* **Code Block:** A block of text representing the theorem definition.
* **Background:** A dark gray/black rounded rectangle.
### Detailed Analysis or Content Details
The code snippet reads as follows:
```
theorem CDifferentiable.id_rule : CDifferentiable K (fun x : X => x) := by
intro x
unfold Scilean.CDifferentiableAt
tauto
```
Breaking down the code:
* `theorem CDifferentiable.id_rule :` - This declares a theorem named `id_rule` within the `CDifferentiable` namespace.
* `CDifferentiable K (fun x : X => x) :=` - This specifies the theorem's statement. It asserts that the function `fun x : X => x` (the identity function) is differentiable with respect to `K`.
* `by` - This keyword introduces the proof of the theorem.
* `intro x` - This introduces a variable `x` into the proof context.
* `unfold Scilean.CDifferentiableAt` - This unfolds the definition of `CDifferentiableAt` (likely a predicate defining differentiability). The `Scilean.` prefix suggests this definition is part of a library or module named `Scilean`.
* `tauto` - This command attempts to prove the theorem by showing that it is a tautology (always true).
### Key Observations
The code snippet defines a fundamental property of differentiable functions: the identity function is differentiable. The use of `tauto` suggests a relatively straightforward proof. The `Scilean` namespace indicates the code is likely part of a larger formalization effort.
### Interpretation
This code snippet demonstrates a formal statement and proof of a basic mathematical property within a formal verification system. The theorem states that the identity function is differentiable. The proof relies on unfolding the definition of differentiability and then applying a tautology check. This suggests the system is capable of automated or semi-automated theorem proving. The use of a formal system allows for rigorous verification of mathematical properties, which is crucial in areas like software and hardware verification. The `Scilean` namespace suggests this is part of a larger project or library focused on differentiable programming or related concepts.
</details>
d) Data Structures in Mathematics: LeanAgent navigates proofs involving array types in a mathematical context, proving theorems such as SciLean.ArrayType.ext, which proves that two arrays are equal if their elements are equal at all indices.
<details>
<summary>extracted/6256159/carbon37.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet
### Overview
The image is a screenshot of a code editor displaying a theorem definition, likely from a formal verification or proof assistant system. The background is a dark grey/blue. There are three colored dots in the top-left corner: red, yellow, and green.
### Components/Axes
There are no axes or traditional chart components. The key elements are:
* **Colored Dots:** Red, Yellow, Green (top-left corner) - their purpose is unclear without further context.
* **Code Block:** A block of text representing a theorem definition.
### Detailed Analysis or Content Details
The code block contains the following text:
```
theorem ext (x y : Cont) : (∀ i, x[i] = y[i]) -> x = y := by
intro h
apply SciLean.ArrayType.get_injective
simp only [h]
```
Breaking down the code:
* `theorem ext (x y : Cont) :` - Defines a theorem named "ext" that takes two arguments, `x` and `y`, both of type `Cont`.
* `(∀ i, x[i] = y[i])` - The hypothesis of the theorem: for all indices `i`, the elements at index `i` in `x` and `y` are equal.
* `-> x = y` - The conclusion of the theorem: `x` is equal to `y`.
* `:= by` - Indicates the start of the proof.
* `intro h` - Introduces the hypothesis `(∀ i, x[i] = y[i])` as `h`.
* `apply SciLean.ArrayType.get_injective` - Applies a lemma or theorem named `get_injective` from the `SciLean.ArrayType` module. This lemma likely states that if two arrays have equal elements at all indices, then the arrays are equal.
* `simp only [h]` - Simplifies the expression using the hypothesis `h`.
### Key Observations
The code defines an extensionality theorem for a type `Cont` that appears to represent arrays or sequences. The theorem states that if two arrays have the same elements at every index, then they are equal. The proof uses the `get_injective` lemma, which suggests that the equality of elements at each index implies the equality of the arrays themselves.
### Interpretation
This code snippet demonstrates a fundamental concept in mathematics and computer science: extensionality. Extensionality states that two objects are equal if and only if they have the same properties. In this case, the property is the value of the element at each index. The use of a formal proof assistant (indicated by the `:= by` syntax) suggests that this theorem is part of a larger, formally verified system. The `SciLean.ArrayType` module indicates that the theorem is specifically tailored to arrays or array-like structures. The colored dots in the top-left corner are likely status indicators or visual cues within the code editor, but their specific meaning is not apparent from the image alone.
</details>
LeanAgent proved this theorem about a traditionally computer-science-oriented data structure from a mathematical lens, which some other setups could not do.
The proofs at this stage are more sophisticated, involving multiple steps and combining various mathematical concepts. This indicates a deeper ability to connect different areas of mathematics.
The progression from basic algebraic structures to advanced function spaces and data structures (like array types) shows that LeanAgent is bridging the gap between pure mathematical concepts and their applications in computational mathematics. Furthermore, the progression from basic integral manipulations to advanced function spaces indicates that LeanAgent is improving its premise selection over time. It learns to identify and apply more sophisticated mathematical structures and theorems as premises.
By the end of lifelong learning, the sorry theorems that LeanAgent proves from SciLean are almost entirely a superset of those that the other setups prove. This corroborates our previous assertion from our analysis of the sorry theorems from Mathematics in Lean Source that LeanAgent’s lifelong learning setup provides it with continuously improving capabilities to reason about premises and proofs that outperform other setups.
Crucially, LeanAgent could prove re_float during lifelong learning while no other setup could. This indicates that LeanAgent’s more measured and stable approach allowed it to grasp floating-point representations and their relation to reals. At the same time, other setups prioritized this less with their reduced stability. This may also suggest that continuous improvement allowed LeanAgent to process new and unique concepts from new repositories.
We gain some interesting insights from comparing the performance of LeanAgent over time on SciLean to other setups. For example, the fact that ReProver could not prove the trivial theorem SciLean.ite_pull_ennreal_toReal while LeanAgent could, even during lifelong learning, suggests that this baseline cannot handle foundational concepts. LeanAgent has a better grasp of these concepts from lifelong learning. This is due to the increased stability of LeanAgent and improvement from learning a new task, as shown in the lifelong learning metric analysis in Sec. 4.3.
Furthermore, an intriguing observation is that Setup 3 could prove Function.Bijective.Prod.mk.arg_fstsnd.Bijective_rule_simple’ during lifelong learning. However, as mentioned above, LeanAgent could only prove this theorem at the end of lifelong learning. This suggests that Setup 3 is more plastic during lifelong learning, while LeanAgent remains more stable. This corroborates the analysis from the lifelong learning metrics. However, this again comes at the cost of grasping basic theorems. For example, Setup 3 cannot prove the trivial measure theory theorem SciLean.Measure.prod_volume, whereas LeanAgent could during lifelong learning, again suggesting that it favors plasticity over stability.
An interesting case is that LeanAgent can prove SciLean.norm 2 _scalar during lifelong learning but not SciLean.norm2_scalar. Conversely, Setup 2 proved SciLean.norm2_scalar but not SciLean.norm 2 _scalar.
<details>
<summary>extracted/6256159/carbon64.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Mathematical Theorem Proof
### Overview
The image is a screenshot of a terminal window displaying a code snippet, likely from a formal verification or proof assistant system (possibly Lean). The code appears to define and prove a theorem related to the norm of a scalar multiplied by a real number. The code is primarily in English, with some mathematical symbols and keywords specific to the system.
### Components/Axes
The screenshot contains the following elements:
* **Terminal Window:** The overall container for the code.
* **Code Blocks:** Two distinct blocks of code representing the theorem statement and its proof.
* **Text:** The code itself, consisting of keywords, mathematical notation, and identifiers.
* **Indicator Lights:** Three colored lights (red, yellow, green) in the top-left corner of the window, likely indicating the status of the system.
### Detailed Analysis or Content Details
The code consists of two blocks. Let's transcribe them precisely:
**Block 1:**
```
theorem norm2_scalar {R} [RealScalar R] (x : R) :
||x||2[R] = Scalar.abs x := by
rw [SciLean.scalar_norm]
```
**Block 2:**
```
theorem norm2_scalar {R} [RealScalar R] (x : R) :
||x||2[R] = x^2 := by
symm
simp [sq]
congr
simp
```
**Explanation of the code:**
* `theorem norm2_scalar {R} [RealScalar R] (x : R) :` declares a theorem named `norm2_scalar`. `{R}` indicates a type parameter `R`. `[RealScalar R]` is a type class constraint, meaning `R` must be a type that supports real scalar operations. `(x : R)` declares a variable `x` of type `R`.
* `||x||2[R] = Scalar.abs x := by` is the theorem statement. It states that the 2-norm of `x` (denoted `||x||2[R]`) is equal to the absolute value of `x` (denoted `Scalar.abs x`). `:= by` introduces the proof.
* `rw [SciLean.scalar_norm]` is a rewrite rule, applying the definition or a previously proven lemma named `SciLean.scalar_norm`.
* `||x||2[R] = x^2 := by` is the theorem statement. It states that the 2-norm of `x` (denoted `||x||2[R]`) is equal to x squared (denoted `x^2`). `:= by` introduces the proof.
* `symm` is a tactic that symmetrizes the goal (swaps the left and right sides of an equation).
* `simp [sq]` is a simplification tactic, applying simplifications based on the `sq` lemma (likely representing the squaring operation).
* `congr` is a congruence tactic, which allows applying a simplification to both sides of an equation.
* `simp` is a general simplification tactic.
### Key Observations
The code demonstrates a proof attempt for a theorem relating the 2-norm of a real number to its absolute value and its square. The first block uses a rewrite rule `SciLean.scalar_norm`, while the second block uses symmetry and simplification tactics. The two blocks appear to be different attempts or stages in the proof.
### Interpretation
The code snippet suggests a formal verification or mathematical proof process. The theorem being proven is a fundamental property of norms in real analysis. The use of tactics like `rw`, `symm`, `simp`, and `congr` indicates a structured, rule-based approach to proof construction. The presence of type classes (`RealScalar`) suggests a strong emphasis on type safety and correctness. The two blocks likely represent different approaches to proving the same theorem, or potentially a refinement of the proof strategy. The indicator lights at the top-left corner suggest the system is running and potentially evaluating the proof. The fact that the code is displayed in a terminal window suggests it's being interacted with directly by a user or as part of an automated process.
</details>
Setup 2 uses the sq premise from mathlib4, which states that the square of an element is the same as multiplying that element by itself, while LeanAgent used the SciLean.scalar_norm premise from SciLean, which states that the 2-norm of a real scalar is equal to the absolute value of that scalar. This suggests that LeanAgent prefers to use new premises if possible rather than simplistic pre-trained ones. This also makes sense since Setup 2 uses popularity order, which generally provides poor stability and plasticity.
PFR. Crucially, LeanAgent is the only setup to prove a sorry theorem from PFR. We can prove condRho_of_translate, which states a form of a translation invariance property of randomness measures.
<details>
<summary>extracted/6256159/carbon44.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet
### Overview
The image is a screenshot of a dark-themed code editor displaying a line of code, likely from a formal verification or mathematical software environment. There are three colored dots in the top-left corner.
### Components/Axes
There are no axes or traditional chart components. The visible elements are:
* **Code:** A single line of code.
* **Colored Dots:** Three dots positioned in the top-left corner: Red, Orange, and Green (from left to right).
### Detailed Analysis or Content Details
The code snippet reads:
```
lemma condRho_of_translate {Ω S : Type*} [MeasureSpace Ω] (X : Ω → G) (Y : Ω → S)
(A : Finset G) (s:G) : condRho (fun ω → X ω + s) Y A = condRho X Y A := by
simp only [condRho, rho_of_translate]
```
The code appears to define a lemma named `condRho_of_translate`. It involves types `Type`, `MeasureSpace`, `Finset`, and functions `condRho`, `rho_of_translate`. The code uses symbolic notation common in formal mathematics and computer science.
### Key Observations
* The code is syntactically correct, suggesting it's from a well-defined programming or formal language.
* The presence of `lemma` indicates this is likely part of a proof or theorem within a formal system.
* The use of `simp only` suggests a simplification step in a proof.
### Interpretation
The code snippet defines a lemma relating the conditional expectation (`condRho`) of a function after a translation. Specifically, it states that the conditional expectation of `Xω + s` with respect to `A` and `Y` is equal to the conditional expectation of `Xω` with respect to `A` and `Y`. This is a standard result in measure theory and probability. The `simp only` command indicates that the proof of this lemma relies on simplifying the expression using the definitions of `condRho` and `rho_of_translate`.
The colored dots in the top-left corner are likely status indicators or visual cues within the code editor. Without further context, their exact meaning is unclear, but they could represent:
* Compilation status (Red: Error, Orange: Warning, Green: Success)
* Debugging state
* Version control status
</details>
LeanAgent suggests that the proof is straightforward after expanding definitions of condRho and considering how rho, the randomness measure, behaves under translation (as captured by the rho_of_translate lemma). The fact that LeanAgent could trivially identify such a short proof using existing premises while the maintainers of the PFR repository did not suggests the power of our approach. This suggests that by learning the foundations of the PFR lemmas using its improved stability, LeanAgent was able to grasp some basic definitions.
However, LeanAgent and other setups could not prove any PFR theorems after lifelong learning. This suggests that LeanAgent requires more training time or data to further strengthen its knowledge in this new area of mathematics.
Alternate PFR Commit. We also analyze whether LeanAgent can generalize to a different commit of a repository in the curriculum. We choose commit 861715b9bf9482d2442760169cb2a3ff54091f75, because PFR/RhoFunctional.lean, the file from which we proved condRho_of_translate in the newer commit, did not exist in the old commit. This allowed the sorry theorems in the old commit to be more distinct. It proved two theorems, including the theorem multiDist_copy about the equality of distributions when copying random variables across different measure spaces that ReProver could not. LeanAgent achieved this with just the tactic rfl. However, these statements were about multiDist, another sorry theorem. Since any two instances of sorry are automatically equal by rfl, LeanAgent exploits this technicality to prove the two theorems.
<details>
<summary>extracted/6256159/carbon45.png Details</summary>

### Visual Description
\n
## Code Block: Lemma Definitions
### Overview
The image displays a code block containing two lemma definitions written in a formal language, likely a proof assistant like Coq or Lean. The code defines mathematical concepts related to measure spaces and distributions. The code is presented in a monospaced font on a dark background.
### Components/Axes
There are no axes or charts in this image. The components are purely textual. The code consists of:
* Lemma names: `multiDist_copy` and `multiDist_of_perm`
* Variable declarations with type annotations (e.g., `{m:ℕ}`, `(i : Fin m)`)
* Hypotheses (e.g., `hΩ`, `hΩ'`)
* Goal statements (e.g., `D[X ; hΩ] = D[X' ; hΩ']`)
* Tactics (e.g., `rfl`)
* Mathematical symbols (e.g., `∀`, `∈`, `MeasureSpace`, `IdentDistrib`, `volume`)
There are three colored dots at the top-left of the image: red, orange, and green. These do not appear to be related to the code itself.
### Detailed Analysis or Content Details
**Lemma `multiDist_copy`:**
```
lemma multiDist_copy {m:ℕ} {Ω : Fin m → Type} {Ω' : Fin m → Type}
(hΩ : (i : Fin m) → MeasureSpace (Ω i))
(hΩ' : (i : Fin m) → MeasureSpace (Ω' i))
(X : (i : Fin m) → (Ω i) → G)
(X' : (i : Fin m) → (Ω' i) → G)
(hident: ∀ i, IdentDistrib (X i) (X' i) (hΩ i).volume (hΩ' i).volume) :
D[X ; hΩ] = D[X' ; hΩ'] := by
rfl
```
* `{m:ℕ}`: A set of natural numbers.
* `{Ω : Fin m → Type}`: A function mapping elements of `Fin m` to types.
* `{Ω' : Fin m → Type}`: A function mapping elements of `Fin m` to types.
* `hΩ : (i : Fin m) → MeasureSpace (Ω i)`: A hypothesis stating that for each `i` in `Fin m`, `Ω i` is a measure space.
* `hΩ' : (i : Fin m) → MeasureSpace (Ω' i)`: A hypothesis stating that for each `i` in `Fin m`, `Ω' i` is a measure space.
* `X : (i : Fin m) → (Ω i) → G`: A function mapping each `i` in `Fin m` and an element of `Ω i` to an element of type `G`.
* `X' : (i : Fin m) → (Ω' i) → G`: A function mapping each `i` in `Fin m` and an element of `Ω' i` to an element of type `G`.
* `hident: ∀ i, IdentDistrib (X i) (X' i) (hΩ i).volume (hΩ' i).volume`: A hypothesis stating that for all `i`, `X i` and `X' i` are identically distributed with respect to the volumes of `hΩ i` and `hΩ' i`.
* `D[X ; hΩ] = D[X' ; hΩ']`: The goal statement, asserting that the distribution of `X` with respect to `hΩ` is equal to the distribution of `X'` with respect to `hΩ'`.
* `rfl`: A tactic indicating that the goal is proven by reflexivity (i.e., the two sides are syntactically equal).
**Lemma `multiDist_of_perm`:**
```
lemma multiDist_of_perm {m:ℕ} {Ω : Fin m → Type}
(hΩ : (i : Fin m) → MeasureSpace (Ω i))
(X : (i : Fin m) → (Ω i) → G) (φ : Equiv.Perm (Fin m)) :
D[X ; hΩ] = D[fun i → X (φ i); fun i → hΩ (φ i)] := by
rfl
```
* `{m:ℕ}`: A set of natural numbers.
* `{Ω : Fin m → Type}`: A function mapping elements of `Fin m` to types.
* `hΩ : (i : Fin m) → MeasureSpace (Ω i)`: A hypothesis stating that for each `i` in `Fin m`, `Ω i` is a measure space.
* `X : (i : Fin m) → (Ω i) → G`: A function mapping each `i` in `Fin m` and an element of `Ω i` to an element of type `G`.
* `φ : Equiv.Perm (Fin m)`: A permutation of the indices `Fin m`.
* `D[X ; hΩ] = D[fun i → X (φ i); fun i → hΩ (φ i)]`: The goal statement, asserting that the distribution of `X` with respect to `hΩ` is equal to the distribution of `X` composed with `φ` with respect to `hΩ` composed with `φ`.
* `rfl`: A tactic indicating that the goal is proven by reflexivity.
### Key Observations
The code defines lemmas related to the preservation of distributions under certain transformations. `multiDist_copy` deals with changing the underlying measure spaces while maintaining identical distributions, and `multiDist_of_perm` deals with permuting the indices. The use of the `rfl` tactic suggests that these lemmas are based on definitional equality.
### Interpretation
These lemmas are likely part of a larger theory of probability and measure theory within a formal proof system. They establish fundamental properties of distributions, allowing for reasoning about their behavior under changes in measure spaces or index permutations. The `IdentDistrib` predicate in `multiDist_copy` is crucial, ensuring that the distributions are truly equivalent before asserting equality. The `Equiv.Perm` type in `multiDist_of_perm` indicates that the permutation is an equivalence, preserving the structure of the underlying space. These lemmas are building blocks for more complex proofs involving distributions and measure spaces. The code is highly abstract and requires a strong background in mathematical logic and measure theory to fully understand.
</details>
This commit also provides another interesting example. The PFR maintainers used 0 = 1 as a placeholder for some sorry theorems, five of which are multiTau_min_exists, multiTau_min_sum_le, sub_multiDistance_le, sub_condMultiDistance_le, sub_condMultiDistance_le’. Interestingly, LeanAgent finds unintended constructs and proves these theorems with this proof:
<details>
<summary>extracted/6256159/carbon46.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet
### Overview
The image is a screenshot of a code editor displaying a lemma definition, likely from a formal verification or theorem proving system. The code is presented against a dark background with a light-colored text. There are three colored circles at the top-left corner of the image.
### Components/Axes
The image does not contain axes or charts. It consists of:
* **Colored Circles:** Three circles positioned at the top-left corner. From left to right, they are red, orange, and green.
* **Code Block:** A block of text representing code.
### Detailed Analysis or Content Details
The code block contains the following text:
```
lemma multiTau_min_exists : 0 = 1 := by
nontriviality
simp
apply @zero_ne_one N_
exact multdist_eq_zero
```
* `lemma multiTau_min_exists : 0 = 1 := by`: This line defines a lemma named `multiTau_min_exists` and asserts that `0 = 1`. The `:= by` indicates the start of the proof.
* `nontriviality`: This is a tactic, likely used to ensure the proof is not trivial.
* `simp`: This is a simplification tactic, attempting to simplify the goal.
* `apply @zero_ne_one N_`: This line applies a lemma or theorem named `zero_ne_one` to the current goal, parameterized by `N_`.
* `exact multdist_eq_zero`: This line completes the proof by showing that the current goal is equivalent to `multdist_eq_zero`.
### Key Observations
The code attempts to prove a contradiction (0 = 1). The tactics used suggest a formal proof environment where automated simplification and lemma application are employed. The presence of `nontriviality` suggests the system is designed to avoid trivial proofs.
### Interpretation
The code snippet demonstrates a proof attempt within a formal system. The goal is to prove the statement `0 = 1`, which is inherently false in standard mathematical systems. The tactics used (`nontriviality`, `simp`, `apply`, `exact`) are common in interactive theorem provers like Lean, Coq, or Isabelle. The code likely represents a test case or an example demonstrating the system's capabilities, or a deliberate attempt to find a flaw in the system's reasoning. The colored circles at the top-left might be indicators of the status of the code (e.g., compilation status, testing status). Without further context, it's difficult to determine the exact purpose of this code snippet.
</details>
It combines the known fact that $0≠ 1$ with the placeholder theorem multidist_eq_zero, which states $0=1$ . As such, it proves False, allowing it to derive anything, including $0=1$ .
MiniF2F. This repository consists of a validation and test split. Prior work evaluated performance as the number of sorry theorems proved and a Pass@k metric on the test split (Yang et al., 2023). However, given that LeanAgent is a framework, not a model, quantitatively comparing it with existing methods using a Pass@k metric is misleading. In line with our existing setups, we do not treat MiniF2F as a benchmark. Instead, we disregard its existing splits and compare LeanAgent’s proven sorry theorems with those of ReProver.
LeanAgent can prove 99 theorems, while ReProver can only prove 85. As mentioned in Sec. 4.2, we append MiniF2F to the initial curriculum to demonstrate the use case of formalizing a new repository in parallel with the ones in the starting curriculum. As such, interesting observations can be made by comparing ReProver (starting point) with LeanAgent (ending point). The types of sorry theorems LeanAgent can prove demonstrate its increasing understanding relative to ReProver in complex mathematical concepts due to lifelong learning.
ReProver initially demonstrated proficiency in a range of foundational mathematical areas on MiniF2F:
a) Basic Arithmetic and Number Theory: ReProver could handle simple arithmetic and modular arithmetic problems, such as mathd_numbertheory_254, a theorem about modular arithmetic and basic addition, mathd_numbertheory_342, a theorem about basic divisibility, and mathd_algebra_304, a theorem about simple exponentiation. These proofs only rely on the norm_num tactic, which evaluates arithmetic expressions. This suggests a less sophisticated proving ability of mathematics at the start of lifelong learning.
<details>
<summary>extracted/6256159/carbon47.png Details</summary>

### Visual Description
\n
## Text Block: Mathematical Theorems
### Overview
The image displays a screenshot of a terminal or code editor window containing three mathematical theorems, likely generated by a formal proof assistant system. Each theorem is presented with a unique identifier, a mathematical statement, and a proof attribution.
### Components/Axes
There are no axes or charts in this image. The components are simply lines of text.
### Detailed Analysis or Content Details
The text content is as follows:
1. `theorem mathd_numbertheory_254 : (239 + 174 + 83) % 10 = 6 := by norm_num`
* Theorem identifier: `mathd_numbertheory_254`
* Mathematical statement: `(239 + 174 + 83) % 10 = 6`
* Proof attribution: `by norm_num`
2. `theorem mathd_numbertheory_342 : 54 % 6 = 0 := by norm_num`
* Theorem identifier: `mathd_numbertheory_342`
* Mathematical statement: `54 % 6 = 0`
* Proof attribution: `by norm_num`
3. `theorem mathd_algebra_304 : 91^2 = 8281 := by norm_num`
* Theorem identifier: `mathd_algebra_304`
* Mathematical statement: `91^2 = 8281`
* Proof attribution: `by norm_num`
### Key Observations
All three theorems are proven using the `norm_num` proof tactic or strategy. The theorems cover number theory and algebra. The theorem identifiers suggest a systematic naming convention.
### Interpretation
The image demonstrates the output of a formal theorem prover. The `norm_num` attribution indicates that the proofs were likely generated using a normalization or simplification tactic applied to numerical expressions. The theorems themselves are relatively simple, likely used as examples or tests within the system. The consistent use of `norm_num` suggests it is a fundamental or frequently used proof strategy within this system. The identifiers suggest the theorems are part of a larger library or database of formalized mathematical knowledge.
</details>
b) Elementary Algebra: ReProver could solve basic algebraic equations and perform straightforward manipulations, such as mathd_algebra_141, which proves a statement about quadratic expressions, mathd_algebra_329, which shows a grasp of systems of linear equations, and mathd_algebra_547, which proves basic algebraic manipulation with roots.
<details>
<summary>extracted/6256159/carbon48.png Details</summary>

### Visual Description
\n
## Text Block: Mathematical Theorems
### Overview
The image presents a text block containing three mathematical theorems, likely from a formal proof assistant or theorem prover system. Each theorem is presented with its name, variable declarations, assumptions, and a statement of the theorem, followed by the method used to prove it. The text is displayed in a monospaced font on a dark background.
### Components/Axes
There are no axes or charts in this image. The components are the three theorems, each structured as follows:
1. `theorem <theorem_name>`
2. Variable declarations (e.g., `(a b : ℝ)`)
3. Assumptions (labeled `h₀`, `h₁`, `h₂`, etc.)
4. Theorem statement (e.g., `(a^2 + b^2) = 369`)
5. Proof method (e.g., `by nlinarith`)
### Detailed Analysis or Content Details
**Theorem 1: mathd_algebra_141**
* Variables: `(a b : ℝ)` - `a` and `b` are real numbers.
* Assumptions:
* `h₁ : (a * b) = 180`
* `h₂ : 2 * (a + b) = 54`
* Theorem: `(a^2 + b^2) = 369`
* Proof Method: `by nlinarith`
**Theorem 2: mathd_algebra_329**
* Variables: `(x y : ℝ)` - `x` and `y` are real numbers.
* Assumptions:
* `h₀ : 3 * y = x`
* `h₁ : 2 * x + 5 * y = 11`
* Theorem: `x + y = 4`
* Proof Method: `by linarith`
**Theorem 3: mathd_algebra_547**
* Variables: `(x y : ℝ)` - `x` and `y` are real numbers.
* Assumptions:
* `h₀ : x = 5`
* `h₁ : y = 2`
* Theorem: `Real.sqrt(x^3 - 2 ^ y) = 11`
* Proof Method: `simp [h₀, h₁, sq] rw [Real.sqrt_eq_iff_sq_eq] <;> norm_num`
### Key Observations
The theorems progressively increase in complexity, as indicated by their names (141, 329, 547). The proof methods used vary (`nlinarith`, `linarith`, `simp` with rewriting rules), suggesting different types of reasoning are employed. The use of `ℝ` indicates that the variables are real numbers. The third theorem involves a square root and more complex operations.
### Interpretation
The image showcases a snippet of a formal mathematical proof system. The theorems are stated in a precise, symbolic language, and the proof methods indicate how the system automatically verifies their validity. The `simp` command in the third theorem suggests simplification using the given assumptions, while `rw` indicates rewriting based on equivalence rules. The overall purpose is to demonstrate the capability of the system to handle algebraic manipulations and prove mathematical statements rigorously. The theorems themselves appear to be examples used for testing or demonstration purposes within the system. The increasing theorem numbers suggest a sequence of problems or a curriculum.
</details>
c) Basic Calculus and Analysis: ReProver showed early capabilities in dealing with logarithms and exponentials, including mathd_algebra_484, a theorem involving dividing logarithmic expressions.
<details>
<summary>extracted/6256159/carbon49.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Mathematical Theorem
### Overview
The image is a screenshot of a terminal or code editor window displaying a mathematical theorem and its proof steps, likely from a formal proof assistant system. The background is a dark grey, and the text is white. There are three colored dots in the top-left corner: red, yellow, and green.
### Components/Axes
There are no axes or traditional chart components. The key elements are the text lines representing the theorem and proof commands. The colored dots appear to be status indicators.
### Detailed Analysis or Content Details
The text content is as follows:
```
theorem mathd_algebra_484 : Real.log 27 / Real.log 3 = 3 := by
field simp
rw [← Real.log_rpow]
all_goals norm_num
```
* **theorem mathd_algebra_484 : Real.log 27 / Real.log 3 = 3 := by**: This line declares a theorem named `mathd_algebra_484`. It states that the logarithm base 3 of 27 divided by the logarithm base 3 of 3 is equal to 3. The `:= by` indicates the start of the proof.
* **field simp**: This is a proof command, likely instructing the system to simplify the expression using field arithmetic rules.
* **rw [← Real.log_rpow]**: This is a rewrite rule command. It applies the rule `Real.log_rpow` to the expression. The `←` suggests that the rule is applied in reverse. `Real.log_rpow` likely relates to the logarithm of a power.
* **all_goals norm_num**: This command instructs the system to normalize the numerical values in all proof goals.
### Key Observations
The theorem is a basic logarithmic identity. The proof appears to be concise and uses standard proof tactics for a formal proof assistant. The colored dots in the top-left corner likely indicate the status of the proof process (e.g., red for error, yellow for warning, green for success).
### Interpretation
The screenshot demonstrates a formal proof of a simple mathematical identity using a proof assistant. The commands suggest a tactic-based proof style, where the system automatically applies simplification and rewriting rules to reach the desired conclusion. The theorem itself is a straightforward application of logarithmic properties. The use of a formal proof assistant ensures the correctness and rigor of the proof. The dots are likely a visual indicator of the proof state, allowing the user to quickly assess the progress and identify any issues.
</details>
Notably, the proofs at this stage were characteristically concise, often using basic tactics like norm_num, linarith, and field_simp. This suggests that ReProver recognized these theorems as straightforward enough to prove without complex retrieval of premises, similar to its behavior with previous repositories.
However, by the end of the lifelong learning process, LeanAgent exhibited significant growth in its mathematical reasoning abilities on MiniF2F:
a) Advanced Number Theory: LeanAgent showed a more advanced grasp of number theory, proving theorems like mathd_numbertheory_293, a complex theorem about divisibility involving a complex expression and mathd_numbertheory_233, a theorem dealing with modular arithmetic in $\text{ZMod}(11^{2})$ .
<details>
<summary>extracted/6256159/carbon50.png Details</summary>

### Visual Description
\n
## Text Block: Mathematical Theorems
### Overview
The image displays a screenshot of a code editor or similar environment containing two mathematical theorems written in a formal, likely proof-assistant language. The theorems relate to number theory and modular arithmetic.
### Components/Axes
There are no axes or charts in this image. The content is purely textual. The image contains three colored circles at the top-left corner: red, orange, and green.
### Detailed Analysis or Content Details
**Theorem 1: mathd_numbertheory_293**
* `(n : N)`: Declares a variable 'n' of type 'N' (presumably natural numbers).
* `(h₀ : n ≤ 9)`: Defines a hypothesis 'h₀' stating that 'n' is less than or equal to 9.
* `(h₁ : 11 | 20 * 100 + 10 * n + 7)`: Defines a hypothesis 'h₁' stating that 11 divides (20 * 100 + 10 * n + 7). The symbol '|' likely represents divisibility.
* `n = 5 := by omega`: States that 'n' is equal to 5, and this is proven 'by omega' (likely a proof tactic or command within the system).
**Theorem 2: mathd_numbertheory_233**
* `(b : ZMod (11^2))` : Declares a variable 'b' of type 'ZMod (11^2)' (integers modulo 11 squared, or 121).
* `(h₀ : b = 24⁻¹)`: Defines a hypothesis 'h₀' stating that 'b' is the modular inverse of 24 modulo 121.
* `b = 116 := by exact h₀`: States that 'b' is equal to 116, and this is proven 'by exact h₀' (meaning it directly follows from the hypothesis h₀).
### Key Observations
The theorems are presented in a highly structured format, typical of formal mathematical proofs. The use of hypotheses (h₀, h₁) and proof tactics (omega, exact) suggests this is code intended for a proof assistant like Lean, Coq, or Isabelle. The theorems themselves involve basic number theory concepts like divisibility and modular inverses.
### Interpretation
The image demonstrates a snippet of formal mathematical reasoning. The theorems are likely part of a larger effort to rigorously prove properties of numbers. The use of a proof assistant ensures that the proofs are logically sound and free from errors. The theorems themselves are relatively simple, but they illustrate the power of formal methods in mathematics. The 'omega' tactic in the first theorem suggests an automated proof search, while 'exact h₀' in the second theorem indicates a direct application of a previously stated hypothesis. The theorems are named with a specific naming convention, `mathd_numbertheory_XXX`, suggesting they are part of a larger library or collection of number theory results.
</details>
b) Sophisticated Algebra: LeanAgent showed a better grasp of more complex algebraic manipulations. Theorems include mathd_algebra_148, which involves function definitions and solving for unknown coefficients, and amc12a_2016_p3, involving a special case of a function.
<details>
<summary>extracted/6256159/carbon61.png Details</summary>

### Visual Description
\n
## Screenshot: Mathematical Theorem Proofs
### Overview
The image is a screenshot of a terminal window displaying two mathematical theorem proofs, likely generated by an automated theorem prover. The proofs are presented in a formal, symbolic notation. The background is dark, and the text is light-colored for readability.
### Components/Axes
There are no axes or traditional chart components. The content is purely textual, representing mathematical statements and proof steps. The screen displays two theorems: `mathd_algebra_148` and `amc12a_2016_p3`. Each theorem is followed by its assumptions (labeled `h0` and `h1`) and the resulting conclusion. The proof steps are indicated by `:= by` followed by the tactic used (e.g., `linarith`, `norm_num`, `field_simp`, `norm_cast`).
### Detailed Analysis or Content Details
**Theorem 1: `mathd_algebra_148`**
* **Variables:**
* `c : ℝ` (c is a real number)
* `f : ℝ → ℝ` (f is a function from real numbers to real numbers)
* **Assumptions:**
* `h₀ : ∀ x, f x = c * x^3 - 9 * x + 3` (For all x, f(x) equals c times x cubed minus 9 times x plus 3)
* `h₁ : f 2 = 9` (f of 2 equals 9)
* **Conclusion:**
* `c = 3 := by linarith [h₀ 2]` (c equals 3, proven by linear arithmetic using assumption h0 with x=2)
**Theorem 2: `amc12a_2016_p3`**
* **Variables:**
* `f : ℝ → ℝ` (f is a function from real numbers to real numbers)
* **Assumptions:**
* `h₀ : ∀ x, ∀ (y : y ≠ 0), f x y = x - y * Int.floor(x / y)` (For all x and y (where y is not equal to 0), f(x, y) equals x minus y times the integer floor of x divided by y)
* **Conclusion:**
* `f (3 / 8) / (-2 / 5) = -(1 / 40) := by norm_num [h₀]` (f(3/8, -2/5) divided by (-2/5) equals -1/40, proven by numerical normalization using assumption h0)
* **Tactics used:**
* `norm_num`
* `field_simp`
* `norm_cast`
### Key Observations
The screenshot presents formal mathematical proofs. The theorems involve functions and real numbers. The proofs are concise and rely on automated tactics like `linarith`, `norm_num`, `field_simp`, and `norm_cast` to derive the conclusions from the assumptions. The use of `Int.floor` in the second theorem suggests the involvement of integer division or floor functions.
### Interpretation
The image demonstrates the use of an automated theorem prover to verify mathematical statements. The prover takes assumptions as input and applies a series of logical steps (tactics) to arrive at a conclusion. The output shows the theorem statement, the assumptions, and the proof steps. The theorems themselves appear to be algebraic in nature, involving polynomial functions and functional equations. The use of tactics like `linarith` and `norm_num` indicates that the prover is capable of performing linear arithmetic and numerical simplification. The screenshot provides a glimpse into the process of formal mathematical reasoning and the role of automation in verifying mathematical proofs.
</details>
c) Advanced Calculus and Analysis: LeanAgent demonstrated improved capabilities in handling more complex analytical problems, including mathd_algebra_270, a theorem involving function composition and rational expressions.
<details>
<summary>extracted/6256159/carbon52.png Details</summary>

### Visual Description
\n
## Screenshot: Mathematical Theorem Definition
### Overview
The image is a screenshot of a code editor or terminal window displaying a mathematical theorem definition, likely written in a formal proof language like Lean. The theorem concerns a function `f` mapping real numbers to real numbers and a condition involving its value at a specific point.
### Components/Axes
There are no axes or traditional chart components. The screenshot consists of text and three colored dots in the top-left corner (red, yellow, green). These dots appear to be status indicators.
### Detailed Analysis or Content Details
The text content is as follows:
```
theorem mathd_algebra_270
(f : ℝ → ℝ)
(h₀ : ∀ x, x ≠ -2 → f x = 1 / (x + 2)) :
f (1) = 3/7 := by
set_option tactic.skipAssignedInstances false in norm_num [h₀]
```
Breaking down the content:
* `theorem mathd_algebra_270`: This declares a theorem named "mathd\_algebra\_270".
* `(f : ℝ → ℝ)`: This defines a function `f` that takes a real number (ℝ) as input and returns a real number (ℝ).
* `(h₀ : ∀ x, x ≠ -2 → f x = 1 / (x + 2))`: This introduces a hypothesis `h₀`. It states that for all real numbers `x` not equal to -2, the function `f` evaluated at `x` is equal to 1 divided by (x + 2). The symbol `∀` represents "for all", `≠` means "not equal to", and `→` represents "implies".
* `: f (1) = 3/7 := by`: This is the conclusion of the theorem. It states that `f` evaluated at 1 is equal to 3/7. The `:= by` indicates the start of the proof.
* `set_option tactic.skipAssignedInstances false in norm_num [h₀]`: This line appears to be a command to the proof assistant (likely Lean) to normalize numbers using the hypothesis `h₀`. `set_option` configures the proof environment, `tactic.skipAssignedInstances false` disables a specific optimization, and `norm_num [h₀]` applies a normalization tactic using the hypothesis `h₀`.
### Key Observations
The theorem defines a function `f` with a specific property and then proves that `f(1)` equals 3/7. The proof uses a normalization tactic and a hypothesis about the function's behavior. The theorem is named `mathd_algebra_270`, suggesting it's part of a larger collection of mathematical definitions or proofs.
### Interpretation
This screenshot represents a formal statement and proof within a mathematical context. The theorem establishes a relationship between a function and its value at a specific point, given a general condition on the function's behavior. The use of formal notation and a proof assistant indicates a rigorous approach to mathematical reasoning. The `norm_num` tactic suggests the proof involves simplification or manipulation of numerical expressions. The theorem likely serves as a building block for more complex mathematical arguments. The colored dots in the top-left corner likely indicate the status of the editor or the compilation process (e.g., red for error, yellow for warning, green for success).
</details>
d) Complex Induction: LeanAgent became adept at more advanced induction proofs. An example is induction_12dvd4expnp1p20, a theorem about divisibility that requires an induction proof.
<details>
<summary>extracted/6256159/carbon54.png Details</summary>

### Visual Description
\n
## Screenshot: Mathematical Theorem Proof
### Overview
The image is a screenshot of a terminal window displaying a mathematical theorem and its proof, likely written in a formal proof assistant language such as Lean. The background is dark gray, and the text is white. The top-left corner shows three colored circles (red, yellow, green), likely indicating the operating system is macOS.
### Components/Axes
There are no axes or traditional chart components. The screenshot consists entirely of text.
### Detailed Analysis or Content Details
The text content is as follows:
```
theorem induction_12dvd4expnp1p20 (n : ℕ) : 12 | 4^(n+1) + 20 := by
norm_num
induction' n with hn
simp
omega
```
Breaking down the content:
* `theorem induction_12dvd4expnp1p20 (n : ℕ) : 12 | 4^(n+1) + 20 := by`: This line declares a theorem named `induction_12dvd4expnp1p20`. It takes a natural number `n` as input (`n : ℕ`) and asserts that 12 divides `4^(n+1) + 20` (denoted by `12 | 4^(n+1) + 20`). The `:= by` indicates the start of the proof.
* `norm_num`: This is a tactic, likely a command to normalize numerical expressions.
* `induction' n with hn`: This is an induction tactic. It performs mathematical induction on the variable `n`. `hn` is the induction hypothesis.
* `simp`: This is a simplification tactic, likely applying simplification rules to the current goal.
* `omega`: This is a tactic that attempts to automatically complete the proof using various decision procedures.
### Key Observations
The code snippet represents a formal proof of a divisibility theorem. The theorem states that for all natural numbers `n`, 12 divides `4^(n+1) + 20`. The proof uses standard tactics like normalization, induction, simplification, and an automated proof search tactic (`omega`).
### Interpretation
The screenshot demonstrates a formal mathematical proof being constructed within a proof assistant environment. The theorem itself suggests a relationship between powers of 4 and divisibility by 12. The use of tactics indicates a step-by-step approach to constructing a logically sound argument. The `omega` tactic suggests that the proof assistant was able to automatically complete the proof after the initial steps. This is a common workflow in formal verification, where a human provides the initial strategy, and the machine handles the detailed reasoning. The theorem is likely a simple example used for demonstration or educational purposes.
</details>
e) Complex Quantifiers and Inequalities: LeanAgent increased its ability to prove more complex logical statements, such as amc12a_2002_p6, a theorem involving multiple existential quantifiers and inequalities.
<details>
<summary>extracted/6256159/carbon55.png Details</summary>

### Visual Description
\n
## Screenshot: Theorem Statement and Proof
### Overview
The image is a screenshot of a dark-themed text editor or terminal window displaying a theorem statement and a partial proof. The theorem appears to be related to natural numbers and inequalities. There are three colored dots in the top-left corner (red, orange, green).
### Components/Axes
There are no axes or traditional chart components. The key elements are:
* **Theorem Statement:** A formal mathematical statement.
* **Proof Steps:** A sequence of commands or instructions likely used in a proof assistant.
* **Colored Dots:** Three dots in the top-left corner, colored red, orange, and green. Their purpose is unclear without further context.
### Detailed Analysis or Content Details
The text content is as follows:
```
theorem amc12a_2002_p6 (n : ℕ) (h₀ : 0 < n) : ∃ m, (m > n ∧ ∃ p, m * p ≤ m + p) := by
lift n to ℕ+ using h₀
cases' n with n
exact (_, lt_add_of_pos_right _ zero_lt_one, 1, by simp)
```
Breaking down the theorem statement:
* `theorem amc12a_2002_p6 (n : ℕ) (h₀ : 0 < n)`: This declares a theorem named `amc12a_2002_p6` that takes a natural number `n` (denoted by `n : ℕ`) and a hypothesis `h₀` stating that `n` is greater than 0 (`h₀ : 0 < n`).
* `: ∃ m, (m > n ∧ ∃ p, m * p ≤ m + p)`: This is the statement of the theorem. It asserts that there exists a natural number `m` such that `m` is greater than `n` (`m > n`) and there exists a natural number `p` such that `m * p` is less than or equal to `m + p` (`m * p ≤ m + p`).
The proof steps are:
* `lift n to ℕ+ using h₀`: This step likely lifts the natural number `n` to a positive natural number `ℕ+` using the hypothesis `h₀`.
* `cases' n with n`: This step performs case analysis on `n`.
* `exact (_, lt_add_of_pos_right _ zero_lt_one, 1, by simp)`: This step completes the proof using a combination of tactics. `exact` indicates a direct proof. The arguments likely refer to lemmas or properties used in the proof. `lt_add_of_pos_right` suggests a lemma related to inequalities. `zero_lt_one` is a known inequality. `simp` is a simplification tactic.
### Key Observations
The theorem statement involves existential quantifiers (`∃`) and inequalities. The proof appears to be written in a formal proof language, likely Lean or Coq, given the syntax. The use of `lift` and `cases'` suggests a proof by induction or case analysis.
### Interpretation
The theorem states that for any natural number `n` greater than 0, there exists a natural number `m` greater than `n` such that `m * p ≤ m + p` for some natural number `p`. This inequality can be rearranged to `m * p - m - p ≤ 0`, or `m * p - m - p + 1 ≤ 1`, which factors to `(m - 1) * (p - 1) ≤ 1`. This suggests that the theorem is related to finding `m` and `p` that satisfy this inequality. The proof steps indicate a formal verification of this statement using a proof assistant. The colored dots in the top-left corner are likely indicators of the status of the proof (e.g., red for error, orange for warning, green for success), but this is speculative without more context.
</details>
The proofs at this later stage are more sophisticated, usually involving multiple steps and combining various mathematical concepts or indicating a better ability to connect different areas of mathematics, mirroring the progression observed in Mathematics in Lean Source and SciLean. For example, as shown above, LeanAgent provides a one-line proof to the relatively advanced theorem mathd_numbertheory_233. The proof means the hypothesis directly proves the goal. This suggests that LeanAgent grasps modular arithmetic and recognizes when a given hypothesis is sufficient to prove the goal without additional steps.
Furthermore, as shown above, LeanAgent uses four tactics to prove the induction_12dvd4expnp1p20 theorem. This demonstrates its ability to handle more complex number theory proofs and use advanced tactics. This again shows that LeanAgent can recognize when it does not require complex premise retrieval.
LeanAgent demonstrates a similar grasp of the theorem amc12a_2002_p6. Notably, it combines the simple premises lt_add_of_pos_right, which describes how an element is less than that element added with a positive one, and zero_lt_one, which states that 0 is less than 1, with more advanced tactics like lift, cases, and exact with complex term construction. This demonstrates its ability to reuse foundational concepts for more complex proofs of abstract mathematical concepts, showing its stability.
Importantly, LeanAgent’s performance on MiniF2F showcases its ability to adapt and improve across different mathematical domains. We see this in the progression from ReProver’s basic arithmetic and algebra to LeanAgent’s more advanced number theory, calculus, and abstract algebra. This aligns with the observations from Mathematics in Lean Source and SciLean, further supporting the effectiveness of LeanAgent’s lifelong learning approach in theorem proving across various mathematical repositories.
Furthermore, early proofs from ReProver dealt with concrete numbers and simple equations. Later proofs from LeanAgent involved more abstract concepts like equivalence relations and function properties. LeanAgent gained the capability to handle more complex number theory problems involving divisibility under constraints. Moreover, LeanAgent shifted from solving basic linear and quadratic equations to analyzing functions and their compositions. Also, early proofs often used norm_num for straightforward computations. Later proofs employed more varied tactics and premises, suggesting a more sophisticated approach to proof construction. This all crucially suggests that while existing methods, like ReProver, may be more tailored to simpler computation problems, LeanAgent is superior on complex and analytical problems. These are precisely the types of problems present in advanced mathematics. This also corroborates LeanAgent’s performance on the repositories mentioned previously.
In addition, we evaluate LeanAgent on the Lean4 version of the MiniF2F test set using the same experimental setup from prior experiments, taking care not to progressively train on Test.lean and only proving sorry theorems from it. The results are as follows:
LeanAgent: Pass@1 of 38.1% (93 / 244 theorems)
ReProver (our run on Lean4): Pass@1 of 34.0% (83 / 244 theorems)
ReProver was only tested on the Lean3 test set in the LeanDojo paper, so we ran ReProver on the Lean4 test set for a fairer comparison. TheoremLlama (Wang et al., 2024c) reports a 33.61% cumulative accuracy on the Lean4 test set, but this makes a direct comparison with the Pass@1 rates of LeanAgent and ReProver infeasible. Moreover, DeepSeek-Prover-V1.5 (Xin et al., 2024b) reports a state-of-the-art result of 63.5% on the Lean4 test set.
However, LeanAgent is a lifelong learning framework, not a model. The performance of LeanAgent is dependent on the retriever used as the starting point - ReProver’s retriever in our case. As such, the metrics for other methods besides ReProver cannot be directly compared with LeanAgent. Such a comparison would be impractical for reasons including differences in data, pretraining, and fine-tuning. Again, we only compare with ReProver because we use ReProver’s retriever as the starting one in LeanAgent, allowing for a more faithful comparison. We use ReProver because past work has shown that retrieval leads to improved generalizability.
Moreover, we design LeanAgent to continuously generalize to and improve on ever-expanding mathematical knowledge without forgetting previously learned knowledge. Our goal is not to beat the MiniF2F benchmark; instead, we aim to perform well in proving sorry theorems across a range of diverse repositories. The other approaches mentioned focus on different objectives and don’t address the lifelong learning aspect that is central to our work.
Formal Book. We first examine the sorry theorems from Formal Book that LeanAgent proved during lifelong learning. These theorems centered around:
a) Real Analysis and Inequalities: LeanAgent demonstrates a better understanding relative to ReProver in real number properties and can handle basic inequality reasoning, proving book.irrational.lem_aux_ii, which involves real analysis and inequalities.
<details>
<summary>extracted/6256159/carbon56.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet with Indicators
### Overview
The image is a screenshot of a dark-themed code editor or terminal window. It displays a line of code, likely from a formal verification or mathematical software environment (possibly Coq or Isabelle). Above the code, there are three colored dots, acting as visual indicators.
### Components/Axes
There are no axes or traditional chart components. The key elements are:
* **Red Dot:** Top-leftmost indicator.
* **Orange Dot:** Middle indicator.
* **Green Dot:** Rightmost indicator.
* **Code:** A single line of code, which is the primary content.
### Detailed Analysis or Content Details
The code snippet is:
```
lemma lem_aux_ii (n : ℕ) (x : ℝ) (h_1 : 0 < x) (h_2 : x < 0) :
(0 < f_aux n x) ∧ (f_aux n x < (1 : ℝ) / n.factorial) := by
constructor <>; linarith
```
Let's break down the code:
* `lemma lem_aux_ii`: This declares a lemma named `lem_aux_ii`. Lemmas are proven statements in formal verification.
* `(n : ℕ) (x : ℝ)`: This defines the arguments to the lemma: `n` is a natural number (ℕ), and `x` is a real number (ℝ).
* `(h_1 : 0 < x) (h_2 : x < 0)`: These are hypotheses or assumptions. `h_1` states that `x` is greater than 0, and `h_2` states that `x` is less than 0. This is a contradiction.
* `:`: This separates the arguments and hypotheses from the statement to be proven.
* `(0 < f_aux n x) ∧ (f_aux n x < (1 : ℝ) / n.factorial)`: This is the statement to be proven. It asserts that `f_aux n x` is greater than 0 and less than `1/n!`. The `∧` symbol represents logical conjunction ("and").
* `:= by`: This indicates the start of the proof.
* `constructor <>; linarith`: This is the proof tactic. `constructor` likely introduces cases, and `linarith` is an automated tactic for solving linear arithmetic inequalities.
The presence of contradictory hypotheses `0 < x` and `x < 0` suggests this code might be part of a proof by contradiction, or an attempt to demonstrate a flaw in a system.
### Key Observations
* The contradictory hypotheses `h_1` and `h_2` are a significant observation.
* The code uses mathematical notation common in formal verification systems.
* The colored dots at the top of the screen likely indicate the status of the code or the proof process (e.g., compilation status, test results).
### Interpretation
The code snippet represents a formal statement (a lemma) and a proof attempt within a formal verification environment. The contradictory hypotheses suggest a potential issue or a deliberate attempt to prove a statement by contradiction. The colored dots likely provide visual feedback on the code's status, but their specific meaning is unknown without further context. The use of `linarith` suggests the proof relies on solving inequalities. The overall purpose is to formally verify a mathematical property related to the function `f_aux` and the factorial of `n`.
</details>
b) Number Theory: LeanAgent shows capabilities in fundamental number theory concepts, proving book.quadratic_reciprocity.quadratic_reciprocity_2, a key result in quadratic reciprocity.
<details>
<summary>extracted/6256159/carbon57.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Quadratic Reciprocity Theorem
### Overview
The image is a screenshot of a terminal window displaying a code snippet, likely from a formal proof assistant system (such as Coq or Lean). The code defines a theorem related to quadratic reciprocity. The background is a dark teal color. There are three colored dots in the top-left corner: red, yellow, and green.
### Components/Axes
There are no axes or traditional chart components. The screenshot consists of a single block of text representing the code. The colored dots in the top-left corner do not appear to be part of the code itself, but rather indicators of the terminal or environment.
### Detailed Analysis or Content Details
The code snippet defines a theorem named `quadratic_reciprocity_2`. Here's a transcription of the code:
```
theorem quadratic_reciprocity_2 (p q : ℕ) (hp : p ≠ 2) (hq : q ≠ 2)
[Fact (Nat.Prime p)] [Fact (Nat.Prime q)] :
(legendre_sym p q) * (legendre_sym q p) = -1 ^ ((p-1) / 2 * (q - 1) / 2) := by
exact book.quadratic_reciprocity.quadratic_reciprocity_1 p q hp hq
```
Let's break down the code:
* `theorem quadratic_reciprocity_2 (p q : ℕ) (hp : p ≠ 2) (hq : q ≠ 2)`: This line declares a theorem named `quadratic_reciprocity_2` that takes two natural numbers `p` and `q` as input, along with two hypotheses `hp` and `hq`. The hypotheses state that `p` and `q` are not equal to 2.
* `[Fact (Nat.Prime p)] [Fact (Nat.Prime q)]`: These are type class instances, indicating that `p` and `q` are assumed to be prime numbers. `Nat.Prime` likely refers to a predicate or type class that asserts primality.
* `(legendre_sym p q) * (legendre_sym q p) = -1 ^ ((p-1) / 2 * (q - 1) / 2)`: This is the core statement of the theorem. It states that the product of the Legendre symbols (p/q) and (q/p) is equal to -1 raised to the power of ((p-1)/2 * (q-1)/2). The Legendre symbol (a/p) is 1 if a is a quadratic residue modulo p, -1 if a is a quadratic non-residue modulo p, and 0 if a is divisible by p.
* `:= by exact book.quadratic_reciprocity.quadratic_reciprocity_1 p q hp hq`: This line provides the proof of the theorem. `exact` indicates that the theorem is proven by directly applying a previously proven lemma or theorem. In this case, it's applying `quadratic_reciprocity_1` from the `book.quadratic_reciprocity` module, passing `p`, `q`, `hp`, and `hq` as arguments.
### Key Observations
The code snippet represents a formal statement and proof of the quadratic reciprocity theorem, a fundamental result in number theory. The use of type classes (`Nat.Prime`) and a formal proof assistant suggests a high level of rigor and precision.
### Interpretation
The code snippet demonstrates a formalization of a mathematical theorem within a proof assistant environment. The theorem itself, quadratic reciprocity, relates the solvability of quadratic congruences. The code's structure highlights the importance of precise definitions, hypotheses, and a rigorous proof strategy. The `exact` keyword suggests that the proof relies on a previously established result, indicating a modular approach to theorem proving. The use of a "book" module suggests a library of pre-proven theorems and lemmas. The colored dots in the top-left corner are likely status indicators for the terminal or environment, but are not directly related to the mathematical content.
</details>
Notably, the proof of the first theorem uses no premises, and the proof of the second uses a simple statement of quadratic reciprocity. However, by the end of the lifelong learning process, LeanAgent exhibits growth in its proving abilities in this repository:
a) Advanced Abstract Algebra: LeanAgent shows advancement in proving a key result in abstract algebra, wedderburn (Wedderburn’s Little Theorem), which is a crucial result in abstract algebra, stating that every finite division ring is a field.
<details>
<summary>extracted/6256159/carbon58.png Details</summary>

### Visual Description
\n
## Screenshot: Mathematical Theorem Statement
### Overview
The image is a screenshot of a dark-themed window displaying a mathematical theorem statement, likely from a formal proof assistant or interactive theorem prover. The window has three colored dots at the top-left corner, indicating window controls.
### Components/Axes
The screenshot contains the following elements:
* **Window Controls:** Three colored dots (red, yellow, green) positioned at the top-left corner.
* **Theorem Statement:** A line of text stating a theorem named "wedderburn".
* **Background:** A gradient background, transitioning from light blue to light gray.
### Detailed Analysis or Content Details
The text within the window reads:
`theorem wedderburn (h: Fintype R): IsField R := by`
`apply Field.toIsField`
This appears to be a statement of Wedderburn's little theorem, which states that every finite field is isomorphic to a field of prime power order. Let's break down the statement:
* `theorem wedderburn`: Declares a theorem named "wedderburn".
* `(h: Fintype R)`: Specifies a hypothesis that `R` is a finite type. `Fintype` likely refers to a type class representing finite types. `h` is a name given to this hypothesis.
* `: IsField R`: States that `R` is a field. `IsField` is likely a type class representing fields.
* `:= by`: Indicates the start of the proof.
* `apply Field.toIsField`: Applies a lemma or tactic named `Field.toIsField` to prove the theorem. This suggests that the proof relies on converting `R` into a field using a predefined function or lemma.
### Key Observations
The screenshot presents a concise mathematical statement and a single-line proof step. The use of type classes (`Fintype`, `IsField`) suggests a formal setting where types are explicitly defined and checked. The proof step `apply Field.toIsField` indicates a reliance on pre-existing lemmas or tactics within a formal system.
### Interpretation
The image demonstrates a snippet of formal mathematical reasoning. The theorem statement and proof step are likely part of a larger formalization of mathematical concepts within a proof assistant. The use of type classes and tactics highlights the precision and rigor required in formal proofs. The screenshot suggests that the user is actively engaged in constructing or verifying a mathematical proof using a formal system. The theorem itself, Wedderburn's little theorem, is a fundamental result in abstract algebra, and its formalization demonstrates the power of proof assistants to capture and verify complex mathematical arguments.
</details>
LeanAgent’s proof of the wedderburn theorem represents the ability to handle algebraic structures. By using the Field.toIsField premise, LeanAgent shows that it has grasped how to use the knowledge of what makes a ring a field. This requires an understanding of ring theory and field properties.
Coxeter. LeanAgent could not prove sorry theorems from the Coxeter repository during lifelong learning. However, by the end of lifelong learning, LeanAgent demonstrates a growing understanding of more complex algebraic structures, proving the lemma invmap.of_eq about Coxeter systems, again showing the ability to work with advanced concepts in group theory and abstract algebra. This corroborates the handling of abstract algebra necessary to prove the wedderburn theorem.
<details>
<summary>extracted/6256159/carbon59.png Details</summary>

### Visual Description
\n
## Screenshot: Code Snippet - Lemma Definition
### Overview
The image is a screenshot of a code editor displaying a lemma definition, likely from a formal proof assistant system (possibly Lean or Coq). The code appears to be related to Coxeter systems and involves concepts like inverse mapping, presentations, matrices, and monoid lifts.
### Components/Axes
There are no axes or traditional chart components. The visible elements are:
* **Colored Dots (Top-Left):** Three dots are present in the top-left corner: Red, Orange, and Green. These likely indicate status or selection within the editor.
* **Code Block:** The main content is a block of code.
* **Background:** A dark background with a subtle gradient.
### Detailed Analysis or Content Details
The code block contains the following text:
```
lemma invmap.of_eq {S:Set G} [CoxeterSystem G S] {s :S} : invmap S s = s := by
simp [CoxeterSystem.Presentation.invmap]
unfold CoxeterSystem.toMatrix
apply CoxeterSystem.monoidLift.of
```
Let's break down the code:
* `lemma invmap.of_eq {S:Set G} [CoxeterSystem G S] {s :S} : invmap S s = s := by`: This line defines a lemma named `invmap.of_eq`. It takes type parameters `S` (a set within `G`) and `G` itself, and a variable `s` belonging to the set `S`. The lemma states that `invmap S s` is equal to `s`. The `:= by` indicates the start of the proof.
* `simp [CoxeterSystem.Presentation.invmap]`: This line instructs the proof assistant to simplify the expression using the `invmap` function from the `CoxeterSystem.Presentation` module.
* `unfold CoxeterSystem.toMatrix`: This line instructs the proof assistant to expand the definition of `CoxeterSystem.toMatrix`.
* `apply CoxeterSystem.monoidLift.of`: This line applies a theorem or lemma named `of` from the `CoxeterSystem.monoidLift` module.
### Key Observations
* The code is syntactically correct (assuming the underlying system's syntax).
* The lemma appears to be a fundamental property related to the inverse mapping within the context of Coxeter systems.
* The proof strategy involves simplification and unfolding definitions, followed by applying a relevant theorem.
### Interpretation
This code snippet represents a formal statement and its proof within a mathematical or computational system. The lemma `invmap.of_eq` likely establishes a crucial relationship between an element `s` and its inverse mapping `invmap S s` within a Coxeter system. The proof steps suggest that this relationship can be demonstrated by leveraging the system's definitions and existing theorems related to presentations, matrices, and monoid lifts. The use of a formal proof assistant ensures the rigor and correctness of the argument. The code is not presenting data in a visual way, but rather a logical statement and its justification.
</details>
LeanAgent’s proof of invmap.of_eq involves unfolding definitions and applying specific properties of Coxeter systems. This demonstrates LeanAgent’s growing understanding relative to ReProver in abstract algebra specific to the new repository it has learned from.
Hairy Ball Theorem. Moreover, LeanAgent could not prove sorry theorems from the Hairy Ball Theorem repository during lifelong learning. However, at the end of lifelong learning, LeanAgent again demonstrates a stronger understanding relative to ReProver in algebraic topology. It proves HairyBallDiff, which states a key step in the Hairy Ball Theorem, demonstrating a grasp of vector spaces, norms, and their connections to topological concepts.
<details>
<summary>extracted/6256159/carbon60.png Details</summary>

### Visual Description
\n
## Code Snippet: Theorem Definition
### Overview
The
</details>
Crucially, only LeanAgent could prove invmap.of_eq, wedderburn, and HairyBallDiff, demonstrating that it has developed much more advanced theorem-proving capabilities than other setups. These proofs show that LeanAgent can work with highly abstract concepts and apply them to specific mathematical objects.
A.6 Curriculum Learning Analysis
In this section, we aim to answer the following questions: (1) Why does LeanAgent use $e^{S}$ ( $S$ = number of proof steps) as its complexity metric? (2) Why does curriculum learning work in theorem proving? (3) Why does LeanAgent use curriculum learning instead of other lifelong learning methods?
Complexity Measure. There is no universal measure of proof complexity in Lean or other formal systems. One approach, the length-based measure, involves examining the proof length (number of steps or lines) and the size of the proof term in a formal system. While these can indicate verification complexity, they may not fully capture the complexity of discovering a proof (Arana & Stafford, 2023). Moreover, within the NLP literature, many works have related input length to complexity (Zaremba & Sutskever, 2015; Cirik et al., 2016; Spitkovsky et al., ; Subramanian et al., 2017; Chang et al., 2021). Starting with shorter sequences and gradually increasing length improves model quality (Li et al., 2024).
These works demonstrate the gains from basing the complexity measure on input length. As such, we consider the equivalent of length in theorem proving to be the number of proof steps. However, we consider a linear scaling of length naive for theorem proving; it doesn’t consider the combinatorial explosion of possible proof paths as the length of the proof increases. As such, we choose an exponential scaling. Notably, a key strength of this choice is it is easy to compute and requires no additional hyperparameters to tune.
We now discuss some alternative complexity metrics and why we chose not to use them in LeanAgent. One option is $e^{B}$ , where $B$ represents the number of different proof paths that could be explored at each step. Formally, $B$ is defined as the average number of child nodes for each non-leaf node in the proof tree. This is sometimes called the branching factor. We refrain from using this complexity metric as computing this becomes computationally expensive for complex proofs. Moreover, another option is to consider the complexity of the theorem statement to determine complexity. For example, this could be measured by the number of unique symbols, the depth of nested expressions, or the number of quantifiers. However, developing a reliable metric for statement complexity that works across various mathematical domains could be challenging. Moreover, LeanAgent focuses on improving proof generation, so using a metric directly related to the proof process (number of steps) aligns better with this goal than statement complexity. Dependency-based complexity, where we order theorems based on their dependency structure within the mathematical library, wasn’t used for multiple reasons. Namely, a theorem might depend on many simple results but still be relatively easy to prove, or it might depend on a few results but be very challenging. Furthermore, a topic-based curriculum would be unsuitable because LeanAgent aims to be a general-purpose framework. A topic-based approach might bias it towards certain mathematical domains. Moreover, a topic-based approach does not account for theorems spanning multiple mathematical concepts. Another option is a form of premise-based complexity measure. However, analyzing premise occurrence frequencies across repositories could be computationally expensive, especially for large repositories like SciLean.
Curriculum Learning. Prior work suggests that curriculum learning guides the learner towards better local minima in non-convex optimization problems (Bengio et al., 2009). Crucially, theorem proving involves navigating a highly non-convex optimization landscape, especially for complex mathematical statements. The space of possible proofs is vast and complex, with many potential dead ends and suboptimal solutions to parts of a proof. This makes theorem proving an ideal candidate for benefitting from curriculum learning. Moreover, curriculum learning has been shown to have an effect similar to unsupervised pre-training, acting as a form of regularization (Bengio et al., 2009). For theorem proving, curriculum learning allows LeanAgent to build a foundational knowledge base of mathematics before attempting more complex theorems, naturally leading to continuous improvement. This avoids suboptimal proof strategies early in training. Furthermore, this leads to more robust and generalizable proof techniques that work across a broader range of theorems, explaining LeanAgent’s sorry theorem proving performance. Moreover, the ability to act as a regularizer means curriculum learning prevents the model from overfitting to specific types of proofs or mathematical domains. This allows for continuous generalizability, explaining LeanAgent’s superior lifelong learning metric scores.
Moreover, other works from the literature show that curriculum learning biases models towards building constructive internal representations (Cirik et al., 2016). Specifically, it allows the model to use the knowledge from earlier steps in later predictions. In theorem proving, this allows LeanAgent to learn basic proof skills, which then become building blocks for more complex proofs later on. This corroborates our analysis of LeanAgent’s progression of proof complexity. Moreover, multiple past works agree that curriculum learning provides larger gains when training data is limited (Zaremba & Sutskever, 2015; Cirik et al., 2016; Spitkovsky et al., ). This is the case in formal theorem proving, where the number of formalized theorems and proofs is limited. These past works also state that curriculum learning leads to better generalization, supporting the observations of LeanAgent’s superior lifelong learning metrics.
This context supports LeanAgent’s sorry theorem proving performance and superiority on lifelong learning metrics.
Comparison with Other Lifelong Learning Methods. Many other lifelong learning methods exist, such as those mentioned in Sec. 2. However, we chose curriculum learning for the following reasons. Regularization methods, like EWC, slow down learning on important parameters. However, the importance of parameters can change as the theorem complexity increases. This helps explain the lower sorry theorem proving performance of EWC methods and their performance on lifelong learning metrics. Moreover, memory-based techniques store examples from previous tasks to prevent forgetting. However, this can greatly affect these methods’ balance of stability and plasticity. This can be seen in the lifelong learning metrics of Merge All setups, including the negative IP values. Knowledge distillation requires a separate teacher model, but curriculum learning is more efficient as it provides the path to knowledge accumulation in a single model. Since LeanAgent is a framework, not a model, we refrain from using dynamic architecture adjustment to keep LeanAgent general to many LLM architectures. Moreover, recent work selectively updates parameters with the largest momentum magnitudes and uses selective reinitialization to maintain plasticity. However, these methods often focus on balancing performance across distinct tasks. In theorem proving, where tasks form a spectrum of increasing complexity, curriculum learning provides a more structured approach to knowledge accumulation.
A.7 Theorem Proving Performance Score (TPPS) Analysis
In this work, we directly compared the number of proven theorems with ReProver. However, the field of theorem proving, especially in a lifelong learning context, currently lacks standardized performance metrics. To address this concern, this section discusses a possible alternative metric while acknowledging its limitations.
Theorem proving difficulty is inherently non-linear in nature. For example, LeanAgent’s significantly improved performance over the baseline across multiple repositories allows it to prove progressively harder theorems. Furthermore, sorry theorems lack ground truth proofs, so proving one is valuable. To address these nuances, one could propose the Theorem Proving Performance Score (TPPS) to emphasize newly proven sorry theorems. Specifically, it could be stated that $\text{LeanAgent TPPS}=(\text{\# ReProver Theorems Proved})+(\text{\# New %
Theorems Proved}*X)+1$ , where $X$ represents the importance of proving a new theorem, and $\text{ReProver TPPS}=(\text{\# ReProver Theorems Proved})+1$ . Then, $\text{Improvement Factor}=(\text{LeanAgent TPPS})/(\text{ReProver TPPS})$ .
The core idea behind TPPS is to assign a higher reward for newly proven theorems, aligning with lifelong learning objectives. However, it is important to recognize its preliminary nature and potential shortcomings. First, choosing the parameter $X$ is challenging. One approach is to choose a static value, such as $X=10$ , to standardize comparisons between LeanAgent and ReProver across diverse repositories. However, this may lead to inflated or deflated metrics on such repositories. Alternatively, $X$ could be chosen adaptively based on the difficulty of a repository, but this may similarly result in unrealistic metric scores. Overall, the TPPS metric focuses on quantity and faces challenges in ensuring comparisons across diverse repositories.
Moreover, the TPPS metric can be susceptible to artifacts that artificially inflate performance. For instance, several theorems in the PFR repository were proven on a technicality due to placeholder statements of “ $0=1$ ”, which could then be used to prove other “sorry” theorems through the principle of ex falso quodlibet. Examples such as this are due to the weakness of the current state of repositories rather than a fundamental shortcoming of our ML approach, underscoring the need for more robust and well-tested benchmarks.
Furthermore, TPPS does not account for the difficulty of the theorems being proved. While we initially considered using proof length as a proxy for difficulty (e.g., $e^{S}$ where $S$ is the number of proof steps), we found this approach problematic during proving evaluation. LeanAgent sometimes generates shorter proofs for difficult theorems, making proof length an unreliable indicator of theorem difficulty.
These issues expose broader challenges in evaluating theorem-proving systems. The field lacks standardized benchmarks, necessitating carefully curated sets of theorems with varying difficulties across different mathematical domains. Developing reliable metrics to assess theorem difficulty beyond simple measures like proof length or statement complexity is crucial. Future benchmarks should prevent the exploitation of technicalities. Moreover, metrics should consider the mathematical significance of proven theorems, not just their quantity. Finally, ensuring that performance metrics are meaningful and comparable across different mathematical repositories is essential for consistent evaluation.
In light of these considerations, we acknowledge that TPPS, while a step towards quantifying theorem-proving performance in a lifelong learning setting, has many limitations. Future work should focus on developing more sophisticated and robust evaluation frameworks that address these challenges. For example, we plan to investigate more sophisticated measures that reward not only newly proved theorems but also difficult ones. By highlighting these issues, we hope to contribute to the ongoing discussion on how best to evaluate and compare theorem-proving systems, particularly in the context of lifelong learning. The development of standardized, reliable metrics will be crucial for measuring progress and guiding future research in this field.