2410.06209

Model: gemini-2.0-flash

# LeanAgent: Lifelong Learning for Formal Theorem Proving **Authors**: - Chaowei Xiao, Anima Anandkumar (California Institute of Technology, Stanford University, University of Wisconsin, Madison) > Equal contribution. Abstract Large Language Models (LLMs) have been successful in mathematical reasoning tasks such as formal theorem proving when integrated with interactive proof assistants like Lean. Existing approaches involve training or fine-tuning an LLM on a specific dataset to perform well on particular domains, such as undergraduate-level mathematics. These methods struggle with generalizability to advanced mathematics. A fundamental limitation is that these approaches operate on static domains, failing to capture how mathematicians often work across multiple domains and projects simultaneously or cyclically. We present LeanAgent, a novel lifelong learning framework for formal theorem proving that continuously generalizes to and improves on ever-expanding mathematical knowledge without forgetting previously learned knowledge. LeanAgent introduces several key innovations, including a curriculum learning strategy that optimizes the learning trajectory in terms of mathematical difficulty, a dynamic database for efficient management of evolving mathematical knowledge, and progressive training to balance stability and plasticity. LeanAgent successfully generates formal proofs for 155 theorems across 23 diverse Lean repositories where formal proofs were previously missing, many from advanced mathematics. It performs significantly better than the static LLM baseline, proving challenging theorems in domains like abstract algebra and algebraic topology while showcasing a clear progression of learning from basic concepts to advanced topics. In addition, we analyze LeanAgent’s superior performance on key lifelong learning metrics. LeanAgent achieves exceptional scores in stability and backward transfer, where learning new tasks improves performance on previously learned tasks. This emphasizes LeanAgent’s continuous generalizability and improvement, explaining its superior theorem-proving performance. 1 Introduction Mathematics can be expressed in informal and formal languages. Informal mathematics utilizes natural language and intuitive reasoning, whereas formal mathematics employs symbolic logic to construct machine-verifiable proofs (Kevin Buzzard, 2019). State-of-the-art large language models (LLMs), such as o1 (OpenAI, 2024) and Claude (Claude Team, 2024), produce incorrect informal proofs (Zhou et al., 2024). This highlights the importance of formal mathematics in ensuring proof correctness and reliability. Interactive theorem provers (ITPs), such as Lean (De Moura et al., 2015), have emerged as tools for formalizing and verifying mathematical proofs. However, constructing formal proofs using ITPs is complex and time-consuming; it requires extremely detailed proof steps and involves working with extensive mathematical libraries. Recent research has explored using LLMs to generate proof steps or complete proofs. For example, LeanDojo (Yang et al., 2023) introduced the first open-source framework to spur such research. Existing approaches typically involve training or fine-tuning LLMs on a specific dataset (Jiang et al., 2022). However, data scarcity in formal theorem proving (Polu et al., 2022) hinders the generalizability of these approaches (Liu et al., 2023). For example, ReProver, the retrieval-augmented LLM from the LeanDojo family, uses a retriever fine-tuned on Lean’s math library, mathlib4 (mathlib4 Community, 2024). Although mathlib4 contains over 100,000 formalized mathematical theorems and definitions, it covers primarily up to undergraduate mathematics. Consequently, ReProver performs poorly on more challenging mathematics, such as Terence Tao’s formalization of the Polynomial Freiman-Ruzsa (PFR) Conjecture (Tao et al., 2024). <details> <summary>extracted/6256159/Final2.png Details</summary> ![f41c5329](/v1/image/f41c5329c5f1b27616eff048e663be6727a5151f57cdad1b40edfcc026bc990b) ### Visual Description ## Diagram: LeanAgent Workflow ### Overview The image is a diagram illustrating the workflow of LeanAgent, a system for automated theorem proving. The diagram is divided into several stages, including data extraction from GitHub repositories, curriculum learning, progressive training, and theorem proving. The workflow involves iterative processes and feedback loops. ### Components/Axes * **GitHub Repositories:** Represents the source of theorem and proof data. * **Extract Data Per Repository:** Process of extracting theorem and proof data from GitHub repositories. * **LeanAgent:** The overall system for automated theorem proving. * **Dynamic Database:** A database used to store and retrieve theorems and proofs. * **Curriculum Learning:** A stage involving learning curricula. * **Repeat For Each Repository:** Indicates an iterative process. * **Premise Corpus Theorems + Proofs:** A collection of premises, theorems, and proofs. * **Progressive Training:** A training stage. * **sorry Theorem Proving:** The final stage of theorem proving. * **Complexity vs. # Proof Steps:** A graph showing the relationship between theorem complexity and the number of proof steps. The y-axis is labeled "Complexity" and the x-axis is labeled "# Proof Steps". The graph shows a curve that increases rapidly. * **Theorem Complexities:** Indicates the complexity of theorems. Examples given are "complexity: 2.7" and "complexity: 7.4". * **Easy Medium Hard:** Labels indicating the difficulty levels in curriculum learning. These are associated with a bell curve. * **Descending # Easy Theorems:** Indicates a descending order of easy theorems. The blocks are colored green, yellow, and red. * **Latest Retriever:** The most recent retriever model. * **Limited Exposure:** A stage with limited exposure, indicated as "1 epoch". * **Balanced Training:** A training stage with balanced stability and plasticity. * **Stability Plasticity:** Labels indicating the balance between stability and plasticity. * **New Retriever:** The updated retriever model. * **sorry Theorems:** Theorems that need to be proven. * **Retrieved Premises w/ Updated Retriever:** Premises retrieved using an updated retriever. Examples given are "theorem" and "lemma". * **Generated Tactics:** Tactics generated for theorem proving. Example given is "apply premise". * **Best-First Tree Search:** A search algorithm used for theorem proving. * **New Proofs:** The resulting proofs. ### Detailed Analysis * **Data Extraction:** The process starts with extracting data from GitHub repositories. The extracted data includes theorems and proofs. * **LeanAgent Core:** The extracted data is fed into the LeanAgent system. * **Curriculum Learning Loop:** The system uses curriculum learning, which involves a dynamic database. The process is repeated for each repository. * **Complexity Graph:** The graph shows that as the number of proof steps increases, the complexity of the theorem also increases. * **Curriculum Stages:** The curriculum learning stage involves easy, medium, and hard difficulty levels. The number of easy theorems decreases. * **Progressive Training Stages:** The progressive training stage involves a latest retriever, limited exposure (1 epoch), and balanced training. * **Theorem Proving Stages:** The theorem proving stage involves retrieved premises, generated tactics, and a best-first tree search. ### Key Observations * The diagram illustrates a cyclical workflow, with feedback loops between different stages. * The curriculum learning stage involves a progression from easy to hard difficulty levels. * The progressive training stage aims to balance stability and plasticity. * The theorem proving stage uses a best-first tree search algorithm. ### Interpretation The diagram provides a high-level overview of the LeanAgent workflow for automated theorem proving. The system leverages data from GitHub repositories, employs curriculum learning and progressive training techniques, and uses a best-first tree search algorithm for theorem proving. The iterative nature of the workflow suggests a continuous learning and improvement process. The balance between stability and plasticity in the progressive training stage is crucial for adapting to new data while maintaining existing knowledge. The complexity graph indicates that more complex theorems require more proof steps. </details> Figure 1: LeanAgent overview. LeanAgent searches for Lean repositories and uses LeanDojo to extract theorems and proofs. It uses curriculum learning, computing theorem complexity as $e^{S}$ ( $S$ = proof steps) and calculating the 33rd and 67th complexity percentiles across all theorems to sort repositories by easy theorem count. LeanAgent adds the curriculum to its dynamic database. As a retrieval-based framework, LeanAgent generates a dataset (premise corpus and a collection of theorems and proofs) for each repository in the curriculum and progressively trains its retriever. Progressive training happens over one epoch to prevent forgetting old knowledge. Then, LeanAgent uses the updated retriever in a search-based method to generate formal proofs for theorems where formal proofs were previously missing, known as sorry theorems. It adds new proofs to the database. The dynamic nature of mathematical research exacerbates this generalizability issue. Mathematicians often formalize across multiple domains and projects simultaneously or cyclically. For example, Terence Tao has worked on various projects in parallel, including formalizations of the PFR Conjecture, symmetric mean of real numbers, classical Newton inequality, and asymptotic analysis (Tao, ; Tao, 2024a; Tao et al., 2024; Tao, 2024b). Patrick Massot has been formalizing Scholze’s condensed mathematics and the Perfectoid Spaces project (Community, 2024a; b). These examples highlight a critical gap in current theorem-proving AI approaches: the lack of a system that can adapt and improve across multiple, diverse mathematical domains over time, given limited Lean data availability. Connection to Lifelong Learning. Crucially, how mathematicians formalize is relevant to lifelong learning, i.e. learning multiple tasks without forgetting (Wang et al., 2024b). A significant challenge is catastrophic forgetting: when adaptation to new distributions leads to a loss of understanding of old ones (Jiang et al., 2024). The core challenge is balancing plasticity (the ability to learn and adapt) with stability (the ability to retain existing knowledge) (Wang et al., 2024b). Increasing plasticity to learn new tasks efficiently can lead to overwriting previously learned information. However, enhancing stability to preserve old knowledge may impair the model’s ability to acquire new skills (van de Ven et al., 2024). Achieving the right balance is key to continuous generalizability in theorem proving. LeanAgent. We present LeanAgent, a novel lifelong learning framework for theorem proving. As shown in Figure 1, LeanAgent’s workflow consists of (1) deriving complexity measures of theorems to compute a curriculum for learning, (2) progressive training to learn while balancing stability and plasticity, and (3) searching for proofs of sorry theorems by leveraging a best-first tree search, all while using a dynamic database to manage its evolving mathematical knowledge. LeanAgent works with any LLM; we implement it with retrieval for improved generalizability (Yang et al., 2023). We employ a simple progressive training method to avoid catastrophic forgetting. Progressive training allows LeanAgent to continuously adapt to new mathematical knowledge while preserving previously learned information. This process involves incrementally training the retriever on newly generated datasets from each repository in the curriculum. Starting with a pre-trained retriever (e.g., ReProver’s retriever based on ByT5 (Xue et al., 2022)), LeanAgent trains on each new dataset for one additional epoch. Restricting progressive training to one epoch helps balance stability and plasticity. Crucially, this training is repeated for each dataset generated from the database, gradually expanding LeanAgent’s knowledge base. This approach increases the space of possible proof states (where a state consists of a theorem’s hypotheses and current proof progress) while adding new premises to the premise embeddings. More sophisticated lifelong learning methods like Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017), which uses the Fisher Information Matrix to constrain important weights for previous tasks, result in excessive plasticity. The uncontrolled plasticity is due to the inability of these methods to adapt parameter importance as theorem complexity increases. This forces rapid changes in parameters crucial for learning advanced concepts. Such methods fail to adapt to the evolving complexity of mathematical theorems, making them unsuitable for lifelong learning in theorem proving. Extensive experiments across 23 diverse Lean repositories demonstrate LeanAgent’s advancements in lifelong learning for theorem proving. LeanAgent successfully generates formal proofs for 155 theorems across these 23 repositories where formal proofs were previously missing, known as sorry theorems, many from advanced mathematics. For example, it proves challenging sorry theorems in abstract algebra and algebraic topology related to Coxeter systems and the Hairy Ball Theorem (Coxeter, 2024; Hairy Ball Theorem, 2024). LeanAgent also proves 7 theorems using exploits found within Lean’s type system. We find that LeanAgent demonstrates progressive learning in theorem proving, initially proving basic sorry theorems and significantly advancing to more complex ones. It significantly outperforms the static ReProver baseline in terms of proving new sorry theorems. We have issued pull requests to the respective repositories with the newly proven sorry theorems. Some of these proofs utilized unintended constructs within the repositories’ implementations, which are currently being addressed through appropriate fixes. In theorem proving, we find that stability, without losing too much plasticity, is crucial for continuous generalizability to new repositories. Backward transfer (BWT), where learning new tasks improves performance on previously learned tasks, is essential in theorem proving (Wang et al., 2024b). Mathematicians require a lifelong learning framework for theorem proving that is both continuously generalizable and continuously improving. We conduct an extensive ablation study using six lifelong metrics carefully proposed or selected from the literature. LeanAgent’s simple components of curriculum learning and progressive training improve stability and BWT scores substantially, emphasizing its continuous generalizability and improvement and explaining its superior sorry theorem proving performance. 2 Preliminaries Neural Theorem Proving. The current state-of-the-art of learning-based provers employs Transformer-based (Vaswani et al., 2017) LLMs that process expressions as plain text strings (Azerbayev et al., 2024; Xin et al., 2024b; Shao et al., 2024). In addition, researchers have explored complementary aspects like proof search algorithms (Lample et al., ; Wang et al., 2023). Moreover, other works break the theorem-proving process into smaller proving tasks (Song et al., 2024; Wang et al., 2024a; Lin et al., 2024). Premise Selection. A critical challenge in theorem proving is the effective selection of relevant premises (Irving et al., 2016; Tworkowski et al., ). However, many existing approaches treat premise selection as an isolated problem (Wang & Deng, 2020; Piotrowski et al., 2023) or use selected premises only as input to symbolic provers (Alama et al., 2014; Mikuła et al., 2024). Retrieval-Augmented LLMs. While retrieval-augmented language models have been extensively studied in areas like code generation (Lu et al., 2022; Zhou et al., 2023), their application to formal theorem proving is relatively new. However, relevant architectures have been researched in natural language processing (NLP) (Lu et al., 2024; Borgeaud et al., 2022; Thakur et al., 2024). Lifelong Learning. Lifelong learning addresses catastrophic forgetting in sequential task learning (Chen et al., 2024). Approaches include regularization methods (Kirkpatrick et al., 2017), memory-based techniques (Lopez-Paz & Ranzato, 2017; Chaudhry et al., 2019; Shin et al., 2017), and knowledge distillation (Li & Hoiem, 2017; Kim et al., 2023). Other strategies involve dynamic architecture adjustment (Mendez & Eaton, 2021) and recent work on gradient manipulation and selective re-initialization (Chen et al., 2024; Dohare et al., 2024). We justify not using these strategies in Appendix A.6. Curriculum Learning in Theorem Proving. Prior work created a synthetic inequality generator to produce a curriculum of statements of increasing difficulty (Polu et al., 2022). For reinforcement learning, an existing work used the length of proofs to help determine rewards (Zombori et al., 2019). 3 Methodology A useful lifelong learning strategy for theorem proving requires (a) a repository order strategy and (b) a learning strategy. We solve (a) with curriculum learning to utilize the structure of Lean proofs and (b) with progressive training to balance stability and plasticity. LeanAgent consists of four main components: curriculum learning, dynamic database management, progressive training of the retriever, and sorry theorem proving. Further methodology details are in Appendix A.1 and a discussion of why curriculum learning works in theorem proving is available in Appendix A.6. 3.1 Curriculum Learning LeanAgent uses curriculum learning to learn on increasingly complex mathematical repositories. This process optimizes LeanAgent’s learning trajectory, allowing it to build upon foundational knowledge before tackling more advanced concepts. First, we automatically search for and clone Lean repositories from GitHub. We use LeanDojo for each repository to extract fine-grained information about their theorems, proofs, and dependencies. Then, we calculate the complexity of each theorem using $e^{S}$ , where $S$ represents the number of proof steps. However, sorry theorems, which have no proofs, are assigned infinite complexity. We use an exponential scaling to address the combinatorial explosion of possible proof paths as the length of the proof increases. Further justification for considering this complexity metric is in Appendix A.6. We compute the 33rd and 67th percentiles of complexity across all theorems in all repositories. Using these percentiles, we categorize non- sorry theorems into three groups: easy (theorems with complexity below the 33rd percentile), medium (theorems with complexity between the 33rd and 67th percentiles), and hard (theorems with complexity above the 67th percentile). We then sort repositories by the number of easy theorems they contain. This sorting forms the basis of our curriculum, with LeanAgent starting on repositories with the highest number of easy theorems. 3.2 Dynamic Database Management Then, we add the sorted repositories to LeanAgent’s custom dynamic database using the data LeanAgent extracted. This way, we can keep track of and interact with the knowledge that LeanAgent is aware of and the proofs it has produced. We also include the complexity of each theorem computed in the previous step into the dynamic database, allowing for efficient reuse of repositories in a future curriculum. Details of the database’s contents and features can be found in Appendix A.1. For each repository in the curriculum, LeanAgent uses the dynamic database to generate a dataset by following the same procedure used to make LeanDojo Benchmark 4 (details in Appendix A.1). This dataset includes a collection of theorems and their proofs. Each step of these proofs contains detailed annotations, such as how the step changes the state of the proof. A state consists of a theorem’s hypotheses and the current progress in proving the theorem. As such, this pairing of theorems and proofs demonstrates how to use specific tactics (functions) and premises in sequence to prove a theorem. In addition, the dataset includes a premise corpus, serving as a library of facts and definitions. 3.3 Progressive Training of the Retriever LeanAgent then progressively trains its retriever on the newly generated dataset. This strategy allows LeanAgent to continuously adapt to new mathematical knowledge from the premises in new datasets while preserving previously learned information, crucial for lifelong learning in theorem proving. Progressive training achieves this by incrementally incorporating new knowledge from each repository. Although LeanAgent works with any LLM, we provide a specific implementation here. We start with ReProver’s retriever, a fine-tuned version of the ByT5 encoder (Xue et al., 2022), leveraging its general pre-trained knowledge from mathlib4. We train LeanAgent on the new dataset for an additional epoch. This limited exposure helps prevent overfitting to the new data while allowing LeanAgent to learn essential new information. LeanAgent’s retriever, and therefore the embeddings it generates, are continuously updated during progressive training. Thus, at the end of the current progressive training run, we precompute embeddings for all premises in the corpus generated by LeanAgent’s current state to ensure that we properly evaluate LeanAgent’s validation performance. To understand how LeanAgent balances stability and plasticity, we save the model iteration with the highest validation recall for the top ten retrieved premises (R@10). This is a raw plasticity value: it can be used to compute other metrics that describe LeanAgent’s ability to adapt to and handle new types of mathematics in the latest repository (details in Sec. 4). Then, we compute the average test R@10 over all previous datasets the model has progressively trained on, a raw stability value. As mentioned previously, we repeat this procedure for each dataset we generate from the database, hence the progressive nature of this training. Progressive training adds new premises to the premise embeddings and increases the space of possible proof states. This allows LeanAgent to explore more diverse paths to prove theorems, discovering new proofs that it couldn’t produce with its original knowledge base. 3.4 sorry Theorem Proving For each sorry theorem, LeanAgent generates a proof with a best-first tree search by generating tactic candidates at each step, in line with prior work (Yang et al., 2023). Using the embeddings from the entire corpus of premises we previously collected, LeanAgent retrieves relevant premises from the premise corpus based on their similarity to the current proof state, represented as a context embedding. Then, it filters the results using a corpus dependency graph to ensure that we only consider premises accessible from the current file. We add these retrieved premises to the current state and generate tactic candidates using beam search. Then, we run each tactic candidate through Lean to obtain potential next states. Each successful tactic application adds a new edge to the proof search tree. We choose the tactic with the maximum cumulative log probability of the tactics leading to it. If the search reaches a dead-end, we backtrack and explore alternative paths. We repeat the above steps until the search finds a proof, exhausts all possibilities, or reaches the time limit of 10 minutes. If LeanAgent finds a proof, it adds it to the dynamic database. The newly added premises from this proof will be included in a future premise corpus involving the current repository. Moreover, LeanAgent can learn from the new proof during progressive training in the future, aiding further improvements. 4 Experiments 4.1 Experimental Setup Table 1: Selected repository descriptions | PFR Hairy Ball Theorem Coxeter | Polynomial Freiman-Ruzsa Conjecture Algebraic topology result Coxeter groups | | --- | --- | | Mathematics in Lean Source | Lean files for the textbook | | Formal Book | Proofs from THE BOOK | | MiniF2F | Math olympiad-style problem solving | | SciLean | Scientific computing | | Carleson | Carleson’s Theorem | | Lean4 PDL | Propositional Dynamic Logic | We devote this section to describing two types of experiments: (1) sorry Theorem Proving: We compare sorry theorem proving performance between LeanAgent and ReProver. We also examine the progression of LeanAgent’s proven sorry theorems during lifelong learning to the end of lifelong learning. This shows LeanAgent’s continuous generalizability and improvement. (2) Lifelong Learning Analysis: We conduct an ablation study with six lifelong learning metrics to explain LeanAgent’s superiority in sorry theorem proving. Moreover, these results explain LeanAgent’s superior handling of the stability-plasticity tradeoff. Please see Appendix A.2 for experiment implementation details. We release LeanAgent at https://github.com/lean-dojo/LeanAgent. Repositories. We evaluate our approach on a diverse set of 23 Lean repositories to assess its generalizability across different mathematical domains (Skřivan, 2024; Kontorovich, 2024; Avigad, 2024; Tao et al., 2024; Renshaw, 2024; Fermat’s Last Theorem, 2024; DeepMind, 2024; Carneiro, 2024; Wieser, 2024; Mizuno, 2024; Murphy, 2024; Formal Logic, 2024; Con-nf, 2024; Gadgil, 2024; Yang, 2024; Zeta 3 Irrational, 2024; Firsching, Moritz, 2024; Monnerjahn, 2024; van Doorn, 2024; Dillies, 2024; Hairy Ball Theorem, 2024; Coxeter, 2024; Gattinger, 2024). Details of key repositories are in Table 1. Further details of these repositories, including commits and how we chose the initial curriculum and the sub-curriculum (described in Sec. 4.2), are in Appendix A.3. 4.2 sorry Theorem Proving We compare the number of sorry theorems LeanAgent can prove, both during and after lifelong learning, to the ReProver baseline. We use ReProver as the baseline because we use its retriever as LeanAgent’s initial retriever in our experiments. <details> <summary>extracted/6256159/newproof4.png Details</summary> ![b9e5ff4b](/v1/image/b9e5ff4b205e132df5dadfa20aca1b5e457ddfb38e4ee23ffec906dc16d41a7c) ### Visual Description ## Code Snippets: Proof Assistant Examples ### Overview The image presents a collection of code snippets, likely from different proof assistants or formal verification systems. Each snippet demonstrates a specific lemma or theorem, along with the proof strategy used to establish its validity. The snippets are grouped under headings indicating the project or library they belong to. ### Components/Axes The image is divided into six sections, each representing a different project or library: 1. **PFR** (top-left) 2. **SciLean** (middle-left) 3. **Coxeter** (bottom-left) 4. **MiniF2F** (top-right) 5. **Formal Book** (bottom-right) Each section contains: * A lemma or theorem declaration, including its name and type signature. * A proof script, indicating the steps taken to prove the lemma or theorem. * Highlighted terms, likely indicating key definitions or tactics used in the proof. ### Detailed Analysis or Content Details **1. PFR (top-left)** * **Lemma:** `condRho_of_translate` * **Declaration:** `lemma condRho_of_translate {Ω S : Type*} [MeasureSpace Ω] (X : Ω → G) (Y : Ω → S) (A : Finset G) (s:G): condRho (fun w → X w + s) Y A = condRho X Y A := by` * **Proof:** `simp only [condRho, rho_of_translate]` * **Interpretation:** This lemma likely relates to conditional expectations or Radon-Nikodym derivatives in measure theory. The proof simplifies the expression using the definitions of `condRho` and `rho_of_translate`. **2. SciLean (middle-left)** * **Theorem:** `re_float` * **Declaration:** `theorem re_float (a : Float) : RCLike.re a = a := by` * **Proof:** `exact RCLike.re_eq_self_of_le le_rfl` * **Interpretation:** This theorem states that the real part of a floating-point number `a` is equal to `a` itself. The proof uses the `RCLike.re_eq_self_of_le le_rfl` lemma, which likely establishes this property based on the ordering of real numbers. **3. Coxeter (bottom-left)** * **Lemma:** `invmap.of_eq` * **Declaration:** `lemma invmap.of_eq {S:Set G} [CoxeterSystem G S] {s :S} : invmap S s = s := by` * **Proof:** * `simp [CoxeterSystem.Presentation.invmap]` * `unfold CoxeterSystem.toMatrix` * `apply CoxeterSystem.monoidLift.mapLift.of` * **Interpretation:** This lemma states that the inverse map of an element `s` in a Coxeter system `S` is equal to `s` itself. The proof involves simplifying using the definition of `invmap`, unfolding the definition of `toMatrix`, and applying a lemma related to monoid lifting. **4. MiniF2F (top-right)** * **Theorem:** `induction_12dvd4expnp1p20` * **Declaration:** `theorem induction_12dvd4expnp1p20 (n: N): 12 | 4^(n+1) + 20 := by` * **Proof:** * `norm_num` * `induction' n with n hn` * `simp` * `omega` * **Interpretation:** This theorem states that 12 divides 4^(n+1) + 20 for all natural numbers `n`. The proof uses induction, simplification, and the `omega` tactic, which likely performs arithmetic reasoning. * **Theorem:** `amc12a_2002_p6` * **Declaration:** `theorem amc12a_2002_p6 (n: N) (h₀: 0 < n): ∃ m, (m > n ∧ ∃ p, m * p ≤ m + p) := by` * **Proof:** * `lift n to N+ using h₀` * `cases' n with n` * `exact (_, lt_add_of_pos_right zero_lt_one, 1, by simp)` * **Interpretation:** This theorem states that for any natural number `n` greater than 0, there exists an `m` such that `m` is greater than `n` and there exists a `p` such that `m * p` is less than or equal to `m + p`. The proof lifts `n` to the positive natural numbers, performs a case split, and uses a lemma related to the addition of positive numbers. **5. Formal Book (bottom-right)** * **Theorem:** `wedderburn` * **Declaration:** `theorem wedderburn (h: Fintype R): IsField R := by` * **Proof:** `apply Field.toIsField` * **Interpretation:** This theorem states that if `R` is a finite type, then `R` is a field. The proof applies the `Field.toIsField` lemma, which likely establishes this property based on the finiteness of `R`. ### Key Observations * The code snippets demonstrate a variety of proof techniques, including simplification, induction, case splitting, and application of existing lemmas. * The highlighted terms indicate key definitions or tactics used in the proofs. * The snippets cover a range of mathematical topics, including measure theory, real analysis, Coxeter systems, number theory, and field theory. ### Interpretation The image provides a glimpse into the world of formal verification and proof assistants. It showcases how mathematical theorems and lemmas can be expressed and proven using code. The different projects and libraries represented in the image likely aim to formalize and verify various areas of mathematics. The use of proof assistants allows for rigorous and machine-checkable proofs, ensuring the correctness of mathematical results. </details> Figure 2: Case studies of LeanAgent’s new proofs. LeanAgent shows an ability to work with these repositories, often able to retrieve the necessary premises (highlighted). For example, LeanAgent proves a sorry theorem from PFR, condRho_of_translate, by simply expanding definitions, showing its proving ability on the PFR repository. In addition, LeanAgent could prove re_float from SciLean during lifelong learning, while ReProver could not. Moreover, its MiniF2F proofs demonstrate its ability to generate relatively longer and more complex proofs for complex mathematics. LeanAgent’s proof of invmap.of_eq and wedderburn represents its theorem proving capabilities with abstract algebra premises. Table 2: Accuracy in proving sorry theorems across repositories. Accuracy is calculated as (proven theorems / total sorry theorems). “LA” denotes LeanAgent and “ReProver+” denotes the setting where we update ReProver on all 23 repositories at once. “During” shows accuracy during lifelong learning, “Add. After” shows additional accuracy after lifelong learning, and “Total” shows the combined accuracy. “MIL” stands for Mathematics in Lean Source and “Hairy Ball” refers to the Hairy Ball Theorem repository. Repositories with no sorry theorems or no proven ones are not shown. The best accuracy for each repository is in bold. As noted previously, we progressively train on MiniF2F after the initial curriculum to demonstrate the use case of formalizing in a new repository after learning a curriculum. As such, we don’t evaluate LeanAgent after lifelong learning on MiniF2F. | MIL MiniF2F Formal Book | 29 406 29 | 72.4 24.4 10.3 | 48.3 24.4 6.9 | 24.1 - 3.4 | 48.3 20.9 6.9 | 55.2 20.9 10.3 | | --- | --- | --- | --- | --- | --- | --- | | SciLean | 294 | 9.2 | 7.5 | 1.7 | 8.2 | 8.5 | | Hairy Ball | 14 | 7.1 | 0.0 | 7.1 | 0.0 | 7.1 | | Coxeter | 15 | 6.7 | 6.7 | 0.0 | 0.0 | 6.7 | | Carleson | 24 | 4.2 | 4.2 | 0.0 | 4.2 | 4.2 | | Lean4 PDL | 30 | 3.3 | 3.3 | 0.0 | 3.3 | 3.3 | | PFR | 37 | 2.7 | 2.7 | 0.0 | 0.0 | 0.0 | In addition, a use case of LeanAgent is proving in a new repository after learning a curriculum; we progressively train on MiniF2F to demonstrate this. Note that we choose the Lean4 version of the MiniF2F repository (Yang, 2024) and disregard its separation into validation and test splits (reasoning in Appendix A.5). LeanAgent’s success rate on the Lean4 version of the MiniF2F test set is also in Appendix A.5. Moreover, a mathematician could use LeanAgent for (1) an initial curriculum $A$ , and later (2) a sub-curriculum $B$ . LeanAgent can then help the mathematician prove in the repositories in curriculum $A+B$ . To demonstrate this scenario, we continue LeanAgent on a sub-curriculum $B$ of 8 repositories. Results are in Table 2, with case studies in Figure 2. Appendix A.5 provides a more thorough discussion, including an ablation study, and contains the complete theorems and proofs relevant to this section. LeanAgent demonstrates continuous generalizability and improvement in theorem-proving capabilities across multiple repositories. LeanAgent’s proofs are a superset of the sorry theorems proved by ReProver in most cases. Moreover, to isolate the effect of curriculum learning, we compare LeanAgent against ReProver+, the ReProver model updated on all 23 repositories at once, and notice that LeanAgent outperforms it on several repositories, emphasizing the importance of curriculum learning. Overall, LeanAgent progresses from basic concepts (arithmetic, simple algebra) to advanced topics (abstract algebra, topology). PFR. LeanAgent can prove a sorry theorem from this repository, while ReProver cannot. It also generalizes to a different commit (not included in progressive training), uncovering 7 system exploits. LeanAgent proves two theorems with just the rfl tactic, one of which ReProver cannot, and proves 5 sorry theorems with a $0=1$ placeholder theorem statement. SciLean. During lifelong learning, LeanAgent proves theorems related to fundamental algebraic structures, linear and affine maps, and measure theory basics. By the end of lifelong learning, it proves concepts in advanced function spaces, sophisticated bijections, and abstract algebraic structures. Mathematics in Lean Source. During lifelong learning, LeanAgent proves theorems about basic algebraic structures and fundamental arithmetic properties. By the end of lifelong learning, it proves more complex theorems involving quantifier manipulation, set theory, and relations. MiniF2F. ReProver demonstrates proficiency in basic arithmetic, elementary algebra, and simple calculus. However, by the end of lifelong learning, LeanAgent handles theorems with advanced number theory, sophisticated algebra, complex calculus and analysis, abstract algebra, and complex induction. Sub-curriculum. In the Formal Book repository, LeanAgent progresses from proving basic real analysis and number theory theorems to more advanced abstract algebra, exemplified by its proof of Wedderburn’s Little Theorem. For the Coxeter repository, LeanAgent proves a complex lemma about Coxeter systems, showcasing its increased understanding of group theory. In the Hairy Ball Theorem repository, LeanAgent proves a key step of the theorem, demonstrating improved performance in algebraic topology. Only LeanAgent can prove these theorems, demonstrating that it has much more advanced theorem-proving capabilities than ReProver. 4.3 Lifelong Learning Analysis To our knowledge, no other lifelong learning frameworks for theorem proving exist in the literature. As such, we conduct an ablation study with six lifelong learning metrics to showcase LeanAgent’s superior handling of the stability-plasticity tradeoff. These results help explain LeanAgent’s superiority in sorry theorem proving performance. We compute these metrics for the original curriculum of 14 repositories. Specifically, the ablation study consists of seven additional setups constructed from a combination of learning and dataset options. Options for learning setups are progressive training with or without EWC. Dataset setups involve a dataset order and construction. Options for dataset orders involve Single Repository or Merge All, where each dataset consists of all previous repositories and the new one. Given the most popular repositories on GitHub by star count, options for dataset construction include popularity order or curriculum order. Appendix A.3 shows these orders and additional repository details. Metrics. We use six lifelong learning metrics: Windowed-Forgetting 5 (WF5), Forgetting Measure (FM), Catastrophic Forgetting Resilience (CFR), Expanded Backward Transfer (EBWT), Windowed-Plasticity 5 (WP5), and Incremental Plasticity (IP). A description of these metrics is in Table 3 (De Lange et al., 2023; Wang et al., 2024b; Díaz-Rodríguez et al., 2018). Our reasoning for considering these metrics is detailed in Appendix A.4. Table 3: Description of lifelong learning metrics. | WF5 | Measures forgetting over a 5-task window | Lower | Existing | | --- | --- | --- | --- | | FM | Average performance drop on old tasks | Lower | Existing | | CFR | Ratio of min to max average test R@10 | Higher | Proposed | | EBWT | Average improvement on old tasks after learning new ones | Higher | Existing | | WP5 | Max average test R@10 increase over a 5-task window | Higher | Existing | | IP | Rate of validation R@10 change per task | Higher | Proposed | We describe why we introduce two new metrics to address specific aspects of lifelong learning in theorem proving: - Catastrophic Forgetting Resilience (CFR). This metric captures LeanAgent’s ability to maintain performance on its weakest task relative to its best performance, crucial in the presence of diverse mathematical domains. - Incremental Plasticity (IP). IP provides a more granular view of plasticity than aggregate measures and is sensitive to the order of tasks, particularly relevant in lifelong learning for theorem proving. In addition, these metrics in the Merge All strategy measure cumulative knowledge refinement rather than isolated task performance (details in Appendix A.4). Due to these interpretational differences, we analyze Single Repository and Merge All setups separately. We consider an improvement of at least 3% to be significant. Table 4: Comparison of lifelong learning metrics across setups. The best scores for each metric are in bold. | WF5 ( $\downarrow$ ) FM ( $\downarrow$ ) CFR ( $\uparrow$ ) | 0.18 0.85 0.88 | 7.60 6.53 0.87 | 7.17 4.04 0.88 | 0.73 2.11 0.85 | 15.83 10.50 0.76 | 2.23 4.06 0.94 | 13.34 11.44 0.75 | 5.82 3.80 0.90 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | EBWT ( $\uparrow$ ) | 1.21 | 0.51 | 1.04 | 0.76 | -0.20 | 0.73 | -1.34 | -0.39 | | WP5 ( $\uparrow$ ) | 2.47 | 0.89 | 1.47 | 3.42 | 0.00 | 0.09 | 0.00 | 0.11 | | IP ( $\uparrow$ ) | 1.02 | 0.36 | 0.26 | 1.06 | -1.50 | -0.64 | -1.71 | -0.89 | | Single Repository: | Merge All: | | --- | --- | | Setup 1: No EWC, Popularity Order | Setup 4: No EWC, Popularity Order | | Setup 2: EWC, Popularity Order | Setup 5: No EWC, Curriculum Learning | | Setup 3: EWC, Curriculum Learning | Setup 6: EWC, Popularity Order | | Setup 7: EWC, Curriculum Learning | | Single Repository Analysis. We first analyze the Single Repository results from Table 4. LeanAgent demonstrates superior stability across multiple metrics. The WF5 metric is 75.34% lower for LeanAgent than the next best setup, suggesting it maintains performance over a window more effectively. Its FM score is 59.97% lower than Setup 3’s, showcasing its resilience against catastrophic forgetting. Furthermore, LeanAgent, Setup 1, and Setup 2 demonstrate high and consistent resilience against catastrophic forgetting, with CFR values above 0.87 and minimal (±0.01) differences. This underscores LeanAgent’s ability to continuously generalize over time. In addition, LeanAgent has a 16.25% higher EBWT, indicating its ability to continuously improve over time. In contrast, Setup 3 exhibits characteristics of higher plasticity. It shows a 38.26% higher WP5 over LeanAgent, indicating a greater ability to rapidly adapt to new tasks in a window. This is complemented by its 3.98% higher IP over LeanAgent, suggesting a more pronounced improvement on new tasks over time. However, these plasticity gains come at a significant cost: Setup 3 suffers from more severe catastrophic forgetting, as evidenced by its significantly worse stability metrics compared to LeanAgent. This excessive plasticity in Setup 3 stems from EWC’s inability to adapt parameter importance as theorem complexity increases. EWC preserves parameters important for simpler theorems, which may not be crucial for more complex ones. Consequently, these preserved parameters resist change while other parameters change rapidly for complex theorems. This forces the model to become more plastic overall, relying heavily on non-preserved parameters for new, complex theorems. LeanAgent’s favorable stability and EBWT scores make it the most suitable for lifelong learning in the Single Repository setting. Merge All Analysis. Next, we analyze the Merge All setups from Table 4. Setup 5’s WF5 metric is 61.68% lower than the next best setup (Setup 7), suggesting Setup 5 balances and retains knowledge across an expanding dataset most effectively. Furthermore, Setup 5’s CFR score is 3.77% higher than that of Setup 7, again demonstrating high and consistent resilience in the face of an expanding, potentially more complex dataset. However, Setup 7 has a 6.44% lower FM score than Setup 5’s, showcasing its ability to maintain performance on earlier data points. Moreover, Setup 5 is the only setup with a positive EBWT, indicating that learning new tasks improves performance on the entire historical dataset. The other setups have a negative EBWT, indicating performance degradation on earlier tasks after learning new ones. Only Setups 5 and 7 have a non-zero WP5, suggesting the ability to adapt to the growing complexity of the combined dataset. The zero values for Setups 4 and 6 indicate that popularity order struggles to show improvement when dealing with merged data. However, although Setup 5 has the highest IP score with a 27.75% improvement over Setup 7, all 4 setups have negative IP values. This indicates a decrease in validation R@10 over time, suggesting that the Merge All strategy struggles to maintain performance. Experiment 5’s favorable stability and EBWT scores suggest it is the best at balancing the retention of earlier knowledge with the adaptation to new data in a combined dataset. However, its negative IP value indicates a fundamental issue with its approach. Comparative Analysis and Insights. Although the metrics have different interpretations in the Single Repository and Merge All settings, we can still draw some meaningful comparisons by focusing on overall trends and relative performance. We must consider that the negative IP values in Merge All setups indicate a significant issue. This drawback outweighs the potential benefits seen in other metrics like WP5, as it indicates a fundamental inability to maintain and improve performance in a continuously growing dataset. In contrast, LeanAgent demonstrates a positive IP, indicating its ability to incorporate new knowledge. This, combined with its superior stability and EBWT metrics relative to other Single Repository methods, suggests that LeanAgent is better suited than Setup 5 for continuous generalizability and improvement. Consistency with sorry Theorem Proving Performance. This lifelong learning analysis is consistent with LeanAgent’s sorry theorem proving performance. LeanAgent’s superior stability metrics (WF5, FM, and CFR) explain its ability to maintain performance across diverse mathematical domains, as evidenced by its success in proving theorems from various repositories like SciLean, Mathematics in Lean Source, and PFR. Its high EBWT score aligns with its progression from basic concepts to advanced topics in theorem proving. While LeanAgent shows slightly lower plasticity (WP5 and IP) compared to some setups, this trade-off results in better overall performance, as reflected in its ability to prove a superset of sorry theorems compared to ReProver in most cases. This analysis demonstrates LeanAgent’s overall superiority in lifelong learning for theorem proving. 5 Conclusion We have presented LeanAgent, a lifelong learning framework for theorem proving that achieves continuous generalizability and improvement across diverse mathematical domains. Key components include a curriculum learning strategy, progressive training approach, and custom dynamic database infrastructure. LeanAgent successfully generates formal proofs for 155 theorems where formal proofs were previously missing and uncovers 7 exploits across 23 Lean repositories, including from challenging mathematics. This highlights its potential to assist in formalizing complex proofs across multiple domains and identifying system exploits. For example, LeanAgent successfully proves challenging theorems in abstract algebra and algebraic topology. It outperforms the ReProver baseline in proving new sorry theorems, progressively learning from basic to complex mathematical concepts. Moreover, LeanAgent shows significant performance in forgetting measures and backward transfer, explaining its continuous generalizability and continuous improvement. Future work could explore integration with Lean Copilot, providing real-time assistance with a mathematician’s repositories. In addition, a limitation of LeanAgent is its inability to prove certain theorems due to a lack of data on specific topics, such as odeSolve.arg_x 0.semiAdjoint_rule in SciLean about ODEs. To solve this problem, future work could use reinforcement learning for synthetic data generation during curriculum construction. Moreover, future work could use LeanAgent with additional math LLMs and search strategies. Acknowledgments Adarsh Kumarappan is supported by the Summer Undergraduate Research Fellowships (SURF) program at Caltech. Anima Anandkumar is supported by the Bren named chair professorship, Schmidt AI 2050 senior fellowship, and ONR (MURI grant N00014-18-12624). We thank Terence Tao for detailed discussions and feedback that significantly improved this paper. We thank Zulip chat members for engaging in clarifying conversations that were incorporated into the paper. References - Alama et al. (2014) Jesse Alama, Tom Heskes, Daniel Kühlwein, Evgeni Tsivtsivadze, and Josef Urban. Premise Selection for Mathematics by Corpus Analysis and Kernel Methods. Journal of Automated Reasoning, 52(2):191–213, February 2014. ISSN 0168-7433, 1573-0670. doi: 10.1007/s10817-013-9286-5. - Arana & Stafford (2023) Andrew Arana and Will Stafford. On the difficulty of discovering mathematical proofs. Synthese, 202(2):1–29, 2023. doi: 10.1007/s11229-023-04184-5. - Avigad (2024) Jeremy Avigad. Avigad/mathematics_in_lean_source, August 2024. - Azerbayev et al. (2024) Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An Open Language Model For Mathematics, March 2024. - Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning. In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 41–48, Montreal Quebec Canada, June 2009. ACM. ISBN 978-1-60558-516-1. doi: 10.1145/1553374.1553380. - Borgeaud et al. (2022) Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George van den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, Diego de Las Casas, Aurelia Guy, Jacob Menick, Roman Ring, Tom Hennigan, Saffron Huang, Loren Maggiore, Chris Jones, Albin Cassirer, Andy Brock, Michela Paganini, Geoffrey Irving, Oriol Vinyals, Simon Osindero, Karen Simonyan, Jack W. Rae, Erich Elsen, and Laurent Sifre. Improving language models by retrieving from trillions of tokens, February 2022. - Carneiro (2024) Mario Carneiro. Digama0/lean4lean, September 2024. - Chang et al. (2021) Ernie Chang, Hui-Syuan Yeh, and Vera Demberg. Does the Order of Training Samples Matter? Improving Neural Data-to-Text Generation with Curriculum Learning. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty (eds.), Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp. 727–733, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.61. - Chaudhry et al. (2019) Arslan Chaudhry, Marc’Aurelio Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. Efficient Lifelong Learning with A-GEM, January 2019. - Chen et al. (2023) Jiefeng Chen, Timothy Nguyen, Dilan Gorur, and Arslan Chaudhry. Is forgetting less a good inductive bias for forward transfer?, March 2023. - Chen et al. (2024) Yupeng Chen, Senmiao Wang, Zhihang Lin, Zeyu Qin, Yushun Zhang, Tian Ding, and Ruoyu Sun. MoFO: Momentum-Filtered Optimizer for Mitigating Forgetting in LLM Fine-Tuning. 2024. doi: 10.48550/ARXIV.2407.20999. - Cirik et al. (2016) Volkan Cirik, Eduard Hovy, and Louis-Philippe Morency. Visualizing and Understanding Curriculum Learning for Long Short-Term Memory Networks, November 2016. - Claude Team (2024) Claude Team. Introducing claude 3.5 sonnet, June 2024. URL https://www.anthropic.com/news/claude-3-5-sonnet. - Community (2024a) Lean Community. Leanprover-community/lean-liquid. leanprover-community, September 2024a. - Community (2024b) Lean Community. Leanprover-community/lean-perfectoid-spaces. leanprover-community, August 2024b. - Con-nf (2024) Con-nf. Leanprover-community/con-nf: A formal consistency proof of Quine’s set theory New Foundations. https://github.com/leanprover-community/con-nf/tree/main, 2024. - Coxeter (2024) Coxeter. NUS-Math-Formalization/coxeter at 96af8aee7943ca8685ed1b00cc83a559ea389a97. https://github.com/NUS-Math-Formalization/coxeter/tree/96af8aee7943ca8685ed1b00cc83a559ea389a97, 2024. - De Lange et al. (2023) Matthias De Lange, Gido van de Ven, and Tinne Tuytelaars. Continual evaluation for lifelong learning: Identifying the stability gap, March 2023. - De Moura et al. (2015) Leonardo De Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob Von Raumer. The Lean Theorem Prover (System Description). In Amy P. Felty and Aart Middeldorp (eds.), Automated Deduction - CADE-25, volume 9195, pp. 378–388, Cham, 2015. Springer International Publishing. ISBN 978-3-319-21400-9 978-3-319-21401-6. doi: 10.1007/978-3-319-21401-6˙26. - DeepMind (2024) Google DeepMind. Google-deepmind/debate. Google DeepMind, August 2024. - Díaz-Rodríguez et al. (2018) Natalia Díaz-Rodríguez, Vincenzo Lomonaco, David Filliat, and Davide Maltoni. Don’t forget, there is more than forgetting: New metrics for Continual Learning, October 2018. - Dillies (2024) Yaël Dillies. YaelDillies/LeanAPAP, September 2024. - Dohare et al. (2024) Shibhansh Dohare, J. Fernando Hernandez-Garcia, Qingfeng Lan, Parash Rahman, A. Rupam Mahmood, and Richard S. Sutton. Loss of plasticity in deep continual learning. Nature, 632(8026):768–774, August 2024. ISSN 0028-0836, 1476-4687. doi: 10.1038/s41586-024-07711-7. - Fermat’s Last Theorem (2024) Fermat’s Last Theorem. ImperialCollegeLondon/FLT. Imperial College London, September 2024. - Firsching, Moritz (2024) Firsching, Moritz. Mo271/FormalBook: Formalizing ”Proofs from THE BOOK”, 2024. - Formal Logic (2024) Formalized Formal Logic. FormalizedFormalLogic/Foundation. FormalizedFormalLogic, September 2024. - Gadgil (2024) Siddhartha Gadgil. Siddhartha-gadgil/Saturn: Experiments with SAT solvers with proofs in Lean 4. https://github.com/siddhartha-gadgil/Saturn, 2024. - Gattinger (2024) Malvin Gattinger. M4lvin/lean4-pdl, September 2024. - Hairy Ball Theorem (2024) Hairy Ball Theorem. Corent1234/hairy-ball-theorem-lean at a778826d19c8a7ddf1d26beeea628c45450612e6. https://github.com/corent1234/hairy-ball-theorem-lean/tree/a778826d19c8a7ddf1d26beeea628c45450612e6, 2024. - Irving et al. (2016) Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Een, Francois Chollet, and Josef Urban. DeepMath - Deep Sequence Models for Premise Selection. In Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc., 2016. - Jiang et al. (2022) Albert Q. Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding Hammers to Integrate Language Models and Automated Theorem Provers. 2022. doi: 10.48550/ARXIV.2205.10893. - Jiang et al. (2024) Gangwei Jiang, Caigao Jiang, Zhaoyi Li, Siqiao Xue, Jun Zhou, Linqi Song, Defu Lian, and Ying Wei. Interpretable Catastrophic Forgetting of Large Language Model Fine-tuning via Instruction Vector. 2024. doi: 10.48550/ARXIV.2406.12227. - Kevin Buzzard (2019) Kevin Buzzard. The Future of Mathematics? Professor Kevin Buzzard - 30 May 2019, June 2019. - Kim et al. (2023) Seungyeon Kim, Ankit Singh Rawat, Manzil Zaheer, Sadeep Jayasumana, Veeranjaneyulu Sadhanala, Wittawat Jitkrittum, Aditya Krishna Menon, Rob Fergus, and Sanjiv Kumar. EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval. 2023. doi: 10.48550/ARXIV.2301.12005. - Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, March 2017. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.1611835114. - Kontorovich (2024) Alex Kontorovich. AlexKontorovich/PrimeNumberTheoremAnd, August 2024. - (37) Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. HyperTree Proof Search for Neural Theorem Proving. - Li et al. (2024) Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, and Yuxiong He. DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing. Proceedings of the AAAI Conference on Artificial Intelligence, 38(16):18490–18498, March 2024. ISSN 2374-3468, 2159-5399. doi: 10.1609/aaai.v38i16.29810. - Li & Hoiem (2017) Zhizhong Li and Derek Hoiem. Learning without Forgetting, February 2017. - Lin et al. (2024) Haohan Lin, Zhiqing Sun, Yiming Yang, and Sean Welleck. Lean-STaR: Learning to Interleave Thinking and Proving, August 2024. - Liu et al. (2023) Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu. FIMO: A Challenge Formal Dataset for Automated Theorem Proving, December 2023. - Lopez-Paz & Ranzato (2017) David Lopez-Paz and Marc’ Aurelio Ranzato. Gradient Episodic Memory for Continual Learning. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. - Loshchilov & Hutter (2017) I. Loshchilov and F. Hutter. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, November 2017. - Lu et al. (2024) Minghai Lu, Benjamin Delaware, and Tianyi Zhang. Proof Automation with Large Language Models, September 2024. - Lu et al. (2022) Shuai Lu, Nan Duan, Hojae Han, Daya Guo, Seung-won Hwang, and Alexey Svyatkovskiy. ReACC: A Retrieval-Augmented Code Completion Framework. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6227–6240, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.431. - mathlib4 Community (2024) The mathlib4 Community. Leanprover-community/mathlib4. leanprover-community, September 2024. - Mendez & Eaton (2021) Jorge A Mendez and Eric Eaton. Lifelong Learning of Compositional Structures. 2021. - Mikuła et al. (2024) Maciej Mikuła, Szymon Tworkowski, Szymon Antoniak, Bartosz Piotrowski, Albert Qiaochu Jiang, Jin Peng Zhou, Christian Szegedy, Łukasz Kuciński, Piotr Miłoś, and Yuhuai Wu. Magnushammer: A Transformer-Based Approach to Premise Selection, March 2024. - Mizuno (2024) Yuma Mizuno. Yuma-mizuno/lean-math-workshop. https://github.com/yuma-mizuno/lean-math-workshop, 2024. - Monnerjahn (2024) Ludwig Monnerjahn. Louis-Le-Grand/Formalisation-of-constructable-numbers, September 2024. - Murphy (2024) Logan Murphy. Loganrjmurphy/LeanEuclid, September 2024. - OpenAI (2024) OpenAI. OpenAI o1 System Card, September 2024. - Piotrowski et al. (2023) Bartosz Piotrowski, Ramon Fernández Mir, and Edward Ayers. Machine-Learned Premise Selection for Lean. In Automated Reasoning with Analytic Tableaux and Related Methods: 32nd International Conference, TABLEAUX 2023, Prague, Czech Republic, September 18–21, 2023, Proceedings, pp. 175–186, Berlin, Heidelberg, September 2023. Springer-Verlag. ISBN 978-3-031-43512-6. doi: 10.1007/978-3-031-43513-3˙10. - Polu et al. (2022) Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and I. Sutskever. Formal Mathematics Statement Curriculum Learning. ArXiv, February 2022. - Renshaw (2024) David Renshaw. Dwrensha/compfiles: Catalog Of Math Problems Formalized In Lean. https://github.com/dwrensha/compfiles, September 2024. - Shao et al. (2024) Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, April 2024. - Shin et al. (2017) Hanul Shin, Jung Kwon Lee, Jaehong Kim, and Jiwon Kim. Continual Learning with Deep Generative Replay, December 2017. - Skřivan (2024) Tomáš Skřivan. Lecopivo/SciLean: Scientific computing in Lean 4. https://github.com/lecopivo/SciLean/tree/master, September 2024. - Song et al. (2024) Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Towards large language models as copilots for theorem proving in lean, 2024. URL https://arxiv.org/abs/2404.12534. - (60) Valentin I Spitkovsky, Hiyan Alshawi, and Daniel Jurafsky. Baby Steps: How “Less is More” in Unsupervised Dependency Parsing. - Subramanian et al. (2017) Sandeep Subramanian, Sai Rajeswar, Francis Dutil, Chris Pal, and Aaron Courville. Adversarial Generation of Natural Language. In Phil Blunsom, Antoine Bordes, Kyunghyun Cho, Shay Cohen, Chris Dyer, Edward Grefenstette, Karl Moritz Hermann, Laura Rimell, Jason Weston, and Scott Yih (eds.), Proceedings of the 2nd Workshop on Representation Learning for NLP, pp. 241–251, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi: 10.18653/v1/W17-2629. - (62) Terence Tao. Teorth/asymptotic. - Tao (2024a) Terence Tao. Teorth/newton, June 2024a. - Tao (2024b) Terence Tao. Teorth/symmetric_project, July 2024b. - Tao et al. (2024) Terence Tao, Pietro Monticone, Lorenzo Luccioli, and Rémy Degenne. Teorth/pfr, August 2024. - Thakur et al. (2024) Amitayush Thakur, George Tsoukalas, Yeming Wen, Jimmy Xin, and Swarat Chaudhuri. An In-Context Learning Agent for Formal Theorem-Proving, August 2024. - (67) Szymon Tworkowski, Maciej Mikula, Tomasz Odrzygozdz, Konrad Czechowski, Szymon Antoniak, Albert Q Jiang, Christian Szegedy, Lukasz Kucinski, Piotr Milos, and Yuhuai Wu. Formal Premise Selection With Language Models. - van de Ven et al. (2024) Gido M. van de Ven, Nicholas Soures, and Dhireesha Kudithipudi. Continual Learning and Catastrophic Forgetting, March 2024. - van Doorn (2024) Floris van Doorn. Fpvandoorn/carleson, September 2024. - Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. - Wang et al. (2023) Haiming Wang, Ye Yuan, Zhengying Liu, Jianhao Shen, Yichun Yin, Jing Xiong, Enze Xie, Han Shi, Yujun Li, Lin Li, Jian Yin, Zhenguo Li, and Xiaodan Liang. DT-Solver: Automated Theorem Proving with Dynamic-Tree Sampling Guided by Proof-level Value Function. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12632–12646, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.706. - Wang et al. (2024a) Haiming Wang, Huajian Xin, Zhengying Liu, Wenda Li, Yinya Huang, Jianqiao Lu, Zhicheng Yang, Jing Tang, Jian Yin, Zhenguo Li, and Xiaodan Liang. Proving Theorems Recursively, May 2024a. - Wang et al. (2024b) Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A Comprehensive Survey of Continual Learning: Theory, Method and Application. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(8):5362–5383, August 2024b. ISSN 0162-8828, 2160-9292, 1939-3539. doi: 10.1109/TPAMI.2024.3367329. - Wang & Deng (2020) Mingzhe Wang and Jia Deng. Learning to Prove Theorems by Learning to Generate Theorems. ArXiv, February 2020. - Wang et al. (2024c) Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, and Tong Zhang. TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts, October 2024c. - Wieser (2024) Eric Wieser. Eric-wieser/lean-matrix-cookbook: The matrix cookbook, proved in the Lean theorem prover. https://github.com/eric-wieser/lean-matrix-cookbook, 2024. - Xin et al. (2024a) Huajian Xin, Daya Guo, Zhihong Shao, Zhizhou Ren, Qihao Zhu, Bo Liu, Chong Ruan, Wenda Li, and Xiaodan Liang. DeepSeek-Prover: Advancing Theorem Proving in LLMs through Large-Scale Synthetic Data, May 2024a. - Xin et al. (2024b) Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search, August 2024b. - Xue et al. (2022) Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. ByT5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. Transactions of the Association for Computational Linguistics, 10:291–306, 2022. doi: 10.1162/tacl˙a˙00461. - Yang (2024) Kaiyu Yang. Yangky11/miniF2F-lean4. https://github.com/yangky11/miniF2F-lean4/tree/main, 2024. - Yang et al. (2023) Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. 2023. doi: 10.48550/ARXIV.2306.15626. - Zaremba & Sutskever (2015) Wojciech Zaremba and Ilya Sutskever. Learning to Execute, February 2015. - Zeta 3 Irrational (2024) Zeta 3 Irrational. Ahhwuhu/zeta_3_irrational at 3d68ddd90434a398c9a72f30d50c57f15a0118c7. https://github.com/ahhwuhu/zeta_3_irrational/tree/3d68ddd90434a398c9a72f30d50c57f15a0118c7, 2024. - (84) Jiawei Zhao, Zhenyu Zhang, Beidi Chen, Zhangyang Wang, Anima Anandkumar, and Yuandong Tian. GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection. URL http://arxiv.org/abs/2403.03507. - Zhou et al. (2024) Jin Peng Zhou, Charles Staats, Wenda Li, Christian Szegedy, Kilian Q. Weinberger, and Yuhuai Wu. Don’t Trust: Verify – Grounding LLM Quantitative Reasoning with Autoformalization, March 2024. - Zhou et al. (2023) Shuyan Zhou, Uri Alon, Frank F. Xu, Zhiruo Wang, Zhengbao Jiang, and Graham Neubig. DocPrompting: Generating Code by Retrieving the Docs, February 2023. - Zombori et al. (2019) Zsolt Zombori, Adrian Csiszarik, Henryk Michalewski, Cezary Kaliszyk, and Josef Urban. Curriculum Learning and Theorem Proving. 2019. Appendix A Appendix A.1 Further Methodology Details Repository Scanning and Data Extraction. We use the GitHub API to query for Lean repositories based on sorting parameters (e.g., by repository stars or most recently updated repositories). We maintain a list of known repositories to avoid; the list can be updated to allow LeanAgent to re-analyze the same repository on a new commit or Lean version. We clone each identified repository locally using the Git version control system. To ensure compatibility with our theorem-proving pipeline, we check the Lean version required by each repository and compare it with the supported versions of our system. If the required version is incompatible, we skip the repository and move on to the next one. Otherwise, LeanAgent switches its Lean version to match the repository’s version. This version checking is performed by parsing the repository’s configuration files and extracting the specified Lean version. Dynamic Database Management. This database contains many key features that are useful in our setting. For example, it can add new repositories, update existing ones, and generate merged datasets from multiple repositories with customizable splitting strategies. In addition, it can query specific theorems or premises across repositories, track the progress of proof attempts (including the proof status of sorry theorems), and analyze the structure and content of Lean proofs, including tactic sequences and proof states. The database keeps track of various details: Repository metadata; theorems categorized as already proven, sorry theorems that are proven, or sorry theorems that are unproven; premise files with their imports and individual premises; traced files for tracking which files have been processed; detailed theorem information, including file path, start/end positions, and full statements; and traced tactics with annotated versions, including the proof state before and after application. If we encounter duplicate theorems between repositories while merging repositories, we use the theorem from the repository most recently added to the database. We deduplicate premise files and traced files by choosing the first one encountered while merging the repositories. We also generate metadata containing details of all the repositories used to generate the dataset and statistics regarding the theorems, premise files, and traced files in the dataset, such as the total number of theorems. We provide the user with many options to generate a dataset. To generate the set of theorems and proofs, the default option is to simply use the theorems, proofs, premise files, and traced files from the current curriculum repository in the database. Specifically, we use the random split from LeanDojo to create training, validation, and testing sets. We refrain from using the novel split from LeanDojo, as we would like LeanAgent to learn as much as possible from a repository to perform well on its hardest theorems. The data in the splits include details about the proofs of theorems, including the URL and commit of the source repository, the file path of the theorem, the full name of the theorem, the theorem statement, the start and end positions in the source file, and a list of traced tactics with annotations. The validation and test split each contain 2% of the total theorem and proofs, following the methodology from LeanDojo. Moreover, the database uses a topological sort over the traced files in the repository to generate the premise corpus. This corpus is a JSON Lines file, where each line is a JSON object consisting of a path to a Lean source file, the file’s imports, and the file’s premise statements and definitions. Progressive Training of the Retriever. We describe some additional steps for progressive training. To precompute the embeddings, we use a single forward pass with batch processing to serialize and tokenize premises from the entire corpus. Then, we use the retriever’s encoder to process the batches and generate embeddings. sorry Theorem Proving. We start by processing the premise corpus to use it more efficiently during premise retrieval. This involves initializing a directed dependency graph to represent each file path in the corpus, adding files as nodes and imports as edges, and creating a transitive closure of this graph. We also track all premises encountered during this process, building a comprehensive knowledge base. Crucially, we limit retrieval to a subset of all available premises to aid the effectiveness of the results. Specifically, we choose the top 25% of accessible and relevant premises, following ReProver’s method. Proof Integration and Pull Request Generation. We integrate the generated proofs into the original Lean files and create pull requests to propose the changes to the repository owners. This aids the development of these repositories and functions as more training data for future research. To achieve this, in a temporary Git branch, we iterate over the Lean files and locate the sorry keywords corresponding to the generated proofs. We then replace these sorry keywords with the actual proof text, working from the bottom of each file upward to preserve the position of theorems. After integrating the proofs, we commit our changes, push them, and create a pull request for the repository on GitHub. A.2 Experiment Implementation Details We use ReProver’s retriever trained on the random split from LeanDojo. We use four NVIDIA A100 GPUs with 80GB of memory each for progressive training. LeanAgent uses a distributed architecture leveraging PyTorch Lightning and Ray for parallel processing. We use bfloat16 mixed precision and optimize with AdamW (Loshchilov & Hutter, 2017) with an effective batch size of 16 (achieved through a batch size of 4 with gradient accumulation over 4 steps). In the first 1,000 steps, the learning rate warms up linearly from 0 to the maximum value of $10^{-3}$ . Then it decays to 0 using a cosine schedule. In addition, we apply gradient clipping with a value of 1.0. Just as ReProver does during training, we sample 3 negative premises per example, including 1 in-file negative premise. The maximum sequence length for the retriever is set to 1024 tokens. The maximum sequence length for the generator is set to 512 tokens for input and 128 tokens for output. The prover uses a best-first search strategy with no limit on the maximum number of expansions of the search tree. It generates 64 tactic candidates and retrieves 100 premises for each proof state. LeanAgent uses ReProver’s tactic generator for the experiments. We generate tactics with a beam search of size 5. We used 4 CPU workers, 1 per GPU. Due to the wide variety of repositories and experimental setups that we tested, the time for each experiment widely varied. For example, the experiments in Table 10 took from 4 to 9 days to complete. Furthermore, we do not compare LeanAgent with any existing LLM-based prover besides ReProver because LeanAgent is a framework, not a model. As mentioned previously, it can be used with any LLM. As such, a comparison would be impractical for reasons including differences in data, pre-training, and fine-tuning. We only compare with ReProver because we use ReProver’s retriever as the starting one in LeanAgent, allowing for a more faithful comparison. Moreover, we do not compare with Aesop because it is not an ML model. We aim to improve upon ML research for theorem proving, such as ReProver. Moreover, Aesop is not a framework, but ReProver was included as the starting point of the LeanAgent framework, which is why we compare LeanAgent to ReProver. Rather than comparing against existing tools, we aim to understand how lifelong learning can work in theorem proving. Furthermore, although LeanAgent can work with other LLMs such as Llemma (Azerbayev et al., 2024) and DeepSeek-Prover (Xin et al., 2024a), using these LLMs in our work would require architectural modifications that go beyond the scope of our current work. For example, the 7B model DeepSeek-Prover as well as the 7B and 34B Llemma models are not retrieval-based. As such, rather than progressively training a retriever, we would progressively train the entire model. This may be feasible with methods such as Gradient Low-Rank Projection (Zhao et al., ), but this would lead to fundamentally different usage than we currently demonstrate. Specifically, rather than using a best-first tree search approach as we do with ReProver’s retriever and tactic generator, we may instead need to generate the entire proof at once. This setup is quite dissimilar from our current evaluation, and so these results may be too dissimilar from our current evaluation framework. In addition, we would like to note that because LeanAgent does not claim to contribute a new search algorithm, it can be used with other search strategies such as Hypertree Proof Search (Lample et al., ). However, the source code for Hypertree Proof Search was only recently released on GitHub, explaining why we did not use it thus far. Moreover, the objective function for Elastic Weight Consolidation (EWC) is given by: $$ L(\theta)=L_{B}(\theta)+\frac{\lambda}{2}\sum_{i}F_{i}(\theta_{i}-\theta_{A,i}% )^{2} $$ where $L_{B}(\theta)$ is the loss for the current task B, $i$ is the label for each parameter, $\theta_{A,i}$ are the parameters from the previous task A, $F_{i}$ is the Fisher information matrix, and $\lambda$ is a hyperparameter that controls the strength of the EWC penalty. For the setups that use EWC, we performed a grid search over $\lambda$ values in {0.01, 0.1, 1, 10, 100}. For each value, we ran Setup 2 on separate testing repositories. We found 0.1 to yield the best overall stability and plasticity scores. A.3 Repository Details Table 5: Additional repository descriptions Table 6: Repository commits. Formalization of Const. Numbers denotes Formalization of Constructable Numbers. | PFR Hairy Ball Theorem Coxeter | fa398a5b853c7e94e3294c45e50c6aee013a2687 a778826d19c8a7ddf1d26beeea628c45450612e6 96af8aee7943ca8685ed1b00cc83a559ea389a97 | | --- | --- | | Mathematics in Lean Source | 5297e0fb051367c48c0a084411853a576389ecf5 | | Formal Book | 6fbe8c2985008c0bfb30050750a71b90388ad3a3 | | MiniF2F | 9e445f5435407f014b88b44a98436d50dd7abd00 | | SciLean | 22d53b2f4e3db2a172e71da6eb9c916e62655744 | | Carleson | bec7808b907190882fa1fa54ce749af297c6cf37 | | Lean4 PDL | c7f649fe3c4891cf1a01c120e82ebc5f6199856e | | Prime Number Theorem And | 29baddd685660b5fedd7bd67f9916ae24253d566 | | Compfiles | f99bf6f2928d47dd1a445b414b3a723c2665f091 | | FLT | b208a302cdcbfadce33d8165f0b054bfa17e2147 | | Debate | 7fb39251b705797ee54e08c96177fabd29a5b5a3 | | Lean4Lean | 05b1f4a68c5facea96a5ee51c6a56fef21276e0f | | Matrix Cookbook | f15a149d321ac99ff9b9c024b58e7882f564669f | | Math Workshop | 5acd4b933d47fd6c1032798a6046c1baf261445d | | LeanEuclid | f1912c3090eb82820575758efc31e40b9db86bb8 | | Foundation | d5fe5d057a90a0703a745cdc318a1b6621490c21 | | Con-nf | 00bdc85ba7d486a9e544a0806a1018dd06fa3856 | | Saturn | 3811a9dd46cdfd5fa0c0c1896720c28d2ec4a42a | | Zeta 3 Irrational | 914712200e463cfc97fe37e929d518dd58806a38 | | Formalization of Const. Numbers | 01ef1f22a04f2ba8081c5fb29413f515a0e52878 | | LeanAPAP | 951c660a8d7ba8e39f906fdf657674a984effa8b | Table 7: Repository orders (initial curriculum). Note that Popularity Order is by star count on August 20, 2024. | 1 2 3 | Compfiles Mathematics in Lean Source Prime Number Theorem And | SciLean FLT PFR | | --- | --- | --- | | 4 | Math Workshop | Prime Number Theorem And | | 5 | FLT | Compfiles | | 6 | PFR | Debate | | 7 | SciLean | Mathematics in Lean Source | | 8 | Debate | Lean4Lean | | 9 | Matrix Cookbook | Matrix Cookbook | | 10 | Con-nf | Math Workshop | | 11 | Foundation | LeanEuclid | | 12 | Saturn | Foundation | | 13 | LeanEuclid | Con-nf | | 14 | Lean4Lean | Saturn | Table 8: Curriculum order (sub-curriculum) | 1 2 3 | Zeta 3 Irrational Formal Book Formalization of Constructable Numbers | | --- | --- | | 4 | Carleson | | 5 | LeanAPAP | | 6 | Hairy Ball Theorem | | 7 | Coxeter | | 8 | Lean4 PDL | Table 9: Repository theorem and premise counts | PFR Hairy Ball Theorem Coxeter | 74306 73026 71273 | 109855 131217 127608 | | --- | --- | --- | | Mathematics in Lean Source | 78886 | 117699 | | Formal Book | 74654 | 112458 | | MiniF2F | 71313 | 127202 | | SciLean | 72244 | 129711 | | Carleson | 73851 | 109334 | | Lean4 PDL | 20599 | 46400 | | Prime Number Theorem And | 79147 | 115751 | | Compfiles | 121391 | 178108 | | FLT | 75082 | 114830 | | Debate | 68853 | 103684 | | Lean4Lean | 2559 | 22689 | | Matrix Cookbook | 67585 | 102294 | | Math Workshop | 76942 | 115458 | | LeanEuclid | 15423 | 40555 | | Foundation | 25047 | 57964 | | Con-nf | 29489 | 64177 | | Saturn | 10982 | 34497 | | Zeta 3 Irrational | 120174 | 176332 | | Formalization of Const. Numbers | 74050 | 109645 | | LeanAPAP | 71090 | 109477 | Repository Statistics and Information. Additional repository descriptions are in Table 5. The commits we used for experiments are in Table 6. Moreover, the repository orders are detailed in Table 7 and Table 8. Furthermore, the total number of theorems and premises per repository (including dependencies) are in Table 9. Repository Selection Process. Many repositories have issues such as incompatibilities with LeanDojo, unsupported Lean versions, and build failures. As such, our process for choosing the 14 repositories in the first curriculum was simply using LeanAgent to extract information from the most popular repositories on GitHub. We disregard incompatible and inapplicable ones, such as those with no theorems. We performed this process on August 20, 2024. We performed a similar process for the eight repositories in the second curriculum with two differences: (1) We checked that the number of sorry theorems visible from GitHub was at least 10. This narrowed down the available repositories significantly. (2) We included some more recently updated repositories to provide some variety in the age of the repositories in our curriculum. We performed this process on September 14, 2024. However, many of the repositories that passed this test had fewer than 10 sorry theorems when processed by LeanDojo; this is mainly due to the functionalities of LeanDojo. A.4 Lifelong Learning Metric Details Prior work has noted that lifelong learning methods generally lack standard evaluation metrics (De Lange et al., 2023; Díaz-Rodríguez et al., 2018). As such, our selection primarily focused on metrics that emphasized a change over time, aligning with our problem setup. In addition, we removed metrics that were redundant. For example, prior work suggests that evaluating lifelong learning frameworks only after each task, rather than over time, leads to substantial forgetting (De Lange et al., 2023). As such, we adopt WF and WP in our analysis of LeanAgent. We use a window size of 5 for WF and WP as this represents a relatively medium-term understanding, given that we have 14 repositories. This would provide a balanced interpretation of forgetting and plasticity. Furthermore, we use the EBWT metric, in line with previous work, to evaluate LeanAgent throughout its lifetime rather than simply at the end (Díaz-Rodríguez et al., 2018). Moreover, we chose not to include the Forward Transfer metric as prior work has shown that a lower FM leads to better forward transfer (Chen et al., 2023). As such, we only check FM. We also chose not to include lifelong learning metrics for overall performance, such as Time Weighted Cumulative Performance (TWCP), Area Under the Learning Curve (AULC), and Average Accuracy (AA), as these would lead to redundancy in our analysis. Specifically, the metrics we chose were all computed using validation R@10 and the average test R@10, which are already measures of LeanAgent’s performance. We provide some additional details on the metrics we used. Windowed-Forgetting 5 (WF5) quantifies model stability by measuring the maximum performance decrease in average test R@10 over a sliding window of 5 evaluations. Following prior work, we define WF for a given window size and then average it over all evaluation tasks to provide a single measure of stability. Moreover, Catastrophic Forgetting Resilience (CFR) is a key indicator of the stability-plasticity trade-off. Furthermore, the Forgetting Measure (FM) measures the negative influence that learning a task has on the test R@10 of all old tasks. It is the average forgetting of all old tasks, where forgetting of a task is the difference between its highest and current performance. Furthermore, BWT measures the positive influence of learning a new task on the test R@10 of old tasks. EBWT improves upon this metric by considering the average of the BWT computed after each task. Windowed-Plasticity 5 (WP5) measures the ability to learn new information by quantifying the maximum average test R@10 increase over a sliding window of 5 evaluations. Incremental Plasticity (IP) tracks changes in validation R@10 for each task over time. However, it is important to note that our lifelong learning metrics have different interpretations in the Merge All dataset construction strategy, which differs from the traditional task-incremental setup. To our knowledge, an interpretation of these metrics in this setting has not been thoroughly conducted. As such, we propose that metrics should be interpreted with an understanding that they may reflect an adaptation to gradual shifts in data distribution rather than abrupt task changes. Specifically, WF5 may reflect not just forgetting old tasks but also the ability to balance and retain knowledge across an expanding dataset. WP5 could indicate how well the model adapts to the growing complexity of the combined dataset rather than purely learning new, isolated tasks. FM, in this context, may represent the ability to maintain performance on earlier data points as the dataset grows. EBWT might reflect the capacity to leverage newly added data to improve performance on the entire historical dataset. CFR becomes a measure of stability in the face of an expanding, potentially more complex dataset. IP may represent how quickly the model adapts to the evolving nature of the combined dataset rather than discrete new tasks. These metrics in the Merge All case measure the ability to accumulate and refine knowledge over time rather than strictly measuring performance on isolated tasks. It is worth analyzing the effect of EWC. Our results in Sec. 4.3 suggest that the effect of EWC is not uniform across different task-ordering strategies. In curriculum-based ordering, EWC seems to improve plasticity (WP5 and IP) at the cost of stability and continuous improvement (WF5, FM, EBWT, and CFR). An exception is Setup 7, which improves WP5 and FM. This suggests that the Merge All strategy creates a more nuanced balance between stability and plasticity. EWC generally improves stability and plasticity metrics, except IP, in the popularity order for the Single Repository strategy. This may be because this ordering is less optimized for learning, and EWC helps to mitigate some of its shortcomings. However, when used with a more effective curriculum-based ordering, EWC interferes with the carefully structured learning process, leading to mixed results. Moreover, in the Merge All scenario, EWC offers benefits only for the curriculum learning setups, suggesting that its effectiveness might be limited in more complex, merged datasets. This can be explained by the fact that the Merge All strategy is a memory-based approach to lifelong learning. As such, using both EWC and the Merge All strategy may lead to excessive stability. This analysis further explains why LeanAgent’s setup is superior. A.5 Further sorry Theorem Proving Discussion We provide some additional discussion on the results in Table 2. First, we note that comparing LeanAgent to ReProver+, the fine-tuning baseline, is not a direct comparison. Specifically, we note how mathematicians often formalize across multiple domains and projects simultaneously or cyclically. We use this as motivation for connecting mathematical knowledge between domains. Moreover, a key use case is formalizing new repositories without retraining on the entire dataset each time. When a mathematician creates some new repositories and adds them to a curriculum, they can simply progressively train LeanAgent on the new repositories while maintaining performance on previous ones, as shown in our experiments with both the initial curriculum and sub-curriculum. This is much more practical than retraining on the entire dataset, which would be expensive in terms of compute and time, especially as the number of repositories grows. Moreover, data scarcity is a major problem in theorem proving. As such, having enough high-quality data for effective pre-training on all repositories may not be feasible. Training on all data, an approach similar to existing work, could prevent the model from generalizing to new repositories. However, LeanAgent does not have this restraint as it continuously generalizes to and improves on ever-expanding mathematical knowledge without forgetting previously learned knowledge. In addition, the results in Table 4 show that our lifelong learning setup leads to effective backward transfer. This is a strong advantage that pre-training does not provide and is also a reason why LeanAgent demonstrates progressive learning, starting with basic concepts and advancing to more complex ones. Also, although not a direct comparison to the pre-training approach, the “Merge All” strategy indicates decreased performance over time. This dataset strategy can be interpreted as being closer to pre-training than the “Single Repository” strategy, suggesting lower than desired pre-training performance when training on all data. Second, we note that LeanAgent improves over the direct fine-tuning baseline on various repositories, including MiniF2F, Scilean, and PFR. We find that these repositories contain a range of mathematical concepts that require progressively more advanced reasoning capabilities. This highlights the effectiveness of our lifelong learning approach in continuously generalizing to and improving on expanding mathematical knowledge without catastrophic forgetting. Table 10: Accuracy comparison across setups. Accuracy is calculated as (proven theorems / total sorry theorems). MIL denotes the Mathematics in Lean Source repository, LA denotes LeanAgent, and SX denotes Setup X (e.g., S1 is Setup 1). The best accuracy for each repository is in bold. | MIL SciLean PFR | 29 294 37 | 72.4 9.2 2.7 | 55.2 8.5 0.0 | 41.4 7.5 0.0 | 55.2 6.8 0.0 | 55.2 8.5 0.0 | 48.3 8.5 0.0 | 58.6 8.8 0.0 | 48.3 7.8 0.0 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | In addition to comparing LeanAgent and ReProver, we conduct an ablation study between LeanAgent and the seven variants discussed in Sec. 4.3 regarding the original curriculum. The detailed sorry theorem proving comparison, which focuses on some of the repositories compared in Sec. 4.2, is in Table 10. Note that when using the Merge All strategy, only sorry theorems from the new repository are proven during each iteration of lifelong learning. We devote the rest of this section to the detailed comparison of sorry theorems that these setups can prove. Mathematics in Lean Source. We notice a progression in LeanAgent’s proving ability in the Mathematics in Lean Source repository. During lifelong learning, LeanAgent demonstrates a grasp of fundamental algebraic structures and basic mathematical operations: a) Group and Ring Theory: LeanAgent proves theorems about basic algebraic structures. For instance, MyGroup.mul_right_inv shows that multiplying an element by its inverse yields the identity and MyRing.add_right_cancel demonstrates the cancellation property in ring addition. <details> <summary>extracted/6256159/carbon17.png Details</summary> ![17f5a8fc](/v1/image/17f5a8fc1f13e2b24993100471ff43269cda5b3cc562cf13e156e587fe13e149) ### Visual Description ## Code Snippet: Theorem Definitions ### Overview The image displays a code snippet defining two mathematical theorems: `mul_right_inv` and `add_right_cancel`. The code appears to be written in a formal language used for theorem proving, likely Lean. The snippet is presented within a dark-themed code editor window. ### Components/Axes * **Window Controls:** The top-left corner of the code window contains three colored circles: red, yellow, and green, typical of window controls for closing, minimizing, and maximizing, respectively. * **Code:** The main body of the image contains the code defining the two theorems. ### Detailed Analysis or ### Content Details The code snippet contains the following: 1. **Theorem `mul_right_inv`:** * `theorem mul_right_inv (a : G) : a * a⁻¹ = 1 := by` * This line defines a theorem named `mul_right_inv`. * It states that for an element `a` in a group `G`, the product of `a` and its inverse `a⁻¹` equals the identity element `1`. * `simp` * This line likely invokes a simplification tactic within the theorem prover to prove the theorem. 2. **Theorem `add_right_cancel`:** * `theorem add_right_cancel {a b c : R} (h: a + b = c + b): a = c := by` * This line defines a theorem named `add_right_cancel`. * It states that for elements `a`, `b`, and `c` in a ring `R`, if `a + b = c + b`, then `a = c`. * `(h: a + b = c + b)` introduces a hypothesis `h` that `a + b = c + b`. * `simpa using h` * This line likely invokes a simplification tactic, using the hypothesis `h`, to prove the theorem. ### Key Observations * The code uses a formal syntax common in theorem proving languages. * The theorems defined are fundamental properties of groups and rings. * The `simp` and `simpa` commands suggest an automated or semi-automated proof process. ### Interpretation The code snippet demonstrates the formal definition and proof of two basic algebraic theorems. The `mul_right_inv` theorem states the existence of a right inverse in a group, while the `add_right_cancel` theorem states the right cancellation property in a ring. The use of `simp` and `simpa` indicates that the proofs are likely achieved through simplification and rewriting rules within the theorem proving environment. The snippet showcases the use of formal methods in mathematics to ensure the correctness of mathematical statements. </details> b) Elementary Number Theory: LeanAgent handles fundamental arithmetic properties, including MyRing.zero_mul, which proves that zero multiplied by any number is zero, and MyRing.neg_neg, which shows that the negative of a negative number is the original number. <details> <summary>extracted/6256159/carbon19.png Details</summary> ![274c66cc](/v1/image/274c66cc4960d1ecacf413d37992571abd4ba05379ed543563dbcb090063a957) ### Visual Description ## Code Snippet: Theorem Definitions ### Overview The image shows a code snippet defining two mathematical theorems: `zero_mul` and `neg_neg`. The code appears to be written in a formal verification language, likely Lean or a similar system. The snippet is presented within a dark-themed code editor window with standard close/minimize/maximize buttons. ### Components/Axes * **Window Controls:** Top-left corner contains three buttons: red (close), yellow (minimize), and green (maximize). * **Code:** The main content area displays the code defining the theorems. ### Detailed Analysis or ### Content Details The code snippet contains the following: 1. **Theorem `zero_mul`:** * `theorem zero_mul (a : R) : 0 * a = 0 := by` * This line defines a theorem named `zero_mul`. * It states that for any element `a` in the set `R`, the product of `0` and `a` is equal to `0`. * The `:= by` indicates the start of the proof. * `rw [MulZeroClass.zero_mul]` * This line applies a rewrite rule named `MulZeroClass.zero_mul` to prove the theorem. 2. **Theorem `neg_neg`:** * `theorem neg_neg (a : R) : - -a = a := by` * This line defines a theorem named `neg_neg`. * It states that for any element `a` in the set `R`, the negation of the negation of `a` is equal to `a`. * The `:= by` indicates the start of the proof. * `simp` * This line applies a simplification tactic to prove the theorem. ### Key Observations * The code uses a formal syntax for defining theorems and proofs. * The theorems are related to basic algebraic properties of zero and negation. * The proofs are concise, relying on rewrite rules and simplification tactics. ### Interpretation The code snippet demonstrates the formal definition and proof of two fundamental mathematical theorems within a formal verification system. The `zero_mul` theorem establishes that multiplying any element by zero results in zero. The `neg_neg` theorem asserts that the double negation of an element is the element itself. These theorems are foundational in algebra and are often used as building blocks for proving more complex results. The use of `rw` and `simp` suggests an automated or semi-automated proof process, highlighting the capabilities of formal verification tools in ensuring mathematical correctness. </details> c) Order Theory: LeanAgent grasps order theory, as evidenced by absorb1, which proves that the infimum of x and the supremum of x and y is always equal to x, and absorb2, which demonstrates that the supremum of x and the infimum of x and y is always equal to x. <details> <summary>extracted/6256159/carbon20.png Details</summary> ![f9f0c77c](/v1/image/f9f0c77c242d1c65ba44a627e9cd81993216250aaedb6493f64b321fbc58ad5f) ### Visual Description ## Code Snippet: Theorem Definitions ### Overview The image shows a code snippet defining two theorems, "absorb1" and "absorb2," likely within a formal verification or theorem proving context. The code appears to be written in a language that uses symbols for set operations like intersection and union. ### Components/Axes * **Window Controls:** The top-left corner of the window contains three circles: red, yellow, and green, representing window control buttons (close, minimize, maximize). * **Theorem Definitions:** The main content consists of two theorem definitions, each followed by a "simp" command. * **Symbols:** The code uses symbols like "п" (intersection) and "U" (union). ### Detailed Analysis or ### Content Details The code snippet contains the following text: * **Line 1:** `theorem absorb1 : х п (х ш у) = x := by` * **Line 2:** `simp` * **Line 3:** `theorem absorb2 : x U х п у = x := by` * **Line 4:** `simp` The symbols "п" and "U" likely represent intersection and union operations, respectively. The variable "x" and "y" are used. The keyword "theorem" indicates the definition of a theorem named "absorb1" and "absorb2". The ":=" likely means "is defined as". The "by" keyword likely indicates the start of a proof strategy, and "simp" is likely a simplification tactic. ### Key Observations * The code defines two theorems related to absorption laws in set theory or a similar algebraic structure. * The "simp" command suggests that the theorems are proven by simplification. * The symbols used for intersection and union are not standard ASCII characters. ### Interpretation The code snippet demonstrates the formal definition of two absorption theorems. The theorems state that: 1. `x ∩ (x ∪ y) = x` (absorb1) 2. `x ∪ (x ∩ y) = x` (absorb2) These theorems are fundamental in set theory and Boolean algebra. The use of "simp" suggests that these theorems can be proven by straightforward simplification rules within the theorem proving environment. The code is likely part of a larger formal verification project where mathematical properties are rigorously defined and proven. </details> d) Rudimentary Real Analysis: LeanAgent demonstrates an early capability to handle properties related to real numbers and absolute values, as shown by C03S05.MyAbs.abs_add, which proves the triangle inequality for real numbers. <details> <summary>extracted/6256159/carbon21.png Details</summary> ![18c186a0](/v1/image/18c186a0a5d5f13a44b31b5c56cfcce73afe3570998a388cdba581fe74b6af46) ### Visual Description ## Screenshot: Theorem Definition ### Overview The image is a screenshot of a code snippet defining a theorem named `abs_add`. The theorem states that the absolute value of the sum of two real numbers x and y is less than or equal to the sum of their absolute values. The code snippet also includes a command to apply a lemma named `abs_add_le`. ### Components/Axes The image contains the following text: - `theorem abs_add (x y : R) : |x + y| ≤ |x| + |y| := by` - `apply abs_add_le` The top-left corner of the window contains three circles: red, yellow, and green. ### Detailed Analysis or ### Content Details The code defines a theorem named `abs_add`. - The theorem takes two real numbers, `x` and `y`, as input. The type of `x` and `y` is `R`, which represents the set of real numbers. - The theorem states that the absolute value of the sum of `x` and `y` is less than or equal to the sum of the absolute values of `x` and `y`. This is represented by the expression `|x + y| ≤ |x| + |y|`. - The `:= by` part indicates that the theorem is being defined by a proof. - The `apply abs_add_le` command applies a lemma named `abs_add_le` to prove the theorem. ### Key Observations The code snippet defines a fundamental theorem related to absolute values and real numbers. The theorem is proven by applying a lemma. ### Interpretation The code snippet demonstrates the definition and proof of a mathematical theorem within a formal system. The theorem `abs_add` is a well-known result in real analysis. The use of `apply abs_add_le` suggests that the proof relies on a previously established lemma, `abs_add_le`, which likely provides a more specific or foundational result related to absolute values. </details> 10/14 of these proven sorry theorems from Mathematics in Lean Source during the lifelong learning process are from the exercise file for proving identities about algebraic structures. This indicates that LeanAgent starts its grasp of mathematical concepts from the basics. Crucially, by the end of the lifelong learning process, LeanAgent exhibits significant growth in its mathematical reasoning abilities: a) Quantifier Manipulation: LeanAgent exhibits more advanced logical reasoning by managing multiple quantifiers and implications, as evidenced by C03S01.my_lemma3, which proves a complex statement involving bounds and absolute values with multiple quantifiers and conditions, and C03S05.MyAbs.abs_lt, which establishes that the absolute value of x being less than y is equivalent to $-y<x\land x<y$ . <details> <summary>extracted/6256159/carbon23.png Details</summary> ![f07b61a9](/v1/image/f07b61a9fa23a5d545fc2cb5f5b6b960213ed8320a76ca13fa3dfc761485cb9e) ### Visual Description ## Code Snippet: Theorem Definitions ### Overview The image displays a code snippet defining two theorems, `my_lemma3` and `abs_lt`, likely within a formal verification or proof assistant environment. The code uses mathematical notation and specific commands (e.g., `apply`, `cases`, `exact`) typical of such systems. ### Components/Axes * **Header:** The top-left of the window has three circles: red, yellow, and green. * **Theorem Definitions:** The main content area contains the code defining the theorems. * **Keywords:** The code uses keywords like `theorem`, `apply`, `cases`, and `exact`. * **Mathematical Notation:** The code uses symbols like `∀` (for all), `ε` (epsilon), `R` (real numbers), `<`, `↔` (if and only if), `∧` (and), and `|x|` (absolute value of x). ### Detailed Analysis or Content Details **Theorem `my_lemma3`:** * **Definition:** `theorem my_lemma3 :` * **Quantification:** `∀{x y ε : R},` (For all x, y, and epsilon, where x, y, and epsilon are real numbers) * **Conditions:** `0 < ε → ε ≤ 1 → |x| < ε → |y| < ε → |x * y| < ε := by` (If 0 < epsilon, and epsilon <= 1, and the absolute value of x is less than epsilon, and the absolute value of y is less than epsilon, then the absolute value of x times y is less than epsilon. The `:= by` indicates the start of the proof strategy.) * **Application:** `apply C03S01.my_lemma` (Apply the lemma `C03S01.my_lemma`) **Theorem `abs_lt`:** * **Definition:** `theorem abs_lt :` * **Equivalence:** `|x| < y ↔ -y < x ∧ x < y := by` (The absolute value of x is less than y if and only if -y is less than x and x is less than y. The `:= by` indicates the start of the proof strategy.) * **Case Analysis:** `cases x` (Perform case analysis on x) * **Exact Proof:** `exact abs_lt` (The proof is exactly `abs_lt`) ### Key Observations * The code defines two mathematical theorems. * `my_lemma3` relates the absolute values of real numbers and their product to a small positive number epsilon. * `abs_lt` defines the equivalence between the absolute value inequality and a conjunction of two inequalities. * The code uses specific commands (`apply`, `cases`, `exact`) to construct the proofs. ### Interpretation The code snippet demonstrates the formal definition and proof of mathematical theorems within a proof assistant. `my_lemma3` likely establishes a property related to the behavior of absolute values and inequalities, while `abs_lt` provides a fundamental equivalence for absolute value inequalities. The use of `apply`, `cases`, and `exact` suggests a step-by-step approach to constructing the proofs, leveraging existing lemmas and proof strategies. The code is written in a language designed for formal verification, where mathematical statements are rigorously defined and proven. </details> b) Set Theory and Relations: LeanAgent handles abstract set-theoretic concepts, as shown by C03S01.Subset.trans, which proves that subset relations are transitive. <details> <summary>extracted/6256159/carbon24.png Details</summary> ![7e7582a5](/v1/image/7e7582a5b890f73f0d7ce301447b12e36c789d00e79edb0c774bd41c7da8c561) ### Visual Description ## Code Snippet: Subset Transitivity Theorem ### Overview The image is a screenshot of a code snippet defining and proving the transitivity of the subset relation. It shows the theorem statement and the proof using the `exact` tactic. The code appears to be written in a proof assistant language, likely Coq or Lean. ### Components/Axes * **Window Controls**: The top-left corner of the window displays three colored circles: red, yellow, and green, representing close, minimize, and maximize/fullscreen, respectively. * **Code Text**: The main content is the code snippet, which includes the theorem definition and its proof. ### Detailed Analysis or ### Content Details The code snippet consists of the following lines: * `theorem Subset.trans : r ⊆ s → s ⊆ t → r ⊆ t := by` * `exact Set.Subset.trans` **Line 1 Breakdown:** * `theorem Subset.trans`: Declares a theorem named `Subset.trans`. * `: r ⊆ s → s ⊆ t → r ⊆ t`: Defines the theorem's statement. It asserts that if `r` is a subset of `s` and `s` is a subset of `t`, then `r` is a subset of `t`. The symbol `⊆` represents the subset relation, and `→` represents implication. * `:= by`: Introduces the proof of the theorem. **Line 2 Breakdown:** * `exact Set.Subset.trans`: Provides the proof of the theorem using the `exact` tactic. It directly applies the existing theorem `Set.Subset.trans` to prove the current theorem. ### Key Observations * The code defines a theorem about the transitivity of the subset relation. * The proof is concise and relies on an existing theorem. * The notation `r ⊆ s` indicates that `r` is a subset of `s`. ### Interpretation The code snippet demonstrates how to define and prove a theorem about the transitivity of the subset relation in a proof assistant. The theorem states a fundamental property of sets, and the proof shows how to leverage existing knowledge to establish new results. The use of `exact` suggests that the proof assistant can automatically verify the correctness of the proof. </details> Now, only 2/7 sorry theorems from Mathematics in Lean Source are from the exercise file for proving identities about algebraic structures. This suggests that lifelong learning allowed LeanAgent to transition to gaining a stronger ability to work with premises for more complicated proofs. We gain some insights from comparing the performance of LeanAgent over time on Mathematics in Lean Source to other setups. For example, the fact that ReProver can handle harder theorems out of the box, such as C03S01.my_lemma3, but fewer theorems overall suggests that it has a broader knowledge base initially but loses performance from a lack of adaptability. Furthermore, Setup 5 proves the same sorry theorems as LeanAgent does during lifelong learning. This suggests that pure curriculum learning without EWC or the Merge All strategy emphasizes grasping easier concepts earlier. This mimics the insights gained from the lifelong learning analysis. However, Setups 3 and 7 (curriculum with EWC and/or Merge All) demonstrate some knowledge plasticity, proving harder theorems during lifelong learning, such as C03S01.Subset.trans in Setup 3. However, this comes at the cost of proving basic theorems, showing catastrophic forgetting. For example, Setups 3 and 7 could not prove the trivial theorems MyGroup.mul_right_inv and MyRing.zero_mul, respectively, during lifelong learning, whereas LeanAgent could. This again aligns with the insights from the lifelong learning scores from our previous analysis. By the end of lifelong learning, the sorry theorems that LeanAgent prove from Mathematics in Lean Source are a superset of those that the other setups prove. This shows that LeanAgent’s lifelong learning setup provides continuously improving capabilities to reason about more advanced premises and proofs than other setups. For example, LeanAgent is the only system, except for Setup 6, which can prove theorem C03S05.MyAbs.abs_lt. LeanAgent achieved this by using the available premises, such as abs_lt with a statement similar to C03S05.MyAbs.abs_lt. An interesting case study can be found in the dichotomy between the theorems C03S05.MyAbs.neg_le_abs_self and C03S05.MyAbs.le_abs_self. LeanAgent can prove C03S05.MyAbs.neg_le_abs_self by referencing C03S05.MyAbs.le_abs_self, which is still unproven at that point: <details> <summary>extracted/6256159/carbon27.png Details</summary> ![325b4e0b](/v1/image/325b4e0b7a59c374ce553fdca530c35c77ecff20b01c2ad63c3836b4bfc4bf52) ### Visual Description ## Code Snippet: Theorem Definition ### Overview The image shows a code snippet defining a theorem related to the absolute value of a real number. The code appears to be written in a formal verification language, likely Lean. ### Components/Axes The image is a screenshot of a code editor or terminal window. The window has standard macOS-style traffic light buttons (red, yellow, green) in the top-left corner. The code is displayed in a monospaced font with syntax highlighting. ### Detailed Analysis or ### Content Details The code snippet defines a theorem named `neg_le_abs_self`. The theorem states that for any real number `x` (denoted as `x : R`), the negation of `x` (`-x`) is less than or equal to the absolute value of `x` (`|x|`). This is expressed as `-x ≤ |x|`. The theorem is proven using the `by` keyword, followed by the proof strategy. In this case, the proof uses the `simpa` tactic, which simplifies the goal using the given lemma. The lemma used is `C03S05.MyAbs.le_abs_self (-x)`. This suggests that there is a library or module named `C03S05.MyAbs` that contains a lemma named `le_abs_self`, which is applied to `-x`. ### Key Observations The code snippet demonstrates a formal definition and proof of a basic mathematical theorem. The use of tactics like `simpa` and lemmas from a library indicates a formal verification environment. ### Interpretation The code snippet shows the formalization of a mathematical concept within a proof assistant. The theorem `neg_le_abs_self` is a fundamental property of real numbers and absolute values. The proof relies on a pre-existing lemma `le_abs_self`, suggesting a modular approach to building mathematical proofs. The use of `simpa` indicates an attempt to simplify the proof process by automating the application of the lemma. </details> At the end of lifelong learning, LeanAgent can prove C03S05.MyAbs.le_abs_self: <details> <summary>extracted/6256159/carbon28.png Details</summary> ![0f9fcfbb](/v1/image/0f9fcfbb91bb2fc726533c111ec6009397287e7b8bc41e37d4f80c36fed03009) ### Visual Description ## Code Snippet: Theorem Definition ### Overview The image displays a code snippet defining a theorem related to the absolute value of a real number. The code appears to be written in a formal verification language, likely Lean. ### Components/Axes * **Top-Left:** Three colored circles (red, yellow, green), likely for window control (close, minimize, maximize). * **Main Content:** Code defining a theorem. ### Detailed Analysis The code snippet contains the following lines: 1. `theorem le_abs_self (x : R) : x ≤ |x| := by` 2. `rw [le_abs]` 3. `simp` ### Key Observations * The theorem `le_abs_self` states that for any real number `x` (denoted as `x : R`), `x` is less than or equal to its absolute value (`x ≤ |x|`). * The proof of the theorem is initiated with `:= by`. * The proof uses the rewrite rule `rw [le_abs]` and simplification `simp`. ### Interpretation The code snippet defines and proves a basic theorem in real analysis. The theorem `le_abs_self` is a fundamental property of absolute values. The proof uses rewrite rules and simplification, which are common techniques in formal verification. The code suggests a formal approach to mathematical reasoning and theorem proving. </details> It achieves this through its use of the le_abs premise, which provides conditions for when an element is less than or equal to the absolute value of another. This suggests that LeanAgent begins by using existing knowledge where possible before trying to realizing why existing facts are reasonable. SciLean. We examine the sorry theorems from SciLean that LeanAgent proved to gain some further key insights about its performance. During the lifelong learning process, LeanAgent demonstrated stronger understanding relative to ReProver in a wide range of mathematical concepts from SciLean. These theorems primarily focus on: a) Fundamental Algebraic Structures: LeanAgent proves basic algebraic operations and properties, such as SciLean.scalar_div_one, which proves that dividing any number by one yields the same number, SciLean.scalar_min_zero_one, which demonstrates the minimum value between 0 and 1 is 0, and Function.invFun.id_rule, which proves that the inverse of the identity function is the identity function itself. <details> <summary>extracted/6256159/carbon29.png Details</summary> ![874c6349](/v1/image/874c634964a37de43dbc052c5292c955e7bdc777b9fb2c670be0f7ab833e3518) ### Visual Description ## Code Snippet: Theorem Definitions ### Overview The image displays a code snippet, likely from a formal verification system or proof assistant, defining and proving mathematical theorems. The code defines three theorems: `scalar_div_one`, `scalar_min_zero_one`, and `id_rule`. Each theorem definition includes the theorem's statement and a proof strategy. ### Components/Axes The image contains the following elements: - **Window Decoration:** Red, yellow, and green circles in the top-left corner, resembling window controls. - **Code Block:** A block of text containing theorem definitions and proof steps. ### Detailed Analysis or ### Content Details The code block contains the following theorem definitions: 1. **`theorem scalar_div_one (x : R) : x / 1 = x := by`** * This theorem states that for any real number `x` (denoted by `x : R`), dividing `x` by 1 results in `x`. * The proof strategy is `simp`, which likely refers to simplification. 2. **`theorem scalar_min_zero_one : min (0 : R) (1 : R) = 0 := by`** * This theorem states that the minimum of 0 and 1 (both real numbers) is 0. * The proof strategy involves rewriting using `min_comm` (likely commutativity of the minimum function) and then simplifying. * `rw [min_comm]` * `simp` 3. **`theorem id_rule : invFun (fun (x : X) => x) = fun x => x := by`** * This theorem defines the inverse function of the identity function. It states that the inverse function of a function that maps `x` to `x` is the function that maps `x` to `x`. * The proof strategy involves applying `Function.invFun_comp` and then using `Function.injective_id`. * `apply Function.invFun_comp` * `exact Function.injective_id` ### Key Observations - The code uses a formal syntax for defining theorems and proofs. - The theorems cover basic mathematical properties related to division, minimum, and inverse functions. - The proof strategies are concise, relying on simplification and pre-defined lemmas. ### Interpretation The code snippet demonstrates the use of a formal system to define and prove mathematical theorems. The theorems themselves are fundamental properties, and their proofs illustrate the system's capabilities for automated reasoning and simplification. The use of tactics like `simp`, `rw`, `apply`, and `exact` suggests an interactive proof environment where the user guides the system towards a complete proof. The code is likely part of a larger formalization effort, aiming to create a machine-verifiable library of mathematical knowledge. </details> b) Linear and Affine Maps: LeanAgent handles basic properties of linear and affine maps effectively, recognizing their structure in IsLinearMap.isLinearMap_apply, which proves the linearity of function applications, and IsAffineMap.IsAffineMap_apply, which demonstrates the affine property of function applications. <details> <summary>extracted/6256159/carbon30.png Details</summary> ![8b38ddc0](/v1/image/8b38ddc03f68ba3d4692e97025dfbb25d0614d4cc3b91f84a6a35c889959c15b) ### Visual Description ## Code Snippet: Theorem Definitions ### Overview The image shows a code snippet defining two theorems: `isLinearMap_apply` and `IsAffineMap_apply`. The code appears to be written in a formal verification language, likely Coq or a similar system. The snippet defines the theorems and provides tactics for proving them. ### Components/Axes * **Theorem Definitions:** Two theorems are defined: `isLinearMap_apply` and `IsAffineMap_apply`. * **Input Parameters:** Both theorems take an input parameter `(i : ι)`. * **Return Type:** Both theorems return a type indicating whether a function `f` is a linear map or an affine map, respectively. The function `f` takes an input `(ι : ι)` and returns a value of type `E i ↔ f i`. * **Proof Tactics:** The `by` keyword indicates the start of the proof tactics used to prove the theorems. ### Detailed Analysis or ### Content Details * **Theorem 1: `isLinearMap_apply`** * Definition: `theorem isLinearMap_apply (i : ι) : IsLinearMap R (fun f : (ι : ι) → E i ↔ f i) := by` * Proof Tactics: * `constructor` * `all_goals aesop` * **Theorem 2: `IsAffineMap_apply`** * Definition: `theorem IsAffineMap_apply (i : ι) : IsAffineMap R (fun f : (i : ι) → E i ↔ f i) := by` * Proof Tactics: * `constructor` * `constructor` * `simp` * `simp` ### Key Observations * Both theorems define properties related to linear and affine maps. * The proof tactics used for the two theorems differ. `isLinearMap_apply` uses `all_goals aesop`, while `IsAffineMap_apply` uses `constructor` and `simp` tactics. * The input parameter `(i : ι)` and the function `f` are similar in both theorem definitions. ### Interpretation The code snippet demonstrates the formal definition and proof of two mathematical theorems related to linear and affine maps. The use of specific proof tactics suggests different approaches are needed to prove each theorem. The `aesop` tactic used in the first theorem is likely a more general-purpose tactic, while `constructor` and `simp` are more specific tactics used in the second theorem. The code is likely part of a larger formal verification project, where mathematical properties are rigorously defined and proven using automated or semi-automated tools. The `E i ↔ f i` likely represents a mapping or relationship between `E i` and `f i`. The `R` likely represents a ring or field over which the linear and affine maps are defined. </details> c) Measure Theory Basics: LeanAgent starts grasping measure theory concepts, exemplified by SciLean.ite_pull_measureOf, which handles conditional measure selection between two measures based on a proposition, SciLean.Measure.prod_volume, which proves that the product of two volume measures is the volume measure itself, and SciLean.ite_pull_ennreal_toReal, which proves that conditionally pulling out an extended non-negative real and converting it to a real is equivalent to converting the individual components first. <details> <summary>extracted/6256159/carbon40.png Details</summary> ![ccf498b6](/v1/image/ccf498b6d0decd911319b77e4f3e3c536f9cf386c174317f051249a56ea54b9c) ### Visual Description ## Code Snippet: Theorem Definitions ### Overview The image displays a code snippet, likely from a formal verification system like Lean or Coq, defining three theorems related to measure theory and real numbers. The code uses a functional programming style with dependent types. ### Components/Axes * **Header:** Contains three colored circles (red, yellow, green) in the top-left corner, likely for window management. * **Main Content:** The code snippet itself, consisting of three theorem definitions. ### Detailed Analysis or ### Content Details **Theorem 1: `ite_pull_measureOf`** * **Name:** `ite_pull_measureOf` * **Type Parameters:** `X` (presumably a type representing a set) * **Context:** * `[MeasurableSpace X]` (X is a measurable space) * `(c : Prop)` (c is a proposition) * `[Decidable c]` (c is decidable) * `(μ ν : Measure X)` (μ and ν are measures on X) * `(A : Set X)` (A is a set in X) * **Statement:** `(if c then μ else v) A = (if c then μ A else v A) := by split_ifs <;> rfl` * This theorem states that applying the conditional measure (if c then μ else v) to a set A is equivalent to applying the measures μ and ν to A conditionally and then combining the results. * `:= by split_ifs <;> rfl` indicates that the proof is done by splitting cases based on the condition `c` and then using reflexivity (`rfl`). **Theorem 2: `Measure.prod_volume`** * **Name:** `Measure.prod_volume` * **Type Parameters:** `X Y` * **Context:** * `[MeasureSpace X]` (X is a measure space) * `[MeasureSpace Y]` (Y is a measure space) * **Statement:** `(Measure.prod (volume : Measure X) (volume : Measure Y)) = volume := by rfl` * This theorem states that the product of the volumes (measures) on measure spaces X and Y is equal to the volume. * `:= by rfl` indicates that the proof is done by reflexivity. **Theorem 3: `ite_pull_ennreal_toReal`** * **Name:** `ite_pull_ennreal_toReal` * **Context:** * `(c : Prop)` (c is a proposition) * `[Decidable c]` (c is decidable) * `(x y : ENNReal)` (x and y are extended non-negative real numbers) * **Statement:** `(if c then x else y).toReal = (if c then x.toReal else y.toReal) := by split_ifs <;> rfl` * This theorem states that converting a conditional extended non-negative real number (if c then x else y) to a real number is equivalent to converting x and y to real numbers conditionally and then combining the results. * `:= by split_ifs <;> rfl` indicates that the proof is done by splitting cases based on the condition `c` and then using reflexivity. ### Key Observations * All three theorems involve conditional expressions (`if c then ... else ...`). * All three theorems are proven by splitting cases based on the condition `c` and then using reflexivity. * The theorems relate to measure theory and real numbers, suggesting a focus on formal verification in these areas. ### Interpretation The code snippet demonstrates the use of a formal verification system to define and prove theorems in measure theory and real analysis. The theorems are relatively simple, but they illustrate the basic principles of conditional reasoning and the use of reflexivity for proving equalities. The use of dependent types allows for precise specification of the types of objects involved, which is crucial for formal verification. The theorems likely form part of a larger library of verified mathematical results. </details> d) Floating-Point Operations: LeanAgent demonstrates an early grasp of floating-point representations and their correspondence to real numbers, shown by SciLean.re_float, proving that a floating-point number’s real-like part is itself. <details> <summary>extracted/6256159/carbon33.png Details</summary> ![369ac544](/v1/image/369ac5449577b00cac8171aa8a1b1c52358094298628a0c12a231324982ec893) ### Visual Description ## Code Snippet: Theorem Definition ### Overview The image displays a code snippet defining a theorem named `re_float`. The code appears to be written in a formal language, likely a proof assistant like Lean or Coq. The theorem states a property related to real numbers (Floats) and their representation within a specific context (RCLike). ### Components/Axes * **Window Decorations:** The code is presented within a dark-themed window with standard macOS-style close (red), minimize (yellow), and maximize (green) buttons in the top-left corner. * **Code:** The code consists of two lines: * `theorem re_float (a : Float) : RCLike.re a = a := by` * `exact RCLike.re_eq_self_of_le le_rfl` ### Detailed Analysis or ### Content Details The code snippet defines a theorem named `re_float`. * The theorem takes a variable `a` of type `Float` as input. * The theorem states that `RCLike.re a = a`. This suggests that `RCLike.re` is a function or property that extracts the real part of a `Float` within the `RCLike` context, and the theorem asserts that this real part is equal to the original `Float` value. * The `:= by` indicates the start of the proof. * The `exact RCLike.re_eq_self_of_le le_rfl` line provides the proof. It uses a function or lemma named `RCLike.re_eq_self_of_le` and applies it to `le_rfl`. This suggests that the proof relies on the reflexivity of the less-than-or-equal-to relation (`le_rfl`) and a property (`RCLike.re_eq_self_of_le`) that connects the real part extraction with the less-than-or-equal-to relation. ### Key Observations * The code uses a formal syntax common in proof assistants. * The theorem relates a `Float` value to its real part within a specific context (`RCLike`). * The proof relies on reflexivity and a specific property related to the real part extraction. ### Interpretation The code snippet defines and proves a theorem stating that extracting the real part of a `Float` value within the `RCLike` context results in the original `Float` value. This theorem likely serves as a fundamental property within a larger formalization of real number theory or numerical computation. The proof demonstrates how this property can be derived from more basic axioms or lemmas. The `RCLike` context likely provides a specific representation or interpretation of real numbers within the formal system. </details> The proofs during this phase are characteristically concise, often using basic tactics like simp, rfl, or aesop that do not use premises. This suggests that LeanAgent recognizes these theorems are straightforward enough to prove without the complex retrieval of premises. Crucially, by the end of the lifelong learning process, LeanAgent exhibits significant growth in its mathematical reasoning abilities on SciLean, just as it did with Mathematics in Lean Source: a) Advanced Function Spaces: LeanAgent understands concepts in advanced function spaces, such as SciLean.ContCDiffMapFD_eta, which demonstrates the eta reduction property for continuously differentiable maps over finite dimensions. <details> <summary>extracted/6256159/carbon34.png Details</summary> ![740fdf8c](/v1/image/740fdf8cd3f503a006d8f84eed6d31ffc1d82b6b15e62cbeba5d91ba59f958e5) ### Visual Description ## Code Snippet: Theorem Definition ### Overview The image shows a code snippet, likely from a formal verification system like Isabelle or Coq. It defines a theorem named `ContCDiffMapFD_eta` and provides a proof using simplification and the `aesop` tactic. ### Components/Axes * **Header:** The top-left of the code block contains three colored circles: red, yellow, and green. These are likely indicators of the code editor's status (e.g., compilation status). * **Code Body:** The main part of the image contains the code itself, formatted with different colors to highlight keywords and variables. ### Detailed Analysis or ### Content Details The code snippet contains the following: * `theorem ContCDiffMapFD_eta (f : X -->FD[K,n] Y) : (fun x -->FD [K,n] f x) = f := by` * This line declares a theorem named `ContCDiffMapFD_eta`. * It states that `f` is a function from `X` to `FD[K,n] Y`. The arrow `-->` likely represents a function type constructor. * The theorem asserts that applying `f` to `x` and then constructing a function from `x` to that result is equivalent to `f` itself. * The `:= by` indicates the start of the proof. * `simp only [DFunLike.ext_iff]` * This line uses the `simp` tactic (likely a simplification tactic) with the `only` modifier. * It specifies that the simplification should only use the `DFunLike.ext_iff` rule, which is likely an extensionality rule for function-like structures. * `aesop` * This line invokes the `aesop` tactic, which is an automated proof search tactic. ### Key Observations * The code uses specific notation (`-->`, `FD[K,n]`) that is likely defined within the context of the formal verification system. * The proof is relatively short, suggesting that the theorem is either straightforward or relies on powerful automated tactics. ### Interpretation The code snippet defines a theorem related to continuous differentiable maps (indicated by `ContCDiffMapFD`). The theorem likely expresses a property of function extensionality in the context of these maps. The proof uses simplification with a specific extensionality rule and then relies on the `aesop` tactic to complete the proof. This suggests that the theorem is provable using standard techniques within the formal verification system. </details> b) Sophisticated Bijections: LeanAgent grows in its ability to work with product spaces and bijections, proving theorems such as Function.Bijective.Prod.mk.arg_fstsnd.Bijective_rule_simple’, which proves the bijectivity of a function that swaps elements in a product space and Function.Bijective.Equiv.invFun.arg_a0.Bijective_rule, which proves that the composition of a bijection and its inverse remains bijective. Crucially, these theorems might seem simple, but they demonstrate LeanAgent’s capability to handle abstract algebraic thinking. <details> <summary>extracted/6256159/carbon39.png Details</summary> ![d696cb2c](/v1/image/d696cb2c74954d343f797ed943ef55bd168e95ade64f3c18bbd28d94010e32d1) ### Visual Description ## Code Snippet: Theorem Definitions ### Overview The image displays a code snippet, likely from a formal verification system or proof assistant, defining two theorems related to bijective functions. The code uses a specific syntax for defining theorems, their types, and proof strategies. ### Components/Axes * **Window Decoration:** The code is presented within a window-like frame, with three circles in the top-left corner (red, yellow, green), mimicking window controls. * **Code:** The main content is the code itself, consisting of theorem definitions and proof tactics. ### Detailed Analysis or ### Content Details **Theorem 1:** * **Name:** `Prod.mk.arg_fstsnd.Bijective_rule_simple'` * **Type:** `Bijective (fun xy: XxY => (xy.2, xy.1))` * This states that the function which takes a pair `xy` from the product type `XxY` and returns the pair `(xy.2, xy.1)` (swapping the elements) is bijective. * **Proof:** * `:= by` indicates the start of the proof. * `constructor <;> intro h` suggests using a constructor tactic followed by introducing a hypothesis `h`. * `all_goals aesop` applies the `aesop` tactic to all remaining goals. **Theorem 2:** * **Name:** `Equiv.invFun.arg_a0.Bijective_rule (f : Y ≃ Z) (g : X → Z) (hf : Bijective g)` * This theorem takes three arguments: * `f : Y ≃ Z`: An equivalence (bijection) between types `Y` and `Z`. * `g : X → Z`: A function from type `X` to type `Z`. * `hf : Bijective g`: A hypothesis stating that the function `g` is bijective. * **Type:** `Bijective (fun x => f.invFun (g x))` * This states that the function which takes an `x` of type `X` and returns `f.invFun (g x)` is bijective. Here, `f.invFun` is the inverse function of `f`, and `g x` is the application of `g` to `x`. * **Proof:** * `:= by` indicates the start of the proof. * `convert hf` attempts to convert the goal to match the hypothesis `hf`. * `simp [hf]` simplifies the goal using the hypothesis `hf`. * `exact f.symm.bijective.comp hf` completes the proof by applying a composition of functions. ### Key Observations * The code defines two theorems related to bijective functions. * The first theorem demonstrates that swapping the elements of a pair results in a bijective function. * The second theorem demonstrates that the composition of a function with the inverse of an equivalence (bijection) results in a bijective function. * The code uses specific tactics (`constructor`, `intro`, `aesop`, `convert`, `simp`, `exact`) to prove the theorems. ### Interpretation The code snippet showcases formal theorem definitions and their proofs within a formal verification system. The theorems relate to properties of bijective functions, specifically concerning swapping elements in a pair and composing functions with inverses. The use of tactics indicates an interactive proof process, where the user guides the system to construct a formal proof. The theorems likely form part of a larger library of mathematical results within the system. </details> c) Abstract Algebraic Structures: LeanAgent proves further abstract algebraic properties, including SciLean.CDifferentiable.id_rule, which proves that the identity function is continuously differentiable. <details> <summary>extracted/6256159/carbon36.png Details</summary> ![4c7ed785](/v1/image/4c7ed785358551a2bdbd1f689aeb09a8e143725e782510ff82f635ff43839561) ### Visual Description ## Code Snippet: CDifferentiable Identity Rule ### Overview The image shows a code snippet, likely from a formal verification system like Lean, defining and proving a theorem related to the differentiability of the identity function. The code defines a theorem named `CDifferentiable.id_rule` which states that the identity function `fun x : X => x` is differentiable. The proof uses tactics like `intro`, `unfold`, and `tauto`. ### Components/Axes * **Window Controls:** The top-left corner of the window contains three circles: red, yellow, and green, representing close, minimize, and maximize window controls respectively. * **Code:** The main content is the code snippet itself. ### Detailed Analysis or ### Content Details The code snippet consists of the following lines: 1. `theorem CDifferentiable.id_rule : CDifferentiable K (fun x : X => x) := by` * This line declares a theorem named `CDifferentiable.id_rule`. * It states that the identity function `(fun x : X => x)` is `CDifferentiable K`. * The `:= by` indicates the start of the proof. 2. `intro x` * This line introduces a variable `x` into the context. 3. `unfold SciLean.CDifferentiableAt` * This line unfolds the definition of `SciLean.CDifferentiableAt`. 4. `tauto` * This line uses the `tauto` tactic, which attempts to automatically prove the goal using tautological reasoning. ### Key Observations * The code snippet is written in a formal language, likely Lean. * The theorem `CDifferentiable.id_rule` is a statement about the differentiability of the identity function. * The proof is short and uses standard tactics. ### Interpretation The code snippet demonstrates a formal proof that the identity function is differentiable. This is a fundamental result in calculus and analysis. The use of tactics like `intro`, `unfold`, and `tauto` suggests that the proof is relatively straightforward and relies on the definitions of differentiability and tautological reasoning. The code is likely part of a larger library or framework for formalizing mathematical concepts and proofs. </details> d) Data Structures in Mathematics: LeanAgent navigates proofs involving array types in a mathematical context, proving theorems such as SciLean.ArrayType.ext, which proves that two arrays are equal if their elements are equal at all indices. <details> <summary>extracted/6256159/carbon37.png Details</summary> ![79005b70](/v1/image/79005b705bf0b2582909baf645cf3ee7db99f7e55c59d72d37a6f8f1dfab7a6e) ### Visual Description ## Code Snippet: Theorem Definition ### Overview The image shows a code snippet defining a theorem named "ext" within a programming environment, likely a proof assistant or theorem prover. The code uses tactics like "intro", "apply", and "simp" to prove the theorem. ### Components/Axes * **Code Block:** The main area containing the code. * **Syntax Highlighting:** Different keywords and variables are highlighted with different colors. * **Top-Left Indicators:** Three circles (red, yellow, green) are present in the top-left corner, resembling window control buttons. ### Detailed Analysis or ### Content Details The code snippet contains the following lines: 1. `theorem ext (x y : Cont) : (∀i, x[i] = y[i]) → x = y := by` * This line defines a theorem named "ext". * It takes two arguments, `x` and `y`, both of type `Cont`. * The theorem states: "For all `i`, if `x[i]` equals `y[i]`, then `x` equals `y`." * The `:= by` indicates the start of the proof. 2. `intro h` * This line introduces a hypothesis `h`. 3. `apply SciLean.ArrayType.get_injective` * This line applies the `get_injective` property from `SciLean.ArrayType`. 4. `simp only [h]` * This line simplifies the goal using the hypothesis `h`. ### Key Observations * The code snippet appears to be written in a language designed for formal verification or theorem proving. * The theorem "ext" likely relates to the extensionality of arrays or some similar data structure. * The `SciLean` namespace suggests the code is part of a larger library or framework. ### Interpretation The code snippet defines and proves a theorem related to the equality of two objects (`x` and `y`) based on the equality of their elements at every index `i`. The theorem leverages the `get_injective` property of `SciLean.ArrayType`, indicating that the elements of the array uniquely determine the array itself. The proof uses tactics like `intro` and `simp` to manipulate the goal and hypotheses, ultimately proving the theorem. This type of code is commonly found in formal verification and theorem proving environments, where rigorous mathematical proofs are constructed and verified by a computer. </details> LeanAgent proved this theorem about a traditionally computer-science-oriented data structure from a mathematical lens, which some other setups could not do. The proofs at this stage are more sophisticated, involving multiple steps and combining various mathematical concepts. This indicates a deeper ability to connect different areas of mathematics. The progression from basic algebraic structures to advanced function spaces and data structures (like array types) shows that LeanAgent is bridging the gap between pure mathematical concepts and their applications in computational mathematics. Furthermore, the progression from basic integral manipulations to advanced function spaces indicates that LeanAgent is improving its premise selection over time. It learns to identify and apply more sophisticated mathematical structures and theorems as premises. By the end of lifelong learning, the sorry theorems that LeanAgent proves from SciLean are almost entirely a superset of those that the other setups prove. This corroborates our previous assertion from our analysis of the sorry theorems from Mathematics in Lean Source that LeanAgent’s lifelong learning setup provides it with continuously improving capabilities to reason about premises and proofs that outperform other setups. Crucially, LeanAgent could prove re_float during lifelong learning while no other setup could. This indicates that LeanAgent’s more measured and stable approach allowed it to grasp floating-point representations and their relation to reals. At the same time, other setups prioritized this less with their reduced stability. This may also suggest that continuous improvement allowed LeanAgent to process new and unique concepts from new repositories. We gain some interesting insights from comparing the performance of LeanAgent over time on SciLean to other setups. For example, the fact that ReProver could not prove the trivial theorem SciLean.ite_pull_ennreal_toReal while LeanAgent could, even during lifelong learning, suggests that this baseline cannot handle foundational concepts. LeanAgent has a better grasp of these concepts from lifelong learning. This is due to the increased stability of LeanAgent and improvement from learning a new task, as shown in the lifelong learning metric analysis in Sec. 4.3. Furthermore, an intriguing observation is that Setup 3 could prove Function.Bijective.Prod.mk.arg_fstsnd.Bijective_rule_simple’ during lifelong learning. However, as mentioned above, LeanAgent could only prove this theorem at the end of lifelong learning. This suggests that Setup 3 is more plastic during lifelong learning, while LeanAgent remains more stable. This corroborates the analysis from the lifelong learning metrics. However, this again comes at the cost of grasping basic theorems. For example, Setup 3 cannot prove the trivial measure theory theorem SciLean.Measure.prod_volume, whereas LeanAgent could during lifelong learning, again suggesting that it favors plasticity over stability. An interesting case is that LeanAgent can prove SciLean.norm 2 _scalar during lifelong learning but not SciLean.norm2_scalar. Conversely, Setup 2 proved SciLean.norm2_scalar but not SciLean.norm 2 _scalar. <details> <summary>extracted/6256159/carbon64.png Details</summary> ![d19e0e59](/v1/image/d19e0e59bd70dfe8a161e3b538254f3ec46504f448abdc8e0d85e80fa2c6fd66) ### Visual Description ## Code Snippet: Norm2 Scalar Theorems ### Overview The image presents a code snippet containing two theorem definitions related to the norm2_scalar property, likely within a formal verification or mathematical library environment. The code defines theorems for real scalars and their relationship to the absolute value and square of a variable x. ### Components/Axes The code snippet includes the following components: * **Theorem Definitions:** Two theorems named `norm2_scalar`. * **Type Constraints:** `{R} [RealScalar R] (x : R)` indicating type constraints on the real scalar `R` and variable `x`. * **Equations:** Mathematical equations defining the norm2_scalar property. * **Tactics:** `rw`, `symm`, `simp`, `congr` which are tactics used in interactive theorem proving. * **Library References:** `[SciLean.scalar_norm]` refers to a library function. ### Detailed Analysis or ### Content Details The code snippet contains the following theorems and expressions: 1. **First Theorem:** * `theorem norm2_scalar {R} [RealScalar R] (x : R):` * `||x||₂ [R] = Scalar.abs x := by` * `rw [SciLean.scalar_norm]` This theorem states that the norm2 of `x` with respect to the real scalar `R` is equal to the absolute value of `x`. The `rw` tactic rewrites the goal using the definition from `SciLean.scalar_norm`. 2. **Second Theorem:** * `theorem norm2_scalar {R} [RealScalar R] (x : R):` * `||x||₂² [R] = x^2 := by` * `symm` * `simp [sq]` * `congr` * `simp` This theorem states that the square of the norm2 of `x` with respect to the real scalar `R` is equal to `x` squared. The tactics `symm`, `simp [sq]`, `congr`, and `simp` are used to prove the theorem. ### Key Observations * Both theorems define a relationship between the norm2 of a real scalar and either the absolute value or the square of a variable. * The first theorem uses a rewrite tactic with a library reference, while the second theorem uses a series of simplification and congruence tactics. * The code appears to be written in a language used for formal verification, possibly Lean. ### Interpretation The code snippet demonstrates the definition of two theorems related to the norm2_scalar property. These theorems are likely part of a larger mathematical library or formal verification project. The first theorem establishes a relationship between the norm2 and the absolute value, while the second theorem establishes a relationship between the square of the norm2 and the square of the variable. The tactics used in the proofs suggest an interactive theorem proving environment. </details> Setup 2 uses the sq premise from mathlib4, which states that the square of an element is the same as multiplying that element by itself, while LeanAgent used the SciLean.scalar_norm premise from SciLean, which states that the 2-norm of a real scalar is equal to the absolute value of that scalar. This suggests that LeanAgent prefers to use new premises if possible rather than simplistic pre-trained ones. This also makes sense since Setup 2 uses popularity order, which generally provides poor stability and plasticity. PFR. Crucially, LeanAgent is the only setup to prove a sorry theorem from PFR. We can prove condRho_of_translate, which states a form of a translation invariance property of randomness measures. <details> <summary>extracted/6256159/carbon44.png Details</summary> ![dce59a72](/v1/image/dce59a72b5009821fd815c21ddb27900aecb4a69941ccf58ef6e0ab76861a5e7) ### Visual Description ## Code Snippet: Lemma Definition ### Overview The image shows a code snippet, likely from a formal verification system like Isabelle or Coq. It defines a lemma named `condRho_of_translate`. The code specifies types, measure spaces, functions, and a simplification rule. ### Components/Axes * **Keywords**: `lemma`, `Type*`, `MeasureSpace`, `Finset`, `fun`, `condRho`, `simp`, `only`, `by` * **Variables**: `Ω`, `S`, `X`, `Y`, `A`, `s`, `ω` * **Operators**: `→`, `+`, `:=` ### Detailed Analysis or ### Content Details The code snippet can be transcribed as follows: ``` lemma condRho_of_translate {Ω S : Type*} [MeasureSpace Ω] (X : Ω → G) (Y : Ω → S) (A : Finset G) (s:G) : condRho (fun ω ↦ X ω + s) Y A = condRho X Y A := by simp only [condRho, rho_of_translate] ``` Breaking down the code: * `lemma condRho_of_translate`: Declares a lemma named `condRho_of_translate`. * `{Ω S : Type*}`: Introduces type variables `Ω` and `S`, both of type `Type*`. * `[MeasureSpace Ω]`: Assumes that `Ω` is a measure space. * `(X : Ω → G)`: Declares a function `X` from `Ω` to `G`. Note: `G` is not explicitly declared, but is likely a group. * `(Y : Ω → S)`: Declares a function `Y` from `Ω` to `S`. * `(A : Finset G)`: Declares `A` as a finite set (Finset) of elements from `G`. * `(s:G)`: Declares `s` as an element of `G`. * `: condRho (fun ω ↦ X ω + s) Y A = condRho X Y A := by`: States the property to be proven. It involves a function `condRho` applied to different arguments. The left side applies `condRho` to a function that maps `ω` to `X ω + s`, along with `Y` and `A`. The right side applies `condRho` to `X`, `Y`, and `A`. The `:= by` indicates that the proof follows. * `simp only [condRho, rho_of_translate]`: Indicates that the proof is done by simplification, using the definitions of `condRho` and `rho_of_translate`. ### Key Observations * The code uses mathematical notation within a programming context. * The lemma relates `condRho` applied to a translated function with `condRho` applied to the original function. * The proof is a simple simplification. ### Interpretation The lemma `condRho_of_translate` likely expresses a property about the invariance of some conditional density or measure (`condRho`) under translation. The translation is represented by adding `s` to the function `X ω`. The simplification step suggests that the definitions of `condRho` and `rho_of_translate` are sufficient to prove this invariance. The code is written in a style common to interactive theorem provers, where the system checks the validity of the proof steps. </details> LeanAgent suggests that the proof is straightforward after expanding definitions of condRho and considering how rho, the randomness measure, behaves under translation (as captured by the rho_of_translate lemma). The fact that LeanAgent could trivially identify such a short proof using existing premises while the maintainers of the PFR repository did not suggests the power of our approach. This suggests that by learning the foundations of the PFR lemmas using its improved stability, LeanAgent was able to grasp some basic definitions. However, LeanAgent and other setups could not prove any PFR theorems after lifelong learning. This suggests that LeanAgent requires more training time or data to further strengthen its knowledge in this new area of mathematics. Alternate PFR Commit. We also analyze whether LeanAgent can generalize to a different commit of a repository in the curriculum. We choose commit 861715b9bf9482d2442760169cb2a3ff54091f75, because PFR/RhoFunctional.lean, the file from which we proved condRho_of_translate in the newer commit, did not exist in the old commit. This allowed the sorry theorems in the old commit to be more distinct. It proved two theorems, including the theorem multiDist_copy about the equality of distributions when copying random variables across different measure spaces that ReProver could not. LeanAgent achieved this with just the tactic rfl. However, these statements were about multiDist, another sorry theorem. Since any two instances of sorry are automatically equal by rfl, LeanAgent exploits this technicality to prove the two theorems. <details> <summary>extracted/6256159/carbon45.png Details</summary> ![c6c42061](/v1/image/c6c42061cfb734159c77550c15c368a4ec0643ef9c8863145a65f5947bd14544) ### Visual Description ## Code Snippet: Lemma Definitions ### Overview The image presents a code snippet containing two lemma definitions: `multiDist_copy` and `multiDist_of_perm`. These lemmas appear to be related to measure theory and probability, potentially within a formal verification context. The code is written in a language that supports dependent types and mathematical notation. ### Components/Axes * **Lemma Definitions:** Two lemmas are defined: `multiDist_copy` and `multiDist_of_perm`. * **Type Parameters:** Both lemmas use type parameters enclosed in curly braces `{}`. * **Hypotheses:** Each lemma has hypotheses denoted by parentheses `()`. * **Goal:** Each lemma has a goal, which is an equality denoted by `=`. * **Proof:** Each lemma concludes with `:= by rfl`, indicating a proof by reflexivity. ### Detailed Analysis or ### Content Details **Lemma 1: `multiDist_copy`** * **Name:** `multiDist_copy` * **Type Parameters:** * `m:N`: `m` is of type `N` (likely natural numbers). * `Ω : Fin m → Type*`: `Ω` is a function from `Fin m` (finite set of size `m`) to `Type*` (likely a type universe). * `Ω' : Fin m → Type*`: `Ω'` is a function from `Fin m` to `Type*`. * **Hypotheses:** * `hΩ : (i : Fin m) → MeasureSpace (Ω i)`: `hΩ` states that for each `i` in `Fin m`, `Ω i` is a `MeasureSpace`. * `hΩ' : (i : Fin m) → MeasureSpace (Ω' i)`: `hΩ'` states that for each `i` in `Fin m`, `Ω' i` is a `MeasureSpace`. * `X' : (i : Fin m) → (Ω' i) → G`: `X'` is a function that takes an index `i` in `Fin m` and an element of `Ω' i` and returns an element of type `G`. * `X : (i : Fin m) → (Ω i) → G`: `X` is a function that takes an index `i` in `Fin m` and an element of `Ω i` and returns an element of type `G`. * `hident: ∀ i, IdentDistrib (X i) (X' i) (hΩ i).volume (hΩ' i).volume`: `hident` states that for all `i`, `X i` and `X' i` are identically distributed with respect to the volumes of `hΩ i` and `hΩ' i`. * **Goal:** `D[X; hΩ] = D[X'; hΩ']` * **Proof:** `:= by rfl` **Lemma 2: `multiDist_of_perm`** * **Name:** `multiDist_of_perm` * **Type Parameters:** * `m:N`: `m` is of type `N` (likely natural numbers). * `Ω : Fin m → Type*`: `Ω` is a function from `Fin m` to `Type*`. * **Hypotheses:** * `hΩ : (i : Fin m) → MeasureSpace (Ω i)`: `hΩ` states that for each `i` in `Fin m`, `Ω i` is a `MeasureSpace`. * `X : (i : Fin m) → (Ω i) → G`: `X` is a function that takes an index `i` in `Fin m` and an element of `Ω i` and returns an element of type `G`. * `φ : Equiv.Perm (Fin m)`: `φ` is a permutation of `Fin m`. * **Goal:** `D[X; hΩ] = D[fun i ↦ X (φ i); fun i ↦ hΩ (φ i)]` * **Proof:** `:= by rfl` ### Key Observations * Both lemmas involve measure spaces and functions defined over finite sets. * The `multiDist_copy` lemma relates two distributions `D` when the underlying measure spaces are different but identically distributed. * The `multiDist_of_perm` lemma relates a distribution `D` to a distribution where the index `i` is permuted by `φ`. * The proofs are trivial (reflexivity), suggesting that the core logic is embedded in the definitions of `D`, `MeasureSpace`, `IdentDistrib`, and `Equiv.Perm`. ### Interpretation The code snippet defines two lemmas related to distributions and measure spaces. The `multiDist_copy` lemma likely states that if two sets of random variables are identically distributed, then their distributions are equal. The `multiDist_of_perm` lemma likely states that the distribution is invariant under permutation of the indices. These lemmas are likely used in a formal verification setting to prove properties about probabilistic programs or systems. The use of `rfl` suggests that these lemmas are definitional equalities or that the proof is handled by a tactic that automatically simplifies the goal to reflexivity. </details> This commit also provides another interesting example. The PFR maintainers used 0 = 1 as a placeholder for some sorry theorems, five of which are multiTau_min_exists, multiTau_min_sum_le, sub_multiDistance_le, sub_condMultiDistance_le, sub_condMultiDistance_le’. Interestingly, LeanAgent finds unintended constructs and proves these theorems with this proof: <details> <summary>extracted/6256159/carbon46.png Details</summary> ![d0ac6c14](/v1/image/d0ac6c147e0bb97427763b9880a45fe3821e4c360c2f94d45fa62764bd6d3cd1) ### Visual Description ## Code Snippet: Lemma Definition ### Overview The image shows a code snippet, likely from a formal verification system or proof assistant. It defines a lemma named `multiTau_min_exists` and provides a proof using tactics like `nontriviality`, `simp`, `apply`, and `exact`. The code appears to be written in a language like Lean or Coq. ### Components/Axes * **Header:** The top-left of the window contains three circles: red, yellow, and green. These are likely window control buttons (close, minimize, maximize). * **Code Block:** The main area contains the code snippet. * `lemma multiTau_min_exists : 0 = 1 := by` * `nontriviality` * `simp` * `apply @zero_ne_one N _` * `exact multidist_eq_zero` ### Detailed Analysis or Content Details The code snippet defines a lemma named `multiTau_min_exists`. The lemma states that `0 = 1`. The proof proceeds as follows: 1. `nontriviality`: This tactic likely uses the assumption that the underlying structure is non-trivial, meaning that 0 and 1 are distinct. 2. `simp`: This tactic likely simplifies the goal using known equalities and definitions. 3. `apply @zero_ne_one N _`: This tactic applies a theorem or lemma named `zero_ne_one` (likely stating that 0 is not equal to 1) to the goal. The `@` symbol suggests explicit instantiation of type parameters. `N` is likely a type or structure. The underscore `_` indicates a placeholder for an argument that the system can infer. 4. `exact multidist_eq_zero`: This tactic proves the goal by directly applying a theorem or lemma named `multidist_eq_zero`. ### Key Observations * The lemma `multiTau_min_exists` claims that 0 = 1, which is a contradiction in standard mathematics. * The proof attempts to derive this contradiction using tactics related to non-triviality and the distinctness of 0 and 1. * The tactics `apply @zero_ne_one N _` and `exact multidist_eq_zero` are crucial steps in the proof. ### Interpretation The code snippet likely demonstrates a proof by contradiction or a situation where the underlying assumptions are inconsistent. The lemma `multiTau_min_exists` is intentionally set up to be false, and the proof shows how this falsehood can be derived from the given assumptions and tactics. This could be used to demonstrate the limitations of a formal system or to highlight the importance of consistent axioms. The presence of `zero_ne_one` suggests that the system is aware of the standard mathematical axiom that zero and one are distinct, but the other tactics are used to circumvent this axiom and derive the contradiction. </details> It combines the known fact that $0≠ 1$ with the placeholder theorem multidist_eq_zero, which states $0=1$ . As such, it proves False, allowing it to derive anything, including $0=1$ . MiniF2F. This repository consists of a validation and test split. Prior work evaluated performance as the number of sorry theorems proved and a Pass@k metric on the test split (Yang et al., 2023). However, given that LeanAgent is a framework, not a model, quantitatively comparing it with existing methods using a Pass@k metric is misleading. In line with our existing setups, we do not treat MiniF2F as a benchmark. Instead, we disregard its existing splits and compare LeanAgent’s proven sorry theorems with those of ReProver. LeanAgent can prove 99 theorems, while ReProver can only prove 85. As mentioned in Sec. 4.2, we append MiniF2F to the initial curriculum to demonstrate the use case of formalizing a new repository in parallel with the ones in the starting curriculum. As such, interesting observations can be made by comparing ReProver (starting point) with LeanAgent (ending point). The types of sorry theorems LeanAgent can prove demonstrate its increasing understanding relative to ReProver in complex mathematical concepts due to lifelong learning. ReProver initially demonstrated proficiency in a range of foundational mathematical areas on MiniF2F: a) Basic Arithmetic and Number Theory: ReProver could handle simple arithmetic and modular arithmetic problems, such as mathd_numbertheory_254, a theorem about modular arithmetic and basic addition, mathd_numbertheory_342, a theorem about basic divisibility, and mathd_algebra_304, a theorem about simple exponentiation. These proofs only rely on the norm_num tactic, which evaluates arithmetic expressions. This suggests a less sophisticated proving ability of mathematics at the start of lifelong learning. <details> <summary>extracted/6256159/carbon47.png Details</summary> ![497508bb](/v1/image/497508bb49b98c0169b9239bef0ea177bd45133bcf001886c93ab9de3472e466) ### Visual Description ## Code Snippet: Theorem Definitions ### Overview The image shows a code snippet, likely from a formal verification system or proof assistant, defining three mathematical theorems. The theorems are related to number theory and algebra, and they are proven using the `norm_num` tactic. ### Components/Axes The code snippet is displayed within a dark-themed window with standard macOS-style window controls (red, yellow, green circles) in the top-left corner. The text is formatted in a monospaced font. ### Detailed Analysis or ### Content Details The code defines three theorems: 1. **Theorem `mathd_numbertheory_254`**: * Statement: `(239 + 174 + 83) % 10 = 6` * Proof: `by norm_num` * Explanation: This theorem states that the remainder of the sum (239 + 174 + 83) divided by 10 is equal to 6. 2. **Theorem `mathd_numbertheory_342`**: * Statement: `54 % 6 = 0` * Proof: `by norm_num` * Explanation: This theorem states that the remainder of 54 divided by 6 is equal to 0. 3. **Theorem `mathd_algebra_304`**: * Statement: `91^2 = 8281` * Proof: `by norm_num` * Explanation: This theorem states that 91 squared is equal to 8281. The `norm_num` tactic is used to automatically prove these theorems. This tactic likely simplifies the expressions and performs the necessary calculations to verify the equality. ### Key Observations * All three theorems are proven using the same tactic: `norm_num`. * The theorems cover basic arithmetic and algebraic operations. * The naming convention for the theorems suggests they are part of a larger mathematical database or library. ### Interpretation The code snippet demonstrates the use of a formal verification system to define and prove mathematical theorems. The `norm_num` tactic simplifies the proof process by automating the necessary calculations. The theorems themselves are relatively simple, but they illustrate the basic principles of formal verification. The theorems are named in a way that suggests they are part of a larger collection, possibly used for automated reasoning or mathematical exploration. </details> b) Elementary Algebra: ReProver could solve basic algebraic equations and perform straightforward manipulations, such as mathd_algebra_141, which proves a statement about quadratic expressions, mathd_algebra_329, which shows a grasp of systems of linear equations, and mathd_algebra_547, which proves basic algebraic manipulation with roots. <details> <summary>extracted/6256159/carbon48.png Details</summary> ![0f1c3615](/v1/image/0f1c3615ae00f5148474f266cc9d1bc2575f5ad99b0bcaf3139b7a8196439d7d) ### Visual Description ## Code Snippet: Mathematical Theorems ### Overview The image presents a code snippet containing three mathematical theorems, labeled `mathd_algebra_141`, `mathd_algebra_329`, and `mathd_algebra_547`. Each theorem defines variables, provides hypotheses, and states a conclusion, followed by a proof strategy. ### Components/Axes The code snippet is structured as follows: - **Theorem Declaration:** `theorem mathd_algebra_XXX` (where XXX is a number). - **Variable Declaration:** `(a b : R)` or `(x y : R)`, indicating variables `a` and `b` or `x` and `y` are real numbers. - **Hypotheses:** `(h0: ...)` and `(h1: ...)` and `(h2: ...)` define assumptions. - **Conclusion:** An equation to be proven. - **Proof Strategy:** `:= by ...` indicates the method used to prove the conclusion. ### Detailed Analysis or ### Content Details **Theorem mathd_algebra_141:** - Variables: `a`, `b` are real numbers. - Hypotheses: - `h1: (a * b) = 180` - `h2: 2 * (a + b) = 54` - Conclusion: `(a^2 + b^2) = 369` - Proof Strategy: `nlinarith` **Theorem mathd_algebra_329:** - Variables: `x`, `y` are real numbers. - Hypotheses: - `h0: 3 * y = x` - `h1: 2 * x + 5 * y = 11` - Conclusion: `x + y = 4` - Proof Strategy: `linarith` **Theorem mathd_algebra_547:** - Variables: `x`, `y` are real numbers. - Hypotheses: - `h0: x = 5` - `h1: y = 2` - Conclusion: `Real.sqrt (x^3 - 2^y) = 11` - Proof Strategy: - `simp [h0, h1, sq]` - `rw [Real.sqrt_eq_iff_sq_eq] <;> norm_num` ### Key Observations - Each theorem involves real number variables and algebraic equations. - The proof strategies vary, including `nlinarith`, `linarith`, `simp`, `rw`, and `norm_num`. - The theorems appear to be designed for automated theorem proving, given the specific syntax and proof strategies. ### Interpretation The code snippet demonstrates the formalization of mathematical theorems in a language suitable for automated reasoning. The theorems are relatively simple algebraic problems, likely used as examples or test cases for a theorem prover. The use of tactics like `nlinarith` and `linarith` suggests the system is capable of automatically solving linear and non-linear arithmetic problems. The `simp` and `rw` tactics indicate simplification and rewriting capabilities, while `norm_num` likely normalizes numerical expressions. The overall structure suggests a system designed for verifying mathematical proofs. </details> c) Basic Calculus and Analysis: ReProver showed early capabilities in dealing with logarithms and exponentials, including mathd_algebra_484, a theorem involving dividing logarithmic expressions. <details> <summary>extracted/6256159/carbon49.png Details</summary> ![558844c3](/v1/image/558844c32d4cdc694222120525b3259d8f7cf09428ce01ee43f1702ed323a93d) ### Visual Description ## Code Snippet: Theorem Proving ### Overview The image shows a code snippet, likely from a theorem prover or interactive proof assistant. It defines a theorem related to real number logarithms and provides the steps to prove it. The code is displayed within a dark-themed window with standard close/minimize/maximize buttons. ### Components/Axes * **Window Controls:** Red, yellow, and green circles in the top-left corner, representing close, minimize, and maximize buttons respectively. * **Code Block:** * `theorem mathd_algebra_484 : Real.log 27 / Real.log 3 = 3 := by` * `field_simp` * `rw [← Real.log_rpow]` * `all_goals norm_num` ### Detailed Analysis or ### Content Details The code snippet defines a theorem named `mathd_algebra_484`. The theorem states that `Real.log 27 / Real.log 3` is equal to 3. The proof is provided using the `by` keyword, followed by a sequence of tactics: 1. `field_simp`: Simplifies the expression using field axioms. 2. `rw [← Real.log_rpow]`: Rewrites the expression using the `Real.log_rpow` theorem, applying it in reverse (indicated by the left arrow `←`). This theorem likely relates the logarithm of a power to the product of the exponent and the logarithm of the base. 3. `all_goals norm_num`: Normalizes numerical expressions in all remaining goals. ### Key Observations * The code uses a domain-specific language (DSL) for theorem proving. * The proof relies on simplification, rewriting, and normalization tactics. * The `Real.log_rpow` theorem is a key step in the proof. ### Interpretation The code snippet demonstrates a formal proof of a simple logarithmic identity. It showcases the use of automated tactics within a theorem prover to simplify and solve mathematical problems. The theorem `Real.log_rpow` is likely used to transform `Real.log 27` into `Real.log (3^3)`, which then simplifies to `3 * Real.log 3`, allowing for cancellation with the denominator. The `norm_num` tactic likely handles the final numerical evaluation. </details> Notably, the proofs at this stage were characteristically concise, often using basic tactics like norm_num, linarith, and field_simp. This suggests that ReProver recognized these theorems as straightforward enough to prove without complex retrieval of premises, similar to its behavior with previous repositories. However, by the end of the lifelong learning process, LeanAgent exhibited significant growth in its mathematical reasoning abilities on MiniF2F: a) Advanced Number Theory: LeanAgent showed a more advanced grasp of number theory, proving theorems like mathd_numbertheory_293, a complex theorem about divisibility involving a complex expression and mathd_numbertheory_233, a theorem dealing with modular arithmetic in $\text{ZMod}(11^{2})$ . <details> <summary>extracted/6256159/carbon50.png Details</summary> ![13603025](/v1/image/13603025033d7d9a7f147560208fb4977f86bd5560e8a4e0ac0ac668fcb2f81a) ### Visual Description ## Code Snippet: Number Theory Theorems ### Overview The image presents a code snippet, likely from a formal verification system or proof assistant, defining two theorems related to number theory. The code uses a specific syntax for defining theorems, assumptions, and proofs. ### Components/Axes The image contains the following elements: * **Window Decorations:** Standard window controls (red, yellow, green circles) in the top-left corner. * **Theorem Definitions:** Two theorem definitions, `mathd_numbertheory_293` and `mathd_numbertheory_233`. * **Variables and Types:** Declarations of variables with their corresponding types (e.g., `n : N`, `b : ZMod (11^2)`). * **Assumptions:** Hypotheses or preconditions for the theorems (e.g., `h0 : n ≤ 9`, `h1 : 11|20 * 100 + 10 * n + 7`, `h0 : b = 24⁻¹`). * **Statements:** Assertions to be proven (e.g., `n = 5`, `b = 116`). * **Proof Strategies:** Tactics used to prove the statements (e.g., `:= by omega`, `:= by exact h0`). ### Detailed Analysis or ### Content Details **Theorem `mathd_numbertheory_293`:** * **Name:** `theorem mathd_numbertheory_293` * **Variable:** `n` of type `N` (likely natural numbers). * **Assumption `h0`:** `n ≤ 9` (n is less than or equal to 9). * **Assumption `h1`:** `11|20 * 100 + 10 * n + 7` (11 divides 20 * 100 + 10 * n + 7). * **Statement:** `n = 5` * **Proof:** `:= by omega` (likely an automated proof tactic). **Theorem `mathd_numbertheory_233`:** * **Name:** `theorem mathd_numbertheory_233` * **Variable:** `b` of type `ZMod (11^2)` (integers modulo 11 squared, i.e., modulo 121). * **Assumption `h0`:** `b = 24⁻¹` (b is the modular inverse of 24 modulo 121). * **Statement:** `b = 116` * **Proof:** `:= by exact h0` (the statement follows directly from assumption h0). ### Key Observations * The code uses a formal syntax, suggesting it's part of a proof assistant or formal verification tool. * The theorems involve basic number theory concepts like divisibility, modular arithmetic, and modular inverses. * The proofs are concise, relying on automated tactics or direct application of assumptions. ### Interpretation The code snippet demonstrates the formalization of number theory theorems within a proof assistant. The theorems are precisely stated with explicit assumptions and types. The proofs, while brief, indicate the logical steps required to establish the truth of the statements. The use of tactics like `omega` suggests the system can automatically handle certain types of proofs. The second theorem shows a direct application of an assumption to prove the statement. </details> b) Sophisticated Algebra: LeanAgent showed a better grasp of more complex algebraic manipulations. Theorems include mathd_algebra_148, which involves function definitions and solving for unknown coefficients, and amc12a_2016_p3, involving a special case of a function. <details> <summary>extracted/6256159/carbon61.png Details</summary> ![0602ad9e](/v1/image/0602ad9eccbf613a9932eb9979d824e51eeea12c2b8cac10cc160ad2954242c3) ### Visual Description ## Code Snippet: Mathematical Theorems ### Overview The image displays a code snippet, likely from a formal verification system or a programming language designed for mathematical proofs. It presents two theorems, `mathd_algebra_148` and `amc12a_2016_p3`, along with their definitions and proofs. ### Components/Axes The code snippet is structured as follows: * **Theorem Declarations:** Each theorem is declared with the keyword `theorem` followed by its name. * **Variable Declarations:** Variables and their types are declared within parentheses, e.g., `(c : R)` indicates that `c` is a real number. * **Function Definitions:** Functions are defined with their domain and codomain, e.g., `(f : R → R)` indicates that `f` is a function from real numbers to real numbers. * **Hypotheses:** Hypotheses are introduced with labels like `h0` and `h1`, followed by a colon and the logical statement they represent. * **Goal:** The statement to be proven is presented before the `:= by` keyword. * **Proof:** The proof is initiated by `:= by` and followed by proof tactics like `linarith`, `norm_num`, `field_simp`, and `norm_cast`. ### Detailed Analysis or ### Content Details **Theorem: `mathd_algebra_148`** * **Variables:** * `c`: A real number (R). * `f`: A function from real numbers to real numbers (R → R). * **Hypotheses:** * `h0`: For all x, `f x = c * x^3 - 9 * x + 3`. * `h1`: `f 2 = 9`. * **Goal:** `c = 3`. * **Proof:** The proof uses the `linarith` tactic with hypotheses `h0` and `2`. **Theorem: `amc12a_2016_p3`** * **Function:** `f`: A function from R x R → R * **Hypotheses:** * `h0`: For all x, for all y, if `y ≠ 0`, then `f x y = x - y * Int.floor (x / y)`. * **Goal:** `f (3 / 8) (-(2 / 5)) = -(1 / 40)`. * **Proof:** The proof uses the `norm_num`, `field_simp`, and `norm_cast` tactics with hypothesis `h0`. ### Key Observations * The code uses a formal language to define and prove mathematical theorems. * The theorems involve real numbers, functions, and logical quantifiers. * The proofs rely on automated tactics to simplify expressions and derive the desired conclusions. ### Interpretation The code snippet demonstrates the use of formal methods in mathematics. It shows how theorems can be precisely defined and rigorously proven using a combination of logical statements and automated proof tactics. The `mathd_algebra_148` theorem establishes a specific value for the coefficient `c` in a cubic function, given a function value at `x = 2`. The `amc12a_2016_p3` theorem evaluates a function `f` defined using the floor function. The use of tactics like `linarith`, `norm_num`, `field_simp`, and `norm_cast` suggests that the system is capable of performing algebraic manipulations and simplifying expressions automatically. This approach is valuable for ensuring the correctness of mathematical results and for automating the process of mathematical discovery. </details> c) Advanced Calculus and Analysis: LeanAgent demonstrated improved capabilities in handling more complex analytical problems, including mathd_algebra_270, a theorem involving function composition and rational expressions. <details> <summary>extracted/6256159/carbon52.png Details</summary> ![84f4e7b3](/v1/image/84f4e7b36e188857df50bca6d775da225e37c0971b77eefcb5218fa63a238d5b) ### Visual Description ## Code Snippet: Theorem Definition ### Overview The image shows a code snippet defining a theorem related to algebra. It defines a function `f` from real numbers to real numbers and provides a condition on its behavior. It then states that `f(f(1)) = 3/7` and uses a tactic to prove this statement. ### Components/Axes * **Header:** The top of the code snippet includes three colored circles (red, yellow, green), likely for window control. * **Theorem Definition:** The core of the snippet defines a theorem named `mathd_algebra_270`. * **Function Definition:** Defines a function `f` that maps real numbers to real numbers (`f: R -> R`). * **Hypothesis:** States a hypothesis `h0` that for all `x` not equal to -2, `f(x) = 1 / (x + 2)`. * **Assertion:** Asserts that `f(f(1)) = 3/7`. * **Tactic:** Uses a tactic `set_option tactic.skipAssignedInstances false in norm_num [h0]` to prove the assertion. ### Detailed Analysis or ### Content Details The code snippet contains the following information: * `theorem mathd_algebra_270` * `(f : R -> R)` * `(h0 : ∀ x, x ≠ -2 -> f x = 1 / (x + 2)) :` * `f (f 1) = 3/7 := by` * `set_option tactic.skipAssignedInstances false in norm_num [h0]` ### Key Observations * The theorem defines a function `f` with a specific property. * The hypothesis `h0` defines the function's behavior for all `x` except -2. * The assertion `f(f(1)) = 3/7` is a specific claim about the function's value. * The tactic is used to automatically prove the assertion, likely using the hypothesis `h0`. ### Interpretation The code snippet defines a theorem and attempts to prove it using automated tactics. The theorem states that if a function `f` maps real numbers to real numbers and satisfies the condition `f(x) = 1 / (x + 2)` for all `x` not equal to -2, then `f(f(1))` must equal `3/7`. The tactic `norm_num` likely simplifies the expression `f(f(1))` using the hypothesis `h0` to arrive at the value `3/7`. The `set_option` command likely configures the tactic to skip assigned instances during the normalization process. </details> d) Complex Induction: LeanAgent became adept at more advanced induction proofs. An example is induction_12dvd4expnp1p20, a theorem about divisibility that requires an induction proof. <details> <summary>extracted/6256159/carbon54.png Details</summary> ![5b779fdc](/v1/image/5b779fdccc16e7a139f913643f7cea590ea6a353a16e521ec781754f2d344908) ### Visual Description ## Code Snippet: Theorem Definition and Proof ### Overview The image shows a code snippet, likely from a formal verification system or proof assistant, defining a theorem and outlining its proof strategy. The code defines a theorem named "induction_12dvd4expnp1p20" which states that for any natural number 'n', 12 divides 4^(n+1) + 20. The proof uses induction and simplification techniques. ### Components/Axes * **Theorem Definition:** `theorem induction_12dvd4expnp1p20 (n : N) : 12 | 4^(n+1) + 20 := by` * `theorem`: Keyword indicating the start of a theorem definition. * `induction_12dvd4expnp1p20`: Name of the theorem. * `(n : N)`: Declares 'n' as a variable of type 'N', where 'N' likely represents the set of natural numbers. * `: 12 | 4^(n+1) + 20`: States that 12 divides 4^(n+1) + 20. The symbol '|' likely represents "divides". * `:= by`: Indicates the start of the proof. * **Proof Steps:** * `norm_num`: Normalizes numerical expressions. * `induction' n with n hn`: Applies induction on the variable 'n'. 'hn' likely represents the inductive hypothesis. * `simp`: Simplifies the expression. * `omega`: A tactic for solving arithmetic goals. ### Detailed Analysis or Content Details The code snippet defines a theorem and provides a proof strategy. The theorem states that for any natural number 'n', the expression 4^(n+1) + 20 is divisible by 12. The proof strategy involves normalizing numerical expressions, applying induction, simplifying the expression, and using an omega tactic to solve arithmetic goals. ### Key Observations The code uses a formal language for defining and proving theorems. The proof strategy is concise and relies on automated tactics like `norm_num`, `simp`, and `omega`. ### Interpretation The code snippet demonstrates a formal approach to proving mathematical theorems. The use of induction and simplification techniques is common in mathematical proofs. The automated tactics suggest that the system is capable of performing complex reasoning steps automatically. The theorem itself relates to divisibility properties of exponential expressions. </details> e) Complex Quantifiers and Inequalities: LeanAgent increased its ability to prove more complex logical statements, such as amc12a_2002_p6, a theorem involving multiple existential quantifiers and inequalities. <details> <summary>extracted/6256159/carbon55.png Details</summary> ![3d564595](/v1/image/3d5645956f54abb4bc8b9f91b21e4f8ff5267df03afdce7817e84e66aa87fbc6) ### Visual Description ## Code Snippet: Theorem Definition ### Overview The image shows a code snippet defining a theorem named `amc12a_2002_p6`. The theorem takes a natural number `n` and a proof `h0` that `0 < n` as input. It asserts the existence of `m` and `p` such that `m > n`, `m >= p`, and `m * p <= m + p`. The proof is constructed using tactics like `lift`, `cases'`, and `exact`. ### Components/Axes * **Theorem Name:** `amc12a_2002_p6` * **Input Parameters:** * `n : N` (n is a natural number) * `h0 : 0 < n` (h0 is a proof that 0 is less than n) * **Goal:** `∃ m, (m > n ∧ ∃ p, m * p ≤ m + p)` (There exists an m such that m > n and there exists a p such that m * p <= m + p) * **Proof Strategy:** * `lift n to N+ using h0` (Lift n to the positive natural numbers using the proof h0) * `cases' n with n` (Perform case analysis on n) * `exact (_, lt_add_of_pos_right _ zero_lt_one, 1, by simp)` (Provide the exact proof term) ### Detailed Analysis or ### Content Details The code snippet defines a theorem and provides a proof using a tactic-based approach. The theorem states that for any natural number `n` greater than 0, there exist numbers `m` and `p` satisfying certain conditions. The proof involves lifting `n` to the positive natural numbers, performing case analysis, and providing an exact proof term. The proof term `(_, lt_add_of_pos_right _ zero_lt_one, 1, by simp)` suggests that the proof relies on the lemma `lt_add_of_pos_right` and the fact that `0 < 1`. The `by simp` tactic likely simplifies the goal to complete the proof. ### Key Observations * The theorem is named `amc12a_2002_p6`, which suggests it might be related to a problem from the AMC 12A competition in 2002. * The proof uses tactics like `lift`, `cases'`, and `exact`, which are common in interactive theorem provers. * The proof term involves lemmas related to inequalities and simplification. ### Interpretation The code snippet demonstrates a formal proof of a mathematical theorem. The theorem states that for any natural number `n` greater than 0, there exist numbers `m` and `p` such that `m > n` and `m * p <= m + p`. The proof is constructed using a combination of tactics and lemmas, showcasing a typical workflow in interactive theorem proving. The specific lemmas used in the proof term suggest that the proof relies on basic properties of inequalities and natural numbers. The theorem and its proof might be related to a problem from a mathematical competition, indicating its potential relevance to problem-solving in mathematics. </details> The proofs at this later stage are more sophisticated, usually involving multiple steps and combining various mathematical concepts or indicating a better ability to connect different areas of mathematics, mirroring the progression observed in Mathematics in Lean Source and SciLean. For example, as shown above, LeanAgent provides a one-line proof to the relatively advanced theorem mathd_numbertheory_233. The proof means the hypothesis directly proves the goal. This suggests that LeanAgent grasps modular arithmetic and recognizes when a given hypothesis is sufficient to prove the goal without additional steps. Furthermore, as shown above, LeanAgent uses four tactics to prove the induction_12dvd4expnp1p20 theorem. This demonstrates its ability to handle more complex number theory proofs and use advanced tactics. This again shows that LeanAgent can recognize when it does not require complex premise retrieval. LeanAgent demonstrates a similar grasp of the theorem amc12a_2002_p6. Notably, it combines the simple premises lt_add_of_pos_right, which describes how an element is less than that element added with a positive one, and zero_lt_one, which states that 0 is less than 1, with more advanced tactics like lift, cases, and exact with complex term construction. This demonstrates its ability to reuse foundational concepts for more complex proofs of abstract mathematical concepts, showing its stability. Importantly, LeanAgent’s performance on MiniF2F showcases its ability to adapt and improve across different mathematical domains. We see this in the progression from ReProver’s basic arithmetic and algebra to LeanAgent’s more advanced number theory, calculus, and abstract algebra. This aligns with the observations from Mathematics in Lean Source and SciLean, further supporting the effectiveness of LeanAgent’s lifelong learning approach in theorem proving across various mathematical repositories. Furthermore, early proofs from ReProver dealt with concrete numbers and simple equations. Later proofs from LeanAgent involved more abstract concepts like equivalence relations and function properties. LeanAgent gained the capability to handle more complex number theory problems involving divisibility under constraints. Moreover, LeanAgent shifted from solving basic linear and quadratic equations to analyzing functions and their compositions. Also, early proofs often used norm_num for straightforward computations. Later proofs employed more varied tactics and premises, suggesting a more sophisticated approach to proof construction. This all crucially suggests that while existing methods, like ReProver, may be more tailored to simpler computation problems, LeanAgent is superior on complex and analytical problems. These are precisely the types of problems present in advanced mathematics. This also corroborates LeanAgent’s performance on the repositories mentioned previously. In addition, we evaluate LeanAgent on the Lean4 version of the MiniF2F test set using the same experimental setup from prior experiments, taking care not to progressively train on Test.lean and only proving sorry theorems from it. The results are as follows: LeanAgent: Pass@1 of 38.1% (93 / 244 theorems) ReProver (our run on Lean4): Pass@1 of 34.0% (83 / 244 theorems) ReProver was only tested on the Lean3 test set in the LeanDojo paper, so we ran ReProver on the Lean4 test set for a fairer comparison. TheoremLlama (Wang et al., 2024c) reports a 33.61% cumulative accuracy on the Lean4 test set, but this makes a direct comparison with the Pass@1 rates of LeanAgent and ReProver infeasible. Moreover, DeepSeek-Prover-V1.5 (Xin et al., 2024b) reports a state-of-the-art result of 63.5% on the Lean4 test set. However, LeanAgent is a lifelong learning framework, not a model. The performance of LeanAgent is dependent on the retriever used as the starting point - ReProver’s retriever in our case. As such, the metrics for other methods besides ReProver cannot be directly compared with LeanAgent. Such a comparison would be impractical for reasons including differences in data, pretraining, and fine-tuning. Again, we only compare with ReProver because we use ReProver’s retriever as the starting one in LeanAgent, allowing for a more faithful comparison. We use ReProver because past work has shown that retrieval leads to improved generalizability. Moreover, we design LeanAgent to continuously generalize to and improve on ever-expanding mathematical knowledge without forgetting previously learned knowledge. Our goal is not to beat the MiniF2F benchmark; instead, we aim to perform well in proving sorry theorems across a range of diverse repositories. The other approaches mentioned focus on different objectives and don’t address the lifelong learning aspect that is central to our work. Formal Book. We first examine the sorry theorems from Formal Book that LeanAgent proved during lifelong learning. These theorems centered around: a) Real Analysis and Inequalities: LeanAgent demonstrates a better understanding relative to ReProver in real number properties and can handle basic inequality reasoning, proving book.irrational.lem_aux_ii, which involves real analysis and inequalities. <details> <summary>extracted/6256159/carbon56.png Details</summary> ![8d36a034](/v1/image/8d36a0348f62900afaf1f64ea363b8e5d42f83ecefbccccdb8bd86df8b4eb702) ### Visual Description ## Code Snippet: Lemma Definition ### Overview The image is a screenshot of a code snippet, likely from a formal verification system like Coq or Lean. It defines a lemma named `lem_aux_ii` with several input parameters and a logical statement to be proven. The proof strategy is indicated as using the `constructor` tactic followed by the `linarith` tactic. ### Components/Axes The code snippet includes the following elements: * **Lemma Declaration:** `lemma lem_aux_ii` * **Input Parameters:** * `n : N` (n is a natural number) * `x : R` (x is a real number) * `h_1 : 0 < x` (hypothesis 1: 0 is less than x) * `h_2 : x < 0` (hypothesis 2: x is less than 0) * **Logical Statement:** `(0 < f_aux n x) ^ (f_aux n x < (1 : R) / n.factorial)` * `f_aux n x` represents some function of `n` and `x`. * `^` represents the logical "and" operator. * `(1 : R)` indicates that 1 is being treated as a real number. * `n.factorial` represents the factorial of `n`. * **Proof Strategy:** `:= by constructor <;> linarith` * `:= by` indicates the start of the proof. * `constructor` is a tactic that attempts to prove the statement by applying the appropriate constructor. * `<;>` is a tactic combinator. * `linarith` is a tactic that attempts to prove the statement using linear arithmetic. ### Detailed Analysis or ### Content Details The lemma `lem_aux_ii` takes a natural number `n` and a real number `x` as input, along with two hypotheses: `0 < x` and `x < 0`. The lemma states that under these conditions, `0 < f_aux n x` and `f_aux n x < (1 : R) / n.factorial` are both true. The proof strategy involves first using the `constructor` tactic and then the `linarith` tactic. ### Key Observations The hypotheses `0 < x` and `x < 0` are contradictory. This suggests that the lemma is either intended to be proven under contradictory assumptions (which would make the conclusion trivially true), or there is an error in the hypotheses. ### Interpretation The code snippet defines a lemma that appears to be attempting to prove a statement about a function `f_aux` under certain conditions. However, the contradictory hypotheses `0 < x` and `x < 0` raise questions about the validity or purpose of the lemma. It is possible that the lemma is intended to demonstrate a proof technique under contradictory assumptions, or that there is an error in the hypotheses that needs to be corrected. The proof strategy suggests an attempt to construct a proof and then use linear arithmetic to complete it. </details> b) Number Theory: LeanAgent shows capabilities in fundamental number theory concepts, proving book.quadratic_reciprocity.quadratic_reciprocity_2, a key result in quadratic reciprocity. <details> <summary>extracted/6256159/carbon57.png Details</summary> ![cb8f2c6d](/v1/image/cb8f2c6d4647b15cbcedcf6469d6394da6c8d4fe5a0086605351ab8e607c504b) ### Visual Description ## Code Snippet: Quadratic Reciprocity Theorem ### Overview The image is a screenshot of a code snippet, likely from a formal verification system or proof assistant. It defines and proves the quadratic reciprocity theorem. ### Components/Axes The code snippet includes the following elements: - **Theorem Declaration:** `theorem quadratic_reciprocity_2` - **Variables:** `p`, `q` of type `N` (likely natural numbers) - **Hypotheses:** `hp : p ≠ 2`, `hq : q ≠ 2` (p and q are not equal to 2) - **Facts:** `Fact (Nat.Prime p)`, `Fact (Nat.Prime q)` (p and q are prime numbers) - **Equation:** `(legendre_sym p q) * (legendre_sym q p) = -1^((p-1) / 2 * (q - 1) / 2 )` - **Proof Strategy:** `:= by exact book.quadratic_reciprocity.quadratic_reciprocity_1 p q hp hq` ### Detailed Analysis or ### Content Details The code defines the quadratic reciprocity theorem, which relates the Legendre symbols `(legendre_sym p q)` and `(legendre_sym q p)`. The theorem states that the product of these symbols is equal to -1 raised to the power of `((p-1) / 2 * (q - 1) / 2 )`. The proof is provided by referencing a pre-existing proof in the `book.quadratic_reciprocity` library. ### Key Observations - The code uses a formal language for defining mathematical theorems and proofs. - The theorem is stated in terms of Legendre symbols and prime numbers. - The proof relies on an existing, verified proof. ### Interpretation The code snippet demonstrates the formalization of a mathematical theorem (quadratic reciprocity) within a proof assistant or formal verification system. This allows for rigorous verification of mathematical statements and ensures their correctness. The use of existing libraries and proofs promotes code reuse and simplifies the verification process. The theorem relates the Legendre symbols, which are used to determine whether a number is a quadratic residue modulo a prime. </details> Notably, the proof of the first theorem uses no premises, and the proof of the second uses a simple statement of quadratic reciprocity. However, by the end of the lifelong learning process, LeanAgent exhibits growth in its proving abilities in this repository: a) Advanced Abstract Algebra: LeanAgent shows advancement in proving a key result in abstract algebra, wedderburn (Wedderburn’s Little Theorem), which is a crucial result in abstract algebra, stating that every finite division ring is a field. <details> <summary>extracted/6256159/carbon58.png Details</summary> ![6d74e86a](/v1/image/6d74e86a899f05e26d00a0b051b29c031d2977db86f9275a82d5d0b25316a234) ### Visual Description ## Screenshot: Code Snippet ### Overview The image is a screenshot of a code snippet, likely from a programming environment or text editor. The code defines a theorem related to fields in mathematics. The snippet is presented in a dark-themed window with standard operating system controls. ### Components/Axes * **Window Controls:** The window has three circles in the top-left corner: red, yellow, and green. * **Code:** The code consists of the following line: `theorem wedderburn (h: Fintype R): IsField R := by apply Field.toIsField` ### Detailed Analysis or Content Details The code snippet contains the following elements: * `theorem wedderburn`: Declares a theorem named "wedderburn." * `(h: Fintype R)`: Defines a hypothesis named "h" stating that "R" is a finite type (Fintype). The variable `R` is colored red. * `: IsField R`: Specifies that "R" is a field. The variable `R` is colored red. * `:= by`: Indicates that the proof will follow. * `apply Field.toIsField`: Applies a function or tactic named "Field.toIsField" to prove the theorem. ### Key Observations The code snippet appears to be written in a language used for formal mathematical proofs, possibly a variant of ML or a similar language. The use of "theorem," "Fintype," and "IsField" suggests a focus on mathematical concepts. The "by apply" syntax indicates a tactic-based proof style. ### Interpretation The code snippet defines and proves Wedderburn's little theorem, which states that every finite division ring is a field. The code leverages existing definitions and tactics (like `Field.toIsField`) to construct the proof. The snippet is likely part of a larger formalization of mathematics in a proof assistant system. </details> LeanAgent’s proof of the wedderburn theorem represents the ability to handle algebraic structures. By using the Field.toIsField premise, LeanAgent shows that it has grasped how to use the knowledge of what makes a ring a field. This requires an understanding of ring theory and field properties. Coxeter. LeanAgent could not prove sorry theorems from the Coxeter repository during lifelong learning. However, by the end of lifelong learning, LeanAgent demonstrates a growing understanding of more complex algebraic structures, proving the lemma invmap.of_eq about Coxeter systems, again showing the ability to work with advanced concepts in group theory and abstract algebra. This corroborates the handling of abstract algebra necessary to prove the wedderburn theorem. <details> <summary>extracted/6256159/carbon59.png Details</summary> ![58cf296d](/v1/image/58cf296df48f9a2605581712cb623ed927683a70f0a1985fa4931eab7bc42c32) ### Visual Description ## Code Snippet: Lean Theorem Prover Code ### Overview The image shows a code snippet, likely from the Lean theorem prover, defining a lemma related to the inverse map of a Coxeter system. The code includes the lemma definition, simplification, unfolding, and application steps. ### Components/Axes * **Code Block:** The main component is a block of code with a dark background and syntax highlighting. * **Window Controls:** Three circles (red, yellow, green) are located at the top-left, resembling window controls. ### Detailed Analysis or ### Content Details The code snippet contains the following lines: 1. `lemma invmap.of_eq {S:Set G} [CoxeterSystem G S] {s :S} : invmap S s = s := by` 2. `simp [CoxeterSystem.Presentation.invmap]` 3. `unfold CoxeterSystem.toMatrix` 4. `apply CoxeterSystem.monoidLift.mapLift.of` ### Key Observations * The code defines a lemma named `invmap.of_eq`. * It involves sets `S` and `G`, and a Coxeter system. * The lemma states that `invmap S s = s`. * The proof uses simplification, unfolding, and application steps. ### Interpretation The code snippet is a formal proof within the Lean theorem prover. It defines and proves a property related to the inverse map in the context of Coxeter systems. The lemma `invmap.of_eq` likely establishes a fundamental relationship between an element `s` in a set `S` and its inverse map. The proof strategy involves simplifying the expression, unfolding definitions, and applying relevant theorems or functions. This type of code is used to formally verify mathematical statements and ensure their correctness. </details> LeanAgent’s proof of invmap.of_eq involves unfolding definitions and applying specific properties of Coxeter systems. This demonstrates LeanAgent’s growing understanding relative to ReProver in abstract algebra specific to the new repository it has learned from. Hairy Ball Theorem. Moreover, LeanAgent could not prove sorry theorems from the Hairy Ball Theorem repository during lifelong learning. However, at the end of lifelong learning, LeanAgent again demonstrates a stronger understanding relative to ReProver in algebraic topology. It proves HairyBallDiff, which states a key step in the Hairy Ball Theorem, demonstrating a grasp of vector spaces, norms, and their connections to topological concepts. <details> <summary>extracted/6256159/carbon60.png Details</summary> ![e5c24b02](/v1/image/e5c24b025e7f2f1c44f69afd16564dda53a1be47b0fad2faf68eff8da665fb05) ### Visual Description ## Code Snippet: HairyBallDiff Theorem ### Overview The image is a screenshot of a code snippet, likely from a proof assistant or programming environment. It defines a theorem named "HairyBallDiff" and provides a proof using tactics like "use" and "rw". The code appears to be related to mathematical concepts, specifically involving norms and vector units. ### Components/Axes * **Top-Left**: Three colored circles (red, yellow, green) resembling window control buttons. * **Main Text**: The code snippet itself, containing the theorem definition and proof steps. ### Detailed Analysis or ### Content Details The code snippet contains the following lines: * `theorem HairyBallDiff : ∃x, v x = 0 := by` * This line defines a theorem named "HairyBallDiff". * The theorem states: "There exists an x, such that v(x) = 0". * The `:= by` indicates the start of the proof. * `use 0` * This line uses the value 0 as a witness for the existential quantifier. * `rw [← norm_eq_zero]` * This line uses the `rw` tactic (likely "rewrite") to apply the lemma `norm_eq_zero` in reverse (indicated by the left arrow). * `rw [vUnit, norm_zero]` * This line uses the `rw` tactic to apply the lemmas `vUnit` and `norm_zero`. ### Key Observations * The code snippet is a formal proof of a theorem. * The proof relies on rewriting using lemmas related to norms and vector units. * The theorem likely relates to the Hairy Ball Theorem, which states that there is no nonvanishing continuous tangent vector field on even-dimensional n-spheres. ### Interpretation The code snippet demonstrates a formal proof of a theorem, likely within a proof assistant like Lean or Coq. The theorem, "HairyBallDiff," seems to be related to a differential version of the Hairy Ball Theorem. The proof uses tactics to manipulate the theorem statement and apply known lemmas, ultimately demonstrating the existence of a point where a vector field equals zero. The use of `norm_eq_zero`, `vUnit`, and `norm_zero` suggests that the proof involves concepts from linear algebra and analysis. </details> Crucially, only LeanAgent could prove invmap.of_eq, wedderburn, and HairyBallDiff, demonstrating that it has developed much more advanced theorem-proving capabilities than other setups. These proofs show that LeanAgent can work with highly abstract concepts and apply them to specific mathematical objects. A.6 Curriculum Learning Analysis In this section, we aim to answer the following questions: (1) Why does LeanAgent use $e^{S}$ ( $S$ = number of proof steps) as its complexity metric? (2) Why does curriculum learning work in theorem proving? (3) Why does LeanAgent use curriculum learning instead of other lifelong learning methods? Complexity Measure. There is no universal measure of proof complexity in Lean or other formal systems. One approach, the length-based measure, involves examining the proof length (number of steps or lines) and the size of the proof term in a formal system. While these can indicate verification complexity, they may not fully capture the complexity of discovering a proof (Arana & Stafford, 2023). Moreover, within the NLP literature, many works have related input length to complexity (Zaremba & Sutskever, 2015; Cirik et al., 2016; Spitkovsky et al., ; Subramanian et al., 2017; Chang et al., 2021). Starting with shorter sequences and gradually increasing length improves model quality (Li et al., 2024). These works demonstrate the gains from basing the complexity measure on input length. As such, we consider the equivalent of length in theorem proving to be the number of proof steps. However, we consider a linear scaling of length naive for theorem proving; it doesn’t consider the combinatorial explosion of possible proof paths as the length of the proof increases. As such, we choose an exponential scaling. Notably, a key strength of this choice is it is easy to compute and requires no additional hyperparameters to tune. We now discuss some alternative complexity metrics and why we chose not to use them in LeanAgent. One option is $e^{B}$ , where $B$ represents the number of different proof paths that could be explored at each step. Formally, $B$ is defined as the average number of child nodes for each non-leaf node in the proof tree. This is sometimes called the branching factor. We refrain from using this complexity metric as computing this becomes computationally expensive for complex proofs. Moreover, another option is to consider the complexity of the theorem statement to determine complexity. For example, this could be measured by the number of unique symbols, the depth of nested expressions, or the number of quantifiers. However, developing a reliable metric for statement complexity that works across various mathematical domains could be challenging. Moreover, LeanAgent focuses on improving proof generation, so using a metric directly related to the proof process (number of steps) aligns better with this goal than statement complexity. Dependency-based complexity, where we order theorems based on their dependency structure within the mathematical library, wasn’t used for multiple reasons. Namely, a theorem might depend on many simple results but still be relatively easy to prove, or it might depend on a few results but be very challenging. Furthermore, a topic-based curriculum would be unsuitable because LeanAgent aims to be a general-purpose framework. A topic-based approach might bias it towards certain mathematical domains. Moreover, a topic-based approach does not account for theorems spanning multiple mathematical concepts. Another option is a form of premise-based complexity measure. However, analyzing premise occurrence frequencies across repositories could be computationally expensive, especially for large repositories like SciLean. Curriculum Learning. Prior work suggests that curriculum learning guides the learner towards better local minima in non-convex optimization problems (Bengio et al., 2009). Crucially, theorem proving involves navigating a highly non-convex optimization landscape, especially for complex mathematical statements. The space of possible proofs is vast and complex, with many potential dead ends and suboptimal solutions to parts of a proof. This makes theorem proving an ideal candidate for benefitting from curriculum learning. Moreover, curriculum learning has been shown to have an effect similar to unsupervised pre-training, acting as a form of regularization (Bengio et al., 2009). For theorem proving, curriculum learning allows LeanAgent to build a foundational knowledge base of mathematics before attempting more complex theorems, naturally leading to continuous improvement. This avoids suboptimal proof strategies early in training. Furthermore, this leads to more robust and generalizable proof techniques that work across a broader range of theorems, explaining LeanAgent’s sorry theorem proving performance. Moreover, the ability to act as a regularizer means curriculum learning prevents the model from overfitting to specific types of proofs or mathematical domains. This allows for continuous generalizability, explaining LeanAgent’s superior lifelong learning metric scores. Moreover, other works from the literature show that curriculum learning biases models towards building constructive internal representations (Cirik et al., 2016). Specifically, it allows the model to use the knowledge from earlier steps in later predictions. In theorem proving, this allows LeanAgent to learn basic proof skills, which then become building blocks for more complex proofs later on. This corroborates our analysis of LeanAgent’s progression of proof complexity. Moreover, multiple past works agree that curriculum learning provides larger gains when training data is limited (Zaremba & Sutskever, 2015; Cirik et al., 2016; Spitkovsky et al., ). This is the case in formal theorem proving, where the number of formalized theorems and proofs is limited. These past works also state that curriculum learning leads to better generalization, supporting the observations of LeanAgent’s superior lifelong learning metrics. This context supports LeanAgent’s sorry theorem proving performance and superiority on lifelong learning metrics. Comparison with Other Lifelong Learning Methods. Many other lifelong learning methods exist, such as those mentioned in Sec. 2. However, we chose curriculum learning for the following reasons. Regularization methods, like EWC, slow down learning on important parameters. However, the importance of parameters can change as the theorem complexity increases. This helps explain the lower sorry theorem proving performance of EWC methods and their performance on lifelong learning metrics. Moreover, memory-based techniques store examples from previous tasks to prevent forgetting. However, this can greatly affect these methods’ balance of stability and plasticity. This can be seen in the lifelong learning metrics of Merge All setups, including the negative IP values. Knowledge distillation requires a separate teacher model, but curriculum learning is more efficient as it provides the path to knowledge accumulation in a single model. Since LeanAgent is a framework, not a model, we refrain from using dynamic architecture adjustment to keep LeanAgent general to many LLM architectures. Moreover, recent work selectively updates parameters with the largest momentum magnitudes and uses selective reinitialization to maintain plasticity. However, these methods often focus on balancing performance across distinct tasks. In theorem proving, where tasks form a spectrum of increasing complexity, curriculum learning provides a more structured approach to knowledge accumulation. A.7 Theorem Proving Performance Score (TPPS) Analysis In this work, we directly compared the number of proven theorems with ReProver. However, the field of theorem proving, especially in a lifelong learning context, currently lacks standardized performance metrics. To address this concern, this section discusses a possible alternative metric while acknowledging its limitations. Theorem proving difficulty is inherently non-linear in nature. For example, LeanAgent’s significantly improved performance over the baseline across multiple repositories allows it to prove progressively harder theorems. Furthermore, sorry theorems lack ground truth proofs, so proving one is valuable. To address these nuances, one could propose the Theorem Proving Performance Score (TPPS) to emphasize newly proven sorry theorems. Specifically, it could be stated that $\text{LeanAgent TPPS}=(\text{\# ReProver Theorems Proved})+(\text{\# New % Theorems Proved}*X)+1$ , where $X$ represents the importance of proving a new theorem, and $\text{ReProver TPPS}=(\text{\# ReProver Theorems Proved})+1$ . Then, $\text{Improvement Factor}=(\text{LeanAgent TPPS})/(\text{ReProver TPPS})$ . The core idea behind TPPS is to assign a higher reward for newly proven theorems, aligning with lifelong learning objectives. However, it is important to recognize its preliminary nature and potential shortcomings. First, choosing the parameter $X$ is challenging. One approach is to choose a static value, such as $X=10$ , to standardize comparisons between LeanAgent and ReProver across diverse repositories. However, this may lead to inflated or deflated metrics on such repositories. Alternatively, $X$ could be chosen adaptively based on the difficulty of a repository, but this may similarly result in unrealistic metric scores. Overall, the TPPS metric focuses on quantity and faces challenges in ensuring comparisons across diverse repositories. Moreover, the TPPS metric can be susceptible to artifacts that artificially inflate performance. For instance, several theorems in the PFR repository were proven on a technicality due to placeholder statements of “ $0=1$ ”, which could then be used to prove other “sorry” theorems through the principle of ex falso quodlibet. Examples such as this are due to the weakness of the current state of repositories rather than a fundamental shortcoming of our ML approach, underscoring the need for more robust and well-tested benchmarks. Furthermore, TPPS does not account for the difficulty of the theorems being proved. While we initially considered using proof length as a proxy for difficulty (e.g., $e^{S}$ where $S$ is the number of proof steps), we found this approach problematic during proving evaluation. LeanAgent sometimes generates shorter proofs for difficult theorems, making proof length an unreliable indicator of theorem difficulty. These issues expose broader challenges in evaluating theorem-proving systems. The field lacks standardized benchmarks, necessitating carefully curated sets of theorems with varying difficulties across different mathematical domains. Developing reliable metrics to assess theorem difficulty beyond simple measures like proof length or statement complexity is crucial. Future benchmarks should prevent the exploitation of technicalities. Moreover, metrics should consider the mathematical significance of proven theorems, not just their quantity. Finally, ensuring that performance metrics are meaningful and comparable across different mathematical repositories is essential for consistent evaluation. In light of these considerations, we acknowledge that TPPS, while a step towards quantifying theorem-proving performance in a lifelong learning setting, has many limitations. Future work should focus on developing more sophisticated and robust evaluation frameworks that address these challenges. For example, we plan to investigate more sophisticated measures that reward not only newly proved theorems but also difficult ones. By highlighting these issues, we hope to contribute to the ongoing discussion on how best to evaluate and compare theorem-proving systems, particularly in the context of lifelong learning. The development of standardized, reliable metrics will be crucial for measuring progress and guiding future research in this field.

Rendering Paper...