2505.05758

Model: gemini-2.0-flash

# APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning **Authors**: Azim Ospanov, &Farzan Farnia , &Roozbeh Yousefzadeh > Huawei Hong Kong Research CenterDepartment of Computer Science & Engineering, The Chinese University of Hong Kong Abstract Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (A utomated P r O of repair via L LM and L ean c O llaboration), a modular, model‑agnostic agentic framework that combines the strengths of the Lean compiler with an LLM’s reasoning abilities to achieve better proof‐generation results at a low token and sampling budgets. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub‑lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top‑ $K$ budget. The repaired sub‑proofs are recombined and reverified, iterating up to a user‑controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state‑of‑the‑art accuracy of 84.9% among sub 8B‑parameter models (as of August 2025) while keeping the sampling budget below one hundred. Moreover, Apollo raises the state‑of‑the‑art accuracy for Goedel‑Prover‑SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. General‑purpose models (o3‑mini, o4‑mini) jump from 3–7% to over 40% accuracy. Our results demonstrate that targeted, compiler‑guided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving. The codebase is available at https://github.com/aziksh-ospanov/APOLLO 1 Introduction <details> <summary>x1.png Details</summary> ![023ae624](/v1/image/023ae62438dc423f91d6b44c4f8a3bb14bdf62b4f52dca285b92522a7764a3cf) ### Visual Description ## Diagram: Common Approach: Whole-Proof Generation Pipeline ### Overview The image is a diagram illustrating a common approach to whole-proof generation pipeline. It depicts a cyclical process involving a Large Language Model (LLM), a Lean Server, and a decision point represented by a user interface with "exit loop" and "continue" options. The process can be repeated up to K times. ### Components/Axes * **Title:** Common Approach: Whole-Proof Generation Pipeline * **Components:** * **LLM:** A blue box containing a brain icon, labeled "LLM" at the top. * **Lean Server:** A white box containing a stylized symbol resembling mathematical notation. Labeled "Lean Server" at the top. * **Decision Point:** A representation of a user interface with two buttons: * "exit loop" button with a green checkmark. * "continue" button with a red "X". * **Repeat Block:** A gray box labeled "repeat up to K times" positioned at the top-center of the diagram. * **Flow:** The process flows from left to right, with arrows indicating the direction: * From LLM to Lean Server, labeled "proof attempt". * From Lean Server to the Decision Point. * A loop connects the Decision Point back to the LLM, indicating a repetition of the process. ### Detailed Analysis 1. **LLM (Large Language Model):** * Position: Top-left of the diagram. * Description: A blue box containing a brain icon. The label "LLM" is at the top of the box. 2. **Proof Attempt:** * Position: Between the LLM and the Lean Server. * Description: A blue arrow labeled "proof attempt" connects the LLM to the Lean Server. 3. **Lean Server:** * Position: Center of the diagram. * Description: A white box containing a stylized symbol. The label "Lean Server" is at the top of the box. 4. **Decision Point:** * Position: Right side of the diagram. * Description: A representation of a user interface with two buttons: "exit loop" (green checkmark) and "continue" (red "X"). 5. **Repeat up to K times:** * Position: Top-center of the diagram. * Description: A gray box labeled "repeat up to K times". 6. **Loop:** * Position: Surrounding the entire diagram. * Description: Arrows indicate the cyclical nature of the process, connecting the Decision Point back to the LLM. ### Key Observations * The diagram illustrates an iterative process where an LLM generates proof attempts, which are then processed by a Lean Server. * The process can be repeated up to K times, suggesting a limit on the number of iterations. * The decision point allows for either exiting the loop or continuing the process, indicating a mechanism for controlling the iteration. ### Interpretation The diagram represents a system for automated proof generation. The LLM attempts to generate proofs, which are then validated or refined by the Lean Server. The "repeat up to K times" element suggests that the system may not always produce a valid proof on the first attempt and requires multiple iterations. The decision point, with the "exit loop" and "continue" options, implies a human or automated evaluation of the proof attempt, allowing the system to either terminate the process or continue refining the proof. This iterative approach aims to improve the likelihood of generating a valid and complete proof. </details> <details> <summary>x2.png Details</summary> ![09988afd](/v1/image/09988afddce92e378384fcd9a713b3f54c56e4d0541be2c70c659e4e297afbca) ### Visual Description ## Diagram: Our Proposed Apollo Pipeline ### Overview The image is a diagram illustrating the proposed Apollo Pipeline, a system likely designed for automated proof repair. The diagram shows the flow of information and processes between different components, including a Language Learning Model (LLM), an Apollo Proof Repair Agent, a Lean Server, and feedback loops. ### Components/Axes * **Title:** Our Proposed Apollo Pipeline * **Loop Control:** "repeat up to r times" (located at the top-center of the diagram) * **LLM (Language Learning Model):** A blue box on the left containing an image of a brain. * Output: "proof attempt(s)" (blue arrow going to Apollo Proof Repair Agent) * Input: "sub-problem(s) to prove" (red arrow coming from Apollo Proof Repair Agent) * **Apollo Proof Repair Agent:** A dashed-line box in the center, containing: * A yellow box labeled "Apollo Proof Repair Agent" with an image of a wrench and code editor. * A red box labeled "Auto Solver" with a tic-tac-toe board image. * An orange box labeled "Subproof Extractor" with an image of a wrench and an exclamation point. * **Lean Server:** A gray box on the right containing a graph that fluctuates up and down. * Input: "proof state compilation errors syntax errors" (blue arrow from Apollo Proof Repair Agent) * Output: Blue arrow to "continue" and red arrow to "exit loop" * **Feedback Loop Control:** A window with a green checkmark labeled "exit loop" and a red X labeled "continue". * The "continue" output loops back to the LLM. ### Detailed Analysis * The LLM generates "proof attempt(s)" which are sent to the Apollo Proof Repair Agent. * The Apollo Proof Repair Agent uses an "Auto Solver" and "Subproof Extractor" to process the proof attempts. * The Apollo Proof Repair Agent sends "proof state compilation errors syntax errors" to the Lean Server. * The Lean Server either exits the loop or continues, sending information back to the LLM. * The entire process is repeated up to 'r' times. ### Key Observations * The diagram illustrates a cyclical process, where the LLM generates proofs, the Apollo Proof Repair Agent attempts to repair them, and the Lean Server provides feedback. * The "repeat up to r times" indicates that the process is iterative. * The "exit loop" and "continue" options suggest that the process can either terminate successfully or continue with further attempts. ### Interpretation The diagram presents a system for automated proof repair using a combination of an LLM, specialized repair agents, and a Lean Server. The iterative nature of the pipeline, controlled by the "repeat up to r times" loop, suggests that the system is designed to refine proofs over multiple attempts. The feedback loop, with "exit loop" and "continue" options, allows the system to adapt and improve its proof repair strategies. The Apollo Proof Repair Agent seems to be the central component, orchestrating the repair process using the Auto Solver and Subproof Extractor. The Lean Server provides crucial feedback on the proof state, guiding the repair process. </details> Figure 1: The summary of whole-proof generation pipeline vs. proposed Apollo agentic pipeline. LLM refers to a chosen formal theorem generator model. Formal reasoning has emerged as one of the most challenging fields of AI with recent achievements such as AlphaProof and AlphaGeometry doing well at the International Math Olympiad competing with humans yang2024formal ; alphaproof ; AlphaGeometryTrinh2024 . Formal reasoning relies both on AI models and a formal verification system that can automatically verify whether a mathematical proof is correct or not. In recent years, formal verification systems such as Lean lean4paper have facilitated a new form of doing mathematical research by enabling mathematicians to interact with the formal verification system and also with each other via the system, enabling larger numbers of mathematicians to collaborate with each other on a single project. As such, these formal verification systems are also called proof assistants as one can use them interactively to write a formal proof and instantaneously observe the current state of the proof and any possible errors or shortcomings in the proof generated by the compiler. Immediate access to the output of Lean compiler can help a mathematician to fix possible errors in the proof. At the same time, when a proof passes the compiler with no error, other collaborators do not need to verify the correctness of the proof. This type of collaboration is transforming the field of mathematics enabling large groups of collaborators to engage in large projects of mathematical research such as the Prime Number Theorem And More project kontorovich_tao_2024_primenumbertheoremand . Moreover, it has given rise to the digitization of mathematics buzzard2024mathematical . In the AI front, large language models (LLMs) are improving at mathematical reasoning in natural language, and at the same time, they have shown a remarkable ability to learn the language of those formal verification systems and write mathematical proofs in formal language. This has led to the field of Automated Theorem Proving (ATP) where the standard approach is to prompt the LLM to generate a number of candidate proofs for a given theorem which will then be verified automatically using the formal verification system. Better models, better training sets, and better training methods has led to significant advances in this field tsoukalas2024putnambenchevaluatingneuraltheoremprovers ; zheng2022minif2fcrosssystembenchmarkformal ; azerbayev2023proofnetautoformalizingformallyproving ; li2024numinamath ; hu2025minictx . Despite these advances, the LLMs do not really get the chance to interact with the verification system the way a human does. LLMs generate many possible proofs, sometimes as many as tens of thousands, and the verification system only marks the proofs with a binary value of correct or incorrect. This common approach is illustrated in Figure 1. When the formal verification system analyzes a proof, its compiler may generate a long list of issues including syntax errors, incorrect tactics, open goals, etc. In the current approach of the community, all of those information/feedback from the compiler remains unused, as the LLM’s proof generation is not directly engaged with the compiler errors of the formal verification system. In this work, we address this issue by introducing Apollo, a fully automated system which uses a modular, model-agnostic pipeline that combines the strengths of the Lean compiler with the reasoning abilities of LLMs. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub‑lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top‑ $K$ budget. The high-level overview of Apollo appears at the bottom of Figure 1. Our contributions are as follows: - We introduce a novel fully automated system that directs a collaboration between LLMs, Lean compiler, and automated solvers for automated theorem proving. - We evaluate Apollo on miniF2F-test set using best LLMs for theorem proving and demonstrate that Apollo improves the baseline accuracies by significant margins while keeping the sampling costs much lower. - We establish a new SOTA on miniF2F-test benchmark for medium-sized language models. (with parameter size 8B or less) 2 Related Works Formal theorem proving systems. At their core, formal theorem‑proving (FTP) systems employ interactive theorem provers (ITPs) to verify mathematical results. In particular, Lean4 lean4paper is both a functional programming language and an ITP. Each theorem, lemma, conjecture, or definition must be checked by Lean’s trusted kernel, so the compiler’s verdict is binary: either the statement type‑checks (True) or it fails (False). This rigorous verification dramatically increases the reliability of formal mathematics; however, it also increases the complexity of proof development: authors must both comprehend the mathematical concepts and precisely encode them in the formal language. The usefulness of Lean is also shown to go beyond theorem proving, as Jiang et al. jiang2024leanreasoner showed that Lean can be adapted to natural language logical reasoning. Patching broken programs. Many prior works have explored using feedback to repair broken proofs. In software engineering, “repair” typically refers to program repair, i.e. fixing code that no longer runs programrepair . A common source of errors is a version mismatch that introduces bugs and prevents execution. For example, SED sed_repair uses a neural program‐generation pipeline that iteratively repairs initial generation attempts. Other notable approaches train specialized models to fix broken code based on compiler feedback (e.g. BIFI bifi , TFix tfix ) or leverage unsupervised learning over large bug corpora (e.g. BugLab buglab ). LLM collaboration with external experts. The use of expert feedback, whether from human users or from a proof assistant’s compiler, has proven effective for repairing broken proofs. Ringer ringer2009typedirected showcased automatic proof repair within the Coq proof assistant bertot2013interactive , enabling the correction of proofs in response to changes in underlying definitions. Jiang et al. jiang2022thor showed that leveraging automatic solvers with generative models yield better performance on formal math benchmarks. More recently, First et al. first2023baldurwholeproofgenerationrepair showed that incorporating compiler feedback into LLM proof generation significantly improves the model’s ability to correct errors and produce valid proofs. Another line of work song2023towards explored an idea of mathematician and LLM collaboration, where human experts are aided by LLMs at proof writing stage. Proof search methods. Numerous recent systems use machine learning to guide the search for formal proofs. One of the methods involves using large language models with structured search strategies, e.g. Best‑First Search (BFS) polu2022formalmathematicsstatementcurriculum ; wu2024internlm25stepproveradvancingautomatedtheorem ; xin2025bfsproverscalablebestfirsttree or Monte Carlo Tree Search (MCTS) lample2022hypertree ; wang2023dt-solver ; alphaproof . While tree‑search methods reliably steer a model toward valid proofs, they result in high inference costs and often explore many suboptimal paths before success. Another line of work leverages retrieval based systems to provide context for proof generators with potentially useful lemmas. One such example is ReProver yang_leandojo_2023 that augments Lean proof generation by retrieving relevant lemmas from a proof corpus and feeding them to an LLM, enabling the model to reuse existing proofs. Whole proof generation. The use of standalone LLMs for theorem proving has emerged as a major research direction in automated theorem proving. One of the earliest works presented GPT-f polu2020generativelanguagemodelingautomated , a transformer-based model for theorem proving that established the LLMs ability in formal reasoning. As training frameworks and methods advance, the community has produced numerous models that generate complete proofs without external search or tree‑search algorithms. Recent work shows that both supervised models xin2024deepseekproverv15harnessingproofassistant ; lin2025goedelproverfrontiermodelopensource and those trained via reinforcement learning xin2024deepseekproverv15harnessingproofassistant ; zhang2025leanabellproverposttrainingscalingformal ; dong2025stp ; kimina_prover_2025 achieve competitive performance on formal‑math benchmarks. However, whole‑proof generation remains vulnerable to hallucinations that can cause compilation failures even for proofs with correct reasoning chains. Informal Chain-of-Thought in Formal Mathematics. Several recent works demonstrate that interleaving informal “chain‑of‑thought” (CoT) reasoning with formal proof steps substantially improves whole‑proof generation. For example, developers insert natural‑language mathematical commentary between tactics, yielding large accuracy gains lin2025goedelproverfrontiermodelopensource ; xin2024deepseekproverv15harnessingproofassistant ; azerbayev2024llemma . Wang, et al. kimina_prover_2025 further show that special “thinking” tokens let the model plan its strategy and self‑verify intermediate results. The “Draft, Sketch, and Prove” (DSP) jiang2023draftsketchproveguiding and LEGO-Prover wang2024legoprover pipelines rely on an informal proof sketch before attempting to generate the formal proof. Collectively, these studies indicate that providing informal mathematical context, whether as comments, sketches, or dedicated tokens, guides LLMs toward correct proofs and significantly boosts proof synthesis accuracy. <details> <summary>x3.png Details</summary> ![9e12208b](/v1/image/9e12208b7160ed08703dfa2d4e7efd32d5cff24d7c653ac9c0255b46fb7f48ea) ### Visual Description ## Algorithm Diagram: Apollo Algorithm ### Overview The image presents a flowchart illustrating the Apollo Algorithm, a process for automated theorem proving and repair. The algorithm takes a mathematical theorem as input, attempts to prove it, and if the initial proof is invalid, it refines the theorem and attempts to prove it again. The process involves several stages, including syntax refinement, sorrification, automated solving, and proof assembly, leveraging a Large Language Model (LLM). ### Components/Axes The diagram consists of the following key components: 1. **Header**: "Apollo Algorithm" title at the top. 2. **Syntax Refiner**: A module represented by a gear icon. 3. **Sorrifier**: A module represented by a hammer icon. 4. **Auto Solver**: A module represented by a document with a pen icon. 5. **LLM**: A module represented by a neural network diagram. 6. **Proof Assembler**: A module represented by a bricklaying trowel icon. 7. **Decision Nodes**: Two diamond-shaped nodes labeled "number of attempts > r?" and "Does the proof compile in Lean4?". 8. **Proof Blocks**: Two rectangular blocks, one labeled "Invalid Proof" (red background) and the other "Correct Proof" (green background), containing code snippets. 9. **Flow Arrows**: Arrows indicating the direction of the process flow. 10. **Yes/No Labels**: Labels indicating the direction of flow based on the decision nodes. ### Detailed Analysis or Content Details **1. Flow of the Algorithm:** * The algorithm starts with the "Syntax Refiner" module. * The output of the "Syntax Refiner" goes to the "Sorrifier" module. * The output of the "Sorrifier" goes to the "Auto Solver" module. * The "Auto Solver" extracts subproofs to prove and feeds them to the "LLM". * The "LLM" processes the subproofs. * The "Proof Assembler" assembles the proof back. * A decision node checks "Does the proof compile in Lean4?". * If "NO", the algorithm checks "number of attempts > r?". * If "NO", the algorithm loops back to the "Syntax Refiner". * If "YES", the algorithm leads to the "Invalid Proof" block. * If "YES", the algorithm leads to the "Correct Proof" block. **2. Code Snippets:** * **Invalid Proof (Red Block):** * `import Mathlib` * `import Aesop` * `set_option maxHeartbeats 0` * `open BigOperators Real Nat Topology Rat Set` * `theorem mathd_numbertheory_495_incorrect (a b : N)` * `(h₀ : 0 < a 0 < b)` * `(h₁ : a % 10 = 2) (h₂ : b % 10 = 4)` * `(h₃ : Nat.gcd a b = 6) : 108 ≤ Nat.lcm a b := by` * `have h4 : a ≥ 6 := by` * `have h5 : a % 10 = 2 := h₁` * `omega` * `have h5 : b ≥ 6 := by` * `have h6 : b % 10 = 4 := h₂` * `omega` * `by_contra h` * `push_neg at h` * `have h6 : a * b < 648 := by` * `have h7 : Nat.lcm a b < 108 := by linarith` * `have h8 : a * b = 6 * Nat.lcm a b := by` * `rw [Nat.gcd_mul_lcm a b]` * `rw [h₃]` * `linarith` * `have h7 : a ≤ 100 := by` * `nlinarith [h₃, h₁, h₂, Nat.gcd_pos_of_pos_left a h₀.right, Nat.gcd_pos_of_pos_right b h₀.right]` * `have h8 : b ≤ 100 := by` * `nlinarith [h₃, h₁, h₂, Nat.gcd_pos_of_pos_left a h₀.right, Nat.gcd_pos_of_pos_right b h₀.right]` * `interval_cases a <;> interval_cases b <;>` * `norm_num at *` * `<;> try { contradiction }` * **Correct Proof (Green Block):** * `theorem mathd_numbertheory_495_after_proof_repair (a b : N)` * `(h₀ : 0 < a 0 < b)` * `(h₁ : a % 10 = 2) (h₂ : b % 10 = 4)` * `(h₃ : Nat.gcd a b = 6) : 108 ≤ Nat.lcm a b := by` * `have h4 : a ≥ 6 := by` * `have h5 : a % 10 = 2 := h₁` * `have h6 : 6 ≤ a := by` * `have h : Nat.gcd a b = 6 := h₃` * `have h7 : 6 | a := by` * `rw [h]` * `apply Nat.gcd_dvd_left` * `omega` * `exact h6` * `have h5 : b ≥ 6 := by` * `have h6 : b % 10 = 4 := h₂` * `by_contra h` * `push_neg at h` * `have h4 : b < 4 := by` * `omega` * `rw [h4] at h₃` * `have h6 : a.gcd 4 ≤ 4 := by` * `apply Nat.gcd_le_right` * `norm_num` * `linarith` * `by_contra h` * `push_neg at h` * `have h6 : a * b < 648 := by` * `have h7 : Nat.lcm a b < 108 := by linarith` * `have h8 : a * b = 6 * Nat.lcm a b := by` * `rw [Nat.gcd_mul_lcm a b]` * `rw [h₃]` * `linarith` * `have h7 : a ≤ 100 := by` * `by_contra h'` * `push_neg at h'` * `have h7 : a ≥ 101 := by linarith` * `have h8 : b ≤ 6 := by nlinarith` * `have h9 : b = 6 := by linarith` * `norm_num at h₂` * `rw [h9] at h₂` * `have h8 : b ≤ 100 := by` * `by_contra h₈` * `push_neg at h₈` * `have h9 : b ≤ 107 := by` * `have h10 : a * b < 648 := h6` * `have h11 : 2 = 6 := h4` * `nlinarith` * `have h12 : b ≤ 107 := h9` * `have h13 : b ≤ 100 := h8` * `have h14 : b % 10 = 4 := h₂` * `have h15 : b ≥ 101 := by omega` * `interval_cases b <;> omega` * `interval_cases a <;> interval_cases b <;> norm_num at *` * `<;> try { contradiction }` ### Key Observations * The algorithm iteratively refines and attempts to prove a theorem until it either succeeds or exceeds the maximum number of attempts. * The "Invalid Proof" block represents a failed proof attempt, while the "Correct Proof" block represents a successful proof. * The code snippets in the "Invalid Proof" and "Correct Proof" blocks show the steps involved in the proof process, including importing libraries, defining theorems, and applying logical rules. * The algorithm uses a Large Language Model (LLM) to assist in the proof process. ### Interpretation The Apollo Algorithm is an automated system designed to prove and repair mathematical theorems. It combines traditional theorem proving techniques with the power of Large Language Models (LLMs). The algorithm's iterative nature allows it to refine theorems and proofs until a valid proof is found or the maximum number of attempts is reached. The use of an LLM suggests that the algorithm can leverage machine learning to improve its proof-finding capabilities. The "Invalid Proof" and "Correct Proof" blocks provide insights into the specific steps and logical rules used in the proof process, highlighting the algorithm's ability to reason and manipulate mathematical expressions. The algorithm's success depends on the effectiveness of the syntax refiner, sorrifier, auto solver, and the LLM in generating and validating proof steps. </details> Figure 2: Overview of the Apollo framework. 3 Our Approach In this section we describe Apollo, our framework for transforming an LLM’s raw proof sketch into a fully verified Lean4 proof. Figure 2 illustrates the Apollo pipeline process with an attached Lean code before and after repair. We observe that often LLM is capable of producing a valid proof sketch; however, it often struggles with closing fine-grained goals. For example, statements h4, h5, h7, h8 are correct but the generated proofs for each sub-lemma throw compilation errors. Apollo identifies and fixes such proof steps and guides the LLM towards a correct solution. Our compiler of choice is the Lean REPL leanrepl , a “ R ead– E val– P rint– L oop” for Lean4. It provides a complete list of compilation errors and warnings, each with a detailed description and the location (for example, the line number) where it occurred. Errors can include syntax mistakes, nonexistent tactics, unused tactics, and more. We aggregate this information to drive our agent’s dynamic code repair and sub‑problem extraction. The pseudo-code for our framework can be found in Algorithm 1 in the Appendix. 3.1 Syntax Refiner The Syntax Refiner catches and corrects superficial compilation errors: missing commas, wrong keywords (e.g. Lean3’s from vs. Lean4’s := by), misplaced brackets, etc. These syntax errors arise when a general‑purpose LLM (such as o3‑mini openai-o3-mini-2025 or o4‑mini openai-o4-mini-2025 ) has not been specialized on Lean4 syntax. By normalizing these common mistakes, this module ensures that subsequent stages operate on syntactically valid code. It is important to note that this module is primarily applicable to general purpose models not explicitly trained for theorem proving in Lean. In contrast, Syntax Refiner usually does not get invoked for proofs generated by LLMs trained for formal theorem proving. 3.2 Sorrifier The Sorrifier patches any remaining compilation failures by inserting Lean’s sorry placeholder. We first parse the failed proof into a tree of nested proof‑blocks (sub‑lemmas as children). Then we compile the proof with Lean REPL leanrepl , detect the offending line or block, and apply one of three repairs: 1. Line removal, if a single tactic or declaration is invalid but its surrounding block may still succeed. 1. Block removal, if the entire sub‑proof is malformed. 1. Insert sorry, if the block compiles but leaves unsolved goals open. We repeat this procedure until the file compiles without errors. At that point, every remaining sorry marks a sub‑lemma to be proved in later stages of Apollo. This part of the pipeline guarantees that the proof successfully compiles in Lean with warnings of presence of sorry statements. Each sorry block corresponds to a sub‑problem that the LLM failed to prove. Such blocks may not type‑check for various reasons, most commonly LLM hallucination. The feedback obtained via REPL lets us to automatically catch these hallucinations, insert a sorry placeholder, and remove invalid proof lines. 3.3 Auto Solver The Auto Solver targets each sorry block in turn. It first invokes the Lean4’s hint to close the goal. If goals persist, it applies built‑in solvers (nlinarith, ring, simp, etc.) wrapped in try to test combinations automatically. Blocks whose goals remain open stay marked with sorry. After applying Auto Solver, a proof may be complete with no sorry’s in which case Apollo has already succeeded in fixing the proof. Otherwise, the process can repeat recursively. 3.4 Recursive reasoning and repair In the case where a proof still has some remaining sorry statements after applying the Auto Solver, Apollo can consider each as a new lemma, i.e., extract its local context (hypotheses, definitions, and any lemmas proved so far), and recursively try to prove the lemmas by prompting the LLM and repeating the whole process of verification, syntax refining, sorrifying, and applying the Auto Solver. At each of these recursive iterations, a lemma may be completely proved, or it may end up with an incomplete proof with one or a few sorry blocks. This allows the Apollo to make progress in proving the original theorem by breaking down the incomplete steps further and further until the LLM or Auto Solver can close the goals. This process can continue up to a user‑specified recursion depth $r$ . 3.5 Proof Assembler Finally, the Proof Assembler splices all repaired blocks back into a single file and verifies that no sorry or admit (alias for "sorry") commands remain. If the proof still fails, the entire pipeline repeats (up to a user‑specified recursion depth $r$ ), allowing further rounds of refinement and sub‑proof generation. Apollo’s staged approach: syntax cleaning, “sorrifying,” automated solving, and targeted LLM‐driven repair, yields improvements in both proof‐sample efficiency and final proof correctness. 4 Experiments In this section, we present empirical results for Apollo on the miniF2F dataset zheng2022minif2fcrosssystembenchmarkformal , which comprises of 488 formalized problems drawn from AIME, IMO, and AMC competitions. The benchmark is evenly split into validation and test sets (244 problems each); here we report results on the miniF2F‑test partition. To demonstrate Apollo’s effectiveness, we evaluate it on a range of state‑of‑the‑art whole‑proof generation models. All experiments use Lean v4.17.0 and run on eight NVIDIA A5000 GPUs with 128 CPU cores. We used $@32$ sampling during sub-proof generation unless stated otherwise. Baseline numbers are sourced from each model’s original publication works. 4.1 The effect of applying Apollo on top of the base models Figure 3 shows the impact of Apollo on two state‑of‑the‑art whole‑proof generation models: Goedel Prover‑SFT lin2025goedelproverfrontiermodelopensource and Kimina‑Prover‑Preview‑Distill‑7B kimina_prover_2025 . Applying Apollo increases Goedel‑Prover‑SFT’s top accuracy from 64.7% to 65.6% while reducing its sampling budget by two orders of magnitude (from 25,600 generated samples to only a few hundred on average). Similarly, Kimina‑Prover‑Preview‑Distill‑7B achieves a new best Kimina-Prover-Preview-Distill-7B accuracy of 75.0% with roughly one‑third of the previous sample budget. We report the exact numbers in Table 1. <details> <summary>x4.png Details</summary> ![787ae04c](/v1/image/787ae04c2d32653096cff943eba2d2779dd091195d27a2f6b6f578849f0446e9) ### Visual Description ## Chart: Performance Comparison of Goedel-Prover-SFT and Kimina-Prover-Preview-Distill-7B ### Overview The image presents two line charts comparing the performance of different models based on "Pass budget (K)" and "Accuracy (%)". The left chart focuses on "Goedel-Prover-SFT" with and without "Apollo", while the right chart focuses on "Kimina-Prover-Preview-Distill-7B" with and without "Apollo". ### Components/Axes **Left Chart:** * **Title:** Performance of Goedel-Prover-SFT * **X-axis:** Pass budget (K) - log scale * Scale markers: 32, 80, 150, 306, 1.0K, 25.6K * **Y-axis:** Accuracy (%) * Scale markers: 58, 59, 60, 61, 62, 63, 64, 65 * **Legend:** Located in the bottom-right corner. * Blue line: Goedel-Prover-SFT * Orange line: Goedel-Prover-SFT + Apollo **Right Chart:** * **Title:** Performance of Kimina-Prover-Preview-Distill-7B * **X-axis:** Pass budget (K) * Scale markers: 0, 200, 400, 600, 800, 1000 * **Y-axis:** Accuracy (%) * Scale markers: 64, 66, 68, 70, 72, 74 * **Legend:** Located in the top-right corner. * Red line: Kimina-Prover-Preview-Distill-7B * Green line: Kimina-Prover-Preview-Distill-7B + Apollo ### Detailed Analysis **Left Chart (Goedel-Prover-SFT):** * **Goedel-Prover-SFT (Blue):** The line slopes upward, indicating increasing accuracy with a larger pass budget. * (32, 57.7%) * (80, 59.2%) * (150, 60.5%) * (306, 61.3%) * (1.0K, 62.2%) * (25.6K, 64.7%) * **Goedel-Prover-SFT + Apollo (Orange):** The line slopes upward, indicating increasing accuracy with a larger pass budget. * (32, 57.6%) * (80, 60.7%) * (150, 63.5%) * (306, 65.1%) **Right Chart (Kimina-Prover-Preview-Distill-7B):** * **Kimina-Prover-Preview-Distill-7B (Red):** The line slopes upward, indicating increasing accuracy with a larger pass budget. * (0, 63.2%) * (1000, 71.0%) * **Kimina-Prover-Preview-Distill-7B + Apollo (Green):** The line slopes upward, indicating increasing accuracy with a larger pass budget. * (0, 63.1%) * (100, 68.8%) * (200, 74.1%) * (300, 74.8%) ### Key Observations * In the Goedel-Prover-SFT chart, the "Apollo" addition consistently improves accuracy across all pass budget values. * The x-axis on the left chart is logarithmic, while the x-axis on the right chart is linear. * In the Kimina-Prover-Preview-Distill-7B chart, the "Apollo" addition significantly improves accuracy, especially at lower pass budget values. The green line plateaus after 200K. ### Interpretation The charts compare the performance of two different models, "Goedel-Prover-SFT" and "Kimina-Prover-Preview-Distill-7B", with and without the addition of "Apollo". The data suggests that adding "Apollo" generally improves the accuracy of both models. However, the impact of "Apollo" is more pronounced for "Kimina-Prover-Preview-Distill-7B", especially at lower pass budget values. The logarithmic scale on the left chart indicates that the pass budget has a diminishing return on accuracy for "Goedel-Prover-SFT". The Kimina model with Apollo plateaus quickly, suggesting a saturation point. </details> (a) Model accuracy against sampling budget <details> <summary>x5.png Details</summary> ![dd6b80a5](/v1/image/dd6b80a5f102ac177fc19e207a43d8d1091f7e2cdd04f28757d52e4045881235) ### Visual Description ## Chart Type: Performance Comparison Line Graphs ### Overview The image presents two line graphs comparing the performance (accuracy in percentage) of different models against the token budget (N) on a logarithmic scale. The left graph compares "Goedel-Prover-SFT" with and without "Apollo," while the right graph compares "Kimina-Prover-Preview-Distill-7B" with and without "Apollo." ### Components/Axes * **Left Graph Title:** Performance of Goedel-Prover-SFT * **Right Graph Title:** Performance of Kimina-Prover-Preview-Distill-7B * **Y-axis (both graphs):** Accuracy (%) - Linear Scale * Left Graph: Ranges from 58% to 65% with gridlines at each integer percentage. * Right Graph: Ranges from 64% to 74% with gridlines at each integer percentage. * **X-axis (both graphs):** Token budget (N) - log scale * Left Graph: 16.1K, 38.3K, 140.0K, 406.0K, 1.3M, 4.5M, 12.7M * Right Graph: 140.0K, 406.0K, 1.3M, 4.5M * **Legends:** * **Left Graph:** Located in the bottom-right corner. * Blue: Goedel-Prover-SFT * Orange: Goedel-Prover-SFT + Apollo * **Right Graph:** Located in the bottom-center. * Red: Kimina-Prover-Preview-Distill-7B * Green: Kimina-Prover-Preview-Distill-7B + Apollo ### Detailed Analysis **Left Graph: Goedel-Prover-SFT** * **Goedel-Prover-SFT (Blue):** The line slopes upward, indicating increasing accuracy with a larger token budget. * 16.1K: Approximately 57.7% * 38.3K: Approximately 59.2% * 140.0K: Approximately 60.1% * 406.0K: Approximately 60.9% * 1.3M: Approximately 62.2% * 4.5M: Approximately 62.8% * 12.7M: Approximately 64.7% * **Goedel-Prover-SFT + Apollo (Orange):** The line slopes upward, indicating increasing accuracy with a larger token budget. The increase is more pronounced at lower token budgets. * 16.1K: Approximately 57.6% * 38.3K: Approximately 60.6% * 140.0K: Approximately 63.5% * 406.0K: Approximately 65.1% **Right Graph: Kimina-Prover-Preview-Distill-7B** * **Kimina-Prover-Preview-Distill-7B (Red):** The line slopes upward, indicating increasing accuracy with a larger token budget. * 140.0K: Approximately 63.2% * 406.0K: Approximately 65.2% * 1.3M: Approximately 66.2% * 4.5M: Approximately 70.8% * **Kimina-Prover-Preview-Distill-7B + Apollo (Green):** The line slopes upward, indicating increasing accuracy with a larger token budget. The increase is more pronounced at lower token budgets, with a plateau after 1.3M tokens. * 140.0K: Approximately 63.1% * 406.0K: Approximately 68.8% * 1.3M: Approximately 74.1% * 4.5M: Approximately 74.6% ### Key Observations * In the left graph, the "Goedel-Prover-SFT + Apollo" model consistently outperforms the base "Goedel-Prover-SFT" model, especially at lower token budgets. * In the right graph, the "Kimina-Prover-Preview-Distill-7B + Apollo" model significantly outperforms the base "Kimina-Prover-Preview-Distill-7B" model. * The "Apollo" addition seems to provide a more significant boost to the "Kimina" model than to the "Goedel" model. * For the "Kimina-Prover-Preview-Distill-7B + Apollo" model, the performance plateaus after 1.3M tokens, suggesting diminishing returns for larger token budgets. ### Interpretation The graphs demonstrate the impact of adding "Apollo" to two different models ("Goedel-Prover-SFT" and "Kimina-Prover-Preview-Distill-7B") in terms of accuracy as a function of the token budget. The addition of "Apollo" consistently improves the performance of both models, but the effect is more pronounced for the "Kimina" model. The plateau in performance for "Kimina-Prover-Preview-Distill-7B + Apollo" suggests that there may be a point of diminishing returns in increasing the token budget for this particular model configuration. The data suggests that "Apollo" is a beneficial addition to both models, but its impact varies depending on the base model architecture. </details> (b) Model accuracy against generated token budget Figure 3: Performance of base models against the Apollo aided models on miniF2F-test dataset at different sample budgets and token length. Note that Apollo’s sampling budget is not fixed: it depends on the recursion depth $r$ and the LLM’s ability to generate sub‑lemmas. For each sub‑lemma that Auto Solver fails to close, we invoke the LLM with a fixed top‑32 sampling budget. Because each generated sub‑lemma may spawn further sub‑lemmas, the total sampling overhead grows recursively, making any single @K budget difficult to report. Sub‑lemmas also tend to require far fewer tactics (and thus far fewer tokens) than the main theorem. A theorem might need 100 lines of proof while a sub-lemma might need only 1 line. Hence, sampling cost does not scale one‑to‑one. Therefore, to compare with other whole-proof generation approaches, we report the average number of proof samples and token budgets. We present more in-depth analysis of inference budgets in Section B of the Appendix. Table 1: Table results comparing sampling cost, accuracy, and token usage before and after applying Apollo. “Cost” denotes the sampling budget $K$ for whole‑proof generation and the recursion depth $r$ for Apollo. Column $N$ reports the average number of tokens generated per problem in "Chain‑of‑Thought" mode. Bold cells highlight the best accuracy. Results are shown on the miniF2F‑test dataset for two models: Kimina‑Prover‑Preview‑Distill‑7B and Goedel‑SFT. | Goedel‑Prover‑SFT | Kimina‑Prover‑Preview‑Distill‑7B | | | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Base Model | + Apollo | Base Model | + Apollo | | | | | | | | | | $K$ | $N$ | Acc. (%) | $r$ | $N$ | Acc. (%) | $K$ | $N$ | Acc. (%) | $r$ | $N$ | Acc. (%) | | 32 | 16.1K | 57.6% | 0 | 16.1K | 57.6% | 32 | 140K | 63.1% | 0 | 140K | 63.1% | | 64 | 31.8K | 59.2% | 2 | 38.3K | 60.7% | | | | 2 | 406K | 68.9% | | 3200 | 1.6M | 62.7% | 4 | 74.6K | 63.5% | | | | 4 | 816K | 74.2% | | 25600 | 12.7M | 64.7% | 6 | 179.0K | 65.6 % | 1024 | 4.5M | 70.8% | 6 | 1.3M | 75.0% | 4.2 Comparison with SoTA LLMs Table 2 compares whole‑proof generation and tree‑search methods on miniF2F‑test. For each approach, we report its sampling budget, defined as the top‑ $K$ parameter for LLM generators or equivalent search‑depth parameters for tree‑search provers. Since Apollo’s budget varies with recursion depth $r$ , we instead report the mean number of proof sampled during procedure. Moreover, we report an average number of generated tokens as another metric for computational cost of proof generation. For some models due to inability to collect generated tokens, we leave them blank and report sampling budget only. When Apollo is applied on top of any generator, it establishes a new best accuracy for that model. For instance, Goedel‑Prover‑SFT’s accuracy jumps to 65.6%, surpassing the previous best result (which required 25,600 sample budget). In contrast, Apollo requires only 362 samples on average. Likewise, Kimina‑Prover‑Preview‑Distill‑7B sees a 4.2% accuracy gain while reducing its sample budget below the original 1,024. To further validate Apollo’s generalizability, we tested the repair framework on Goedel-V2-8B lin2025goedelproverv2 , the current state-of-the-art theorem prover. We observe that, at similar sample and token budgets, Apollo achieves a 0.9% accuracy gain, whereas the base LLM requires twice the budget to reach the same accuracy. Overall, Apollo not only boosts accuracy but does so with a smaller sampling budget, setting a new state-of-the-art result among sub‑8B‑parameter LLMs with sampling budget of 32. We also evaluate general‑purpose LLMs (OpenAI o3‑mini and o4‑mini). Without Apollo, these models rarely produce valid Lean4 proofs, since they default to Lean3 syntax or introduce trivial compilation errors. Yet they have a remarkable grasp of mathematical concepts, which is evident by the proof structures they often produce. By applying Apollo’s syntax corrections and solver‑guided refinements, their accuracies increase from single digits (3–7%) to over 40%. These results demonstrate that in many scenarios discarding and regenerating the whole proof might be overly wasteful, and with a little help from a Lean compiler, we can achieve better accuracies while sampling less proofs from LLMs. Table 2: Comparison of Apollo accuracy against state-of-the-art models on the miniF2F-test dataset. Token budget is computed over all generated tokens by LLM. | Method | Model size | Sample budget | Token budget | miniF2F-test | | --- | --- | --- | --- | --- | | Tree Search Methods | | | | | | Hypertree Proof Search (lample2022hypertree, ) | 0.6B | $64× 5000$ | - | $41.0\%$ | | IntLM2.5-SP+BFS+CG (wu2024internlm25stepproveradvancingautomatedtheorem, ) | 7B | $256× 32× 600$ | - | $65.9\%$ | | HPv16+BFS+DC (li2025hunyuanproverscalabledatasynthesis, ) | 7B | $600× 8× 400$ | - | $68.4\%$ | | BFS-Prover (xin2025bfsproverscalablebestfirsttree, ) | 7B | $2048× 2× 600$ | - | $70.8\%$ | | Whole-proof Generation Methods | | | | | | DeepSeek-R1- Distill-Qwen-7B (deepseekai2025deepseekr1incentivizingreasoningcapability, ) | 7B | 32 | - | 42.6% | | Leanabell-GD-RL (zhang2025leanabellproverposttrainingscalingformal, ) | 7B | $128$ | - | $61.1\%$ | | STP (dong2025stp, ) | 7B | $25600$ | - | $67.6\%$ | | o3-mini (openai-o3-mini-2025, ) | - | 1 | - | $3.3\%$ | | $32$ | - | $24.6\%$ | | | | o4-mini (openai-o4-mini-2025, ) | - | 1 | - | $7.0\%$ | | Goedel-SFT (lin2025goedelproverfrontiermodelopensource, ) | 7B | 32 | 16.1K | $57.6\%$ | | $3200$ | 1.6M | $62.7\%$ | | | | $25600$ | 12.7M | $64.7\%$ | | | | Kimina-Prover- Preview-Distill (kimina_prover_2025, ) | 7B | 1 | 4.4K | $52.5\%$ | | $32$ | 140K | $63.1\%$ | | | | $1024$ | 4.5M | $70.8\%$ | | | | Goedel-V2 (lin2025goedelproverv2, ) | 8B | 32 | 174K | $83.3\%$ | | $64$ | 349K | $84.0\%$ | | | | $128$ | 699K | $84.9\%$ | | | | Whole-proof Generation Methods + Apollo | | | | | | o3-mini + Apollo | - | 8 | - | 40.2% (+36.9%) | | o4-mini + Apollo | - | 15 | - | 46.7% (+39.7%) | | Goedel-SFT + Apollo | 7B | 362 | 179K | 65.6% (+0.9%) | | Kimina-Preview + Apollo | 7B | 307 | 1.3M | 75.0% (+4.2%) | | Goedel-V2 + Apollo | 8B | 63 | 344K | 84.9% (+0.9%) | To further assess Apollo, we conducted experiments on PutnamBench tsoukalas2024putnambenchevaluatingneuraltheoremprovers , using Kimina-Prover-Preview-Distill-7B as the base model. Under whole-proof generation pipeline, LLM produced 10 valid proofs. With Apollo, the LLM produced one additional valid proof while consuming nearly half as many tokens. Results are presented in Table 3. Overall, our results on a variety of LLMs and benchmarks demonstrate that Apollo consistently consumes fewer tokens while achieving higher accuracy, highlighting its effectiveness. Table 3: Comparison of Apollo accuracy on the PutnamBench dataset. Token budget is computed over all generated tokens by LLM. | Method | Model size | Sample budget | Token budget | PutnamBench | | --- | --- | --- | --- | --- | | Kimina-Preview | 7B | 32 | 180K | 7/658 | | Kimina-Preview | 7B | 192 | 1.1M | 10/658 | | Kimina-Preview+Apollo | 7B | 108 | 579K | 11/658 | 4.3 Distribution of Proof Lengths We assess how Apollo affects proof complexity by examining proof‑length distributions before and after repair. Here, proof length is defined as the total number of tactics in a proof, which serves as a proxy for proof complexity. Figure 4 presents violin plots for three base models: Kimina‑Prover‑Preview‑Distill‑7B, Goedel‑Prover‑SFT, and o4‑mini. Each subplot shows two non‑overlapping sets: proofs generated directly by the base model (“before”) and proofs produced after applying Apollo (“after”). We only consider valid proofs verified by REPL in this setup. The proofs that Apollo succeeds in fixing in collaboration with the LLM have considerably longer proof lengths. The mean length of proofs fixed by Apollo is longer than those written by the LLM itself at least by a factor of two in each model selection scenario. This increase indicates that the proofs which the base model alone could not prove require longer, more structured reasoning chains. By decomposing challenging goals into smaller sub‑lemmas, Apollo enables the generation of these extended proofs, therefore improving overall success rate. These results demonstrate that Apollo not only raises accuracy but also systematically guides models to construct deeper, more rigorous proof structures capable of solving harder theorems. <details> <summary>x6.png Details</summary> ![a152c3e0](/v1/image/a152c3e0c083bf31d287a2afd85465221ffc50e7a56e0a525418a16588f0cd7a) ### Visual Description ## Violin Plot: Kimina-7B Distribution Comparison ### Overview The image presents a side-by-side comparison of two violin plots. The left plot displays the distribution of "Proof Length" for "Kimina-7B", while the right plot shows the distribution for "Kimina-7B + Apollo". The y-axis represents "Proof Length", and the x-axis represents the respective Kimina-7B configurations. Each violin plot shows the distribution shape, median, and interquartile range of the proof lengths. ### Components/Axes * **Title:** Kimina-7B Distribution Comparison * **Y-axis:** Proof Length, with scale markers at 0, 50, 100, 150, and 200. * **X-axis (Left Plot):** Kimina-7B, with scale markers at 0.8, 0.9, 1.0, 1.1, and 1.2. * **X-axis (Right Plot):** Kimina-7B + Apollo, with scale markers at 0.8, 0.9, 1.0, 1.1, and 1.2. * **Left Plot Mean:** 11.1 * **Right Plot Mean:** 45.4 * **Violin Plot Color:** Light orange fill with a black outline. * **Median Line:** Black horizontal line within each violin plot. * **Interquartile Range:** Black vertical line extending above and below the median. ### Detailed Analysis **Left Plot (Kimina-7B):** * The violin plot is narrow and concentrated near the bottom of the y-axis. * The median proof length is approximately 11.1. * The interquartile range is relatively small, indicating low variability. * The distribution is heavily skewed towards shorter proof lengths. * The bulk of the data is between 0 and 20. * The maximum proof length is approximately 110. **Right Plot (Kimina-7B + Apollo):** * The violin plot is wider and more spread out compared to the left plot. * The median proof length is approximately 45.4. * The interquartile range is larger, indicating higher variability. * The distribution is less skewed compared to the left plot. * The bulk of the data is between 0 and 70. * The maximum proof length is approximately 210. ### Key Observations * The addition of Apollo to Kimina-7B significantly increases the proof length. * The distribution of proof lengths is more variable with the addition of Apollo. * The median proof length is substantially higher for Kimina-7B + Apollo. ### Interpretation The data suggests that integrating Apollo with Kimina-7B leads to a notable increase in proof length and greater variability in proof lengths. This could be due to the increased complexity or different approach introduced by Apollo. The violin plots effectively visualize the shift in distribution and highlight the impact of Apollo on the proof generation process. The mean values further quantify this difference, showing a more than fourfold increase in average proof length. </details> (a) Kimina-Preview-Distill-7B <details> <summary>x7.png Details</summary> ![99cb73e7](/v1/image/99cb73e763d3f46113658efa9c0ac02baa3eebaaa7e4d185610d6c08f6e573dd) ### Visual Description ## Violin Plot: Goedel-Prover-SFT Distribution Comparison ### Overview The image presents a side-by-side violin plot comparing the distribution of "Proof Length" for two scenarios: "Goedel-Prover-SFT" and "Goedel-Prover-SFT + Apollo". The plots visually represent the probability density of the proof lengths for each scenario, with overlaid horizontal lines indicating the mean values. ### Components/Axes * **Title:** Goedel-Prover-SFT Distribution Comparison * **Y-axis:** Proof Length (numerical scale from 0 to 60, with tick marks at 0, 10, 20, 30, 40, 50, and 60) * **X-axis (Left Plot):** Goedel-Prover-SFT (numerical scale from 0.8 to 1.2, with tick marks at 0.8, 0.9, 1.0, 1.1, and 1.2) * **X-axis (Right Plot):** Goedel-Prover-SFT + Apollo (numerical scale from 0.8 to 1.2, with tick marks at 0.8, 0.9, 1.0, 1.1, and 1.2) * **Violin Plot Color:** Light teal ### Detailed Analysis **Left Plot: Goedel-Prover-SFT** * The violin plot is centered around x=1.0. * The distribution is skewed right. * The mean proof length is 6.5. * The minimum proof length is approximately 0. * The maximum proof length is approximately 44. * The interquartile range appears to be between approximately 3 and 10. **Right Plot: Goedel-Prover-SFT + Apollo** * The violin plot is centered around x=1.0. * The distribution is skewed right. * The mean proof length is 13.0. * The minimum proof length is approximately 0. * The maximum proof length is approximately 58. * The interquartile range appears to be between approximately 5 and 20. ### Key Observations * The addition of "Apollo" to "Goedel-Prover-SFT" significantly increases the mean proof length. * The distribution of proof lengths is more spread out with the addition of "Apollo". * Both distributions are skewed right, indicating a higher probability of shorter proof lengths. ### Interpretation The violin plots illustrate the impact of adding "Apollo" to the "Goedel-Prover-SFT" system on the distribution of proof lengths. The increase in mean proof length and the wider distribution suggest that "Apollo" introduces more variability and, on average, longer proofs. This could be due to "Apollo" exploring a wider range of proof strategies or introducing more complex reasoning steps. The right skew in both distributions suggests that shorter proofs are more common, but the addition of "Apollo" increases the likelihood of encountering longer proofs. </details> (b) Goedel-SFT <details> <summary>x8.png Details</summary> ![0eb0bcee](/v1/image/0eb0bceec3ead215577a1081796bc44e60c006b012313530e6eb8821c1934a61) ### Visual Description ## Chart: o4-mini Distribution Comparison ### Overview The image presents a side-by-side comparison of two violin plots, illustrating the distribution of "Proof Length" for "o4-mini" and "o4-mini + Apollo". The y-axis represents "Proof Length", while the x-axis represents the respective configurations. Each plot displays the distribution shape, median, and interquartile range, along with the mean value. ### Components/Axes * **Title:** o4-mini Distribution Comparison * **Y-axis:** Proof Length, ranging from 0 to 60. * **X-axis (Left Plot):** o4-mini, ranging from 0.8 to 1.2. * **X-axis (Right Plot):** o4-mini + Apollo, ranging from 0.8 to 1.2. * **Violin Plots:** Shaded areas representing the distribution of proof lengths. * **Median Lines:** Horizontal lines within each violin plot indicating the median proof length. * **Interquartile Range (IQR):** Boxes around the median lines representing the IQR. * **Mean Values:** Text boxes indicating the mean proof length for each configuration. ### Detailed Analysis **Left Plot: o4-mini** * **Mean:** 3.8 * The distribution is concentrated between approximately 0.8 and 1.2 on the x-axis. * The median is approximately 4. * The IQR spans roughly from 2 to 6. **Right Plot: o4-mini + Apollo** * **Mean:** 13.0 * The distribution is concentrated between approximately 0.8 and 1.2 on the x-axis. * The median is approximately 12. * The IQR spans roughly from 8 to 16. ### Key Observations * The "o4-mini + Apollo" configuration exhibits a significantly higher mean and median proof length compared to "o4-mini". * The distribution for "o4-mini + Apollo" appears to be more spread out, with a longer tail extending towards higher proof lengths. * Both distributions are centered around 1.0 on the x-axis. ### Interpretation The data suggests that adding "Apollo" to the "o4-mini" configuration results in a substantial increase in proof length. The violin plots visually demonstrate the shift in the distribution, with "o4-mini + Apollo" having a higher central tendency and greater variability. This could indicate that the addition of "Apollo" introduces more complex or lengthy proofs. The difference in mean values (3.8 vs. 13.0) highlights the magnitude of this effect. </details> (c) OpenAI o4-mini Figure 4: Distribution of proof lengths for selected models before (left) and after (right) applying Apollo. In the “before” plots, proof lengths cover only proofs the base model proved without the help of Apollo; in the “after” plots, proof lengths cover only proofs proved with Apollo’s assistance. <details> <summary>x9.png Details</summary> ![2e53900b](/v1/image/2e53900b64c37cb12710dfcbbaa2b0eb494e947b3b1ae7fa3a5ba7ae2ad6ad86) ### Visual Description ## Chart: Accuracy vs. Recursion Depth for Different Models ### Overview The image presents three line graphs comparing the accuracy of different models ("o4-mini", "Goedel-Prover-SFT", and "Kimina-Prover-Preview-Distill-7B") as a function of recursion depth. Each graph plots accuracy (in percentage) on the y-axis against recursion depth on the x-axis. ### Components/Axes * **X-axis (all graphs):** Recursion Depth, with integer values from 0 to 4 (o4-mini) or 0 to 6 (Goedel-Prover-SFT and Kimina-Prover-Preview-Distill-7B). * **Y-axis (all graphs):** Accuracy (%), with a range from approximately 0 to 50 for "o4-mini", 57 to 65 for "Goedel-Prover-SFT", and 67 to 76 for "Kimina-Prover-Preview-Distill-7B". * **Graph Titles:** * Left: "o4-mini" * Middle: "Goedel-Prover-SFT" * Right: "Kimina-Prover-Preview-Distill-7B" * **Data Series:** * "o4-mini": Blue line with circle markers. * "Goedel-Prover-SFT": Red line with downward triangle markers. * "Kimina-Prover-Preview-Distill-7B": Pink line with square markers. ### Detailed Analysis **1. o4-mini (Left Graph):** * **Trend:** The blue line slopes upward, indicating increasing accuracy with recursion depth. * **Data Points:** * Recursion Depth 0: Accuracy approximately 7%. * Recursion Depth 1: Accuracy approximately 38%. * Recursion Depth 2: Accuracy approximately 40%. * Recursion Depth 3: Accuracy approximately 44%. * Recursion Depth 4: Accuracy approximately 46%. **2. Goedel-Prover-SFT (Middle Graph):** * **Trend:** The red line slopes upward, with a steep initial increase, followed by a plateau. * **Data Points:** * Recursion Depth 0: Accuracy approximately 57%. * Recursion Depth 1: Accuracy approximately 61%. * Recursion Depth 2: Accuracy approximately 64%. * Recursion Depth 3: Accuracy approximately 64.5%. * Recursion Depth 4: Accuracy approximately 65%. * Recursion Depth 5: Accuracy approximately 65%. * Recursion Depth 6: Accuracy approximately 65%. **3. Kimina-Prover-Preview-Distill-7B (Right Graph):** * **Trend:** The pink line slopes upward, with a steep initial increase, followed by a plateau. * **Data Points:** * Recursion Depth 0: Accuracy approximately 67%. * Recursion Depth 1: Accuracy approximately 69%. * Recursion Depth 2: Accuracy approximately 72%. * Recursion Depth 3: Accuracy approximately 74%. * Recursion Depth 4: Accuracy approximately 74.5%. * Recursion Depth 5: Accuracy approximately 75%. * Recursion Depth 6: Accuracy approximately 75%. ### Key Observations * All three models show an increase in accuracy with increasing recursion depth, but the rate of increase diminishes as recursion depth increases. * "Kimina-Prover-Preview-Distill-7B" consistently achieves the highest accuracy across all recursion depths. * "o4-mini" has the lowest initial accuracy and the smallest overall range of accuracy values. * "Goedel-Prover-SFT" and "Kimina-Prover-Preview-Distill-7B" exhibit a similar trend, with a sharp initial increase in accuracy followed by a plateau. ### Interpretation The graphs suggest that increasing recursion depth generally improves the accuracy of these models, but there are diminishing returns. The "Kimina-Prover-Preview-Distill-7B" model appears to be the most effective, achieving the highest accuracy across all recursion depths. The "o4-mini" model is the least accurate. The plateauing effect observed in "Goedel-Prover-SFT" and "Kimina-Prover-Preview-Distill-7B" indicates that there is a limit to the benefits of increasing recursion depth for these models. The data implies that the "Kimina-Prover-Preview-Distill-7B" model is the most sophisticated or well-optimized among the three, as it consistently outperforms the others. </details> Figure 5: Performance of Apollo on miniF2F dataset at various recursion depth parameters $r$ on three different models: o4-mini, Kimina‑Prover‑Preview‑Distill‑7B and Goedel‑Prover‑SFT. 4.4 Impact of recursion depth $r$ on proof accuracy We evaluate how the recursion‐depth parameter $r$ affects accuracy on miniF2F‑test. A recursion depth of $r=0$ corresponds to the base model’s standalone performance; higher $r$ values permit additional rounds of Apollo. Figure 5 plots accuracy versus $r$ for three models: o4‑mini ( $r=0... 4$ ), Kimina‑Prover‑Preview‑Distill‑7B ( $r=0... 6$ ), and Goedel‑Prover‑SFT ( $r=0... 6$ ). The o4‑mini model exhibits its largest accuracy gain between $r=0$ and $r=1$ , since its raw proof structure is often correct but contains many syntax errors and proof-step errors. Syntax Refiner fixes most of them and together with Auto Solver leads to a spike in accuracy. Beyond $r=1$ , accuracy plateaus, since o4-mini was not explicitly designed for proving formal theorems. We note that even though accuracy gains slow down drastically for o4-mini, its accuracy still increases at a slow rate, which means that pushing parameter $r$ further has a potential to increase the efficiency of o4-mini further. In contrast, both Kimina‑Prover‑Preview‑Distill‑7B and Goedel‑Prover‑SFT continue to improve steadily throughout different levels of recursion depth. The results suggest that decomposing the challenging problems into smaller parts allows the model to close simpler goals, slowly building towards a complete proof. Although the total number of proof samples grows with $r$ , even $r=5$ requires only 300–400 LLM requests on average, an order of magnitude below typical budgets in the literature. Moreover, one could further increase $r$ to explore budgets in the thousands if desired. 4.5 Ablation Studies on different components of Apollo To evaluate the contribution of each Apollo component, we conducted an ablation study on two base models: o4-mini (a general-purpose model) and Kimina-Prover-Preview-Distill-7B (a theorem-prover model). Results are reported in Table 4. We decomposed Apollo into three key modules: Syntax Refiner, Auto Solver, and LLM Re-invoker. The LLM invoker is the module which re-invokes the LLM to prove the sorrified sub-goals. The Sorrifier and Proof Assembler modules are excluded from this analysis, as they are essential for decomposing problems into sub-goals and assembling proofs, and thus cannot be ablated. From the results, we observe that Syntax Refiner alone provides only a marginal benefit, and its effect is limited to the general-purpose model o4-mini; it yields no improvement on the dedicated prover. Auto Solver, when used in isolation, also produces little accuracy gain, indicating that built-in solvers in most cases do not have the capability to prove the failed sub-goals after the first round of proof generation by LLMs. However, further reasoning and proof decomposition on those sub-goals by the LLM can significantly increase the accuracy gain. The most significant accuracy gains, however, arise when Auto Solver and LLM Re-invoker are combined, which corresponds to the full Apollo framework. This demonstrates that Apollo’s repair mechanism is crucial for achieving robust performance across both model types. We extend the ablation to individual module triggers during inference in Section E of the Appendix. Table 4: Ablation study of different parts of Apollo on o4-mini (general purpose model) and Kimina-Prover-Preview-Distill-7B (theorem prover model). | Apollo Modules | Model Accuracy | | | | | --- | --- | --- | --- | --- | | Syntax Refiner | AutoSolver | LLM Re-Invoker | o4-mini | Kimina-7B | | ✗ | ✗ | ✗ | 7.0% | 63.1% | | ✓ | ✗ | ✗ | 7.4% | 63.1% | | ✗ | ✓ | ✗ | 7.0% | 63.5% | | ✗ | ✗ | ✓ | 8.2% | 69.3% | | ✓ | ✓ | ✗ | 20.5% | 63.5% | | ✓ | ✗ | ✓ | 18.9% | 69.3% | | ✗ | ✓ | ✓ | 10.2% | 75.0% | | ✓ | ✓ | ✓ | 46.7% | 75.0% | 5 Limitations Integration with tree‑search methods. Our evaluation focuses exclusively on applying Apollo to whole‑proof generation by LLMs and does not explore its interaction with traditional tree‑search provers (e.g. BFS, MCTS). Investigating how lemma decomposition and targeted repair could accelerate or prune search paths in those systems is a promising direction for future work. Dependence on base model’s proof sketch quality. Apollo’s ability to produce a correct formal proof largely depends on the base model and whether its initial proof has a coherent proof sketch. As yousefzadeh2025a observe, models often tend to “cut corners,” producing proofs that are much shorter than the rigorous, multi-step arguments required to generate a valid proof. When a base model fails to write a correct proof strategy (e.g. omitting critical lemmas or suggesting irrelevant tactics, or taking a wrong approach), the Apollo is less likely to repair such a proof. Enhancing the robustness of Apollo to very poor initial sketches remains an open challenge. 6 Conclusion In this work, we present Apollo, a novel, modular fully automated agentic system that combines syntax cleaning, automatic solvers, and LLM‑driven sub‑proof generation to transform an LLM’s initial proof sketch into a fully verified Lean4 proof. This framework harnesses the full power of the Lean compiler and integrated solvers, merging them with LLM systems. Applied across five whole‑proof generation models, ranging from general‑purpose LLMs (o3‑mini, o4‑mini) to specialized provers (Kimina‑Prover‑Preview‑Distill‑7B, Goedel‑Prover‑SFT, Goedel-V2), Apollo consistently establishes new best accuracies on the miniF2F benchmark while reducing token and sampling budgets by one to two orders of magnitude. Our empirical analysis shows that Apollo not only raises overall proof success rates but also guides models to generate longer, more structured proofs, as evidenced by a drastic increase in average successful proof length. We further demonstrate how the recursion‑depth parameter $r$ trades off sample complexity against accuracy, achieving robust gains with only a few hundred samples. We believe that Apollo’s collaboration between LLMs, automatic solvers, and the Lean compiler defines a paradigm of agentic systems that produce high‑quality, verifiable proofs for increasingly challenging theorems. Acknowledgments The work of Farzan Farnia is partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China, Project 14209920, and is partially supported by CUHK Direct Research Grants with CUHK Project No. 4055164 and 4937054. Also, the authors would like to thank the anonymous reviewers for their constructive feedback. References - (1) Kaiyu Yang, Gabriel Poesia, Jingxuan He, Wenda Li, Kristin Lauter, Swarat Chaudhuri, and Dawn Song. Formal mathematical reasoning: A new frontier in AI. In Proceedings of the International Conference on Machine Learning, 2025. - (2) Google DeepMind. AI achieves silver-medal standard solving international mathematical olympiad problems. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/, 2024. Accessed: 2025-05-08. - (3) Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 2024. - (4) Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, page 625–635, Berlin, Heidelberg, 2021. Springer-Verlag. - (5) Alex Kontorovich and Terence Tao. Prime number theorem and more. https://github.com/AlexKontorovich/PrimeNumberTheoremAnd, 2024. Version 0.1.0, released 2024-01-09. - (6) Kevin Buzzard. Mathematical reasoning and the computer. Bulletin of the American Mathematical Society, 2024. - (7) George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition, 2024. - (8) Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. miniF2F: a cross-system benchmark for formal Olympiad-level mathematics, 2022. - (9) Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023. - (10) Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf, 2024. GitHub repository. - (11) Jiewen Hu, Thomas Zhu, and Sean Welleck. miniCTX: Neural theorem proving with (long-)contexts. In The Thirteenth International Conference on Learning Representations, 2025. - (12) Dongwei Jiang, Marcio Fonseca, and Shay B Cohen. Leanreasoner: Boosting complex logical reasoning with lean. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7490–7503, 2024. - (13) Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. A survey of learning-based automated program repair. ACM Trans. Softw. Eng. Methodol., 33(2), December 2023. - (14) Kavi Gupta, Peter Ebert Christensen, Xinyun Chen, and Dawn Song. Synthesize, execute and debug: Learning to repair for neural program synthesis. In Advances in Neural Information Processing Systems, volume 33, pages 17685–17695. Curran Associates, Inc., 2020. - (15) Michihiro Yasunaga and Percy Liang. Break-it-fix-it: Unsupervised learning for program repair. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 11941–11952. PMLR, 18–24 Jul 2021. - (16) Berkay Berabi, Jingxuan He, Veselin Raychev, and Martin Vechev. Tfix: Learning to fix coding errors with a text-to-text transformer. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 780–791. PMLR, 18–24 Jul 2021. - (17) Miltiadis Allamanis, Henry Jackson-Flux, and Marc Brockschmidt. Self-supervised bug detection and repair. In Advances in Neural Information Processing Systems, volume 34, pages 27865–27876. Curran Associates, Inc., 2021. - (18) Talia Ringer. Proof Repair. PhD thesis, University of Washington, 2021. Ph.D. dissertation. - (19) Yves Bertot and Pierre Castéran. Interactive theorem proving and program development: Coq’Art: the calculus of inductive constructions. Springer Science & Business Media, 2013. - (20) Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding hammers to integrate language models and automated theorem provers. In Advances in Neural Information Processing Systems, 2022. - (21) Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models, 2023. - (22) Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Towards large language models as copilots for theorem proving in lean. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023. - (23) Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning, 2022. - (24) Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Jiayu Wang, Dahua Lin, and Kai Chen. Internlm2.5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems, 2024. - (25) Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Kai Shen. BFS-Prover: Scalable best-first tree search for llm-based automatic theorem proving, 2025. - (26) Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pages 26337–26349, 2022. - (27) Haiming Wang et al. Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12632–12646, 2023. - (28) Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models, October 2023. arXiv:2306.15626. - (29) Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving, 2020. - (30) Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. - (31) Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving, 2025. - (32) Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, and Kun Gai. Leanabell-prover: Posttraining scaling in formal reasoning, 2025. - (33) Kefan Dong and Tengyu Ma. STP: Self-play LLM theorem provers with iterative conjecturing and proving. arXiv preprint arXiv:2502.00212, 2025. - (34) Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Pagani, Moreira Machado, Pauline Bourigault, Ran Wang, Stanislas Polu, Thibaut Barroyer, Wen-Ding Li, Yazhe Niu, Yann Fleureau, Yangyang Hu, Zhouliang Yu, Zihan Wang, Zhilin Yang, Zhengying Liu, and Jia Li. Kimina-Prover Preview: Towards large formal reasoning models with reinforcement learning, 2025. - (35) Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learning Representations, 2024. - (36) Albert Q. Jiang et al. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs, 2023. - (37) Haiming Wang, Huajian Xin, Chuanyang Zheng, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jian Yin, Zhenguo Li, and Xiaodan Liang. LEGO-prover: Neural theorem proving with growing libraries. In The Twelfth International Conference on Learning Representations, 2024. - (38) Lean FRO. A read-eval-print-loop for Lean 4. https://github.com/leanprover-community/repl, 2023. - (39) OpenAI. Openai o3-mini. https://openai.com/index/openai-o3-mini/, January 2025. - (40) OpenAI. Introducing openai o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/, April 2025. - (41) Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: The strongest open-source theorem prover to date, 2025. - (42) Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, and Haitao Mi. HunyuanProver: A scalable data synthesis framework and guided tree search for automated theorem proving, 2025. - (43) DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. - (44) Roozbeh Yousefzadeh, Xuenan Cao, and Azim Ospanov. A Lean dataset for International Math Olympiad: Small steps towards writing math proofs for hard problems. Transactions on Machine Learning Research, 2025. - (45) Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search, August 2024. arXiv:2408.08152. Appendix A Additional implementation details Algorithm 1 presents a high-level pseudo code of the Apollo framework. We highlight that the framework is built-upon the recursive function calls up to a maximum recursion depth given by parameter $r$ . $\mathbf{LLM}$ refers to a proof generator model that takes in a proof statement and outputs a formal proof attempt. $\mathbf{LeanCompiler}$ refers to Lean REPL that verifies the proof and outputs True or False. In the case of False, it also provides a detailed list of errors, their locations and types along with the error messages generated by the Lean compiler. Other functions in the pseudo code ( $\mathbf{SyntaxRefiner,Sorrifier,AutoSolver,ProofAssembler}$ ) represent the sub-modules described in the main text. To give deeper insight into Apollo’s design, we now outline the rationale and responsibilities of each core module. A.1 Syntax Refiner The Syntax Refiner class applies a set of rule‐based corrections to eliminate syntax errors in LLM‐generated Lean code. During model experimentation, we collected a corpus of common mistakes, such as stray Lean 3 keywords (from, begin … end), misplaced tactics, or missing delimiters, and encoded each repair as a regular‐expression rule. For example: - Convert [from by] to [:= by]. - Replace Lean 3 blocks (begin … end) with the corresponding Lean 4 structure. - Ensure that commands like rw use correctly placed square brackets to avoid compilation errors. In more complex cases, multiple regex patterns and replacements are composed to restore valid Lean 4 syntax without altering the logical content of the proof sketch. A.2 Sorrifier The Sorrifier module uses a feedback loop between Apollo and the Lean REPL as follows: 1. Apollo sends the current proof candidate to REPL. 1. The REPL returns a list of error messages and their source locations. 1. Apollo parses the proof into a syntax tree, navigates to each failing sub‐lemma, and attempts to fix it or replace it with a sorry placeholder. 1. The updated proof is re‐submitted to REPL, and this cycle repeats until no error remain. If the REPL indicates the original theorem statement itself is malformed (e.g. due to LLM hallucination), Apollo abandons the current proof and requests a fresh generation from LLM. One example of malformed formal statement is if LLM introduces a variable in a hypothesis that was not previously defined. Such case would trigger a compilation error in the formal statement; therefore, when parsing the proof, Apollo will catch this error and request a fresh proof from LLM. To guarantee consistent variable types during lemma extraction and proof assembly, Apollo sets the following Lean options to true: ⬇ set_option pp. instanceTypes true set_option pp. numericTypes true set_option pp. coercions. types true set_option pp. letVarTypes true set_option pp. structureInstanceTypes true set_option pp. instanceTypes true set_option pp. mvars. withType true set_option pp. coercions true set_option pp. funBinderTypes true set_option pp. piBinderTypes true so that the compiler reports explicit types (e.g. $\mathbb{Z}$ vs. $\mathbb{R}$ ). This fail-safe prevents trivialization of theorems and ensures correct typing in the final assembled proof. A.3 AutoSolver The AutoSolver module takes the “sorrified” proof and applies a sequence of tactics to close as many goals as possible: 1. Hint-based tactics. It first invokes hint, which proposes candidate tactics. We filter out any suggestions that merely progress the goal and retain only those that fully discharge it. 1. Built-in tactics. If hint fails, AutoSolver systematically tries a suite of powerful Lean tactics, such as nlinarith, norm_num, norm_cast, ring_nf, simp, simp_all, and others, often combining them with try to catch failures without aborting. 1. Recursive fallback. Any goals that remain unsolved are reported back to Apollo. Apollo queries the REPL for the exact goal statements, spawns corresponding sub-lemmas, and then invokes the full repair loop (LLM generation, Syntax Refiner, Sorrifier, and AutoSolver) on each sub-lemma. This tiered strategy maximizes automation while ensuring that challenging subgoals are handled via targeted recursion. Algorithm 1 Apollo Pseudocode for Formal Theorem Proving 1: problem statement $ps$ , current recursion depth $r_{current}$ , maximum depth $r$ 2: if $r_{current}$ is not initialized then 3: $r_{current}← 0$ 4: end if 5: if $r_{current}>r$ then 6: return sorry $\triangleright$ If recursion depth reached, return sorry 7: end if 8: $proof←\mathbf{LLM}(ps)$ $\triangleright$ Generate proof from LLM 9: $proof←\mathbf{SyntaxRefiner}(proof)$ $\triangleright$ Refine the proof syntax 10: $proof←\mathbf{Sorrifier}(proof)$ $\triangleright$ Patch failing sub-lemmas 11: $proof←\mathbf{AutoSolver}(proof)$ $\triangleright$ Try built-in Lean solvers 12: if $\mathbf{LeanCompiler}(proof)$ == True then 13: return $\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}proof$ 14: end if 15: $r_{current}← r_{current}+1$ 16: $n←\mathbf{CountSorries}(proof)$ $\triangleright$ Number of ’sorry’ placeholders 17: $sub\_proofs←[]$ 18: for $i=1$ to $n$ do 19: $ps_{goal}←\mathbf{GetTheoremGoal}(proof,i)$ 20: $sub\_ps←\mathbf{TransformGoal}(ps_{goal})$ 21: $sub\_proofs[i]←\mathbf{Apollo}(sub\_ps,r_{current},r)$ 22: end for 23: $repaired\_proof←\mathbf{ProofAssembler}(sub\_proofs)$ 24: return $repaired\_proof$ Appendix B Discussion on maximum budget of @K and Apollo To better highlight Apollo’s contribution to token efficiency, we report token counts under two settings: (1) the budget over all attempted miniF2F-test problems, and (2) the budget restricted to proofs where Apollo was invoked. For instance, if the base LLM proves a theorem without Apollo’s assistance, its token usage is included only in the first setting. In contrast, if the base LLM cannot prove the theorem on its own but succeeds with Apollo’s help, its token usage is counted in the second setting. Results for setting (1) are shown in Table 5, while results for setting (2) appear in Table 6. We emphasize that, regardless of the token-count setting, Apollo consistently consumes fewer tokens while achieving higher accuracy, highlighting its effectiveness. Throughout this work, we primarily report results under setting (2), as it more directly highlights Apollo’s contribution: the reported token usage is not artificially reduced by problems that the base LLM could solve independently. This makes Scenario (2) a more representative measure of our contribution. Table 5: Comparison of maximum/average sample/token budget across $@K$ sampling and Apollo repair on the entire miniF2F-test dataset. | Model | Max sample budget | Average sample budget | Max token budget | Average token budget | | --- | --- | --- | --- | --- | | Goedel-SFT | 25600 | 9730 | 12.7M | 4.8M | | Goedel-SFT+Apollo | 1100 | 335 | 548K | 167K | | Kimina-7B | 1024 | 373 | 1.3M | 1.6M | | Kimina-7B+Apollo | 888 | 189 | 3.8M | 818K | Table 6: Comparison of maximum/average sample/token budget across $@K$ sampling on the entire miniF2F-test dataset and Apollo repair exclusively on Apollo-assisted problems. | Model | Max sample budget | Average sample budget | Max token budget | Average token budget | | --- | --- | --- | --- | --- | | Goedel-SFT | 25600 | 9730 | 12.7M | 4.8M | | Goedel-SFT+Apollo | 1100 | 618 | 548K | 307K | | Kimina-7B | 1024 | 373 | 1.3M | 1.6M | | Kimina-7B+Apollo | 888 | 367 | 3.8M | 1.6M | Appendix C Additional evaluation results We also conducted experiments on the ProofNet azerbayev2023proofnetautoformalizingformallyproving dataset using the Kimina-Preview-Prover-Distill-7B model. Table 7 summarizes the sample/token budget and accuracy. Incorporating the Apollo agent with Kimina not only improves accuracy by 7% but also reduces token consumption by 25%. Table 7: Comparison of Apollo accuracy on the ProofNet dataset. Token budget is computed over all generated tokens by LLM. | Method | Model size | Sample budget | Token budget | ProofNet | | --- | --- | --- | --- | --- | | Kimina-Preview | 7B | 128 | 559K | 11.3% | | Kimina-Preview+Apollo | 7B | 96 | 415K | 18.3% | Moreover, we observe rising adoption of REPL-based repair, introduced in yousefzadeh2025a . Currently, models such as Goedel-Prover-V2 are fine-tuned on REPL outputs paired with faulty Lean code, enabling the model to repair missing or incorrect proof fragments using this feedback. Following yousefzadeh2025a , we compare Apollo against a REPL-feedback baseline using the general-purpose o3-mini model. Table 8 presents the results: Apollo outperforms the REPL-feedback approach while requiring a smaller sample budget. We view REPL-based feedback as complementary and expect it to work well in conjunction with recursive, compiler-aided repair of Lean proofs. Feedback prompt schema This is an incorrect proof: [model’s last proof] Compilation errors are as follows: [Lean’s error messages] Based on this feedback, produce a correct raw Lean code for the following problem: [header] [informal prefix] [formal statement] Table 8: Comparison of Apollo accuracy against compiler (REPL) feedback based repair on the miniF2F-test dataset. Token budget is omitted since OpenAI API does not provide exact token budget with its response. | Method | Sample budget | Accuracy | | --- | --- | --- | | o3-mini | 1 | 3.3% | | o3-mini | 32 | 24.6% | | o3-mini + REPL feedback repair | 5 | 17.2% | | o3-mini + REPL feedback repair | 10 | 25.4% | | o3-mini + Apollo | 8 | 40.2% | We also evaluated Apollo against the RMaxTS method proposed in xin_deepseek-prover-v15_2024 . This approach employs Monte Carlo Tree Simulation to explore the proof space and truncates the search whenever Lean compiler errors are encountered. RMaxTS has been shown to further improve the accuracy of whole-generation provers. For our evaluation, we used Kimina-Prover-Preview-Distill-7B as the base model. Table 9 reports the results. At a comparable sample budget of approximately 2M tokens, Apollo outperforms RMaxTS by 10.7%. These findings further reinforce the effectiveness of the agentic recursive repair approach. Appendix D Computational resource considerations Our emphasis on reducing sample complexity is driven by resource allocation. CPU performance is not a limiting factor in our setup: the Lean kernel remains efficient even under high parallelization, with proof verification typically completing in 6–200 seconds. To guard against inefficient structures, we impose a 5-minute compilation timeout, which is rarely approached in practice. By contrast, GPU resources were the primary bottleneck. Running Kimina-Prover-Preview-Distill-7B on a single A5000 GPU with a 16,384-token context window takes 700–2,000 seconds per problem, over 11.5 minutes at @32 sampling. Increased sampling exacerbates this cost, and even with eight GPUs throughput remains difficult to scale. While Lean-based servers such as Kimina-server continue to improve verification efficiency through CPU parallelization, the growing overhead of sampling highlights the need to minimize LLM invocations alongside improving theorem-proving accuracy. Table 9: Comparison of Apollo accuracy against other sampling methods on the miniF2F-test dataset. Token budget is computed over all generated tokens by LLM. | Method | Model size | Sample budget | Token budget | Accuracy | | --- | --- | --- | --- | --- | | Kimina-Preview | 7B | 1024 | 4.5M | 70.8% | | Kimina-Preview+RMaxTS xin_deepseek-prover-v15_2024 | 7B | $4× 128$ | 2.0M | 64.3% | | Kimina-Preview+Apollo | 7B | 589 | 2.2M | 75.0% | Appendix E Extension of ablation study To further elaborate on the ablation results presented in the main text, we measured the number of relevant Apollo module calls during inference. The results are shown in Table 10. Specifically, we counted the number of invocations for each Apollo-assisted problem in the miniF2F-test subset. The findings are consistent with the ablation study reported in the main text. Table 10: Number of Apollo module triggers across different models | Model | Syntax Refiner | Auto Solver | LLM Re-Invoker | | --- | --- | --- | --- | | Goedel-SFT | 0.0% | 100.0% | 100.0% | | Kimina-7B | 0.0% | 100.0% | 100.0% | | o3-mini | 100.0% | 100.0% | 70.0% | | o4-mini | 94.7% | 100.0% | 64.9% | Appendix F An example of Apollo proof repair procedure Figure 6 shows a detailed run of the Apollo framework on mathd_algebra_332. First, the LLM generates an initial proof sketch, which fails to type-check in Lean. Apollo then applies its Syntax Refiner to correct simple errors (e.g. changing “from by” to “:= by”) and to strip out all comment blocks. Next, the Sorrifier replaces each failing sub-lemma with a sorry statement (six in this example). The AutoSolver module then attempts to discharge these sub-lemmas: it resolves four of them, leaving lemmas #5 and #6 unsolved. Apollo recurses on those two: each is passed again through LLM, Syntax Refiner, Sorrifier, and AutoSolver. After this single recursive iteration, all goals are closed. Finally, Apollo reassembles the full proof and returns it to the user. This example illustrates Apollo’s repair logic. In this case the proof was fixed in one iteration; more complex theorems may require deeper recursive repair. <details> <summary>examples/step_by_step/dryrun.png Details</summary> ![bc11c263](/v1/image/bc11c263942700111a5853b24cce7cb8771beb6623dcd2293b672cfa728ee9b0) ### Visual Description ## Flow Diagram: APOLLO Framework ### Overview The image presents a flow diagram illustrating the APOLLO Framework, likely used for formal mathematical proofs, showcasing the different stages and tools involved in verifying a theorem. The diagram consists of multiple boxes, each representing a step or a tool, connected by arrows that indicate the flow of information. ### Components/Axes * **Title:** APOLLO Framework (centered at the top) * Boxes: Each box represents a module or step in the framework. The boxes are color-coded and arranged in a top-down flow, with arrows indicating the direction of processing. Modules include "LLM," "Syntax Refiner", "Sorrifier", and "Auto Solver". * Arrows: Black arrows indicate the flow of data between the modules. * Text: Each box contains Lean code with theorem statements and proof tactics. * Color Coding (Top-Left of the Box): * Red: LLM (Large Language Model) * Orange: Syntax Refiner * Yellow: Sorrifier * Blue: Auto Solver * Green: LLM + APOLLO, Base Proof + Lemma #5 + Lemma #6 ### Detailed Analysis or ### Content Details **1. Initial Setup (Top of Diagram)** * **LLM (Red Box):** * Imports: `Mathlib`, `Aesop` * Options: * `set_option maxHeartbeats 0` * `set_option pp.numericTypes true` * `set_option pp.coercions.types true` * Opens: `BigOperators Real Nat Topology Rat` * Theorem Statement (Informal): Real numbers 'x' and 'y' have an arithmetic mean of 7 and a geometric mean of sqrt(19). Show that x^2 + y^2 = 158. * Theorem Declaration: `theorem mathd_algebra_332 (x y : R) (h₀ : (x + y) / 2 = (7 : R)) (h₁ : Real.sqrt (x * y) = Real.sqrt (19 : R)) : x^2 + y^2 = (158 : R) := by` * Proof starts by rewriting the sum as x+y=14 * It substitues the given equation * It proves that x*y > 0 * It uses injectivity of sqrt on nonnegatives to deduce x * y = 19 * Finally, it substitutes and simplifies using ring * **Syntax Refiner (Orange Box):** * Imports: `Mathlib`, `Aesop` * Options: * `set_option maxHeartbeats 0` * `set_option pp.numericTypes true` * `set_option pp.coercions.types true` * Opens: `BigOperators Real Nat Topology Rat` * Theorem Declaration: `theorem mathd_algebra_332 (x y : R) (h₀ : (x + y) / 2 = (7 : R)) (h₁ : Real.sqrt (x * y) = Real.sqrt (19 : R)) : x^2 + y^2 = (158 : R) := by` * Proof starts by rewriting the sum as x+y=14 * It substitues the given equation * It proves that x*y > 0 * The proof uses `by` and `ring` * The proof also applies `sqrt_inj` * **Sorrifier (Yellow Box):** * Imports: `Mathlib`, `Aesop` * Options: * `set_option maxHeartbeats 0` * `set_option pp.numericTypes true` * `set_option pp.coercions.types true` * Opens: `BigOperators Real Nat Topology Rat` * Theorem Declaration: `theorem mathd_algebra_332 (x y : R) (h₀ : (x + y) / 2 = (7 : R)) (h₁ : Real.sqrt (x * y) = Real.sqrt (19 : R)) : x^2 + y^2 = (158 : R) := by` * Proof starts by rewriting the sum as x+y=14 * It substitues the given equation * It proves that x*y > 0 * There are `sorry` statements here that need to be refined. * **Auto Solver (Blue Box):** * Imports: `Mathlib`, `Aesop` * Options: * `set_option maxHeartbeats 0` * `set_option pp.numericTypes true` * `set_option pp.coercions.types true` * Opens: `BigOperators Real Nat Topology Rat` * Theorem Declaration: `theorem mathd_algebra_332 (x y : R) (h₀ : (x + y) / 2 = (7 : R)) (h₁ : Real.sqrt (x * y) = Real.sqrt (19 : R)) : x^2 + y^2 = (158 : R) := by` * Proof uses: * `linarith` * `simp_all only [Real.sqrt_pos, ofNat_pos]` **2. Intermediate Stages (Middle of Diagram)** * Multiple iterations of lemma definitions, refining tactics, and utilizing different modules ("LLM", "Syntax Refiner", "Sorrifier", "Auto Solver") * Lemma Definition 1: `lemma mathd_algebra_332_1 (x y : R) (h₁ : (x * y) = (19 : R)) (sum : x + y = (14 : R)) (prod_pos : (0 : R) < x * y) (h₀ : True) : x * y = (19 : R) := by` **3. Base Proof and Final Result (Bottom of Diagram)** * **Base Proof + Lemma #5 + Lemma #6 (White Box):** Indicates that the base proof relies on Lemma #5 and Lemma #6 * **LLM + APOLLO (Green Box):** * Imports: `Mathlib`, `Aesop` * Options: * `set_option maxHeartbeats 0` * `set_option pp.numericTypes true` * `set_option pp.coercions.types true` * Opens: `BigOperators Real Nat Topology Rat` * Theorem Declaration: `theorem mathd_algebra_332 (x y : R) (h₀ : (x + y) / 2 = (7 : R)) (h₁ : Real.sqrt (x * y) = Real.sqrt (19 : R)) : x^2 + y^2 = (158 : R) := by` **4. Formal Statement (Bottom Left of Diagram)** * A restatement of the problem being solved: Given real numbers x and y with an arithmetic mean of 7 and a geometric mean of sqrt(19), show that x^2 + y^2 = 158. ### Key Observations * The diagram depicts a structured approach to proving a mathematical theorem, leveraging different automated tools and lemmas to arrive at the final result. * The flow starts with an initial statement and progresses through refinement stages, utilizing tools like "Syntax Refiner," "Sorrifier", and "Auto Solver." * The use of color-coded boxes helps to visually distinguish the roles of different components in the framework. * The repeated appearance of the same theorem declaration across different modules suggests an iterative process of refining the proof. * The presence of "sorry" statements in the intermediate steps indicates incomplete or placeholder proofs. ### Interpretation The diagram illustrates a framework for formalizing and verifying mathematical proofs using a combination of AI (LLM) and automated reasoning tools. The process begins with an informal statement, which is then translated into formal Lean code. The proof is refined through multiple stages, leveraging tactics provided by different modules, until a complete and verified proof is achieved. This suggests an approach to theorem proving where AI and automated tools collaborate to handle complex mathematical reasoning. The diagram underlines the importance of different proof tactics and strategies in achieving a conclusive result. </details> Figure 6: Step-by-step Apollo repair on mathd_algebra_332. The initial proof and all sub-lemma proofs are generated with o4-mini, then systematically repaired and reassembled by Apollo. <details> <summary>examples/comparison/short_3.png Details</summary> ![f53eb6f0](/v1/image/f53eb6f08dfcdfb6f93dcb004468935b66c6c8fcab93e63521e53341ad646508) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code snippets, labeled "LLM" on the left and "LLM + APOLLO" on the right. The code appears to be related to mathematical theorem proving, likely using a tool like Lean. The comparison highlights differences in the code required to achieve the same result with and without the APOLLO extension. Specific lines are highlighted in red (for LLM) and green (for LLM + APOLLO) to indicate differences. ### Components/Axes * **Titles:** "LLM" (top-left, red background) and "LLM + APOLLO" (top-right, green background). * **Code:** The main content consists of code blocks written in a formal language, likely a theorem prover's syntax. * **Highlights:** Specific lines of code are highlighted with colored backgrounds (red and green) and numbered annotations (#1, #2, #3). ### Detailed Analysis or Content Details **Left Side (LLM):** * Imports: * `import Mathlib` * `import Aesop` * Option Setting: * `set_option maxHeartbeats 0` * Open Statement: * `open BigOperators Real Nat Topology Rat` * Theorem Definition: * `theorem mathd_algebra_158_llm (a k : N) (h₀ : Even a)` * `(h₁ : ↑(Σ k in Finset.range 8, 2 * k + 1) - ↑(Σ k in Finset.range 5, a + 2 * k) = (4 : Z)) :` * `a = 8 := by` * `simp_all` * `ring_nf at *` * `have eq : (5 : Z) * (↑a : Z) + (2 : Z) * (↑k : Z) = 53 := by` * `linarith` * Highlighted Sections: * **#1 (Red):** `have h_even : ∃ (m : N), a = 2 * m := h₀` * **#2 (Red):** `have two_divides_53 : ∃ (d : Z), 53 = 2 * d := by` * `rw [← even_sum] at eq_subst` * `rw [eq_subst] at eq` * `exact <(↑5) * (↑m : Z) + ↑k, eq>` * `have not_divides : ¬ ∃ (d : Z), 53 = 2 * d := by` * `rintro <d, hd>` * `have mod53 : (53 : Z) % 2 = 1 := by` * `norm_num` * `have mod_d : (2 * d) % 2 = 0 := by` * `norm_num` * `rw [hd] at mod53` * **#3 (Red):** `rw [mul_mod] at mod53` * `linarith` * `have contra : False := by` * `apply not_divides` * `exact two_divides_53` * `exact False.elim contra` **Right Side (LLM + APOLLO):** * Imports: * `import Mathlib` * `import Aesop` * Option Setting: * `set_option maxHeartbeats 0` * Open Statement: * `open BigOperators Real Nat Topology Rat` * Theorem Definition: * `theorem mathd_algebra_158_apollo (a k : N) (h₀ : Even a)` * `(h₁ : ↑(Σ k in Finset.range 8, 2 * k + 1) - ↑(Σ k in Finset.range 5, a + 2 * k) = (4 : Z)) :` * `a = 8 := by` * `simp_all` * `ring_nf at *` * `have eq : (5 : Z) * (↑a : Z) + (2 : Z) * (↑k : Z) = 53 := by` * `linarith` * Highlighted Sections: * **#1 (Green):** `have h_even : ∃ (m : N), a = 2 * m := by` * `exact Even.exists_two_nsmul a h₀` * **#2 (Green):** `have two_divides_53 : ∃ (d : Z), 53 = 2 * d := by` * `:= by omega` * `have not_divides : ¬ ∃ (d : Z), 53 = 2 * d := by` * `rintro <d, hd>` * `have mod53 : (53 : Z) % 2 = 1 := by` * `norm_num` * `have mod_d : (2 * d) % 2 = 0 := by` * `norm_num` * `rw [hd] at mod53` * **#3 (Green):** `rw [mul_mod] at mod53` * `linarith` * `have contra : False := by` * `apply not_divides` * `exact two_divides_53` * `exact False.elim contra` ### Key Observations * The code on both sides aims to prove the same theorem (`mathd_algebra_158`). * The "LLM + APOLLO" version appears to be more concise in certain sections, particularly in the highlighted areas. * The highlighted sections (#1, #2, #3) show where APOLLO simplifies the proof process. * In #1, APOLLO provides a direct proof (`exact Even.exists_two_nsmul a h₀`) compared to the original. * In #2, APOLLO uses `:= by omega` to automatically prove the statement. * The rest of the code is identical. ### Interpretation The image demonstrates the benefits of using the APOLLO extension in a theorem proving environment. APOLLO seems to automate or simplify certain proof steps, leading to more concise and potentially more efficient code. The highlighted sections illustrate specific instances where APOLLO reduces the amount of manual reasoning required. The comparison suggests that APOLLO can improve the user experience and potentially reduce the complexity of theorem proving tasks. </details> Figure 7: Proof attempt of mathd_algebra_158 produced by the base model (o4-mini) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. Appendix G Comparison between base proofs generated from LLMs and Apollo-repaired counterparts In this section, we present and discuss examples of initial proof attempts alongside their Apollo-repaired counterparts. Figures 7, 8, 9 illustrate cases where Apollo repairs the proof without invoking the LLM again. In each example, the LLM struggles to generate the precise tactics needed to close certain goals; however, we find that regenerating the entire proof is unnecessary. Instead, built-in Lean solvers can discharge those subgoals directly. Moreover, in Figure 7, problematic block #3 is resolved simply by removing it, since the surrounding proof context is correct. Thus, omitting problematic lines can sometimes yield a valid proof. In Figure 10, one goal is expanded via an LLM query, but the model injects extra tactics that trigger a no goals to be solved error. Apollo then repairs the first block using a combination of LLM generation and AutoSolver, while the second block is removed entirely. Figure 11 illustrates a case where the model fails to apply nlinarith to discharge the goal. We observe that, when uncertain, the model often resorts to broad tactics in hopes of closing the goal. Here, the goal is too complex for nlinarith, so Apollo leverages both the LLM and AutoSolver to guide the proof to completion. Figure 12 illustrates an example of proof that required a recursion depth $r=5$ to repair the initial proof attempt. We observe that LLM’s proof structure is correct; however, in many cases it over-relies on built-in solvers and rewrites. However, since the goals are too complex, this approach leads to a total of nine error blocks. Repairing those blocks requires aid from both AutoSolver module of Apollo and LLM itself. We show that if LLM grasps the correct approach, then Apollo is able to repair fine grained logic errors to produce a correct proof. Figure 13 illustrates a case where the LLM produces an incomplete proof sketch. The first half of the proof is correct, but the model fails to discharge the remaining goal and instead uses all_goals positivity. Apollo detects this error and performs two additional recursive repair iterations to complete the proof. Figure 14 illustrates a case where the base model takes a lazy approach by trying to close all goals with linarith. In contrast, Apollo’s repaired proof performs additional setup and discharges each goal by first proving the necessary auxiliary hypotheses. This example shows that, although the model often understands the high-level strategy, it lacks the fine-grained tactics (and compiler feedback) needed to close subgoals. Apollo decomposes the proof, identifies the failure points, and applies targeted repairs to generate a fully verified solution. Figure 15 shows another scenario in which Apollo successfully closes the first two blocks with linarith, but the final block requires a deeper reasoning chain. Apollo augments the proof by introducing and proving the missing lemmas, which leads to a correct solution with a series of rewrites. Figure 16 presents an example of a very large repair. As in Figure 12, Apollo requires $r=5$ to fully fix the proof. The final proof length increases from 100 lines to 216, which means Apollo applied roughly $× 2.2$ more tactics to close the goal. The repair relied on combined effort from AutoSolver and the LLM to close all goals. We show that, even for large, complex proofs, Apollo can repair each failing sub-block and generate a correct solution. <details> <summary>examples/comparison/short_1.png Details</summary> ![304e1e97](/v1/image/304e1e97139f19f24d63660c79e6574d63cebf6d46bea0b379181c99c0df9fb8) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code snippets, labeled "LLM" (left) and "LLM + APOLLO" (right). The code appears to be a mathematical proof or derivation, likely written in a formal language like Lean or similar. The comparison highlights differences in the steps required to reach the same conclusion, with "LLM + APOLLO" using more automated tactics like `linarith` to simplify the proof. ### Components/Axes * **Headers:** * Left: "LLM" (red background) * Right: "LLM + APOLLO" (green background) * **Code Blocks:** Each side contains a code block with similar structure, including: * Import statements (`import Mathlib`, `import Aesop`) * Setting options (`set_option maxHeartbeats 0`) * Opening operators (`open BigOperators Real Nat Topology Rat`) * Theorem definition (`theorem mathd_algebra_141_llm` and `theorem mathd_algebra_141_apollo`) * Proof steps using `have`, `rw`, `norm_num`, `ring`, and `exact`. * **Annotations:** The code contains annotations marked with "#1", "#2", and "#3", highlighting specific lines or steps. These annotations are colored red in the "LLM" version and green in the "LLM + APOLLO" version. ### Detailed Analysis or ### Content Details **LLM (Left Side):** * `import Mathlib` * `import Aesop` * `set_option maxHeartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem mathd_algebra_141_llm (a b : R) (h₁ : a * b = 180) (h₂ : 2 * (a + b) = 54) :` * `a ^ 2 + b ^ 2 = 369 := by` * `have h3 : a + b = 27 := by #1` (red highlight) * `field_simp [h2]` * `have h4 : (a + b) ^ 2 = 729 := by` * `rw [h3]` * `norm_num` * `have expand : a ^ 2 + b ^ 2 = (a + b) ^ 2 - 2 * a * b := by` * `ring` * `have step1 : a ^ 2 + b ^ 2 = 729 - 2 * a * b := by` * `rw [expand, h4]` * `have step2 : 729 - 2 * a * b = 729 - 360 := by #2` (red highlight) * `rw [h₁]` * `have step3 : 729 - 360 = 369 := by` * `norm_num` * `exact step1.trans (step2.trans step3) #3` (red highlight) **LLM + APOLLO (Right Side):** * `import Mathlib` * `import Aesop` * `set_option maxHeartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem mathd_algebra_141_apollo (a b : R) (h₁ : a * b = 180) (h₂ : 2 * (a + b) = 54) :` * `a ^ 2 + b ^ 2 = 369 := by` * `have h3 : a + b = 27 := by #1` (green highlight) * `linarith` * `have h4 : (a + b) ^ 2 = 729 := by` * `rw [h3]` * `norm_num` * `have expand : a ^ 2 + b ^ 2 = (a + b) ^ 2 - 2 * a * b := by` * `ring` * `have step1 : a ^ 2 + b ^ 2 = 729 - 2 * a * b := by` * `rw [expand, h4]` * `have step2 : 729 - 2 * a * b = 729 - 360 := by #2` (green highlight) * `linarith` * `have step3 : 729 - 360 = 369 := by` * `norm_num` * `linarith #3` (green highlight) ### Key Observations * The "LLM + APOLLO" version uses the `linarith` tactic more frequently, simplifying the proof steps. * The annotations #1, #2, and #3 highlight the key differences in the proof strategies. * In the LLM version, `field_simp [h2]` is used after `have h3`, while in the LLM + APOLLO version, `linarith` is used instead. * In the LLM version, `rw [h1]` is used after `have step2`, while in the LLM + APOLLO version, `linarith` is used instead. * In the LLM version, `exact step1.trans (step2.trans step3)` is used to complete the proof, while in the LLM + APOLLO version, `linarith` is used instead. ### Interpretation The comparison demonstrates how the addition of the "APOLLO" tactic (likely an automated theorem prover or simplification tool) can significantly streamline mathematical proofs. The "LLM + APOLLO" version requires fewer explicit steps, relying on `linarith` to automatically deduce intermediate results. This suggests that "APOLLO" can automate certain aspects of the proof process, making it more efficient and potentially easier to understand. The annotations highlight specific points where "APOLLO" replaces manual steps, showcasing its impact on the overall proof structure. </details> Figure 8: Proof attempt of mathd_algebra_141 produced by the base model (o4-mini) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/short_2.png Details</summary> ![e97ec270](/v1/image/e97ec2705c9ecec89aa1e0642e45fa9ed995c2bf7a98fffd68941161cfb12e4d) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code snippets, one labeled "LLM" and the other "LLM + APOLLO". Both snippets appear to be related to mathematical theorem proving, likely using a formal verification system. The comparison highlights differences in the code required to prove a specific theorem using two different approaches. ### Components/Axes * **Left Block:** * Title: "LLM" (red background) * Code: A series of commands and a theorem definition. * **Right Block:** * Title: "LLM + APOLLO" (green background) * Code: A series of commands and a theorem definition. ### Detailed Analysis or ### Content Details **Left Block (LLM):** * `import Mathlib` * `import Aesop` * `set_option maxHeartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem imo_1983_p6_llm (a b c : R) (h₀ : 0 < a ∧ 0 < b ∧ 0 < c) (h₁ : c < a + b) (h₂ : b < a + c) (h₃ : a < b + c) : 0 ≤ a ^ 2 * b * (a - b) + b ^ 2 * c * (b - c) + c ^ 2 * a * (c - a) := by` * `nlinarith [sq_nonneg (a - b), sq_nonneg (b - c), sq_nonneg (c - a), sq_nonneg (a - b), sq_nonneg (b - c), sq_nonneg (c - a), mul_pos h₀.left h₀.right.left, mul_pos h₀.left h₀.right.right, mul_pos h₀.right.left h₀.right.right]` * Annotation: "#1" next to `sq_nonneg (b - c)` in the `nlinarith` line. **Right Block (LLM + APOLLO):** * `import Mathlib` * `import Aesop` * `set_option maxHeartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem imo_1983_p6_apollo (a b c : R) (h₀ : 0 < a ∧ 0 < b ∧ 0 < c) (h₁ : c < a + b) (h₂ : b < a + c) (h₃ : a < b + c) : 0 ≤ a ^ 2 * b * (a - b) + b ^ 2 * c * (b - c) + c ^ 2 * a * (c - a) := by` * `nlinarith [sq_nonneg (a - b), sq_nonneg (b - c), sq_nonneg (c - a), mul_pos (sub_pos.mpr h₁), (sub_pos.mpr h₂), mul_pos (sub_pos.mpr h₂), (sub_pos.mpr h₃), mul_pos (sub_pos.mpr h₃ (sub_pos.mpr h₁)]` * Annotation: "#1" next to `sq_nonneg (b - c)` in the `nlinarith` line. ### Key Observations * Both code snippets define a theorem named `imo_1983_p6_llm` and `imo_1983_p6_apollo` respectively, which appears to be related to the International Mathematical Olympiad (IMO) problem from 1983, problem 6. * The theorem states that under certain conditions on real numbers `a`, `b`, and `c`, a specific inequality holds. * The `nlinarith` command is used to prove the theorem. This command likely invokes a linear arithmetic solver. * The key difference lies in the arguments passed to the `nlinarith` command. The "LLM" version requires more explicit hints (multiple repetitions of `sq_nonneg` and explicit calls to `mul_pos` with detailed arguments), while the "LLM + APOLLO" version uses a more concise and automated approach (using `sub_pos.mpr` and fewer explicit hints). ### Interpretation The image demonstrates the impact of using the "APOLLO" extension in conjunction with "LLM" for proving mathematical theorems. The "LLM + APOLLO" approach significantly simplifies the proof process by reducing the amount of manual guidance required for the linear arithmetic solver. This suggests that "APOLLO" provides a higher level of automation or more powerful reasoning capabilities compared to using "LLM" alone. The annotation "#1" likely refers to a specific point or step in the proof process that is being highlighted. </details> Figure 9: Proof attempt of imo_1983_p6 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/short_4.png Details</summary> ![a2c6bfa2](/v1/image/a2c6bfa2ba27c4fab2fd868d550fa838960fee20a3293fecf2e9bf9369193642) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code snippets, labeled "LLM" on the left and "LLM + APOLLO" on the right. Both snippets appear to be related to mathematical theorem proving, with similar initializations and a theorem definition. The key difference lies in the handling of cases within the induction proof, where "LLM + APOLLO" introduces an additional induction step. ### Components/Axes * **Headers:** * Left: "LLM" (red background) * Right: "LLM + APOLLO" (green background) * **Code Blocks:** Two code blocks containing mathematical definitions and proofs. * **Annotations:** "#1" and "#2" are used to highlight specific lines in both code blocks. ### Detailed Analysis or ### Content Details **Left Code Block (LLM):** * **Imports:** * `import Mathlib` * `import Aesop` * **Options:** * `set_option maxHeartbeats 0` * **Open:** * `open BigOperators Real Nat Topology Rat` * **Theorem:** * `theorem induction_pord1p1on2powklt5on2_1lm (n k : N)` * `(h₀ : 0 < n) :` * `(Π k in Finset.Icc 1 n, 1 + (1 : R) / 2 ^ k) < 5 / 2 := by` * **Proof:** * `induction n with` * `| zero =>` * `contradiction` * `| succ n hn =>` * `cases n with` * `| zero =>` * `norm_num` `#1` (highlighted in pink) * `| succ n =>` * `simp_all [Finset.prod_Icc_succ_top, Nat.cast_succ, div_eq_mul_inv]` * `norm_num` `#2` (highlighted in pink) * `ring_nf` (highlighted in pink) * `linarith` (highlighted in pink) **Right Code Block (LLM + APOLLO):** * **Imports:** * `import Mathlib` * `import Aesop` * **Options:** * `set_option maxHeartbeats 0` * **Open:** * `open BigOperators Real Nat Topology Rat` * **Theorem:** * `theorem induction_pord1p1on2powklt5on2_apollo (n k : N)` * `(h₀ : 0 < n) :` * `(Π k in Finset.Icc 1 n, 1 + (1 : R) / 2 ^ k) < 5 / 2 := by` * **Proof:** * `induction n with` * `| zero =>` * `contradiction` * `| succ n hn =>` * `cases n with` * `| zero =>` * `norm_num` `#1` (highlighted in light green) * `induction k with` * `| zero =>` * `norm_num` * `[Finset.prod_Icc_succ_top] at hn h₀ <;> linarith` * `| succ k ih =>` * `cases k with` * `| zero =>` * `norm_num` * `[Finset.prod_Icc_succ_top] at hn h₀ <;> linarith` * `| succ k =>` * `simp_all [Finset.prod_Icc_succ_top, pow_succ, mul_comm]` * `linarith` * `| succ n =>` * `simp_all [Finset.prod_Icc_succ_top, Nat.cast_succ, div_eq_mul_inv]` * `norm_num` `#2` (highlighted in light green) * `ring_nf` (highlighted in light green) * `linarith` (highlighted in light green) ### Key Observations * Both code blocks define a theorem named `induction_pord1p1on2powklt5on2_...` with slightly different suffixes (`_1lm` vs `_apollo`). * The initial setup (imports, options, open statements, theorem definition) is identical in both blocks. * The key difference is in the handling of the `cases n with | zero =>` branch within the `induction n with | succ n hn =>` case. The "LLM + APOLLO" version introduces an additional `induction k with` step, providing more granular control over the proof. * Annotations `#1` and `#2` highlight corresponding lines in both code blocks, indicating specific points of interest or comparison. * The highlighting colors (pink for LLM, light green for LLM + APOLLO) visually distinguish the code blocks and their respective annotations. ### Interpretation The image demonstrates a comparison between two approaches to proving a mathematical theorem. The "LLM + APOLLO" version utilizes a more refined induction strategy, potentially leading to a more robust or efficient proof. The additional `induction k with` step in the "LLM + APOLLO" version suggests a more detailed analysis of the base case within the inductive step. The annotations highlight the specific lines where the proof strategies diverge, allowing for a focused comparison of the two approaches. The use of different highlighting colors enhances the visual clarity of the comparison. </details> Figure 10: Proof attempt of induction_pord1p1on2powklt5on2 produced by the base model (Goedel-Prover-SFT) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/short_5.png Details</summary> ![c87472fc](/v1/image/c87472fceee0a5c49f7b8c4e43b6e1ecad9d5c1cec7dbcb59d1974340baa2276) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code snippets, likely from a formal verification system. The left side is labeled "LLM" and the right side is labeled "LLM + APOLLO". Both snippets define a theorem related to real number algebra, but the "LLM + APOLLO" version includes additional steps and lemmas to aid in the proof. ### Components/Axes * **Titles:** * Left: "LLM" (red background) * Right: "LLM + APOLLO" (green background) * **Code:** The code snippets are written in a language similar to Lean or Coq, used for formal mathematical proofs. * **Comments/Annotations:** Both sides contain comments and annotations, such as "#1" in the bottom right corner of each code block. ### Detailed Analysis or ### Content Details **Left Side (LLM):** * **Imports:** * `import Mathlib` * `import Aesop` * **Options:** * `set_option maxHeartbeats 0` * **Open:** * `open BigOperators Real Nat Topology Rat` * **Theorem Definition:** * `theorem mathd_algebra_293_llm (x : NNReal) :` * `Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = 36 * x * Real.sqrt (35 * x) := by` * **Proof Steps:** * `rw [Real.sqrt_mul (by positivity), ← Real.sqrt_mul (by positivity)]` * `ring_nf` * `rw [Real.sqrt_eq_iff_mul_self_eq]` * `ring_nf` * `nlinarith #1` **Right Side (LLM + APOLLO):** * **Imports:** * `import Mathlib` * `import Aesop` * **Options:** * `set_option maxHeartbeats 0` * **Open:** * `open BigOperators Real Nat Topology Rat` * **Theorem Definition:** * `theorem mathd_algebra_293_apollo (x : NNReal) :` * `Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = 36 * x * Real.sqrt (35 * x) := by` * **Proof Steps:** * `rw [Real.sqrt_mul (by positivity), ← Real.sqrt_mul (by positivity)]` * `ring_nf` * `rw [Real.sqrt_eq_iff_mul_self_eq]` * `ring_nf` * **Additional Proof Steps (norm_num block):** * `norm_num` * `have h: (√(x : R) * √35)^2 = (x : R) * 35 := by rw [mul_pow]` * `simp` * `rw [h]` * `ring` * `simp_all only [ofNat_pos, mul_nonneg_iff_of_pos_right, NNReal.zero_le_coe, pow_nonneg]` * `have hx: x ≥ 0 := by exact _root_.zero_le x` * `have h1: √((↑x : R) * √(35: R)) * (36: R) ≥ 0 := by` * `simp_all only [ge_iff_le, _root_.zero_le, NNReal.zero_le_coe, sqrt_mul, ofNat_pos, mul_nonneg_iff_of_pos_right, Real.sqrt_pos, Real.sqrt_nonneg]` * `have A: 0 ≤ (x : R) := NNReal.coe_nonneg x` * `have B: 0 ≤ √((x: R) * 35) := sqrt_nonneg ((x: R) * 35)` * `have C: (0: R) ≤ 36 := by norm_num` * `exact mul_nonneg (mul_nonneg A B) C` ### Key Observations * Both sides define the same theorem, but the "LLM + APOLLO" version includes a more detailed proof using the `norm_num` tactic and several supporting lemmas. * The "LLM" version relies on the `nlinarith` tactic, which likely performs more automated reasoning. * The "#1" annotation appears on both sides, potentially indicating a specific goal or step in a larger proof. ### Interpretation The comparison highlights the difference in proof strategies between using a large language model (LLM) alone versus using an LLM augmented with a tool like APOLLO. The "LLM + APOLLO" version demonstrates a more explicit and step-by-step proof, potentially making it easier to understand and verify. The LLM version relies on a more powerful tactic (`nlinarith`), which might be less transparent but more concise. The data suggests that combining LLMs with specialized tools can lead to more robust and verifiable formal proofs. </details> Figure 11: Proof attempt of mathd_algebra_293 produced by the base model (Goedel-Prover-SFT) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/medium_1.png Details</summary> ![1af7eaa2](/v1/image/1af7eaa2b63f5583f32a63941a6ef6593889b1cffdcba9a1982a7eebb3746b5f) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of two code implementations, labeled "LLM" (left) and "LLM + APOLLO" (right). Both implementations appear to be attempting to prove the same mathematical theorem, "mathd_algebra_184," within a formal verification environment (likely Lean). The code snippets involve importing libraries, setting options, and defining the theorem with its assumptions and proof steps. The comparison highlights differences in the proof strategies and tactics used in each implementation. ### Components/Axes * **Headers:** * Top-left: "LLM" (enclosed in a red box) * Top-right: "LLM + APOLLO" (enclosed in a green box) * **Code Blocks:** Two distinct blocks of code, one for each implementation. * **Annotations:** Numerical annotations (e.g., "#1", "#2") are scattered throughout both code blocks, potentially indicating key steps or points of divergence. ### Detailed Analysis or ### Content Details Here's a breakdown of the code and annotations, comparing the LLM and LLM + APOLLO implementations: **Common Elements:** * Both implementations start with the same imports: `import Mathlib` and `import Aesop`. * Both set the option `set_option maxHeartbeats 0`. * Both open the namespace `open BigOperators Real Nat Topology Rat`. * Both define the theorem `theorem mathd_algebra_184` with the same assumptions: * `(a b : NNReal)` * `(h₀ : (0 : NNReal) < a ∧ (0 : NNReal) < b)` * `(h₁ : a ^ (2 : ℕ) = (6 : NNReal) * b)` * `(h₂ : a ^ (2 : ℕ) = (54 : NNReal) / b)` * The goal is to prove `a = (3 : NNReal) * (NNReal.sqrt (NNReal → NNReal) (2 : NNReal))` **LLM (Left - Red Box):** * **#1:** `have h4 : b ≠ 0 := by linarith [h₀.right]` * **#2:** `nlinarith` * **#3:** `have h7 : b ^ 2 = (9 : NNReal) := by nlinarith` * **#4:** `have h10 : b ^ 2 - (9 : NNReal) = (0 : NNReal) := by linarith` * **#5:** `linarith` * **#6:** `-- b - 3 = 0` * **#7:** `-- b + 3 = 0, but b > 0, contradiction` * **#8:** `linarith` * **#9:** `rw [h15, h16]` **LLM + APOLLO (Right - Green Box):** * **#1:** `have h4 : b ≠ 0 := by exact ne_of_gt hb` * **#2:** `have h5 : (6 : NNReal) * b ^ (2 : ℕ) = (54 : NNReal) := by have h6 : b ^ (2 : ℕ) = b * b := by simp [pow_two]` * **#3:** `have h7 : b ^ (2 : ℕ) = (9 : NNReal) := by` * **#4:** `have h10 : b ^ 2 - (9 : NNReal) = (0 : NNReal) := by` * **#5:** `have hb : b = (3 : NNReal) := by have h7' : b ^ 2 = (9 : NNReal) simpa using h7` * **#6:** `cases' (mul_eq_zero.mp h16) with` * **#7:** `exact h14` * **#8:** `gcongr` * **#9:** `<;> nlinarith [ho.left, show 0 ≤ (3 : NNReal) * √2 by apply mul_nonneg norm_num apply Real.sqrt_nonneg]` ### Key Observations * The LLM implementation relies heavily on the `nlinarith` tactic, which automatically solves linear arithmetic problems. * The LLM + APOLLO implementation uses more explicit proof steps and tactics, such as `exact ne_of_gt hb`, `simp [pow_two]`, `gcongr`, and `cases'`. * Both implementations arrive at similar intermediate steps, but the level of detail and the tactics used differ significantly. * The annotations highlight specific points where the proof strategies diverge or where particular tactics are applied. ### Interpretation The comparison suggests that the "LLM + APOLLO" implementation provides a more detailed and explicit proof compared to the "LLM" implementation. The LLM implementation leverages the `nlinarith` tactic to automate many of the arithmetic steps, making the proof shorter but potentially less transparent. The "LLM + APOLLO" implementation, on the other hand, breaks down the proof into smaller, more manageable steps, providing greater control and clarity. The choice between the two implementations depends on the desired level of detail and the specific requirements of the verification environment. The annotations serve as valuable markers for understanding the key differences in the proof strategies employed by each implementation. </details> Figure 12: Proof attempt of mathd_algebra_184 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/medium_2.png Details</summary> ![474bc985](/v1/image/474bc9851d77e75468d9977699e4779592a2eb37c0cfe1fc09550879d43e98f2) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code snippets, likely from a formal verification system like Lean, used to prove a mathematical theorem. The left section, labeled "LLM," shows a proof attempt using a Large Language Model. The right section, labeled "LLM + APOLLO," shows a proof attempt using a Large Language Model augmented with the APOLLO system. The code includes imports, theorem definitions, and proof steps. The goal is to demonstrate how APOLLO enhances the proof process. ### Components/Axes * **Headers:** * Left: "LLM" (Red background) * Right: "LLM + APOLLO" (Green background) * **Code Blocks:** Two blocks of code, one under each header, containing Lean code. * **Annotations:** "#1" is annotated in the right margin of each code block. * The code uses mathematical notation and Lean-specific syntax. ### Detailed Analysis or ### Content Details **Left Code Block (LLM):** * Imports: `import Mathlib`, `import Aesop` * Sets option `maxHeartbeats` to 0. * Opens `BigOperators Real Nat Topology Rat`. * Defines a theorem `mathd_algebra_293_llm` for `x : NNReal` (non-negative real number). * The theorem states: `Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = 36 * x * Real.sqrt (35 * x)` * Proof starts with `:= by`. * `have h1`: Introduces a hypothesis stating `Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = Real.sqrt ((60 * x) * (12 * x) * (63*x))`. * `rw`: Applies rewrite rules using `Real.sqrt_mul (by positivity)` twice. * `rw [h1]`: Rewrites using hypothesis `h1`. * `have h2`: Introduces another hypothesis stating `(60 * x) * (12 * x) * (63 * x) = (36 * x) ^ 2 * (35 * x)`. * `ring_nf`: Simplifies using ring normal form. * `<;> simp [x_mul_x]`: Simplifies using the `x_mul_x` simplification rule. * `<;> linarith (show (0: R) ≤ x from by positivity]`: Attempts to prove a linear inequality. * `rw [h2]`: Rewrites using hypothesis `h2`. * `rw [Real.sqrt_mul (by positivity)]`: Applies rewrite rule. * `rw [Real.sqrt_sq (by simp [x])]`: Applies rewrite rule. * `all_goals positivity`: Attempts to prove all remaining goals are positivity conditions. * The last three lines are highlighted in red, indicating a potential failure or area of concern. **Right Code Block (LLM + APOLLO):** * Imports: `import Mathlib`, `import Aesop` * Sets option `maxHeartbeats` to 0. * Opens `BigOperators Real Nat Topology Rat`. * Defines a theorem `mathd_algebra_293_apollo` for `x : NNReal`. * The theorem states: `Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = 36 * x * Real.sqrt (35 * x)` * Proof starts with `:= by`. * `have h1`: Introduces a hypothesis stating `Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = Real.sqrt ((60 * x) * (12 * x) * (63 * x))`. * `rw`: Applies rewrite rules using `Real.sqrt_mul (by positivity)` twice. * `rw [h1]`: Rewrites using hypothesis `h1`. * `have h2`: Introduces another hypothesis stating `(60 * x) * (12 * x) * (63 * x) = (36 * x) ^ 2 * (35 * x)`. * `ring_nf`: Simplifies using ring normal form. * `<;> simp [x_mul_x]`: Simplifies using the `x_mul_x` simplification rule. * `<;> linarith (show (0: R) ≤ x from by positivity]`: Attempts to prove a linear inequality. * `try norm_cast; try norm_num; try simp_all; try ring_nf at * ; try native_decide ; try linarith; try nlinarith`: Applies a series of simplification tactics. * `have h1: √(60: R) = 2 * √15 := by ...`: Introduces and proves a series of hypotheses about square roots of constants. * `have h2': √(12: R) = 2 * √3 := by ...` * `have h3 : √(63: R) = 3 * √7 := by ...` * `have h4: √(↑x: R) ^ (3 : N) = √(↑x : R) * (↑x: R) := by ...` * `have h41: √(↑x : R) ^ (3 : N) = √(↑x : R) ^ 2 * √(↑x: R) := by ...` * `have h42: √(↑x : R) ^ 2 = (↑x : R) := by ...` * `rw [Real.sq_sqrt]`: Applies rewrite rule. * `exact NNReal.zero_le_coe`: Proves a goal using the fact that non-negative real numbers are greater than or equal to zero. * `rw [h41, h42]`: Rewrites using hypotheses `h41` and `h42`. * `linarith`: Attempts to prove a linear inequality. * `rw [h1, h2, h3, h4]`: Rewrites using hypotheses `h1`, `h2`, `h3`, and `h4`. * `ring_nf`: Simplifies using ring normal form. * `<;> simp [mul_assoc]`: Simplifies using the `mul_assoc` simplification rule. * `try norm_cast; try norm_num; try simp_all; try ring_nf at * ; try native_decide ; try linarith; try nlinarith`: Applies a series of simplification tactics. * The code continues with more complex manipulations involving square roots and real numbers. **Right Code Block (LLM + APOLLO) - Continued:** * `have h2: √(15: R) * √(3: R) * √(7: R) * √(↑x: R) * (↑x : R) * (12 : R) = √(↑x : R) * (↑x: R) * √(35: R) * (36: R) := by` * `have h21: √(15: R) * √(3: R) * √(7: R) = √(315: R) := by` * `calc`: Starts a calculation block. * `√(15: R) * √(3: R) * √(7: R) = √(15 * 3 * 7) := by` * `rw [Real.sqrt_mul (by norm_num), ← Real.sqrt_mul (by norm_num)]` * `= √(315: R) := by norm_num` * `have h22: √(315 : R) * √(↑x : R) * (↑x: R) * (12 : R) = √(↑x : R) * (↑x: R) * √(35: R) * (36: R) := by` * `have h221: √(315: R) = √(35: R) * (3: R) := by` * `calc`: Starts a calculation block. * `√(315: R) = √(35*3^2) := by` * `= √(35) * √(3^2) := by` * `rw [Real.sqrt_mul (by norm_num)]` * `= √(35) * (3: R) := by` * `rw [Real.sqrt_sq (by norm_num)]` * `rw [h221]` * `ring_nf` * `<;> simp [Real.sqrt_mul]` * `<;> ring` * `rw [h21]` * `have h23: √(315: R) * √(↑x : R) * (↑x : R) * (12: R) = √(15: R) * √(3: R) * √(7: R) * √(↑x : R) * (↑x : R) * (12 : R) := by` * `rw [h21]` * `rw [h23]` * `linarith [h22]` * `rw [h2]` * `ring` ### Key Observations * Both code blocks start with the same imports and theorem definition. * The LLM code block is shorter and relies on more general tactics. * The LLM + APOLLO code block is longer and more explicit, breaking down the proof into smaller steps. * The LLM code block has the last three lines highlighted in red, suggesting a failure or incomplete proof. * The LLM + APOLLO code block appears to be more successful, with a more detailed and structured proof. ### Interpretation The image demonstrates the potential benefits of using the APOLLO system to augment Large Language Models in formal verification. The LLM alone struggles to complete the proof, as indicated by the highlighted lines. APOLLO provides additional guidance and tactics, allowing the system to break down the problem into smaller, more manageable steps. This results in a more detailed and potentially more successful proof. The comparison suggests that APOLLO enhances the reasoning capabilities of LLMs in the context of formal mathematical proofs. </details> Figure 13: Proof attempt of mathd_algebra_293 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/medium_3.png Details</summary> ![b26c5b3d](/v1/image/b26c5b3dbad9e08c3a44d11c092ab869ed3ecf50c697c677df84f7603caf7bd9) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code snippets labeled "LLM" and "LLM + APOLLO". Both snippets appear to be related to mathematical proofs or symbolic computations, likely within a formal verification or automated reasoning context. The code includes commands, definitions, and theorems, with annotations and comments. The comparison highlights differences in the approaches or optimizations used in each version. ### Components/Axes * **Headers:** * Left: "LLM" (Red background) * Right: "LLM + APOLLO" (Green background) * **Code Snippets:** Each section contains a block of code, including imports, options, theorem definitions, and proof steps. * **Annotations:** Numbered annotations "#1", "#2", "#3" appear in both code snippets, possibly indicating key steps or differences. ### Detailed Analysis or ### Content Details **LLM (Left Side - Red Background)** * **Imports:** * `import Mathlib` * `import Aesop` * **Options:** * `set_option maxHeartbeats 0` * `set_option pp.numericTypes true` * `set_option pp.coercions.types true` * **Open:** * `open BigOperators Real Nat Topology Rat` * **Theorem:** * `theorem algebra_2varlineareq_fp3zeq11_3tfm1m5zeqn68_feqn10_zeq7_llm (f z : C)` * `(h₀ : f + (3 : C) * z = (11 : C))` * `(h₁ : (3 : C) * (f - 1) - (5 : C) * z = (-68 : C))` * `f = (-10 : C) ∧ z = (7 : C) := by` * **Proof Steps:** * `-- Simplify the second equation using mul_sub` * `have h₂ : 3 * f - 3 - 5 * z = -68 := by` * `simpa [mul_sub] using h₁` * `-- Combine like terms to isolate 3*f - 5*z` * `have h₃ : 3 * f - 5 * z = -65 := by #1` * `linarith [h₂]` * `-- Use the two equations to solve for z` * `have z_eq : z = 7 := by #2` * `linarith [h₀, h₃]` * `-- Substitute z back to solve for f` * `have f_eq : f = -10 := by #3` * `linarith [h₀, z_eq]` * `-- Conclude both equalities` * `exact And.intro f_eq z_eq` **LLM + APOLLO (Right Side - Green Background)** * **Imports:** * `import Mathlib` * `import Aesop` * **Options:** * `set_option maxHeartbeats 0` * `set_option pp.numericTypes true` * `set_option pp.coercions.types true` * **Open:** * `open BigOperators Real Nat Topology Rat` * **Theorem:** * `theorem algebra_2varlineareq_fp3zeq11_3tfm1m5zeqn68_feqn10_zeq7_apollo (f z : C)` * `(h₀ : f + (3 : C) * z = (11 : C))` * `(h₁ : (3 : C) * (f - 1) - (5 : C) * z = (-68 : C))` * `f = (-10 : C) ∧ z = (7 : C) := by` * **Proof Steps:** * `have h₂ : 3 * f - 3 - 5 * z = -68 := by` * `simpa [mul_sub] using h₁` * `have h₃ : 3 * f - 5 * z = -65 := by #1` * `try norm_cast; try norm_num; try simp_all;` * `try ring_nf at *; try native_decide; try` * `linarith; try nlinarith` * `have h_expanded : (3 : C) * f - (3 : C) - (5 : C) * z = (-68 : C) := by` * `gcongr` * `have h_added : (3 : C) * f - (5 : C) * z = ((3 : C) * f - (3 : C) - (5 : C) * z) + (3 : C) := by` * `try norm_cast; try norm_num; try` * `simp_all; try ring_nf at *; try` * `native_decide; try linarith; try nlinarith` * `have h_add : f * (3 : C) - z * (5 : C) = (-68 : C) + (3 : C) := by` * `have lhs : (-3 : C) + (f * (3 : C) - z * (5 : C)) = (f * (3 : C) - z * (5 : C)) := by` * `exact neg_add_cancel_comm (3 : C) (f * (3 : C) - z * (5 : C))` * `have rhs : (-68 : C) + (3 : C) = (-68 : C) + (3 : C) := by` * `exact rfl` * `rw [← lhs] at h₁` * `simp_all only` * `have h_sum : (-68 : C) + (3 : C) = (-65 : C) := by` * `try norm_cast; try norm_num; try` * `simp_all; try ring_nf at *; try native_decide;` * `try linarith; try nlinarith` * `rw [h_add, h_sum]` * `have h_subst : (3 : C) * f - (5 : C) * z = (-68 : C) + (3 : C) := by` * `rw [h_expanded] at h_added` * `exact h_added` * `have h_simpl : (-68 : C) + (3 : C) = (-65 : C) := by` * `norm_num` * `exact Eq.trans h_subst h_simpl` * `have z_eq : z = 7 := by #2` * `try norm_cast; try norm_num; try simp_all;` * `try ring_nf at *; try native_decide;` * `try linarith; try nlinarith` * `have f_eq_def : f = (11 : C) - (3 : C) * z := by` * `exact eq_sub_of_add_eq h₀` * `have h3_sub : (3 : C) * ((11 : C) - (3 : C) * z) - (5 : C) * z = (-65 : C) := by` * `rw [f_eq_def] at h₃` * `exact h₃` * `have h3_simplified : (33 : C) - (14 : C) * z = (-65 : C) := by` * `norm_num` * `try norm_cast; try norm_num; try` * `simp_all; try ring_nf at *; try native_decide;` * `try linarith; try nlinarith` * `gcongr` * `have isolate_z : (14 : C) * z = 98 := by` * `try norm_cast; try norm_num; try` * `simp_all; try ring_nf at *; try native_decide;` * `try linarith; try nlinarith` * `have eq1 : (30 : C) = ((-68 : C) + z * (14 : C)) := by` * `rw [sub_eq_iff_eq_add] at h₁` * `gcongr` * `have eq2 : (30 : C) + (68 : C) = ((-68 : C) + (68 : C)) + z * (14 : C) := by` * `simp_all only [add_sub_cancel_right, neg_add_cancel_comm, neg_add_cancel, zero_add]` * `have eq3 : (98 : C) = (0 : C) + z * (14 : C) := by` * `norm_num at eq2` * `simp_all only [add_sub_cancel_right, zero_add]` * `have eq4 : (0 : C) + z * (14 : C) = z * (14 : C) := by` * `simp` * `have eq_final : (98 : C) = z * (14 : C) := by` * `simp_all only [add_sub_cancel_right, neg_add_cancel_comm, neg_add_cancel, zero_add]` * `gcongr` * `have z_sol : z = (7 : C) := by` * `try norm_cast; try norm_num; try simp_all;` * `try ring_nf at *; try native_decide; try` * `linarith; try nlinarith` * `have h_nonzero : (14 : C) ≠ 0 := by` * `norm_num` * `have z_inv : z = 98 * ((14 : C)⁻¹) := by` * `exact (eq_mul_inv_iff_mul_eq₀ h_nonzero).mpr isolate_z` * `have z_div : z = 98 / (14 : C) := by` * `exact z_inv` * `have z_simpl : 98 / (14 : C) = (7 : C) := by` * `norm_num` * `rw [z_div, z_simpl]` * `exact z_sol` * `have f_eq : f = -10 := by #3` * `try norm_cast; try norm_num; try simp_all;` * `try ring_nf at *; try native_decide; try` * `linarith; try nlinarith` * `have h_f : f = (11 : C) - (21 : C) := by` * `exact eq_sub_of_add_eq' h₀` * `have h_num : (11 : C) - (21 : C) = (-10 : C) := by` * `norm_num` * `rw [h_f, h_num]` * `exact And.intro f_eq z_eq` ### Key Observations * **Similarities:** Both code snippets address the same theorem and initial setup (imports, options, theorem statement). They both aim to prove `f = -10` and `z = 7`. * **Differences:** The "LLM + APOLLO" version includes more explicit and detailed proof steps, using tactics like `try norm_cast`, `try norm_num`, `try simp_all`, `try ring_nf at *`, `try native_decide`, `try linarith`, and `try nlinarith`. It also introduces intermediate lemmas (e.g., `h_expanded`, `h_added`, `h_add`, `h_subst`, `isolate_z`, `eq1`, `eq2`, `eq3`, `eq4`, `eq_final`, `z_sol`, `z_inv`, `z_div`, `z_simpl`, `h_f`, `h_num`) to break down the proof into smaller, more manageable steps. The LLM version uses `linarith` more directly. * **Annotations:** The annotations `#1`, `#2`, and `#3` likely mark corresponding steps in both proofs where significant transformations or deductions occur. ### Interpretation The "LLM + APOLLO" version likely represents an enhanced or more explicit proof strategy compared to the "LLM" version. The additional proof steps and tactics in "LLM + APOLLO" suggest a more robust or automated approach, possibly designed to be more easily verified or adapted to similar problems. The "LLM" version might be a more concise or human-readable proof, while "LLM + APOLLO" prioritizes rigor and automation. The use of tactics like `try norm_cast` and `try linarith` indicates an attempt to handle potential type casting and linear arithmetic automatically. The APOLLO addition seems to be adding more explicit steps to guide the proof assistant. </details> Figure 14: Proof attempt of algebra_2varlineareq_fp3zeq11_3tfm1m5zeqn68_feqn10_zeq7 produced by the base model (o4-mini) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/medium_4.png Details</summary> ![1e220172](/v1/image/1e2201723d75a63c97cc80e0c63d7a1821642e79b38d24282c54e7ab855ebfcc) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code snippets, labeled "LLM" (left) and "LLM + APOLLO" (right), likely representing different approaches or versions of a mathematical proof or computation. The code appears to be written in a language related to theorem proving or formal verification, possibly Lean. The comparison highlights differences in the code structure and specific commands used. ### Components/Axes * **Headers:** * Left: "LLM" (red background) * Right: "LLM + APOLLO" (green background) * **Code Blocks:** Two distinct code blocks, one under each header. * **Annotations:** Numerical annotations (#1, #2, #3) highlight specific lines or sections in each code block. * The code includes imports, options settings, theorem definitions, and proof steps. ### Detailed Analysis or ### Content Details **Left Code Block (LLM):** ```lean import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat theorem amc12a_2019_p12_1lm (x y : R) (h₀ : x ≠ 1 ∧ y ≠ 1) (h₁ : Real.log x / Real.log 2 = Real.log 16 / Real.log y) (h₂ : x * y = 64) : (Real.log (x / y) / Real.log 2) ^ 2 = 20 := by have x_nonzero : x ≠ 0 := by #1 rintro rfl simp at h₁ have y_nonzero : y ≠ 0 := by #2 rintro rfl simp at h₁ field_simp [Real.log_div, Real.log_mul, Real.log_inv, Real.log_pow] at h₁ nlinarith #3 ``` **Right Code Block (LLM + APOLLO):** ```lean import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat theorem amc12a_2019_p12_apollo (x y : R) (h₀ : x ≠ 1 ∧ y ≠ 1) (h₁ : Real.log x / Real.log 2 = Real.log 16 / Real.log y) (h₂ : x * y = 64) : (Real.log (x / y) / Real.log 2) ^ 2 = 20 := by have x_nonzero : x ≠ 0 := by #1 rintro rfl simp at h₁ linarith have y_nonzero : y ≠ 0 := by #2 rintro rfl simp at h₁ linarith field_simp [Real.log_div, Real.log_mul, Real.log_inv, Real.log_pow] at h₁ ring_nf at h₁ field_simp [Real.log_mul, Real.log_rpow, Real.log_pow, mul_comm] at h₁ let L := Real.log (2 : R) let X := Real.log x let Y := Real.log y have hXY : X + Y = Real.log 64 := by rw [Real.log_mul x_nonzero y_nonzero] rw [h₂] have hXY' : X + Y = 6 * L := by rw [hXY] have hL : 6 * Real.log (2 : R) = Real.log (64) := by have h₂_pos : (2 : R) > 0 := by norm_num have H : Real.log (2^6) = 6 * Real.log (2 : R) := by simp rw [← H] norm_num gcongr have h16 : Real.log (16 : R) = 4 * L := by norm_num have h₂_pos : (2 : R) > 0 := by norm_num have H' : Real.log (2^4) = (4 : R) * Real.log (2 : R) := by simp rw [← H'] norm_num rw [h16] at h₁ have hX : X = (4 * L^2) / Y := by have hL : Real.log (2 : R) = L := rfl rw [hL] at h₁ have hL' : L * (4 * L) = 4 * L^2 := by linarith rw [hL'] at h₁ exact h₁ have hXY_prod : X * Y = 4 * L^2 := by rw [hX] field_simp [y_nonzero] ring_nf have Y_nonzero : Y ≠ 0 := by have y_log_nonzero : Real.log y ≠ 0 := by intro H have hy_exp : y = Real.exp (Real.log y) := by simp_all only [ne_eq, div_zero, log_eq_zero, false_or, log_neg_eq_log, log_one, exp_zero, mul_neg, mul_one, neg_neg, OfNat.one_ne_ofNat] rw [H, Real.exp_zero] at hy_exp have y_eq_one : y = 1 := hy_exp exact (h₀.right) y_eq_one exact y_log_nonzero simp_all only [ne_eq, isUnit_iff_ne_zero, not_false_eq_true, IsUnit.mul_inv_cancel_right] have diff_eq : X^2 + Y^2 - 2 * X * Y = (X + Y)^2 - 4 * (X * Y) := by ring rw [hXY'] at diff_eq rw [hXY_prod] at diff_eq linarith ``` **Annotations:** * `#1`: Marks the "have x_nonzero" line in both code blocks. * `#2`: Marks the "have y_nonzero" line in both code blocks. * `#3`: Marks the "nlinarith" line in the LLM code block and a section of code starting with `let L := Real.log (2 : R)` in the LLM + APOLLO code block. ### Key Observations * The "LLM + APOLLO" code block is significantly longer and more detailed than the "LLM" code block. * The "LLM" code uses `nlinarith` to complete the proof, while "LLM + APOLLO" expands the proof with more explicit steps. * The "LLM + APOLLO" code introduces intermediate variables (L, X, Y) and uses rewrite rules (`rw`) extensively. * The annotations highlight key differences in the proof strategies. ### Interpretation The image illustrates the difference between a more concise proof ("LLM") and a more detailed, step-by-step proof ("LLM + APOLLO"). The "LLM + APOLLO" version likely benefits from the assistance of the APOLLO tool, which seems to guide the proof process by suggesting intermediate steps and rewrite rules. The increased verbosity in "LLM + APOLLO" may make the proof easier to understand and verify, but it comes at the cost of increased code length. The use of `linarith` in "LLM + APOLLO" suggests a more targeted approach to linear arithmetic simplification compared to the more general `nlinarith` used in "LLM". The APOLLO version breaks down the proof into smaller, more manageable steps, potentially making it more robust and easier to debug. </details> Figure 15: Proof attempt of amc12a_2019_p12 produced by the base model (Goedel-Prover-SFT) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/long.png Details</summary> ![4135f530](/v1/image/4135f5309b11e4ff942f32191369f2554e1672a38536efbb367ff3e4c698d1f5) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code snippets, labeled "LLM" (left, red background) and "LLM + APOLLO" (right, green background). The code appears to be a formal proof or derivation, possibly in a language like Lean or Coq, dealing with number theory and algebraic manipulations. The comparison highlights differences in the proof strategies or steps taken by the two approaches. ### Components/Axes * **Headers:** * Left (Red): "LLM" * Right (Green): "LLM + APOLLO" * **Code Blocks:** Two columns of code, each representing a different approach to the same theorem. * **Annotations:** Numerical annotations (e.g., "#1", "#2") are present within the code, possibly indicating specific points of interest or differences. ### Detailed Analysis or ### Content Details **Left Column (LLM):** * **Theorem Statement:** `theorem mathd_algebra_289_1lm (k t m n: N) (h0: Nat.Prime m ∧ Nat.Prime n) (h1: t < k) (h2: (k ^ 2: Z) - m * k + n = 0) (h3: (t ^ 2: Z) - m * t + n = 0): m ^ n + n ^ m + k * t + t ^ k = 20 := by` * **Key Steps:** * Derives `have h4: (k : Z) ^ 2 - m * k + n = 0 := by exact_mod_cast h2` * Derives `have h5: (t : Z) ^ 2 - m * t + n = 0 := by exact_mod_cast h3` * Uses `linarith` to derive several intermediate results. * Applies `mul_eq_zero.mp h9` and `resolve_left` with `omega`. * Uses `nlinarith` with `[h4, h5, h12]`. * Derives `have k = 1 := by nlinarith #1` * Derives `have t = 1 := by omega` * Derives `have n = 1 := by omega #2` * Derives `have k = 1 := by nlinarith #3` * Uses `interval_cases k <;> omega #4` **Right Column (LLM + APOLLO):** * **Theorem Statement:** `theorem mathd_algebra_289_apollo (k t m n: N) (h0: Nat.Prime m ∧ Nat.Prime n) (h1: t < k) (h2: (k ^ 2: Z) - m * k + n = 0) (h3: (t ^ 2: Z) - m * t + n = 0): m ^ n + n ^ m + k * t + t ^ k = 20 := by` * **Key Steps:** * Similar initial steps as LLM, deriving `h4` and `h5` using `exact_mod_cast`. * Derives `have h24: (k : Z) + (t : Z) = m := by #5 exact h12` * Uses `gcongr` and `rw` (rewrite) at various points. * Applies `simp_all` with various arguments like `Nat.cast_add`, `Nat.cast_one`, etc. * Uses `by_contra h` and `push_neg at h` for proof by contradiction. * Derives `have k % 2 = 0 := by omega #6` * Derives `have (m : Z) = (k : Z) + 1 := by linarith #7` * Derives `have k = 2 := by interval_cases k <;> omega #8` **Annotations:** * `#1`, `#2`, `#3`, `#4`, `#5`, `#6`, `#7`, `#8`: These annotations highlight specific lines or sections of code, likely indicating points of comparison or significant steps in the proof. ### Key Observations * Both approaches start with the same theorem and initial assumptions. * The "LLM + APOLLO" approach seems to incorporate more explicit simplifications and rewrites using lemmas and theorems from the `Nat` library. * The use of `interval_cases` and proof by contradiction (`by_contra h`) is more prominent in the "LLM + APOLLO" approach. * The annotations suggest specific points where the two approaches diverge or where key steps are taken. ### Interpretation The image demonstrates two different strategies for proving the same theorem. The "LLM + APOLLO" approach likely benefits from additional knowledge or tools (represented by "APOLLO") that allow for more direct simplification and application of relevant lemmas. The annotations highlight the critical junctures where these differences in strategy become apparent. The presence of "omega" suggests the use of an automated theorem prover to discharge certain subgoals. The comparison suggests that "LLM + APOLLO" may be more efficient or require less manual intervention than the pure "LLM" approach. </details> Figure 16: Proof attempt of mathd_algebra_289 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. Import headers are the same as other presented examples, for visibility they were omitted in this diagram.

Rendering Paper...