2505.05758v5

Model: gemma-3-27b-it-free

# APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning **Authors**: Azim Ospanov, &Farzan Farnia , &Roozbeh Yousefzadeh > Huawei Hong Kong Research CenterDepartment of Computer Science & Engineering, The Chinese University of Hong Kong Abstract Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (A utomated P r O of repair via L LM and L ean c O llaboration), a modular, model‑agnostic agentic framework that combines the strengths of the Lean compiler with an LLM’s reasoning abilities to achieve better proof‐generation results at a low token and sampling budgets. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub‑lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top‑ $K$ budget. The repaired sub‑proofs are recombined and reverified, iterating up to a user‑controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state‑of‑the‑art accuracy of 84.9% among sub 8B‑parameter models (as of August 2025) while keeping the sampling budget below one hundred. Moreover, Apollo raises the state‑of‑the‑art accuracy for Goedel‑Prover‑SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. General‑purpose models (o3‑mini, o4‑mini) jump from 3–7% to over 40% accuracy. Our results demonstrate that targeted, compiler‑guided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving. The codebase is available at https://github.com/aziksh-ospanov/APOLLO 1 Introduction <details> <summary>x1.png Details</summary> ![023ae624](/v1/image/023ae62438dc423f91d6b44c4f8a3bb14bdf62b4f52dca285b92522a7764a3cf) ### Visual Description \n ## Diagram: Whole-Proof Generation Pipeline ### Overview The image depicts a diagram illustrating a "Common Approach: Whole-Proof Generation Pipeline". It shows a cyclical process involving a Large Language Model (LLM), a Lean Server, and a confirmation/exit loop. The process is repeated up to K times. ### Components/Axes The diagram consists of three main components: 1. **LLM:** Represented by a stylized brain with gears inside, topped with a blue cylinder labeled "LLM". 2. **Lean Server:** A rectangular block labeled "Lean Server" containing the letters "LVN" in a stylized font. 3. **Confirmation/Exit Loop:** A screen with two buttons: a green checkmark labeled "exit loop" and a red X labeled "continue". There are also two text labels: * "proof attempt" - connecting the LLM to the Lean Server. * "repeat up to K times" - positioned above the Lean Server, indicating the iterative nature of the process. ### Detailed Analysis or Content Details The diagram illustrates a flow of information: 1. The LLM generates a "proof attempt". 2. This attempt is sent to the Lean Server. 3. The Lean Server processes the attempt. 4. The Lean Server outputs a result, which is presented to a confirmation/exit loop. 5. The user (or an automated system) can choose to "exit loop" (green checkmark) or "continue" (red X). 6. If "continue" is selected, the process repeats from step 1, up to a maximum of K iterations. The LLM is depicted in the top-left corner. The Lean Server occupies the central portion of the diagram. The confirmation/exit loop is located on the right side. The entire process is enclosed within a grey rectangular border. ### Key Observations The diagram highlights an iterative approach to proof generation. The "K times" repetition suggests that the system is designed to refine the proof through multiple attempts. The Lean Server appears to be the core processing unit, while the LLM provides the initial proof attempts. The confirmation/exit loop allows for human or automated intervention in the process. ### Interpretation This diagram illustrates a common workflow in automated theorem proving or formal verification. The LLM acts as a creative engine, generating potential proofs. The Lean Server serves as a rigorous validator, checking the correctness of the proof. The iterative loop allows the system to learn from its mistakes and improve the quality of the generated proofs. The parameter "K" represents the maximum number of attempts, balancing the computational cost of refinement against the desired level of confidence in the final proof. The diagram suggests a hybrid approach, combining the generative power of LLMs with the formal rigor of Lean-based verification systems. The use of "LVN" within the Lean Server block is unclear without further context, but could represent a specific algorithm or component within the server. The diagram is a high-level overview and does not provide details about the specific algorithms or techniques used in each component. </details> <details> <summary>x2.png Details</summary> ![09988afd](/v1/image/09988afddce92e378384fcd9a713b3f54c56e4d0541be2c70c659e4e297afbca) ### Visual Description \n ## Diagram: Apollo Pipeline ### Overview The image depicts a diagram illustrating the proposed "Apollo Pipeline," a system for automated proof repair. The pipeline consists of several components: a Large Language Model (LLM), an Apollo Proof Repair Agent, and a Lean Server. The process involves iterative attempts to generate and refine proofs, with feedback loops between the components. ### Components/Axes The diagram is structured with three main components, arranged horizontally from left to right: * **LLM:** (Leftmost) Represented by a blue box with a brain-like graphic. * **Apollo Proof Repair Agent:** (Center) Represented by a yellow box containing a wrench and a computer screen. This box also contains two sub-components: "Auto Solver" and "Subproof Extractor". * **Lean Server:** (Rightmost) Represented by a teal box with a stylized "Lean" text. This box contains a control panel with "exit loop" and "continue" buttons. Arrows indicate the flow of information and control between these components. Text labels describe the data being passed between them. A loop is indicated above the pipeline with the text "repeat up to r times". ### Detailed Analysis or Content Details 1. **LLM to Apollo Proof Repair Agent:** An arrow labeled "proof attempt(s)" points from the LLM to the Apollo Proof Repair Agent. Another arrow labeled "sub-problem(s) to prove" also points from the LLM to the Apollo Proof Repair Agent. 2. **Apollo Proof Repair Agent to Lean Server:** An arrow labeled "proof state compilation errors syntax errors" points from the Apollo Proof Repair Agent to the Lean Server. 3. **Lean Server Feedback:** The Lean Server provides feedback via two arrows: * A red arrow points from the Lean Server to the Apollo Proof Repair Agent. * A green arrow points from the Lean Server to the Apollo Proof Repair Agent. 4. **Apollo Proof Repair Agent Sub-components:** * **Auto Solver:** Depicted with interconnected circles and lines. * **Subproof Extractor:** Depicted with a magnifying glass and a question mark. 5. **Lean Server Control Panel:** The control panel on the Lean Server has two buttons: * "exit loop" with a green checkmark. * "continue" with a red X. 6. **Loop Indicator:** The text "repeat up to r times" is positioned above the Apollo Proof Repair Agent, indicating an iterative process. ### Key Observations The diagram highlights a closed-loop system where the LLM generates proof attempts, the Apollo Proof Repair Agent refines them, and the Lean Server provides feedback. The iterative nature of the process is emphasized by the "repeat up to r times" loop. The Lean Server acts as a validator and controller, determining whether to continue refining the proof or exit the loop. The sub-components within the Apollo Proof Repair Agent suggest a modular approach to proof repair, with dedicated modules for automated solving and subproof extraction. ### Interpretation The Apollo Pipeline diagram illustrates a sophisticated approach to automated theorem proving. The LLM serves as the initial proof generator, leveraging its language understanding capabilities to formulate potential proofs. The Apollo Proof Repair Agent then acts as a specialized repair system, utilizing both automated solving techniques and subproof extraction to address errors and refine the proof. The Lean Server provides a formal verification layer, ensuring the correctness of the proof and controlling the iterative process. The diagram suggests a system designed to overcome the limitations of purely LLM-based theorem proving by incorporating formal verification and specialized repair mechanisms. The iterative loop indicates a robust approach to handling complex proofs, allowing the system to refine its attempts until a valid proof is found or a maximum number of iterations is reached. The modular design of the Apollo Proof Repair Agent, with its Auto Solver and Subproof Extractor components, suggests a flexible and adaptable system capable of handling a wide range of proof challenges. The use of Lean as the formal verification system implies a focus on mathematical rigor and correctness. </details> Figure 1: The summary of whole-proof generation pipeline vs. proposed Apollo agentic pipeline. LLM refers to a chosen formal theorem generator model. Formal reasoning has emerged as one of the most challenging fields of AI with recent achievements such as AlphaProof and AlphaGeometry doing well at the International Math Olympiad competing with humans yang2024formal ; alphaproof ; AlphaGeometryTrinh2024 . Formal reasoning relies both on AI models and a formal verification system that can automatically verify whether a mathematical proof is correct or not. In recent years, formal verification systems such as Lean lean4paper have facilitated a new form of doing mathematical research by enabling mathematicians to interact with the formal verification system and also with each other via the system, enabling larger numbers of mathematicians to collaborate with each other on a single project. As such, these formal verification systems are also called proof assistants as one can use them interactively to write a formal proof and instantaneously observe the current state of the proof and any possible errors or shortcomings in the proof generated by the compiler. Immediate access to the output of Lean compiler can help a mathematician to fix possible errors in the proof. At the same time, when a proof passes the compiler with no error, other collaborators do not need to verify the correctness of the proof. This type of collaboration is transforming the field of mathematics enabling large groups of collaborators to engage in large projects of mathematical research such as the Prime Number Theorem And More project kontorovich_tao_2024_primenumbertheoremand . Moreover, it has given rise to the digitization of mathematics buzzard2024mathematical . In the AI front, large language models (LLMs) are improving at mathematical reasoning in natural language, and at the same time, they have shown a remarkable ability to learn the language of those formal verification systems and write mathematical proofs in formal language. This has led to the field of Automated Theorem Proving (ATP) where the standard approach is to prompt the LLM to generate a number of candidate proofs for a given theorem which will then be verified automatically using the formal verification system. Better models, better training sets, and better training methods has led to significant advances in this field tsoukalas2024putnambenchevaluatingneuraltheoremprovers ; zheng2022minif2fcrosssystembenchmarkformal ; azerbayev2023proofnetautoformalizingformallyproving ; li2024numinamath ; hu2025minictx . Despite these advances, the LLMs do not really get the chance to interact with the verification system the way a human does. LLMs generate many possible proofs, sometimes as many as tens of thousands, and the verification system only marks the proofs with a binary value of correct or incorrect. This common approach is illustrated in Figure 1. When the formal verification system analyzes a proof, its compiler may generate a long list of issues including syntax errors, incorrect tactics, open goals, etc. In the current approach of the community, all of those information/feedback from the compiler remains unused, as the LLM’s proof generation is not directly engaged with the compiler errors of the formal verification system. In this work, we address this issue by introducing Apollo, a fully automated system which uses a modular, model-agnostic pipeline that combines the strengths of the Lean compiler with the reasoning abilities of LLMs. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub‑lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top‑ $K$ budget. The high-level overview of Apollo appears at the bottom of Figure 1. Our contributions are as follows: - We introduce a novel fully automated system that directs a collaboration between LLMs, Lean compiler, and automated solvers for automated theorem proving. - We evaluate Apollo on miniF2F-test set using best LLMs for theorem proving and demonstrate that Apollo improves the baseline accuracies by significant margins while keeping the sampling costs much lower. - We establish a new SOTA on miniF2F-test benchmark for medium-sized language models. (with parameter size 8B or less) 2 Related Works Formal theorem proving systems. At their core, formal theorem‑proving (FTP) systems employ interactive theorem provers (ITPs) to verify mathematical results. In particular, Lean4 lean4paper is both a functional programming language and an ITP. Each theorem, lemma, conjecture, or definition must be checked by Lean’s trusted kernel, so the compiler’s verdict is binary: either the statement type‑checks (True) or it fails (False). This rigorous verification dramatically increases the reliability of formal mathematics; however, it also increases the complexity of proof development: authors must both comprehend the mathematical concepts and precisely encode them in the formal language. The usefulness of Lean is also shown to go beyond theorem proving, as Jiang et al. jiang2024leanreasoner showed that Lean can be adapted to natural language logical reasoning. Patching broken programs. Many prior works have explored using feedback to repair broken proofs. In software engineering, “repair” typically refers to program repair, i.e. fixing code that no longer runs programrepair . A common source of errors is a version mismatch that introduces bugs and prevents execution. For example, SED sed_repair uses a neural program‐generation pipeline that iteratively repairs initial generation attempts. Other notable approaches train specialized models to fix broken code based on compiler feedback (e.g. BIFI bifi , TFix tfix ) or leverage unsupervised learning over large bug corpora (e.g. BugLab buglab ). LLM collaboration with external experts. The use of expert feedback, whether from human users or from a proof assistant’s compiler, has proven effective for repairing broken proofs. Ringer ringer2009typedirected showcased automatic proof repair within the Coq proof assistant bertot2013interactive , enabling the correction of proofs in response to changes in underlying definitions. Jiang et al. jiang2022thor showed that leveraging automatic solvers with generative models yield better performance on formal math benchmarks. More recently, First et al. first2023baldurwholeproofgenerationrepair showed that incorporating compiler feedback into LLM proof generation significantly improves the model’s ability to correct errors and produce valid proofs. Another line of work song2023towards explored an idea of mathematician and LLM collaboration, where human experts are aided by LLMs at proof writing stage. Proof search methods. Numerous recent systems use machine learning to guide the search for formal proofs. One of the methods involves using large language models with structured search strategies, e.g. Best‑First Search (BFS) polu2022formalmathematicsstatementcurriculum ; wu2024internlm25stepproveradvancingautomatedtheorem ; xin2025bfsproverscalablebestfirsttree or Monte Carlo Tree Search (MCTS) lample2022hypertree ; wang2023dt-solver ; alphaproof . While tree‑search methods reliably steer a model toward valid proofs, they result in high inference costs and often explore many suboptimal paths before success. Another line of work leverages retrieval based systems to provide context for proof generators with potentially useful lemmas. One such example is ReProver yang_leandojo_2023 that augments Lean proof generation by retrieving relevant lemmas from a proof corpus and feeding them to an LLM, enabling the model to reuse existing proofs. Whole proof generation. The use of standalone LLMs for theorem proving has emerged as a major research direction in automated theorem proving. One of the earliest works presented GPT-f polu2020generativelanguagemodelingautomated , a transformer-based model for theorem proving that established the LLMs ability in formal reasoning. As training frameworks and methods advance, the community has produced numerous models that generate complete proofs without external search or tree‑search algorithms. Recent work shows that both supervised models xin2024deepseekproverv15harnessingproofassistant ; lin2025goedelproverfrontiermodelopensource and those trained via reinforcement learning xin2024deepseekproverv15harnessingproofassistant ; zhang2025leanabellproverposttrainingscalingformal ; dong2025stp ; kimina_prover_2025 achieve competitive performance on formal‑math benchmarks. However, whole‑proof generation remains vulnerable to hallucinations that can cause compilation failures even for proofs with correct reasoning chains. Informal Chain-of-Thought in Formal Mathematics. Several recent works demonstrate that interleaving informal “chain‑of‑thought” (CoT) reasoning with formal proof steps substantially improves whole‑proof generation. For example, developers insert natural‑language mathematical commentary between tactics, yielding large accuracy gains lin2025goedelproverfrontiermodelopensource ; xin2024deepseekproverv15harnessingproofassistant ; azerbayev2024llemma . Wang, et al. kimina_prover_2025 further show that special “thinking” tokens let the model plan its strategy and self‑verify intermediate results. The “Draft, Sketch, and Prove” (DSP) jiang2023draftsketchproveguiding and LEGO-Prover wang2024legoprover pipelines rely on an informal proof sketch before attempting to generate the formal proof. Collectively, these studies indicate that providing informal mathematical context, whether as comments, sketches, or dedicated tokens, guides LLMs toward correct proofs and significantly boosts proof synthesis accuracy. <details> <summary>x3.png Details</summary> ![9e12208b](/v1/image/9e12208b7160ed08703dfa2d4e7efd32d5cff24d7c653ac9c0255b46fb7f48ea) ### Visual Description \n ## Diagram: Apollo Algorithm - Automated Theorem Proving ### Overview This diagram illustrates the Apollo Algorithm, a system for automated theorem proving. It depicts a workflow involving several components: Syntax Refiner, Sorrifier, Auto Solver, Proof Assembler, and a Large Language Model (LLM). The process begins with a theorem and iteratively refines it until a valid proof is constructed, or a maximum number of attempts is reached. The diagram also shows examples of invalid and correct proof outputs. ### Components/Axes The diagram is structured around a central flow chart with supporting text blocks. Key components include: * **Apollo Algorithm** (Title, top-center) * **Syntax Refiner:** (Top-left) - Refines the syntax of the theorem. * **Sorrifier:** (Top-center-left) - Transforms the theorem into a suitable form for the Auto Solver. * **Auto Solver:** (Top-center-right) - Attempts to solve the theorem. * **Proof Assembler:** (Bottom-center) - Assembles the proof steps. * **LLM:** (Top-right) - Large Language Model, used to feed information and assemble the proof. * **Decision Nodes:** (Throughout) - Represent yes/no questions determining the flow of the algorithm. * **Invalid Proof:** (Bottom-left) - Displays an example of an invalid proof attempt. * **Correct Proof:** (Bottom-right) - Displays an example of a correct proof. * **Flow Arrows:** Indicate the direction of the process. * **Labels:** "number of attempts > ?", "Does the proof compile in Lean?" ### Detailed Analysis or Content Details **Flow Chart:** 1. The process starts with a theorem. 2. It passes through the Syntax Refiner. 3. Then to the Sorrifier. 4. Next, the Auto Solver attempts to solve it. 5. A decision node asks: "number of attempts > ?". * If **YES**, the process loops back to the Syntax Refiner. * If **NO**, another decision node asks: "Does the proof compile in Lean?". * If **YES**, the Proof Assembler assembles the proof, and the LLM is used to refine it. * If **NO**, the process ends with an "Invalid Proof". **Invalid Proof (Left Block):** * **Title:** `theorem math_numbertheory_499_incorrect (a b : Nat) (ha : a ≤ 10) (hb : b ≤ 10) (h1 : Nat.gcd a b = 6) : Nat.lcm a b = 60` * `(a b : Nat)` * `(ha : a ≤ 10)` * `(hb : b ≤ 10)` * `(h1 : Nat.gcd a b = 6)` * `have h4 : a ≥ 6 : by` * `have h5 : a ≤ 10 : h1` * `omega` * `have h5 : b ≥ 6 : by` * `have h6 : b ≤ 10 : h1` * `omega` * `have h7 : a ≤ 30 : by` * `have h8 : b ≤ 30 : by` * `omega` * `have h9 : Nat.lcm a b ≤ 60 : by` * `have h10 : Nat.lcm a b ≥ 60 : by` * `omega` * `have h11 : Nat.lcm a b = 60 : by` * `apply Nat.lcm_eq` * `have h12 : a = 6 : by` * `have h13 : b = 10 : by` * `omega` * `have h14 : Nat.gcd 6 10 = 6 : by` * `apply Nat.gcd_eq` * `have h15 : Nat.lcm 6 10 = 60 : by` * `apply Nat.lcm_eq` * `QED` **Correct Proof (Right Block):** * **Title:** `theorem math_numbertheory_499_after_proof_repair (a b : Nat) (ha : a ≤ 10) (hb : b ≤ 10) (h1 : Nat.gcd a b = 6) : Nat.lcm a b = 60` * `(a b : Nat)` * `(ha : a ≤ 10)` * `(hb : b ≤ 10)` * `(h1 : Nat.gcd a b = 6)` * `have h2 : a ≤ 10 : by` * `linarith` * `have h3 : a ≥ 6 : by` * `by contra h1` * `push_neg` * `have h4 : a ≤ 101 : by` * `linarith` * `have h5 : b ≥ 6 : by` * `linarith` * `have h6 : b ≤ 10 : by` * `linarith` * `rw [h1]` * `norm_num at h6` * `have h7 : b ≤ 100 : by` * `linarith` * `have h8 : b ≤ 10 : by` * `by contra h1` * `push_neg` * `have h9 : b ≤ 101 : by` * `linarith` * `have h10 : Nat.lcm a b ≤ 60 : by` * `have h11 : Nat.lcm a b ≥ 60 : by` * `linarith` * `have h12 : Nat.lcm a b = 60 : by` * `apply Nat.lcm_eq` * `QED` ### Key Observations * The diagram highlights an iterative process of theorem refinement. * The "Invalid Proof" block shows a proof attempt that failed, likely due to a logical error or inability to compile in Lean. * The "Correct Proof" block demonstrates a successful proof, with steps labeled using Lean's proof tactics (e.g., `linarith`, `omega`, `apply`). * The use of the LLM suggests a role in guiding the proof assembly process. * The decision nodes indicate that the algorithm can either converge to a solution or terminate with an invalid proof. ### Interpretation The Apollo Algorithm represents a sophisticated approach to automated theorem proving. It combines traditional automated reasoning techniques (Auto Solver, Lean compilation) with the power of Large Language Models to improve the efficiency and success rate of proof construction. The iterative nature of the algorithm, guided by the decision nodes, allows it to explore different proof strategies and refine the theorem until a valid proof is found. The comparison between the "Invalid Proof" and "Correct Proof" blocks illustrates the challenges of automated reasoning and the importance of careful proof step selection. The LLM likely plays a role in suggesting proof tactics or identifying potential errors, thereby accelerating the proof process. The diagram suggests a system designed to not only find proofs but also to learn from failed attempts, potentially improving its performance over time. The use of Lean as a compilation target indicates a focus on formal verification and mathematical rigor. The diagram is a high-level overview and doesn't detail the specific algorithms or techniques used within each component. </details> Figure 2: Overview of the Apollo framework. 3 Our Approach In this section we describe Apollo, our framework for transforming an LLM’s raw proof sketch into a fully verified Lean4 proof. Figure 2 illustrates the Apollo pipeline process with an attached Lean code before and after repair. We observe that often LLM is capable of producing a valid proof sketch; however, it often struggles with closing fine-grained goals. For example, statements h4, h5, h7, h8 are correct but the generated proofs for each sub-lemma throw compilation errors. Apollo identifies and fixes such proof steps and guides the LLM towards a correct solution. Our compiler of choice is the Lean REPL leanrepl , a “ R ead– E val– P rint– L oop” for Lean4. It provides a complete list of compilation errors and warnings, each with a detailed description and the location (for example, the line number) where it occurred. Errors can include syntax mistakes, nonexistent tactics, unused tactics, and more. We aggregate this information to drive our agent’s dynamic code repair and sub‑problem extraction. The pseudo-code for our framework can be found in Algorithm 1 in the Appendix. 3.1 Syntax Refiner The Syntax Refiner catches and corrects superficial compilation errors: missing commas, wrong keywords (e.g. Lean3’s from vs. Lean4’s := by), misplaced brackets, etc. These syntax errors arise when a general‑purpose LLM (such as o3‑mini openai-o3-mini-2025 or o4‑mini openai-o4-mini-2025 ) has not been specialized on Lean4 syntax. By normalizing these common mistakes, this module ensures that subsequent stages operate on syntactically valid code. It is important to note that this module is primarily applicable to general purpose models not explicitly trained for theorem proving in Lean. In contrast, Syntax Refiner usually does not get invoked for proofs generated by LLMs trained for formal theorem proving. 3.2 Sorrifier The Sorrifier patches any remaining compilation failures by inserting Lean’s sorry placeholder. We first parse the failed proof into a tree of nested proof‑blocks (sub‑lemmas as children). Then we compile the proof with Lean REPL leanrepl , detect the offending line or block, and apply one of three repairs: 1. Line removal, if a single tactic or declaration is invalid but its surrounding block may still succeed. 1. Block removal, if the entire sub‑proof is malformed. 1. Insert sorry, if the block compiles but leaves unsolved goals open. We repeat this procedure until the file compiles without errors. At that point, every remaining sorry marks a sub‑lemma to be proved in later stages of Apollo. This part of the pipeline guarantees that the proof successfully compiles in Lean with warnings of presence of sorry statements. Each sorry block corresponds to a sub‑problem that the LLM failed to prove. Such blocks may not type‑check for various reasons, most commonly LLM hallucination. The feedback obtained via REPL lets us to automatically catch these hallucinations, insert a sorry placeholder, and remove invalid proof lines. 3.3 Auto Solver The Auto Solver targets each sorry block in turn. It first invokes the Lean4’s hint to close the goal. If goals persist, it applies built‑in solvers (nlinarith, ring, simp, etc.) wrapped in try to test combinations automatically. Blocks whose goals remain open stay marked with sorry. After applying Auto Solver, a proof may be complete with no sorry’s in which case Apollo has already succeeded in fixing the proof. Otherwise, the process can repeat recursively. 3.4 Recursive reasoning and repair In the case where a proof still has some remaining sorry statements after applying the Auto Solver, Apollo can consider each as a new lemma, i.e., extract its local context (hypotheses, definitions, and any lemmas proved so far), and recursively try to prove the lemmas by prompting the LLM and repeating the whole process of verification, syntax refining, sorrifying, and applying the Auto Solver. At each of these recursive iterations, a lemma may be completely proved, or it may end up with an incomplete proof with one or a few sorry blocks. This allows the Apollo to make progress in proving the original theorem by breaking down the incomplete steps further and further until the LLM or Auto Solver can close the goals. This process can continue up to a user‑specified recursion depth $r$ . 3.5 Proof Assembler Finally, the Proof Assembler splices all repaired blocks back into a single file and verifies that no sorry or admit (alias for "sorry") commands remain. If the proof still fails, the entire pipeline repeats (up to a user‑specified recursion depth $r$ ), allowing further rounds of refinement and sub‑proof generation. Apollo’s staged approach: syntax cleaning, “sorrifying,” automated solving, and targeted LLM‐driven repair, yields improvements in both proof‐sample efficiency and final proof correctness. 4 Experiments In this section, we present empirical results for Apollo on the miniF2F dataset zheng2022minif2fcrosssystembenchmarkformal , which comprises of 488 formalized problems drawn from AIME, IMO, and AMC competitions. The benchmark is evenly split into validation and test sets (244 problems each); here we report results on the miniF2F‑test partition. To demonstrate Apollo’s effectiveness, we evaluate it on a range of state‑of‑the‑art whole‑proof generation models. All experiments use Lean v4.17.0 and run on eight NVIDIA A5000 GPUs with 128 CPU cores. We used $@32$ sampling during sub-proof generation unless stated otherwise. Baseline numbers are sourced from each model’s original publication works. 4.1 The effect of applying Apollo on top of the base models Figure 3 shows the impact of Apollo on two state‑of‑the‑art whole‑proof generation models: Goedel Prover‑SFT lin2025goedelproverfrontiermodelopensource and Kimina‑Prover‑Preview‑Distill‑7B kimina_prover_2025 . Applying Apollo increases Goedel‑Prover‑SFT’s top accuracy from 64.7% to 65.6% while reducing its sampling budget by two orders of magnitude (from 25,600 generated samples to only a few hundred on average). Similarly, Kimina‑Prover‑Preview‑Distill‑7B achieves a new best Kimina-Prover-Preview-Distill-7B accuracy of 75.0% with roughly one‑third of the previous sample budget. We report the exact numbers in Table 1. <details> <summary>x4.png Details</summary> ![787ae04c](/v1/image/787ae04c2d32653096cff943eba2d2779dd091195d27a2f6b6f578849f0446e9) ### Visual Description ## Line Charts: Performance of Language Models ### Overview The image presents two line charts comparing the performance (Accuracy in %) of different language models against varying "Pass budget" (in thousands, K) on a logarithmic scale for the first chart. The first chart focuses on "Goedel-SFT" and "Goedel-SFT + Apollo", while the second chart focuses on "Kimina-Prover-Preview-Distill-7B" and "Kimina-Prover-Preview-Distill-7B + Apollo". ### Components/Axes **Chart 1: Performance of Goedel-SFT** * **X-axis:** Pass budget (K) - log scale. Markers at 32, 80, 150, 306, 1.0k, 25.6k. * **Y-axis:** Accuracy (%) - Scale from approximately 58% to 65%. * **Legend:** * Blue Line: Goedel-Prover-SFT * Orange Line: Goedel-Prover-SFT + Apollo **Chart 2: Performance of Kimina-Prover-Preview-Distill-7B** * **X-axis:** Pass budget (K) - Scale from 0 to 1000. Markers at 0, 200, 400, 600, 800, 1000. * **Y-axis:** Accuracy (%) - Scale from approximately 64% to 75%. * **Legend:** * Red Line: Kimina-Prover-Preview-Distill-7B * Green Line: Kimina-Prover-Preview-Distill-7B + Apollo ### Detailed Analysis or Content Details **Chart 1: Goedel-SFT Performance** * **Goedel-Prover-SFT (Blue Line):** The line slopes upward, indicating increasing accuracy with increasing pass budget. * At 32K: Approximately 58.5% accuracy. * At 80K: Approximately 60.2% accuracy. * At 150K: Approximately 61.8% accuracy. * At 306K: Approximately 62.5% accuracy. * At 1.0K: Approximately 62.7% accuracy. * At 25.6K: Approximately 64.5% accuracy. * **Goedel-Prover-SFT + Apollo (Orange Line):** The line initially rises sharply, then plateaus. * At 32K: Approximately 63.5% accuracy. * At 80K: Approximately 64.2% accuracy. * At 150K: Approximately 65.0% accuracy. * At 306K: Approximately 65.2% accuracy. * At 1.0K: Approximately 65.1% accuracy. * At 25.6K: Approximately 65.0% accuracy. **Chart 2: Kimina-Prover-Preview-Distill-7B Performance** * **Kimina-Prover-Preview-Distill-7B (Red Line):** The line slopes upward, but at a decreasing rate. * At 0K: Approximately 64.2% accuracy. * At 200K: Approximately 68.5% accuracy. * At 400K: Approximately 69.5% accuracy. * At 600K: Approximately 69.8% accuracy. * At 800K: Approximately 70.0% accuracy. * At 1000K: Approximately 70.2% accuracy. * **Kimina-Prover-Preview-Distill-7B + Apollo (Green Line):** The line rises sharply initially, then plateaus at a higher accuracy than the base model. * At 0K: Approximately 71.5% accuracy. * At 200K: Approximately 74.2% accuracy. * At 400K: Approximately 74.5% accuracy. * At 600K: Approximately 74.4% accuracy. * At 800K: Approximately 74.3% accuracy. * At 1000K: Approximately 74.2% accuracy. ### Key Observations * For both models, adding "Apollo" consistently improves accuracy, especially at lower pass budgets. * The improvement from "Apollo" diminishes as the pass budget increases, suggesting a point of diminishing returns. * The Goedel-SFT model shows a more consistent improvement with increasing pass budget compared to the Kimina model. * The Kimina model with Apollo starts at a significantly higher accuracy than the Goedel model with Apollo. ### Interpretation The charts demonstrate the impact of "Pass budget" and the "Apollo" enhancement on the accuracy of two different language models. The "Pass budget" likely represents the computational resources allocated to the model during a verification or training process. The "Apollo" component appears to be an additional module or technique that boosts performance, particularly when computational resources are limited. The diminishing returns observed with increasing pass budget and "Apollo" suggest that there's a trade-off between computational cost and accuracy gains. The initial steep increase in accuracy with "Apollo" indicates that it effectively leverages limited resources, while the plateau suggests that further investment in pass budget yields smaller improvements. The difference in the overall accuracy levels between the Goedel and Kimina models suggests inherent differences in their architectures or training data. The Kimina model, even without Apollo, achieves higher accuracy than the Goedel model, indicating a potentially more robust or efficient base model. The charts provide valuable insights for optimizing resource allocation and model selection based on desired accuracy levels and computational constraints. </details> (a) Model accuracy against sampling budget <details> <summary>x5.png Details</summary> ![dd6b80a5](/v1/image/dd6b80a5f102ac177fc19e207a43d8d1091f7e2cdd04f28757d52e4045881235) ### Visual Description ## Charts: Performance of Language Models ### Overview The image presents two line charts comparing the performance (Accuracy in %) of different language models – Goedel-Prover-SFT and Kimina-Prover-Preview-Distill-7B – with and without the addition of Apollo, as a function of token budget (N) on a logarithmic scale. ### Components/Axes **Chart 1: Performance of Goedel-Prover-SFT** * **X-axis:** Token budget (N) - log scale. Markers: 16.3K, 39.3K, 140.0K, 406.0K, 1.3M, 4.5M, 12.7M * **Y-axis:** Accuracy (%) - Scale: 58% to 65% * **Legend:** * Blue Line: Goedel-Prover-SFT * Orange Line: Goedel-Prover-SFT + Apollo **Chart 2: Performance of Kimina-Prover-Preview-Distill-7B** * **X-axis:** Token budget (N) - log scale. Markers: 140.0K, 406.0K, 1.3M, 4.5M * **Y-axis:** Accuracy (%) - Scale: 64% to 74% * **Legend:** * Red Line: Kimina-Prover-Preview-Distill-7B * Green Line: Kimina-Prover-Preview-Distill-7B + Apollo ### Detailed Analysis or Content Details **Chart 1: Goedel-Prover-SFT** * **Goedel-Prover-SFT (Blue Line):** The line slopes upward, indicating increasing accuracy with increasing token budget. * 16.3K: ~58.8% * 39.3K: ~59.5% * 140.0K: ~61.5% * 406.0K: ~62.2% * 1.3M: ~62.5% * 4.5M: ~63.5% * 12.7M: ~64.5% * **Goedel-Prover-SFT + Apollo (Orange Line):** The line initially rises sharply, then plateaus and decreases slightly. * 16.3K: ~58.5% * 39.3K: ~63.5% * 140.0K: ~64.5% * 406.0K: ~65.0% * 1.3M: ~64.0% * 4.5M: ~63.0% * 12.7M: ~62.5% **Chart 2: Kimina-Prover-Preview-Distill-7B** * **Kimina-Prover-Preview-Distill-7B (Red Line):** The line slopes upward, indicating increasing accuracy with increasing token budget. * 140.0K: ~64.2% * 406.0K: ~66.5% * 1.3M: ~68.5% * 4.5M: ~69.5% * **Kimina-Prover-Preview-Distill-7B + Apollo (Green Line):** The line initially rises sharply, then plateaus. * 140.0K: ~64.5% * 406.0K: ~68.5% * 1.3M: ~74.5% * 4.5M: ~74.0% ### Key Observations * For Goedel-Prover-SFT, adding Apollo initially improves performance significantly, but the benefit diminishes and even reverses at higher token budgets. * For Kimina-Prover-Preview-Distill-7B, adding Apollo consistently improves performance, with a significant jump between 406.0K and 1.3M token budgets, and then plateaus. * Kimina-Prover-Preview-Distill-7B consistently outperforms Goedel-Prover-SFT across all token budgets, even without Apollo. * The effect of Apollo is more pronounced for lower token budgets. ### Interpretation The data suggests that the Apollo component is beneficial for both language models, but its effectiveness is dependent on the token budget and the base model. For Goedel-Prover-SFT, Apollo appears to provide a boost in performance at lower token budgets, but becomes detrimental at higher budgets, potentially due to overfitting or other complexities. For Kimina-Prover-Preview-Distill-7B, Apollo consistently improves performance, indicating a more synergistic relationship. The difference in performance between the two base models suggests that Kimina-Prover-Preview-Distill-7B is inherently more capable, and benefits more from increased token budgets. The plateauing of the green line (Kimina + Apollo) at higher token budgets suggests that the model is reaching its performance limit, and further increasing the token budget does not yield significant improvements. The logarithmic scale of the x-axis emphasizes the diminishing returns of increasing the token budget. The initial gains in accuracy are more substantial than those achieved at higher token budgets. This suggests that there is a point of diminishing returns where the cost of increasing the token budget outweighs the benefits in terms of accuracy. </details> (b) Model accuracy against generated token budget Figure 3: Performance of base models against the Apollo aided models on miniF2F-test dataset at different sample budgets and token length. Note that Apollo’s sampling budget is not fixed: it depends on the recursion depth $r$ and the LLM’s ability to generate sub‑lemmas. For each sub‑lemma that Auto Solver fails to close, we invoke the LLM with a fixed top‑32 sampling budget. Because each generated sub‑lemma may spawn further sub‑lemmas, the total sampling overhead grows recursively, making any single @K budget difficult to report. Sub‑lemmas also tend to require far fewer tactics (and thus far fewer tokens) than the main theorem. A theorem might need 100 lines of proof while a sub-lemma might need only 1 line. Hence, sampling cost does not scale one‑to‑one. Therefore, to compare with other whole-proof generation approaches, we report the average number of proof samples and token budgets. We present more in-depth analysis of inference budgets in Section B of the Appendix. Table 1: Table results comparing sampling cost, accuracy, and token usage before and after applying Apollo. “Cost” denotes the sampling budget $K$ for whole‑proof generation and the recursion depth $r$ for Apollo. Column $N$ reports the average number of tokens generated per problem in "Chain‑of‑Thought" mode. Bold cells highlight the best accuracy. Results are shown on the miniF2F‑test dataset for two models: Kimina‑Prover‑Preview‑Distill‑7B and Goedel‑SFT. | Goedel‑Prover‑SFT | Kimina‑Prover‑Preview‑Distill‑7B | | | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Base Model | + Apollo | Base Model | + Apollo | | | | | | | | | | $K$ | $N$ | Acc. (%) | $r$ | $N$ | Acc. (%) | $K$ | $N$ | Acc. (%) | $r$ | $N$ | Acc. (%) | | 32 | 16.1K | 57.6% | 0 | 16.1K | 57.6% | 32 | 140K | 63.1% | 0 | 140K | 63.1% | | 64 | 31.8K | 59.2% | 2 | 38.3K | 60.7% | | | | 2 | 406K | 68.9% | | 3200 | 1.6M | 62.7% | 4 | 74.6K | 63.5% | | | | 4 | 816K | 74.2% | | 25600 | 12.7M | 64.7% | 6 | 179.0K | 65.6 % | 1024 | 4.5M | 70.8% | 6 | 1.3M | 75.0% | 4.2 Comparison with SoTA LLMs Table 2 compares whole‑proof generation and tree‑search methods on miniF2F‑test. For each approach, we report its sampling budget, defined as the top‑ $K$ parameter for LLM generators or equivalent search‑depth parameters for tree‑search provers. Since Apollo’s budget varies with recursion depth $r$ , we instead report the mean number of proof sampled during procedure. Moreover, we report an average number of generated tokens as another metric for computational cost of proof generation. For some models due to inability to collect generated tokens, we leave them blank and report sampling budget only. When Apollo is applied on top of any generator, it establishes a new best accuracy for that model. For instance, Goedel‑Prover‑SFT’s accuracy jumps to 65.6%, surpassing the previous best result (which required 25,600 sample budget). In contrast, Apollo requires only 362 samples on average. Likewise, Kimina‑Prover‑Preview‑Distill‑7B sees a 4.2% accuracy gain while reducing its sample budget below the original 1,024. To further validate Apollo’s generalizability, we tested the repair framework on Goedel-V2-8B lin2025goedelproverv2 , the current state-of-the-art theorem prover. We observe that, at similar sample and token budgets, Apollo achieves a 0.9% accuracy gain, whereas the base LLM requires twice the budget to reach the same accuracy. Overall, Apollo not only boosts accuracy but does so with a smaller sampling budget, setting a new state-of-the-art result among sub‑8B‑parameter LLMs with sampling budget of 32. We also evaluate general‑purpose LLMs (OpenAI o3‑mini and o4‑mini). Without Apollo, these models rarely produce valid Lean4 proofs, since they default to Lean3 syntax or introduce trivial compilation errors. Yet they have a remarkable grasp of mathematical concepts, which is evident by the proof structures they often produce. By applying Apollo’s syntax corrections and solver‑guided refinements, their accuracies increase from single digits (3–7%) to over 40%. These results demonstrate that in many scenarios discarding and regenerating the whole proof might be overly wasteful, and with a little help from a Lean compiler, we can achieve better accuracies while sampling less proofs from LLMs. Table 2: Comparison of Apollo accuracy against state-of-the-art models on the miniF2F-test dataset. Token budget is computed over all generated tokens by LLM. | Method | Model size | Sample budget | Token budget | miniF2F-test | | --- | --- | --- | --- | --- | | Tree Search Methods | | | | | | Hypertree Proof Search (lample2022hypertree, ) | 0.6B | $64× 5000$ | - | $41.0\%$ | | IntLM2.5-SP+BFS+CG (wu2024internlm25stepproveradvancingautomatedtheorem, ) | 7B | $256× 32× 600$ | - | $65.9\%$ | | HPv16+BFS+DC (li2025hunyuanproverscalabledatasynthesis, ) | 7B | $600× 8× 400$ | - | $68.4\%$ | | BFS-Prover (xin2025bfsproverscalablebestfirsttree, ) | 7B | $2048× 2× 600$ | - | $70.8\%$ | | Whole-proof Generation Methods | | | | | | DeepSeek-R1- Distill-Qwen-7B (deepseekai2025deepseekr1incentivizingreasoningcapability, ) | 7B | 32 | - | 42.6% | | Leanabell-GD-RL (zhang2025leanabellproverposttrainingscalingformal, ) | 7B | $128$ | - | $61.1\%$ | | STP (dong2025stp, ) | 7B | $25600$ | - | $67.6\%$ | | o3-mini (openai-o3-mini-2025, ) | - | 1 | - | $3.3\%$ | | $32$ | - | $24.6\%$ | | | | o4-mini (openai-o4-mini-2025, ) | - | 1 | - | $7.0\%$ | | Goedel-SFT (lin2025goedelproverfrontiermodelopensource, ) | 7B | 32 | 16.1K | $57.6\%$ | | $3200$ | 1.6M | $62.7\%$ | | | | $25600$ | 12.7M | $64.7\%$ | | | | Kimina-Prover- Preview-Distill (kimina_prover_2025, ) | 7B | 1 | 4.4K | $52.5\%$ | | $32$ | 140K | $63.1\%$ | | | | $1024$ | 4.5M | $70.8\%$ | | | | Goedel-V2 (lin2025goedelproverv2, ) | 8B | 32 | 174K | $83.3\%$ | | $64$ | 349K | $84.0\%$ | | | | $128$ | 699K | $84.9\%$ | | | | Whole-proof Generation Methods + Apollo | | | | | | o3-mini + Apollo | - | 8 | - | 40.2% (+36.9%) | | o4-mini + Apollo | - | 15 | - | 46.7% (+39.7%) | | Goedel-SFT + Apollo | 7B | 362 | 179K | 65.6% (+0.9%) | | Kimina-Preview + Apollo | 7B | 307 | 1.3M | 75.0% (+4.2%) | | Goedel-V2 + Apollo | 8B | 63 | 344K | 84.9% (+0.9%) | To further assess Apollo, we conducted experiments on PutnamBench tsoukalas2024putnambenchevaluatingneuraltheoremprovers , using Kimina-Prover-Preview-Distill-7B as the base model. Under whole-proof generation pipeline, LLM produced 10 valid proofs. With Apollo, the LLM produced one additional valid proof while consuming nearly half as many tokens. Results are presented in Table 3. Overall, our results on a variety of LLMs and benchmarks demonstrate that Apollo consistently consumes fewer tokens while achieving higher accuracy, highlighting its effectiveness. Table 3: Comparison of Apollo accuracy on the PutnamBench dataset. Token budget is computed over all generated tokens by LLM. | Method | Model size | Sample budget | Token budget | PutnamBench | | --- | --- | --- | --- | --- | | Kimina-Preview | 7B | 32 | 180K | 7/658 | | Kimina-Preview | 7B | 192 | 1.1M | 10/658 | | Kimina-Preview+Apollo | 7B | 108 | 579K | 11/658 | 4.3 Distribution of Proof Lengths We assess how Apollo affects proof complexity by examining proof‑length distributions before and after repair. Here, proof length is defined as the total number of tactics in a proof, which serves as a proxy for proof complexity. Figure 4 presents violin plots for three base models: Kimina‑Prover‑Preview‑Distill‑7B, Goedel‑Prover‑SFT, and o4‑mini. Each subplot shows two non‑overlapping sets: proofs generated directly by the base model (“before”) and proofs produced after applying Apollo (“after”). We only consider valid proofs verified by REPL in this setup. The proofs that Apollo succeeds in fixing in collaboration with the LLM have considerably longer proof lengths. The mean length of proofs fixed by Apollo is longer than those written by the LLM itself at least by a factor of two in each model selection scenario. This increase indicates that the proofs which the base model alone could not prove require longer, more structured reasoning chains. By decomposing challenging goals into smaller sub‑lemmas, Apollo enables the generation of these extended proofs, therefore improving overall success rate. These results demonstrate that Apollo not only raises accuracy but also systematically guides models to construct deeper, more rigorous proof structures capable of solving harder theorems. <details> <summary>x6.png Details</summary> ![a152c3e0](/v1/image/a152c3e0c083bf31d287a2afd85465221ffc50e7a56e0a525418a16588f0cd7a) ### Visual Description \n ## Violin Plot: Kimina-7B Distribution Comparison ### Overview The image presents a comparative violin plot illustrating the distribution of "Prof Length" for two models: "Kimina-7B" and "Kimina-7B + Apollo". The plots visually represent the probability density of the data at different values. Each plot also includes a horizontal line indicating the median and a boxplot overlaid on the violin shape, showing the interquartile range. Mean values are explicitly stated above each plot. ### Components/Axes * **Title:** Kimina-7B Distribution Comparison * **X-axis:** Model Name (Kimina-7B, Kimina-7B + Apollo) * **Y-axis:** Prof Length (Scale from 0 to 200, with markings at 0, 50, 100, 150, and 200) * **Violin Plots:** Two violin plots, one for each model. * **Boxplots:** Overlaid on each violin plot, showing the median, quartiles, and potential outliers. * **Mean Labels:** "Mean: 11.1" above the Kimina-7B plot, and "Mean: 45.4" above the Kimina-7B + Apollo plot. ### Detailed Analysis **Kimina-7B Plot:** * The violin plot is narrower and more concentrated around lower "Prof Length" values. * The median line is approximately at a "Prof Length" of 0.9. * The interquartile range (IQR) is relatively small. * The mean is 11.1. * The distribution is heavily skewed to the right, with a long tail extending towards higher "Prof Length" values, but with very low density. **Kimina-7B + Apollo Plot:** * The violin plot is wider and more spread out, indicating a greater variance in "Prof Length" values. * The median line is approximately at a "Prof Length" of 1.0. * The IQR is larger than that of the Kimina-7B plot. * The mean is 45.4. * The distribution is also skewed to the right, but the tail is more pronounced and extends further than the Kimina-7B plot. ### Key Observations * The "Kimina-7B + Apollo" model exhibits a significantly higher mean "Prof Length" (45.4) compared to the "Kimina-7B" model (11.1). * The distribution of "Prof Length" is more variable for the "Kimina-7B + Apollo" model, as indicated by the wider violin plot and larger IQR. * Both distributions are positively skewed. ### Interpretation The data suggests that adding "Apollo" to the "Kimina-7B" model substantially increases the average "Prof Length". The wider distribution for "Kimina-7B + Apollo" indicates that the addition of "Apollo" introduces more variability in the "Prof Length". The positive skewness in both distributions suggests that while most "Prof Length" values are relatively low, there are some instances of significantly higher values. The difference in means and distributions between the two models suggests that "Apollo" has a significant impact on the "Prof Length" generated by the model. The "Prof Length" is likely a measure of some output characteristic of the models, and the addition of "Apollo" appears to amplify this characteristic. Without knowing what "Prof Length" represents, it's difficult to draw more specific conclusions, but the data clearly demonstrates a substantial difference between the two models. </details> (a) Kimina-Preview-Distill-7B <details> <summary>x7.png Details</summary> ![99cb73e7](/v1/image/99cb73e763d3f46113658efa9c0ac02baa3eebaaa7e4d185610d6c08f6e573dd) ### Visual Description \n ## Violin Plot: Goedel-Prover-SFT Distribution Comparison ### Overview The image presents a side-by-side comparison of two violin plots, visualizing the distribution of "Proof Length" for two different conditions: "Goedel-Prover-SFT" and "Goedel-Prover-SFT + Apollo". Each plot also displays a horizontal line indicating the mean value. ### Components/Axes * **Title:** "Goedel-Prover-SFT Distribution Comparison" (centered at the top) * **X-axis:** Labels "Goedel-Prover-SFT" (left plot) and "Goedel-Prover-SFT + Apollo" (right plot). The scale ranges approximately from 0.8 to 1.2. * **Y-axis:** Label "Proof Length". The scale ranges from approximately 0 to 60. * **Violin Plots:** Two violin plots, one for each condition. The width of each violin represents the density of the data at different proof lengths. * **Mean Indicators:** Horizontal lines within each violin plot, indicating the mean proof length. * **Mean Labels:** "Mean: 6.5" (above the left plot) and "Mean: 13.0" (above the right plot). ### Detailed Analysis **Left Plot (Goedel-Prover-SFT):** The violin plot is centered around a value of approximately 1.0 on the x-axis. The shape is relatively narrow, indicating a tighter distribution of proof lengths. The violin plot extends from approximately 0.8 to 1.2 on the x-axis. The mean line is positioned at approximately 1.0, with a value of 6.5 on the y-axis. The plot shows a concentration of data between 5 and 15 on the y-axis. **Right Plot (Goedel-Prover-SFT + Apollo):** The violin plot is also centered around a value of approximately 1.0 on the x-axis. However, this violin is wider than the left one, suggesting a more dispersed distribution of proof lengths. The violin plot extends from approximately 0.8 to 1.2 on the x-axis. The mean line is positioned at approximately 1.0, with a value of 13.0 on the y-axis. The plot shows a concentration of data between 5 and 25 on the y-axis. ### Key Observations * The "Goedel-Prover-SFT + Apollo" condition has a significantly higher mean proof length (13.0) compared to the "Goedel-Prover-SFT" condition (6.5). * The distribution of proof lengths is more spread out in the "Goedel-Prover-SFT + Apollo" condition, as indicated by the wider violin plot. * Both distributions appear roughly symmetrical around their respective means. ### Interpretation The data suggests that adding "Apollo" to "Goedel-Prover-SFT" results in longer proofs, on average. The wider distribution for the combined condition indicates that the impact of "Apollo" on proof length is more variable. This could mean that "Apollo" sometimes leads to significantly longer proofs, while in other cases, the effect is less pronounced. The violin plots provide a visual representation of the distribution of proof lengths, allowing for a comparison of the central tendency and spread of the data for each condition. The difference in means is substantial, suggesting a meaningful effect of "Apollo" on proof length. The symmetry of the distributions suggests that the effect of "Apollo" is not biased towards particularly short or long proofs. </details> (b) Goedel-SFT <details> <summary>x8.png Details</summary> ![0eb0bcee](/v1/image/0eb0bceec3ead215577a1081796bc44e60c006b012313530e6eb8821c1934a61) ### Visual Description \n ## Violin Plot: o4-mini Distribution Comparison ### Overview The image presents a side-by-side comparison of two violin plots, visualizing the distribution of "Proof Length" for two conditions: "o4-mini" and "o4-mini + Apollo". Each plot also displays a horizontal line indicating the mean value. ### Components/Axes * **X-axis:** Labels "o4-mini" (left plot) and "o4-mini + Apollo" (right plot). The scale appears to range from approximately 0.8 to 1.2. * **Y-axis:** Labeled "Proof Length", with a scale ranging from approximately 0 to 60. * **Violin Plots:** Two violin plots, one for each condition. The width of the violin represents the density of the data at each Proof Length value. * **Mean Indicators:** Horizontal lines within each violin plot, with text labels indicating the mean value. * **Title:** "o4-mini Distribution Comparison" positioned at the top-center of the image. ### Detailed Analysis **Left Plot (o4-mini):** * The violin plot is relatively narrow, indicating a concentrated distribution of Proof Length values. * The plot is centered around a value of approximately 1.0. * The mean is indicated as 3.8. * The violin extends from approximately 0.8 to 1.2 on the x-axis, and from 0 to approximately 20 on the y-axis. **Right Plot (o4-mini + Apollo):** * The violin plot is wider and taller than the left plot, indicating a more dispersed distribution of Proof Length values. * The plot is centered around a value of approximately 1.05. * The mean is indicated as 13.0. * The violin extends from approximately 0.8 to 1.2 on the x-axis, and from 0 to approximately 60 on the y-axis. ### Key Observations * The "o4-mini + Apollo" condition exhibits a significantly wider distribution of Proof Length values compared to the "o4-mini" condition. * The mean Proof Length is substantially higher for the "o4-mini + Apollo" condition (13.0) than for the "o4-mini" condition (3.8). * The left plot shows a more concentrated distribution, while the right plot shows a more spread-out distribution. ### Interpretation The data suggests that adding "Apollo" to "o4-mini" significantly impacts the distribution of Proof Length. Specifically, it leads to a wider range of Proof Length values and a much higher average Proof Length. This could indicate that "Apollo" introduces more complexity or variability into the proof generation process, or that it allows for the creation of longer proofs. The difference in the shapes of the violin plots suggests that the underlying distributions are different, and a statistical test could be used to confirm this. The large difference in means (3.8 vs 13.0) is a strong indicator of a meaningful effect. The y-axis scale is important to note, as the right plot extends much further, indicating the potential for significantly longer proofs with the addition of Apollo. </details> (c) OpenAI o4-mini Figure 4: Distribution of proof lengths for selected models before (left) and after (right) applying Apollo. In the “before” plots, proof lengths cover only proofs the base model proved without the help of Apollo; in the “after” plots, proof lengths cover only proofs proved with Apollo’s assistance. <details> <summary>x9.png Details</summary> ![2e53900b](/v1/image/2e53900b64c37cb12710dfcbbaa2b0eb494e947b3b1ae7fa3a5ba7ae2ad6ad86) ### Visual Description \n ## Line Charts: Accuracy vs. Recursion Depth for Different Models ### Overview This image presents three separate line charts, each depicting the relationship between "Accuracy (%)" and "Recursion Depth" for three different models: "o4-mini", "Goedel-Prover-SFT", and "Kimina-Prover-Preview-Distill-7B". Each chart shows how accuracy changes as the recursion depth increases. ### Components/Axes Each chart shares the same axes: * **X-axis:** "Recursion Depth", ranging from 0 to 4 for the first chart, 0 to 6 for the second and third charts. * **Y-axis:** "Accuracy (%)", ranging from approximately 8% to 75%. * Each chart has a title indicating the model being evaluated. * Each chart has a line representing the accuracy of the model at different recursion depths. ### Detailed Analysis or Content Details **Chart 1: o4-mini** * **Line Color:** Blue * **Trend:** The line slopes upward, indicating increasing accuracy with increasing recursion depth, but the rate of increase slows down at higher recursion depths. * **Data Points (approximate):** * Recursion Depth 0: ~9% * Recursion Depth 1: ~38% * Recursion Depth 2: ~41% * Recursion Depth 3: ~44% **Chart 2: Goedel-Prover-SFT** * **Line Color:** Red * **Trend:** The line shows a rapid increase in accuracy initially, then plateaus as recursion depth increases. * **Data Points (approximate):** * Recursion Depth 0: ~58% * Recursion Depth 1: ~60% * Recursion Depth 2: ~62% * Recursion Depth 3: ~63% * Recursion Depth 4: ~64% * Recursion Depth 5: ~64% * Recursion Depth 6: ~64% **Chart 3: Kimina-Prover-Preview-Distill-7B** * **Line Color:** Purple * **Trend:** The line slopes upward, showing increasing accuracy with increasing recursion depth. The rate of increase is more consistent than in the "o4-mini" chart. * **Data Points (approximate):** * Recursion Depth 0: ~68% * Recursion Depth 1: ~70% * Recursion Depth 2: ~72% * Recursion Depth 3: ~73% * Recursion Depth 4: ~74% * Recursion Depth 5: ~74% * Recursion Depth 6: ~75% ### Key Observations * The "Goedel-Prover-SFT" model achieves the highest accuracy overall, plateauing around 64%. * The "Kimina-Prover-Preview-Distill-7B" model shows a consistent increase in accuracy with recursion depth, reaching approximately 75%. * The "o4-mini" model starts with the lowest accuracy and exhibits a diminishing return in accuracy gains as recursion depth increases. * All models demonstrate that increasing recursion depth generally improves accuracy, although the extent of improvement varies significantly. ### Interpretation The data suggests that recursion depth is a crucial parameter for improving the accuracy of these models. However, the optimal recursion depth varies depending on the model architecture. The "Goedel-Prover-SFT" model appears to reach its performance limit relatively quickly, while the "Kimina-Prover-Preview-Distill-7B" model continues to benefit from increased recursion depth throughout the observed range. The "o4-mini" model shows the least improvement with increasing recursion depth, indicating that its performance may be limited by other factors. The differences in performance between the models could be attributed to variations in model size, training data, or architectural design. The plateauing effect observed in the "Goedel-Prover-SFT" and "Kimina-Prover-Preview-Distill-7B" models suggests that there is a point of diminishing returns where further increasing recursion depth does not yield significant accuracy gains. </details> Figure 5: Performance of Apollo on miniF2F dataset at various recursion depth parameters $r$ on three different models: o4-mini, Kimina‑Prover‑Preview‑Distill‑7B and Goedel‑Prover‑SFT. 4.4 Impact of recursion depth $r$ on proof accuracy We evaluate how the recursion‐depth parameter $r$ affects accuracy on miniF2F‑test. A recursion depth of $r=0$ corresponds to the base model’s standalone performance; higher $r$ values permit additional rounds of Apollo. Figure 5 plots accuracy versus $r$ for three models: o4‑mini ( $r=0... 4$ ), Kimina‑Prover‑Preview‑Distill‑7B ( $r=0... 6$ ), and Goedel‑Prover‑SFT ( $r=0... 6$ ). The o4‑mini model exhibits its largest accuracy gain between $r=0$ and $r=1$ , since its raw proof structure is often correct but contains many syntax errors and proof-step errors. Syntax Refiner fixes most of them and together with Auto Solver leads to a spike in accuracy. Beyond $r=1$ , accuracy plateaus, since o4-mini was not explicitly designed for proving formal theorems. We note that even though accuracy gains slow down drastically for o4-mini, its accuracy still increases at a slow rate, which means that pushing parameter $r$ further has a potential to increase the efficiency of o4-mini further. In contrast, both Kimina‑Prover‑Preview‑Distill‑7B and Goedel‑Prover‑SFT continue to improve steadily throughout different levels of recursion depth. The results suggest that decomposing the challenging problems into smaller parts allows the model to close simpler goals, slowly building towards a complete proof. Although the total number of proof samples grows with $r$ , even $r=5$ requires only 300–400 LLM requests on average, an order of magnitude below typical budgets in the literature. Moreover, one could further increase $r$ to explore budgets in the thousands if desired. 4.5 Ablation Studies on different components of Apollo To evaluate the contribution of each Apollo component, we conducted an ablation study on two base models: o4-mini (a general-purpose model) and Kimina-Prover-Preview-Distill-7B (a theorem-prover model). Results are reported in Table 4. We decomposed Apollo into three key modules: Syntax Refiner, Auto Solver, and LLM Re-invoker. The LLM invoker is the module which re-invokes the LLM to prove the sorrified sub-goals. The Sorrifier and Proof Assembler modules are excluded from this analysis, as they are essential for decomposing problems into sub-goals and assembling proofs, and thus cannot be ablated. From the results, we observe that Syntax Refiner alone provides only a marginal benefit, and its effect is limited to the general-purpose model o4-mini; it yields no improvement on the dedicated prover. Auto Solver, when used in isolation, also produces little accuracy gain, indicating that built-in solvers in most cases do not have the capability to prove the failed sub-goals after the first round of proof generation by LLMs. However, further reasoning and proof decomposition on those sub-goals by the LLM can significantly increase the accuracy gain. The most significant accuracy gains, however, arise when Auto Solver and LLM Re-invoker are combined, which corresponds to the full Apollo framework. This demonstrates that Apollo’s repair mechanism is crucial for achieving robust performance across both model types. We extend the ablation to individual module triggers during inference in Section E of the Appendix. Table 4: Ablation study of different parts of Apollo on o4-mini (general purpose model) and Kimina-Prover-Preview-Distill-7B (theorem prover model). | Apollo Modules | Model Accuracy | | | | | --- | --- | --- | --- | --- | | Syntax Refiner | AutoSolver | LLM Re-Invoker | o4-mini | Kimina-7B | | ✗ | ✗ | ✗ | 7.0% | 63.1% | | ✓ | ✗ | ✗ | 7.4% | 63.1% | | ✗ | ✓ | ✗ | 7.0% | 63.5% | | ✗ | ✗ | ✓ | 8.2% | 69.3% | | ✓ | ✓ | ✗ | 20.5% | 63.5% | | ✓ | ✗ | ✓ | 18.9% | 69.3% | | ✗ | ✓ | ✓ | 10.2% | 75.0% | | ✓ | ✓ | ✓ | 46.7% | 75.0% | 5 Limitations Integration with tree‑search methods. Our evaluation focuses exclusively on applying Apollo to whole‑proof generation by LLMs and does not explore its interaction with traditional tree‑search provers (e.g. BFS, MCTS). Investigating how lemma decomposition and targeted repair could accelerate or prune search paths in those systems is a promising direction for future work. Dependence on base model’s proof sketch quality. Apollo’s ability to produce a correct formal proof largely depends on the base model and whether its initial proof has a coherent proof sketch. As yousefzadeh2025a observe, models often tend to “cut corners,” producing proofs that are much shorter than the rigorous, multi-step arguments required to generate a valid proof. When a base model fails to write a correct proof strategy (e.g. omitting critical lemmas or suggesting irrelevant tactics, or taking a wrong approach), the Apollo is less likely to repair such a proof. Enhancing the robustness of Apollo to very poor initial sketches remains an open challenge. 6 Conclusion In this work, we present Apollo, a novel, modular fully automated agentic system that combines syntax cleaning, automatic solvers, and LLM‑driven sub‑proof generation to transform an LLM’s initial proof sketch into a fully verified Lean4 proof. This framework harnesses the full power of the Lean compiler and integrated solvers, merging them with LLM systems. Applied across five whole‑proof generation models, ranging from general‑purpose LLMs (o3‑mini, o4‑mini) to specialized provers (Kimina‑Prover‑Preview‑Distill‑7B, Goedel‑Prover‑SFT, Goedel-V2), Apollo consistently establishes new best accuracies on the miniF2F benchmark while reducing token and sampling budgets by one to two orders of magnitude. Our empirical analysis shows that Apollo not only raises overall proof success rates but also guides models to generate longer, more structured proofs, as evidenced by a drastic increase in average successful proof length. We further demonstrate how the recursion‑depth parameter $r$ trades off sample complexity against accuracy, achieving robust gains with only a few hundred samples. We believe that Apollo’s collaboration between LLMs, automatic solvers, and the Lean compiler defines a paradigm of agentic systems that produce high‑quality, verifiable proofs for increasingly challenging theorems. Acknowledgments The work of Farzan Farnia is partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China, Project 14209920, and is partially supported by CUHK Direct Research Grants with CUHK Project No. 4055164 and 4937054. Also, the authors would like to thank the anonymous reviewers for their constructive feedback. References - (1) Kaiyu Yang, Gabriel Poesia, Jingxuan He, Wenda Li, Kristin Lauter, Swarat Chaudhuri, and Dawn Song. Formal mathematical reasoning: A new frontier in AI. In Proceedings of the International Conference on Machine Learning, 2025. - (2) Google DeepMind. AI achieves silver-medal standard solving international mathematical olympiad problems. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/, 2024. Accessed: 2025-05-08. - (3) Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 2024. - (4) Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, page 625–635, Berlin, Heidelberg, 2021. Springer-Verlag. - (5) Alex Kontorovich and Terence Tao. Prime number theorem and more. https://github.com/AlexKontorovich/PrimeNumberTheoremAnd, 2024. Version 0.1.0, released 2024-01-09. - (6) Kevin Buzzard. Mathematical reasoning and the computer. Bulletin of the American Mathematical Society, 2024. - (7) George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition, 2024. - (8) Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. miniF2F: a cross-system benchmark for formal Olympiad-level mathematics, 2022. - (9) Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023. - (10) Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf, 2024. GitHub repository. - (11) Jiewen Hu, Thomas Zhu, and Sean Welleck. miniCTX: Neural theorem proving with (long-)contexts. In The Thirteenth International Conference on Learning Representations, 2025. - (12) Dongwei Jiang, Marcio Fonseca, and Shay B Cohen. Leanreasoner: Boosting complex logical reasoning with lean. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7490–7503, 2024. - (13) Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. A survey of learning-based automated program repair. ACM Trans. Softw. Eng. Methodol., 33(2), December 2023. - (14) Kavi Gupta, Peter Ebert Christensen, Xinyun Chen, and Dawn Song. Synthesize, execute and debug: Learning to repair for neural program synthesis. In Advances in Neural Information Processing Systems, volume 33, pages 17685–17695. Curran Associates, Inc., 2020. - (15) Michihiro Yasunaga and Percy Liang. Break-it-fix-it: Unsupervised learning for program repair. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 11941–11952. PMLR, 18–24 Jul 2021. - (16) Berkay Berabi, Jingxuan He, Veselin Raychev, and Martin Vechev. Tfix: Learning to fix coding errors with a text-to-text transformer. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 780–791. PMLR, 18–24 Jul 2021. - (17) Miltiadis Allamanis, Henry Jackson-Flux, and Marc Brockschmidt. Self-supervised bug detection and repair. In Advances in Neural Information Processing Systems, volume 34, pages 27865–27876. Curran Associates, Inc., 2021. - (18) Talia Ringer. Proof Repair. PhD thesis, University of Washington, 2021. Ph.D. dissertation. - (19) Yves Bertot and Pierre Castéran. Interactive theorem proving and program development: Coq’Art: the calculus of inductive constructions. Springer Science & Business Media, 2013. - (20) Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding hammers to integrate language models and automated theorem provers. In Advances in Neural Information Processing Systems, 2022. - (21) Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models, 2023. - (22) Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Towards large language models as copilots for theorem proving in lean. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023. - (23) Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning, 2022. - (24) Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Jiayu Wang, Dahua Lin, and Kai Chen. Internlm2.5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems, 2024. - (25) Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Kai Shen. BFS-Prover: Scalable best-first tree search for llm-based automatic theorem proving, 2025. - (26) Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pages 26337–26349, 2022. - (27) Haiming Wang et al. Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12632–12646, 2023. - (28) Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models, October 2023. arXiv:2306.15626. - (29) Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving, 2020. - (30) Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024. - (31) Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving, 2025. - (32) Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, and Kun Gai. Leanabell-prover: Posttraining scaling in formal reasoning, 2025. - (33) Kefan Dong and Tengyu Ma. STP: Self-play LLM theorem provers with iterative conjecturing and proving. arXiv preprint arXiv:2502.00212, 2025. - (34) Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Pagani, Moreira Machado, Pauline Bourigault, Ran Wang, Stanislas Polu, Thibaut Barroyer, Wen-Ding Li, Yazhe Niu, Yann Fleureau, Yangyang Hu, Zhouliang Yu, Zihan Wang, Zhilin Yang, Zhengying Liu, and Jia Li. Kimina-Prover Preview: Towards large formal reasoning models with reinforcement learning, 2025. - (35) Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learning Representations, 2024. - (36) Albert Q. Jiang et al. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs, 2023. - (37) Haiming Wang, Huajian Xin, Chuanyang Zheng, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jian Yin, Zhenguo Li, and Xiaodan Liang. LEGO-prover: Neural theorem proving with growing libraries. In The Twelfth International Conference on Learning Representations, 2024. - (38) Lean FRO. A read-eval-print-loop for Lean 4. https://github.com/leanprover-community/repl, 2023. - (39) OpenAI. Openai o3-mini. https://openai.com/index/openai-o3-mini/, January 2025. - (40) OpenAI. Introducing openai o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/, April 2025. - (41) Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: The strongest open-source theorem prover to date, 2025. - (42) Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, and Haitao Mi. HunyuanProver: A scalable data synthesis framework and guided tree search for automated theorem proving, 2025. - (43) DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. - (44) Roozbeh Yousefzadeh, Xuenan Cao, and Azim Ospanov. A Lean dataset for International Math Olympiad: Small steps towards writing math proofs for hard problems. Transactions on Machine Learning Research, 2025. - (45) Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search, August 2024. arXiv:2408.08152. Appendix A Additional implementation details Algorithm 1 presents a high-level pseudo code of the Apollo framework. We highlight that the framework is built-upon the recursive function calls up to a maximum recursion depth given by parameter $r$ . $\mathbf{LLM}$ refers to a proof generator model that takes in a proof statement and outputs a formal proof attempt. $\mathbf{LeanCompiler}$ refers to Lean REPL that verifies the proof and outputs True or False. In the case of False, it also provides a detailed list of errors, their locations and types along with the error messages generated by the Lean compiler. Other functions in the pseudo code ( $\mathbf{SyntaxRefiner,Sorrifier,AutoSolver,ProofAssembler}$ ) represent the sub-modules described in the main text. To give deeper insight into Apollo’s design, we now outline the rationale and responsibilities of each core module. A.1 Syntax Refiner The Syntax Refiner class applies a set of rule‐based corrections to eliminate syntax errors in LLM‐generated Lean code. During model experimentation, we collected a corpus of common mistakes, such as stray Lean 3 keywords (from, begin … end), misplaced tactics, or missing delimiters, and encoded each repair as a regular‐expression rule. For example: - Convert [from by] to [:= by]. - Replace Lean 3 blocks (begin … end) with the corresponding Lean 4 structure. - Ensure that commands like rw use correctly placed square brackets to avoid compilation errors. In more complex cases, multiple regex patterns and replacements are composed to restore valid Lean 4 syntax without altering the logical content of the proof sketch. A.2 Sorrifier The Sorrifier module uses a feedback loop between Apollo and the Lean REPL as follows: 1. Apollo sends the current proof candidate to REPL. 1. The REPL returns a list of error messages and their source locations. 1. Apollo parses the proof into a syntax tree, navigates to each failing sub‐lemma, and attempts to fix it or replace it with a sorry placeholder. 1. The updated proof is re‐submitted to REPL, and this cycle repeats until no error remain. If the REPL indicates the original theorem statement itself is malformed (e.g. due to LLM hallucination), Apollo abandons the current proof and requests a fresh generation from LLM. One example of malformed formal statement is if LLM introduces a variable in a hypothesis that was not previously defined. Such case would trigger a compilation error in the formal statement; therefore, when parsing the proof, Apollo will catch this error and request a fresh proof from LLM. To guarantee consistent variable types during lemma extraction and proof assembly, Apollo sets the following Lean options to true: ⬇ set_option pp. instanceTypes true set_option pp. numericTypes true set_option pp. coercions. types true set_option pp. letVarTypes true set_option pp. structureInstanceTypes true set_option pp. instanceTypes true set_option pp. mvars. withType true set_option pp. coercions true set_option pp. funBinderTypes true set_option pp. piBinderTypes true so that the compiler reports explicit types (e.g. $\mathbb{Z}$ vs. $\mathbb{R}$ ). This fail-safe prevents trivialization of theorems and ensures correct typing in the final assembled proof. A.3 AutoSolver The AutoSolver module takes the “sorrified” proof and applies a sequence of tactics to close as many goals as possible: 1. Hint-based tactics. It first invokes hint, which proposes candidate tactics. We filter out any suggestions that merely progress the goal and retain only those that fully discharge it. 1. Built-in tactics. If hint fails, AutoSolver systematically tries a suite of powerful Lean tactics, such as nlinarith, norm_num, norm_cast, ring_nf, simp, simp_all, and others, often combining them with try to catch failures without aborting. 1. Recursive fallback. Any goals that remain unsolved are reported back to Apollo. Apollo queries the REPL for the exact goal statements, spawns corresponding sub-lemmas, and then invokes the full repair loop (LLM generation, Syntax Refiner, Sorrifier, and AutoSolver) on each sub-lemma. This tiered strategy maximizes automation while ensuring that challenging subgoals are handled via targeted recursion. Algorithm 1 Apollo Pseudocode for Formal Theorem Proving 1: problem statement $ps$ , current recursion depth $r_{current}$ , maximum depth $r$ 2: if $r_{current}$ is not initialized then 3: $r_{current}← 0$ 4: end if 5: if $r_{current}>r$ then 6: return sorry $\triangleright$ If recursion depth reached, return sorry 7: end if 8: $proof←\mathbf{LLM}(ps)$ $\triangleright$ Generate proof from LLM 9: $proof←\mathbf{SyntaxRefiner}(proof)$ $\triangleright$ Refine the proof syntax 10: $proof←\mathbf{Sorrifier}(proof)$ $\triangleright$ Patch failing sub-lemmas 11: $proof←\mathbf{AutoSolver}(proof)$ $\triangleright$ Try built-in Lean solvers 12: if $\mathbf{LeanCompiler}(proof)$ == True then 13: return $\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}proof$ 14: end if 15: $r_{current}← r_{current}+1$ 16: $n←\mathbf{CountSorries}(proof)$ $\triangleright$ Number of ’sorry’ placeholders 17: $sub\_proofs←[]$ 18: for $i=1$ to $n$ do 19: $ps_{goal}←\mathbf{GetTheoremGoal}(proof,i)$ 20: $sub\_ps←\mathbf{TransformGoal}(ps_{goal})$ 21: $sub\_proofs[i]←\mathbf{Apollo}(sub\_ps,r_{current},r)$ 22: end for 23: $repaired\_proof←\mathbf{ProofAssembler}(sub\_proofs)$ 24: return $repaired\_proof$ Appendix B Discussion on maximum budget of @K and Apollo To better highlight Apollo’s contribution to token efficiency, we report token counts under two settings: (1) the budget over all attempted miniF2F-test problems, and (2) the budget restricted to proofs where Apollo was invoked. For instance, if the base LLM proves a theorem without Apollo’s assistance, its token usage is included only in the first setting. In contrast, if the base LLM cannot prove the theorem on its own but succeeds with Apollo’s help, its token usage is counted in the second setting. Results for setting (1) are shown in Table 5, while results for setting (2) appear in Table 6. We emphasize that, regardless of the token-count setting, Apollo consistently consumes fewer tokens while achieving higher accuracy, highlighting its effectiveness. Throughout this work, we primarily report results under setting (2), as it more directly highlights Apollo’s contribution: the reported token usage is not artificially reduced by problems that the base LLM could solve independently. This makes Scenario (2) a more representative measure of our contribution. Table 5: Comparison of maximum/average sample/token budget across $@K$ sampling and Apollo repair on the entire miniF2F-test dataset. | Model | Max sample budget | Average sample budget | Max token budget | Average token budget | | --- | --- | --- | --- | --- | | Goedel-SFT | 25600 | 9730 | 12.7M | 4.8M | | Goedel-SFT+Apollo | 1100 | 335 | 548K | 167K | | Kimina-7B | 1024 | 373 | 1.3M | 1.6M | | Kimina-7B+Apollo | 888 | 189 | 3.8M | 818K | Table 6: Comparison of maximum/average sample/token budget across $@K$ sampling on the entire miniF2F-test dataset and Apollo repair exclusively on Apollo-assisted problems. | Model | Max sample budget | Average sample budget | Max token budget | Average token budget | | --- | --- | --- | --- | --- | | Goedel-SFT | 25600 | 9730 | 12.7M | 4.8M | | Goedel-SFT+Apollo | 1100 | 618 | 548K | 307K | | Kimina-7B | 1024 | 373 | 1.3M | 1.6M | | Kimina-7B+Apollo | 888 | 367 | 3.8M | 1.6M | Appendix C Additional evaluation results We also conducted experiments on the ProofNet azerbayev2023proofnetautoformalizingformallyproving dataset using the Kimina-Preview-Prover-Distill-7B model. Table 7 summarizes the sample/token budget and accuracy. Incorporating the Apollo agent with Kimina not only improves accuracy by 7% but also reduces token consumption by 25%. Table 7: Comparison of Apollo accuracy on the ProofNet dataset. Token budget is computed over all generated tokens by LLM. | Method | Model size | Sample budget | Token budget | ProofNet | | --- | --- | --- | --- | --- | | Kimina-Preview | 7B | 128 | 559K | 11.3% | | Kimina-Preview+Apollo | 7B | 96 | 415K | 18.3% | Moreover, we observe rising adoption of REPL-based repair, introduced in yousefzadeh2025a . Currently, models such as Goedel-Prover-V2 are fine-tuned on REPL outputs paired with faulty Lean code, enabling the model to repair missing or incorrect proof fragments using this feedback. Following yousefzadeh2025a , we compare Apollo against a REPL-feedback baseline using the general-purpose o3-mini model. Table 8 presents the results: Apollo outperforms the REPL-feedback approach while requiring a smaller sample budget. We view REPL-based feedback as complementary and expect it to work well in conjunction with recursive, compiler-aided repair of Lean proofs. Feedback prompt schema This is an incorrect proof: [model’s last proof] Compilation errors are as follows: [Lean’s error messages] Based on this feedback, produce a correct raw Lean code for the following problem: [header] [informal prefix] [formal statement] Table 8: Comparison of Apollo accuracy against compiler (REPL) feedback based repair on the miniF2F-test dataset. Token budget is omitted since OpenAI API does not provide exact token budget with its response. | Method | Sample budget | Accuracy | | --- | --- | --- | | o3-mini | 1 | 3.3% | | o3-mini | 32 | 24.6% | | o3-mini + REPL feedback repair | 5 | 17.2% | | o3-mini + REPL feedback repair | 10 | 25.4% | | o3-mini + Apollo | 8 | 40.2% | We also evaluated Apollo against the RMaxTS method proposed in xin_deepseek-prover-v15_2024 . This approach employs Monte Carlo Tree Simulation to explore the proof space and truncates the search whenever Lean compiler errors are encountered. RMaxTS has been shown to further improve the accuracy of whole-generation provers. For our evaluation, we used Kimina-Prover-Preview-Distill-7B as the base model. Table 9 reports the results. At a comparable sample budget of approximately 2M tokens, Apollo outperforms RMaxTS by 10.7%. These findings further reinforce the effectiveness of the agentic recursive repair approach. Appendix D Computational resource considerations Our emphasis on reducing sample complexity is driven by resource allocation. CPU performance is not a limiting factor in our setup: the Lean kernel remains efficient even under high parallelization, with proof verification typically completing in 6–200 seconds. To guard against inefficient structures, we impose a 5-minute compilation timeout, which is rarely approached in practice. By contrast, GPU resources were the primary bottleneck. Running Kimina-Prover-Preview-Distill-7B on a single A5000 GPU with a 16,384-token context window takes 700–2,000 seconds per problem, over 11.5 minutes at @32 sampling. Increased sampling exacerbates this cost, and even with eight GPUs throughput remains difficult to scale. While Lean-based servers such as Kimina-server continue to improve verification efficiency through CPU parallelization, the growing overhead of sampling highlights the need to minimize LLM invocations alongside improving theorem-proving accuracy. Table 9: Comparison of Apollo accuracy against other sampling methods on the miniF2F-test dataset. Token budget is computed over all generated tokens by LLM. | Method | Model size | Sample budget | Token budget | Accuracy | | --- | --- | --- | --- | --- | | Kimina-Preview | 7B | 1024 | 4.5M | 70.8% | | Kimina-Preview+RMaxTS xin_deepseek-prover-v15_2024 | 7B | $4× 128$ | 2.0M | 64.3% | | Kimina-Preview+Apollo | 7B | 589 | 2.2M | 75.0% | Appendix E Extension of ablation study To further elaborate on the ablation results presented in the main text, we measured the number of relevant Apollo module calls during inference. The results are shown in Table 10. Specifically, we counted the number of invocations for each Apollo-assisted problem in the miniF2F-test subset. The findings are consistent with the ablation study reported in the main text. Table 10: Number of Apollo module triggers across different models | Model | Syntax Refiner | Auto Solver | LLM Re-Invoker | | --- | --- | --- | --- | | Goedel-SFT | 0.0% | 100.0% | 100.0% | | Kimina-7B | 0.0% | 100.0% | 100.0% | | o3-mini | 100.0% | 100.0% | 70.0% | | o4-mini | 94.7% | 100.0% | 64.9% | Appendix F An example of Apollo proof repair procedure Figure 6 shows a detailed run of the Apollo framework on mathd_algebra_332. First, the LLM generates an initial proof sketch, which fails to type-check in Lean. Apollo then applies its Syntax Refiner to correct simple errors (e.g. changing “from by” to “:= by”) and to strip out all comment blocks. Next, the Sorrifier replaces each failing sub-lemma with a sorry statement (six in this example). The AutoSolver module then attempts to discharge these sub-lemmas: it resolves four of them, leaving lemmas #5 and #6 unsolved. Apollo recurses on those two: each is passed again through LLM, Syntax Refiner, Sorrifier, and AutoSolver. After this single recursive iteration, all goals are closed. Finally, Apollo reassembles the full proof and returns it to the user. This example illustrates Apollo’s repair logic. In this case the proof was fixed in one iteration; more complex theorems may require deeper recursive repair. <details> <summary>examples/step_by_step/dryrun.png Details</summary> ![bc11c263](/v1/image/bc11c263942700111a5853b24cce7cb8771beb6623dcd2293b672cfa728ee9b0) ### Visual Description \n ## Diagram: APOLLO Framework - LLM + APOLLO ### Overview This diagram illustrates the APOLLO framework, detailing the interaction between a Large Language Model (LLM) and various modules (Verifier, Synthesizer, Optimizer, and Auto-tutor) across two stages: Formal Statement and LLM + APOLLO. The diagram presents code snippets within each module, along with data flow represented by arrows and boxes. It also includes a performance comparison table at the bottom. ### Components/Axes The diagram is structured into three main sections: 1. **Header:** "APOLLO Framework" 2. **Main Body:** Four modules (LLM, Verifier, Synthesizer, Optimizer, Auto-tutor) arranged in two rows, representing two stages. 3. **Footer:** A table comparing "Formal Statement" and "LLM + APOLLO" performance. The modules are color-coded: * LLM: Blue * Verifier: Red * Synthesizer: Yellow * Optimizer: Green * Auto-tutor: Purple Arrows indicate the flow of data and control between modules. The footer table has columns for "Run", "Number of Bugs", "Time (s)", "Success Rate (%)", and "Avg. Time per Bug (s)". ### Detailed Analysis or Content Details **Stage 1: Formal Statement** (Top Row) * **LLM (Blue):** Contains code related to importing modules and defining a function `solve_alpha_beta`. Includes comments like "This is an example of a simple program". * **Verifier (Red):** Contains code for `check_specification` and `check_topology`. Includes comments like "Verify if the program satisfies the specification". * **Synthesizer (Yellow):** Contains code for `synthesize_alpha_beta` and `synthesize_topology`. Includes comments like "Synthesize a program that satisfies the specification". * **Optimizer (Green):** Contains code for `optimize_alpha_beta` and `optimize_topology`. Includes comments like "Optimize the program for performance". * **Auto-tutor (Purple):** Contains code for `auto_tutor_alpha_beta` and `auto_tutor_topology`. Includes comments like "Provide feedback to the user". **Stage 2: LLM + APOLLO** (Bottom Row) * **LLM (Blue):** Similar code structure to the first LLM module, with comments like "This is an example of a simple program". * **Verifier (Red):** Similar code structure to the first Verifier module, with comments like "Verify if the program satisfies the specification". * **Synthesizer (Yellow):** Similar code structure to the first Synthesizer module, with comments like "Synthesize a program that satisfies the specification". * **Optimizer (Green):** Similar code structure to the first Optimizer module, with comments like "Optimize the program for performance". * **Auto-tutor (Purple):** Similar code structure to the first Auto-tutor module, with comments like "Provide feedback to the user". **Data Flow:** Arrows connect the modules, indicating the flow of information. For example, the LLM in both stages sends output to the Verifier, Synthesizer, Optimizer, and Auto-tutor. There is also a feedback loop from the Auto-tutor back to the LLM. **Footer Table:** | | Formal Statement | LLM + APOLLO | |----------------------|------------------|--------------| | Run | 5 | 5 | | Number of Bugs | 5 | 0 | | Time (s) | 28.1 ± 3.2 | 2.3 ± 0.4 | | Success Rate (%) | 0 | 100 | | Avg. Time per Bug (s)| 5.6 ± 1.0 | 0.2 ± 0.0 | **Additional Text:** * "Base Proof + Lemma #5 Lemma #6" - positioned between the two stages. * "LLM + APOLLO" - positioned below the table. ### Key Observations * The LLM + APOLLO framework demonstrably improves performance compared to the Formal Statement approach. * The LLM + APOLLO framework achieves a 100% success rate, eliminating all bugs. * The time taken to solve the problem is significantly reduced with LLM + APOLLO (2.3s vs 28.1s). * The average time per bug is drastically reduced (0.2s vs 5.6s). * The code snippets within each module appear similar across both stages, suggesting that APOLLO enhances the LLM's capabilities rather than replacing them. ### Interpretation The diagram illustrates a framework for improving the reliability and efficiency of program synthesis using a Large Language Model. The APOLLO framework integrates modules for verification, synthesis, optimization, and automated tutoring to refine the LLM's output. The performance comparison table provides strong evidence that APOLLO significantly enhances the LLM's ability to generate correct and efficient code. The data suggests that the formal statement approach is prone to errors and requires substantial time to resolve them. In contrast, the LLM + APOLLO framework effectively eliminates bugs and dramatically reduces the time required for program synthesis. This improvement is likely due to the iterative feedback loop provided by the APOLLO modules, which guide the LLM towards a correct solution. The consistent code structure within each module across both stages indicates that APOLLO doesn't fundamentally alter the LLM's core functionality but rather provides a structured environment for refinement and error correction. The "Base Proof + Lemma #5 Lemma #6" text suggests that the framework leverages formal methods and mathematical proofs to ensure the correctness of the generated code. The diagram is a compelling demonstration of how combining the power of LLMs with formal verification and optimization techniques can lead to significant advancements in automated program synthesis. </details> Figure 6: Step-by-step Apollo repair on mathd_algebra_332. The initial proof and all sub-lemma proofs are generated with o4-mini, then systematically repaired and reassembled by Apollo. <details> <summary>examples/comparison/short_3.png Details</summary> ![f53eb6f0](/v1/image/f53eb6f08dfcdfb6f93dcb004468935b66c6c8fcab93e63521e53341ad646508) ### Visual Description \n ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of two code blocks, labeled "LLM" and "LLM + APOLLO". Both blocks appear to be formal mathematical proofs written in a specific code/language (likely related to Lean theorem prover, given keywords like `theorem`, `linarith`, `exact`). The code aims to prove a statement related to divisibility by 53. The "LLM + APOLLO" block shows additional steps and annotations compared to the "LLM" block. ### Components/Axes There are no axes or traditional chart components. The image consists of two independent code blocks, each with the following elements: * **Import Statements:** `import Mathlib`, `import Aesop`, `set_option maxHeartbeats 0` * **Theorem Declaration:** `theorem mathd_algebra_158_llm (a k : Nat) (h₀ : Even a)` and `theorem mathd_algebra_158_apollo (a k : Nat) (h₀ : Even a)` * **Proof Body:** A series of lines representing logical steps, commands, and assertions. * **Annotations:** Lines starting with `#1`, `#2`, `#3`, `#4` in the "LLM + APOLLO" block, indicating specific steps or points in the proof. ### Detailed Analysis or Content Details **LLM Block:** ``` import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat theorem mathd_algebra_158_llm (a k : Nat) (h₀ : Even a) (h₁ : ↑(k in Finset.range 8, 2 * k + 1) - ↑(k in Finset.range 5, a + 2 * k) = (4 : Z)) : a = 8 : by simp_all ring_nf at * have eq : (5 : Z) * ((f a : Z) + (2 : Z)) * ((f k : Z) = 53 : by linarith have h_even : ∀ (m : N), a = 2 * m := h₀ have two_divides_53 : ∃ (d : Z), 53 = 2 * d := #2 rw [← even_sum] at eq_subst eq [rw_subst] at eq exact ⟨(15)⟩ * ((f m : Z) + 1k, eq have not_divides : ¬ ∃ (d : Z), 53 = 2 * d := by rintro (d, h) have mod53 : (53 : Z) % 2 = 1 := by norm_num have mod_d : (2 * d) % 2 = 0 := by norm_num have [mul] mod53 #3 linarith have contra : False := by rw [not_divides] apply not_introduction ``` **LLM + APOLLO Block:** ``` import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat theorem mathd_algebra_158_apollo (a k : Nat) (h₀ : Even a) (h₁ : ↑(k in Finset.range 8, 2 * k + 1) - ↑(k in Finset.range 5, a + 2 * k) = (4 : Z)) : a = 8 : by simp_all ring_nf at * have eq : (5 : Z) * ((f a : Z) + (2 : Z)) * ((f k : Z) = 53 : by linarith have h_even : ∀ (m : N), a = 2 * m := h₀ #1 exact Even.exists_two_nsmul a ho have two_divides_53 : ∃ (d : Z), 53 = 2 * d := #2 := by omega have two_divides_53 : ∃ (d : Z), 53 = 2 * d := #2 rintro (d, hd) have mod53 : (53 : Z) % 2 = 1 := #4 norm_num have mod_d : (2 * d) % 2 = 0 := #3 rw [hd] at num53 have [mul] mod53 := mod53 #3 linarith have contra : False := by rw [not_divides] apply contra_introduction ``` **Key Differences (LLM + APOLLO vs. LLM):** * **Annotations:** The "LLM + APOLLO" block includes annotations `#1`, `#2`, `#3`, `#4` which are absent in the "LLM" block. * **`exact Even.exists_two_nsmul a ho`:** This line replaces `have h_even : ∀ (m : N), a = 2 * m := h₀` in the "LLM + APOLLO" block. * **`:= by omega`:** This line is added after `have two_divides_53 : ∃ (d : Z), 53 = 2 * d := #2` in the "LLM + APOLLO" block. * **`rintro (d, hd)`:** This line is added in the "LLM + APOLLO" block. * **`rw [hd] at num53`:** This line is added in the "LLM + APOLLO" block. * **`have [mul] mod53 := mod53 #3`:** This line replaces `have [mul] mod53 #3` in the "LLM + APOLLO" block. * **`apply contra_introduction`:** This line replaces `apply not_introduction` in the "LLM + APOLLO" block. ### Key Observations * The "LLM + APOLLO" block appears to be a refined or improved version of the "LLM" block. The annotations suggest a step-by-step refinement process. * The addition of `omega` suggests the use of an automated tactic for solving the proof. * The changes in the "LLM + APOLLO" block seem to streamline the proof and potentially make it more efficient. * The code is highly symbolic and requires domain expertise (Lean theorem prover, number theory) to fully understand. ### Interpretation The image demonstrates the effect of applying "APOLLO" (likely an automated proof assistant or tactic) to a mathematical proof generated by an "LLM" (Large Language Model). The LLM generates an initial proof attempt, and APOLLO refines it by: 1. **Adding Annotations:** Providing step-by-step markers for clarity and debugging. 2. **Utilizing Automated Tactics:** Employing `omega` to automatically solve parts of the proof. 3. **Streamlining Logic:** Replacing manual steps with more concise and efficient commands. 4. **Correcting Errors:** Replacing `not_introduction` with `contra_introduction`. This suggests a collaborative approach to formal verification, where the LLM provides a starting point, and APOLLO leverages its automated reasoning capabilities to complete and optimize the proof. The image highlights the potential of combining the generative power of LLMs with the rigor of automated theorem provers. The fact that the annotations are numbered sequentially suggests a process of iterative refinement. The changes made by APOLLO appear to be aimed at making the proof more concise, efficient, and logically sound. </details> Figure 7: Proof attempt of mathd_algebra_158 produced by the base model (o4-mini) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. Appendix G Comparison between base proofs generated from LLMs and Apollo-repaired counterparts In this section, we present and discuss examples of initial proof attempts alongside their Apollo-repaired counterparts. Figures 7, 8, 9 illustrate cases where Apollo repairs the proof without invoking the LLM again. In each example, the LLM struggles to generate the precise tactics needed to close certain goals; however, we find that regenerating the entire proof is unnecessary. Instead, built-in Lean solvers can discharge those subgoals directly. Moreover, in Figure 7, problematic block #3 is resolved simply by removing it, since the surrounding proof context is correct. Thus, omitting problematic lines can sometimes yield a valid proof. In Figure 10, one goal is expanded via an LLM query, but the model injects extra tactics that trigger a no goals to be solved error. Apollo then repairs the first block using a combination of LLM generation and AutoSolver, while the second block is removed entirely. Figure 11 illustrates a case where the model fails to apply nlinarith to discharge the goal. We observe that, when uncertain, the model often resorts to broad tactics in hopes of closing the goal. Here, the goal is too complex for nlinarith, so Apollo leverages both the LLM and AutoSolver to guide the proof to completion. Figure 12 illustrates an example of proof that required a recursion depth $r=5$ to repair the initial proof attempt. We observe that LLM’s proof structure is correct; however, in many cases it over-relies on built-in solvers and rewrites. However, since the goals are too complex, this approach leads to a total of nine error blocks. Repairing those blocks requires aid from both AutoSolver module of Apollo and LLM itself. We show that if LLM grasps the correct approach, then Apollo is able to repair fine grained logic errors to produce a correct proof. Figure 13 illustrates a case where the LLM produces an incomplete proof sketch. The first half of the proof is correct, but the model fails to discharge the remaining goal and instead uses all_goals positivity. Apollo detects this error and performs two additional recursive repair iterations to complete the proof. Figure 14 illustrates a case where the base model takes a lazy approach by trying to close all goals with linarith. In contrast, Apollo’s repaired proof performs additional setup and discharges each goal by first proving the necessary auxiliary hypotheses. This example shows that, although the model often understands the high-level strategy, it lacks the fine-grained tactics (and compiler feedback) needed to close subgoals. Apollo decomposes the proof, identifies the failure points, and applies targeted repairs to generate a fully verified solution. Figure 15 shows another scenario in which Apollo successfully closes the first two blocks with linarith, but the final block requires a deeper reasoning chain. Apollo augments the proof by introducing and proving the missing lemmas, which leads to a correct solution with a series of rewrites. Figure 16 presents an example of a very large repair. As in Figure 12, Apollo requires $r=5$ to fully fix the proof. The final proof length increases from 100 lines to 216, which means Apollo applied roughly $× 2.2$ more tactics to close the goal. The repair relied on combined effort from AutoSolver and the LLM to close all goals. We show that, even for large, complex proofs, Apollo can repair each failing sub-block and generate a correct solution. <details> <summary>examples/comparison/short_1.png Details</summary> ![304e1e97](/v1/image/304e1e97139f19f24d63660c79e6574d63cebf6d46bea0b379181c99c0df9fb8) ### Visual Description \n ## Code Block: Theorem Proofs - LLM vs. LLM + APOLLO ### Overview The image presents two side-by-side code blocks, likely representing the output of an automated theorem prover. Both blocks demonstrate a proof attempt for the same theorem, `theorem mathd_algebra_141_llm` and `theorem mathd_algebra_141_apollo`. The left block is labeled "LLM" and the right block is labeled "LLM + APOLLO". The code appears to be written in a formal proof language, possibly related to Lean or Coq, with commands for simplification, rewriting, and norm_num. The blocks show a step-by-step derivation, with each line labeled with a step number (`#1`, `#2`, `#3`). ### Components/Axes There are no axes or traditional chart components. The structure is purely textual, consisting of commands and their outputs. Key elements include: * **Theorem Statement:** `theorem mathd_algebra_141_llm (a b : R) (h₁ : a * b = 180) (h₂ : 2 * (a + b) = 54) :` * **Commands:** `import Mathlib`, `import Aesop`, `set_option maxHeartbeats 0`, `open BigOperators Real Nat Topology Rat`, `have`, `field_simp`, `rw`, `norm_num`, `ring`, `exact`, `trans` * **Step Labels:** `#1`, `#2`, `#3` * **Intermediate Results:** Lines showing the transformation of the expression. * **Labels:** "LLM" and "LLM + APOLLO" indicating the prover used. ### Detailed Analysis or Content Details **Left Block (LLM):** 1. `import Mathlib` 2. `import Aesop` 3. `set_option maxHeartbeats 0` 4. `open BigOperators Real Nat Topology Rat` 5. `theorem mathd_algebra_141_llm (a b : R) (h₁ : a * b = 180) (h₂ : 2 * (a + b) = 54) :` 6. `have h3 : a + b = 27 := by #1` 7. `field_simp [h₂]` 8. `have h4 : (a + b) ^ 2 = 729 := by` 9. `rw [h3]` 10. `norm_num` 11. `have expand : a ^ 2 + 2 * a * b + b ^ 2 = (a + b) ^ 2 - 2 * a * b := by` 12. `ring` 13. `have step1 : a ^ 2 + b ^ 2 = 729 - 2 * a * b := by` 14. `rw [expand, h4]` 15. `have step2 : 729 - 2 * a * b = 729 - 2 * a * b := by #2` 16. `rw [h3]` 17. `have step3 : 729 - 360 = 369 := by` 18. `norm_num` 19. `exact step1.trans (step2.trans step3) #3` **Right Block (LLM + APOLLO):** 1. `import Mathlib` 2. `import Aesop` 3. `set_option maxHeartbeats 0` 4. `open BigOperators Real Nat Topology Rat` 5. `theorem mathd_algebra_141_apollo (a b : R) (h₁ : a * b = 180) (h₂ : 2 * (a + b) = 54) :` 6. `have h3 : a + b = 27 := by #1` 7. `linarith` 8. `have h4 : (a + b) ^ 2 = 729 := by` 9. `rw [h3]` 10. `norm_num` 11. `have expand : a ^ 2 + 2 * a * b + b ^ 2 = (a + b) ^ 2 - 2 * a * b := by` 12. `ring` 13. `have step1 : a ^ 2 + b ^ 2 = 729 - 2 * a * b := by` 14. `rw [expand, h4]` 15. `have step2 : 729 - 2 * a * b = 729 - 360 := by #2` 16. `linarith` 17. `have step3 : 729 - 360 = 369 := by` 18. `norm_num` 19. `linarith #3` ### Key Observations * Both blocks start with the same imports and theorem statement. * Both blocks derive `h3 : a + b = 27` in the first step (`#1`). * The primary difference lies in the commands used for simplification. The "LLM" block uses `field_simp`, while the "LLM + APOLLO" block uses `linarith`. * Step `#2` differs significantly. In the LLM block, it's `have step2 : 729 - 2 * a * b = 729 - 2 * a * b := by #2`, which appears to be a trivial identity. In the LLM + APOLLO block, it's `have step2 : 729 - 2 * a * b = 729 - 360 := by #2`, which simplifies the expression. * The final step also differs. LLM uses `exact step1.trans (step2.trans step3)`, while LLM + APOLLO uses `linarith`. ### Interpretation The image demonstrates a comparison between two theorem proving approaches: a base LLM and an LLM augmented with APOLLO. APOLLO appears to be a tool or strategy that enhances the LLM's ability to simplify expressions and progress through the proof. The difference in step `#2` is particularly telling. The LLM gets stuck on a trivial identity, while APOLLO successfully simplifies the expression, leading to a more direct path to the solution. The use of `linarith` in the APOLLO-enhanced block suggests that APOLLO leverages linear arithmetic reasoning capabilities. The final `linarith` command in the APOLLO block likely completes the proof by applying linear arithmetic to the remaining expression. The image suggests that APOLLO significantly improves the performance of the LLM in this specific theorem proving task, enabling it to find a more efficient and concise proof. This highlights the potential of combining LLMs with specialized reasoning tools to enhance their problem-solving abilities in formal domains. The fact that the theorem statement is identical in both blocks suggests that the core logic is the same, but the APOLLO augmentation provides a more effective strategy for applying that logic. </details> Figure 8: Proof attempt of mathd_algebra_141 produced by the base model (o4-mini) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/short_2.png Details</summary> ![e97ec270](/v1/image/e97ec2705c9ecec89aa1e0642e45fa9ed995c2bf7a98fffd68941161cfb12e4d) ### Visual Description ## Code Comparison: LLM vs. LLM + APOLLO ### Overview The image presents a side-by-side comparison of code generated by two systems: "LLM" (left) and "LLM + APOLLO" (right). Both systems appear to be using a formal verification or theorem proving environment, likely Coq, based on the syntax. The code snippets define a theorem `imo_1983_p6` and a related nlinarith proof. ### Components/Axes There are no axes or traditional chart components. The image consists of two distinct code blocks, each with a header indicating the system that generated it ("LLM" and "LLM + APOLLO"). Each code block contains the following elements: * `import` statements: `Mathlib` and `Aesop` * `set_option` statement: `maxHeartbeats 0` * `open` statement: `BigOperators Real Nat Topology Rat` * `theorem` definition: `imo_1983_p6` * `nlinarith` proof block: containing a series of constraints and operations. ### Detailed Analysis or Content Details **LLM (Left)** * **Imports:** `import Mathlib`, `import Aesop` * **Option:** `set_option maxHeartbeats 0` * **Open:** `open BigOperators Real Nat Topology Rat` * **Theorem:** `theorem imo_1983_p6_llm (a b c : R) (h₀ : 0 < a ∧ 0 < b ∧ 0 < c) (h₁ : c < a + b) (h₂ : b < a + c) (h₃ : a < b + c) : 0 ≤ a ^ 2 + b ^ 2 + c ^ 2 - 2 * c * (b - c) + c ^ 2 * a * (c - a) := by` * **nlinarith:** * `nlinarith [sq_nonneg (a - b), sq_nonneg (b - c), sq_nonneg (c - a), mul_pos ho.left h₀.right, mul_pos ho.left h₁.right, mul_pos h₁.right.left h₀.right.right]` **LLM + APOLLO (Right)** * **Imports:** `import Mathlib`, `import Aesop` * **Option:** `set_option maxHeartbeats 0` * **Open:** `open BigOperators Real Nat Topology Rat` * **Theorem:** `theorem imo_1983_p6_apollo (a b c : R) (h₀ : 0 < a ∧ 0 < b ∧ 0 < c) (h₁ : c < a + b) (h₂ : b < a + c) (h₃ : a < b + c) : 0 ≤ a ^ 2 + b ^ 2 + c ^ 2 - 2 * c * (b - c) + c ^ 2 * a * (c - a) := by` * **nlinarith:** * `nlinarith [sq_nonneg (a - b), sq_nonneg (b - c), sq_nonneg (c - a), mul_pos (sub_pos.mpr h₁) (sub_pos.mpr h₂), mul_pos (sub_pos.mpr h₂) (sub_pos.mpr h₃), mul_pos (sub_pos.mpr h₃) (sub_pos.mpr h₁)]` **Differences in nlinarith:** The key difference lies within the `nlinarith` block. The "LLM" version uses `mul_pos ho.left h₀.right`, `mul_pos ho.left h₁.right`, and `mul_pos h₁.right.left h₀.right.right`. The "LLM + APOLLO" version uses `mul_pos (sub_pos.mpr h₁) (sub_pos.mpr h₂)` and similar constructions involving `sub_pos.mpr`. This suggests that APOLLO is introducing a step to explicitly prove the positivity of the differences before multiplying, potentially leading to a more robust or complete proof. ### Key Observations The code is nearly identical between the two systems, except for the `nlinarith` block. The "LLM + APOLLO" version appears to be more verbose and explicit in its reasoning, using `sub_pos.mpr` to demonstrate the positivity of terms before applying `mul_pos`. This could indicate that APOLLO is adding a layer of proof automation or simplification. ### Interpretation The image demonstrates a comparison between a baseline LLM and an LLM augmented with APOLLO for formal verification. The theorem `imo_1983_p6` likely represents a mathematical inequality. The difference in the `nlinarith` blocks suggests that APOLLO is enhancing the LLM's ability to generate proofs by adding more explicit reasoning steps. The use of `sub_pos.mpr` indicates that APOLLO is actively verifying the positivity of terms, which is crucial for the correctness of the proof. This suggests that APOLLO is not simply generating code, but also contributing to the proof strategy itself. The `maxHeartbeats 0` setting suggests that the systems are configured to avoid timeouts during proof search. The fact that both systems produce valid code (at least syntactically) indicates that the LLM has a good understanding of the Coq syntax and semantics. The addition of APOLLO appears to refine the proof process, potentially making it more reliable and complete. </details> Figure 9: Proof attempt of imo_1983_p6 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/short_4.png Details</summary> ![a2c6bfa2](/v1/image/a2c6bfa2ba27c4fab2fd868d550fa838960fee20a3293fecf2e9bf9369193642) ### Visual Description \n ## Code Snippets: LLM vs. LLM + APOLLO ### Overview The image presents two code snippets, likely from a formal verification or theorem proving environment (Lean 4). The snippets appear to define a theorem named `induction_pordipion2powklt5on2_llm` and `induction_pordipion2powklt5on2_apollo` respectively. The code uses mathematical notation and logical operators. The snippets are visually similar, but the right-hand side (labeled "LLM + APOLLO") contains additional lines of code compared to the left-hand side ("LLM"). ### Components/Axes There are no axes or traditional chart components. The image consists of two blocks of text representing code. Each block has a header indicating "LLM" or "LLM + APOLLO". The code is structured with indentation to represent the logical hierarchy. ### Detailed Analysis or Content Details **Left Snippet (LLM):** ``` import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat theorem induction_pordipion2powklt5on2_llm (n : N) (h₀ : 0 < n) : (∀ k in Finset.Icc 1, n + (1 : ℛ) ^ 2 ^ k) < 5 / 2 := by induction n with | zero => contradiction | succ n hh => cases n with | zero => norm_num #1 | succ n => simp all [Finset.prod_Icc succ top, Nat.cast_succ, div_eq_mul_inv] norm_num #2 ring_nf linarith ``` * **Imports:** `Mathlib`, `Aesop` * **Option:** `set_option maxHeartbeats 0` * **Open Namespace:** `BigOperators Real Nat Topology Rat` * **Theorem Name:** `induction_pordipion2powklt5on2_llm` * **Input:** `n : N` (n is a natural number), `h₀ : 0 < n` (n is greater than 0) * **Statement:** `(∀ k in Finset.Icc 1, n + (1 : ℛ) ^ 2 ^ k) < 5 / 2` (For all k in the closed interval from 1 to n, n plus 1 raised to the power of 2 to the power of k is less than 5/2) * **Proof:** `by induction n with ...` (Proof by induction on n) * **Base Case:** `zero => contradiction` (If n is zero, a contradiction arises) * **Inductive Step:** `succ n hh => cases n with ...` (If n is a successor, consider two cases) * **Case 1:** `zero => norm_num #1` (If n is zero, apply `norm_num #1`) * **Case 2:** `succ n => simp all [Finset.prod_Icc succ top, Nat.cast_succ, div_eq_mul_inv] norm_num #2 ring_nf linarith` (If n is a successor, apply simplification rules, `norm_num #2`, ring normalization, and linear arithmetic) **Right Snippet (LLM + APOLLO):** ``` import Mathlib import Aesop set_option maxHeartbeats 0 open BigOperators Real Nat Topology Rat theorem induction_pordipion2powklt5on2_apollo (n : N) (h₀ : 0 < n) : (∀ k in Finset.Icc 1, n + (1 : ℛ) ^ 2 ^ k) < 5 / 2 := by induction n with | zero => contradiction | succ n hh => cases n with | zero => norm_num #1 | succ n => induction k with | zero => norm_num | succ k ih => [Finset.prod_Icc succ top] at hn h₀ => linarith | succ k => simp_all [Finset.prod_Icc succ top, pow_succ, mul_comm] linarith | succ n => simp_all [Finset.prod_Icc succ top, Nat.cast_succ, div_eq_mul_inv] norm_num #2 ring_nf linarith ``` * The initial lines are identical to the left snippet. * The key difference is within the `succ n hh` case. Instead of directly applying simplification rules, it introduces an inner `induction k with ...` * **Inner Base Case:** `zero => norm_num` * **Inner Inductive Step:** `succ k ih => [Finset.prod_Icc succ top] at hn h₀ => linarith` and `succ k => simp_all [Finset.prod_Icc succ top, pow_succ, mul_comm] linarith` * The original simplification steps from the left snippet are also present, but nested within the inner inductive step. ### Key Observations * The "LLM + APOLLO" snippet appears to be a more refined or expanded version of the "LLM" snippet. * The addition of the inner `induction k with` suggests that APOLLO introduces an additional level of inductive reasoning within the proof. * The use of `at hn h₀` indicates that the `Finset.prod_Icc succ top` lemma is being applied to hypotheses `hn` and `h₀`. * The `#1` and `#2` annotations next to `norm_num` likely refer to specific normalization tactics or lemmas. ### Interpretation The image demonstrates a potential workflow in automated theorem proving. The "LLM" snippet represents an initial attempt at proving the theorem, while "LLM + APOLLO" shows a refined proof strategy. APOLLO seems to enhance the proof by introducing an inner inductive step, potentially breaking down the problem into smaller, more manageable subproblems. This suggests that APOLLO acts as a proof search engine or tactic selector, guiding the LLM towards a more complete and efficient proof. The difference in code length and complexity highlights the added reasoning capability provided by APOLLO. The annotations (`#1`, `#2`) suggest that the system is utilizing specific normalization procedures during the proof process. The overall data suggests a system where an LLM generates a proof, and APOLLO refines it through additional reasoning steps. </details> Figure 10: Proof attempt of induction_pord1p1on2powklt5on2 produced by the base model (Goedel-Prover-SFT) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/short_5.png Details</summary> ![c87472fc](/v1/image/c87472fceee0a5c49f7b8c4e43b6e1ecad9d5c1cec7dbcb59d1974340baa2276) ### Visual Description \n ## Code Snippets: LLM vs LLM + APOLLO ### Overview The image presents two code snippets, labeled "LLM" on the left and "LLM + APOLLO" on the right. Both snippets appear to be written in a formal mathematical/programming language, likely related to theorem proving or formal verification. They both contain similar initial lines, but diverge significantly in their subsequent content. The code is densely packed with mathematical notation and function calls. ### Components/Axes There are no axes or traditional chart components. The image consists entirely of text-based code. The two code blocks are visually separated by a vertical line. Each block has a header indicating "LLM" or "LLM + APOLLO". Both blocks contain lines starting with keywords like `import`, `set_option`, `open`, `theorem`, `rw`, `ring_nf`, `nlinarith`, `have`, `simp`, `exact`, and `norm_num`. ### Detailed Analysis or Content Details **Left Block (LLM):** * `import Mathlib` * `import Aesop` * `set_option maxHeartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem mathd_algebra_293_llm (x : NNReal) : Real.sqrt(60 * x) * Real.sqrt(12 * x) * Real.sqrt(63 * x) = 36 * x * Real.sqrt(35 * x) := by` * `rw [← Real.sqrt_mul (by positivity), ← Real.sqrt_mul (by positivity)]` * `ring_nf` * `rw [Real.sqrt_eq_iff mul_self_eq]` * `ring_nf` * `nlinarith [#1` **Right Block (LLM + APOLLO):** * `import Mathlib` * `import Aesop` * `set_option maxHeartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem mathd_algebra_293_apollo (x : NNReal) : Real.sqrt(60 * x) * Real.sqrt(12 * x) * Real.sqrt(63 * x) = 36 * x * Real.sqrt(35 * x) := by` * `rw [← Real.sqrt_mul (by positivity), ← Real.sqrt_mul (by positivity)]` * `ring_nf` * `rw [Real.sqrt_eq_iff mul_self_eq]` * `ring_nf` * `norm_num` * `have h : (√(x : ℝ) * √35)^2 = (x : ℝ) * 35 := by` * `rw [mul_pow]` * `simp` * `ring` * `simp_all only [ofNat_pos, mul_nonneg_iff_of_pos_right, NNReal.zero_le_coe, pow_nonneg]` * `have hx : x ≥ 0 := by` * `exact _root_zero_le_x` * `have h1 : √(√(x : ℝ) * √(35 : ℝ)) * (36 : ℝ) ≥ 0 := by` * `simp_all only [ge_iff_le, _root_zero_le, NNReal.zero_le_coe, sqrt_mul, ofNat_pos, mul_nonneg_iff_of_pos_right, Real.sqrt_pos, Real.sqrt_nonneg]` * `have A : 0 ≤ √(x : ℝ) := NNReal.coe_nonneg x` * `have B : 0 ≤ √(35 : ℝ) := sqrt_nonneg (35 : ℝ)` * `have C : (0 : ℝ) ≤ 36 := by norm_num` * `exact mul_nonneg (mul_nonneg A B) C` ### Key Observations The code snippets are very similar at the beginning, suggesting a common starting point. The "LLM + APOLLO" block contains significantly more lines of code and introduces new variables (e.g., `h`, `hx`, `h1`, `A`, `B`, `C`) and more complex logical steps. The "LLM + APOLLO" block appears to be attempting a more detailed or complete proof of the theorem. The presence of `norm_num` in the right block suggests a normalization step. The `#1` at the end of the left block might indicate a reference to a specific tactic or goal within the theorem proving environment. ### Interpretation The image demonstrates a comparison between two approaches to proving a mathematical theorem. The "LLM" block represents a simpler, potentially incomplete attempt, while the "LLM + APOLLO" block represents a more sophisticated approach, likely leveraging additional tools or tactics (represented by "APOLLO") to achieve a more rigorous or complete proof. The increased complexity in the "LLM + APOLLO" block suggests that the theorem requires more detailed reasoning and manipulation to be proven effectively. The code snippets are likely part of a larger system for automated theorem proving, where the goal is to verify the correctness of mathematical statements using computer-assisted methods. The differences between the two blocks highlight the challenges and trade-offs involved in designing and implementing such systems. The use of `simp`, `ring`, and `norm_num` suggests simplification, ring arithmetic, and numerical normalization are being employed as proof strategies. </details> Figure 11: Proof attempt of mathd_algebra_293 produced by the base model (Goedel-Prover-SFT) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/medium_1.png Details</summary> ![1af7eaa2](/v1/image/1af7eaa2b63f5583f32a63941a6ef6593889b1cffdcba9a1982a7eebb3746b5f) ### Visual Description \n ## Mathematical Code Snippets: Formal Theorem Representations ### Overview The image presents three columns of mathematical code, likely generated from a formal theorem prover system (possibly related to Isabelle/HOL or similar). The code appears to define theorems and related logic, with a focus on real number topology and potentially linear algebra. The left column is labeled "LLM", the center "LLM + APOLLO", and the right column contains additional code snippets. The code is densely packed and uses a specialized syntax. ### Components/Axes There are no axes or traditional chart components. The image consists entirely of text-based code. The columns serve as organizational dividers. The code uses a consistent indentation style. The code snippets are separated by horizontal lines. ### Detailed Analysis or Content Details **Column 1: LLM** * **Header:** `import Mathlib` and `import Topology.Real.Nat` * **Theorem 1:** `theorem mathd_algebra_14_11m` * `(a b : ℝ)` * `have h1 : a + b ≥ 0 : ℝ` * `have h2 : a - b ≥ 0 : ℝ` * `have h3 : (3 * NNReal) ≤ (54 * NNReal) : b` * `have h4 : b ≤ h1 : b` * `have h5 : h1 ≤ (3 * NNReal) : b` * `have h6 : h2 ≤ h1 : b` * **Theorem 2:** `theorem mathd_algebra_14_12m` * `(a b : ℝ)` * `have h1 : a + b ≥ 0 : ℝ` * `have h2 : a - b ≥ 0 : ℝ` * `have h3 : (3 * NNReal) ≤ (54 * NNReal) : b` * `have h4 : b ≤ h1 : b` * `have h5 : h1 ≤ (3 * NNReal) : b` * `have h6 : h2 ≤ h1 : b` * **Theorem 3:** `theorem mathd_algebra_14_13m` * `(a b : ℝ)` * `have h1 : a + b ≥ 0 : ℝ` * `have h2 : a - b ≥ 0 : ℝ` * `have h3 : (3 * NNReal) ≤ (54 * NNReal) : b` * `have h4 : b ≤ h1 : b` * `have h5 : h1 ≤ (3 * NNReal) : b` * `have h6 : h2 ≤ h1 : b` * **Theorem 4:** `theorem mathd_algebra_14_14m` * `(a b : ℝ)` * `have h1 : a + b ≥ 0 : ℝ` * `have h2 : a - b ≥ 0 : ℝ` * `have h3 : (3 * NNReal) ≤ (54 * NNReal) : b` * `have h4 : b ≤ h1 : b` * `have h5 : h1 ≤ (3 * NNReal) : b` * `have h6 : h2 ≤ h1 : b` **Column 2: LLM + APOLLO** * **Header:** `import Mathlib` and `import Topology.Real.Nat` * **Theorem 1:** `theorem mathd_algebra_14_11m_apollo` * `(a b : ℝ)` * `have h1 : a + b ≥ 0 : ℝ` * `have h2 : a - b ≥ 0 : ℝ` * `have h3 : (3 * NNReal) ≤ (54 * NNReal) : b` * `have h4 : b ≤ h1 : b` * `have h5 : h1 ≤ (3 * NNReal) : b` * `have h6 : h2 ≤ h1 : b` * **Theorem 2:** `theorem mathd_algebra_14_12m_apollo` * `(a b : ℝ)` * `have h1 : a + b ≥ 0 : ℝ` * `have h2 : a - b ≥ 0 : ℝ` * `have h3 : (3 * NNReal) ≤ (54 * NNReal) : b` * `have h4 : b ≤ h1 : b` * `have h5 : h1 ≤ (3 * NNReal) : b` * `have h6 : h2 ≤ h1 : b` * **Theorem 3:** `theorem mathd_algebra_14_13m_apollo` * `(a b : ℝ)` * `have h1 : a + b ≥ 0 : ℝ` * `have h2 : a - b ≥ 0 : ℝ` * `have h3 : (3 * NNReal) ≤ (54 * NNReal) : b` * `have h4 : b ≤ h1 : b` * `have h5 : h1 ≤ (3 * NNReal) : b` * `have h6 : h2 ≤ h1 : b` * **Theorem 4:** `theorem mathd_algebra_14_14m_apollo` * `(a b : ℝ)` * `have h1 : a + b ≥ 0 : ℝ` * `have h2 : a - b ≥ 0 : ℝ` * `have h3 : (3 * NNReal) ≤ (54 * NNReal) : b` * `have h4 : b ≤ h1 : b` * `have h5 : h1 ≤ (3 * NNReal) : b` * `have h6 : h2 ≤ h1 : b` **Column 3: LLM + APOLLO** * **Code Snippets:** Contains various code blocks, including: * `cases (mul_eq_zero_mp his)` * `norm_num` * `cases (mul_eq_zero_mp his)` * `simp` * `rw [mul_comm]` * `cases (add_eq_zero_iff_eq_neg)` * `cases (mul_eq_zero_mp his)` * `simp` * `rw [mul_comm]` * `cases (add_eq_zero_iff_eq_neg)` * `cases (mul_eq_zero_mp his)` * `simp` * `rw [mul_comm]` * `cases (add_eq_zero_iff_eq_neg)` * `cases (mul_eq_zero_mp his)` * `simp` * `rw [mul_comm]` * `cases (add_eq_zero_iff_eq_neg)` ### Key Observations * The theorems in the first two columns are nearly identical, suggesting that "APOLLO" might be a refinement or extension of the "LLM" system. * The theorems in columns 1 and 2 have a repetitive structure, defining variables `a` and `b` as real numbers (`ℝ`) and establishing inequalities. * The constant `NNReal` appears frequently, potentially representing a natural number converted to a real number. * The third column contains code snippets that appear to be simplification or case analysis steps within a proof environment. * The code uses a functional style with `have` statements to introduce assumptions and `rw` (rewrite) and `simp` (simplify) commands to manipulate expressions. ### Interpretation The image showcases the output of a formal theorem prover. The code represents attempts to prove mathematical statements, likely related to properties of real numbers and inequalities. The repetition in the first two columns suggests a comparison between two different approaches or versions of a theorem prover. The third column provides a glimpse into the proof process itself, showing how the system attempts to simplify and analyze expressions to reach a conclusion. The presence of `NNReal` suggests a connection between natural numbers and real numbers, potentially in the context of constructing real numbers from natural numbers. The code is highly technical and requires expertise in formal logic and mathematical proof techniques to fully understand. The overall purpose is to rigorously verify mathematical statements using a computer-assisted proof system. </details> Figure 12: Proof attempt of mathd_algebra_184 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/medium_2.png Details</summary> ![474bc985](/v1/image/474bc9851d77e75468d9977699e4779592a2eb37c0cfe1fc09550879d43e98f2) ### Visual Description ## Diagram: Mathematical Proof/Code Snippets ### Overview The image presents a side-by-side comparison of mathematical code and proof snippets, likely generated by a Large Language Model (LLM) and a system called "APOLLO". The content appears to be related to real number topology and calculations involving square roots and norms. The diagram is structured as a series of code blocks with annotations indicating the flow of logic or derivation steps. The left side is labeled "LLM", the center "APOLLO", and the right side "LLM + APOLLO". ### Components/Axes The diagram is not a traditional chart with axes. Instead, it consists of several distinct code blocks and text annotations. Key components include: * **Headers:** "LLM", "APOLLO", "LLM + APOLLO" at the top of each column. * **Code Blocks:** Representing mathematical expressions and code snippets. These are formatted with indentation and annotations. * **Annotations:** Textual notes (e.g., "by positivity", "by norm num", "calc") indicating the reasoning or operation applied. * **Mathematical Expressions:** Involving `Real.sqrt`, `Real.mul`, `Real.Topology Rat`, `norm`, `norm num`, and variables like `x`, `y`, `z`. * **Footer:** Contains information about the model versions and a timestamp. ### Detailed Analysis or Content Details **LLM Column (Left):** * `import Mathlib` * `import Aesop` * `set_option maxHeartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem mathd_algebra_293_llm (x : NNReal) : Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = Real.sqrt (36 * x) * Real.sqrt (35 * x) := by` * `have hi : Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = Real.sqrt (60 * x) * (12 * x) * Real.sqrt (63 * x) := by` * `rw [Real.sqrt_mul (by positivity), Real.sqrt_mul (by positivity)]` * `have h1 : (60 * x) * (12 * x) * (63 * x) = (36 * x) * (35 * x) := by` * `rw [Real.sqrt_mul (by positivity), Real.sqrt_mul (by positivity)]` * `have h2 : (60 * x) * (12 * x) * (63 * x) = (36 * x) * (35 * x) := by` * `simp` * `trace_state 'show [norm x, try rfl by positivity]'` * `apply h1; try ring_nf; abel` * `have h3 : (60 * x) * (12 * x) * (63 * x) = (36 * x) * (35 * x) := by` * `rw [Real.sqrt_mul (by positivity), Real.sqrt_mul (by positivity)]` * `have h4 : (60 * x) * (12 * x) * (63 * x) = (36 * x) * (35 * x) := by` * `simp` * `trace_state 'show [norm x, try rfl by positivity]'` * `apply h3; try ring_nf; abel` **APOLLO Column (Center):** * `import Mathlib` * `import Aesop` * `set_option maxHeartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem mathd_algebra_293_apollo (x : NNReal) : Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = Real.sqrt (36 * x) * Real.sqrt (35 * x) := by` * `have hi : Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = Real.sqrt (60 * x) * (12 * x) * Real.sqrt (63 * x) := by` * `rw [Real.sqrt_mul (by positivity)]` * `have h1 : (60 * x) * (12 * x) * (63 * x) = (36 * x) * (35 * x) := by` * `rw [Real.sqrt_mul (by positivity)]` * `have h2 : (60 * x) * (12 * x) * (63 * x) = (36 * x) * (35 * x) := by` * `calc` * `√(15 * 3 * 7) := by` * `√(15 * 3 * 7) := by` * `rw [Real.sqrt_mul (by norm num)]` * `have h2a : √(315) = √(3 * 3 * 5 * 7) := by` * `(36 * 12) + √(21) := √(3 * √(3 * √(5 * 7))) := by` * `have h2b : √(315) = √(35 * 9) := by` * `calc` * `√(35 * 9) = 3 * √(35) := by` * `rw [Real.sqrt_mul (by norm num)]` * `have h2b1 : √(35 * 9) = 3 * √(35) := by` * `rw [Real.sqrt_mul (by norm num)]` **LLM + APOLLO Column (Right):** * `have h2 : (√ (15) * √ (3) * √ (7)) * √ (7) := by` * `have h2 : (√ (15) * √ (3) * √ (7)) * √ (7) := by` * `calc` * `√(15 * 3 * 7) := by` * `√(15 * 3 * 7) := by` * `rw [Real.sqrt_mul (by norm num)]` * `have h2a : √(315) = √(3 * 3 * 5 * 7) := by` * `(36 * 12) + √(21) := √(3 * √(3 * √(5 * 7))) := by` * `have h2b : √(315) = √(35 * 9) := by` * `calc` * `√(35 * 9) = 3 * √(35) := by` * `rw [Real.sqrt_mul (by norm num)]` * `have h2b1 : √(35 * 9) = 3 * √(35) := by` * `rw [Real.sqrt_mul (by norm num)]` **Footer:** * `model_version = 'v0.3.26'` * `date = '2024-01-26'` * `commit = '8f8f599'` ### Key Observations * The LLM and APOLLO columns attempt to prove the same theorem. * The APOLLO column includes more detailed calculation steps ("calc") and intermediate results. * Both columns utilize `Real.sqrt_mul` with the justification "by positivity" or "by norm num". * The "trace\_state" annotations in the LLM column suggest debugging or tracing of the proof process. * The LLM + APOLLO column appears to be a combination of the outputs from the other two columns. ### Interpretation The diagram demonstrates a comparison of mathematical reasoning and proof generation between a standard LLM and the APOLLO system. APOLLO seems to provide a more step-by-step, detailed breakdown of the calculation, potentially making the proof more transparent and easier to verify. The use of "calc" and intermediate results suggests a more explicit computational approach. The LLM, while attempting the same proof, relies more on higher-level transformations and annotations like "trace\_state". The combination in the "LLM + APOLLO" column suggests a potential synergy where APOLLO's detailed calculations can augment the LLM's reasoning capabilities. The footer information indicates the specific versions of the models and the date of the output, providing context for reproducibility and analysis. The overall goal appears to be to evaluate and improve the ability of LLMs to perform rigorous mathematical proofs. The repeated use of `Real.sqrt_mul` suggests this is a core operation in the proof process, and the "by positivity" justification indicates a reliance on the non-negativity of real numbers. </details> Figure 13: Proof attempt of mathd_algebra_293 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/medium_3.png Details</summary> ![b26c5b3d](/v1/image/b26c5b3dbad9e08c3a44d11c092ab869ed3ecf50c697c677df84f7603caf7bd9) ### Visual Description \n ## Code Snippets: Mathematical Proofs & Computations ### Overview The image contains three columns of code snippets, likely representing mathematical proofs or computations performed in a formal verification system (possibly Coq, based on the syntax). The snippets appear to be related to algebraic manipulations and simplification, potentially within a topological context. The left column is labeled "LLM", the middle column "APOLLO", and the right column "LLM + APOLLO". The code is primarily in English keywords interspersed with mathematical notation. ### Components/Axes There are no axes or traditional chart components. The image is structured as three independent code blocks. Each block contains a series of lines, each representing a step in a proof or computation. The code uses keywords like `theorem`, `have`, `by`, `simp`, `rewrite`, `induction`, `assert`, `elim`, `constructor`, `apply`, `auto`, `done`, `norm`, `try`, `linearith`, `ring`, `field`, `solve`, `split`, `left`, `right`, `destruct`, `intros`, `generalize`, `replace`, `eq`, `forall`, `exists`, `in`, `of`, `with`, `using`, `native`, `decision`. Mathematical symbols include +, -, *, /, =, <, >, (, ), {, }, [, ], :, ;, ^, and various Greek letters. ### Detailed Analysis or Content Details **LLM (Left Column)** * `set_option maxheartbeats 0` * `set_option pp.numerictypes true` * `set_option pp.coercions.types true` * `open BigOperators Real Nat Topology Rat` * `theorem algebra.2varlinearseq.fplseqli_3tfminseqn6_feqnio.seq?` * `(h1 : (f x) = (g x))`: A hypothesis stating that f(x) equals g(x). * `(h2 : (f x) = (11 : Z))`: A hypothesis stating that f(x) equals 11. * `(h3 : (g x) = (5 : Z))`: A hypothesis stating that g(x) equals 5. * `(C) : (11 : Z) = (5 : Z)`: The goal to prove, which is a contradiction. * `have (3 * f x) = (3 * 11 : Z)` by `simp` * `have (3 * f x) = 33 : Z` by `simp` * `have (3 * g x) = (3 * 5 : Z)` by `simp` * `have (3 * g x) = 15 : Z` by `simp` * `have (3 * f x) - (3 * g x) = 33 : Z - 15 : Z` by `simp` * `have (3 * f x) - (3 * g x) = 18 : Z` by `simp` * `have 3 * (f x - g x) = 18 : Z` by `simp` * `have (f x - g x) = 6 : Z` by `ring` * `have (11 : Z) - (5 : Z) = 6 : Z` by `simp` * `have 6 : Z = 6 : Z` by `simp` * `assert False` by `contradiction` * `done` **APOLLO (Middle Column)** * `set_option maxheartbeats 0` * `set_option pp.numerictypes true` * `set_option pp.coercions.types true` * `open BigOperators Real Nat Topology Rat` * `theorem algebra.2varlinearseq.fplseqli_3tfminseqn6_feqnio.seq?` * `(h1 : (f x) = (g x))`: A hypothesis stating that f(x) equals g(x). * `(h2 : (f x) = (11 : Z))`: A hypothesis stating that f(x) equals 11. * `(h3 : (g x) = (5 : Z))`: A hypothesis stating that g(x) equals 5. * `(C) : (11 : Z) = (5 : Z)`: The goal to prove, which is a contradiction. * `have (3 * f x) = (3 * 11 : Z)` by `simp` * `have (3 * f x) = 33 : Z` by `simp` * `have (3 * g x) = (3 * 5 : Z)` by `simp` * `have (3 * g x) = 15 : Z` by `simp` * `have (3 * f x) - (3 * g x) = 33 : Z - 15 : Z` by `simp` * `have (3 * f x) - (3 * g x) = 18 : Z` by `simp` * `have 3 * (f x - g x) = 18 : Z` by `simp` * `have (f x - g x) = 6 : Z` by `ring` * `have (11 : Z) - (5 : Z) = 6 : Z` by `simp` * `have 6 : Z = 6 : Z` by `simp` * `assert False` by `contradiction` * `done` **LLM + APOLLO (Right Column)** * `have eq : (58 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 : (0 </details> Figure 14: Proof attempt of algebra_2varlineareq_fp3zeq11_3tfm1m5zeqn68_feqn10_zeq7 produced by the base model (o4-mini) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/medium_4.png Details</summary> ![1e220172](/v1/image/1e2201723d75a63c97cc80e0c63d7a1821642e79b38d24282c54e7ab855ebfcc) ### Visual Description \n ## Code Snippets: Mathematical Proofs ### Overview The image presents three code snippets, likely from a mathematical software environment (Mathlib). The snippets appear to contain formal proofs or theorem derivations, written in a symbolic logic/programming language. The snippets are arranged side-by-side, with the third snippet being significantly wider. The code is primarily in English, with some symbolic notation. ### Components/Axes There are no axes or traditional chart components. The snippets consist of lines of code, with indentation indicating structure. Each snippet begins with `import Mathlib` or `import Aesop` and `set_option maxheartbeats 0`. Each snippet contains a theorem name (e.g., `theorem amc12a_2019_p12_llm (x y : ℝ) (h₀ : x ≠ 1 ∧ y ≠ 1)`) followed by a series of logical statements and commands (e.g., `have h1 : Real.log(x/y) / Real.log 2 ^ 2 = 20 : by`). ### Detailed Analysis or Content Details **Snippet 1 (Left)** * `import Mathlib` * `import Aesop` * `set_option maxheartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem amc12a_2019_p12_llm (x y : ℝ) (h₀ : x ≠ 1 ∧ y ≠ 1)` * `have h1 : Real.log x / Real.log 2 = Real.log 16 / Real.log y (by)` * `(Real.log(x / y) / Real.log 2) ^ 2 = 20 : by` * `rintro rfl` * `simp at h₁` * `have y_nonzero : y ≠ 0 : by rfl` * `rintro rfl` * `simp at h₁` * `field_simp [Real.log_div, Real.log_mul, Real.log_inv, Real.log_pow] at h₁` * `nlinarith` **Snippet 2 (Center)** * `import Mathlib` * `import Aesop` * `set_option maxheartbeats 0` * `open BigOperators Real Nat Topology Rat` * `theorem amc12a_2019_p12_apollo (x y : ℝ) (h₀ : x ≠ 1 ∧ y ≠ 1)` * `have h1 : Real.log x / Real.log 2 = Real.log 16 / Real.log y (by)` * `(Real.log(x / y) / Real.log 2) ^ 2 = 20 : by` * `have x_nonzero : x ≠ 0 : by rfl` * `rintro rfl` * `simp at h₁` * `linarith` * `have y_nonzero : y ≠ 0 : by rfl` * `rintro rfl` * `simp at h₁` * `linarith` * `field_simp [Real.log_div, Real.log_mul, Real.log_inv, Real.log_pow, ring.nat]` * `rw [h1]` * `have h2 : Real.log x = Real.log 16 + Real.log y : by` * `simp` * `have h3 : Real.log(2^4) = 4*Real.log 2 : by` * `simp` * `rw [h3]` * `have h4 : Real.log x = 4*Real.log 2 + Real.log y : by` * `simp` * `have h5 : Real.log(x * y) = Real.log x + Real.log y : by` * `simp` * `rw [h5]` * `have h6 : Real.log(x * y) = 4*Real.log 2 + 2*Real.log y : by` * `simp` * `have h7 : Real.log(x * y) = Real.log(2^4 * y^2) : by` * `simp` * `exact h7` **Snippet 3 (Right)** * `have h8 : Real.log(2^6) = 6 * Real.log(2 : ℝ) : by` * `simp` * `rw [← h]` * `norm_num` * `gcongr` * `have h16 : Real.log(16 : ℝ) = 4 * Real.log(2 : ℝ) : by` * `norm_num` * `have h2 : pos : (2 : ℝ) > 0 : by norm_num` * `have h' : Real.log(2^4) = (4 : ℝ) * Real.log(2 : ℝ) : by` * `simp` * `rw [← h']` * `norm_num` * `rw [h16] at h` * `have hL : L = (4 * L^2) / Y : by` * `rw [h] at h₁` * `have hL' : L = (4 * L^2) / Y : by` * `rw [h] at h₁` * `have hx : X = (4 * L^2) / Y : by` * `have hL : Real.log(2 : ℝ) = L : rfl` * `rw [hL] at h₁` * `have hx : prod : X * Y = 4 * L^2 : by` * `rw [hx]` * `have hL : Y = 4 * L^2 : by` * `rw [hL]` * `have hx : X = pos : (2 : ℝ) > 0 : by` * `have hx : X = 4 * L^2 : by` * `rw [hx]` * `have hx : prod : X * Y = 4 * L^2 : by` * `rw [hx]` * `have hx : Real.log x = Real.log(2^4 * y^2) : by` * `rw [h16, h8]` * `have hx : Real.log x = Real.log(2^4) + Real.log(y^2) : by` * `simp` * `have hx : Real.log x = 4 * Real.log(2 : ℝ) + 2 * Real.log y : by` * `rw [hL]` * `have hx : Real.log x = 4 * L + 2 * Real.log y : by` * `simp` * `have hx : Real.log x = 4 * L + 2 * Real.log y : by` * `exact hx` ### Key Observations The code snippets appear to be attempting to prove a mathematical statement involving logarithms. The use of `simp`, `rw`, `linarith`, and `field_simp` suggests simplification, rewriting, linear arithmetic, and field simplification operations, respectively. The snippets share common elements like the import statements and the initial theorem definition, suggesting they are related attempts to solve the same problem. The rightmost snippet is significantly longer and more complex, potentially representing a more complete or refined proof. ### Interpretation The code snippets represent a formal proof attempt within a mathematical software environment. The goal appears to be to manipulate logarithmic expressions to arrive at a desired conclusion. The use of automated proof tactics (like `simp`, `rw`, `linarith`) indicates a desire to leverage computational assistance in the proof process. The repeated use of `have` statements suggests a step-by-step construction of the proof, with each statement building upon previous ones. The snippets demonstrate the power of formal methods in mathematical reasoning, where proofs are expressed in a precise and unambiguous language that can be verified by a computer. The differences between the snippets likely represent different approaches or refinements in the proof strategy. The presence of `maxheartbeats` suggests the system may be attempting to solve a complex problem that could potentially lead to infinite loops or excessive computation. </details> Figure 15: Proof attempt of amc12a_2019_p12 produced by the base model (Goedel-Prover-SFT) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. <details> <summary>examples/comparison/long.png Details</summary> ![4135f530](/v1/image/4135f5309b11e4ff942f32191369f2554e1672a38536efbb367ff3e4c698d1f5) ### Visual Description \n ## Textual Output: Code Snippets - Mathematical/Logical Expressions ### Overview The image presents a collection of code snippets, likely representing mathematical or logical expressions, possibly related to formal verification or theorem proving. The snippets are arranged in a grid-like fashion, with three main columns labeled "LLM", "Theorem_Math_Algebra_200_LLM_Apollo (k in Z)", and "LLM + APOLLO". Each snippet appears to define or manipulate variables and functions, using a syntax that resembles a functional programming language or a specialized logic language. The snippets are densely packed with text, making precise extraction challenging. ### Components/Axes There are no explicit axes or legends in the traditional sense. The structure is defined by the three column headers: * **LLM:** Leftmost column. * **Theorem\_Math\_Algebra\_200\_LLM\_Apollo (k in Z):** Middle column, with a longer title indicating a specific theorem or mathematical context. The "(k in Z)" suggests a variable 'k' belonging to the set of integers. * **LLM + APOLLO:** Rightmost column, indicating a combination of LLM and Apollo. Each column contains multiple code snippets, each appearing to be a self-contained unit of code. ### Detailed Analysis or Content Details Due to the density and complexity of the code, a complete transcription is impractical. Instead, I will provide representative snippets and observations from each column, focusing on recurring patterns and keywords. I will attempt to extract key function names and variable definitions. Note that the transcription will be approximate due to image quality and the specialized nature of the code. **LLM Column (Left):** * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` **Theorem\_Math\_Algebra\_200\_LLM\_Apollo (k in Z) Column (Middle):** * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` * `have h2 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h1` **LLM + APOLLO Column (Right):** * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` * `have h1 : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h2` **Common Patterns:** * The structure `have h[number] : (k : Z) -> (m : Z) -> (n : Z) -> by exact_cast h[number]` appears repeatedly across all columns. This suggests a common pattern of defining a function `h` that takes three integer arguments (k, m, n) and uses `exact_cast` to convert or validate a value based on another function `h`. * The use of type annotations `(k : Z)`, `(m : Z)`, `(n : Z)` indicates that the code is strongly typed and deals with integers. * The `exact_cast` keyword suggests a type conversion or assertion mechanism. ### Key Observations * The code snippets are highly repetitive, particularly within each column. This might indicate a systematic exploration of different cases or a repetitive application of a theorem. * The consistent use of `exact_cast` suggests a focus on type safety and correctness. * The presence of "LLM" and "Apollo" in the column headers suggests that these are different systems or components involved in the process. ### Interpretation The image presents a series of code snippets likely generated during an automated theorem proving or formal verification process. The "LLM" column might represent the initial output of a Large Language Model, while the "Theorem\_Math\_Algebra\_200\_LLM\_Apollo" column represents the application of a specific theorem or algebraic manipulation using the Apollo system. The "LLM + APOLLO" column could represent a combined or refined output. The repetitive nature of the code suggests that the system is systematically exploring different possibilities or applying a theorem to a range of cases. The `exact_cast` keyword highlights the importance of type correctness in this process. The overall goal appears to be to verify or prove a mathematical statement using a combination of LLM-generated code and a more specialized theorem proving system (Apollo). The image provides a glimpse into the interaction between AI (LLM) and formal methods (Apollo) in the context of mathematical reasoning. It demonstrates how LLMs can be used to generate code that can then be verified or refined using more rigorous tools. The repetition suggests a search for a valid proof or a systematic exploration of the solution space. </details> Figure 16: Proof attempt of mathd_algebra_289 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. Import headers are the same as other presented examples, for visibility they were omitted in this diagram.

Rendering Paper...