# APOLLO: Automated LLM and Lean Collaboration for Advanced Formal Reasoning
**Authors**: Azim Ospanov, &Farzan Farnia , &Roozbeh Yousefzadeh
> Huawei Hong Kong Research CenterDepartment of Computer Science & Engineering, The Chinese University of Hong Kong
## Abstract
Formal reasoning and automated theorem proving constitute a challenging subfield of machine learning, in which machines are tasked with proving mathematical theorems using formal languages like Lean. A formal verification system can check whether a formal proof is correct or not almost instantaneously, but generating a completely correct formal proof with large language models (LLMs) remains a formidable task. The usual approach in the literature is to prompt the LLM many times (up to several thousands) until one of the generated proofs passes the verification system. In this work, we present APOLLO (A utomated P r O of repair via L LM and L ean c O llaboration), a modular, model‑agnostic agentic framework that combines the strengths of the Lean compiler with an LLM’s reasoning abilities to achieve better proof‐generation results at a low token and sampling budgets. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub‑lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top‑ $K$ budget. The repaired sub‑proofs are recombined and reverified, iterating up to a user‑controlled maximum number of attempts. On the miniF2F benchmark, we establish a new state‑of‑the‑art accuracy of 84.9% among sub 8B‑parameter models (as of August 2025) while keeping the sampling budget below one hundred. Moreover, Apollo raises the state‑of‑the‑art accuracy for Goedel‑Prover‑SFT to 65.6% while cutting sample complexity from 25,600 to a few hundred. General‑purpose models (o3‑mini, o4‑mini) jump from 3–7% to over 40% accuracy. Our results demonstrate that targeted, compiler‑guided repair of LLM outputs yields dramatic gains in both efficiency and correctness, suggesting a general paradigm for scalable automated theorem proving. The codebase is available at https://github.com/aziksh-ospanov/APOLLO
## 1 Introduction
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Whole-Proof Generation Pipeline
### Overview
The image is a technical flowchart illustrating a "Common Approach: Whole-Proof Generation Pipeline." It depicts an iterative, automated process where a Large Language Model (LLM) attempts to generate a formal proof, which is then verified by a dedicated server. The process repeats until a proof is successfully verified or a maximum number of attempts is reached.
### Components/Axes
The diagram is structured as a linear flow with a feedback loop. There are no traditional chart axes. The primary components are:
1. **Title:** "Common Approach: Whole-Proof Generation Pipeline" (centered at the top).
2. **Process Loop Label:** "repeat up to K times" (centered above the main flow).
3. **Component 1 (Left):** A blue box labeled "LLM" containing a stylized brain icon with circuit-like connections.
4. **Component 2 (Center):** A white box labeled "Lean Server" containing a stylized line graph or waveform.
5. **Component 3 (Right):** A browser window icon displaying two possible outcomes:
* A green checkmark icon with the text "got it!" above it.
* A red 'X' icon with the text "nope!" above it.
6. **Flow Arrows:** Blue arrows connect the components in sequence: LLM -> proof attempt -> Lean Server -> Output Window. A black arrow loops from the output window back to the start of the LLM, indicating the iterative process.
### Detailed Analysis
* **LLM Block:** This represents the generative model. Its role is to produce a "proof attempt."
* **"proof attempt" Label:** This text is placed on the arrow connecting the LLM to the Lean Server, specifying the data being transmitted.
* **Lean Server Block:** This represents a verification engine (likely for the Lean theorem prover). Its internal graphic suggests analysis or processing of the proof attempt.
* **Output Window:** This represents the result of the verification step. The two icons represent a binary outcome:
* **Success State:** Green checkmark with "got it!" indicates the proof was accepted.
* **Failure State:** Red 'X' with "nope!" indicates the proof was rejected.
* **Iterative Loop:** The overarching black arrow and the "repeat up to K times" label define the core mechanism. The entire pipeline (generate -> verify -> check result) is repeated a maximum of K times. The loop implies that if a proof fails ("nope!"), the system returns to the LLM to generate a new attempt.
### Key Observations
1. **Binary Outcome:** The process has only two terminal states: success ("got it!") or failure after K attempts (implied by the loop limit).
2. **Centralized Verification:** The Lean Server acts as the single source of truth for validating the LLM's output.
3. **Trial-and-Error Framework:** The pipeline is fundamentally a search or retry mechanism, relying on repeated generation to find a correct proof.
4. **Visual Metaphors:** The brain icon symbolizes the generative intelligence of the LLM, while the waveform in the Lean Server suggests analytical processing.
### Interpretation
This diagram illustrates a common paradigm in neuro-symbolic AI or automated theorem proving. It shows a hybrid system where a neural network (LLM) is used for its creative, generative capacity to propose solutions (proofs), but its outputs are strictly validated by a symbolic, deterministic system (the Lean Server). The "repeat up to K times" loop is critical; it acknowledges that the LLM's first attempt is often incorrect, and success relies on iterative refinement guided by the verifier's feedback (the "nope!" signal). This approach leverages the strengths of both AI types: the LLM's ability to navigate vast possibility spaces and the verifier's guarantee of formal correctness. The pipeline's efficiency would heavily depend on the value of K and the LLM's ability to learn from or be guided by previous failure signals.
</details>
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: Apollo Proof Repair Pipeline Flowchart
### Overview
The image is a technical flowchart titled "Our Proposed Apollo Pipeline." It illustrates an iterative, automated system for repairing mathematical or logical proofs. The system combines a Large Language Model (LLM) with a specialized "Apollo Proof Repair Agent" and a formal verification server ("Lean Server") in a closed-loop process that repeats up to a specified number of times (`r`).
### Components/Axes
The diagram is organized into three main vertical sections connected by directional arrows, with a control loop indicated at the top.
**1. Left Section: LLM**
* **Component:** A blue box labeled **"LLM"** containing an icon of a brain with circuit connections.
* **Function:** Acts as the initial proof generator.
* **Outputs:** Sends **"proof attempt(s)"** (blue arrow) to the central repair agent.
* **Inputs:** Receives **"sub-problem(s) to prove"** (red arrow) back from the repair agent.
**2. Central Section: Apollo Proof Repair Agent**
* **Main Component:** A large, dashed-border box labeled **"Apollo Proof Repair Agent"**. It contains an icon of a wrench repairing a document with code symbols (`</>`).
* **Sub-Components:** Two smaller boxes feed into the main agent:
* **"Auto Solver"** (bottom-left, orange box with a grid/network icon).
* **"Subproof Extractor"** (bottom-right, orange box with a magnifying glass over a gear and exclamation mark).
* **Function:** Receives proof attempts, decomposes them, attempts automated solving, and coordinates repair.
* **Outputs:** Sends a **"proof state, compilation errors, syntax errors"** (blue arrow) to the Lean Server.
* **Inputs:** Receives error feedback from the Lean Server.
**3. Right Section: Lean Server & Control Flow**
* **Component:** A white box labeled **"Lean Server"** containing a stylized line graph icon.
* **Function:** A formal proof verification server (likely for the Lean theorem prover) that checks the proof state and returns errors.
* **Sub-Component:** Below the Lean Server is a smaller control box with two options:
* A green checkmark icon labeled **"exit loop"**.
* A red 'X' icon labeled **"continue"**.
* **Flow Control:** A large, overarching black arrow labeled **"repeat up to r times"** connects the output of this control box back to the input of the LLM, forming the main iterative loop.
### Detailed Analysis
The process flow is as follows:
1. The **LLM** generates an initial proof attempt.
2. This attempt is sent to the **Apollo Proof Repair Agent**.
3. The agent uses its **Subproof Extractor** and **Auto Solver** sub-components to analyze and attempt to fix the proof.
4. The current proof state (with any errors) is sent to the **Lean Server** for formal verification.
5. The Lean Server returns compilation and syntax errors.
6. Based on the result, the control flow decides to either **"exit loop"** (if the proof is valid) or **"continue"**.
7. If continuing, the system loops back, feeding sub-problems back to the LLM, and the process repeats for up to `r` iterations.
### Key Observations
* **Iterative Nature:** The pipeline is explicitly designed as a loop (`repeat up to r times`), indicating an iterative refinement process rather than a single-pass solution.
* **Hybrid Architecture:** It combines the generative capability of an **LLM** with the symbolic, rigorous checking of a **formal server (Lean)** and specialized repair modules (**Auto Solver, Subproof Extractor**).
* **Error-Driven Feedback:** The core feedback mechanism is based on formal **"compilation errors"** and **"syntax errors"** from the Lean Server, which guide the repair agent.
* **Bidirectional Communication:** There is a clear back-and-forth between the LLM and the Repair Agent (proof attempts vs. sub-problems), and between the Repair Agent and the Lean Server (proof state vs. errors).
### Interpretation
This diagram represents a sophisticated **neuro-symbolic system for automated theorem proving or formal verification**. The pipeline addresses a key limitation of LLMs in formal reasoning: their tendency to generate plausible but incorrect or incomplete proofs.
* **How it works:** The LLM acts as a creative but fallible "proposer." The Apollo Proof Repair Agent acts as a "mechanic" that breaks down the proof, tries to solve sub-parts automatically, and interfaces with the Lean Server, which acts as the infallible "judge" providing strict, formal feedback.
* **Why it matters:** This architecture could significantly improve the success rate of automated proof generation. By iteratively repairing proofs based on formal error messages, the system can converge on a correct solution that a single-pass LLM generation might miss. The `r` parameter controls the computational budget for this repair process.
* **Underlying Assumption:** The system assumes that formal error messages from Lean are actionable and can be used by the repair agent to guide the LLM's next attempt, effectively creating a dialogue between generative AI and formal verification.
</details>
Figure 1: The summary of whole-proof generation pipeline vs. proposed Apollo agentic pipeline. LLM refers to a chosen formal theorem generator model.
Formal reasoning has emerged as one of the most challenging fields of AI with recent achievements such as AlphaProof and AlphaGeometry doing well at the International Math Olympiad competing with humans yang2024formal ; alphaproof ; AlphaGeometryTrinh2024 . Formal reasoning relies both on AI models and a formal verification system that can automatically verify whether a mathematical proof is correct or not. In recent years, formal verification systems such as Lean lean4paper have facilitated a new form of doing mathematical research by enabling mathematicians to interact with the formal verification system and also with each other via the system, enabling larger numbers of mathematicians to collaborate with each other on a single project. As such, these formal verification systems are also called proof assistants as one can use them interactively to write a formal proof and instantaneously observe the current state of the proof and any possible errors or shortcomings in the proof generated by the compiler. Immediate access to the output of Lean compiler can help a mathematician to fix possible errors in the proof. At the same time, when a proof passes the compiler with no error, other collaborators do not need to verify the correctness of the proof. This type of collaboration is transforming the field of mathematics enabling large groups of collaborators to engage in large projects of mathematical research such as the Prime Number Theorem And More project kontorovich_tao_2024_primenumbertheoremand . Moreover, it has given rise to the digitization of mathematics buzzard2024mathematical .
In the AI front, large language models (LLMs) are improving at mathematical reasoning in natural language, and at the same time, they have shown a remarkable ability to learn the language of those formal verification systems and write mathematical proofs in formal language. This has led to the field of Automated Theorem Proving (ATP) where the standard approach is to prompt the LLM to generate a number of candidate proofs for a given theorem which will then be verified automatically using the formal verification system. Better models, better training sets, and better training methods has led to significant advances in this field tsoukalas2024putnambenchevaluatingneuraltheoremprovers ; zheng2022minif2fcrosssystembenchmarkformal ; azerbayev2023proofnetautoformalizingformallyproving ; li2024numinamath ; hu2025minictx .
Despite these advances, the LLMs do not really get the chance to interact with the verification system the way a human does. LLMs generate many possible proofs, sometimes as many as tens of thousands, and the verification system only marks the proofs with a binary value of correct or incorrect. This common approach is illustrated in Figure 1. When the formal verification system analyzes a proof, its compiler may generate a long list of issues including syntax errors, incorrect tactics, open goals, etc. In the current approach of the community, all of those information/feedback from the compiler remains unused, as the LLM’s proof generation is not directly engaged with the compiler errors of the formal verification system.
In this work, we address this issue by introducing Apollo, a fully automated system which uses a modular, model-agnostic pipeline that combines the strengths of the Lean compiler with the reasoning abilities of LLMs. Apollo directs a fully automated process in which the LLM generates proofs for theorems, a set of agents analyze the proofs, fix the syntax errors, identify the mistakes in the proofs using Lean, isolate failing sub‑lemmas, utilize automated solvers, and invoke an LLM on each remaining goal with a low top‑ $K$ budget. The high-level overview of Apollo appears at the bottom of Figure 1.
Our contributions are as follows:
- We introduce a novel fully automated system that directs a collaboration between LLMs, Lean compiler, and automated solvers for automated theorem proving.
- We evaluate Apollo on miniF2F-test set using best LLMs for theorem proving and demonstrate that Apollo improves the baseline accuracies by significant margins while keeping the sampling costs much lower.
- We establish a new SOTA on miniF2F-test benchmark for medium-sized language models. (with parameter size 8B or less)
## 2 Related Works
Formal theorem proving systems. At their core, formal theorem‑proving (FTP) systems employ interactive theorem provers (ITPs) to verify mathematical results. In particular, Lean4 lean4paper is both a functional programming language and an ITP. Each theorem, lemma, conjecture, or definition must be checked by Lean’s trusted kernel, so the compiler’s verdict is binary: either the statement type‑checks (True) or it fails (False). This rigorous verification dramatically increases the reliability of formal mathematics; however, it also increases the complexity of proof development: authors must both comprehend the mathematical concepts and precisely encode them in the formal language. The usefulness of Lean is also shown to go beyond theorem proving, as Jiang et al. jiang2024leanreasoner showed that Lean can be adapted to natural language logical reasoning.
Patching broken programs. Many prior works have explored using feedback to repair broken proofs. In software engineering, “repair” typically refers to program repair, i.e. fixing code that no longer runs programrepair . A common source of errors is a version mismatch that introduces bugs and prevents execution. For example, SED sed_repair uses a neural program‐generation pipeline that iteratively repairs initial generation attempts. Other notable approaches train specialized models to fix broken code based on compiler feedback (e.g. BIFI bifi , TFix tfix ) or leverage unsupervised learning over large bug corpora (e.g. BugLab buglab ).
LLM collaboration with external experts. The use of expert feedback, whether from human users or from a proof assistant’s compiler, has proven effective for repairing broken proofs. Ringer ringer2009typedirected showcased automatic proof repair within the Coq proof assistant bertot2013interactive , enabling the correction of proofs in response to changes in underlying definitions. Jiang et al. jiang2022thor showed that leveraging automatic solvers with generative models yield better performance on formal math benchmarks. More recently, First et al. first2023baldurwholeproofgenerationrepair showed that incorporating compiler feedback into LLM proof generation significantly improves the model’s ability to correct errors and produce valid proofs. Another line of work song2023towards explored an idea of mathematician and LLM collaboration, where human experts are aided by LLMs at proof writing stage.
Proof search methods. Numerous recent systems use machine learning to guide the search for formal proofs. One of the methods involves using large language models with structured search strategies, e.g. Best‑First Search (BFS) polu2022formalmathematicsstatementcurriculum ; wu2024internlm25stepproveradvancingautomatedtheorem ; xin2025bfsproverscalablebestfirsttree or Monte Carlo Tree Search (MCTS) lample2022hypertree ; wang2023dt-solver ; alphaproof . While tree‑search methods reliably steer a model toward valid proofs, they result in high inference costs and often explore many suboptimal paths before success. Another line of work leverages retrieval based systems to provide context for proof generators with potentially useful lemmas. One such example is ReProver yang_leandojo_2023 that augments Lean proof generation by retrieving relevant lemmas from a proof corpus and feeding them to an LLM, enabling the model to reuse existing proofs.
Whole proof generation. The use of standalone LLMs for theorem proving has emerged as a major research direction in automated theorem proving. One of the earliest works presented GPT-f polu2020generativelanguagemodelingautomated , a transformer-based model for theorem proving that established the LLMs ability in formal reasoning. As training frameworks and methods advance, the community has produced numerous models that generate complete proofs without external search or tree‑search algorithms. Recent work shows that both supervised models xin2024deepseekproverv15harnessingproofassistant ; lin2025goedelproverfrontiermodelopensource and those trained via reinforcement learning xin2024deepseekproverv15harnessingproofassistant ; zhang2025leanabellproverposttrainingscalingformal ; dong2025stp ; kimina_prover_2025 achieve competitive performance on formal‑math benchmarks. However, whole‑proof generation remains vulnerable to hallucinations that can cause compilation failures even for proofs with correct reasoning chains.
Informal Chain-of-Thought in Formal Mathematics. Several recent works demonstrate that interleaving informal “chain‑of‑thought” (CoT) reasoning with formal proof steps substantially improves whole‑proof generation. For example, developers insert natural‑language mathematical commentary between tactics, yielding large accuracy gains lin2025goedelproverfrontiermodelopensource ; xin2024deepseekproverv15harnessingproofassistant ; azerbayev2024llemma . Wang, et al. kimina_prover_2025 further show that special “thinking” tokens let the model plan its strategy and self‑verify intermediate results. The “Draft, Sketch, and Prove” (DSP) jiang2023draftsketchproveguiding and LEGO-Prover wang2024legoprover pipelines rely on an informal proof sketch before attempting to generate the formal proof. Collectively, these studies indicate that providing informal mathematical context, whether as comments, sketches, or dedicated tokens, guides LLMs toward correct proofs and significantly boosts proof synthesis accuracy.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagram: Apollo Algorithm - Automated Theorem Proving Pipeline
### Overview
The image depicts a technical flowchart and code comparison for the "Apollo Algorithm," an automated theorem-proving system. The diagram is divided into two primary sections: a top flowchart illustrating the algorithm's pipeline and a bottom section comparing an "Invalid Proof" with a "Correct Proof" in Lean4 code. The system appears to use a Large Language Model (LLM) to repair mathematical proofs that fail compilation.
### Components/Axes
**Top Flowchart Components (Left to Right, Top to Bottom):**
1. **Title:** "Apollo Algorithm" (centered at the top in a yellow banner).
2. **Syntax Refiner:** A blue box with a gear/document icon. It receives input from the "Invalid Proof" block below.
3. **Sorrifier:** A red box with a hammer icon. Receives input from the Syntax Refiner.
4. **Auto Solver:** A green box with a pencil/document icon. Receives input from the Sorrifier.
5. **LLM:** An orange box with a neural network icon. Receives input from the Auto Solver via a "Feed to LLM" arrow.
6. **Proof Assembler:** A brown box with a trowel/brick wall icon. Receives input from the LLM via an "Assemble the proof back" arrow.
7. **Decision Diamond 1:** "number of attempts > c?" (Orange). If "NO", loops back to the Syntax Refiner. If "YES", proceeds to the next decision.
8. **Decision Diamond 2:** "Does the proof compile in Lean4?" (Orange). If "NO", loops back to the Syntax Refiner. If "YES", the process ends (implied success).
9. **Process Arrow:** "Extract subproofs to prove" connects the Auto Solver to the LLM.
**Bottom Code Blocks:**
1. **Left Block (Red Border):** Titled "Invalid Proof". Contains Lean4 code with several lines highlighted in pink/red, indicating errors.
2. **Right Block (Green Border):** Titled "Correct Proof". Contains Lean4 code that is the repaired version of the invalid proof.
### Detailed Analysis
**Flowchart Process Flow:**
The pipeline processes an initial, presumably incorrect, mathematical proof.
1. The proof enters the **Syntax Refiner**.
2. It passes to the **Sorrifier** (likely for error identification or "sorrow" marking).
3. The **Auto Solver** attempts to solve sub-problems.
4. Sub-proofs are extracted and **fed to the LLM**.
5. The LLM generates repaired proof steps.
6. The **Proof Assembler** reconstructs the full proof.
7. The system checks two conditions in a loop:
* Has the number of repair attempts exceeded a constant `c`?
* Does the newly assembled proof compile successfully in the Lean4 proof assistant?
* If either check fails ("NO"), the process loops back to the Syntax Refiner for another iteration. If both pass ("YES"), the process terminates.
**Code Block Transcription & Comparison:**
**Invalid Proof (Left Block - Lean4 Code):**
```lean
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Topology Rat Set
theorem mathd_numbertheory_495_incorrect (a b : ℕ)
(h₀ : 0 < a ∧ 0 < b)
(h₁ : a % 10 = 2) (h₂ : b % 10 = 4)
(h₃ : Nat.gcd a b = 6) : 108 ≤ Nat.lcm a b := by
have h4 : a ≥ 6 := by
have h5 : a % 10 = 2 := h₁
omega -- #1 (Highlighted in Pink)
have h5 : b ≥ 6 := by
have h6 : b % 10 = 4 := h₂
omega -- #2 (Highlighted in Pink)
by_contra h
push_neg at h
have h6 : a * b < 648 := by
have h7 : Nat.lcm a b < 108 := by linarith
have h8 : a * b = 6 * Nat.lcm a b := by
rw [Nat.gcd_mul_lcm a b]
rw [h₃]
linarith
have h7 : a ≤ 100 := by
nlinarith [h₀.1, h₀.2, Nat.gcd_pos_of_pos_left a h₀.1,
h₀.2, Nat.gcd_pos_of_pos_right b h₀.2] -- #3 (Highlighted in Pink)
have h9 : b ≤ 107 := by
interval_cases a <> interval_cases b <>
norm_num at *
<;> try { contradiction }
```
* **Key Errors:** Lines marked `omega` (#1, #2) and a complex `nlinarith` tactic (#3) are highlighted, indicating failed proof steps.
**Correct Proof (Right Block - Lean4 Code):**
```lean
...
theorem mathd_numbertheory_495_after_proof_repair
(a b : ℕ)
(h₀ : 0 < a ∧ 0 < b)
(h₁ : a % 10 = 2) (h₂ : b % 10 = 4)
(h₃ : Nat.gcd a b = 6) : 108 ≤ Nat.lcm a b := by
have h4 : a ≥ 6 := by
have h5 : a % 10 = 2 := h₁
have h : a.gcd b = 6 := h₃
have h6 : 6 ∣ a := by
apply Nat.gcd_dvd_left
exact h6
have h5 : b ≥ 6 := by
have h6 : b % 10 = 4 := h₂
omega -- #1 (Now Correct)
by_contra h
push_neg at h
have h4 : b = 4 := by
rw [h₂]
apply Nat.gcd_le_right
omega
have h5 : a % 10 = 2 := h₁
have h6 : a.gcd 4 = 6 := by
apply Nat.gcd_le_right
norm_num at h
linarith
push_neg at h
have h7 : a ≤ 100 := by
by_contra h'
push_neg at h'
have h7 : a ≥ 101 := by linarith
have h8 : b ≥ 6 := by linarith
have h9 : b ≤ 100 := by linarith
rw [h9] at h8
norm_num at h8
have h8 : b ≤ 100 := by
by_contra h'
push_neg at h'
have h9 : b ≤ 107 := by
have h10 : a * b < 648 := h₆
have h11 : a ≥ 6 := h₄
linarith
have h12 : b > 100 := h'
have h13 : b ≤ 107 := h₉
have h14 : b % 10 = 4 := h₂
have h15 : b ≥ 101 := by omega
interval_cases b <;> omega
interval_cases a <> interval_cases b <> norm_num at *
<;> try { contradiction }
```
* **Corrections:** The erroneous `omega` tactics are replaced with more detailed justifications (e.g., using `Nat.gcd_dvd_left`). The complex `nlinarith` block is replaced with a structured case analysis (`interval_cases`) and simpler arithmetic reasoning.
### Key Observations
1. **Iterative Repair Loop:** The core of the algorithm is a feedback loop that repeatedly refines a proof until it compiles or a limit is reached.
2. **Hybrid Approach:** It combines traditional automated theorem proving components (Syntax Refiner, Auto Solver) with a modern LLM for creative proof repair.
3. **Error Localization:** The "Invalid Proof" block visually highlights specific failing tactics (`omega`, `nlinarith`), which are the targets for the LLM's repair.
4. **Proof Strategy Shift:** The correct proof abandons the overly broad `nlinarith` tactic in favor of more granular, step-by-step reasoning and case analysis, which is a common pattern in human-written proofs.
5. **Spatial Layout:** The flowchart flows left-to-right, with the critical decision loops on the left side. The proof comparison is placed directly below, creating a clear "problem" (left) and "solution" (right) visual relationship.
### Interpretation
The Apollo Algorithm diagram illustrates a sophisticated pipeline for **automated mathematical proof repair**. It demonstrates a practical integration of symbolic AI (the structured proof pipeline and Lean4 verifier) with connectionist AI (the LLM).
* **What it Suggests:** The system is designed to handle proofs that are *almost* correct but fail due to incomplete or incorrect tactic applications. The LLM acts as a "proof tutor" or "repair agent," suggesting alternative reasoning steps where traditional solvers fail.
* **Relationships:** The flowchart defines the *process*, while the code blocks show the *input and output* of that process. The red/green color coding creates a strong visual metaphor for failure and success. The arrows from the "Invalid Proof" to the "Syntax Refiner" and from the "Proof Assembler" to the "Correct Proof" explicitly link the two sections.
* **Notable Anomalies/Insights:**
* The use of `set_option maxHeartbeats 0` in the invalid proof suggests it may be computationally intensive or stuck in a loop.
* The repaired proof is not just a minor tweak; it represents a **different proof strategy**. This implies the LLM is capable of high-level reasoning about proof structure, not just low-level syntax correction.
* The loop condition `number of attempts > c` is a crucial safeguard against infinite recursion, acknowledging that not all proofs are repairable.
* The entire system hinges on the **Lean4** proof assistant as the ground truth verifier, emphasizing the importance of formal verification in this workflow.
In essence, the diagram portrays a human-AI collaborative model for formal mathematics, where the AI handles the tedious, error-prone aspects of proof refinement under the strict guidance of a formal system.
</details>
Figure 2: Overview of the Apollo framework.
## 3 Our Approach
In this section we describe Apollo, our framework for transforming an LLM’s raw proof sketch into a fully verified Lean4 proof. Figure 2 illustrates the Apollo pipeline process with an attached Lean code before and after repair. We observe that often LLM is capable of producing a valid proof sketch; however, it often struggles with closing fine-grained goals. For example, statements h4, h5, h7, h8 are correct but the generated proofs for each sub-lemma throw compilation errors. Apollo identifies and fixes such proof steps and guides the LLM towards a correct solution.
Our compiler of choice is the Lean REPL leanrepl , a “ R ead– E val– P rint– L oop” for Lean4. It provides a complete list of compilation errors and warnings, each with a detailed description and the location (for example, the line number) where it occurred. Errors can include syntax mistakes, nonexistent tactics, unused tactics, and more. We aggregate this information to drive our agent’s dynamic code repair and sub‑problem extraction. The pseudo-code for our framework can be found in Algorithm 1 in the Appendix.
### 3.1 Syntax Refiner
The Syntax Refiner catches and corrects superficial compilation errors: missing commas, wrong keywords (e.g. Lean3’s from vs. Lean4’s := by), misplaced brackets, etc. These syntax errors arise when a general‑purpose LLM (such as o3‑mini openai-o3-mini-2025 or o4‑mini openai-o4-mini-2025 ) has not been specialized on Lean4 syntax. By normalizing these common mistakes, this module ensures that subsequent stages operate on syntactically valid code. It is important to note that this module is primarily applicable to general purpose models not explicitly trained for theorem proving in Lean. In contrast, Syntax Refiner usually does not get invoked for proofs generated by LLMs trained for formal theorem proving.
### 3.2 Sorrifier
The Sorrifier patches any remaining compilation failures by inserting Lean’s sorry placeholder. We first parse the failed proof into a tree of nested proof‑blocks (sub‑lemmas as children). Then we compile the proof with Lean REPL leanrepl , detect the offending line or block, and apply one of three repairs:
1. Line removal, if a single tactic or declaration is invalid but its surrounding block may still succeed.
1. Block removal, if the entire sub‑proof is malformed.
1. Insert sorry, if the block compiles but leaves unsolved goals open.
We repeat this procedure until the file compiles without errors. At that point, every remaining sorry marks a sub‑lemma to be proved in later stages of Apollo. This part of the pipeline guarantees that the proof successfully compiles in Lean with warnings of presence of sorry statements.
Each sorry block corresponds to a sub‑problem that the LLM failed to prove. Such blocks may not type‑check for various reasons, most commonly LLM hallucination. The feedback obtained via REPL lets us to automatically catch these hallucinations, insert a sorry placeholder, and remove invalid proof lines.
### 3.3 Auto Solver
The Auto Solver targets each sorry block in turn. It first invokes the Lean4’s hint to close the goal. If goals persist, it applies built‑in solvers (nlinarith, ring, simp, etc.) wrapped in try to test combinations automatically. Blocks whose goals remain open stay marked with sorry. After applying Auto Solver, a proof may be complete with no sorry’s in which case Apollo has already succeeded in fixing the proof. Otherwise, the process can repeat recursively.
### 3.4 Recursive reasoning and repair
In the case where a proof still has some remaining sorry statements after applying the Auto Solver, Apollo can consider each as a new lemma, i.e., extract its local context (hypotheses, definitions, and any lemmas proved so far), and recursively try to prove the lemmas by prompting the LLM and repeating the whole process of verification, syntax refining, sorrifying, and applying the Auto Solver.
At each of these recursive iterations, a lemma may be completely proved, or it may end up with an incomplete proof with one or a few sorry blocks. This allows the Apollo to make progress in proving the original theorem by breaking down the incomplete steps further and further until the LLM or Auto Solver can close the goals. This process can continue up to a user‑specified recursion depth $r$ .
### 3.5 Proof Assembler
Finally, the Proof Assembler splices all repaired blocks back into a single file and verifies that no sorry or admit (alias for "sorry") commands remain. If the proof still fails, the entire pipeline repeats (up to a user‑specified recursion depth $r$ ), allowing further rounds of refinement and sub‑proof generation.
Apollo’s staged approach: syntax cleaning, “sorrifying,” automated solving, and targeted LLM‐driven repair, yields improvements in both proof‐sample efficiency and final proof correctness.
## 4 Experiments
In this section, we present empirical results for Apollo on the miniF2F dataset zheng2022minif2fcrosssystembenchmarkformal , which comprises of 488 formalized problems drawn from AIME, IMO, and AMC competitions. The benchmark is evenly split into validation and test sets (244 problems each); here we report results on the miniF2F‑test partition. To demonstrate Apollo’s effectiveness, we evaluate it on a range of state‑of‑the‑art whole‑proof generation models. All experiments use Lean v4.17.0 and run on eight NVIDIA A5000 GPUs with 128 CPU cores. We used $@32$ sampling during sub-proof generation unless stated otherwise. Baseline numbers are sourced from each model’s original publication works.
### 4.1 The effect of applying Apollo on top of the base models
Figure 3 shows the impact of Apollo on two state‑of‑the‑art whole‑proof generation models: Goedel Prover‑SFT lin2025goedelproverfrontiermodelopensource and Kimina‑Prover‑Preview‑Distill‑7B kimina_prover_2025 . Applying Apollo increases Goedel‑Prover‑SFT’s top accuracy from 64.7% to 65.6% while reducing its sampling budget by two orders of magnitude (from 25,600 generated samples to only a few hundred on average). Similarly, Kimina‑Prover‑Preview‑Distill‑7B achieves a new best Kimina-Prover-Preview-Distill-7B accuracy of 75.0% with roughly one‑third of the previous sample budget. We report the exact numbers in Table 1.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Line Charts: Performance Comparison of Two AI Models with and without "Apollo" Enhancement
### Overview
The image displays two side-by-side line charts comparing the performance (accuracy) of two different AI models against an increasing "Pass budget." Each chart compares a base model against the same model enhanced with a component called "Apollo." The left chart uses a logarithmic scale for the x-axis, while the right chart uses a linear scale.
### Components/Axes
**Common Elements:**
* **Y-axis (Both Charts):** Labeled "Accuracy (%)". The scale is linear.
* **X-axis (Both Charts):** Labeled "Pass budget (K)". The unit "K" likely denotes thousands.
* **Left Chart:** Uses a **log scale**. Major tick marks are at 32, 80, 150, 306, 1.0K, and 21.6K.
* **Right Chart:** Uses a **linear scale**. Major tick marks are at 0, 200, 400, 600, 800, and 1000.
* **Legends:** Each chart has a legend identifying the two data series by color and model name.
**Left Chart Specifics:**
* **Title:** "Performance of Goedel-Prover-SFT"
* **Legend (Located inside the plot area, bottom-right):**
* Blue line with circle markers: "Goedel-Prover-SFT"
* Orange line with circle markers: "Goedel-Prover-SFT + Apollo"
**Right Chart Specifics:**
* **Title:** "Performance of Kimina-Prover-Preview-Distill-7B"
* **Legend (Located outside the plot area, top-right):**
* Red line with circle markers: "Kimina-Prover-Preview-Distill-7B"
* Green line with circle markers: "Kimina-Prover-Preview-Distill-7B + Apollo"
### Detailed Analysis
**Left Chart: Goedel-Prover-SFT (Log Scale X-Axis)**
* **Trend Verification:** Both lines show a positive, upward trend as the pass budget increases. The orange line ("+ Apollo") has a steeper initial slope than the blue line.
* **Data Points (Approximate):**
* **Goedel-Prover-SFT (Blue):**
* At 32K: ~57.5%
* At 80K: ~59.3%
* At 1.0K: ~62.7%
* At 21.6K: ~64.5%
* **Goedel-Prover-SFT + Apollo (Orange):**
* At 32K: ~57.5% (same starting point as blue)
* At 80K: ~60.7%
* At 150K: ~63.5%
* At 306K: ~65.0% (final data point for this series)
**Right Chart: Kimina-Prover-Preview-Distill-7B (Linear Scale X-Axis)**
* **Trend Verification:** Both lines show a positive trend. The green line ("+ Apollo") exhibits a very sharp, near-vertical increase at low pass budgets before plateauing. The red line shows a more gradual, steady increase.
* **Data Points (Approximate):**
* **Kimina-Prover-Preview-Distill-7B (Red):**
* At 0K: ~63.0%
* At 1000K: ~70.8%
* **Kimina-Prover-Preview-Distill-7B + Apollo (Green):**
* At 0K: ~63.0% (same starting point as red)
* At ~100K: ~68.8%
* At 200K: ~74.0%
* At 400K: ~75.0% (final data point for this series)
### Key Observations
1. **Apollo's Impact:** In both models, the "+ Apollo" variant significantly outperforms the base model at every measured pass budget beyond the starting point.
2. **Efficiency Gain:** The Apollo enhancement provides a much larger accuracy boost at **lower pass budgets**. This is especially dramatic in the right chart, where the green line reaches near-peak performance (~74%) at just 200K, while the base red line requires 1000K to reach only ~71%.
3. **Performance Ceiling:** The Kimina model with Apollo (green line) appears to hit a performance plateau around 75% accuracy after 400K, suggesting diminishing returns. The Goedel model with Apollo (orange line) does not show a clear plateau within its plotted range.
4. **Scale Context:** The left chart's log scale compresses the high end of the x-axis, making the performance gains of the Goedel model appear more linear. The right chart's linear scale clearly shows the rapid saturation of the Kimina+Apollo model.
### Interpretation
This data demonstrates that the "Apollo" component is a highly effective enhancement for increasing the accuracy of these AI models, particularly in **low-compute or early-budget scenarios**. The primary benefit appears to be a dramatic improvement in **sample efficiency**—achieving high accuracy with a much smaller "pass budget."
The difference in curve shapes between the two models suggests that Apollo's integration or effect may be architecture-dependent. The Kimina model (right) benefits from an extremely rapid initial gain, making it suitable for applications where the computational budget is severely constrained. The Goedel model (left) shows a more sustained, gradual improvement, which might be preferable in scenarios where scaling the budget further is possible and continued gains are valuable.
The charts effectively argue that adding Apollo is not just a minor improvement but a fundamental shift in the performance-vs-budget curve, allowing these models to reach usable accuracy levels far sooner. The choice between the base and Apollo-enhanced versions would depend on the specific operational constraints (available compute/budget) and required accuracy thresholds.
</details>
(a) Model accuracy against sampling budget
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Dual Line Charts: Performance Comparison of Two AI Models with and without "Apollo" Enhancement
### Overview
The image displays two side-by-side line charts comparing the performance (accuracy) of two different AI models against an increasing token budget. The left chart analyzes "Goedel-Prover-SFT," and the right chart analyzes "Kimina-Prover-Preview-Distill-7B." Each chart plots two series: the base model and the model enhanced with a method called "Apollo." The x-axis uses a logarithmic scale for the token budget (N).
### Components/Axes
**Common Elements:**
* **Chart Type:** Two separate line charts arranged horizontally.
* **X-Axis (Both Charts):** Label: `Token budget (N) - log scale`. The scale is logarithmic, with major tick marks at specific token counts.
* **Y-Axis (Both Charts):** Label: `Accuracy (%)`. The scale is linear.
* **Legends:** Located in the bottom-right corner of each chart's plot area.
**Left Chart: "Performance of Goedel-Prover-SFT"**
* **Title:** `Performance of Goedel-Prover-SFT` (centered at top).
* **Y-Axis Range:** Approximately 58% to 65%.
* **X-Axis Ticks (Approximate Values):** `16.1K`, `38.3K`, `140.0K`, `400.0K`, `1.3M`, `4.5M`, `12.7M`.
* **Legend:**
* Blue line with circle markers: `Goedel-Prover-SFT`
* Orange line with circle markers: `Goedel-Prover-SFT + Apollo`
**Right Chart: "Performance of Kimina-Prover-Preview-Distill-7B"**
* **Title:** `Performance of Kimina-Prover-Preview-Distill-7B` (centered at top).
* **Y-Axis Range:** Approximately 64% to 74%.
* **X-Axis Ticks (Approximate Values):** `140.0K`, `400.0K`, `1.3M`, `4.5M`.
* **Legend:**
* Red line with circle markers: `Kimina-Prover-Preview-Distill-7B`
* Green line with circle markers: `Kimina-Prover-Preview-Distill-7B + Apollo`
### Detailed Analysis
**Left Chart: Goedel-Prover-SFT**
* **Trend Verification:** Both lines show a positive correlation between token budget and accuracy. The orange line (`+ Apollo`) has a steeper upward slope than the blue line, indicating a greater performance gain per token budget increase.
* **Data Points (Approximate):**
* **Goedel-Prover-SFT (Blue):** Starts at (16.1K, ~57.5%), rises to (38.3K, ~59.3%), then to (1.3M, ~62.5%), and ends at (12.7M, ~64.5%).
* **Goedel-Prover-SFT + Apollo (Orange):** Starts at the same point as the blue line (16.1K, ~57.5%), rises sharply to (38.3K, ~60.7%), then to (140.0K, ~63.5%), and ends at (400.0K, ~65.5%). The orange line terminates at a lower token budget (400.0K) than the blue line's final point.
**Right Chart: Kimina-Prover-Preview-Distill-7B**
* **Trend Verification:** Both lines show a positive correlation. The green line (`+ Apollo`) has a significantly steeper slope than the red line, demonstrating a much more rapid improvement in accuracy.
* **Data Points (Approximate):**
* **Kimina-Prover-Preview-Distill-7B (Red):** Starts at (140.0K, ~63.5%), rises to (4.5M, ~71.0%).
* **Kimina-Prover-Preview-Distill-7B + Apollo (Green):** Starts at the same point as the red line (140.0K, ~63.5%), rises sharply to (400.0K, ~69.0%), then to (1.3M, ~74.5%), and ends at (4.5M, ~75.0%). The green line shows a near-plateau between 1.3M and 4.5M tokens.
### Key Observations
1. **Apollo Enhancement is Effective:** In both models, the version with "+ Apollo" achieves higher accuracy than the base model at equivalent or lower token budgets.
2. **Diminishing Returns:** The green line (Kimina + Apollo) shows clear diminishing returns, with the accuracy gain between 1.3M and 4.5M tokens being minimal (~0.5%) compared to the large jump from 400.0K to 1.3M (~5.5%).
3. **Model Comparison:** The Kimina model (right chart) operates in a higher accuracy regime (64-75%) compared to the Goedel model (58-65.5%) within the shown token budgets.
4. **Efficiency:** The Apollo-enhanced models reach higher accuracy levels with fewer tokens. For example, Goedel-Prover-SFT + Apollo at 400.0K tokens (~65.5%) outperforms the base Goedel model at 12.7M tokens (~64.5%).
### Interpretation
The data suggests that the "Apollo" method is a successful technique for improving the sample efficiency of these language models, likely in a reasoning or proof-generation task given the model names ("Prover"). It allows the models to achieve better performance with a smaller computational budget (fewer tokens processed).
The steeper curves for the Apollo-enhanced versions indicate a better "return on investment" for additional training or inference tokens. The plateau in the Kimina + Apollo line is a critical finding, suggesting that for this specific model and task, scaling the token budget beyond ~1.3 million yields minimal benefit, which has important implications for resource allocation and cost optimization.
The charts effectively argue for the value of the Apollo enhancement, showing it not only boosts peak performance but also improves the efficiency of scaling. The use of a log scale on the x-axis is appropriate, as it clearly visualizes performance across orders of magnitude of token budgets, highlighting the efficiency gains.
</details>
(b) Model accuracy against generated token budget
Figure 3: Performance of base models against the Apollo aided models on miniF2F-test dataset at different sample budgets and token length.
Note that Apollo’s sampling budget is not fixed: it depends on the recursion depth $r$ and the LLM’s ability to generate sub‑lemmas. For each sub‑lemma that Auto Solver fails to close, we invoke the LLM with a fixed top‑32 sampling budget. Because each generated sub‑lemma may spawn further sub‑lemmas, the total sampling overhead grows recursively, making any single @K budget difficult to report. Sub‑lemmas also tend to require far fewer tactics (and thus far fewer tokens) than the main theorem. A theorem might need 100 lines of proof while a sub-lemma might need only 1 line. Hence, sampling cost does not scale one‑to‑one. Therefore, to compare with other whole-proof generation approaches, we report the average number of proof samples and token budgets. We present more in-depth analysis of inference budgets in Section B of the Appendix.
Table 1: Table results comparing sampling cost, accuracy, and token usage before and after applying Apollo. “Cost” denotes the sampling budget $K$ for whole‑proof generation and the recursion depth $r$ for Apollo. Column $N$ reports the average number of tokens generated per problem in "Chain‑of‑Thought" mode. Bold cells highlight the best accuracy. Results are shown on the miniF2F‑test dataset for two models: Kimina‑Prover‑Preview‑Distill‑7B and Goedel‑SFT.
| Goedel‑Prover‑SFT | Kimina‑Prover‑Preview‑Distill‑7B | | | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Base Model | + Apollo | Base Model | + Apollo | | | | | | | | |
| $K$ | $N$ | Acc. (%) | $r$ | $N$ | Acc. (%) | $K$ | $N$ | Acc. (%) | $r$ | $N$ | Acc. (%) |
| 32 | 16.1K | 57.6% | 0 | 16.1K | 57.6% | 32 | 140K | 63.1% | 0 | 140K | 63.1% |
| 64 | 31.8K | 59.2% | 2 | 38.3K | 60.7% | | | | 2 | 406K | 68.9% |
| 3200 | 1.6M | 62.7% | 4 | 74.6K | 63.5% | | | | 4 | 816K | 74.2% |
| 25600 | 12.7M | 64.7% | 6 | 179.0K | 65.6 % | 1024 | 4.5M | 70.8% | 6 | 1.3M | 75.0% |
### 4.2 Comparison with SoTA LLMs
Table 2 compares whole‑proof generation and tree‑search methods on miniF2F‑test. For each approach, we report its sampling budget, defined as the top‑ $K$ parameter for LLM generators or equivalent search‑depth parameters for tree‑search provers. Since Apollo’s budget varies with recursion depth $r$ , we instead report the mean number of proof sampled during procedure. Moreover, we report an average number of generated tokens as another metric for computational cost of proof generation. For some models due to inability to collect generated tokens, we leave them blank and report sampling budget only.
When Apollo is applied on top of any generator, it establishes a new best accuracy for that model. For instance, Goedel‑Prover‑SFT’s accuracy jumps to 65.6%, surpassing the previous best result (which required 25,600 sample budget). In contrast, Apollo requires only 362 samples on average. Likewise, Kimina‑Prover‑Preview‑Distill‑7B sees a 4.2% accuracy gain while reducing its sample budget below the original 1,024. To further validate Apollo’s generalizability, we tested the repair framework on Goedel-V2-8B lin2025goedelproverv2 , the current state-of-the-art theorem prover. We observe that, at similar sample and token budgets, Apollo achieves a 0.9% accuracy gain, whereas the base LLM requires twice the budget to reach the same accuracy. Overall, Apollo not only boosts accuracy but does so with a smaller sampling budget, setting a new state-of-the-art result among sub‑8B‑parameter LLMs with sampling budget of 32.
We also evaluate general‑purpose LLMs (OpenAI o3‑mini and o4‑mini). Without Apollo, these models rarely produce valid Lean4 proofs, since they default to Lean3 syntax or introduce trivial compilation errors. Yet they have a remarkable grasp of mathematical concepts, which is evident by the proof structures they often produce. By applying Apollo’s syntax corrections and solver‑guided refinements, their accuracies increase from single digits (3–7%) to over 40%. These results demonstrate that in many scenarios discarding and regenerating the whole proof might be overly wasteful, and with a little help from a Lean compiler, we can achieve better accuracies while sampling less proofs from LLMs.
Table 2: Comparison of Apollo accuracy against state-of-the-art models on the miniF2F-test dataset. Token budget is computed over all generated tokens by LLM.
| Method | Model size | Sample budget | Token budget | miniF2F-test |
| --- | --- | --- | --- | --- |
| Tree Search Methods | | | | |
| Hypertree Proof Search (lample2022hypertree, ) | 0.6B | $64× 5000$ | - | $41.0\$ |
| IntLM2.5-SP+BFS+CG (wu2024internlm25stepproveradvancingautomatedtheorem, ) | 7B | $256× 32× 600$ | - | $65.9\$ |
| HPv16+BFS+DC (li2025hunyuanproverscalabledatasynthesis, ) | 7B | $600× 8× 400$ | - | $68.4\$ |
| BFS-Prover (xin2025bfsproverscalablebestfirsttree, ) | 7B | $2048× 2× 600$ | - | $70.8\$ |
| Whole-proof Generation Methods | | | | |
| DeepSeek-R1- Distill-Qwen-7B (deepseekai2025deepseekr1incentivizingreasoningcapability, ) | 7B | 32 | - | 42.6% |
| Leanabell-GD-RL (zhang2025leanabellproverposttrainingscalingformal, ) | 7B | $128$ | - | $61.1\$ |
| STP (dong2025stp, ) | 7B | $25600$ | - | $67.6\$ |
| o3-mini (openai-o3-mini-2025, ) | - | 1 | - | $3.3\$ |
| $32$ | - | $24.6\$ | | |
| o4-mini (openai-o4-mini-2025, ) | - | 1 | - | $7.0\$ |
| Goedel-SFT (lin2025goedelproverfrontiermodelopensource, ) | 7B | 32 | 16.1K | $57.6\$ |
| $3200$ | 1.6M | $62.7\$ | | |
| $25600$ | 12.7M | $64.7\$ | | |
| Kimina-Prover- Preview-Distill (kimina_prover_2025, ) | 7B | 1 | 4.4K | $52.5\$ |
| $32$ | 140K | $63.1\$ | | |
| $1024$ | 4.5M | $70.8\$ | | |
| Goedel-V2 (lin2025goedelproverv2, ) | 8B | 32 | 174K | $83.3\$ |
| $64$ | 349K | $84.0\$ | | |
| $128$ | 699K | $84.9\$ | | |
| Whole-proof Generation Methods + Apollo | | | | |
| o3-mini + Apollo | - | 8 | - | 40.2% (+36.9%) |
| o4-mini + Apollo | - | 15 | - | 46.7% (+39.7%) |
| Goedel-SFT + Apollo | 7B | 362 | 179K | 65.6% (+0.9%) |
| Kimina-Preview + Apollo | 7B | 307 | 1.3M | 75.0% (+4.2%) |
| Goedel-V2 + Apollo | 8B | 63 | 344K | 84.9% (+0.9%) |
To further assess Apollo, we conducted experiments on PutnamBench tsoukalas2024putnambenchevaluatingneuraltheoremprovers , using Kimina-Prover-Preview-Distill-7B as the base model. Under whole-proof generation pipeline, LLM produced 10 valid proofs. With Apollo, the LLM produced one additional valid proof while consuming nearly half as many tokens. Results are presented in Table 3.
Overall, our results on a variety of LLMs and benchmarks demonstrate that Apollo consistently consumes fewer tokens while achieving higher accuracy, highlighting its effectiveness.
Table 3: Comparison of Apollo accuracy on the PutnamBench dataset. Token budget is computed over all generated tokens by LLM.
| Method | Model size | Sample budget | Token budget | PutnamBench |
| --- | --- | --- | --- | --- |
| Kimina-Preview | 7B | 32 | 180K | 7/658 |
| Kimina-Preview | 7B | 192 | 1.1M | 10/658 |
| Kimina-Preview+Apollo | 7B | 108 | 579K | 11/658 |
### 4.3 Distribution of Proof Lengths
We assess how Apollo affects proof complexity by examining proof‑length distributions before and after repair. Here, proof length is defined as the total number of tactics in a proof, which serves as a proxy for proof complexity.
Figure 4 presents violin plots for three base models: Kimina‑Prover‑Preview‑Distill‑7B, Goedel‑Prover‑SFT, and o4‑mini. Each subplot shows two non‑overlapping sets: proofs generated directly by the base model (“before”) and proofs produced after applying Apollo (“after”). We only consider valid proofs verified by REPL in this setup.
The proofs that Apollo succeeds in fixing in collaboration with the LLM have considerably longer proof lengths. The mean length of proofs fixed by Apollo is longer than those written by the LLM itself at least by a factor of two in each model selection scenario. This increase indicates that the proofs which the base model alone could not prove require longer, more structured reasoning chains. By decomposing challenging goals into smaller sub‑lemmas, Apollo enables the generation of these extended proofs, therefore improving overall success rate.
These results demonstrate that Apollo not only raises accuracy but also systematically guides models to construct deeper, more rigorous proof structures capable of solving harder theorems.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Statistical Distribution Comparison Chart: Klarna-70 Dataset
### Overview
The image displays a comparative statistical visualization, specifically a combination of violin plots and overlaid box plots, comparing the distribution of "Pixel Intensity" values between two categories labeled "Real" and "Fake." The chart is titled "Klarna-70 Distribution Comparison."
### Components/Axes
* **Title:** "Klarna-70 Distribution Comparison" (centered at the top).
* **Y-Axis:** Labeled "Pixel Intensity." The scale runs from 0 to 200, with major tick marks at 0, 50, 100, 150, and 200.
* **X-Axis:** Contains two categorical labels: "Real" (left side) and "Fake" (right side).
* **Legend:** Located in the top-right corner. It provides the mean values for each category:
* "Real" with "Mean: 11.1"
* "Fake" with "Mean: 45.4"
* **Plot Elements:**
* **Violin Plots:** Shaded orange areas representing the probability density of the data at different values. The width of the violin indicates the frequency of data points.
* **Box Plots:** Black lines within each violin. The central horizontal line represents the median. The vertical line (whisker) extends to show the range of the data, excluding outliers.
### Detailed Analysis
**1. "Real" Category (Left Side):**
* **Distribution Shape:** The violin plot is extremely narrow and concentrated near the bottom of the scale (y=0). It has a very sharp peak, indicating a high concentration of data points with low pixel intensity values.
* **Central Tendency:** The median line (within the box plot) is positioned very close to 0. The provided mean is 11.1.
* **Spread/Range:** The vertical whisker extends from near 0 up to approximately 60-70 on the y-axis. The vast majority of the data density, however, is below 20.
* **Trend:** The distribution is heavily right-skewed, with a long, thin tail extending upward.
**2. "Fake" Category (Right Side):**
* **Distribution Shape:** The violin plot is much wider and more spread out. It has a broad, rounded peak, indicating a wider range of common pixel intensity values.
* **Central Tendency:** The median line is positioned significantly higher than the "Real" category, at approximately 50. The provided mean is 45.4.
* **Spread/Range:** The vertical whisker extends from near 0 up to approximately 150-160. The main body of the data (the wider part of the violin) spans roughly from 20 to 100.
* **Trend:** The distribution is more symmetric than the "Real" category but still shows a slight right skew, with a tail extending towards higher values.
### Key Observations
1. **Stark Contrast in Central Tendency:** The mean pixel intensity for "Fake" (45.4) is over four times higher than for "Real" (11.1). The medians show an even more dramatic visual difference.
2. **Difference in Variance:** The "Fake" distribution has a vastly larger spread (variance) than the "Real" distribution. The "Real" data is tightly clustered, while the "Fake" data is dispersed across a wide range of intensities.
3. **Overlap:** There is a region of overlap between the two distributions, primarily in the lower pixel intensity range (approximately 0 to 60). However, the density of "Real" data in this region is extremely high, while for "Fake" data, it represents the lower end of its range.
4. **Presence of High-Value Tail:** The "Fake" distribution has a significant tail extending to very high pixel intensity values (above 100), which is almost entirely absent in the "Real" distribution.
### Interpretation
This chart provides strong visual evidence that the "Real" and "Fake" subsets of the Klarna-70 dataset have fundamentally different statistical properties regarding pixel intensity.
* **What the Data Suggests:** The "Real" images (or image patches) are characterized by consistently low pixel intensity values, suggesting they may be darker, have lower contrast, or contain more uniform, low-frequency content. The "Fake" images exhibit a much broader and higher range of pixel intensities, indicating they are brighter on average, have higher contrast, or contain more high-frequency noise, artifacts, or varied textures.
* **Relationship Between Elements:** The violin plot effectively shows the full shape of the distribution, while the box plot provides clear markers for central tendency (median) and range. The legend's mean values quantify the central shift observed visually.
* **Anomalies/Notable Patterns:** The most notable pattern is the extreme tightness of the "Real" distribution. This suggests a very high degree of consistency or constraint in the generation or capture process for the "Real" data. Conversely, the wide spread of the "Fake" data could indicate a less controlled process, the introduction of random noise, or a deliberate attempt to mimic a wider variety of natural image statistics, albeit with a bias toward higher intensities. This clear separation in distributions could be a powerful feature for a classifier attempting to distinguish between "Real" and "Fake" samples based on low-level pixel statistics.
</details>
(a) Kimina-Preview-Distill-7B
<details>
<summary>x7.png Details</summary>

### Visual Description
## Statistical Distribution Comparison Chart: Goedel-Prover-SFT vs. ProofNet-SFT
### Overview
The image displays a side-by-side comparison of two probability density distributions, visualized as violin plots with embedded box plots. The chart compares the distribution of "Proof Length" for two different models or methods: "Goedel-Prover-SFT" (left) and "ProofNet-SFT" (right). The primary visual takeaway is that the ProofNet-SFT distribution is centered at a higher proof length and is more spread out than the Goedel-Prover-SFT distribution.
### Components/Axes
* **Chart Type:** Dual Violin Plot with embedded Box Plot.
* **X-Axis:** Common to both plots. Label: **"Proof Length"**. Scale: Linear, ranging from **0.0 to 3.0**, with major tick marks at 0.0, 0.5, 1.0, 1.5, 2.0, 2.5, and 3.0.
* **Y-Axis:** Represents probability density. Label: **"Probability Density"**. Scale: Linear, ranging from **0 to 60**, with major tick marks at 0, 20, 40, and 60.
* **Data Series Labels:** Positioned directly above each respective violin plot.
* Left Plot: **"Goedel-Prover-SFT"**
* Right Plot: **"ProofNet-SFT"**
* **Statistical Annotations:** The mean value for each distribution is displayed as text above its plot.
* Above Goedel-Prover-SFT: **"Mean: 6.5"**
* Above ProofNet-SFT: **"Mean: 13.0"**
* **Legend:** Not present as a separate element. The two distributions are distinguished by their spatial separation and direct labels.
* **Color:** Both violin plots are filled with the same teal/green color. The internal box plot elements (median line, quartile box, whiskers) are rendered in black.
### Detailed Analysis
1. **Goedel-Prover-SFT Distribution (Left):**
* **Shape & Trend:** The distribution is strongly right-skewed (positively skewed). The highest probability density (the widest part of the violin) is concentrated at the lower end of the proof length scale, approximately between **0.3 and 1.2**. The density tapers off sharply as proof length increases beyond ~1.5.
* **Central Tendency:** The annotated mean is **6.5**. The median (black line inside the box) appears to be located at a lower value than the mean, consistent with right-skew, visually estimated around **0.8-1.0**.
* **Spread & Quartiles:** The interquartile range (IQR, the black box) is relatively narrow, spanning roughly from **0.5 to 1.3**. The whiskers extend from approximately **0.2 to 2.0**. A few outlier points are visible beyond the upper whisker, near 2.5.
* **Peak Density:** The peak density value on the y-axis is approximately **55-58**.
2. **ProofNet-SFT Distribution (Right):**
* **Shape & Trend:** This distribution is more symmetric and platykurtic (flatter) compared to the left one, though it still shows a slight right skew. The high-density region is broader, spanning approximately from **1.0 to 2.5**.
* **Central Tendency:** The annotated mean is **13.0**. The median appears to be located around **1.7-1.9**, which is closer to the mean than in the left plot, indicating less skew.
* **Spread & Quartiles:** The IQR is wider, spanning roughly from **1.4 to 2.2**. The whiskers extend from approximately **0.8 to 2.8**. Outlier points are visible near the minimum (close to 0.5) and maximum (near 3.0).
* **Peak Density:** The peak density is lower than the left plot, reaching approximately **35-40** on the y-axis.
### Key Observations
* **Significant Mean Difference:** The mean proof length for ProofNet-SFT (13.0) is exactly double that of Goedel-Prover-SFT (6.5).
* **Distribution Shape Contrast:** Goedel-Prover-SFT produces a tight cluster of short proofs with a long tail of rare, longer proofs. ProofNet-SFT produces a much wider, more uniform spread of proof lengths across the observed range.
* **Density Concentration:** The highest concentration of data for Goedel-Prover-SFT is below a proof length of 1.5, while for ProofNet-SFT, it is between 1.0 and 2.5.
* **Overlap Region:** There is a significant overlap in the distributions between proof lengths of approximately 0.8 and 2.0, where both models have non-negligible probability density.
### Interpretation
This chart likely compares the performance of two automated theorem-proving or proof-generation systems. "Proof Length" is a common efficiency metric, where shorter proofs are generally preferred as they are more concise and often computationally cheaper to verify.
* **Goedel-Prover-SFT** demonstrates a clear tendency to generate **shorter, more efficient proofs** on average. Its right-skewed distribution suggests it is highly optimized for finding minimal proofs but occasionally produces longer ones. This could indicate a model that is good at finding direct, elegant solutions.
* **ProofNet-SFT** generates **longer proofs on average** with much higher variability. The wider, more symmetric distribution suggests less consistency in proof length optimization. This might indicate a model that is more robust or general in its approach but less focused on minimizing proof length, or it could be operating on a more complex subset of problems.
* **The Peircean Investigative Reading:** The stark difference in distributions raises questions about the underlying training or architecture. The "SFT" suffix likely stands for Supervised Fine-Tuning. The difference may stem from the quality or nature of the fine-tuning data (e.g., Goedel-Prover was fine-tuned on a corpus of minimal proofs), the model's objective function, or its inherent inductive biases. The chart doesn't show success rates, so a shorter proof length doesn't automatically mean better performance; it must be balanced against the ability to prove theorems at all. The ideal model would likely combine the short-proof tendency of Goedel-Prover with the broader coverage suggested by ProofNet's wider distribution.
</details>
(b) Goedel-SFT
<details>
<summary>x8.png Details</summary>

### Visual Description
## Distribution Comparison Chart: e4-mid Distribution Comparison
### Overview
The image displays a side-by-side comparison of two statistical distributions, likely representing pixel intensity or value distributions from two different datasets or conditions labeled "e4" and "mid". The chart uses filled area plots (similar to violin plots or kernel density estimates) to visualize the shape and spread of each distribution. The overall aesthetic is a clean, scientific plot with a white background and black axes.
### Components/Axes
* **Main Title:** "e4-mid Distribution Comparison" (centered at the top).
* **Left Plot:**
* **Annotation:** "Mean: 3.8" (positioned at the top-center of the left plot area).
* **Y-axis Label:** "Pixel Count" (rotated vertically on the left side).
* **Y-axis Scale:** Linear scale from 0 to 60, with major tick marks at 0, 20, 40, and 60.
* **X-axis:** No explicit label. The scale runs from 0.0 to 1.2, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, 1.0, and 1.2.
* **Data Representation:** A filled area plot in a light, desaturated blue/purple color. The distribution is widest (highest pixel count) at the lower end of the x-axis and tapers off as x increases.
* **Right Plot:**
* **Annotation:** "Mean: 13.0" (positioned at the top-center of the right plot area).
* **Y-axis Label:** "Pixel Count" (shared with the left plot, implied by alignment).
* **Y-axis Scale:** Identical to the left plot (0 to 60).
* **X-axis:** No explicit label. The scale runs from 0.0 to 1.5, with major tick marks at 0.0, 0.5, 1.0, and 1.5.
* **Data Representation:** A filled area plot in a darker, more saturated purple color. The distribution has a pronounced, sharp peak in the middle of its range and is narrower overall compared to the left plot.
### Detailed Analysis
* **Left Distribution (e4?):**
* **Trend:** The distribution is right-skewed. It has its maximum density (peak pixel count) at a low x-value, approximately between 0.1 and 0.3. The pixel count decreases steadily as the x-value increases towards 1.2.
* **Key Data Points (Approximate):**
* Peak Pixel Count: ~15-20 (at x ≈ 0.2).
* Pixel Count at x=0.0: ~5.
* Pixel Count at x=0.6: ~2.
* Pixel Count at x=1.0: Approaching 0.
* **Central Tendency:** The mean is explicitly stated as 3.8.
* **Right Distribution (mid?):**
* **Trend:** The distribution is roughly symmetric and leptokurtic (peaked). It rises sharply to a high peak and then falls symmetrically.
* **Key Data Points (Approximate):**
* Peak Pixel Count: ~55 (at x ≈ 0.7).
* Pixel Count at x=0.0: ~0.
* Pixel Count at x=0.5: ~20.
* Pixel Count at x=1.0: ~20.
* Pixel Count at x=1.5: Approaching 0.
* **Central Tendency:** The mean is explicitly stated as 13.0.
### Key Observations
1. **Contrasting Shapes:** The two distributions are fundamentally different. The left plot shows a broad, decaying distribution concentrated at low values, while the right plot shows a narrow, peaked distribution centered at a higher value.
2. **Mean Discrepancy:** The mean of the right distribution (13.0) is significantly higher than that of the left (3.8), confirming the visual shift of mass to the right.
3. **Peak Density:** The peak pixel count in the right distribution (~55) is nearly three times higher than the peak in the left distribution (~15-20), indicating a much higher concentration of data points around its central value.
4. **Range:** The left distribution's meaningful data spans from x=0.0 to ~1.0. The right distribution's meaningful data spans from x=0.2 to ~1.2, with its core between 0.5 and 1.0.
### Interpretation
This chart effectively demonstrates a stark contrast between two data populations. The "e4" (left) distribution suggests a dataset where most values are low, with a long tail of less frequent higher values—characteristic of noise, background, or a process with a lower baseline. The "mid" (right) distribution suggests a dataset with a strong, consistent signal centered around a specific value (x≈0.7), with little deviation—characteristic of a focused measurement, a detected feature, or a processed signal.
The comparison likely aims to show the effect of a process (e.g., image processing, filtering, or segmentation). The "mid" distribution could represent the result of isolating a specific feature (like a cell or object in an image) from the broader, noisier background represented by the "e4" distribution. The higher mean and peaked shape indicate successful concentration and enhancement of the signal of interest. The clear separation in both mean and distribution shape suggests the two conditions or datasets are distinctly different and the process being visualized has a dramatic effect on the data's statistical properties.
</details>
(c) OpenAI o4-mini
Figure 4: Distribution of proof lengths for selected models before (left) and after (right) applying Apollo. In the “before” plots, proof lengths cover only proofs the base model proved without the help of Apollo; in the “after” plots, proof lengths cover only proofs proved with Apollo’s assistance.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Line Charts: Model Accuracy vs. Recursion Depth
### Overview
The image displays three separate line charts arranged horizontally, each plotting the accuracy of a different AI model against increasing recursion depth. The charts share the same axis labels but have different scales and data trends. All charts show a positive correlation between recursion depth and accuracy.
### Components/Axes
* **Common Elements:**
* **X-axis Label (All Charts):** "Recursion Depth"
* **Y-axis Label (All Charts):** "Accuracy (%)"
* **Chart Titles (Top-Center of each plot):** Left: "o4-mini", Middle: "Goedel-Prover-SFT", Right: "Kimina-Prover-Preview-Distill-7B"
* **Data Series:** Each chart contains a single data series represented by a colored line with distinct markers.
* **Left Chart (o4-mini):**
* **Line Color:** Blue
* **Marker Style:** Filled circles
* **X-axis Range:** 0 to 4 (integer steps)
* **Y-axis Range:** Approximately 5% to 45%
* **Middle Chart (Goedel-Prover-SFT):**
* **Line Color:** Red
* **Marker Style:** Filled downward-pointing triangles
* **X-axis Range:** 0 to 6 (integer steps)
* **Y-axis Range:** Approximately 56% to 65%
* **Right Chart (Kimina-Prover-Preview-Distill-7B):**
* **Line Color:** Magenta/Pink
* **Marker Style:** Filled squares
* **X-axis Range:** 0 to 6 (integer steps)
* **Y-axis Range:** Approximately 67% to 75%
### Detailed Analysis
**1. Left Chart: o4-mini**
* **Trend:** The line shows a steep, concave-down increase. Accuracy rises sharply from depth 0 to 1, then the rate of improvement slows but remains positive through depth 4.
* **Data Points (Approximate):**
* Depth 0: ~8%
* Depth 1: ~37%
* Depth 2: ~40%
* Depth 3: ~43%
* Depth 4: ~45%
**2. Middle Chart: Goedel-Prover-SFT**
* **Trend:** The line shows a steady, concave-down increase. The slope is steepest from depth 0 to 2 and gradually flattens, suggesting diminishing returns at higher recursion depths.
* **Data Points (Approximate):**
* Depth 0: ~56%
* Depth 1: ~61%
* Depth 2: ~63.5%
* Depth 3: ~64.5%
* Depth 4: ~65%
* Depth 5: ~65.2%
* Depth 6: ~65.3%
**3. Right Chart: Kimina-Prover-Preview-Distill-7B**
* **Trend:** The line shows a consistent, nearly linear increase with a very slight flattening at the highest depths. It demonstrates the most sustained improvement across the measured range.
* **Data Points (Approximate):**
* Depth 0: ~67%
* Depth 1: ~69%
* Depth 2: ~72%
* Depth 3: ~74%
* Depth 4: ~74.5%
* Depth 5: ~74.8%
* Depth 6: ~75%
### Key Observations
1. **Performance Hierarchy:** There is a clear hierarchy in baseline (depth 0) and peak accuracy. Kimina-Prover-Preview-Distill-7B > Goedel-Prover-SFT > o4-mini.
2. **Improvement Magnitude:** The model with the lowest starting accuracy (o4-mini) shows the largest relative gain (~37 percentage points from depth 0 to 4). The highest-performing model (Kimina-Prover-Preview-Distill-7B) shows the smallest absolute gain (~8 points from depth 0 to 6).
3. **Diminishing Returns:** All three models exhibit diminishing returns, where each additional unit of recursion depth yields a smaller increase in accuracy. This effect is most pronounced in the Goedel-Prover-SFT chart.
4. **Recursion Depth Tested:** The o4-mini model was only evaluated up to depth 4, while the other two were evaluated up to depth 6.
### Interpretation
The data demonstrates that increasing recursion depth is an effective strategy for improving the accuracy of these specific AI models on the evaluated task. The relationship is non-linear, following a pattern of rapid initial gains that taper off.
The significant differences in absolute performance suggest these are fundamentally different models or were trained/fine-tuned with different methodologies (as hinted by their names: "SFT" likely for Supervised Fine-Tuning, "Distill" for knowledge distillation). The "o4-mini" model appears to be a smaller or less capable base model that benefits dramatically from added computation (recursion), while the "Kimina-Prover-Preview-Distill-7B" starts from a much higher performance plateau.
From a Peircean perspective, the charts are **icons** representing the quantitative relationship between two variables. They are also **indices** pointing to a causal hypothesis: that recursive self-correction or refinement (the implied process behind "Recursion Depth") improves output quality. The consistent trend across three different models strengthens the argument that this is a generalizable phenomenon for this class of model and task, rather than an artifact of a single architecture. The diminishing returns are a critical practical insight, indicating there is an optimal recursion depth beyond which computational cost may outweigh marginal accuracy benefits.
</details>
Figure 5: Performance of Apollo on miniF2F dataset at various recursion depth parameters $r$ on three different models: o4-mini, Kimina‑Prover‑Preview‑Distill‑7B and Goedel‑Prover‑SFT.
### 4.4 Impact of recursion depth $r$ on proof accuracy
We evaluate how the recursion‐depth parameter $r$ affects accuracy on miniF2F‑test. A recursion depth of $r=0$ corresponds to the base model’s standalone performance; higher $r$ values permit additional rounds of Apollo. Figure 5 plots accuracy versus $r$ for three models: o4‑mini ( $r=0… 4$ ), Kimina‑Prover‑Preview‑Distill‑7B ( $r=0… 6$ ), and Goedel‑Prover‑SFT ( $r=0… 6$ ).
The o4‑mini model exhibits its largest accuracy gain between $r=0$ and $r=1$ , since its raw proof structure is often correct but contains many syntax errors and proof-step errors. Syntax Refiner fixes most of them and together with Auto Solver leads to a spike in accuracy. Beyond $r=1$ , accuracy plateaus, since o4-mini was not explicitly designed for proving formal theorems. We note that even though accuracy gains slow down drastically for o4-mini, its accuracy still increases at a slow rate, which means that pushing parameter $r$ further has a potential to increase the efficiency of o4-mini further. In contrast, both Kimina‑Prover‑Preview‑Distill‑7B and Goedel‑Prover‑SFT continue to improve steadily throughout different levels of recursion depth. The results suggest that decomposing the challenging problems into smaller parts allows the model to close simpler goals, slowly building towards a complete proof.
Although the total number of proof samples grows with $r$ , even $r=5$ requires only 300–400 LLM requests on average, an order of magnitude below typical budgets in the literature. Moreover, one could further increase $r$ to explore budgets in the thousands if desired.
### 4.5 Ablation Studies on different components of Apollo
To evaluate the contribution of each Apollo component, we conducted an ablation study on two base models: o4-mini (a general-purpose model) and Kimina-Prover-Preview-Distill-7B (a theorem-prover model). Results are reported in Table 4. We decomposed Apollo into three key modules: Syntax Refiner, Auto Solver, and LLM Re-invoker. The LLM invoker is the module which re-invokes the LLM to prove the sorrified sub-goals. The Sorrifier and Proof Assembler modules are excluded from this analysis, as they are essential for decomposing problems into sub-goals and assembling proofs, and thus cannot be ablated.
From the results, we observe that Syntax Refiner alone provides only a marginal benefit, and its effect is limited to the general-purpose model o4-mini; it yields no improvement on the dedicated prover. Auto Solver, when used in isolation, also produces little accuracy gain, indicating that built-in solvers in most cases do not have the capability to prove the failed sub-goals after the first round of proof generation by LLMs. However, further reasoning and proof decomposition on those sub-goals by the LLM can significantly increase the accuracy gain. The most significant accuracy gains, however, arise when Auto Solver and LLM Re-invoker are combined, which corresponds to the full Apollo framework. This demonstrates that Apollo’s repair mechanism is crucial for achieving robust performance across both model types. We extend the ablation to individual module triggers during inference in Section E of the Appendix.
Table 4: Ablation study of different parts of Apollo on o4-mini (general purpose model) and Kimina-Prover-Preview-Distill-7B (theorem prover model).
| Apollo Modules | Model Accuracy | | | |
| --- | --- | --- | --- | --- |
| Syntax Refiner | AutoSolver | LLM Re-Invoker | o4-mini | Kimina-7B |
| ✗ | ✗ | ✗ | 7.0% | 63.1% |
| ✓ | ✗ | ✗ | 7.4% | 63.1% |
| ✗ | ✓ | ✗ | 7.0% | 63.5% |
| ✗ | ✗ | ✓ | 8.2% | 69.3% |
| ✓ | ✓ | ✗ | 20.5% | 63.5% |
| ✓ | ✗ | ✓ | 18.9% | 69.3% |
| ✗ | ✓ | ✓ | 10.2% | 75.0% |
| ✓ | ✓ | ✓ | 46.7% | 75.0% |
## 5 Limitations
Integration with tree‑search methods. Our evaluation focuses exclusively on applying Apollo to whole‑proof generation by LLMs and does not explore its interaction with traditional tree‑search provers (e.g. BFS, MCTS). Investigating how lemma decomposition and targeted repair could accelerate or prune search paths in those systems is a promising direction for future work.
Dependence on base model’s proof sketch quality. Apollo’s ability to produce a correct formal proof largely depends on the base model and whether its initial proof has a coherent proof sketch. As yousefzadeh2025a observe, models often tend to “cut corners,” producing proofs that are much shorter than the rigorous, multi-step arguments required to generate a valid proof. When a base model fails to write a correct proof strategy (e.g. omitting critical lemmas or suggesting irrelevant tactics, or taking a wrong approach), the Apollo is less likely to repair such a proof. Enhancing the robustness of Apollo to very poor initial sketches remains an open challenge.
## 6 Conclusion
In this work, we present Apollo, a novel, modular fully automated agentic system that combines syntax cleaning, automatic solvers, and LLM‑driven sub‑proof generation to transform an LLM’s initial proof sketch into a fully verified Lean4 proof. This framework harnesses the full power of the Lean compiler and integrated solvers, merging them with LLM systems. Applied across five whole‑proof generation models, ranging from general‑purpose LLMs (o3‑mini, o4‑mini) to specialized provers (Kimina‑Prover‑Preview‑Distill‑7B, Goedel‑Prover‑SFT, Goedel-V2), Apollo consistently establishes new best accuracies on the miniF2F benchmark while reducing token and sampling budgets by one to two orders of magnitude. Our empirical analysis shows that Apollo not only raises overall proof success rates but also guides models to generate longer, more structured proofs, as evidenced by a drastic increase in average successful proof length. We further demonstrate how the recursion‑depth parameter $r$ trades off sample complexity against accuracy, achieving robust gains with only a few hundred samples. We believe that Apollo’s collaboration between LLMs, automatic solvers, and the Lean compiler defines a paradigm of agentic systems that produce high‑quality, verifiable proofs for increasingly challenging theorems.
## Acknowledgments
The work of Farzan Farnia is partially supported by a grant from the Research Grants Council of the Hong Kong Special Administrative Region, China, Project 14209920, and is partially supported by CUHK Direct Research Grants with CUHK Project No. 4055164 and 4937054. Also, the authors would like to thank the anonymous reviewers for their constructive feedback.
## References
- (1) Kaiyu Yang, Gabriel Poesia, Jingxuan He, Wenda Li, Kristin Lauter, Swarat Chaudhuri, and Dawn Song. Formal mathematical reasoning: A new frontier in AI. In Proceedings of the International Conference on Machine Learning, 2025.
- (2) Google DeepMind. AI achieves silver-medal standard solving international mathematical olympiad problems. https://deepmind.google/discover/blog/ai-solves-imo-problems-at-silver-medal-level/, 2024. Accessed: 2025-05-08.
- (3) Trieu Trinh, Yuhuai Wu, Quoc Le, He He, and Thang Luong. Solving olympiad geometry without human demonstrations. Nature, 2024.
- (4) Leonardo de Moura and Sebastian Ullrich. The lean 4 theorem prover and programming language. In Automated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings, page 625–635, Berlin, Heidelberg, 2021. Springer-Verlag.
- (5) Alex Kontorovich and Terence Tao. Prime number theorem and more. https://github.com/AlexKontorovich/PrimeNumberTheoremAnd, 2024. Version 0.1.0, released 2024-01-09.
- (6) Kevin Buzzard. Mathematical reasoning and the computer. Bulletin of the American Mathematical Society, 2024.
- (7) George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition, 2024.
- (8) Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. miniF2F: a cross-system benchmark for formal Olympiad-level mathematics, 2022.
- (9) Zhangir Azerbayev, Bartosz Piotrowski, Hailey Schoelkopf, Edward W. Ayers, Dragomir Radev, and Jeremy Avigad. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics, 2023.
- (10) Jia Li, Edward Beeching, Lewis Tunstall, Ben Lipkin, Roman Soletskyi, Shengyi Costa Huang, Kashif Rasul, Longhui Yu, Albert Jiang, Ziju Shen, Zihan Qin, Bin Dong, Li Zhou, Yann Fleureau, Guillaume Lample, and Stanislas Polu. Numinamath. https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf, 2024. GitHub repository.
- (11) Jiewen Hu, Thomas Zhu, and Sean Welleck. miniCTX: Neural theorem proving with (long-)contexts. In The Thirteenth International Conference on Learning Representations, 2025.
- (12) Dongwei Jiang, Marcio Fonseca, and Shay B Cohen. Leanreasoner: Boosting complex logical reasoning with lean. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 7490–7503, 2024.
- (13) Quanjun Zhang, Chunrong Fang, Yuxiang Ma, Weisong Sun, and Zhenyu Chen. A survey of learning-based automated program repair. ACM Trans. Softw. Eng. Methodol., 33(2), December 2023.
- (14) Kavi Gupta, Peter Ebert Christensen, Xinyun Chen, and Dawn Song. Synthesize, execute and debug: Learning to repair for neural program synthesis. In Advances in Neural Information Processing Systems, volume 33, pages 17685–17695. Curran Associates, Inc., 2020.
- (15) Michihiro Yasunaga and Percy Liang. Break-it-fix-it: Unsupervised learning for program repair. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 11941–11952. PMLR, 18–24 Jul 2021.
- (16) Berkay Berabi, Jingxuan He, Veselin Raychev, and Martin Vechev. Tfix: Learning to fix coding errors with a text-to-text transformer. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 780–791. PMLR, 18–24 Jul 2021.
- (17) Miltiadis Allamanis, Henry Jackson-Flux, and Marc Brockschmidt. Self-supervised bug detection and repair. In Advances in Neural Information Processing Systems, volume 34, pages 27865–27876. Curran Associates, Inc., 2021.
- (18) Talia Ringer. Proof Repair. PhD thesis, University of Washington, 2021. Ph.D. dissertation.
- (19) Yves Bertot and Pierre Castéran. Interactive theorem proving and program development: Coq’Art: the calculus of inductive constructions. Springer Science & Business Media, 2013.
- (20) Albert Qiaochu Jiang, Wenda Li, Szymon Tworkowski, Konrad Czechowski, Tomasz Odrzygóźdź, Piotr Miłoś, Yuhuai Wu, and Mateja Jamnik. Thor: Wielding hammers to integrate language models and automated theorem provers. In Advances in Neural Information Processing Systems, 2022.
- (21) Emily First, Markus N. Rabe, Talia Ringer, and Yuriy Brun. Baldur: Whole-proof generation and repair with large language models, 2023.
- (22) Peiyang Song, Kaiyu Yang, and Anima Anandkumar. Towards large language models as copilots for theorem proving in lean. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23, 2023.
- (23) Stanislas Polu, Jesse Michael Han, Kunhao Zheng, Mantas Baksys, Igor Babuschkin, and Ilya Sutskever. Formal mathematics statement curriculum learning, 2022.
- (24) Zijian Wu, Suozhi Huang, Zhejian Zhou, Huaiyuan Ying, Jiayu Wang, Dahua Lin, and Kai Chen. Internlm2.5-stepprover: Advancing automated theorem proving via expert iteration on large-scale lean problems, 2024.
- (25) Ran Xin, Chenguang Xi, Jie Yang, Feng Chen, Hang Wu, Xia Xiao, Yifan Sun, Shen Zheng, and Kai Shen. BFS-Prover: Scalable best-first tree search for llm-based automatic theorem proving, 2025.
- (26) Guillaume Lample, Marie-Anne Lachaux, Thibaut Lavril, Xavier Martinet, Amaury Hayat, Gabriel Ebner, Aurélien Rodriguez, and Timothée Lacroix. Hypertree proof search for neural theorem proving. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pages 26337–26349, 2022.
- (27) Haiming Wang et al. Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12632–12646, 2023.
- (28) Kaiyu Yang, Aidan M. Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models, October 2023. arXiv:2306.15626.
- (29) Stanislas Polu and Ilya Sutskever. Generative language modeling for automated theorem proving, 2020.
- (30) Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. Deepseek-prover-v1.5: Harnessing proof assistant feedback for reinforcement learning and monte-carlo tree search, 2024.
- (31) Yong Lin, Shange Tang, Bohan Lyu, Jiayun Wu, Hongzhou Lin, Kaiyu Yang, Jia Li, Mengzhou Xia, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover: A frontier model for open-source automated theorem proving, 2025.
- (32) Jingyuan Zhang, Qi Wang, Xingguang Ji, Yahui Liu, Yang Yue, Fuzheng Zhang, Di Zhang, Guorui Zhou, and Kun Gai. Leanabell-prover: Posttraining scaling in formal reasoning, 2025.
- (33) Kefan Dong and Tengyu Ma. STP: Self-play LLM theorem provers with iterative conjecturing and proving. arXiv preprint arXiv:2502.00212, 2025.
- (34) Haiming Wang, Mert Unsal, Xiaohan Lin, Mantas Baksys, Junqi Liu, Marco Dos Santos, Flood Sung, Marina Vinyes, Zhenzhe Ying, Zekai Zhu, Jianqiao Lu, Hugues de Saxcé, Bolton Bailey, Chendong Song, Chenjun Xiao, Dehao Zhang, Ebony Zhang, Frederick Pu, Han Zhu, Jiawei Liu, Jonas Bayer, Julien Michel, Longhui Yu, Léo Dreyfus-Schmidt, Lewis Tunstall, Luigi Pagani, Moreira Machado, Pauline Bourigault, Ran Wang, Stanislas Polu, Thibaut Barroyer, Wen-Ding Li, Yazhe Niu, Yann Fleureau, Yangyang Hu, Zhouliang Yu, Zihan Wang, Zhilin Yang, Zhengying Liu, and Jia Li. Kimina-Prover Preview: Towards large formal reasoning models with reinforcement learning, 2025.
- (35) Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen Marcus McAleer, Albert Q. Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learning Representations, 2024.
- (36) Albert Q. Jiang et al. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs, 2023.
- (37) Haiming Wang, Huajian Xin, Chuanyang Zheng, Zhengying Liu, Qingxing Cao, Yinya Huang, Jing Xiong, Han Shi, Enze Xie, Jian Yin, Zhenguo Li, and Xiaodan Liang. LEGO-prover: Neural theorem proving with growing libraries. In The Twelfth International Conference on Learning Representations, 2024.
- (38) Lean FRO. A read-eval-print-loop for Lean 4. https://github.com/leanprover-community/repl, 2023.
- (39) OpenAI. Openai o3-mini. https://openai.com/index/openai-o3-mini/, January 2025.
- (40) OpenAI. Introducing openai o3 and o4-mini. https://openai.com/index/introducing-o3-and-o4-mini/, April 2025.
- (41) Yong Lin, Shange Tang, Bohan Lyu, Ziran Yang, Jui-Hui Chung, Haoyu Zhao, Lai Jiang, Yihan Geng, Jiawei Ge, Jingruo Sun, Jiayun Wu, Jiri Gesi, David Acuna, Kaiyu Yang, Hongzhou Lin, Yejin Choi, Danqi Chen, Sanjeev Arora, and Chi Jin. Goedel-prover-v2: The strongest open-source theorem prover to date, 2025.
- (42) Yang Li, Dong Du, Linfeng Song, Chen Li, Weikang Wang, Tao Yang, and Haitao Mi. HunyuanProver: A scalable data synthesis framework and guided tree search for automated theorem proving, 2025.
- (43) DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in llms via reinforcement learning, 2025.
- (44) Roozbeh Yousefzadeh, Xuenan Cao, and Azim Ospanov. A Lean dataset for International Math Olympiad: Small steps towards writing math proofs for hard problems. Transactions on Machine Learning Research, 2025.
- (45) Huajian Xin, Z. Z. Ren, Junxiao Song, Zhihong Shao, Wanjia Zhao, Haocheng Wang, Bo Liu, Liyue Zhang, Xuan Lu, Qiushi Du, Wenjun Gao, Qihao Zhu, Dejian Yang, Zhibin Gou, Z. F. Wu, Fuli Luo, and Chong Ruan. DeepSeek-Prover-V1.5: Harnessing Proof Assistant Feedback for Reinforcement Learning and Monte-Carlo Tree Search, August 2024. arXiv:2408.08152.
## Appendix A Additional implementation details
Algorithm 1 presents a high-level pseudo code of the Apollo framework. We highlight that the framework is built-upon the recursive function calls up to a maximum recursion depth given by parameter $r$ . $LLM$ refers to a proof generator model that takes in a proof statement and outputs a formal proof attempt. $LeanCompiler$ refers to Lean REPL that verifies the proof and outputs True or False. In the case of False, it also provides a detailed list of errors, their locations and types along with the error messages generated by the Lean compiler. Other functions in the pseudo code ( $SyntaxRefiner,Sorrifier,AutoSolver,ProofAssembler$ ) represent the sub-modules described in the main text. To give deeper insight into Apollo’s design, we now outline the rationale and responsibilities of each core module.
### A.1 Syntax Refiner
The Syntax Refiner class applies a set of rule‐based corrections to eliminate syntax errors in LLM‐generated Lean code. During model experimentation, we collected a corpus of common mistakes, such as stray Lean 3 keywords (from, begin … end), misplaced tactics, or missing delimiters, and encoded each repair as a regular‐expression rule. For example:
- Convert [from by] to [:= by].
- Replace Lean 3 blocks (begin … end) with the corresponding Lean 4 structure.
- Ensure that commands like rw use correctly placed square brackets to avoid compilation errors.
In more complex cases, multiple regex patterns and replacements are composed to restore valid Lean 4 syntax without altering the logical content of the proof sketch.
### A.2 Sorrifier
The Sorrifier module uses a feedback loop between Apollo and the Lean REPL as follows:
1. Apollo sends the current proof candidate to REPL.
1. The REPL returns a list of error messages and their source locations.
1. Apollo parses the proof into a syntax tree, navigates to each failing sub‐lemma, and attempts to fix it or replace it with a sorry placeholder.
1. The updated proof is re‐submitted to REPL, and this cycle repeats until no error remain.
If the REPL indicates the original theorem statement itself is malformed (e.g. due to LLM hallucination), Apollo abandons the current proof and requests a fresh generation from LLM. One example of malformed formal statement is if LLM introduces a variable in a hypothesis that was not previously defined. Such case would trigger a compilation error in the formal statement; therefore, when parsing the proof, Apollo will catch this error and request a fresh proof from LLM. To guarantee consistent variable types during lemma extraction and proof assembly, Apollo sets the following Lean options to true:
⬇
set_option pp. instanceTypes true
set_option pp. numericTypes true
set_option pp. coercions. types true
set_option pp. letVarTypes true
set_option pp. structureInstanceTypes true
set_option pp. instanceTypes true
set_option pp. mvars. withType true
set_option pp. coercions true
set_option pp. funBinderTypes true
set_option pp. piBinderTypes true
so that the compiler reports explicit types (e.g. $ℤ$ vs. $ℝ$ ). This fail-safe prevents trivialization of theorems and ensures correct typing in the final assembled proof.
### A.3 AutoSolver
The AutoSolver module takes the “sorrified” proof and applies a sequence of tactics to close as many goals as possible:
1. Hint-based tactics. It first invokes hint, which proposes candidate tactics. We filter out any suggestions that merely progress the goal and retain only those that fully discharge it.
1. Built-in tactics. If hint fails, AutoSolver systematically tries a suite of powerful Lean tactics, such as nlinarith, norm_num, norm_cast, ring_nf, simp, simp_all, and others, often combining them with try to catch failures without aborting.
1. Recursive fallback. Any goals that remain unsolved are reported back to Apollo. Apollo queries the REPL for the exact goal statements, spawns corresponding sub-lemmas, and then invokes the full repair loop (LLM generation, Syntax Refiner, Sorrifier, and AutoSolver) on each sub-lemma.
This tiered strategy maximizes automation while ensuring that challenging subgoals are handled via targeted recursion.
Algorithm 1 Apollo Pseudocode for Formal Theorem Proving
1: problem statement $ps$ , current recursion depth $r_current$ , maximum depth $r$
2: if $r_current$ is not initialized then
3: $r_current← 0$
4: end if
5: if $r_current>r$ then
6: return sorry $\triangleright$ If recursion depth reached, return sorry
7: end if
8: $proof←LLM(ps)$ $\triangleright$ Generate proof from LLM
9: $proof←SyntaxRefiner(proof)$ $\triangleright$ Refine the proof syntax
10: $proof←Sorrifier(proof)$ $\triangleright$ Patch failing sub-lemmas
11: $proof←AutoSolver(proof)$ $\triangleright$ Try built-in Lean solvers
12: if $LeanCompiler(proof)$ == True then
13: return $\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}proof$
14: end if
15: $r_current← r_current+1$
16: $n←CountSorries(proof)$ $\triangleright$ Number of ’sorry’ placeholders
17: $sub\_proofs←[]$
18: for $i=1$ to $n$ do
19: $ps_goal←GetTheoremGoal(proof,i)$
20: $sub\_ps←TransformGoal(ps_goal)$
21: $sub\_proofs[i]←Apollo(sub\_ps,r_current,r)$
22: end for
23: $repaired\_proof←ProofAssembler(sub\_proofs)$
24: return $repaired\_proof$
## Appendix B Discussion on maximum budget of @K and Apollo
To better highlight Apollo’s contribution to token efficiency, we report token counts under two settings: (1) the budget over all attempted miniF2F-test problems, and (2) the budget restricted to proofs where Apollo was invoked. For instance, if the base LLM proves a theorem without Apollo’s assistance, its token usage is included only in the first setting. In contrast, if the base LLM cannot prove the theorem on its own but succeeds with Apollo’s help, its token usage is counted in the second setting. Results for setting (1) are shown in Table 5, while results for setting (2) appear in Table 6.
We emphasize that, regardless of the token-count setting, Apollo consistently consumes fewer tokens while achieving higher accuracy, highlighting its effectiveness. Throughout this work, we primarily report results under setting (2), as it more directly highlights Apollo’s contribution: the reported token usage is not artificially reduced by problems that the base LLM could solve independently. This makes Scenario (2) a more representative measure of our contribution.
Table 5: Comparison of maximum/average sample/token budget across $@K$ sampling and Apollo repair on the entire miniF2F-test dataset.
| Model | Max sample budget | Average sample budget | Max token budget | Average token budget |
| --- | --- | --- | --- | --- |
| Goedel-SFT | 25600 | 9730 | 12.7M | 4.8M |
| Goedel-SFT+Apollo | 1100 | 335 | 548K | 167K |
| Kimina-7B | 1024 | 373 | 1.3M | 1.6M |
| Kimina-7B+Apollo | 888 | 189 | 3.8M | 818K |
Table 6: Comparison of maximum/average sample/token budget across $@K$ sampling on the entire miniF2F-test dataset and Apollo repair exclusively on Apollo-assisted problems.
| Model | Max sample budget | Average sample budget | Max token budget | Average token budget |
| --- | --- | --- | --- | --- |
| Goedel-SFT | 25600 | 9730 | 12.7M | 4.8M |
| Goedel-SFT+Apollo | 1100 | 618 | 548K | 307K |
| Kimina-7B | 1024 | 373 | 1.3M | 1.6M |
| Kimina-7B+Apollo | 888 | 367 | 3.8M | 1.6M |
## Appendix C Additional evaluation results
We also conducted experiments on the ProofNet azerbayev2023proofnetautoformalizingformallyproving dataset using the Kimina-Preview-Prover-Distill-7B model. Table 7 summarizes the sample/token budget and accuracy. Incorporating the Apollo agent with Kimina not only improves accuracy by 7% but also reduces token consumption by 25%.
Table 7: Comparison of Apollo accuracy on the ProofNet dataset. Token budget is computed over all generated tokens by LLM.
| Method | Model size | Sample budget | Token budget | ProofNet |
| --- | --- | --- | --- | --- |
| Kimina-Preview | 7B | 128 | 559K | 11.3% |
| Kimina-Preview+Apollo | 7B | 96 | 415K | 18.3% |
Moreover, we observe rising adoption of REPL-based repair, introduced in yousefzadeh2025a . Currently, models such as Goedel-Prover-V2 are fine-tuned on REPL outputs paired with faulty Lean code, enabling the model to repair missing or incorrect proof fragments using this feedback. Following yousefzadeh2025a , we compare Apollo against a REPL-feedback baseline using the general-purpose o3-mini model. Table 8 presents the results: Apollo outperforms the REPL-feedback approach while requiring a smaller sample budget. We view REPL-based feedback as complementary and expect it to work well in conjunction with recursive, compiler-aided repair of Lean proofs.
Feedback prompt schema
This is an incorrect proof: [model’s last proof] Compilation errors are as follows: [Lean’s error messages] Based on this feedback, produce a correct raw Lean code for the following problem: [header] [informal prefix] [formal statement]
Table 8: Comparison of Apollo accuracy against compiler (REPL) feedback based repair on the miniF2F-test dataset. Token budget is omitted since OpenAI API does not provide exact token budget with its response.
| Method | Sample budget | Accuracy |
| --- | --- | --- |
| o3-mini | 1 | 3.3% |
| o3-mini | 32 | 24.6% |
| o3-mini + REPL feedback repair | 5 | 17.2% |
| o3-mini + REPL feedback repair | 10 | 25.4% |
| o3-mini + Apollo | 8 | 40.2% |
We also evaluated Apollo against the RMaxTS method proposed in xin_deepseek-prover-v15_2024 . This approach employs Monte Carlo Tree Simulation to explore the proof space and truncates the search whenever Lean compiler errors are encountered. RMaxTS has been shown to further improve the accuracy of whole-generation provers. For our evaluation, we used Kimina-Prover-Preview-Distill-7B as the base model. Table 9 reports the results. At a comparable sample budget of approximately 2M tokens, Apollo outperforms RMaxTS by 10.7%. These findings further reinforce the effectiveness of the agentic recursive repair approach.
## Appendix D Computational resource considerations
Our emphasis on reducing sample complexity is driven by resource allocation. CPU performance is not a limiting factor in our setup: the Lean kernel remains efficient even under high parallelization, with proof verification typically completing in 6–200 seconds. To guard against inefficient structures, we impose a 5-minute compilation timeout, which is rarely approached in practice.
By contrast, GPU resources were the primary bottleneck. Running Kimina-Prover-Preview-Distill-7B on a single A5000 GPU with a 16,384-token context window takes 700–2,000 seconds per problem, over 11.5 minutes at @32 sampling. Increased sampling exacerbates this cost, and even with eight GPUs throughput remains difficult to scale. While Lean-based servers such as Kimina-server continue to improve verification efficiency through CPU parallelization, the growing overhead of sampling highlights the need to minimize LLM invocations alongside improving theorem-proving accuracy.
Table 9: Comparison of Apollo accuracy against other sampling methods on the miniF2F-test dataset. Token budget is computed over all generated tokens by LLM.
| Method | Model size | Sample budget | Token budget | Accuracy |
| --- | --- | --- | --- | --- |
| Kimina-Preview | 7B | 1024 | 4.5M | 70.8% |
| Kimina-Preview+RMaxTS xin_deepseek-prover-v15_2024 | 7B | $4× 128$ | 2.0M | 64.3% |
| Kimina-Preview+Apollo | 7B | 589 | 2.2M | 75.0% |
## Appendix E Extension of ablation study
To further elaborate on the ablation results presented in the main text, we measured the number of relevant Apollo module calls during inference. The results are shown in Table 10. Specifically, we counted the number of invocations for each Apollo-assisted problem in the miniF2F-test subset. The findings are consistent with the ablation study reported in the main text.
Table 10: Number of Apollo module triggers across different models
| Model | Syntax Refiner | Auto Solver | LLM Re-Invoker |
| --- | --- | --- | --- |
| Goedel-SFT | 0.0% | 100.0% | 100.0% |
| Kimina-7B | 0.0% | 100.0% | 100.0% |
| o3-mini | 100.0% | 100.0% | 70.0% |
| o4-mini | 94.7% | 100.0% | 64.9% |
## Appendix F An example of Apollo proof repair procedure
Figure 6 shows a detailed run of the Apollo framework on mathd_algebra_332. First, the LLM generates an initial proof sketch, which fails to type-check in Lean. Apollo then applies its Syntax Refiner to correct simple errors (e.g. changing “from by” to “:= by”) and to strip out all comment blocks. Next, the Sorrifier replaces each failing sub-lemma with a sorry statement (six in this example). The AutoSolver module then attempts to discharge these sub-lemmas: it resolves four of them, leaving lemmas #5 and #6 unsolved. Apollo recurses on those two: each is passed again through LLM, Syntax Refiner, Sorrifier, and AutoSolver. After this single recursive iteration, all goals are closed. Finally, Apollo reassembles the full proof and returns it to the user.
This example illustrates Apollo’s repair logic. In this case the proof was fixed in one iteration; more complex theorems may require deeper recursive repair.
<details>
<summary>examples/step_by_step/dryrun.png Details</summary>

### Visual Description
## [Diagram]: APOLLO Framework (Formal Mathematical Proof Pipeline)
### Overview
The image depicts the **APOLLO Framework**, a pipeline for generating formal mathematical proofs (in the Lean theorem prover) from a problem statement. It illustrates a modular approach to proof generation, with components for language modeling (LLM), syntax refinement, handling incomplete proofs ("sorry" placeholders), and automated proof completion. The diagram includes three parallel proof attempts (base, Lemma #5, Lemma #6) and a final combined proof.
### Components/Axes
- **Top Section (Proof Pipelines)**: Three parallel workflows (LLM → Syntax Refiner → Sorrifier → Auto Solver) for different proof attempts (base, #5, #6).
- **Color-Coded Components**:
- Red: LLM (generates initial proof sketches).
- Pink: Syntax Refiner (refines Lean syntax).
- Yellow: Sorrifier (handles "sorry" placeholders in Lean).
- Blue: Auto Solver (completes proofs with automated tactics).
- **Arrows**: Show the flow of proof data between components.
- **Bottom Section**:
- **Formal Statement (Gray Box)**: The problem statement in Lean.
- **LLM + APOLLO (Green Box)**: The final combined proof (base proof + Lemmas #5, #6).
### Detailed Analysis
#### Top Pipelines (Proof Attempts)
1. **Base Proof (Top-Left Pipeline)**:
- **LLM (Red)**: Generates a Lean proof with steps like `have sum : x + y = 14 := by ...` (deriving \( x + y = 14 \) from the arithmetic mean) and `have prod : x * y = 19 := by ...` (deriving \( xy = 19 \) from the geometric mean).
- **Syntax Refiner (Pink)**: Refines syntax (e.g., `have sum : x + y = (14 : R) := by ...`).
- **Sorrifier (Yellow)**: Replaces "sorry" placeholders (e.g., `sorry #1`, `sorry #2`) with valid tactics.
- **Auto Solver (Blue)**: Completes the proof using tactics like `linarith` (linear arithmetic) and `ring` (ring theory).
2. **Lemma #5 (Middle Pipeline)**:
- **LLM (Red)**: Proves a lemma about square roots: `lemma mathd_algebra_332_1 (x y : R) (h : Real.sqrt (x * y) = Real.sqrt (19 : R)) : x * y = (19 : R) := by ...` (justifying \( xy = 19 \) from the geometric mean).
- **Syntax Refiner (Pink)**: Refines the lemma’s syntax.
- **Sorrifier (Yellow)**: Handles "sorry" placeholders in the lemma.
- **Auto Solver (Blue)**: Completes the lemma proof.
3. **Lemma #6 (Bottom Pipeline)**:
- **LLM (Red)**: Proves a lemma about algebraic manipulation: `lemma mathd_algebra_332_2 (x y : R) (h : x + y = (14 : R)) (h1 : x * y = (19 : R)) : x ^ 2 + y ^ 2 = (158 : R) := by ...` (deriving \( x^2 + y^2 = 158 \) from \( x + y = 14 \) and \( xy = 19 \)).
- **Syntax Refiner (Pink)**: Refines the lemma’s syntax.
- **Sorrifier (Yellow)**: Handles "sorry" placeholders.
- **Auto Solver (Blue)**: Completes the lemma proof.
#### Bottom Section
- **Formal Statement (Gray Box)**:
The problem: *"Real numbers \( x \) and \( y \) have an arithmetic mean of 7 and a geometric mean of \( \sqrt{19} \). Find \( x^2 + y^2 \). Show that it is 158."*
Translated to Lean:
```lean
theorem mathd_algebra_332 (x y : R)
(h₀ : (x + y) / 2 = (7 : R))
(h₁ : Real.sqrt (x * y) = Real.sqrt (19 : R)) :
x ^ 2 + y ^ 2 = (158 : R) := by ...
```
- **LLM + APOLLO (Green Box)**:
The final combined proof, integrating the base proof and Lemmas #5, #6. Key steps:
- Derive \( x + y = 14 \) (from \( (x + y)/2 = 7 \)) and \( xy = 19 \) (from \( \sqrt{xy} = \sqrt{19} \)).
- Use \( x^2 + y^2 = (x + y)^2 - 2xy \) to compute \( 14^2 - 2 \cdot 19 = 158 \).
### Key Observations
- **Modular Proof Generation**: The framework breaks the proof into smaller lemmas (#5, #6) and a base proof, then combines them.
- **Component Roles**: LLM generates initial proof sketches, Syntax Refiner fixes syntax, Sorrifier handles incomplete parts ("sorry"), and Auto Solver completes the proof.
- **Lean Tactics**: Uses tactics like `linarith` (linear arithmetic), `ring` (ring theory), `exact` (exact proof), and `norm_num` (numeric normalization).
- **Problem Context**: The problem relies on arithmetic/geometric mean properties: \( x + y = 14 \) (arithmetic mean) and \( xy = 19 \) (geometric mean), leading to \( x^2 + y^2 = 158 \).
### Interpretation
The APOLLO Framework demonstrates a pipeline for **automated theorem proving** in Lean, leveraging LLMs to generate proof sketches, then refining and completing them with specialized components. The modular approach (breaking into lemmas) manages complexity, while the final proof combines algebraic manipulation (from the base proof) with square-root and arithmetic lemmas. This illustrates how AI can assist in formal mathematics by reducing manual effort in theorem proving, enabling more efficient exploration of mathematical truths.
</details>
Figure 6: Step-by-step Apollo repair on mathd_algebra_332. The initial proof and all sub-lemma proofs are generated with o4-mini, then systematically repaired and reassembled by Apollo.
<details>
<summary>examples/comparison/short_3.png Details</summary>

### Visual Description
## [Code Comparison Diagram]: LLM vs. LLM + APOLLO Proof Generation
### Overview
The image presents a side-by-side comparison of two Lean 4 code blocks. The left block, titled "LLM" and outlined in red, shows a proof generated by a Large Language Model. The right block, titled "LLM + APOLLO" and outlined in green, shows a proof for the same theorem generated by an LLM augmented with a system called APOLLO. The comparison highlights differences in proof strategy and conciseness, with specific lines of code highlighted in corresponding colors to indicate key differences.
### Components/Axes
* **Layout**: Two vertical panels.
* **Left Panel**: Header "LLM" in a red box. Code block with a red outline. Certain lines are highlighted with a light red background.
* **Right Panel**: Header "LLM + APOLLO" in a green box. Code block with a green outline. Certain lines are highlighted with a light green background.
* **Content**: Both panels contain formal mathematical proof code written in the Lean 4 programming language and theorem prover.
* **Highlighted Lines**: Numbered tags (#1, #2, #3) are placed next to specific lines in both panels to draw direct comparisons.
### Detailed Analysis
**Common Code (Present in Both Panels):**
The initial setup and theorem statement are identical.
```lean
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem mathd_algebra_158_llm (a k : ℕ) (h₀ : Even a)
(h₁ : ↑(∑ k in Finset.range 8, 2 * k + 1) - ↑(∑ k in Finset.range 5, a + 2 * k) = (4 : ℤ)) :
a = 8 := by
simp_all
ring_nf at *
have eq : (5 : ℤ) * (↑a : ℤ) + (2 : ℤ) * (↑k : ℤ) = 53 := by
linarith
```
*The theorem `mathd_algebra_158` (named `_llm` on the left, `_apollo` on the right) states: Given natural numbers `a` and `k`, where `a` is even, and a hypothesis `h₁` about the difference of two sums equaling 4, prove that `a = 8`.*
**Divergent Proof Steps (Highlighted Comparisons):**
**Comparison #1: Proving `a` is even.**
* **LLM (Left, Red Highlight):**
```lean
have h_even : ∃ (m : ℕ), a = 2 * m := h₀
```
*Directly uses the hypothesis `h₀` (which states `Even a`).*
* **LLM + APOLLO (Right, Green Highlight):**
```lean
have h_even : ∃ (m : ℕ), a = 2 * m := by
exact Even.exists_two_nsmul a h₀
```
*Uses a tactic block and the specific lemma `Even.exists_two_nsmul` to unpack the definition of `Even`.*
**Comparison #2: Proving 53 is even (a step towards contradiction).**
* **LLM (Left, Red Highlight):**
```lean
have two_divides_53 : ∃ (d : ℤ), 53 = 2 * d := by
rw [← even_sum] at eq_subst
rw [eq_subst] at eq
exact <(↑5) * (↑m : ℤ) + ↑k, eq>
```
*Attempts a complex rewrite strategy involving a previously defined `eq_subst` (not shown in the common code) and constructs a proof term.*
* **LLM + APOLLO (Right, Green Highlight):**
```lean
have two_divides_53 : ∃ (d : ℤ), 53 = 2 * d := by omega
```
*Uses the `omega` tactic, a powerful automated decision procedure for linear integer arithmetic, to discharge the goal directly.*
**Comparison #3: A rewrite step within the contradiction proof.**
* **LLM (Left, Red Highlight):**
```lean
rw [mul_mod] at mod53
```
* **LLM + APOLLO (Right, Green Highlight):**
```lean
rw [mul_mod] at mod53
```
*This line is identical in both proofs. The highlight indicates it is part of the divergent proof context but the tactic call itself is the same.*
**Remaining Common Code (Post-Divergence):**
Both proofs converge again after the highlighted sections to finish the contradiction:
```lean
have not_divides : ¬ ∃ (d : ℤ), 53 = 2 * d := by
rintro <d, hd>
have mod53 : (53 : ℤ) % 2 = 1 := by norm_num
have mod_d : (2 * d) % 2 = 0 := by norm_num
rw [hd] at mod53
linarith
have contra : False := by
apply not_divides
exact two_divides_53
exact False.elim contra
```
### Key Observations
1. **Proof Conciseness**: The APOLLO-augmented proof is significantly shorter and more direct. The key differences are at tags #1 and #2.
2. **Tactic Sophistication**: The LLM proof uses more manual, low-level tactics (`rw`, constructing proof terms). The APOLLO proof leverages higher-level, automated tactics (`omega`) and standard library lemmas (`Even.exists_two_nsmul`).
3. **Identical Structure**: Despite the different tactics, the overall logical structure of the proof (establishing an equation, deriving a contradiction from the parity of 53) is the same in both versions.
4. **Language**: The code is written in **Lean 4**, a functional programming language and interactive theorem prover. All text is in English and Lean 4's syntax.
### Interpretation
This diagram serves as a technical demonstration of how an augmentation system (APOLLO) can enhance the output of a base LLM for formal mathematical reasoning.
* **What the data suggests**: The LLM alone can produce a correct but verbose and somewhat "naive" proof, using rewrites and explicit term construction. APOLLO guides or refines the output to produce a proof that is more idiomatic, concise, and leverages the prover's automation (`omega`) and extensive mathematical library more effectively.
* **How elements relate**: The side-by-side layout with color-coded highlights creates a direct visual diff, making it easy to compare the proof strategies line-by-line. The numbered tags explicitly link corresponding conceptual steps (proving evenness, proving 53 is even) across the two implementations.
* **Notable patterns**: The most striking pattern is the replacement of a multi-line, manual proof for `two_divides_53` with a single tactic call (`by omega`). This indicates that APOLLO is particularly effective at identifying goals that can be solved by the prover's built-in automation, a key skill in efficient formal proof writing.
* **Underlying significance**: The comparison illustrates a potential path for AI-assisted formal verification: using LLMs to generate initial proof sketches or steps, and then employing specialized systems (like APOLLO) to refine, optimize, and complete them into polished, efficient proofs. This hybrid approach aims to combine the creative, high-level reasoning of LLMs with the precision and rigor of symbolic theorem provers.
</details>
Figure 7: Proof attempt of mathd_algebra_158 produced by the base model (o4-mini) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals.
## Appendix G Comparison between base proofs generated from LLMs and Apollo-repaired counterparts
In this section, we present and discuss examples of initial proof attempts alongside their Apollo-repaired counterparts.
Figures 7, 8, 9 illustrate cases where Apollo repairs the proof without invoking the LLM again. In each example, the LLM struggles to generate the precise tactics needed to close certain goals; however, we find that regenerating the entire proof is unnecessary. Instead, built-in Lean solvers can discharge those subgoals directly. Moreover, in Figure 7, problematic block #3 is resolved simply by removing it, since the surrounding proof context is correct. Thus, omitting problematic lines can sometimes yield a valid proof.
In Figure 10, one goal is expanded via an LLM query, but the model injects extra tactics that trigger a no goals to be solved error. Apollo then repairs the first block using a combination of LLM generation and AutoSolver, while the second block is removed entirely.
Figure 11 illustrates a case where the model fails to apply nlinarith to discharge the goal. We observe that, when uncertain, the model often resorts to broad tactics in hopes of closing the goal. Here, the goal is too complex for nlinarith, so Apollo leverages both the LLM and AutoSolver to guide the proof to completion.
Figure 12 illustrates an example of proof that required a recursion depth $r=5$ to repair the initial proof attempt. We observe that LLM’s proof structure is correct; however, in many cases it over-relies on built-in solvers and rewrites. However, since the goals are too complex, this approach leads to a total of nine error blocks. Repairing those blocks requires aid from both AutoSolver module of Apollo and LLM itself. We show that if LLM grasps the correct approach, then Apollo is able to repair fine grained logic errors to produce a correct proof.
Figure 13 illustrates a case where the LLM produces an incomplete proof sketch. The first half of the proof is correct, but the model fails to discharge the remaining goal and instead uses all_goals positivity. Apollo detects this error and performs two additional recursive repair iterations to complete the proof.
Figure 14 illustrates a case where the base model takes a lazy approach by trying to close all goals with linarith. In contrast, Apollo’s repaired proof performs additional setup and discharges each goal by first proving the necessary auxiliary hypotheses. This example shows that, although the model often understands the high-level strategy, it lacks the fine-grained tactics (and compiler feedback) needed to close subgoals. Apollo decomposes the proof, identifies the failure points, and applies targeted repairs to generate a fully verified solution.
Figure 15 shows another scenario in which Apollo successfully closes the first two blocks with linarith, but the final block requires a deeper reasoning chain. Apollo augments the proof by introducing and proving the missing lemmas, which leads to a correct solution with a series of rewrites.
Figure 16 presents an example of a very large repair. As in Figure 12, Apollo requires $r=5$ to fully fix the proof. The final proof length increases from 100 lines to 216, which means Apollo applied roughly $× 2.2$ more tactics to close the goal. The repair relied on combined effort from AutoSolver and the LLM to close all goals. We show that, even for large, complex proofs, Apollo can repair each failing sub-block and generate a correct solution.
<details>
<summary>examples/comparison/short_1.png Details</summary>

### Visual Description
\n
## Diagram: Comparison of Lean 4 Proof Strategies (LLM vs. LLM + APOLLO)
### Overview
The image presents a side-by-side comparison of two Lean 4 code implementations for proving the same mathematical theorem. The left panel, outlined in red, is labeled "LLM" and shows a proof generated by a Large Language Model. The right panel, outlined in green, is labeled "LLM + APOLLO" and shows a proof that incorporates the APOLLO system. The comparison highlights differences in the tactics and steps used to arrive at the same conclusion.
### Components/Axes
The diagram consists of two primary, vertically oriented rectangular panels:
* **Left Panel (LLM):** Bordered in red. Header text: "LLM".
* **Right Panel (LLM + APOLLO):** Bordered in green. Header text: "LLM + APOLLO".
Each panel contains a block of Lean 4 code. Within the code, specific lines or sequences are highlighted with a background color matching the panel's border (red or green) and are annotated with a numbered tag (e.g., `#1`, `#2`, `#3`).
### Detailed Analysis
Both code blocks begin with identical setup lines and state the same theorem.
**Common Setup & Theorem Statement:**
```lean
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem mathd_algebra_141_<variant> (a b : ℝ) (h₁ : a * b = 180) (h₂ : 2 * (a + b) = 54) :
a ^ 2 + b ^ 2 = 369 := by
```
* The theorem name differs: `mathd_algebra_141_llm` (left) vs. `mathd_algebra_141_apollo` (right).
* The goal is to prove `a ^ 2 + b ^ 2 = 369` given `a * b = 180` and `2 * (a + b) = 54`.
**Proof Step Comparison:**
| Step / Goal | LLM Proof (Left Panel) | LLM + APOLLO Proof (Right Panel) |
| :--- | :--- | :--- |
| **1. Deduce `a + b = 27`** | `have h3 : a + b = 27 := by` <br> `field_simp [h₂]` <br> *(Highlighted Red, Tag #1)* | `have h3 : a + b = 27 := by` <br> `linarith` <br> *(Highlighted Green, Tag #1)* |
| **2. Compute `(a + b)^2 = 729`** | `have h4 : (a + b) ^ 2 = 729 := by` <br> `rw [h3]` <br> `norm_num` | `have h4 : (a + b) ^ 2 = 729 := by` <br> `rw [h3]` <br> `norm_num` |
| **3. State algebraic identity** | `have expand : a ^ 2 + b ^ 2 = (a + b) ^ 2 - 2 * a * b := by` <br> `ring` | `have expand : a ^ 2 + b ^ 2 = (a + b) ^ 2 - 2 * a * b := by` <br> `ring` |
| **4. Substitute into identity** | `have step1 : a ^ 2 + b ^ 2 = 729 - 2 * a * b := by` <br> `rw [expand, h4]` | `have step1 : a ^ 2 + b ^ 2 = 729 - 2 * a * b := by` <br> `rw [expand, h4]` |
| **5. Substitute `a*b=180`** | `have step2 : 729 - 2 * a * b = 729 - 360 := by` <br> `rw [h₁]` <br> *(Highlighted Red, Tag #2)* | `have step2 : 729 - 2 * a * b = 729 - 360 := by` <br> `linarith` <br> *(Highlighted Green, Tag #2)* |
| **6. Compute final value** | `have step3 : 729 - 360 = 369 := by` <br> `norm_num` | `have step3 : 729 - 360 = 369 := by` <br> `norm_num` |
| **7. Conclude proof** | `exact step1.trans (step2.trans step3)` <br> *(Highlighted Red, Tag #3)* | `linarith` <br> *(Highlighted Green, Tag #3)* |
### Key Observations
1. **Proof Structure:** Both proofs follow the same logical sequence: derive `a+b`, square it, use the identity `(a+b)^2 = a^2 + b^2 + 2ab`, substitute known values, and compute the result.
2. **Tactic Divergence:** The key difference lies in the tactics used at three critical junctures (highlighted and tagged #1, #2, #3).
* **Step #1:** The LLM uses `field_simp [h₂]` (a simplification tactic for field arithmetic), while APOLLO uses `linarith` (a tactic for linear arithmetic).
* **Step #2:** The LLM uses the explicit rewrite `rw [h₁]` to substitute `a*b=180`. APOLLO again uses `linarith`, which can infer this substitution automatically from the context.
* **Step #3 (Conclusion):** The LLM constructs a detailed proof term using `exact step1.trans (step2.trans step3)` to chain the equalities. APOLLO simply uses `linarith` again, which solves the final linear arithmetic goal.
3. **Conciseness:** The APOLLO version is more concise, relying on the powerful, automated `linarith` tactic in place of more manual or specialized tactics (`field_simp`, explicit rewrites, proof term construction).
### Interpretation
This diagram serves as a technical demonstration of how the APOLLO system augments or refines the output of a base LLM for formal theorem proving in Lean 4.
* **What it demonstrates:** The comparison suggests that APOLLO enhances proof automation. It replaces LLM-generated, potentially more verbose or less general tactic calls with a uniform, powerful automated tactic (`linarith`). This indicates APOLLO's strength in recognizing and solving linear arithmetic subgoals that arise during a proof.
* **Relationship between elements:** The side-by-side layout directly contrasts the "raw" LLM output with the "refined" LLM+APOLLO output. The colored highlights and numbered tags (#1, #2, #3) explicitly map the points of divergence, making the comparison easy to follow.
* **Significance:** The outlier is the LLM's use of `exact step1.trans...` in the final step. This is a valid but more low-level approach compared to APOLLO's strategy of delegating to `linarith`. The overall trend shown is toward greater automation and tactic uniformity when APOLLO is involved, which can lead to more robust and maintainable proofs. The image argues for the utility of systems like APOLLO in bridging the gap between LLM-generated proof sketches and fully automated, polished formal proofs.
</details>
Figure 8: Proof attempt of mathd_algebra_141 produced by the base model (o4-mini) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals.
<details>
<summary>examples/comparison/short_2.png Details</summary>

### Visual Description
\n
## Code Comparison Diagram: LLM vs. LLM + APOLLO Proof Scripts
### Overview
The image displays a side-by-side comparison of two Lean 4 code blocks. The left block, outlined in red, is labeled "LLM". The right block, outlined in green, is labeled "LLM + APOLLO". Both blocks contain code to prove the same mathematical theorem (`imo_1983_p6`), but they employ different proof tactics, suggesting a comparison of proof strategies or the effect of an additional tool (APOLLO).
### Components/Axes
The image is divided into two primary, vertically-aligned rectangular regions:
1. **Left Region (LLM):** Bordered in red. Contains a complete Lean 4 proof script.
2. **Right Region (LLM + APOLLO):** Bordered in green. Contains a complete Lean 4 proof script.
Each region contains the following structural components:
* **Header Label:** A colored box at the top center of each region ("LLM" in red, "LLM + APOLLO" in green).
* **Code Block:** Monospaced text representing Lean 4 code.
* **Highlighted Proof Tactic:** The final `nlinarith` tactic call is highlighted with a background color matching the region's border (light red on the left, light green on the right). A small tag `#1` appears at the end of the first line of this tactic in both blocks.
### Detailed Analysis / Content Details
**1. Common Code Structure (Both Blocks):**
Both scripts share identical setup and theorem statements:
```lean
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem imo_1983_p6_... (a b c : ℝ) (h₀ : 0 < a ∧ 0 < b ∧ 0 < c) (h₁ : c < a + b) (h₂ : b < a + c) (h₃ : a < b + c) : 0 ≤ a ^ 2 * b * (a - b) + b ^ 2 * c * (b - c) + c ^ 2 * a * (c - a) := by
```
* **Imports:** `Mathlib`, `Aesop`.
* **Option:** `maxHeartbeats` set to 0 (disabling the timeout).
* **Opened Namespaces:** `BigOperators`, `Real`, `Nat`, `Topology`, `Rat`.
* **Theorem Hypotheses:** `a, b, c` are positive real numbers (`h₀`), and they satisfy the triangle inequalities (`h₁`, `h₂`, `h₃`).
* **Theorem Conclusion:** A specific cyclic inequality involving squares and products of `a, b, c`.
**2. Divergent Proof Tactics (The Key Difference):**
The proof scripts differ entirely in the tactic used after `:= by`.
* **Left Block (LLM) Proof Tactic:**
```lean
nlinarith [sq_nonneg (a - b), sq_nonneg (b - c),
sq_nonneg (c - a), sq_nonneg (a - b),
sq_nonneg (b - c), sq_nonneg (c - a),
mul_pos h₀.left h₀.right.left,
mul_pos h₀.left h₀.right.right,
mul_pos h₀.right.left h₀.right.right]
```
* **Tactic:** `nlinarith` (non-linear arithmetic).
* **Provided Facts:** A list of 9 facts:
1. `sq_nonneg (a - b)` (repeated twice)
2. `sq_nonneg (b - c)` (repeated twice)
3. `sq_nonneg (c - a)` (repeated twice)
4. `mul_pos h₀.left h₀.right.left` (product of `a>0` and `b>0` is positive)
5. `mul_pos h₀.left h₀.right.right` (product of `a>0` and `c>0` is positive)
6. `mul_pos h₀.right.left h₀.right.right` (product of `b>0` and `c>0` is positive)
* **Right Block (LLM + APOLLO) Proof Tactic:**
```lean
nlinarith [sq_nonneg (a - b), sq_nonneg (b - c),
sq_nonneg (c - a),
mul_pos (sub_pos.mpr h₁)
(sub_pos.mpr h₂), mul_pos
(sub_pos.mpr h₂)
(sub_pos.mpr h₃), mul_pos (sub_pos.mpr h₃)
(sub_pos.mpr h₁)]
```
* **Tactic:** `nlinarith` (non-linear arithmetic).
* **Provided Facts:** A list of 6 facts:
1. `sq_nonneg (a - b)`
2. `sq_nonneg (b - c)`
3. `sq_nonneg (c - a)`
4. `mul_pos (sub_pos.mpr h₁) (sub_pos.mpr h₂)` (product of `(a+b-c)>0` and `(a+c-b)>0` is positive)
5. `mul_pos (sub_pos.mpr h₂) (sub_pos.mpr h₃)` (product of `(a+c-b)>0` and `(b+c-a)>0` is positive)
6. `mul_pos (sub_pos.mpr h₃) (sub_pos.mpr h₁)` (product of `(b+c-a)>0` and `(a+b-c)>0` is positive)
### Key Observations
1. **Identical Goal:** Both proofs aim to establish the same inequality under the same hypotheses.
2. **Proof Strategy Divergence:** The core difference lies in the auxiliary facts supplied to the `nlinarith` automation tactic.
3. **Fact Selection - LLM:** The LLM proof uses a redundant set of non-negativity facts for squares (`sq_nonneg`) and basic positivity facts derived directly from `h₀` (`a,b,c > 0`).
4. **Fact Selection - LLM + APOLLO:** The APOLLO-assisted proof uses a more targeted set. It includes the three necessary `sq_nonneg` facts without repetition. Crucially, it derives positivity facts from the *triangle inequalities* (`h₁, h₂, h₃`) using `sub_pos.mpr`, which transforms `c < a + b` into `0 < a + b - c`. The supplied `mul_pos` facts are products of these derived positive terms.
5. **Conciseness:** The APOLLO version provides fewer total facts (6 vs. 9) and avoids redundancy.
### Interpretation
This diagram illustrates a comparison between two automated theorem proving approaches for a classic IMO problem. The "LLM" label suggests a baseline proof generated by a Large Language Model. The "LLM + APOLLO" label indicates a proof where the LLM's output is augmented or guided by a system named APOLLO.
The key insight is in the **quality and relevance of the auxiliary facts** fed to the `nlinarith` tactic. The LLM-generated proof uses a somewhat brute-force approach, supplying many basic, loosely related facts (including duplicates). In contrast, the APOLLO-assisted proof demonstrates a more sophisticated understanding of the problem structure. It correctly identifies that the triangle inequalities (`h₁, h₂, h₃`) are the critical source of positivity for the terms `(a+b-c)`, `(a+c-b)`, and `(b+c-a)`, which are central to the inequality's proof. By providing these derived facts, the APOLLO system likely makes the `nlinarith` tactic's job more efficient and direct.
Therefore, the image suggests that APOLLO enhances the LLM's capability by improving its **strategic selection of lemmas** or its ability to **perform intermediate reasoning steps** (like applying `sub_pos.mpr`) before invoking automation. This leads to a more elegant, concise, and logically focused proof script. The comparison highlights the difference between generating syntactically correct code and generating optimally informative code for automated provers.
</details>
Figure 9: Proof attempt of imo_1983_p6 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals.
<details>
<summary>examples/comparison/short_4.png Details</summary>

### Visual Description
\n
## Diagram: Side-by-Side Code Comparison (LLM vs. LLM + APOLLO)
### Overview
The image displays a side-by-side comparison of two blocks of formal mathematical proof code written in the Lean 4 language. The left panel, titled "LLM" in a red header, shows a proof generated by a Large Language Model. The right panel, titled "LLM + APOLLO" in a green header, shows a more detailed and structured proof, presumably generated or enhanced by a system called APOLLO. The comparison highlights differences in proof strategy, detail, and the use of automated tactics.
### Components/Axes
The image is divided into two vertical panels of equal width.
**Left Panel (LLM):**
* **Header:** A red rectangular box at the top containing the text "LLM".
* **Code Block:** A block of Lean 4 code on a light pink background.
* **Annotations:** Two highlighted sections in pink, labeled `#1` and `#2`.
**Right Panel (LLM + APOLLO):**
* **Header:** A green rectangular box at the top containing the text "LLM + APOLLO".
* **Code Block:** A block of Lean 4 code on a light green background.
* **Annotations:** Two highlighted sections in green, labeled `#1` and `#2`. The `#1` annotation is significantly more extensive than its counterpart on the left.
### Detailed Analysis
**1. Common Code Structure (Both Panels):**
Both proofs begin with identical setup lines:
```lean
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem induction_pord1p1on2powklt5on2_llm (n k : ℕ) (h₀ : 0 < n) :
(∏ k in Finset.Icc 1 n, 1 + (1 : ℝ) / 2 ^ k) < 5 / 2 := by
```
This defines a theorem named `induction_pord1p1on2powklt5on2_llm` (the right panel uses `_apollo`). The theorem states that for natural numbers `n` and `k`, with the hypothesis `h₀ : 0 < n`, the product of the terms `(1 + 1/2^k)` for `k` from 1 to `n` is strictly less than 5/2.
**2. Proof Strategy - Left Panel (LLM):**
The proof proceeds by induction on `n`.
```lean
induction n with
| zero =>
contradiction
| succ n hn =>
cases n with
| zero =>
norm_num #1 -- Annotation #1
| succ n =>
simp_all [Finset.prod_Icc_succ_top, Nat.cast_succ, div_eq_mul_inv]
norm_num #2 -- Annotation #2
ring_nf
linarith
```
* **Base Case (`zero`):** Discharged by `contradiction` (since `h₀ : 0 < 0` is false).
* **Inductive Step (`succ n hn`):** Uses `cases n` to split into two subcases.
* Subcase `zero` (i.e., `n=1`): Uses `norm_num` (Annotation #1) to verify the numeric inequality.
* Subcase `succ n` (i.e., `n ≥ 2`): Uses `simp_all` with specific lemmas, then `norm_num` (Annotation #2), `ring_nf` for normalization, and `linarith` for linear arithmetic.
**3. Proof Strategy - Right Panel (LLM + APOLLO):**
The proof also begins with induction on `n` but introduces a nested induction on `k` within the inductive step.
```lean
induction n with
| zero =>
contradiction
| succ n hn =>
cases n with
| zero =>
norm_num
induction k with -- Nested induction begins
| zero =>
norm_num
[Finset.prod_Icc_succ_top] at hn h₀ <;> linarith
| succ k ih =>
cases k with
| zero =>
norm_num
[Finset.prod_Icc_succ_top] at hn h₀ <;> linarith
| succ k =>
simp_all
[Finset.prod_Icc_succ_top, pow_succ, mul_comm]
linarith
-- End of nested induction block (Annotation #1)
| succ n =>
simp_all [Finset.prod_Icc_succ_top, Nat.cast_succ, div_eq_mul_inv]
norm_num #2 -- Annotation #2
ring_nf
linarith
```
* **Base Case (`zero`):** Identical to the left panel.
* **Inductive Step (`succ n hn`):** Uses `cases n`.
* Subcase `zero` (i.e., `n=1`): This is where the major difference lies (Annotation #1). Instead of a single `norm_num`, it performs a **nested induction on `k`**. This inner induction has its own base case (`zero`) and inductive step (`succ k ih`), which is further split. This suggests a more granular, step-by-step verification of the product inequality for the specific case where `n=1`, possibly to handle the product over `k` more explicitly.
* Subcase `succ n` (i.e., `n ≥ 2`): The code is identical to the left panel's corresponding subcase, including Annotation #2.
**4. Annotations:**
* **Annotation #1 (Left - LLM):** A single tactic: `norm_num`.
* **Annotation #1 (Right - LLM + APOLLO):** A multi-line block implementing a nested induction on `k` with several tactics (`norm_num`, `induction`, `cases`, `simp_all`, `linarith`). This is the core differentiator.
* **Annotation #2 (Both Panels):** Identical in both: `norm_num`.
### Key Observations
1. **Structural Divergence:** The primary difference is in the subcase where `n=1`. The LLM proof uses a single tactic, while the LLM+APOLLO proof expands this into a detailed nested induction on the product index `k`.
2. **Proof Granularity:** The APOLLO-enhanced proof is more verbose and structured, breaking down the problem further. This may indicate a strategy to increase proof reliability or provide more detailed steps for verification.
3. **Identical Core Logic:** For the base case (`n=0`) and the inductive subcase for `n ≥ 2`, the proofs are identical. The enhancement is localized to a specific, potentially tricky part of the proof.
4. **Use of Tactics:** Both proofs use a similar set of Lean tactics (`induction`, `cases`, `norm_num`, `simp_all`, `linarith`), but APOLLO's version uses them in a more nested and explicit sequence.
### Interpretation
This diagram illustrates a comparison between a standard LLM-generated formal proof and one augmented by a system named APOLLO. The key takeaway is that **APOLLO appears to enhance the proof by introducing more detailed, structured reasoning in complex subgoals.**
* **What the data suggests:** The LLM alone produced a correct but minimally detailed proof for the `n=1` case. The APOLLO system identified this as a point requiring deeper justification and automatically generated a nested induction on `k` to provide that justification. This suggests APOLLO may function as a **proof refinement or debugging tool**, expanding high-level tactic calls into more fundamental steps.
* **How elements relate:** The two panels are direct counterparts. The identical headers and theorem names establish they are solving the same problem. The divergence in the `| zero =>` subcase under `| succ n hn =>` is the focal point, demonstrating APOLLO's intervention.
* **Notable patterns/anomalies:** The most significant pattern is the transformation of a single `norm_num` tactic into a multi-step inductive argument. This is not an error but a deliberate enrichment of the proof. It implies that for full formal verification or educational purposes, the APOLLO-style proof is more robust and transparent, as it leaves fewer steps to automated "black box" tactics. The fact that the rest of the proof remains unchanged suggests APOLLO's modifications are targeted and context-aware.
</details>
Figure 10: Proof attempt of induction_pord1p1on2powklt5on2 produced by the base model (Goedel-Prover-SFT) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals.
<details>
<summary>examples/comparison/short_5.png Details</summary>

### Visual Description
## [Code Comparison Diagram]: Lean 4 Theorem Proofs - LLM vs. LLM + APOLLO
### Overview
The image presents a side-by-side comparison of two Lean 4 code blocks, each containing a proof for the same mathematical theorem. The left block, outlined in red, is labeled "LLM". The right block, outlined in green, is labeled "LLM + APOLLO". The comparison visually highlights differences in the proof strategies and tactics employed by a standard Large Language Model (LLM) versus an LLM augmented with a system called APOLLO.
### Components/Axes
The image is divided into two primary, vertically oriented rectangular regions:
1. **Left Region (LLM):** Bordered in red. Contains a complete Lean 4 proof script.
2. **Right Region (LLM + APOLLO):** Bordered in green. Contains a complete Lean 4 proof script for the same theorem.
**Labels:**
* Top of Left Region: `LLM` (black text on a red background).
* Top of Right Region: `LLM + APOLLO` (black text on a green background).
**Embedded Annotations:**
* In the LLM proof, the tactic `nlinarith #1` is highlighted with a pink background.
* In the LLM + APOLLO proof, the annotation `#1` is highlighted with a green background at the end of the `norm_num` line.
### Detailed Analysis / Content Details
**Left Block (LLM) - Full Transcription:**
```lean
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem mathd_algebra_293_llm (x : NNReal) :
Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = 36 * x * Real.sqrt (35 * x) := by
rw [← Real.sqrt_mul (by positivity), ← Real.sqrt_mul (by positivity)]
ring_nf
rw [Real.sqrt_eq_iff_mul_self_eq]
ring_nf
nlinarith #1
```
**Right Block (LLM + APOLLO) - Full Transcription:**
```lean
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem mathd_algebra_293_apollo (x : NNReal) :
Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = 36 * x * Real.sqrt (35 * x) := by
rw [← Real.sqrt_mul (by positivity), ← Real.sqrt_mul (by positivity)]
ring_nf
rw [Real.sqrt_eq_iff_mul_self_eq]
ring_nf
norm_num #1
have h : (√(x : ℝ) * √35)^2 = (x : ℝ) * 35 := by
rw [mul_pow]
simp
rw [h]
ring
simp_all only [ofNat_pos, mul_nonneg_iff_of_pos_right, NNReal.zero_le_coe, pow_nonneg]
have hx : x ≥ 0 := by
exact _root_.zero_le x
have h1 : √((↑x : ℝ) * √(35 : ℝ)) * (36 : ℝ) ≥ 0 := by
simp_all only [ge_iff_le, _root_.zero_le, NNReal.zero_le_coe, sqrt_mul, ofNat_pos, mul_nonneg_iff_of_pos_right, Real.sqrt_pos, Real.sqrt_nonneg]
have A : 0 ≤ (x : ℝ) := NNReal.coe_nonneg x
have B : 0 ≤ √((x : ℝ) * 35) := sqrt_nonneg ((x : ℝ) * 35)
have C : (0 : ℝ) ≤ 36 := by norm_num
exact mul_nonneg (mul_nonneg A B) C
```
**Structural Comparison:**
* **Common Initial Steps:** Both proofs begin with identical setup (imports, options, open commands) and the same initial tactic sequence: two rewrites using `Real.sqrt_mul`, followed by `ring_nf`, a rewrite using `Real.sqrt_eq_iff_mul_self_eq`, and another `ring_nf`.
* **Divergence Point:** The proofs diverge after the second `ring_nf`.
* The **LLM** proof concludes with a single tactic: `nlinarith #1`.
* The **LLM + APOLLO** proof continues with a detailed, multi-step justification. It starts with `norm_num #1`, then establishes a helper lemma `h` about squaring a product of square roots, rewrites using it, applies `ring` and `simp_all`, and then explicitly constructs three hypotheses (`hx`, `h1`, `A`, `B`, `C`) to prove the non-negativity of the terms involved. The final step is `exact mul_nonneg (mul_nonneg A B) C`.
### Key Observations
1. **Proof Length and Explicitness:** The LLM + APOLLO proof is significantly longer and more verbose. It explicitly states and proves intermediate facts (like `h`, `hx`, `h1`, `A`, `B`, `C`) that the LLM proof likely leaves to the `nlinarith` tactic to infer.
2. **Tactic Choice:** The LLM relies on the powerful, automated `nlinarith` (nonlinear arithmetic) tactic to close the goal. The LLM + APOLLO version uses a combination of more basic tactics (`norm_num`, `ring`, `simp_all`) and explicit `have` statements to build the proof step-by-step.
3. **Visual Highlighting:** The pink highlight on `nlinarith #1` in the LLM block and the green highlight on `#1` in the LLM + APOLLO block draw attention to the critical, diverging steps in each proof.
4. **Theorem Naming:** The theorems are named `mathd_algebra_293_llm` and `mathd_algebra_293_apollo`, indicating they are two versions of the same problem (likely from a dataset like Mathlib's `mathd` series).
### Interpretation
This image is a technical comparison designed to showcase the effect of the "APOLLO" system on an LLM's code generation, specifically for formal theorem proving in Lean.
* **What it demonstrates:** It suggests that APOLLO guides the LLM to produce proofs that are more **explicit, structured, and possibly more reliable**. Instead of relying on a single, opaque automation tactic (`nlinarith`), the APOLLO-augmented model generates a proof that breaks down the reasoning into human-readable steps, explicitly verifying the non-negativity conditions required for the square root manipulations.
* **Relationship between elements:** The side-by-side layout forces a direct comparison. The identical initial steps establish a common baseline, making the divergence in the final proof strategy starkly clear. The color coding (red for baseline LLM, green for enhanced APOLLO) reinforces a "before/after" or "problem/solution" narrative.
* **Underlying message:** The detailed proof on the right is presented as superior—not necessarily because it's shorter (it's not), but because it is more transparent and constructs its argument from more fundamental principles. This aligns with goals in formal verification where proof clarity, maintainability, and independence from complex automation are highly valued. The image argues that APOLLO helps the LLM move from generating merely *correct* code to generating *well-structured, explanatory* code.
</details>
Figure 11: Proof attempt of mathd_algebra_293 produced by the base model (Goedel-Prover-SFT) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals.
<details>
<summary>examples/comparison/medium_1.png Details</summary>

### Visual Description
## [Code Comparison]: Lean 4 Theorem Proofs - LLM vs. LLM + APOLLO
### Overview
The image displays a side-by-side comparison of two Lean 4 code proofs for the same mathematical theorem. The left column, labeled "LLM" in a red header, shows a proof generated by a Large Language Model alone. The right column, labeled "LLM + APOLLO" in a green header, shows a proof generated by an LLM combined with a system named APOLLO. Both proofs aim to solve the same algebraic problem involving non-negative real numbers (`NNReal`).
### Components/Axes
The image is structured as two vertical code panels.
* **Left Panel (LLM):** Bordered in red. Header text: "LLM".
* **Right Panel (LLM + APOLLO):** Bordered in green. Header text: "LLM + APOLLO".
* **Code Language:** Lean 4, a formal proof language.
* **Highlighted Steps:** Both proofs contain numbered steps (e.g., `#1`, `#2`) highlighted with colored backgrounds (red in the left panel, green in the right) to indicate key proof steps or tactics.
### Detailed Analysis
#### **Left Panel: LLM Proof**
**Theorem Statement:**
```lean
theorem mathd_algebra_184_llm
(a b : NNReal)
(h₀ : (0 : NNReal) < a ∧ (0 : NNReal) < b)
(h₁ : a ^ (2 : ℕ) = (6 : NNReal) * b)
(h₂ : a ^ (2 : ℕ) = (54 : NNReal) / b) :
a = (3 : NNReal) * (NNReal.sqrt : NNReal → NNReal) (2 : NNReal) := by
```
**Proof Strategy & Key Steps:**
1. **Step #1 (`have h4`):** Establishes `b ≠ 0` using `linarith` on `h₀.right` (which states `0 < b`).
2. **Step #2 (`nlinarith`):** A nonlinear arithmetic tactic used to derive `h7 : b ^ 2 = (9 : NNReal)` from previous hypotheses.
3. **Step #3 (`nlinarith`):** Another use of nonlinear arithmetic, likely to handle the equation `b^2 = 9`.
4. **Step #4 (`linarith`):** Used to prove `h10 : b ^ 2 - (9 : NNReal) = (0 : NNReal)`.
5. **Step #5 (`linarith`):** Used in the context of `ring_nf at h10` to simplify the ring expression.
6. **Step #6 & #7 (`linarith`):** Used within a `cases'` statement to handle the two possible solutions from `(b - 3) * (b + 3) = 0`. Step #6 handles `b - 3 = 0`. Step #7 handles `b + 3 = 0` and derives a contradiction with `b > 0` (from `h9`).
7. **Step #8 (`linarith`):** Used to prove `h14 : a ^ 2 = (18 : NNReal)` after rewriting with `hb`.
8. **Step #9 (`rw [h15, h16]`):** The final rewrite step using previously established equalities for `a` and `Real.sqrt 18` to conclude the proof with `exact ha`.
**Overall Flow:** The LLM proof follows a relatively direct path: establish `b ≠ 0`, derive `b^2 = 9`, solve for `b = 3` (discarding `b = -3`), substitute to find `a^2 = 18`, and then express `a` in the required form.
#### **Right Panel: LLM + APOLLO Proof**
**Theorem Statement:**
```lean
theorem mathd_algebra_184_apollo
(a b : NNReal)
(h₀ : (0 : NNReal) < a ∧ (0 : NNReal) < b)
(h₁ : a ^ (2 : ℕ) = (6 : NNReal) * b)
(h₂ : a ^ (2 : ℕ) = (54 : NNReal) / b) :
a = (3 : NNReal) * (NNReal.sqrt : NNReal → NNReal) (2 : NNReal) := by
```
**Proof Strategy & Key Steps:**
The proof is significantly longer and more granular, breaking down each logical step into explicit `have` statements and using a wider variety of tactics.
1. **Step #1 (`have h4`):** Similar to the LLM proof, establishes `b ≠ 0` but uses `exact ne_of_gt hb` where `hb` is `0 < b`.
2. **Step #2 (`have h5`):** Derives `h5 : (6 : NNReal) * b ^ (2 : ℕ) = (54 : NNReal)` using `simp [pow_two]`, `rw [h6]`, and a sequence of tactics (`try norm_cast; try norm_num; ...`).
3. **Step #3 (`have h7`):** Derives `h7 : b ^ (2 : ℕ) = (9 : NNReal)` using `gcongr` (a congruence tactic).
4. **Step #4 (`have h10`):** Proves `h10 : b ^ 2 - (9 : NNReal) = (0 : NNReal)` using `simp_all only [...]` with a specific list of simplification rules.
5. **Step #5 (`have hb`):** Concludes `hb : b = (3 : NNReal)` using `ring_nf at h10` and then `have h7'` and `have h11`.
6. **Step #6 (`have h13`):** Within a `cases'` branch, proves `h13 : b ^ 2 = 9 := by simpa using h7`.
7. **Step #7 (`simp_all only [...]`):** Another simplification step using a specific set of rules.
8. **Step #8 (`gcongr`):** Used to prove `h14 : a ^ 2 = (18 : NNReal)`.
9. **Step #9 (`have h19`):** A detailed calculation (`calc`) block proving `√18 = 3 * √2` by breaking it into `√(3^2 * 2) = √(3^2) * √2 = 3 * √2`, citing `Real.sqrt_mul` and `Real.sqrt_sq`.
**Overall Flow:** The APOLLO-assisted proof is more verbose and methodical. It explicitly constructs intermediate facts, uses a broader set of tactics (`gcongr`, `simp_all` with explicit rule lists, `calc` blocks), and provides more detailed justifications for each step, particularly in the final algebraic manipulation involving square roots.
### Key Observations
1. **Proof Length & Detail:** The LLM + APOLLO proof is approximately 2-3 times longer than the LLM-only proof. It contains many more intermediate `have` statements.
2. **Tactic Usage:** The LLM proof relies heavily on automated arithmetic solvers (`linarith`, `nlinarith`). The APOLLO proof uses these as well but supplements them with more structural tactics (`gcongr`, `simp_all` with explicit configurations) and manual calculation steps (`calc`).
3. **Transparency:** The APOLLO proof makes the logical structure more transparent. For example, the derivation of `√18 = 3√2` is fully spelled out in a `calc` block (Step #9), whereas in the LLM proof, it is handled by a single rewrite (`rw [Real.sqrt_eq_iff_sq_eq]`) and arithmetic tactics.
4. **Step Numbering:** The highlighted step numbers (`#1`, `#2`, etc.) do not correspond between the two proofs. They are internal to each proof's presentation.
### Interpretation
This image illustrates a comparison between two approaches to automated theorem proving in Lean 4.
* **What the data suggests:** The "LLM + APOLLO" system produces proofs that are more detailed, structured, and potentially more robust. The increased verbosity is not redundancy; it represents a more explicit encoding of the proof's logical dependencies. The use of tactics like `gcongr` and configured `simp_all` suggests a more controlled application of the simplifier compared to the LLM's reliance on powerful but sometimes opaque arithmetic decision procedures.
* **How elements relate:** Both proofs start from identical theorem statements and hypotheses. They follow the same high-level mathematical strategy (solve for `b`, then `a`). The divergence is in the *formal* strategy: the LLM seeks the shortest path using automation, while APOLLO guides the proof construction to be more explanatory and granular.
* **Notable trends/anomalies:** The key trend is the trade-off between proof compactness and proof transparency/explainability. The LLM proof is compact but may be harder for a human to follow or for another system to verify in a different context. The APOLLO proof is longer but each step is easier to validate independently. The detailed `calc` block for the square root identity is a prime example of this transparency. There are no mathematical outliers, as both proofs correctly arrive at the same conclusion. The "outlier" is the methodological difference itself, highlighting a design choice in AI-assisted formal verification: prioritizing succinctness versus prioritizing human-readable, step-by-step justification.
</details>
Figure 12: Proof attempt of mathd_algebra_184 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals.
<details>
<summary>examples/comparison/medium_2.png Details</summary>

### Visual Description
\n
## Code Comparison Diagram: LLM vs. LLM + APOLLO Proof Generation
### Overview
The image displays a side-by-side comparison of two Lean 4 code blocks, each contained within a colored, rounded rectangle. The left block, outlined in red, is labeled "LLM". The right block, outlined in green, is labeled "LLM + APOLLO". Both blocks contain formal mathematical proofs for what appears to be the same underlying algebraic theorem, but the proof strategies and lengths differ significantly. The "LLM + APOLLO" proof is substantially longer and more detailed.
### Components/Axes
* **Layout:** Two vertical panels.
* **Left Panel (LLM):** Red border, light pink background. Header text: "LLM" in a red box.
* **Right Panel (LLM + APOLLO):** Green border, light green background. Header text: "LLM + APOLLO" in a green box.
* **Content Type:** Formal proof code written in the Lean 4 theorem prover language.
* **Language:** The primary language is Lean 4 code. Mathematical notation is embedded within the code (e.g., `√`, `*`, `^`, `:=`).
* **Annotations:** The left panel contains a highlighted red section with the comment `#1`.
### Detailed Analysis
#### **Left Panel: LLM Proof**
**Theorem Name:** `mathd_algebra_293_llm`
**Goal:** Prove that for `x : NNReal`, `Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = 36 * x * Real.sqrt (35 * x)`.
**Proof Structure (Transcribed):**
```lean
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem mathd_algebra_293_llm (x : NNReal) :
Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = 36 * x * Real.sqrt (35 * x) := by
have h1 : Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = Real.sqrt ((60 * x) * (12 * x) * (63 * x)) := by
rw [← Real.sqrt_mul (by positivity), ← Real.sqrt_mul (by positivity)]
rw [h1]
have h2 : (60 * x) * (12 * x) * (63 * x) = (36 * x) ^ 2 * (35 * x) := by
ring_nf
<;> simp [x_mul_x]
<;> linarith [show (0 : ℝ) ≤ x from by positivity]
rw [h2] -- #1
rw [Real.sqrt_mul (by positivity)]
rw [Real.sqrt_sq (by simp [x])]
all_goals positivity
```
* **Key Steps:** The proof uses a `have` tactic to assert an intermediate equality `h1` by combining square roots. It then rewrites using `h1` and asserts a second equality `h2` about the product inside the square root. The final steps rewrite using `h2` and simplify the square root of a square.
* **Highlighted Section:** The lines from `rw [h2]` to `all_goals positivity` are highlighted in a darker pink/red, with a comment `#1` in the right margin.
#### **Right Panel: LLM + APOLLO Proof**
**Theorem Name:** `mathd_algebra_293_apollo`
**Goal:** Identical to the LLM proof.
**Proof Structure (Transcribed - Key Sections):**
The proof is much longer. It begins with similar imports and setup, then proceeds with a more granular, step-by-step approach.
**Initial Steps (Similar to LLM):**
```lean
have h1 : Real.sqrt (60 * x) * Real.sqrt (12 * x) * Real.sqrt (63 * x) = Real.sqrt ((60 * x) * (12 * x) * (63 * x)) := by
rw [← Real.sqrt_mul (by positivity), ← Real.sqrt_mul (by positivity)]
rw [h1]
have h2 : (60 * x) * (12 * x) * (63 * x) = (36 * x) ^ 2 * (35 * x) := by
ring_nf
<;> simp [x_mul_x]
<;> linarith [show (0 : ℝ) ≤ x from by positivity]
```
**Divergence and Elaboration (After `rw [h2]`):**
Instead of directly simplifying the square root, the proof enters a long sequence of `try` tactics and then explicitly breaks down the square roots of the numerical coefficients.
```lean
try norm_cast ; try norm_num ; try simp_all ; try ring_nf at * ; try native_decide ;
try linarith ; try nlinarith
have h1 : √(60 : ℝ) = 2 * √15 := by
rw [Real.sqrt_eq_iff_sq_eq] <;> norm_num
<;> ring_nf
<;> norm_num
have h2' : √(12 : ℝ) = 2 * √3 := by
... (similar structure)
have h3 : √(63 : ℝ) = 3 * √7 := by
...
have h4 : √(↑x : ℝ) ^ (3 : ℕ) = √(↑x : ℝ) * (↑x : ℝ) := by
...
```
The proof then continues with a lengthy `calc` block that meticulously shows the equality step-by-step, factoring and recombining the square roots of the prime factors (15, 3, 7) and the variable `x`. It uses lemmas like `h21`, `h22`, `h221`, and `h23` to manage intermediate results.
**Final Lines:**
```lean
rw [h23]
linarith [h22]
rw [← h2]
ring
```
### Key Observations
1. **Proof Strategy:** The LLM proof is concise, relying on powerful automation tactics (`ring_nf`, `simp`, `linarith`) to close the goal quickly after asserting the key algebraic identity `h2`. The LLM+APOLLO proof is verbose and explicit, manually decomposing the problem into lemmas about individual square roots and then reconstructing the proof in a `calc` block.
2. **Length and Detail:** The LLM+APOLLO proof is approximately 3-4 times longer in terms of lines of code.
3. **Use of Automation:** The LLM proof uses automation aggressively at the end. The LLM+APOLLO proof uses a block of `try` tactics early on but then proceeds with mostly manual, detailed steps.
4. **Highlighted Region:** The red highlight in the LLM panel draws attention to the final simplification steps, which are the point where the concise automation is applied.
### Interpretation
This image is a technical comparison demonstrating two different approaches to automated theorem proving in Lean 4 for the same mathematical statement.
* **What it shows:** It contrasts a "standard" LLM-generated proof (left) with a proof generated or augmented by a system called "APOLLO" (right). The LLM proof is efficient and relies on the prover's built-in automation. The APOLLO-augmented proof is more pedagogical and explicit, breaking the problem down into fundamental algebraic steps.
* **Relationship between elements:** The side-by-side layout invites direct comparison. The identical theorem statement and initial steps establish a common ground, highlighting the divergence in proof strategy after the key identity `h2` is established.
* **Potential Meaning:** The comparison likely aims to showcase a trade-off. The LLM proof is shorter and may be preferred for efficiency. The APOLLO proof, while longer, is more transparent, easier for a human to follow, and may be more robust or generalizable because it avoids relying on the "black box" of powerful automation tactics. The highlighted section in the LLM proof may indicate the specific point where APOLLO's more detailed approach begins. This could be illustrating APOLLO's ability to generate more verifiable, step-by-step proofs compared to an LLM that might take shortcuts via automation.
</details>
Figure 13: Proof attempt of mathd_algebra_293 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals.
<details>
<summary>examples/comparison/medium_3.png Details</summary>

### Visual Description
## [Code Comparison]: Lean 4 Proof Strategies for a System of Linear Equations
### Overview
The image displays a side-by-side comparison of three Lean 4 code blocks, each attempting to formally prove the same mathematical theorem. The theorem solves a system of two linear equations in variables `f` and `z` over the complex numbers (`ℂ`). The comparison highlights differences in proof strategy, verbosity, and the use of automation tactics between a baseline "LLM" approach and an enhanced "LLM + APOLLO" approach.
### Components/Axes
The image is divided into three vertical columns:
1. **Left Column (Header: "LLM" in a red box):** Contains a concise proof attempt. Some lines are highlighted with a pink background.
2. **Middle Column (Header: "LLM + APOLLO" in a green box):** Contains a much more verbose and detailed proof attempt.
3. **Right Column (No header, green background):** Contains a third, distinct proof attempt, also detailed. It appears to be a continuation or alternative to the middle column's proof.
**Common Code Structure (All Columns):**
* **Imports:** `import Mathlib`, `import Aesop`
* **Options:** `set_option maxHeartbeats 0`, `set_option pp.numericTypes true`, `set_option pp.coercions.types true`
* **Open Sections:** `open BigOperators Real Nat Topology Rat`
* **Theorem Statement:** All three prove a theorem with the same core statement (solving for `f` and `z` given two equations), but with different internal names.
### Detailed Analysis
#### **1. Left Column: "LLM" Proof**
* **Theorem Name:** `algebra_2varlineareq_fp3zeq11_3tfm1m5zeqn68_feqn10_zeq7_llm`
* **Hypotheses:**
* `h0 : f + (3 : ℂ) * z = (11 : ℂ)`
* `h1 : (3 : ℂ) * (f - 1) - (5 : ℂ) * z = (-68 : ℂ)`
* **Goal:** `f = (-10 : ℂ) ∧ z = (7 : ℂ)`
* **Proof Strategy:** A short, direct proof using the `linarith` (linear arithmetic) tactic.
1. Simplifies the second equation (`h1`) using `mul_sub` to get `h2 : 3 * f - 3 - 5 * z = -68`.
2. Combines like terms to isolate `3*f - 5*z`, creating `h3 : 3 * f - 5 * z = -65`. **This line is highlighted in pink.**
3. Uses `linarith [h0, h3]` to solve for `z`, creating `z_eq : z = 7`. **This line is highlighted in pink.**
4. Uses `linarith [h0, z_eq]` to solve for `f`, creating `f_eq : f = -10`. **This line is highlighted in pink.**
5. Concludes with `exact And.intro f_eq z_eq`.
* **Observation:** The proof is compact but relies heavily on the `linarith` tactic to perform multiple algebraic steps implicitly. The pink highlights may indicate steps where the proof is less transparent or where a more advanced prover (like APOLLO) intervened.
#### **2. Middle Column: "LLM + APOLLO" Proof**
* **Theorem Name:** `algebra_2varlineareq_fp3zeq11_3tfm1m5zeqn68_feqn10_zeq7_apollo`
* **Hypotheses:** Identical to the left column (`h0`, `h1`).
* **Goal:** Identical to the left column.
* **Proof Strategy:** An extremely verbose, step-by-step proof that explicitly performs each algebraic manipulation. It uses a wide array of tactics (`norm_num`, `simp_all`, `ring_nf`, `native_decide`, `linarith`, `nlinarith`) in a trial-and-error fashion at many steps.
* **Key Steps:**
* Creates `h2` and `h3` similar to the LLM proof but with explicit tactic sequences.
* Expands and manipulates equations to isolate terms (e.g., `h_expanded`, `h_added`, `h_add`, `h_sum`, `h_subst`, `h_simpl`).
* Solves for `z` (`z_eq`) and `f` (`f_eq_def`, `f_eq`) through explicit substitution and simplification, culminating in `have isolate_z : (14 : ℂ) * z = 98`.
* The proof in this column is cut off at the bottom with an ellipsis (`...`), indicating it continues.
#### **3. Right Column: Alternative Detailed Proof**
* **Theorem Name:** Not fully visible, but the code is a continuation of a proof.
* **Proof Strategy:** This column shows a different detailed proof path, focusing on manipulating equations to isolate `z`.
* **Key Steps:**
* Defines intermediate equalities `eq3`, `eq4`, and `eq_final`.
* Solves for `z` using `have z_sol : z = (7 : ℂ) := by norm_num`.
* Explicitly handles the non-zero denominator: `have h_nonzero : (14 : ℂ) ≠ 0 := by norm_num`.
* Uses the inverse to solve: `have z_inv : z = 98 * ((14 : ℂ)⁻¹) := by exact (eq_mul_inv_iff_mul_eq₀ h_nonzero).mpr isolate_z`.
* Simplifies the division: `have z_simpl : 98 / (14 : ℂ) = (7 : ℂ) := by norm_num`.
* Solves for `f` (`f_eq`) using the first hypothesis `h0`.
* Concludes with `exact And.intro f_eq z_eq`.
### Key Observations
1. **Proof Length & Transparency:** The "LLM" proof is short and opaque, relying on `linarith` as a black box. The "LLM + APOLLO" proofs are orders of magnitude longer, making every algebraic step explicit.
2. **Tactic Usage:** The enhanced proofs employ a much broader and more repetitive set of tactics, suggesting a search-based or guided proof strategy.
3. **Common Goal:** All three proofs successfully establish the same theorem: `f = -10` and `z = 7`.
4. **Visual Coding:** The red "LLM" header and pink highlights contrast with the green "LLM + APOLLO" header and green background, visually framing the comparison as between a baseline and an improved method.
### Interpretation
This image is a technical comparison of automated theorem proving strategies in the Lean 4 proof assistant. It demonstrates the difference between a proof generated by a standard Large Language Model (LLM) and one enhanced by a system called APOLLO.
* **What it shows:** The LLM produces a correct but minimally explanatory proof. It gets the right answer but doesn't "show its work" in a human-readable way. The APOLLO-enhanced proofs, while verbose to the point of being impractical for a human to write, decompose the problem into a long chain of simple, verifiable steps. This makes the proof's logic completely transparent and robust.
* **Why it matters:** In formal verification, the goal is not just correctness but also trust and understandability. The verbose proofs act as a "certificate" where each step can be independently checked by simple tactics (`norm_num`). This approach is less elegant but potentially more reliable and easier to debug. The comparison likely aims to showcase APOLLO's ability to generate more detailed, trustworthy, and human-auditable formal proofs compared to a vanilla LLM.
* **Underlying Pattern:** The core mathematical task is solving the linear system:
1. `f + 3z = 11`
2. `3(f - 1) - 5z = -68` → `3f - 3 - 5z = -68` → `3f - 5z = -65`
Subtracting 3 times the first equation from the second: `(3f - 5z) - 3*(f + 3z) = -65 - 3*11` → `-14z = -98` → `z = 7`. Substituting back gives `f = -10`. The APOLLO proofs essentially mechanize this manual algebraic process in excruciating detail.
</details>
Figure 14: Proof attempt of algebra_2varlineareq_fp3zeq11_3tfm1m5zeqn68_feqn10_zeq7 produced by the base model (o4-mini) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals.
<details>
<summary>examples/comparison/medium_4.png Details</summary>

### Visual Description
\n
## Code Comparison Diagram: LLM vs. LLM + APOLLO Proof Attempts
### Overview
The image displays a side-by-side comparison of two Lean 4 code blocks, each attempting to prove the same mathematical theorem (`amc12a_2019_p12`). The left block, outlined in red and labeled "LLM," shows a shorter, more direct proof attempt. The right block, outlined in green and labeled "LLM + APOLLO," shows a longer, more detailed proof that introduces intermediate variables and lemmas. The comparison likely illustrates the difference in proof strategy or capability between a base Large Language Model (LLM) and an LLM augmented with a system named "APOLLO."
### Components/Axes
The image is structured as two vertical panels:
1. **Left Panel (LLM):**
* **Header:** A red rectangular label at the top center containing the text "LLM".
* **Border:** A red rounded rectangle enclosing the code.
* **Content:** Lean 4 code with specific lines highlighted in a light red/pink background. Annotations `#1`, `#2`, and `#3` are placed next to certain lines.
2. **Right Panel (LLM + APOLLO):**
* **Header:** A green rectangular label at the top center containing the text "LLM + APOLLO".
* **Border:** A green rounded rectangle enclosing the code.
* **Content:** Lean 4 code with a light green background. The code is more extensive. Annotations `#1`, `#2`, and `#3` are also present, corresponding to similar logical steps as in the left panel. The code continues beyond the visible frame, indicated by an ellipsis (`...`) at the top and bottom.
### Detailed Analysis
**Theorem Statement (Common to Both):**
Both blocks begin with identical setup and theorem statements:
```lean
import Mathlib
import Aesop
set_option maxHeartbeats 0
open BigOperators Real Nat Topology Rat
theorem amc12a_2019_p12_... (x y : ℝ) (h₀ : x ≠ 1 ∧ y ≠ 1)
(h₁ : Real.log x / Real.log 2 = Real.log 16 / Real.log y) (h₂ : x * y = 64) :
(Real.log (x / y) / Real.log 2) ^ 2 = 20 := by
```
The theorem concerns real numbers `x` and `y`, with hypotheses about their logs and product, aiming to prove a specific equation involving the log of their ratio.
**Left Panel (LLM) Proof Steps:**
1. **`have x_nonzero : x ≠ 0 := by`** (Highlighted, annotated `#1`):
* `rintro rfl`
* `simp at h₁`
2. **`have y_nonzero : y ≠ 0 := by`** (Highlighted, annotated `#2`):
* `rintro rfl`
* `simp at h₁`
3. **`field_simp [Real.log_div, Real.log_mul, Real.log_inv, Real.log_pow] at h₁ ⊢`**
4. **`nlinarith`** (Highlighted, annotated `#3`)
**Right Panel (LLM + APOLLO) Proof Steps:**
The proof is more elaborate. Key differences and extensions include:
1. **`have x_nonzero : x ≠ 0 := by`** (Annotated `#1`):
* `rintro rfl`
* `simp at h₁`
* `linarith` (Uses a different tactic than the LLM version).
2. **`have y_nonzero : y ≠ 0 := by`** (Annotated `#2`):
* `rintro rfl`
* `simp at h₁`
* `linarith`
3. **`field_simp` and `ring_nf` tactics** are applied to `h₁`.
4. **Introduction of Intermediate Variables** (Annotated `#3`):
* `let L := Real.log (2 : ℝ)`
* `let X := Real.log x`
* `let Y := Real.log y`
5. **Derivation of Lemmas:** The proof then builds a chain of logical steps using these variables:
* `have hXY : X + Y = Real.log 64 := by ...`
* `have hXY' : X + Y = 6 * L := by ...`
* `have hL : 6 * Real.log (2 : ℝ) = Real.log (64) := by ...`
* `have h2_pos : (2 : ℝ) > 0 := by norm_num`
* (The code continues with more `have` statements, defining `h16`, `h2_pos`, `H'`, `hX`, `hL'`, `hXY_prod`, `Y_nonzero`, `y_log_nonzero`, `hy_exp`, `y_eq_one`, `diff_eq`, etc., before concluding with `linarith`).
### Key Observations
1. **Proof Strategy:** The LLM proof is concise, relying on `field_simp` and a single `nlinarith` call. The LLM + APOLLO proof is methodical, explicitly defining logarithmic substitutions (`L`, `X`, `Y`) and proving numerous intermediate equalities and properties.
2. **Tactic Usage:** The LLM version uses `nlinarith` (nonlinear arithmetic) at the end. The APOLLO version uses `linarith` (linear arithmetic) multiple times and employs more rewriting (`rw`) and simplification (`simp`) steps.
3. **Structure:** The APOLLO proof is highly structured, with each logical step isolated in a `have` block. This makes the proof more readable and verifiable but significantly longer.
4. **Annotations:** The `#1`, `#2`, `#3` annotations in both panels highlight analogous points in the proofs: establishing non-zero conditions (`#1`, `#2`) and the main proof strategy or variable introduction (`#3`).
### Interpretation
This diagram serves as a qualitative comparison of proof generation capabilities. It suggests that the "APOLLO" augmentation leads to a more **structured, explicit, and human-readable** proof style.
* **LLM Approach:** Represents a more direct, "black-box" style where complex tactics (`nlinarith`) are used to close the goal after some preprocessing. This is efficient but may be less instructive and harder to debug if it fails.
* **LLM + APOLLO Approach:** Represents a **decompositional** strategy. By introducing `L`, `X`, and `Y`, it transforms the problem about `x` and `y` into a problem about their logarithms, which is a standard technique for solving such equations. The extensive use of intermediate lemmas (`have` statements) breaks the proof into small, manageable, and self-contained steps. This mirrors how a human mathematician might structure the proof, enhancing transparency and trust in the result.
The key takeaway is not about the correctness of the theorem (which is the same in both), but about the **process and explainability** of the proof. The APOLLO-augmented system appears to prioritize generating a proof that is not just correct, but also **verifiable and pedagogically clear**, which is a crucial goal in automated reasoning and AI-assisted mathematics. The ellipsis (`...`) indicates the APOLLO proof is even longer than shown, further emphasizing its thoroughness.
</details>
Figure 15: Proof attempt of amc12a_2019_p12 produced by the base model (Goedel-Prover-SFT) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals.
<details>
<summary>examples/comparison/long.png Details</summary>

### Visual Description
\n
## Code Comparison Diagram: Lean 4 Theorem Proofs (LLM vs. LLM + APOLLO)
### Overview
The image displays a side-by-side comparison of three Lean 4 proof scripts for the same mathematical theorem (`mathd_algebra_289`). The leftmost column shows a proof generated by a base LLM, highlighted with red annotations. The middle and right columns show a proof generated by an LLM augmented with a system called "APOLLO," highlighted with green annotations. The comparison aims to demonstrate differences in proof strategy, tactic usage, and structure between the two methods.
### Components/Axes
The image is structured into three vertical columns, each with a header and a block of Lean 4 code.
1. **Column Headers (Top of each column):**
* **Left Column:** `LLM` (Red background, white text)
* **Middle Column:** `LLM + APOLLO` (Green background, white text)
* **Right Column:** `LLM + APOLLO` (Green background, white text)
2. **Code Blocks:** Each column contains a continuous block of Lean 4 code, starting with a `theorem` declaration and ending with `exact` or a similar concluding tactic. Lines within the code are annotated with colored highlights and numbered tags (e.g., `#1`, `#2`).
3. **Annotation System:**
* **Red Highlights (Left Column):** Indicate specific lines or steps in the base LLM proof. Each highlighted line has a small red tag with a number (e.g., `#1`, `#2`) on its right side.
* **Green Highlights (Middle & Right Columns):** Indicate corresponding or alternative steps in the APOLLO-augmented proof. Each highlighted line has a small green tag with a number (e.g., `#1`, `#2`) on its right side.
### Detailed Analysis
#### **Left Column: LLM Proof**
* **Theorem:** `theorem mathd_algebra_289_llm (k t m n : ℕ) (h₀ : Nat.Prime m ∧ Nat.Prime n) (h₁ : t < k) (h₂ : (k ^ 2 : ℤ) - m * k + n = 0) (h₃ : (t ^ 2 : ℤ) - m * t + n = 0) : m ^ n + n ^ m + k ^ t + t ^ k = 20 := by`
* **Proof Structure:** The proof proceeds through a series of `have` statements, introducing intermediate hypotheses (`h4` through `h27`). It uses tactics like `linarith`, `nlinarith`, `omega`, `ring_nf`, `norm_num`, `exact_mod_cast`, and `by_contra`.
* **Annotated (Red) Lines:**
* `#1`: `have : t = 1 := by nlinarith`
* `#2`: `have : n = 1 := by omega`
* `#3`: `have : k = 1 := by nlinarith`
* `#4`: `interval_cases k <;> omega`
* `#5`: `have h24 : (k : ℤ) + (t : ℤ) = m := by exact h12`
* `#6`: `have h34 : k % 2 = 0 := by omega`
* `#7`: `rw [h35]`
* `#8`: `have h30 : k = 2 := by interval_cases k <;> omega`
#### **Middle Column: LLM + APOLLO Proof (Part 1)**
* **Theorem:** `theorem mathd_algebra_289_apollo (k t m n : ℕ) (h₀ : Nat.Prime m ∧ Nat.Prime n) (h₁ : t < k) (h₂ : (k ^ 2 : ℤ) - m * k + n = 0) (h₃ : (t ^ 2 : ℤ) - m * t + n = 0) : m ^ n + n ^ m + k ^ t + t ^ k = 20 := by`
* **Proof Structure:** This proof also uses a sequence of `have` statements but employs a different set of tactics, often combining multiple tactics in a single step (e.g., `try norm_cast ; try norm_num ; try simp_all ; try ring_nf at * ; try native_decide ; try linarith ; try nlinarith`). It makes heavy use of `simp_all only [...]` with specific lemma lists.
* **Annotated (Green) Lines:**
* `#1`: `have : t = 1 := by try norm_cast ; try norm_num ; try simp_all ; try ring_nf at * ; try native_decide ; try linarith ; try nlinarith`
* `#2`: `have : n = 1 := by simp_all only [sub_self, mul_zero, gt_iff_lt]`
* `#3`: `have : k = 1 := by simp_all only [Nat.cast_one, sub_self, mul_zero, gt_iff_lt, mul_eq_one, le_refl]`
* `#4`: `simp_all only [sub_self, mul_zero, gt_iff_lt]`
#### **Right Column: LLM + APOLLO Proof (Part 2 - Continuation)**
* **Proof Structure:** This column continues the proof from the middle column. It introduces more hypotheses (`h21` through `h40`) and uses tactics like `gcongr`, `linarith`, `omega`, `rcases`, `apply`, `calc`, and `norm_num`.
* **Annotated (Green) Lines:**
* `#5`: `have h24 : (k : ℤ) + (t : ℤ) = m := by gcongr`
* `#6`: `have h34 : k % 2 = 0 := by try norm_cast ; try norm_num ; try simp_all ; try ring_nf at * ; try native_decide ; try linarith ; try nlinarith`
* `#7`: `try simp_all ; try norm_num ; try ring_nf at * ; try native_decide ; try linarith ; try nlinarith`
* `#8`: `have h30 : k = 2 := by try norm_cast ; try norm_num ; try simp_all ; try ring_nf at * ; try native_decide ; try linarith ; try nlinarith`
### Key Observations
1. **Proof Strategy Divergence:** The base LLM proof (red) relies heavily on arithmetic solvers (`linarith`, `nlinarith`, `omega`) and manual case analysis (`interval_cases`). The APOLLO-augmented proof (green) uses more structured simplification (`simp_all only [...]`) and broader, combined tactic blocks (`try ... ; try ...`).
2. **Tactic Specificity:** The APOLLO proof often specifies lists of lemmas for `simp_all` (e.g., `[sub_self, mul_zero, gt_iff_lt]`), suggesting a more targeted simplification approach. The base LLM proof uses more generic arithmetic tactics.
3. **Annotation Correspondence:** The numbered annotations (`#1` through `#8`) appear to mark analogous steps or decision points in the two proofs, allowing for direct comparison of how each method handles the same logical subgoal.
4. **Code Density:** The APOLLO proof columns appear slightly denser, with more complex single-line tactic calls, while the base LLM proof has more discrete, single-tactic lines.
### Interpretation
This diagram is a technical comparison designed to evaluate the performance of an AI system ("APOLLO") that assists or augments a base Large Language Model (LLM) in generating formal mathematical proofs in Lean 4.
* **What the data suggests:** The comparison implies that the APOLLO system guides the LLM towards a different, potentially more robust or systematic proof strategy. The use of targeted simplification (`simp_all` with specific lemmas) and combined tactic blocks may lead to proofs that are less reliant on brute-force arithmetic solving and more on algebraic simplification and logical structure.
* **How elements relate:** The red and green highlights create a direct visual mapping between corresponding proof steps. This allows a reviewer to assess not just the final output, but the *process*—how each system arrives at intermediate conclusions like `t = 1`, `n = 1`, or `k = 2`.
* **Notable patterns/anomalies:** The most striking pattern is the shift from sequential, single-tactic reasoning in the LLM column to bundled, multi-tactic attempts in the APOLLO columns. This could indicate that APOLLO promotes a "try everything relevant" approach at key junctures, which might be more successful in navigating complex proof states. The absence of `interval_cases` in the annotated APOLLO steps (compared to its use in LLM steps `#4` and `#8`) is a notable strategic difference, suggesting APOLLO finds alternative paths to the same conclusions.
**Language Note:** The code is written in the **Lean 4** programming language and proof assistant. All comments, tactic names, and syntax are part of this formal language. The theorem statement and proof steps are in mathematical logic expressed through Lean's syntax.
</details>
Figure 16: Proof attempt of mathd_algebra_289 produced by the base model (Kimina-Prover-Preview-Distill-7B) versus after Apollo’s targeted repair. The left shows the original end-to-end LLM output, which does not compile in Lean; the right shows the corrected, repaired proof assembled by Apollo that closes all of the goals. Import headers are the same as other presented examples, for visibility they were omitted in this diagram.