# An In-Context Learning Agent for Formal Theorem-Proving
## Abstract
We present an in-context learning agent for formal theorem-proving in environments like Lean and Coq. Current state-of-the-art models for the problem are finetuned on environment-specific proof data. By contrast, our approach, called Copra, repeatedly asks a high-capacity, general-purpose large language model (GPT-4) to propose tactic applications from within a stateful backtracking search. Proposed tactics are executed in the underlying proof environment. Feedback from the execution is used to build the prompt for the next model query, along with selected information from the search history and lemmas retrieved from an external database. We evaluate our implementation of Copra on the $\mathtt{miniF2F}$ benchmark for Lean and a set of Coq tasks from the CompCert project. On these benchmarks, Copra significantly outperforms few-shot invocations of GPT-4. It also compares favorably against finetuning-based approaches, outperforming ReProver, a state-of-the-art finetuned approach for Lean, in terms of the pass@ $1$ metric. Our code and data are available at https://github.com/trishullab/copra.
## 1 Introduction
The ability of large language models (LLMs) to learn in-context (Brown et al., 2020) is among the most remarkable recent findings in machine learning. Since its introduction, in-context learning has proven fruitful in a broad range of domains, from text and code generation (OpenAI, 2023b) to image generation (Ramesh et al., 2021) to game-playing (Wang et al., 2023a). In many tasks, it is now best practice to prompt a general-purpose, black-box LLM rather than finetune smaller-scale models.
In this paper, we investigate the value of in-context learning in the discovery of formal proofs written in frameworks like Lean (de Moura et al., 2015) and Coq (Huet et al., 1997). Such frameworks allow proof goals to be iteratively simplified using tactics such as variable substitution and induction. A proof is a sequence of tactics that iteratively “discharges” a given goal.
Automatic theorem-proving is a longstanding challenge in computer science (Newell et al., 1957). Traditional approaches to the problem were based on discrete search and had difficulty scaling to complex proofs (Bundy, 1988; Andrews & Brown, 2006; Blanchette et al., 2011). More recent work has used neural models — most notably, autoregressive language models (Polu & Sutskever, 2020; Han et al., 2021; Yang et al., 2023) — that generate a proof tactic by tactic.
The current crop of such models is either trained from scratch or fine-tuned on formal proofs written in a specific proof framework. By contrast, our method uses a highly capable, general-purpose, black-box LLM (GPT-4-turbo (OpenAI, 2023a) For brevity, we refer to GPT-4-turbo as GPT-4 in the rest of this paper. ) that can learn in context. We show that few-shot prompting of GPT-4 is not effective at proof generation. However, one can achieve far better performance with an in-context learning agent (Yao et al., 2022; Wang et al., 2023a; Shinn et al., 2023) that repeatedly invokes GPT-4 from within a higher-level backtracking search and uses retrieved knowledge and rich feedback from the proof environment. Without any framework-specific training, our agent achieves performance comparable to — and by some measures better than — state-of-the-art finetuned approaches.
<details>
<summary>extracted/5780501/copramainfigure.png Details</summary>

### Visual Description
\n
## Diagram: COPRAsystem Workflow
### Overview
The image depicts a workflow diagram illustrating the COPRAsystem, a process for formal theorem proving with optional informal hints. The diagram shows a cyclical process with five numbered steps and feedback loops, all centered around a central COPRAsystem logo. The process begins with prompt synthesis and ends with a "QED" (Quod Erat Demonstrandum - which was to be demonstrated) result.
### Components/Axes
The diagram consists of several rectangular blocks representing steps in the process, connected by arrows indicating the flow of information. The central element is a circular logo labeled "COPRAsystem". There is also a block at the top labeled "Formal theorem + informal hints (optional)". The diagram is enclosed within a dashed grey border, suggesting a closed system or iterative process.
### Detailed Analysis or Content Details
The diagram details the following steps:
1. **1. Prompt Synthesis:** A light green rectangular block. An arrow originates from "Formal theorem + informal hints (optional)" and points to this block. An arrow also originates from this block and points to "2. Tactic Parsing".
2. **2. Tactic Parsing:** A light blue rectangular block. An arrow originates from "1. Prompt Synthesis" and points to this block. An arrow also originates from this block and points to "3. Execute the tactic via environment (Lean, Coq)".
3. **3. Execute the tactic via environment (Lean, Coq):** A yellow rectangular block. An arrow originates from "2. Tactic Parsing" and points to this block. An arrow also originates from this block and points to "Feedback".
4. **4. Augment the prompt; Backtrack if needed:** A pink rectangular block. An arrow originates from "3. Execute the tactic via environment (Lean, Coq)" and points to this block. An arrow also originates from this block and points back to "1. Prompt Synthesis".
5. **5. Query lemma database:** A light green rectangular block. An arrow originates from "1. Prompt Synthesis" and points to "Lemma Database". An arrow also originates from "Lemma Database" and points to "4. Augment the prompt; Backtrack if needed".
6. **Lemma Database:** A light green rectangular block. Arrows originate from "5. Query lemma database" and point to "4. Augment the prompt; Backtrack if needed".
7. **Feedback:** A pink rectangular block. An arrow originates from "3. Execute the tactic via environment (Lean, Coq)" and points to this block. An arrow also originates from this block and points to "QED".
8. **QED:** A teal rectangular block. An arrow originates from "Feedback" and points to this block.
9. **COPRAsystem Logo:** A circular logo in the center of the diagram, containing a stylized green design.
10. **Formal theorem + informal hints (optional):** A yellow rectangular block at the top of the diagram.
The arrows indicate a flow of information. The dashed grey border encompasses steps 1-5, the Lemma Database, and the COPRAsystem logo, suggesting an iterative process. The feedback loop from "Execute the tactic" to "Augment the prompt" and back to "Prompt Synthesis" indicates a refinement process.
### Key Observations
The diagram highlights the iterative nature of the COPRAsystem. The process is not strictly linear; there are feedback loops that allow for refinement and backtracking. The inclusion of "informal hints (optional)" suggests that the system can operate with varying levels of user guidance. The "QED" block signifies the successful completion of the theorem proving process.
### Interpretation
The diagram illustrates a system designed to automate or assist in the process of formal theorem proving. The COPRAsystem appears to leverage a lemma database to aid in the synthesis of prompts and tactics. The iterative nature of the process, with feedback loops, suggests that the system is capable of handling complex theorems and refining its approach based on the results of tactic execution. The optional use of informal hints indicates a degree of flexibility, allowing users to provide guidance when needed. The overall structure suggests a sophisticated system that combines automated reasoning with human input to achieve formal verification of theorems. The diagram is a high-level overview and does not provide details on the specific algorithms or techniques used within each step.
</details>
Figure 1: Overview of Copra. Copra takes as input a formal theorem statement and performs a stateful proof search through an in-context learning agent. At each step of search, the agent has access to the search history, error feedback from the proof environment, and relevant lemmas retrieved from a database, which are formatted according to a prompt serialization protocol.
Figure 1 gives an overview of our agent, called Copra. Copra is an acronym for “In- co ntext Pr over A gent”. The agent takes as input a formal statement of a theorem and optional natural-language hints about how to prove the theorem. At each time step, it prompts the LLM to select the next tactic to apply. LLM-selected tactics are “executed” in the underlying proof assistant. If the tactic fails, Copra records this information and uses it to avoid future failures.
Additionally, the agent uses lemmas and definitions retrieved from an external database to simplify proofs. Finally, we use a symbolic procedure (Sanchez-Stern et al., 2020) to only apply LLM-selected tactics when they simplify the current proof goals (ruling out, among other things, cyclic tactic sequences).
Our use of a general-purpose LLM makes it easy to integrate our approach with different proof environments. Our current implementation of Copra allows for proof generation in both Lean and Coq. To our knowledge, this implementation is the first proof generation system with this capability.
We evaluate Copra using the $\mathtt{miniF2F}$ (Zheng et al., 2021) benchmark for competition-level mathematical reasoning in Lean, as well as a set of Coq proof tasks (Sanchez-Stern et al., 2020) from the CompCert (Leroy, 2009) project on verified compilation. In both settings, Copra outperforms few-shot calls to GPT-4. On $\mathtt{miniF2F}$ , Copra outperforms ReProver, a state-of-the-art finetuned approach for Lean theorem-proving, in terms of the established pass@1 metric. Using a refinement of the pass@1 metric, we show that Copra can converge to correct proofs with fewer model queries than ReProver as well as Proverbot9001, a state-of-the-art approach for Coq. Finally, we show that a GPT-4-scale model is critical for this setting; an instantiation of Copra with GPT-3.5 is less effective.
To summarize our contributions, we offer: (i) a new approach to formal theorem-proving, called Copra, that combines in-context learning with search, knowledge retrieval, and the use of rich feedback from the underlying proof environment; (ii) a systematic study of the effectiveness of the GPT-4-based instantiation of Copra, compared to few-shot GPT-4 invocations, ablations based on GPT-3.5, and state-of-the-art finetuned baselines; (iii) an implementation of Copra, which is the first open-source theorem-proving system to be integrated with both the Lean and Coq proof environments.
## 2 Problem Formulation
A (tactic-based) theorem-prover starts with a set of unmet proof obligations and applies a sequence of proof tactics to progressively eliminate these obligations. Each obligation $o$ is a pair $(g,h)$ , where $g$ is a goal and $h$ is a hypothesis. The goal $g$ consists of the propositions that need to be proved in order to meet $o$ ; the hypothesis $h$ is the set of assumptions that can be made in the proof of $g$ . The prover seeks to reduce the obligation set to the empty set.
[roundcorner=10pt]
(a) ⬇ theorem mod_arith_2 (x : ℕ) : x % 2 = 0 → (x * x) % 2 = 0 := begin intro h, rw mul_mod, rw h, rw zero_mul, refl, end (b) ⬇ x: ℕ h: x % 2 = 0 ⊢ x * x % 2 = 0
(c) ⬇ begin intro h, have h1 : x = 2 * (x / 2) := (nat. mul_div_cancel ’ h). symm, rw h1, rw nat. mul_div_assoc _ (show 2 ∣ 2, from dvd_refl _), rw [mul_assoc, nat. mul_mod_right], end
Figure 2: (a) A Lean theorem and a correct proof found by Copra. (b) Proof state after the first tactic. (c) An incorrect proof generated by GPT-4.
Figure 2 -(a) shows a formal proof, in the Lean language (de Moura et al., 2015), of a basic theorem about modular arithmetic (the proof is automatically generated using Copra). The first step of the proof is the $\mathtt{intro}$ tactic, which changes a goal $P\rightarrow Q$ to a hypothesis $P$ and a goal $Q$ . The next few steps use the $\mathtt{rw}$ (rewrite) tactic, which applies substitutions to goals and hypotheses. The last step is an application of the $\mathtt{refl}$ (reflexivity) tactic, which asserts definitional equivalences.
In code generation settings, LLMs like GPT-4 can often generate complex programs from one-shot queries. However, one-shot queries to GPT-4 often fail at even relatively simple formal proof tasks.
Figure 2 -(c) shows a GPT-4-generated incorrect Lean proof of our example theorem.
By contrast, we follow a classic view of theorem-proving as a discrete search problem (Bundy, 1988). Specifically, we offer an agent that searches the state space of the underlying proof environment and discovers sequences of tactic applications that constitute proofs. The main differences from classical symbolic approaches are that the search is LLM-guided, history-dependent, and can use natural-language insights provided by the user or the environment.
Now we formalize our problem statement. Abstractly, let a proof environment consist of:
- A set of states $\mathcal{O}$ . Here, a state is either a set $O=\{o_{1},\dots,o_{k}\}$ of obligations $o_{i}$ , or of the form $(O,w)$ , where $O$ is a set of obligations and $w$ is a textual error message. States of the latter form are error states, and the set of such states is denoted by $\mathit{Err}$ .
- A set of initial states, each consisting of a single obligation $(g_{\mathit{in}},h_{\mathit{in}})$ extracted from a user-provided theorem.
- A unique goal state $\mathtt{QED}$ is the empty obligation set.
- A finite set of proof tactics.
- A transition function $T(O,a)$ , which determines the result of applying a tactic $a$ to a state $O$ . If $a$ can be successfully applied at state $O$ , then $T(O,a)$ is the new set of obligations resulting from the application. If $a$ is a “bad” tactic, then $T(O,a)$ is an error state $(O,w_{e})$ , where $w_{e}$ is some feedback that explains the error. Error states $(O,w)$ satisfy the property that $T((O,w),a)=(O,w)$ for all $a$ .
- A set of global contexts, each of which is a natural-language string that describes the theorem to be proven and insights (Jiang et al., 2022b) about how to prove it. The theorem-proving agent can take a global context as an optional input and may use it to accelerate search.
Let us lift the transition function $T$ to tactic sequences by defining:
$$
T(O,\alpha)=\left\{\begin{array}[]{l}T(O,a_{1})\qquad\textrm{if $n=1$}\\
T(T(O,\langle a_{1},\dots,a_{n-1}\rangle),a_{n})~{}~{}~{}\textrm{otherwise.}
\end{array}\right.
$$
The theorem-proving problem is now defined as follows:
**Problem 1 (Theorem-proving)**
*Given an initial state $O_{\mathit{in}}$ and an optional global context $\chi$ , find a tactic sequence $\alpha$ (a proof) such that $T(O_{\mathit{in}},\alpha)=\mathtt{QED}$ .*
In practice, we aim for proofs to be as short as possible.
## 3 The Copra Agent
The Copra agent solves the theorem-proving problem via a GPT-4-directed depth-first search over tactic sequences. Figure 3 shows pseudocode for the agent. At any given point, the algorithm maintains a stack of environment states and a “failure dictionary” $\mathit{Bad}$ that maps states to sets of tactics that are known to be “unproductive” at the state.
At each search step, it pushes the current state on the stack and retrieves external lemmas and definitions relevant to the state. After this, it serializes the stack, $\mathit{Bad}(O)$ , the retrieved information, and the optional global context into a prompt and feeds it to the LLM. The LLM’s output is parsed into a tactic and executed in the environment.
One outcome of the tactic could be that the agent arrives at $\mathtt{QED}$ or makes some progress in the proof. A second possibility is that the new state is an error. A third possibility is that the tactic is not incorrect but does not represent progress in the proof. We detect this scenario using a symbolic progress check as described below. The agent terminates successfully after reaching $\mathtt{QED}$ and rejects the new state in the second and third cases. Otherwise, it recursively continues the proof from the new state. After issuing a few queries to the LLM, the agent backtracks.
Progress Checks.
Following Sanchez-Stern et al. (2020), we define a partial order $\sqsupseteq$ over states that indicates when a state is “at least as hard” as another. Formally, for states $O_{1}$ and $O_{2}$ with $O_{1}\notin\mathit{Err}$ and $O_{2}\notin\mathit{Err}$ , $O_{1}\sqsupseteq O_{2}$ iff
$$
\forall o_{i}=(g_{i},h_{i})\in O_{2}.\qquad\exists o_{k}=(g_{k},h_{k})\in O_{1
}.\;g_{k}=g_{i}\land(h_{k}\subseteq h_{i}).
$$
Intuitively, $O_{1}\sqsupseteq O_{2}$ if for every obligation in $O_{2}$ , there is a stronger obligation in $O_{1}$ . While comparing the difficulty of arbitrary proof states is not well-defined, our definition helps us eliminate some straightforward cases. Particularly, these cases include those proof states whose goals match exactly and one set of hypotheses contains the other.
$\proc{Copra}(O,\chi)$ $\proc{Push}(\mathit{st},O)$ $\rho\leftarrow\proc{Retrieve}(O)$ $j\leftarrow 1$ $t$ $p\leftarrow\proc{Promptify}(\mathit{st},\mathit{Bad}(O),\rho,\chi)$ $a\sim\proc{ParseTactic}(\proc{Llm}(p))$ $O^{\prime}\leftarrow T(O,a)$ $O^{\prime}=\mathtt{QED}$ terminate successfully $O^{\prime}\in\mathit{Err}$ or $\exists O^{\prime\prime}\in\mathit{st}.\ O^{\prime}\sqsupseteq O^{\prime\prime}$ add $a$ to $\mathit{Bad}(O)$ Copra( $O^{\prime}$ , $\chi$ ) $\proc{Pop}(\mathit{st})$
Figure 3: The search procedure in Copra. $T$ is the environment’s transition function; $\mathit{st}$ is a stack, initialized to be empty. $\mathit{Bad}(O)$ is a set of tactics, initialized to $\emptyset$ , that are known to be bad at $O$ . $\proc{Llm}$ is an LLM, $\proc{Promptify}$ generates a prompt, $\proc{ParseTactic}$ parses the output of the LLM into a tactic (repeatedly querying the LLM in case there are formatting errors in its output), and $\proc{Retrieve}$ gathers relevant lemmas and definitions from an external source. The overall procedure is called with a state $O_{\mathit{in}}$ and an optional global context $\chi$ .
As shown in Figure 3, the Copra procedure uses the $\sqsupseteq$ relation to only take actions that lead to progress in the proof. Using our progress check, Copra avoids cyclic tactic sequences that would cause nontermination.
Prompt Serialization.
The routines $\proc{Promptify}$ and $\proc{ParseTactic}$ together constitute the prompt serialization protocol and are critical to the success of the policy. Now we elaborate on these procedures.
$\proc{Promptify}$ carefully places the different pieces of information relevant to the proof in the prompt. It also includes logic for trimming this information to fit the most relevant parts in the LLM’s context window. Every prompt has two parts: the “system prompt” (see more details in Section A.3) and the “agent prompt.”
The agent prompts are synthetically generated using a context-free grammar and contain information about the state stack (including the current proof state), the execution feedback for the previous action, and the set of actions we know to avoid at the current proof state.
The system prompt describes the rules of engagement for the LLM. It contains a grammar (distinct from the one for agent prompts) that we expect the LLMs to follow when it proposes a course of action. The grammar carefully incorporates cases when the response is incomplete because of the LLM’s token limits. We parse partial responses to extract the next tactic using the $\proc{ParseTactic}$ routine. $\proc{ParseTactic}$ also identifies any formatting errors in the LLM’s responses, possibly communicating with the LLM multiple times until these errors are resolved.
<details>
<summary>extracted/5780501/Figure4.png Details</summary>

### Visual Description
## Data Table: Interactive Theorem Proving Results
### Overview
This image presents a data table documenting the results of an interactive theorem proving process. The table tracks the state of a proof, the steps taken, the interaction result (success or failure), and the Large Language Model (LLM) response to each query. The table is divided into four columns, each representing a separate query/attempt to prove a mathematical statement.
### Components/Axes
The table has the following row labels:
* **Serialized Proof State:** Represents the current state of the proof, including goals and hypotheses.
* **Stack & Failure Dictionary:** Details the steps attempted and any failures encountered during the proof process.
* **Interaction Result:** Indicates whether the interaction with the theorem prover was successful or resulted in an error.
* **LLM Response:** Shows the query submitted to the LLM and the LLM's response.
The table has four columns, labeled:
* **Query #1**
* **Query #2**
* **Query #3**
* **Query #4**
### Detailed Analysis or Content Details
**Query #1:**
* **Serialized Proof State:** `x * x % 2 = 0 [GOAL] 1 [HYPOTHESES] 1 [HYPOTHESIS] x : ℕ [HYPOTHESIS] h : x % 2 = 0`
* **Stack & Failure Dictionary:** `[LAST STEP] intro h, [STEPS] [STEP] intro h, [LAST STEP] rw h`
* **Interaction Result:** `[SUCCESS] [END]`
* **LLM Response:** `[RUN TACTIC] rw h, [END]`
**Query #2:**
* **Serialized Proof State:** `x % 2 * (x % 2) % 2 = 0 [GOAL] 1 [HYPOTHESES] 1 [HYPOTHESIS] x : ℕ [HYPOTHESIS] h : x % 2 = 0`
* **Stack & Failure Dictionary:** `[STEPS] [STEP] intro h, [LAST STEP] rw h, [INCORRECT STEPS] [STEP] rw h, [LAST STEP] apply nat.mul_mod_right`
* **Interaction Result:** `[ERROR MESSAGE] Got error in 'rw h': error: rewrite tactic failed, did not find instance of the pattern in the target expression x * x % 2` [END]
* **LLM Response:** `[RUN TACTIC] apply nat.mul_mod_right, [END]`
**Query #3:**
* **Serialized Proof State:** `x % 2 * (x % 2) % 2 = 0 [GOAL] 1 [HYPOTHESES] 1 [HYPOTHESIS] x : ℕ [HYPOTHESIS] h : x % 2 = 0`
* **Stack & Failure Dictionary:** `[STEPS] [STEP] intro h, [LAST STEP] rw h, [LAST STEP] apply nat.mul_mod_mod,`
* **Interaction Result:** `[ERROR MESSAGE] Got error in 'apply nat.mul_mod_mod_right': error: invalid apply tactic, failed to unify x * x % 2 = 0 with ?m_1 * ?m_2 % ?m_1 = 0` [END]
* **LLM Response:** `[RUN TACTIC] rw nat.mul_mod, [END]`
**Query #4:**
* **Serialized Proof State:** `x % 2 * (x % 2) % 2 = 0 [GOAL] 1 [HYPOTHESES] 1 [HYPOTHESIS] x : ℕ [HYPOTHESIS] h : x % 2 = 0`
* **Stack & Failure Dictionary:** `[STEPS] [STEP] intro h, [LAST STEP] rw h`
* **Interaction Result:** `[SUCCESS] [END]`
* **LLM Response:** `[RUN TACTIC] rw h, [END]`
### Key Observations
* Query #1 and #4 were successful, while Queries #2 and #3 resulted in errors.
* The errors in Queries #2 and #3 relate to the `rw` (rewrite) and `apply` tactics failing to find matching patterns or unify expressions.
* The LLM consistently suggests `rw h` as a tactic, and it succeeds in Queries #1 and #4.
* The initial goal is `x * x % 2 = 0`, and the hypothesis is `x : ℕ` and `h : x % 2 = 0`.
* The serialized proof state remains consistent across all queries.
### Interpretation
This data table demonstrates the iterative process of interactive theorem proving, where an LLM is used to suggest tactics to advance the proof. The table highlights the challenges of automated reasoning, as the LLM's suggestions are not always successful and can lead to errors. The success of `rw h` in Queries #1 and #4 suggests that this tactic is appropriate for the given goal and hypothesis. The failures in Queries #2 and #3 indicate that the LLM sometimes struggles to apply tactics correctly or to find the right patterns for rewriting expressions. The error messages provide valuable information for debugging the proof process and understanding the limitations of the LLM's reasoning capabilities. The table provides a record of the interaction between a human prover (or automated system) and an LLM, showcasing the potential and challenges of using LLMs to assist in formal verification.
</details>
Figure 4: A “conversation” between the LLM and the search algorithm in Copra. Query #1, Query #2, …represent queries made as the proof search progresses. The column labeled Query $\#i$ depicts the prompt at time step $i$ and the corresponding LLM response. The LLM response is parsed into a tactic and executed, with the error feedback incorporated into the next prompt.
Figure 4 illustrates a typical “conversation” between the LLM and Copra ’s search algorithm via the prompt serialization protocol. Note, in particular, the prompt’s use of the current goals, the stack, and error feedback for the last executed tactic.
## 4 Evaluation
Benchmarks.
We primarily evaluate Copra on the Lean proof environment using $\mathtt{miniF2F}$ - $\mathtt{test}$ (Zheng et al., 2021) benchmark. This benchmark comprises 244 formalizations of mathematics problems from (i) the MATH dataset (Hendrycks et al., 2021), (ii) high school mathematics competitions, (iii) hand-designed problems mirroring the difficulty of (ii). Broadly, these problems fall under three categories: number theory, algebra, and induction.
To evaluate Copra ’s ability to work with multiple proof frameworks, we perform a secondary set of experiments on the Coq platform. Our benchmark here consists of a set of a theorems drawn from the CompCert compiler verification project (Leroy, 2009). The dataset was originally evaluated on the Proverbot9001 system (Sanchez-Stern et al., 2020). Due to budgetary constraints, our evaluation for CompCert uses 118 of the 501 theorems used in the original evaluation of Proverbot9001. For fairness, we include all 98 theorems proved by Proverbot9001 in our system. The remaining theorems are randomly sampled.
Implementing COPRA.
Our implementation of Copra is LLM-agnostic, and functions as long as the underlying language model is capable of producing responses which parse according to the output grammar. We instantiate the LLM to be GPT-4-turbo as a middle-ground between quality and affordability. Because of the substantial cost of GPT-4 queries, we cap the number of queries that Copra can make at 60 for the majority of our experiments.
In both Lean and Coq, we instantiate the retrieval mechanism to be a BM25 search. For our Lean experiments, our retrieval database is mathlib (mathlib Community, 2020), a mathematics library developed by the Lean community. Unlike $\mathtt{miniF2F}$ , our CompCert-based evaluation set is accompanied by a training set. Since Copra relies on a black-box LLM, we do not perform any training with the train set, but do use it as the retrieval database for our Coq experiments.
Inspired by DSP (Jiang et al., 2022b), we measure the efficacy of including the natural-language (informal) proof of the theorem at each search step in our Learn experiments. We use the informal theorem statements in $\mathtt{miniF2F}$ - $\mathtt{test}$ to generate informal proofs of theorems using few-shot prompting of the underlying LLM. The prompt has several examples, fixed over all invocations, of informal proof generation given the informal theorem statement. These informal proofs are included as part of the global context. Unlike $\mathtt{miniF2F}$ , which consists of competition problems originally specified in natural language, our CompCert evaluation set comes from software verification tasks that lack informal specifications. Hence, we do not run experiments with informal statements and proofs for Coq.
Initially, we run Copra without access to the retrieval database. By design, Copra may backtrack out of its proof search before the cap of 60 queries has been met. With the remaining attempts on problems yet to be solved, Copra is restarted to operate with retrieved information, and then additionally with informal proofs. This implementation affords lesser expenses, as appending retrieved lemmas and informal proofs yields longer prompts, which incur higher costs. More details are discussed in Section A.1.1.
Baselines.
We perform a comparison with the few-shot invocation of GPT-4 in both the $\mathtt{miniF2F}$ - $\mathtt{test}$ and CompCert domains. The “system prompt” in the few-shot approach differs from that of Copra, containing instructions to generate the entire proof in one response, rather than going step-by-step. For both Copra and the few-shot baseline, the prompt contains several proof examples which clarify the required format of the LLM response. The proof examples, fixed across all test cases, contain samples broadly in the same categories as, but distinct from, those problems in the evaluation set.
For our fine-tuned baselines, a challenge is that, to our knowledge, no existing theorem-proving system targets both Lean and Coq. Furthermore, all open-source theorem proving systems have targeted a single proof environment. As a result, we had to choose different baselines for the Lean ( $\mathtt{miniF2F}$ ) and Coq (CompCert) domains.
Our fine-tuned baseline for $\mathtt{miniF2F}$ - $\mathtt{test}$ is ReProver, a state-of-the-art retrieval-augmented open-source system that is part of the LeanDojo project (Yang et al., 2023). We also compare against Llemma -7b and Llemma -34b (Azerbayev et al., 2023). In the CompCert task, we compare with Proverbot9001 (Sanchez-Stern et al., 2020), which, while not LLM-based, is the best public model for Coq.
Ablations.
We consider ablations of Copra which use GPT-3.5 and CodeLlama (Roziere et al., 2023) as the underlying model. We also measure the effectiveness of backtracking in Copra ’s proof search. Additionally, we ablate for retrieved information and informal proofs as part of the global context during search.
Metric.
The standard metric for evaluating theorem-provers is pass@k (Lample et al., 2022). In this metric, a prover is given a budget of $k$ proof attempts; the method is considered successful if one of these attempts leads to success. We consider a refinement of this metric called pass@k-with-n-queries. Here, we measure the number of correct proofs that a prover can generate given k attempts, each with a budget of $n$ queries from the LLM or neural model. We enforce that a single query includes exactly one action (a sequence of tactics) to be used in the search. We want this metric to be correlated with the number of correct proofs that the prover produces within a specified wall-clock time budget; however, the cost of an inference query on an LLM is proportional to the number of responses generated per query. To maintain the correlation between the number of inference queries and wall-clock time, we restrict each inference on the LLM to a single response (see Section A.1.2 for more details).
$$
\mathtt{miniF2F} \mathtt{test} \times \times \times \times \times 60^{\ddagger} \times \times \times \times \times \times \times \times \times \times \tag{600}
$$
Table 1: Aggregate statistics for Copra and the baselines on $\mathtt{miniF2F}$ dataset. Note the various ablations of Copra with and without retrieval or informal proofs. The timeout is a wall-clock time budget in seconds allocated across all attempts. Unless otherwise specified, (i) Copra uses GPT-4 as the LLM (ii) the temperature ( $T$ ) is set to 0.
Copra vs. Few-Shot LLM Queries.
Statistics for the two approaches, as well as a comparison with the few-shot GPT-3.5 and GPT-4 baselines, appear in Table 1. As we see, Copra offers a significant advantage over the few-shot approaches. For example, Copra solves more than twice as many problems as the few-shot GPT-4 baseline, which indicates that the state information positively assists the search. Furthermore, we find that running the few-shot baseline with 60 To ensure fairness we matched the number of attempts for Few-Shot (GPT-4) baseline with the number of queries Copra took to pass or fail for each theorem. attempts (provided a nonzero temperature), exhibits considerably worse performance compared to Copra. This indicates that Copra makes more efficient use of queries to GPT-4 than the baseline, even when queries return a single tactic (as part of a stateful search), as opposed to a full proof. We include further details in Section A.1.1.
Copra is capable of improving the performance of GPT-3.5, CodeLlama over its few-shot counterpart. We note that the use of GPT-4 seems essential, as weaker models like GPT-3.5 or CodeLlama have a reduced ability to generate responses following the specified output format.
Comparison with Finetuned Approaches on $\mathtt{miniF2F}$ .
Figure 5 shows our comparison results for the $\mathtt{miniF2F}$ domain. With a purely in-context learning approach, Copra outperforms ReProver, proving within just 60 queries theorems that ReProver could not solve after a thousand queries. This is remarkable given that ReProver was finetuned on a curated proof-step dataset derived from Lean’s mathlib library. For fairness, we ran ReProver multiple times with 16, 32, and 64 (default) as the maximum number of queries per proof-step. We obtained the highest success rates for ReProver with 64 queries per proof-step. We find that Copra without retrieval outperforms ReProver.
Our pass@1 performance surpasses those of the best open-source approaches for $\mathtt{miniF2F}$ - $\mathtt{test}$ on Lean. Copra proves 29.09% of theorems in $\mathtt{miniF2F}$ - $\mathtt{test}$ theorems, which exceeds that of Llemma -7b & Llemma -34b, (Azerbayev et al., 2023) and ReProver (Yang et al., 2023). It is important to note that these methods involve training on curated proof data, while Copra relies only on in-context learning.
We also establish a correlation between the number of queries needed for a proof and wall-clock time in Figure 6 and Table 3 (more details are discussed in Section A.1.3). Although the average time per query is higher for Copra, Copra still finds proofs almost 3x faster than ReProver. This can explained by the fact that our search is more effective as it uses 16x fewer queries than ReProver (see Table 2). The average time per query includes the time taken to execute the generated proof step by the interactive theorem prover. Hence, a more effective model can be slow in generating responses while still proving faster compared to a model with quicker inference, by avoiding the waste of execution time on bad proof-steps.
<details>
<summary>extracted/5780501/img-miniF2F-pass-k-guidance-steps.png Details</summary>

### Visual Description
\n
## Line Chart: pass@1-with-n-queries
### Overview
This line chart visualizes the number of theorems proved as a function of the number of queries (n) for four different models: COPRAGPT-4-turbo with and without retrieval, and ReProver with and without retrieval. The chart aims to compare the performance of these models in theorem proving based on the number of queries made.
### Components/Axes
* **Title:** pass@1-with-n-queries (positioned at the top-center)
* **X-axis:** Number of Queries (n) - Scale ranges from 0 to 3500, with tick marks at intervals of 500.
* **Y-axis:** Number of Theorems Proved - Scale ranges from 0 to 70, with tick marks at intervals of 10.
* **Legend:** Located in the top-right corner, containing the labels and corresponding colors for each data series.
* COPRA (GPT-4-turbo) (with Retrieval) - Orange
* ReProver (with Retrieval) - Blue
* COPRA (GPT-4-turbo) (without Retrieval) - Green
* ReProver (without Retrieval) - Red
### Detailed Analysis
* **COPRA (GPT-4-turbo) (with Retrieval) - Orange:** The line starts at approximately 0 theorems proved at 0 queries. It rapidly increases to around 65 theorems proved by approximately 400 queries, then plateaus around 68 theorems proved for the remainder of the query range.
* **ReProver (with Retrieval) - Blue:** The line begins at 0 theorems proved at 0 queries. It increases steadily, reaching approximately 58 theorems proved by 800 queries, and continues to increase, reaching around 62 theorems proved at 3500 queries.
* **COPRA (GPT-4-turbo) (without Retrieval) - Green:** The line starts at 0 theorems proved at 0 queries. It rises quickly to approximately 68 theorems proved by 200 queries, and then remains relatively flat, hovering around 68 theorems proved for the rest of the query range.
* **ReProver (without Retrieval) - Red:** The line starts at 0 theorems proved at 0 queries. It increases gradually, reaching approximately 52 theorems proved by 1000 queries, and then plateaus around 55 theorems proved for the remainder of the query range.
### Key Observations
* COPRA (GPT-4-turbo) with and without retrieval demonstrates the fastest initial performance, reaching a high number of theorems proved with a relatively small number of queries.
* ReProver with retrieval shows a slower but more consistent increase in theorems proved as the number of queries increases.
* ReProver without retrieval exhibits the slowest performance, with the lowest number of theorems proved across all query ranges.
* The performance of COPRA (GPT-4-turbo) plateaus quickly, suggesting diminishing returns with increased queries.
### Interpretation
The data suggests that COPRA (GPT-4-turbo) is more efficient at proving theorems initially, regardless of retrieval, compared to ReProver. The inclusion of retrieval appears to have a more significant positive impact on COPRA's performance, as it reaches a higher plateau. ReProver benefits from retrieval, but its overall performance remains lower than COPRA. The plateauing of COPRA's performance indicates that after a certain point, additional queries do not significantly contribute to proving more theorems, potentially due to the model encountering more complex or unsolvable theorems. The consistent, albeit slower, increase in ReProver's performance with retrieval suggests that it may be more capable of tackling a wider range of theorems, even if it requires more queries. The chart highlights the trade-off between initial efficiency (COPRA) and sustained performance (ReProver). The "pass@1" metric suggests that the models are evaluated on whether they can prove a theorem with a single attempt, and the "with-n-queries" aspect explores how performance scales with the number of attempts.
</details>
Figure 5: Copra vs. ReProver on the $\mathtt{miniF2F}$ benchmark.
Qualitative Analysis of Proofs.
We performed an analysis of the different categories of $\mathtt{miniF2F}$ problems solved by Copra and ReProver. Figure 8 and Figure 9 (in Section A.2) show that Copra proves more theorems in most of the categories and takes fewer steps consistently across all categories of problems in $\mathtt{miniF2F}$ as compared to ReProver. Additionally, we find that certain kinds of problems, for example, International Mathematics Olympiad (IMO) problems and theorems that require induction, are difficult for both approaches.
From our qualitative analysis, there are certain kinds of problems where our language-agent approach seems especially helpful.
<details>
<summary>extracted/5780501/img-miniF2F-pass-k-seconds.png Details</summary>

### Visual Description
## Line Chart: pass@n-seconds
### Overview
The image presents a line chart comparing the performance of two theorem proving systems, COPRAGPT-4-turbo and ReProver, with and without retrieval augmentation, measured by the number of theorems proved over time. The chart displays the cumulative number of theorems proven as a function of wall-clock time in seconds.
### Components/Axes
* **Title:** pass@n-seconds (positioned at the top-center)
* **X-axis:** Wall-Clock Time in Seconds (n) - ranging from approximately 0 to 600 seconds.
* **Y-axis:** Number of Theorems Proved - ranging from 0 to 70.
* **Legend:** Located in the top-right corner, listing the following data series:
* COPRA (GPT-4-turbo) (with Retrieval) - Yellow
* ReProver (with Retrieval) - Blue
* COPRA (GPT-4) (without Retrieval) - Green
* ReProver (without Retrieval) - Red
### Detailed Analysis
The chart shows the cumulative number of theorems proven by each system over time.
* **COPRA (GPT-4-turbo) (with Retrieval) - Yellow:** This line starts at approximately 0 theorems at 0 seconds and steadily increases, reaching approximately 68 theorems at around 400 seconds, and plateaus.
* **ReProver (with Retrieval) - Blue:** This line also starts at 0 theorems at 0 seconds. It increases more slowly than the yellow line initially, but then accelerates, reaching approximately 55 theorems at around 400 seconds, and plateaus.
* **COPRA (GPT-4) (without Retrieval) - Green:** This line starts at 0 theorems at 0 seconds and increases at a moderate pace, reaching approximately 60 theorems at around 350 seconds, and plateaus.
* **ReProver (without Retrieval) - Red:** This line starts at 0 theorems at 0 seconds and increases very slowly initially. It begins to accelerate around 200 seconds, reaching approximately 45 theorems at 500 seconds.
Approximate Data Points (extracted by visually estimating from the chart):
| Time (seconds) | COPRA (GPT-4-turbo) (with Retrieval) | ReProver (with Retrieval) | COPRA (GPT-4) (without Retrieval) | ReProver (without Retrieval) |
|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 0 |
| 50 | 10 | 5 | 8 | 2 |
| 100 | 20 | 10 | 15 | 5 |
| 200 | 40 | 20 | 30 | 10 |
| 300 | 55 | 35 | 45 | 20 |
| 400 | 68 | 55 | 60 | 35 |
| 500 | 68 | 55 | 60 | 45 |
| 600 | 68 | 55 | 60 | 45 |
### Key Observations
* COPRA (GPT-4-turbo) with retrieval consistently outperforms all other methods in terms of the number of theorems proven within the first 400 seconds.
* ReProver with retrieval performs better than ReProver without retrieval, indicating the benefit of retrieval augmentation.
* COPRA (GPT-4) without retrieval performs better than ReProver with and without retrieval.
* ReProver without retrieval is the slowest to prove theorems.
* All lines appear to plateau after approximately 400 seconds, suggesting diminishing returns in theorem proving performance with increased time.
### Interpretation
The data suggests that COPRA (GPT-4-turbo) with retrieval is the most effective theorem proving system among those tested, demonstrating a significantly higher rate of theorem proving compared to the other methods. The consistent improvement observed when using retrieval augmentation across both COPRA and ReProver indicates that providing relevant information to the theorem provers enhances their performance. The plateauing of all lines suggests that there is a limit to the number of theorems that can be proven within the given timeframe, potentially due to the complexity of the remaining theorems or the inherent limitations of the systems. The differences in performance between the systems could be attributed to variations in their underlying algorithms, training data, or computational resources. The chart provides valuable insights into the effectiveness of different theorem proving approaches and the benefits of incorporating retrieval augmentation.
</details>
Figure 6: Copra vs. ReProver on the $\mathtt{miniF2F}$ benchmark on the pass@n-seconds metric.
For instance, Figure 7 shows a problem in the ‘numbertheory’ category that ReProver could not solve. More examples of interesting proofs that Copra found appear in the Section A.2 in Figure 10.
Effectiveness of Backtracking, Informal Proofs, and Retrieval.
We show the ablation for the backtracking feature of Copra in Table 1. We find that backtracking is useful when proofs are longer or more complex, as Copra is more likely to make mistakes which require amending. We include additional examples in Figure 12 (in Section A.2).
From Table 1, we see that retrieval helps in proving more problems. Retrieval reduces the tendency of the model to hallucinate the lemma names and provides the model with existing lemma definitions it otherwise may not have known. We include a proof generated by Copra that uses a retrieved lemma in Figure 7.
We experiment with adding model-generated informal proofs as global context in Copra ’s proof search. As evidenced in Table 1, Copra is able to outperform PACT (Han et al., 2021) and the state-of-the-art (29.6%) expert iteration method (Polu et al., 2022) in a $pass$ @ $1$ search through the incorporation of informal proofs. Furthermore, increasing the maximum query count to 100 enables a further increase in Copra ’s performance to 30.74%. An example of a proof found when incorporating informal proofs is shown in Figure 15 (see Section A.5).
Test-Set Memorization Concerns.
Given that the pretraining corpus of GPT-4 is not publicly available, it is imperative to assess for the possibility of test-set leakage in our experiments. We perform an analysis of those proofs checked into the open-source $\mathtt{miniF2F}$ repository compared to those proofs Copra discovers.
We categorize those proofs available in $\mathtt{miniF2F}$ according to the number of tactics required and the complexity of the tactic arguments. These results can be found in detail in Section A.1.4. As seen in Table 4, we find that Copra reproduces none of the “long” proofs in $\mathtt{miniF2F}$ - $\mathtt{test}$ repository. Setting aside those cases where the proof consists of a single tactic, approximately 91.93% of the proofs generated by Copra either bear no overlap to proofs in $\mathtt{miniF2F}$ , or no proof has been checked in. We provide examples of proofs generated by our approach compared to those included in $\mathtt{miniF2F}$ in Figure 11.
[roundcorner=10pt]
⬇
theorem mathd_numbertheory_100
(n : ℕ)
(h ₀ : 0 < n)
(h ₁ : nat. gcd n 40 = 10)
(h ₂ : nat. lcm n 40 = 280) :
n = 70 :=
begin
have h ₃ : n * 40 = 10 * 280 := by rw [← nat. gcd_mul_lcm n 40, h ₁, h ₂],
exact (nat. eq_of_mul_eq_mul_right (by norm_num : 0 < 40) h ₃),
end
Figure 7: A theorem that requires retrieval. Copra used BM25 to find the lemma “ $\texttt{gcd\_mul\_lcm}:(\texttt{n}\;\texttt{m}:\mathbb{N}+):(\texttt{gcd}\; \texttt{n}\;\texttt{m})*(\texttt{lcm}\;\texttt{n}\;\texttt{m})=\texttt{n}* \texttt{m}$ ” which led to the proof while ReProver failed to prove this theorem. It is important to note that just retrieving the correct lemma is not sufficient, but knowing how to use it correctly is equally important. The Figure 18 shows how Copra utilizes the capabilities of LLM to correctly use the retrieved lemma in the proof.
Coq Experiments Copra can prove a significant portion of the theorems in our Coq evaluation set. As shown in Figure 16 (in Section A.5), Copra slightly outperforms Proverbot9001 when both methods are afforded the same number of queries. Copra with retrieval is capable of proving 57 of 118 theorems within our cap of 60 queries. Furthermore, Copra exceeds the few-shot baselines utilizing GPT-4 and GPT-3.5, which could prove 36 and 10 theorems, respectively, on our CompCert-based evaluation set. Some example Coq proofs generated by Copra are shown in Section A.5 in Figure 17.
## 5 Related Work
Neural Theorem-Proving.
There is a sizeable literature on search-based theorem-proving techniques based on supervised learning. Neural models are trained to predict a proof step given a context and the proof state, then employed to guide a search algorithm (e.g. best-first or depth-limited search) to synthesize the complete proof.
Early methods of this sort (Yang & Deng, 2019; Sanchez-Stern et al., 2020; Huang et al., 2019) used small-scale neural networks as proof step predictors. Subsequent methods, starting with GPT- $f$ (Polu & Sutskever, 2020), have used language models trained on proof data.
PACT (Han et al., 2021) enhanced the training of such models with a set of self-supervised auxiliary tasks. Lample et al. (2022) introduced HyperTree Proof Search, which uses a language model trained on proofs to guide an online MCTS-inspired, search algorithm.
Among results from the very recent past, ReProver trains a retrieval-augmented transformer for proof generation in Lean. Llemma performs continued pretraining of the CodeLlama 7B & 34B on a math-focused corpus. Baldur (First et al., 2023) generates the whole proof in one-shot using an LLM and then performs a single repair step by passing error feedback through an LLM finetuned on (incorrect proof, error message, correct proof) tuples. AlphaGeometry (Trinh et al., 2024) integrates a transformer model trained on synthetic geometry data with a symbolic deduction engine to prove olympiad geometry problems. In contrast to these approaches, Copra is entirely based on in-context learning.
We evaluated Copra in the Lean and Coq environments. However, significant attention has been applied to theorem proving with LLMs in the interactive theorem prover Isabelle (Paulson, 1994). Theorem-proving systems for Lean and Isabelle are not directly comparable due to the substantial differences in automation provided by each language. Isabelle is equipped with Sledgehammer (Paulson & Blanchette, 2015), a powerful automated reasoning tool that calls external automated theorem provers such as E (Schulz, 2002) and Z3 (De Moura & Bjørner, 2008) to prove goals. Thor (Jiang et al., 2022a) augmented the PISA dataset (Jiang et al., 2021) to include successful Sledgehammer invocations, and trained a language model to additionally predict hammer applications. Integrating these ideas with the Copra approach is an interesting subject of future work.
The idea of using informal hints to guide proofs was first developed in DSP (Jiang et al., 2022b), which used an LLM to translate informal proofs to formal sketches that were then completed with Isabelle’s automated reasoning tactics. Zhao et al. (2023) improved on DSP by rewriting the informal proofs to exhibit a more formal structure and employs a diffusion model to predict the optimal ordering of the few-shot examples. LEGOProver (Wang et al., 2023b) augmented DSP with a skill library that grows throughout proof search. Lyra (Zheng et al., 2023) iterated on DSP by utilizing error feedback several times to modify the formal sketch, using automated reasoning tools to amend incorrect proofs of intermediate hypotheses. DSP, LEGOProver, and Lyra heavily rely on Isabelle’s hammer capabilities (which are not present in other ITPs like Lean) and informal proofs for formal proof generation (see Section A.1.5 for details). Moreover, the requirement for informal proofs in these approaches prohibits their application in the software verification domain, where the notion of informal proof is not well-defined. However, we propose a domain-agnostic stateful search approach within an in-context learning agent which iteratively uses execution feedback from the ITP with optional use of external signals like retrieval and informal proofs.
In-Context Learning Agents.
Several distinct in-context learning agent architectures have been proposed recently (Significant-Gravitas, 2023; Yao et al., 2022; Shinn et al., 2023; Wang et al., 2023a). These models combine an LLM’s capability to use tools Schick et al. (2023), decompose a task into subtasks (Wei et al., 2022; Yao et al., 2023), and self-reflect (Shinn et al., 2023). However, Copra is the first in-context learning agent for theorem-proving.
## 6 Conclusion
We have presented Copra, the first in-context-learning approach to theorem-proving in frameworks like Lean and Coq. The approach departs from prior LLM-based theorem-proving techniques by performing a history-dependent backtracking search by utilizing in-context learning, the use of execution feedback from the underlying proof environment, and retrieval from an external database. We have empirically demonstrated that Copra significantly outperforms few-shot LLM invocations at proof generation and also compares favorably to finetuned approaches.
As for future work, we gave GPT-4 a budget of at most 60 queries per problem for cost reasons. Whether the learning dynamics would drastically change with a much larger inference budget remains to be seen. Also, it is unclear whether a GPT-4-scale model is truly essential for our task. We have shown that the cheaper GPT-3.5 agent is not competitive against our GPT-4 agent; however, it is possible that a Llama-scale model that is explicitly finetuned on interactions between the model and the environment would have done better. While data on such interactions is not readily available, a fascinating possibility is to generate such data synthetically using the search mechanism of Copra.
## 7 Reproducibility Statement
We are releasing all the code needed to run Copra as supplementary material. The code contains all “system prompts” described in Section Section A.3 and Section A.4, along with any other relevant data needed to run Copra. However, to use our code, one must use their own OpenAI API keys. An issue with reproducibility in our setting is that the specific models served via the GPT-4 and GPT-3.5 APIs may change over time. In our experiments, we set the “temperature” parameter to zero to ensure the LLM outputs are as deterministic as possible.
Funding Acknowledgements. This work was partially supported by NSF awards CCF-1918651, CCF-2403211, and CCF-2212559, and a gift from the Ashar Aziz Foundation.
## References
- Andrews & Brown (2006) Andrews, P. B. and Brown, C. E. Tps: A hybrid automatic-interactive system for developing proofs. Journal of Applied Logic, 4(4):367–395, 2006.
- Azerbayev et al. (2023) Azerbayev, Z., Schoelkopf, H., Paster, K., Santos, M. D., McAleer, S., Jiang, A. Q., Deng, J., Biderman, S., and Welleck, S. Llemma: An open language model for mathematics, 2023.
- Blanchette et al. (2011) Blanchette, J. C., Bulwahn, L., and Nipkow, T. Automatic proof and disproof in isabelle/hol. In Frontiers of Combining Systems: 8th International Symposium, FroCoS 2011, Saarbrücken, Germany, October 5-7, 2011. Proceedings 8, pp. 12–27. Springer, 2011.
- Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Bundy (1988) Bundy, A. The use of explicit plans to guide inductive proofs. In 9th International Conference on Automated Deduction: Argonne, Illinois, USA, May 23–26, 1988 Proceedings 9, pp. 111–120. Springer, 1988.
- De Moura & Bjørner (2008) De Moura, L. and Bjørner, N. Z3: an efficient smt solver. In Proceedings of the Theory and Practice of Software, 14th International Conference on Tools and Algorithms for the Construction and Analysis of Systems, TACAS’08/ETAPS’08, pp. 337–340, Berlin, Heidelberg, 2008. Springer-Verlag. ISBN 3540787992.
- de Moura et al. (2015) de Moura, L., Kong, S., Avigad, J., Van Doorn, F., and von Raumer, J. The Lean theorem prover (system description). In Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25, pp. 378–388. Springer, 2015.
- First et al. (2023) First, E., Rabe, M. N., Ringer, T., and Brun, Y. Baldur: whole-proof generation and repair with large language models. arXiv preprint arXiv:2303.04910, 2023.
- Han et al. (2021) Han, J. M., Rute, J., Wu, Y., Ayers, E. W., and Polu, S. Proof artifact co-training for theorem proving with language models. arXiv preprint arXiv:2102.06203, 2021.
- Hendrycks et al. (2021) Hendrycks, D., Burns, C., Kadavath, S., Arora, A., Basart, S., Tang, E., Song, D., and Steinhardt, J. Measuring mathematical problem solving with the math dataset. 2021.
- Huang et al. (2019) Huang, D., Dhariwal, P., Song, D., and Sutskever, I. Gamepad: A learning environment for theorem proving. In ICLR, 2019.
- Huet et al. (1997) Huet, G., Kahn, G., and Paulin-Mohring, C. The coq proof assistant a tutorial. Rapport Technique, 178, 1997.
- Jiang et al. (2021) Jiang, A. Q., Li, W., Han, J. M., and Wu, Y. Lisa: Language models of isabelle proofs. In 6th Conference on Artificial Intelligence and Theorem Proving, pp. 378–392, 2021.
- Jiang et al. (2022a) Jiang, A. Q., Li, W., Tworkowski, S., Czechowski, K., Odrzygóźdź, T., Miłoś, P., Wu, Y., and Jamnik, M. Thor: Wielding hammers to integrate language models and automated theorem provers. Advances in Neural Information Processing Systems, 35:8360–8373, 2022a.
- Jiang et al. (2022b) Jiang, A. Q., Welleck, S., Zhou, J. P., Li, W., Liu, J., Jamnik, M., Lacroix, T., Wu, Y., and Lample, G. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283, 2022b.
- Lample et al. (2022) Lample, G., Lacroix, T., Lachaux, M.-A., Rodriguez, A., Hayat, A., Lavril, T., Ebner, G., and Martinet, X. Hypertree proof search for neural theorem proving. Advances in Neural Information Processing Systems, 35:26337–26349, 2022.
- Leroy (2009) Leroy, X. Formal verification of a realistic compiler. Communications of the ACM, 52(7):107–115, 2009.
- mathlib Community (2020) mathlib Community, T. The lean mathematical library. In Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs, POPL ’20. ACM, January 2020. doi: 10.1145/3372885.3373824. URL http://dx.doi.org/10.1145/3372885.3373824.
- Newell et al. (1957) Newell, A., Shaw, J. C., and Simon, H. A. Empirical explorations of the logic theory machine: a case study in heuristic. In Papers presented at the February 26-28, 1957, western joint computer conference: Techniques for reliability, pp. 218–230, 1957.
- OpenAI (2023a) OpenAI. GPT-4 technical report, 2023a.
- OpenAI (2023b) OpenAI. GPT-4 and GPT-4 turbo, 2023b. URL https://platform.openai.com/docs/models/gpt-4-and-gpt-4-turbo.
- Paulson & Blanchette (2015) Paulson, L. and Blanchette, J. Three years of experience with sledgehammer, a practical link between automatic and interactive theorem provers. 02 2015. doi: 10.29007/tnfd.
- Paulson (1994) Paulson, L. C. Isabelle: A generic theorem prover. Springer, 1994.
- Polu & Sutskever (2020) Polu, S. and Sutskever, I. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393, 2020.
- Polu et al. (2022) Polu, S., Han, J. M., Zheng, K., Baksys, M., Babuschkin, I., and Sutskever, I. Formal mathematics statement curriculum learning. arXiv preprint arXiv:2202.01344, 2022.
- Ramesh et al. (2021) Ramesh, A., Pavlov, M., Goh, G., Gray, S., Voss, C., Radford, A., Chen, M., and Sutskever, I. Zero-shot text-to-image generation. In International Conference on Machine Learning, pp. 8821–8831. PMLR, 2021.
- Roziere et al. (2023) Roziere, B., Gehring, J., Gloeckle, F., Sootla, S., Gat, I., Tan, X. E., Adi, Y., Liu, J., Remez, T., Rapin, J., et al. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Sanchez-Stern et al. (2020) Sanchez-Stern, A., Alhessi, Y., Saul, L., and Lerner, S. Generating correctness proofs with neural networks. In Proceedings of the 4th ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, pp. 1–10, 2020.
- Schick et al. (2023) Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Schulz (2002) Schulz, S. E - a brainiac theorem prover. AI Commun., 15(2,3):111–126, aug 2002. ISSN 0921-7126.
- Shinn et al. (2023) Shinn, N., Cassano, F., Labash, B., Gopinath, A., Narasimhan, K., and Yao, S. Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 2023.
- Significant-Gravitas (2023) Significant-Gravitas. Autogpt. https://github.com/Significant-Gravitas/Auto-GPT, 2023.
- Touvron et al. (2023) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Trinh et al. (2024) Trinh, T. H., Wu, Y., Le, Q. V., He, H., and Luong, T. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
- Wang et al. (2023a) Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- Wang et al. (2023b) Wang, H., Xin, H., Zheng, C., Li, L., Liu, Z., Cao, Q., Huang, Y., Xiong, J., Shi, H., Xie, E., Yin, J., Li, Z., Liao, H., and Liang, X. Lego-prover: Neural theorem proving with growing libraries. 2023b.
- Wei et al. (2022) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Yang & Deng (2019) Yang, K. and Deng, J. Learning to prove theorems via interacting with proof assistants. In International Conference on Machine Learning, pp. 6984–6994. PMLR, 2019.
- Yang et al. (2023) Yang, K., Swope, A. M., Gu, A., Chalamala, R., Song, P., Yu, S., Godil, S., Prenger, R., and Anandkumar, A. Leandojo: Theorem proving with retrieval-augmented language models. arXiv preprint arXiv:2306.15626, 2023.
- Yao et al. (2022) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Yao et al. (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Zhao et al. (2023) Zhao, X., Li, W., and Kong, L. Decomposing the enigma: Subgoal-based demonstration learning for formal theorem proving. 2023.
- Zheng et al. (2023) Zheng, C., Wang, H., Xie, E., Liu, Z., Sun, J., Xin, H., Shen, J., Li, Z., and Li, Y. Lyra: Orchestrating dual correction in automated theorem proving. arXiv preprint arXiv:2309.15806, 2023.
- Zheng et al. (2021) Zheng, K., Han, J. M., and Polu, S. Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110, 2021.
## Appendix A Appendix
### A.1 Evaluation Details
#### A.1.1 Copra Implementation Setup Details
We introduce a common proof environment for Copra, which can also be used by any other approach for theorem-proving. The proof environment has a uniform interface that makes Copra work seamlessly for both Lean and Coq. It also supports the use of retrieval from external lemma repositories and the use of informal proofs when available (as in the case of $\mathtt{miniF2F}$ ). In the future, we plan to extend Copra to support more proof languages.
Copra provides support for various LLMs other than the GPT-series, including open-sourced LLMs like Llama 2 (Touvron et al., 2023) and Code Llama (Roziere et al., 2023). For most of our experiments, all the theorems are searched within a timeout of 600 seconds (10 minutes) and with a maximum of 60 LLM inference calls (whichever exhausts first). To make it comparable across various LLMs, only one response is generated per inference. All these responses are generated with the temperature set to zero, which ensures that the responses generated are more deterministic, focused, and comparable. In one of our ablations with few-shot GPT-4, we use set the temperature to $0.7$ and run more than 1 attempt. The number of attempts for few-shot GPT-4 matches the number of queries Copra takes to either successfully prove or fail to prove a certain theorem. This comparison is not completely fair for Copra because individual queries focus on a single proof step and cannot see the original goal, yet measures a useful experiment which indicates the benefits of Copra ’s stateful search under a fixed budget of API calls to GPT-4.
We use GPT-3.5, GPT-4 (OpenAI, 2023b), and CodeLLama (Roziere et al., 2023) to test the capabilities of Copra. We find that it is best to use Copra ’s different capabilities in an ensemble to enhance its performance while also minimizing cost. Therefore, we first try to find the proof without retrieval, then with retrieval, and finally with informal proofs (if available) only when we fail with retrieval. The informal proofs are used more like informal sketches for the formal proof and are generated in a separate few-shot invocation of the LLM. Thereafter, Copra simply adds the informal proofs to its prompt which guides the search. To ensure fairness in comparison, we make sure that the number of guidance steps is capped at 60 and the 10-minute timeout is spread across all these ensemble executions. The use of a failure dictionary (see Section 3) enables fast failure (see Table 3) which helps in using more strategies along with Copra within the 60 queries cap and 10-minute timeout. From Table 2, it is clear that despite the significant overlap between the different ablations, the ensemble covers more cases than just using one strategy with Copra. While extra lemmas are often useful, the addition of extra information from retrieval may also be misleading because the retriever can find lemmas that are not relevant to proving the goal, so the best performance is acquired by using the different capabilities in the manner we describe. Similarly, the use of informal proofs as a sketch for formal proofs can potentially increase the number of steps needed in a formal proof, decreasing the search efficacy. For example, an informal proof can suggest the use of multiple rewrites for performing some algebraic manipulation which can be easily handled with powerful tactics like linarith in Lean. Table 5 shows the number of tokens used in our prompts for different datasets.
#### A.1.2 Metric: pass@k-with-n-queries
We consider the metric pass@k-with-n-queries to assess the speed of the proposed approach and the effectiveness of the LLM or neural network to guide the proof search. It is a reasonable metric because it does a more even-handed trade-off in accounting for the time taken to complete a proof and at the same time ignores very low-level hardware details.
Different approaches need a different amount of guidance from a neural model to find the right proof. For example, approaches like Baldur (First et al., 2023), DSP (Jiang et al., 2022b), etc., generate the whole proof all at once. On the other hand, GPT- $f$ (Polu & Sutskever, 2020), PACT (Han et al., 2021), ReProver (Yang et al., 2023), Proverbot (Sanchez-Stern et al., 2020), and Copra generate proofs in a step-by-step fashion. We argue that pass@k-with-n-queries is a fairer metric to compare these different types of approaches because it correlates with the effectiveness of the proof-finding algorithm in an implementation-agnostic way. Since the exact time might not always be a good reflection of the effectiveness because of hardware differences, network throttling, etc., it makes sense to not compare directly on metrics like pass@k-minutes or pass@k-seconds. Not only these metrics will be brittle and very sensitive to the size, hardware, and other implementation details of the model, but not every search implementation will be based on a timeout. For example, Proverbot9001 does not use timeout-based search (and hence we don’t compare on the basis of time with Proverbot9001).
#### A.1.3 pass@k-with-n-queries versus wall-clock time
We show that pass@k-with-n-queries, correlates well with wall-clock time for finding proofs by using the metric pass@k-seconds. pass@k-seconds measures the number of proofs that an approach can find in less than $k$ seconds. The plot in Figure 6 shows that pass@k-seconds follows the same trend as pass@k-with-n-queries as shown in Figure 5.
| Evaluation on $\mathtt{miniF2F}$ - $\mathtt{test}$ | | | |
| --- | --- | --- | --- |
| Approach | Avg. Query in Total | Avg. Query on Failure | Avg. Query on Pass |
| ReProver (- Retrieval) | 350.7 | 427.24 | 81.6 |
| ReProver | 1015.32 | 1312.89 | 122.62 |
| Copra | 21.73 | 28.23 | 3.83 |
| Copra ( $+$ Retrieval) | 37.88 | 50.90 | 7.38 |
Table 2: Aggregate effectiveness statistics for Copra and the baselines on $\mathtt{miniF2F}$ dataset. We can see the various ablations of Copra with and without retrieval.
| Approach | Avg. Time In Seconds | | | | | |
| --- | --- | --- | --- | --- | --- | --- |
| Per Proof | Per Query | | | | | |
| On Pass | On Fail | All | On Pass | On Fail | All | |
| ReProver (on CPU - retrieval) | 279.19 | 618.97 | 543.78 | 3.42 | 1.45 | 1.55 |
| ReProver (on GPU - retrieval) | 267.94 | 601.35 | 520.74 | 2.06 | 0.44 | 0.48 |
| ReProver (on GPU + retrieval) | 301.20 | 605.29 | 529.27 | 2.46 | 0.46 | 0.52 |
| Copra (GPT-3.5) | 39.13 | 134.26 | 122.21 | 15.97 | 9.43 | 9.53 |
| Copra (GPT-4) | 67.45 | 370.71 | 289.92 | 17.61 | 13.09 | 13.34 |
| Copra (GPT-4 + retrieval) | 117.85 | 678.51 | 510.78 | 15.97 | 13.33 | 13.48 |
Table 3: Average time taken by our approach (Copra) and ReProver on $\mathtt{miniF2F}$ dataset. We split the values according to the success of the proof search on that problem. We also report values per query.
We can use the comparison of Copra with ReProver (Yang et al., 2023) on the miniF2F dataset to explain the correlation between finding proofs fast and pass@k-with-n-queries. From Table 3, we know that on average the time taken per guidance (which includes time taken to execute the proof steps on ITP as well) is around 0.52 seconds for ReProver and 13.48 seconds for Copra. Given that ReProver ’s guidance LLM is small, we can assume that ReProver does not take any time (zero time) to query its LLM and spends most of the 0.52 seconds running the proof steps on ITP. Now, we can reasonably estimate GPT-4 average response time to be approximately 13 seconds from Table 3. However, we see that the number of guidance used by ReProver (from Table 2) is about 16x higher on success. Interestingly, this also shows up in the wall clock time, which is around 3x higher for ReProver on success, so there is a tradeoff between the two, but the number of queries dominates when the guidance model is of low quality. Hence, given a high-quality guidance model, we can empirically argue that asymptotically the search will converge to a proof faster (assuming a proof is achievable with the guidance model).
#### A.1.4 Data Leakage in GPT-4
A key risk with closed-source pretrained models like GPT-4 is data leakage, i.e., overlaps between the evaluation set and the pretraining set. Naturally, we cannot be certain that Copra does not benefit from such leakage. However, there are several reasons to believe that data leakage is not a significant contributor to our results.
First, we note that Copra significantly outperforms few-shot invocations of GPT-4. If the results on Copra were significantly tainted by data leakage, we would have expected better performance from few-shot GPT-4.
Second, it is highly unlikely that GPT-4 has been trained on proof-state and tactic pair generated by hooking up the Lean ITP. Not all the formal proofs of the miniF2F test dataset are available online (only 80 proofs are available in Lean). Furthermore, if one were to manually annotate (proof state, tactic) pairs, one would need ground truth tactics to annotate with, the majority of which do not appear in $\mathtt{miniF2F}$ - $\mathtt{test}$ . Given that GPT-4 is a general-purpose LLM, it is highly unlikely that while training GPT-4 the $\mathtt{miniF2F}$ dataset was first manually annotated, and then proof-state and tactic pair information was collected by hooking up the Lean ITP.
Also, in our agent interactions, we limit ourselves only to the goal at that point. There is no mention of the original theorem anywhere (except for the very first proof-state), so the chances that GPT-4 can correlate any intermediate state with the original theorem are slim, unless it has learned a model of Lean’s kernel, which is highly unlikely. It is also improbable that GPT-4 has seen the proof-state in the same format that we use, let alone using the execution feedback which has not been used in any known previous works for Lean.
One could hypothesize that some of the few-shot GPT-4 proofs might be influenced by potential training on the $\mathtt{miniF2F}$ dataset. However, this does not seem to be true because we see that most of the proofs we generated were either not mentioned in the $\mathtt{miniF2F}$ test dataset or completely different from the manually written proofs in the $\mathtt{miniF2F}$ test dataset (including the first step mismatch). Table 4 shows the detailed analysis of proofs generated by Copra and the proofs mentioned in $\mathtt{miniF2F}$ test dataset for Lean. From the Table 4, it is clear that most of the proofs generated by Copra are different from the proofs mentioned in the $\mathtt{miniF2F}$ . The ones that are exactly the same are simple single-tactic proofs that just use exactly one of the linarith, nlinarith, or norm_num tactics without any arguments. If we set aside these straightforward simple cases, then about $91.93\$ of the proofs generated by Copra are either different from the proofs mentioned in the $\mathtt{miniF2F}$ or do not have a proof mentioned in the $\mathtt{miniF2F}$ dataset. Out of all proofs generated by Copra about 25.33% proofs are for theorems that have no proofs mentioned in the $\mathtt{miniF2F}$ test dataset as compared to 22.95% for ReProver. Some of the proofs generated by our approach as compared to proofs mentioned in the $\mathtt{miniF2F}$ test dataset are shown in Figure 11.
| | Proofs found in $\mathtt{miniF2F}$ - $\mathtt{test}$ | Proofs NOT in $\mathtt{miniF2F}$ | Total | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Single-Tactic Simple Proofs | Two-Tactic Proofs | Longer OR Complex Proofs | Total | | | | | |
| Tactics Used —— Proof Count | linarith | norm_num | nlinarith | two tactics | $>2$ tactics OR 1 tactic multi-args | | sorry | |
| Proof Count | 11 | 12 | 2 | 16 | 39 | 80 | 164 | 244 |
| Exact Match Copra Count | 7 | 10 | 0 | 5 | 0 | 22 | 0 | 22 |
| $1^{st}$ Tactic Match Copra Count | 7 | 10 | 0 | 8 | 4 | 29 | 0 | 29 |
| Distinct Copra Count | 4 | 2 | 2 | 9 | 17 | 34 | 19 | 53 / 75 70.67% |
| Distinct Copra Count ex Single-Tactic | - | - | - | 9 | 17 | 34 | 19 | 53 / 58 91.37% |
| All Copra Count | 11 | 12 | 2 | 14 | 17 | 56 | 19 | 75 |
Table 4: Analysis of proof generated by Copra (GPT-4 + Retrieval + Informal) on $\mathtt{miniF2F}$ test dataset for Lean. See Table 1 for details about various ablations.
Finally, the ability of agent interactions to enhance the basic LLM approach seems to transcend OpenAI’s LLMs. We ran Copra on the recently released CodeLlama. From Table 2, Copra improved CodeLlama’s capabilities to prove theorems by about $5\$ on $\mathtt{miniF2F}$ dataset. This indicates that the in-context learning capabilities that we build are transferable and LLM-agnostic.
#### A.1.5 Comparison with methods using Isabelle and informal proofs
An interactive theorem prover (ITP) is a software tool to assist with the development of formal proofs by human-machine collaboration. This involves a sort of interactive proof editor, or other interfaces, with which a human can guide the search for proofs. A formal proof written in an interactive theorem prover can be verified automatically by computers, whereas an informal proof is written in natural language can only be verified by a human. Generally, formal proofs are much more rigorous and pedantic than informal proofs. So informal proof can be loosely considered as a proof sketch based on which one can write rigorous machine-checkable formal proofs.
Note that the accuracy numbers of DSP-like approaches (Jiang et al., 2022b; Zhao et al., 2023; Zheng et al., 2021) are not directly comparable to ours because they use a different proof language. These approaches use Isabelle, which, unlike Lean, allows the use of powerful automatic reasoning tools like Sledgehammer. Methods following the DSP pipeline use informal proofs for formal proof synthesis. While this strategy works well on mathematics-competition benchmarks like $\mathtt{miniF2F}$ , it is less applicable to domains such as software verification, where there is often no informal specification, as well as domains that use customized, domain-specific formalizations. Furthermore, having access to informal proofs (human-written or LLM-generated) shifts the problem of synthesizing the formal proof towards an autoformalization problem, as the LLM is likely to have seen correct natural language proofs of $\mathtt{miniF2F}$ problems in its training. Additionally, unlike DSP-like approaches which tend to use pass@100 or pass@200, we only use pass@1.
### A.2 Example Proofs Generated for $\mathtt{miniF2F}$
Figure 10 shows some other interesting proofs generated by our approach on $\mathtt{miniF2F}$ dataset. Figure 8 and Figure 9 shows the breakdown of theorems proved in various categories by Copra versus ReProver. Figure 10 shows some interesting $\mathtt{miniF2F}$ proofs as done by Copra. Figure 11 shows the comparison of proofs mentioned in the $\mathtt{miniF2F}$ repository versus the proofs discovered by Copra. Figure 12 shows the proofs which were only possible because of our backtracking feature.
<details>
<summary>extracted/5780501/img-miniF2F-problems-per-category.png Details</summary>

### Visual Description
\n
## Bar Chart: Theorems Proved by COPRA and ReProver
### Overview
This bar chart compares the number of theorems proved by two automated theorem provers, COPRA and ReProver, across several problem categories. The x-axis represents the problem category, and the y-axis represents the number of theorems proved. Each problem category has two bars, one for COPRA (green) and one for ReProver (orange).
### Components/Axes
* **X-axis Title:** Problem Category
* **Y-axis Title:** Number of Theorems Proved
* **X-axis Categories:** mathd\_algebra, mathd\_numbertheory, amc, aime, algebra, imo, induction, numbertheory
* **Legend:**
* Orange: ReProver
* Green: COPRA
### Detailed Analysis
The chart displays the number of theorems proved for each category by each prover.
* **mathd\_algebra:** COPRA proves approximately 39 theorems, while ReProver proves approximately 32 theorems.
* **mathd\_numbertheory:** COPRA proves approximately 24 theorems, while ReProver proves approximately 26 theorems.
* **amc:** COPRA proves approximately 9 theorems, while ReProver proves approximately 3 theorems.
* **aime:** COPRA proves approximately 2 theorems, while ReProver proves approximately 1 theorem.
* **algebra:** COPRA proves approximately 2 theorems, while ReProver proves approximately 1 theorem.
* **imo:** COPRA proves 0 theorems, while ReProver proves 0 theorems.
* **induction:** COPRA proves 0 theorems, while ReProver proves 0 theorems.
* **numbertheory:** COPRA proves 0 theorems, while ReProver proves 0 theorems.
### Key Observations
* COPRA consistently proves more theorems than ReProver in the categories where both provers have proven theorems.
* COPRA demonstrates a significant advantage in the 'mathd\_algebra' category.
* Both provers fail to prove any theorems in the 'imo', 'induction', and 'numbertheory' categories.
* The 'aime' and 'algebra' categories have very low theorem counts for both provers.
### Interpretation
The data suggests that COPRA is generally more effective than ReProver at proving theorems in the tested categories. The large difference in 'mathd\_algebra' suggests that COPRA may have specific strengths in this area. The failure of both provers to solve problems in 'imo', 'induction', and 'numbertheory' indicates that these categories may be particularly challenging for current automated theorem proving techniques, or that the test set for these categories was particularly difficult. The low counts in 'aime' and 'algebra' might indicate that these categories are less amenable to automated proof, or that the test sets were small. The chart provides a comparative performance assessment of the two theorem provers, highlighting their relative strengths and weaknesses across different mathematical problem domains.
</details>
Figure 8: Breakdown of theorems proved in various categories by Copra (GPT-4 + retrieval + informal proof): Problems solved in different categories in $\mathtt{miniF2F}$ test dataset. Notice that ‘IMO’ and ‘Induction’ problems are hard for both approaches. Copra does more problems than ReProver in most of the categories.
<details>
<summary>extracted/5780501/img-miniF2F-steps-per-category.png Details</summary>

### Visual Description
\n
## Bar Chart: Average Number of Queries by Problem Category
### Overview
This image presents a bar chart comparing the average number of queries for two systems, ReProver and COPRA, across several problem categories. The chart displays the average number of queries (n) on the y-axis and the problem category on the x-axis.
### Components/Axes
* **X-axis:** Problem Category. Categories include: imo, induction, numbertheory, algebra, aime, amc, mathd\_numbertheory, mathd\_algebra.
* **Y-axis:** Avg. Number of Queries (n). Scale ranges from 0 to 1750, with increments of 250.
* **Legend:** Located at the top-right of the chart.
* ReProver (Orange)
* COPRA (Green)
### Detailed Analysis
The chart consists of paired bars for each problem category, representing the average number of queries for ReProver and COPRA.
* **imo:** ReProver: approximately 1250 queries. COPRA: approximately 25 queries.
* **induction:** ReProver: approximately 1500 queries. COPRA: approximately 25 queries.
* **numbertheory:** ReProver: approximately 1275 queries. COPRA: approximately 50 queries.
* **algebra:** ReProver: approximately 1725 queries. COPRA: approximately 25 queries.
* **aime:** ReProver: approximately 1400 queries. COPRA: approximately 25 queries.
* **amc:** ReProver: approximately 1100 queries. COPRA: approximately 25 queries.
* **mathd\_numbertheory:** ReProver: approximately 950 queries. COPRA: approximately 25 queries.
* **mathd\_algebra:** ReProver: approximately 600 queries. COPRA: approximately 25 queries.
Visually, ReProver consistently generates a significantly higher number of queries than COPRA across all problem categories. The bars for COPRA are very short and often appear at the same level, close to the x-axis. ReProver's bars vary in height, indicating different levels of query generation depending on the problem category.
### Key Observations
* ReProver consistently outperforms COPRA in terms of the average number of queries generated.
* The highest number of queries for ReProver is observed in the "algebra" category (approximately 1725).
* COPRA's query count remains consistently low across all categories, hovering around 25 queries.
* The difference in query counts between ReProver and COPRA is substantial for all problem categories.
### Interpretation
The data suggests that ReProver is significantly more active or complex in its problem-solving approach compared to COPRA, as evidenced by the much higher number of queries generated. This could indicate that ReProver explores a wider range of possibilities or employs a more detailed search strategy. The consistent low query count for COPRA might suggest a simpler or more direct approach, potentially sacrificing thoroughness for efficiency. The variation in ReProver's query count across different problem categories indicates that the complexity of the problem influences the extent of its search process. The "algebra" category appears to be the most challenging for ReProver, requiring the highest number of queries to resolve. The consistent low performance of COPRA suggests it may be limited in its ability to handle the complexity of these problem categories.
</details>
Figure 9: Breakdown of theorems proved in various categories by Copra (GPT-4 + retrieval + informal proof): Number of queries needed in different categories in $\mathtt{miniF2F}$ test dataset. Notice that Copra takes very less queries to solve the problems in each category.
[roundcorner=10pt]
(a) ⬇ theorem mathd_numbertheory_100 (n : ℕ) (h ₀ : 0 < n) (h ₁ : nat. gcd n 40 = 10) (h ₂ : nat. lcm n 40 = 280) : n = 70 := begin have h ₃ : n * 40 = 10 * 280 := by rw [← nat. gcd_mul_lcm n 40, h ₁, h ₂], exact (nat. eq_of_mul_eq_mul_right (by norm_num : 0 < 40) h ₃), end (b) ⬇ theorem mathd_numbertheory_447 : $ \ sum$ k in finset. filter (λ x, 3∣ x) (finset. Icc 1 49), (k % 10) = 78 := begin norm_num [finset. sum_filter, finset. Icc], ring, end
(c) ⬇ theorem mathd_algebra_158 (a : ℕ) (h ₀ : even a) (h ₁ : ↑ $ \ sum$ k in finset. range 8, (2 * k + 1) - ↑ $ \ sum$ k in finset. range 5, (a + 2 * k) = (4:ℤ)) : a = 8 := begin simp only [finset. sum_range_succ, finset. sum_range_zero, add_zero, finset. sum_const, finset. card_range, nsmul_eq_mul, finset. sum_add_distrib, finset. mul_sum, finset. sum_mul] at h ₁, norm_num at h ₁, linarith, end (d) ⬇ theorem mathd_numbertheory_341 (a b c : ℕ) (h ₀ : a ≤ 9 ∧ b ≤ 9 ∧ c ≤ 9) (h ₁ : nat. digits 10 ((5^100) % 1000) = [c, b, a]) : a + b + c = 13 := begin norm_num at h ₁, cases h ₁, linarith, end
(e) ⬇ theorem amc12b_2002_p7 (a b c : ℕ) (h ₀ : 0 < a ∧ 0 < b ∧ 0 < c) (h ₁ : b = a + 1) (h ₂ : c = b + 1) (h ₃ : a * b * c = 8 * (a + b + c)) : a ^2 + (b ^2 + c ^2) = 77 := begin rw [h ₁, h ₂] at h ₃, ring_nf at h ₃, rw [h ₁, h ₂], ring_nf, simp [h ₁, h ₂] at *, ring_nf at *, nlinarith, end
Figure 10: Some other interesting proofs generated for $\mathtt{miniF2F}$ by Copra. The length of the proofs generated shows that interaction with the environment helps in fixing the errors encountered while writing long proofs. These long sequences of rewrites are not easy to synthesize without knowing the exact execution feedback from the environment which often contains the hint to fix the rewrites.
[roundcorner=10pt]
(a.1) ⬇ theorem mathd_numbertheory_3 : ($ \ sum$ x in finset. range 10, ((x + 1)^2)) % 10 = 5 := begin dec_trivial!, end (b.1) ⬇ theorem mathd_algebra_209 ($ \ sigma$ : equiv ℝ ℝ) (h ₀ : $ \ sigma$.2 2 = 10) (h ₁ : $ \ sigma$.2 10 = 1) (h ₂ : $ \ sigma$.2 1 = 2) : $ \ sigma$.1 ($ \ sigma$.1 10) = 1 := begin rw [← h ₀, ← h ₂], simp, end (c.1) ⬇ theorem mathd_numbertheory_229 : (5^30) % 7 = 1 := begin have five_to_thirty_is_one : (5^30 : zmod 7) = 1 := begin have five_to_the_six_is_one : (5^6 : zmod 7) = 1, by dec_trivial, have break_power : (5^30 : zmod 7) = (5^6)^5, by norm_num, rw break_power, rw five_to_the_six_is_one, norm_num, end, change 5^30 ≡ 1 [MOD 7], rw ← zmod. eq_iff_modeq_nat, exact_mod_cast five_to_thirty_is_one, end
(a.2) ⬇ theorem mathd_numbertheory_3 : ($ \ sum$ x in finset. range 10, ((x + 1)^2)) % 10 = 5 := begin simp only [finset. sum_range_succ, finset. sum_range_zero, add_zero, pow_two], norm_num, end (b.2) ⬇ theorem mathd_algebra_209 ($ \ sigma$: equiv ℝ ℝ) (h ₀ : $ \ sigma$.2 2 = 10) (h ₁ : $ \ sigma$.2 10 = 1) (h ₂ : $ \ sigma$.2 1 = 2) : $ \ sigma$.1 ($ \ sigma$.1 10) = 1 := begin apply_fun σ. inv_fun at h ₁, rw equiv. inv_fun_as_coe at h ₁, rw ← h ₀ at h ₁, rw ← h ₂ at h ₁, rw ← h ₀, rw equiv. inv_fun_as_coe at h ₀, apply_fun σ at h ₁, rw equiv. inv_fun_as_coe at h ₁, rw ← h ₂, rw equiv. inv_fun_as_coe at h ₂, simp [equiv. symm_apply_apply], end (c.2) ⬇ theorem mathd_numbertheory_229 : (5^30) % 7 = 1 := begin norm_num, end
Figure 11: Some proofs found by Copra as compared to the proofs mentioned in the $\mathtt{miniF2F}$ test dataset. It is interesting to see that the proofs generated by Copra are different from the proofs mentioned in the repository. This is especially true when the proofs are longer. It is also worth noting that occasionally Copra can find very simple proofs for longer proofs mentioned in $\mathtt{miniF2F}$ test dataset. (a.1), (b.1), and (c.1) show the proofs as mentioned in the $\mathtt{miniF2F}$ dataset, while (a.2), (b.2), and (c.2) show the corresponding proofs generated by Copra.
[roundcorner=10pt]
(a) ⬇ theorem mathd_algebra_158 (a : ℕ) (h ₀ : even a) (h ₁ : ↑ $ \ sum$ k in finset. range 8, (2 * k + 1) - ↑ $ \ sum$ k in finset. range 5, (a + 2 * k) = (4:ℤ)) : a = 8 := begin simp only [finset. sum_range_succ, finset. sum_range_zero, add_zero, finset. sum_const, finset. card_range, nsmul_eq_mul, finset. sum_add_distrib, finset. mul_sum, finset. sum_mul] at h ₁, norm_num at h ₁, linarith, end (b) ⬇ theorem mathd_algebra_44 (s t : ℝ) (h ₀ : s = 9 - 2 * t) (h ₁ : t = 3 * s + 1) : s = 1 ∧ t = 4 := begin split, rw h ₁ at h ₀, rw h ₀, linarith, rw h ₀ at h ₁, rw h ₁ at h ₀, linarith, end
(c) ⬇ theorem aime_1990_p4 (x : ℝ) (h ₀ : 0 < x) (h ₁ : x ^2 - 10 * x - 29 ≠ 0) (h ₂ : x ^2 - 10 * x - 45 ≠ 0) (h ₃ : x ^2 - 10 * x - 69 ≠ 0) (h ₄ : 1 / (x ^2 - 10 * x - 29) + 1 / (x ^2 - 10 * x - 45) - 2 / (x ^2 - 10 * x - 69) = 0) : x = 13 := begin ring_nf at h ₄, have h ₅ : (x ^ 2 - 10 * x - 29) * (x ^ 2 - 10 * x - 45) * (x ^ 2 - 10 * x - 69) ≠ 0 := mul_ne_zero (mul_ne_zero h ₁ h ₂) h ₃, field_simp [h ₅] at h ₄, ring_nf at h ₄, nlinarith, end (d) ⬇ theorem amc12b_2002_p7 (a b c : ℕ) (h ₀ : 0 < a ∧ 0 < b ∧ 0 < c) (h ₁ : b = a + 1) (h ₂ : c = b + 1) (h ₃ : a * b * c = 8 * (a + b + c)) : a ^2 + (b ^2 + c ^2) = 77 := begin rw [h ₁, h ₂] at h ₃, ring_nf at h ₃, rw [h ₁, h ₂], ring_nf, simp [h ₁, h ₂] at *, ring_nf at *, nlinarith, end
(e) ⬇ theorem mathd_algebra_113 (x : ℝ) : x ^2 - 14 * x + 3 ≥ 7^2 - 14 * 7 + 3 := begin apply le_of_sub_nonneg, ring_nf, have : (x - 14) * x + 49 = (x - 7) ^ 2, by ring, rw this, apply pow_two_nonneg, end
Figure 12: Some interesting proofs generated for $\mathtt{miniF2F}$ dataset which were generated because of Copra ’s backtracking capabilities.
### A.3 System Prompts For $\mathtt{miniF2F}$
Parts of the ‘system prompt’ used by Copra for theorem proving in Lean are shown in Fig. 13.
[roundcorner=10pt] [breaklines]md You are a proficient formal theorem-proving agent in Lean 3. You are tasked with predicting the next proof step given the current proof state, which is described in the following format: 1. All the goals are described under the ‘[GOALS]‘ keyword. Each goal following ‘[GOALS]‘ is described under the keyword ‘[GOAL] i‘, where ‘i‘ is a positive integer. For example, ‘[GOAL] 1‘, ‘[GOAL] 2‘, etc. 2. Within each ‘[GOAL] i‘ keyword, the goal is described as a human-readable serialized version of the proof state as shown while running the ‘lean‘ command. Each goal may be accompanied by hypotheses, which are described under the keyword ‘[HYPOTHESES] i‘. Each hypothesis following ‘[HYPOTHESES] i‘ starts with the prefix ‘[HYPOTHESIS]‘. Apart from goals and hypotheses, the OPTIONAL keywords ‘[DEFINITIONS] i‘ and ‘[THEOREMS] i‘ may appear, which respectively describe the relevant definitions of symbols and terms used in that goal and some potentially useful theorems or lemmas which might help in proving the goal. Each definition within ‘[DEFINITIONS]‘ starts with the prefix ‘[DEFINITION]‘. Similarly, each theorem/lemma within ‘[THEOREMS]‘ starts with the prefix ‘[THEOREM]‘. If you choose to use a theorem described in ‘[THEOREMS] i‘, be SURE that it applies and is useful for proving the goal. 3. The ‘[STEPS]‘ keyword is used to describe the proof-steps which were used to obtain the current proof state from the original theorem. Each proof step starts with the prefix ‘[STEP]‘ and is a valid Lean tactic. For example, ‘[STEPS][STEP]rw h 1 at h 2,[STEP]linarith,‘. 4. Sometimes the ‘[INCORRECT STEPS]‘ keyword will appear, which describes proof-steps which should NOT be generated. For example, ‘[INCORRECT STEPS][STEP]apply h 1,[STEP]rw ←h 1 ‘. **DO NOT** generate these ‘[INCORRECT STEPS]‘ again, as they are failed proof steps which have already been tried. Re-generating such proof steps will cause backtracking and early termination of your proof search. 5. There is also an optional ‘[LAST STEP]‘ keyword which describes the proof-step generated last time. If the proof-step was incorrect, then it is also followed by an error message from Lean 3 environment described with the ‘[ERROR MESSAGE]‘ keyword. For example, ‘[LAST STEP]linarith, [ERROR MESSAGE]linarith failed to find a contradiction: y : $\mathbb{R}$ , 1 : x = 3 - 2 * y, 2 : 2 * x - y = 1 $\vdash$ false‘. You can use the error message as guidance in predicting a correct proof-step. Do not generate tactics which you believe will result in the same error. If the proof-step was correct then it is followed by the keyword ‘[SUCCESS]‘. For example, ‘[LAST STEP]linarith,[SUCCESS]‘. Do NOT generate the last proof-step again if it was NOT successful, this will also cause early termination of your proof search. 6. Sometimes there can be errors in the format of the generated response. This is reported using the keyword ‘[ERROR]‘ followed by the error message. For example, ‘[ERROR] response: ’Great! The proof is complete.’, Reason: ’stop’. Please respond only in the format specified.[END]‘. This means that the response generated by you was not in the specified format.
Start your response with ‘[RUN TACTIC]‘ followed by the tactic which will help in proving the current proof state, and then ‘[END]‘. For example, ‘[RUN TACTIC] induction c, [END]‘. Do NOT finish the proof in one shot ending with ‘end‘, which signifies the end of a Lean 3 proof. Generate exactly ONE proof-step. Multiple proof steps are more error prone, because you will not get a chance to see intermediate proof state descriptions………….
Figure 13: Parts of “system prompt” used by Copra for Lean.
### A.4 System Prompts For CompCert
Parts of the ‘system prompt’ used by Copra for theorem proving in Coq are shown in Fig. 14.
[roundcorner=10pt] [breaklines]md You are a proficient formal theorem-proving agent in Coq. You are tasked with predicting the next proof step given the current proof state, which is described in the following format: 1. All the goals are described under the ‘[GOALS]‘ keyword. Each goal following ‘[GOALS]‘ is described under the keyword ‘[GOAL] i‘, where ‘i‘ is a positive integer. For example, ‘[GOAL] 1‘, ‘[GOAL] 2‘, etc. 2. Within each ‘[GOAL] i‘ keyword, the goal is described as a human-readable serialized version of the proof state as shown while running the ‘lean‘ command. Each goal may be accompanied by hypotheses, which are described under the keyword ‘[HYPOTHESES] i‘. Each hypothesis following ‘[HYPOTHESES] i‘ starts with the prefix ‘[HYPOTHESIS]‘. Apart from goals and hypotheses, the OPTIONAL keywords ‘[DEFINITIONS] i‘ and ‘[THEOREMS] i‘ may appear, which respectively describe the relevant definitions of symbols and terms used in that goal and some potentially useful theorems or lemmas which might help in proving the goal. Each definition within ‘[DEFINITIONS]‘ starts with the prefix ‘[DEFINITION]‘. Similarly, each theorem/lemma within ‘[THEOREMS]‘ starts with the prefix ‘[THEOREM]‘. If you choose to use a theorem described in ‘[THEOREMS] i‘, be SURE that it applies and is useful for proving the goal. 3. The ‘[STEPS]‘ keyword is used to describe the proof-steps which were used to obtain the current proof state from the original theorem. Each proof step starts with the prefix ‘[STEP]‘, and is a valid Coq tactic ending with a ‘.‘. For example, ‘[STEPS][STEP]intros a.[STEP]induction a.‘ 4. Sometimes the ‘[INCORRECT STEPS]‘ keyword will appear, which describes proof-steps which should NOT be generated. For example, ‘[INCORRECT STEPS][STEP]apply mul_assoc.[STEP]rewrite ¡- H.‘. **DO NOT** generate these ‘[INCORRECT STEPS]‘ again, as they are failed proof steps which have already been tried. Re-generating such proof steps will cause backtracking and early termination of your proof search. 5. There is also an optional ‘[LAST STEP]‘ keyword which describes the proof-step generated last time. If the proof-step was incorrect, then it is also followed by an error message from Coq environment described with the ‘[ERROR MESSAGE]‘ keyword. For example, ‘[LAST STEP]reflexivity.[ERROR MESSAGE]Error: In environment: nat to unify ”n” with ”n + 0”.‘. You can use the error message as guidance in predicting a correct proof-step. Do not generate tactics which you believe will result in the same error. If the proof-step was correct then it is followed by the keyword ‘[SUCCESS]‘. For example, ‘[LAST STEP]reflexivity.[SUCCESS]‘. Do NOT generate the last proof-step again if it was NOT successful, this will also cause early termination of your proof search. 6. Sometimes there can be errors in the format of the generated response. This is reported using the keyword ‘[ERROR]‘ followed by the error message. For example, ‘[ERROR] response: ’Great! The proof is complete.’, Reason: ’stop’. Please respond only in the format specified.[END]‘. This means that the response generated by you was not in the specified format.
Start your response with ‘[RUN TACTIC]‘, followed by the tactic which will help in proving the current proof state, and then ‘[END]‘. For example, ‘[RUN TACTIC]destruct c.[END]‘. Do NOT finish the proof in one shot ending with ‘Qed‘, which signifies the end of a Coq proof. Generate exactly ONE proof-step. Multiple proof steps are more error prone, because you will not get a chance to see intermediate proof state descriptions. In particular, do not generate tactics which could be split into smaller tactics as you will not be able to view the intermediate proof states during these rewrites. The tactic ‘admit‘ is NOT a valid proof step, do NOT generate it.
Ensure that the proof step you generate is (1) valid (2) helpful …..
Figure 14: Parts of “system prompt” used by Copra for Coq.
### A.5 Example Proofs generated For CompCert
[roundcorner=10pt]
⬇
theorem mathd_algebra_107
(x y : ℝ)
(h ₀ : x ^2 + 8 * x + y ^2 - 6 * y = 0) :
(x + 4)^2 + (y -3)^2 = 5^2 :=
begin
have h ₁ : (x + 4) ^ 2 = x ^ 2 + 8 * x + 16 := by ring,
have h ₂ : (y - 3) ^ 2 = y ^ 2 - 6 * y + 9 := by ring,
rw [h ₁, h ₂],
ring_nf,
linarith,
end
Figure 15: An interesting proof generated by Copra while using informal proofs hints generated via GPT-4.
<details>
<summary>extracted/5780501/img-compcert-pass-k-guidance-steps-gpt-4-turbo.png Details</summary>

### Visual Description
## Line Chart: Theorem Proving Performance
### Overview
This line chart compares the performance of two theorem proving systems, COPRAGPT-4-turbo (with retrieval) and Proverbot9001, based on the number of theorems proved as a function of the number of queries submitted. The chart title indicates the metric being evaluated is "pass@1-with-n-queries".
### Components/Axes
* **X-axis:** "Number of Queries (n)", ranging from 0 to 60, with tick marks at intervals of 10.
* **Y-axis:** "Number of Theorems Proved", ranging from 0 to 60, with tick marks at intervals of 10.
* **Lines:**
* COPRAGPT-4-turbo (with retrieval) - represented by a solid orange line.
* Proverbot9001 - represented by a solid blue line.
* **Legend:** Located in the bottom-right corner of the chart.
* **Title:** "pass@1-with-n-queries" positioned at the top-center of the chart.
### Detailed Analysis
**COPRAGPT-4-turbo (with retrieval) - Orange Line:**
The orange line starts at approximately 0 theorems proved at 0 queries. It exhibits a steep upward slope initially, reaching approximately 15 theorems proved at around 8 queries. The slope gradually decreases, leveling off around 55-58 theorems proved between 25 and 60 queries. There is a plateau between approximately 20 and 40 queries where the number of theorems proved remains relatively constant.
* (0, 0)
* (5, ~2)
* (10, ~15)
* (15, ~25)
* (20, ~35)
* (25, ~45)
* (30, ~48)
* (35, ~50)
* (40, ~52)
* (45, ~54)
* (50, ~55)
* (60, ~58)
**Proverbot9001 - Blue Line:**
The blue line also starts at approximately 0 theorems proved at 0 queries. It has a more gradual initial slope compared to the orange line, reaching approximately 8 theorems proved at around 8 queries. The slope increases more noticeably between 8 and 20 queries, reaching approximately 32 theorems proved at 20 queries. The slope then decreases, and the line plateaus around 52-54 theorems proved between 40 and 60 queries. The line has more pronounced step-like increases than the orange line.
* (0, 0)
* (5, ~3)
* (10, ~8)
* (15, ~18)
* (20, ~32)
* (25, ~38)
* (30, ~42)
* (35, ~46)
* (40, ~50)
* (45, ~52)
* (50, ~53)
* (60, ~54)
### Key Observations
* COPRAGPT-4-turbo (with retrieval) consistently proves more theorems than Proverbot9001 across the entire range of queries.
* Both systems exhibit diminishing returns as the number of queries increases, with the rate of theorem proving slowing down.
* Proverbot9001 shows more discrete jumps in the number of theorems proved, suggesting it may solve theorems in batches or have a more step-wise learning process.
* The orange line (COPRAGPT-4-turbo) has a smoother curve, indicating a more continuous improvement in performance.
### Interpretation
The chart demonstrates that COPRAGPT-4-turbo (with retrieval) outperforms Proverbot9001 in theorem proving, as measured by the number of theorems proved per query. The "pass@1-with-n-queries" metric suggests that the systems are evaluated on their ability to correctly prove theorems given a single attempt per query. The diminishing returns observed in both systems indicate that there is a limit to the effectiveness of simply increasing the number of queries. The differences in the smoothness of the curves suggest that the two systems employ different strategies for theorem proving. The step-like increases in Proverbot9001's performance could be due to the system learning from its mistakes and making significant progress in batches. The plateauing of both lines suggests that the systems are approaching their maximum performance level within the tested query range. This data could be used to compare the efficiency and effectiveness of different theorem proving approaches and to identify areas for improvement.
</details>
Figure 16: Copra vs. Proverbot9001 on the CompCert benchmark.
Figure 16 shows the number of proofs done by Copra versus Proverbot9001 varying with the number of queries on CompCert benchmark. Figure 17 shows some interesting proofs generated by our approach on the CompCert dataset.
[roundcorner=10pt]
(a) [breaklines]Coq gss : forall l v m, (set l v m) l = match l with R r =¿ v — S sl ofs ty =¿ Val.load_result (chunk_of_type ty) v end. Proof. intros l v m. destruct l as [r — s o t]. - unfold set. destruct (Loc.eq (R r) (R r)); [reflexivity — contradiction]. - unfold set. destruct (Loc.eq (S s o t) (S s o t)); [reflexivity — contradiction]. Qed. (b) [breaklines]Coq eq : forall (p q: loc), p = q + p ¡¿ q. Proof. decide equality. - apply mreg_eq. - decide equality. - decide equality. apply Pos.eq_dec. decide equality. - decide equality. Qed.
(c) [breaklines]Coq disjoint_cons_right : forall a l1 l2, disjoint l1 (a :: l2) -¿ disjoint l1 l2. Proof. intros a l1 l2 H. unfold disjoint. intros x1 x2 H1 H2. apply H. assumption. right. assumption. Qed. (d) [breaklines]Coq eq_int_type : forall (x y: int_type), x=y + x¡¿y. Proof. decide equality. Qed.
(e) [breaklines]Coq set_locals_lessdef : forall e1 e2, env_lessdef e1 e2 -¿ forall il, env_lessdef (set_locals il e1) (set_locals il e2). Proof. intros e1 e2 H. induction il as [— a il’]. - apply H. - intros. apply set_var_lessdef. apply IHil’. apply Val.lessdef_refl. Qed.
Figure 17: Some other interesting proofs generated for CompCert by Copra. We can see that these proofs are long, and often use ‘apply’ tactic which shows that Copra can effectively use the retrieved information to discharge given proof states.
[roundcorner=10pt] [breaklines]md ‘[GOALS]‘ ‘[GOAL] 1‘ n = 70 ‘# The goal we want to prove (or proof state)‘ ‘[HYPOTHESES] 1 # Assumptions about the goal‘ ‘[HYPOTHESIS]‘ n : $\mathbb{N}$ ‘[HYPOTHESIS]‘ h 0 : 0 ¡ n ‘[HYPOTHESIS]‘ h 1 : n.gcd 40 = 10 ‘[HYPOTHESIS]‘ h 2 : n.lcm 40 = 280
‘[THEOREMS] 1 # The retrieved lemmas from BM25 which can help prove goal‘ ‘[THEOREM]‘ pnat.gcd_mul_lcm : (n m : $\mathbb{N}$ +) : (gcd n m) * (lcm n m) = n * m ‘[THEOREM]‘ tactic.is_prime_helper : (n : $\mathbb{N}$ ) (h 1 : 1 ¡ n) (h 2 : nat.min_fac n = n) : nat.prime n ‘[THEOREM]‘ pnat.lcm_coe : (n m : $\mathbb{N}$ +) ….
‘[INCORRECT STEPS] # Previous mistakes on the given proof state‘ ‘# Each mistake is listed with the ‘STEP’ keyword.‘ ‘[STEP]‘ apply nat.eq_of_mul_eq_mul_left (by norm_num : 0 ¡ 40), rw ←pnat.gcd_mul_lcm, repeat rw pnat.gcd_coe , repeat rw pnat.lcm_coe , norm_num, …. …. …. ‘[STEP]‘ have h 3 : n * 40 = nat.gcd n 40 * nat.lcm n 40, from pnat.gcd_mul_lcm (⟨n, h 0 ⟩) ⟨40, dec_trivial⟩, ‘[STEP]‘ rw nat.gcd_mul_lcm, ‘[STEP]‘ rw pnat.gcd_mul_lcm at h 1,
‘[LAST STEP]‘ # Last prediction have h 3 : n * 40 = 10 * 280, from pnat.gcd_mul_lcm (subtype.mk n h 0) (subtype.mk 40 dec_trivial), linarith,
‘[ERROR MESSAGE] # Error message on the last step‘ Got error in ’have h 3 : n * 40 = 10 * 2…’: error: invalid type ascription, term has type pnat.gcd ⟨n, h 0 ⟩ ⟨40, _⟩ * pnat.lcm ⟨n, h 0 ⟩ ⟨40, _⟩ = ⟨n, h 0 ⟩ * ⟨40, _⟩ … ‘[END]‘ ‘—————————————————————————‘ ‘# Prediction by GPT-4‘ ‘[RUN TACTIC]‘ have h 3 : n * 40 = 10 * 280, by rw [←nat.gcd_mul_lcm n 40, h 1, h 2], linarith, ‘[END]‘
Figure 18: We can see from this example that to be able to capitalize the well-retrieved lemmas, the models need to learn how to use those lemmas as well. In our case, BM25 does a very good job of retrieving the best lemmas, however, GPT-4 could not use it correctly in the first couple of tries. It was only because of our novel stateful backtracking approach that we could capture the previous mistakes along with the last error message all in the same prompt which dissuaded the GPT-4 from making repeated mistakes and getting to the right prediction. It is also important to note that we could not solve this problem without retrieval, but once we know the list of relevant lemmas, knowing how to use them becomes extremely important. Given this example, it should not be surprising that even when other retrieval methods like DPR perform better than BM25, Copra has an edge because it can act as per the rich feedback it gets from the environment. Another interesting idea one can try is to let Copra create a search query to control the retrieved lemma, which can make a simple BM25 search more effective than DPR.
| | Prompt Token Statistics for different settings | | |
| --- | --- | --- | --- |
| Setting | Min. Token Count | Max. Token Count | Avg. Token Count |
| $\mathtt{miniF2F}$ | 33 | 1828 | 322.61 |
| $\mathtt{miniF2F}$ (+ Retrieval) | 181 | 4355 | 824.21 |
| $\mathtt{miniF2F}$ (+ Retrieval + Informal) | 192 | 4355 | 955.20 |
| CompCert ( $+$ Retrieval) | 292 | 116574 | 3219.79 |
Table 5: Aggregate prompt token statistics for Copra on $\mathtt{miniF2F}$ and CompCert dataset.