# Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving
> The first two authors contributed equally to this work. Corresponding authors.
## Abstract
Recent advances in Automated Theorem Proving have shown the effectiveness of leveraging a (large) language model that generates tactics (i.e. proof steps) to search through proof states. The current model, while trained solely on successful proof paths, faces a discrepancy at the inference stage, as it must sample and try various tactics at each proof state until finding success, unlike its training which does not incorporate learning from failed attempts. Intuitively, a tactic that leads to a failed search path would indicate that similar tactics should receive less attention during the following trials. In this paper, we demonstrate the benefit of training models that additionally learn from failed search paths. Facing the lack of such trial-and-error data in existing open-source theorem-proving datasets, we curate a dataset on intuitionistic propositional logic theorems and formalize it in Lean, such that we can reliably check the correctness of proofs. We compare our model trained on relatively short trial-and-error information (TrialMaster) with models trained only on the correct paths and discover that the former solves more unseen theorems with lower trial searches.
∧ $∧$ ∨ $∨$ → $→$
Learn from Failure: Fine-Tuning LLMs with Trial-and-Error Data for Intuitionistic Propositional Logic Proving
Chenyang An 1 thanks: The first two authors contributed equally to this work., Zhibo Chen 2 footnotemark: 0000-0003-0045-5024, Qihao Ye 1 0000-0002-7369-757X, Emily First 1, Letian Peng 1 Jiayun Zhang 1, Zihan Wang 1 thanks: Corresponding authors., Sorin Lerner 1 footnotemark: , Jingbo Shang 1 footnotemark: University of California, San Diego 1 Carnegie Mellon University 2 {c5an, q8ye, emfirst, lepeng, jiz069, ziw224, lerner, jshang}@ucsd.edu zhiboc@andrew.cmu.edu
## 1 Introduction
Automated Theorem Proving is a challenging task that has recently gained popularity in the machine-learning community. Researchers build neural theorem provers to synthesize formal proofs of mathematical theorems Yang et al. (2023); Welleck et al. (2021); Lample et al. (2022); Mikuła et al. (2023); Wang et al. (2023); Bansal et al. (2019); Davies et al. (2021); Wu et al. (2021); Rabe et al. (2020); Kusumoto et al. (2018); Bansal et al. (2019); Irving et al. (2016). Typically, a neural theorem prover, given a partial proof and the current proof state, uses a neural model to predict the next likely proof step, or tactics. The neural models utilize different architectures like LSTMs Sekiyama et al. (2017), CNNs Irving et al. (2016), DNNs Sekiyama and Suenaga (2018), GNNs Bansal et al. (2019); Wang et al. (2017) and RNNs Wang and Deng (2020), though most recent work has begun to explore the use of transformer-based large language models (LLMs) due to their emerging reasoning abilities.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Comparison of Tactic Selection Methods (Conventional vs. TrialMaster)
### Overview
The image is a technical diagram illustrating and comparing two methods for selecting tactics in a system that interacts with a "Lean call" (likely a reference to the Lean theorem prover). It contrasts a "Conventional Method" with a method called "TrialMaster," focusing on how they handle tactic probabilities after a backtracking event. The diagram uses a state-transition graph with nodes and arrows to represent the process flow.
### Components/Axes
**Legend (Top-Left):**
* **Symbol `k` inside a circle:** "state k after Lean call"
* **Symbol `?` inside a circle:** "state waiting for Lean call"
* **Arrow `→`:** "tactic" (indicating a transition labeled with a tactic)
**Main Diagram Structure:**
The diagram is divided into three main sections:
1. **Left Section (Initial State):** Shows the starting point before any method is applied.
2. **Top-Right Section (Conventional Method):** Shows the outcome using the conventional approach.
3. **Bottom-Right Section (TrialMaster):** Shows the outcome using the TrialMaster approach.
**Visual Elements:**
* **Nodes:** Circles containing either a number (e.g., `0`, `1`) or a question mark `?`.
* **Arrows:** Directed lines connecting nodes, representing state transitions. Some arrows are blue.
* **Shaded Ovals:** Gray ellipses grouping certain nodes and transitions, indicating a sub-process or focus area.
* **Text Annotations:** Probabilities, labels, and descriptive sentences placed near the graphical elements.
### Detailed Analysis
**1. Initial State (Left Section):**
* A starting node `0` (state 0 after a Lean call) has an outgoing arrow to a node `?` (state waiting for a Lean call).
* **Associated Text:**
* `tactic 1: 0.6`
* `tactic 2: 0.3`
* `tactic 3: 0.1`
* **Descriptive Text:** "LLM generates tactics; tactic 1 is selected first."
* **Flow:** An arrow points from this initial setup to the next stage, labeled with the blue text "**backtrack**". This indicates that the first selected tactic (tactic 1) led to a need to backtrack.
**2. Backtrack Stage (Center):**
* The diagram shows a node `0` with a blue arrow pointing to a node `1`. Node `1` is inside a gray shaded oval that also contains an outgoing arrow leading to "..." (ellipsis, indicating continuation).
* This represents the state after backtracking from the failed attempt with tactic 1. The system is now at state `1`.
**3. Conventional Method (Top-Right):**
* **Flow:** From the backtrack stage (state `1`), the process continues. The diagram shows node `0` (the original state) with two outgoing arrows:
* One blue arrow to node `1` (the backtracked path).
* One black arrow to a new node `?`.
* **Associated Text (next to the new `?` node):**
* `tactic 2: 0.3`
* `tactic 3: 0.1`
* The word "**unchanged**" is written in bold italics below these probabilities.
* **Descriptive Text:** "tactic 2 is then selected."
* **Interpretation:** The conventional method does not update the probabilities of the remaining tactics (tactic 2 and tactic 3) after the backtracking event. Their probabilities remain at the initial values (0.3 and 0.1, respectively). Tactic 2, having the highest remaining probability, is selected next.
**4. TrialMaster (Bottom-Right):**
* **Flow:** Identical graphical structure to the Conventional Method branch: node `0` with a blue arrow to node `1` and a black arrow to a new node `?`.
* **Associated Text (next to the new `?` node):**
* `tactic 2: 0.2`
* `tactic 3: 0.8`
* The word "**updated**" is written in bold, red italics below these probabilities.
* **Descriptive Text:** "LLM generates tactics with all history paths including backtracking. tactic 3 is then selected."
* **Interpretation:** The TrialMaster method updates the tactic probabilities after incorporating the history of the backtracking event. The probability for tactic 2 decreases from 0.3 to 0.2, while the probability for tactic 3 increases significantly from 0.1 to 0.8. Consequently, tactic 3, now having the highest probability, is selected next.
### Key Observations
1. **Probability Update Mechanism:** The core difference between the two methods is the dynamic updating of tactic probabilities. The Conventional Method uses static probabilities, while TrialMaster adjusts them based on execution history (including failures/backtracks).
2. **Dramatic Probability Shift:** In the TrialMaster example, the probability of tactic 3 increases eightfold (from 0.1 to 0.8) after the backtrack, making it the dominant choice. This suggests the system learned that tactic 1 failed and tactic 2 might be less promising, thereby boosting the relative likelihood of tactic 3.
3. **Visual Consistency:** The graphical structure (nodes, arrows, shaded ovals) is identical for both methods in the right-hand sections, emphasizing that the difference is purely in the associated probability values and the selection logic.
4. **Color Coding:** Blue is used for the "backtrack" label and the arrow representing the backtracking path. Red is used exclusively for the word "updated" in the TrialMaster section, highlighting the key action of that method.
### Interpretation
This diagram illustrates a conceptual improvement in automated reasoning or proof-search systems. The "Conventional Method" represents a naive or memoryless approach where each decision is made based on initial, fixed probabilities, ignoring the outcome of previous attempts. This can lead to inefficient repetition of less promising paths.
**TrialMaster** proposes a more adaptive, history-aware strategy. By updating tactic probabilities based on the success or failure (backtracking) of previous choices, it mimics a form of reinforcement learning. The system "remembers" that tactic 1 led to a dead end and recalibrates its strategy, significantly increasing the probability assigned to tactic 3. This likely leads to more efficient exploration of the search space, avoiding repeated failures and focusing on potentially more fruitful tactics.
The diagram serves as a high-level, intuitive explanation of how incorporating historical context (specifically, backtracking information) into the decision-making process of a Large Language Model (LLM) can lead to different and presumably better tactic selection outcomes. The specific numerical values (0.6, 0.3, 0.1 → 0.2, 0.8) are illustrative examples of this updating mechanism in action.
</details>
Figure 1: A simple example for how learning trial-and-error data impacts inference distribution. See Figure 4 for a concrete case.
An interactive proof assistant, such as Lean de Moura et al. (2015), Coq Barras et al. (1997) or Isabelle Nipkow et al. (2002), evaluates the model’s predicted candidate proof steps, returning either new proof states or errors. Neural theorem provers iterate on this procedure, performing proof search, e.g., a depth-first search (DFS), to traverse the space of possible proofs. An example of a DFS proof search is illustrated in Figure 2(a), where the prover progressively generates new tactics if the attempted tactics result in incorrect proofs.
Such provers are usually trained on a dataset containing only the correct proof paths. This, however, presents a limitation: during inference, the prover does not have the ability to leverage the already failed paths it explored. Such failure information, intuitively, is beneficial, as it could suggest the model to generate tactics similar to the failed ones sparingly. At the very least, the failure information should help the model easily avoid generating already failed tactics. See Figure 1.
In this paper, we wish to empirically verify this intuition. To conduct the experiment, we would compare the conventional model trained on correct proof paths, and TrialMaster, the model trained on the whole proof tree, containing both correct paths and incorrect paths. See Figure 2(b). As such, TrialMaster can make predictions based on the failure information during inference time.
Since current open-source Automated Theorem Proving datasets do not contain complete proof trees, we create such a dataset, PropL, written in Lean. We focus on theorems of intuitionistic propositional logic. A propositional logic formula in intuitionistic propositional logic $A$ is true if and only if $\vdash A$ has a derivation according to the rules of the intuitionistic propositional logic. These rules are listed in Figure 6 in Appendix A with explanations. This is to be distinguished from the classical logic, where a theorem of classical propositional logic admits a proof if and only if under all truth value assignments to all propositions that appear in the theorem, the theorem statement is evaluated to be true.
We give two elementary examples of theorems of intuitionistic propositional logic with proofs. We then give an example of a theorem whose proof contains a backtracking step. The first theorem admits a proof, where as the second doesn’t.
⬇
theorem_1 thm : p1 & $ \ rightarrow$ & p1 $ \ vee$ p2 := by
intro h1
apply or. inl
exact h1
The first theorem states that p1 implies $\texttt{p1}∨\texttt{p2}$ , where $∨$ stands for "or". To show this, we assume p1 and prove either p1 or p2. We prove p1 in this case. The last three lines are tactics (proof steps) in Lean used to prove this fact.
Our second example looks like the following.
⬇
theorem_2 thm : p1 & $ \ rightarrow$ & p1 $ \ land$ p2 := by
The second theorem states that p1 implies p1 and p2. There is no proof to this theorem. When we assume p1, we cannot show both p1 and p2.
Proofs might include backtracking instructions. Here is another possible proof for theorem 1. After step 2, there is no possible proof step that can lead to final solution. Therefore we backtrack to step 1 and try a different proof step.
⬇
theorem_1 thm : p1 & $ \ rightarrow$ & p1 $ \ vee$ p2 := by
intro h1 #this is step 1
apply or. inr #this is step 2
no solution, backtrack to step 1
apply or. inl
exact h1
Note that "no solution, backtrack to step 1" is not a tactic in Lean. It simply tells the system to backtrack to step 2 and start reasoning from there.
Specifically, our PropL dataset is created through a two-stage process that first involves generating a comprehensive set of propositional logic theorems by uniformly sampling from all possible theorems, utilizing a bijection between natural numbers and propositions to ensure representativeness. Following theorem generation, proofs are constructed using a focusing method with polarization, incorporating a detailed trial-and-error search process that includes both successful and backtracked steps, thereby capturing the complexity and nuances of theorem proving in intuitionistic propositional logic. Thus, our dataset is complete, scalable, and representative. The proofs in our dataset are combined with trial-and-error information, which is generated by the Focused Proof Search (FPS) algorithm McLaughlin and Pfenning (2009); Liang and Miller (2009); Pfenning (2017).
We verify the effectiveness of incorporating the failed trials during training and inference by experiments on PropL, observing that TrialMaster achieves a higher proof search success rate and lower search cost over conventional model trained on correct proof paths. Our experiments further indicate that our model can perform backtracking without help from an external system.
Our main contributions are as follows:
- We establish, PropL, a complete, scalable, and representative benchmark for intuitionistic propositional logic theorems formalized in Lean. PropL includes proofs with trial-and-error information, generated by the FPS algorithm.
- We demonstrate that for intuitionistic propositional logic theorem proving, incorporating trial-and-error information into training and proving outperforms a conventional model that is trained on correct proofs only.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: LLM Fine-Tuning vs. DFS Inference for Theorem Proving
### Overview
The image is a technical diagram comparing two processes: **LLM Fine-Tuning** (left panel) and **DFS Inference** (right panel). It illustrates how a Large Language Model (LLM) is trained to generate proofs for mathematical theorems and how it is then used during inference with a Depth-First Search (DFS) strategy, integrated with a formal verification tool called "Lean." The diagram uses flowcharts, state trees, and annotated steps to explain the workflows.
### Components/Axes
The diagram is divided into two main vertical panels separated by a double line.
**Left Panel: LLM Fine-Tuning**
* **Title:** "LLM Fine-Tuning" (top-left, bold).
* **Process Flow:**
1. Starts with "theorem (i.e., state 0)".
2. An arrow points down to a yellow box labeled "LLM".
3. An arrow points down from the LLM to "predicted proof (e.g., state 0; tactic 1; ...; state 7)".
4. An upward orange arrow labeled "compute loss" points from a yellow box containing a training sample to the "predicted proof".
5. The yellow training sample box contains: "state 0; tactic 6; state 6; tactic 7; state 7."
6. Below this box is a small tree diagram labeled "proof tree of a training sample". The tree has nodes 0, 1, 2, 6, 7. Node 7 has a green checkmark (✓). A blue curved arrow points from node 0 to node 6.
**Right Panel: DFS Inference**
* **Title:** "DFS Inference" (top-center, bold).
* **Legend (Top of Right Panel):**
* `ⓚ` "state k by calling Lean"
* `?` "waiting to call Lean"
* `→` "tactic generated by LLM" (black arrow)
* `→` "backtrack to the last valid parent state" (blue arrow)
* **Main Content:** The inference process is shown in three sequential columns, separated by dashed vertical lines, representing steps in a search.
* **Column 1 (Step 3):**
* **Top:** A state tree showing nodes 0, 1, 2, 3, and several `?` nodes. Node 3 has a red cross (✗). A blue arrow backtracks from node 3 to node 2.
* **Bottom (Gray Box):** A flowchart showing the sequence: "state 0; tactic 1; state 1; tactic 2; state 2" -> "LLM" -> "tactic 3, tactic 4" -> "Lean" -> "state 3: 'error'" (in red). Labeled "step 3" at the bottom.
* **Column 2 (Step 4):**
* **Top:** The state tree expands. Node 4 is added with a red cross (✗). A blue arrow backtracks from node 4 to node 2.
* **Bottom (Gray Box):** Similar flowchart: "state 0; tactic 1; state 1; tactic 2; state 2" -> "LLM" -> "tactic 3, tactic 4" -> "Lean" -> "state 4: 'error'" (in red). An annotation points to the "LLM" box: "no LLM execution at this step". Labeled "step 4" at the bottom.
* **Column 3 (Last Step):**
* **Top:** The state tree shows a successful path. From node 1, a black arrow points to node 6. From node 6, a black arrow points to node 7, which has a green checkmark (✓). Nodes 3, 4, and 5 have red crosses (✗). Blue arrows show backtracking from failed branches.
* **Bottom (Gray Box):** Flowchart: "state 0; tactic 6; state 6" -> "LLM" -> "tactic 7" -> "Lean" -> "state 7: 'complete'" (in green). Labeled "last step" at the bottom.
### Detailed Analysis
**LLM Fine-Tuning Process:**
1. **Input:** A theorem, represented as an initial proof state ("state 0").
2. **Model:** An LLM takes this state as input.
3. **Output:** The LLM generates a "predicted proof," which is a sequence of states and tactics (e.g., "state 0; tactic 1; ...; state 7").
4. **Training Objective:** The loss is computed by comparing the LLM's predicted proof sequence against a ground-truth proof sequence from a training sample (e.g., "state 0; tactic 6; state 6; tactic 7; state 7."). The associated proof tree shows the correct path from state 0 to state 7 via state 6.
**DFS Inference Process:**
1. **Search Strategy:** Uses Depth-First Search to explore possible proof paths.
2. **State Representation:** Circled numbers (`ⓚ`) represent proof states verified by calling Lean. Circled question marks (`?`) represent states waiting for verification.
3. **Action Flow:**
* The LLM generates one or more candidate tactics for the current state (e.g., "tactic 3, tactic 4" for state 2).
* These tactics are sent to "Lean" for verification.
* Lean returns either a new valid state (`ⓚ`) or an "error."
4. **Backtracking:** If a tactic leads to an error (marked with a red `✗`), the search backtracks (blue arrow) to the last valid parent state to try a different tactic.
5. **Progression:**
* **Step 3:** From state 2, tactics 3 and 4 are tried. Both lead to errors (states 3 and 4).
* **Step 4:** The search backtracks to state 2. The diagram notes "no LLM execution at this step," implying the previously generated tactics (3 & 4) are being exhausted or the search is moving to a different branch.
* **Last Step:** The search finds a successful path. From state 1, tactic 6 leads to state 6. From state 6, the LLM generates tactic 7, which Lean verifies, leading to state 7 marked "complete" (green text).
### Key Observations
1. **Integration of LLM and Formal Verifier:** The core workflow involves an LLM proposing tactics and a formal system (Lean) verifying them, creating a human-AI collaborative proof search.
2. **Error-Driven Search:** The DFS is guided by verification results. Errors trigger backtracking, making the search systematic.
3. **Training vs. Inference:** Fine-tuning trains the LLM to mimic complete proof sequences. Inference uses the LLM as a generator within a search algorithm that can recover from its mistakes via backtracking.
4. **Visual Coding:** Colors and symbols are used consistently: yellow for LLM, blue for Lean, red for errors/failure, green for success/completion, blue arrows for backtracking.
5. **State Tree Evolution:** The state trees at the top of the DFS panel visually map the search history, showing explored nodes, dead ends (✗), and the final successful path (✓).
### Interpretation
This diagram explains a methodology for **neural theorem proving**. It demonstrates how to bridge the gap between the flexible, pattern-matching capabilities of LLMs and the rigorous, step-by-step verification of formal systems.
* **The Fine-Tuning Stage** teaches the LLM the "shape" of valid proofs by training it on sequences of states and tactics from a dataset. The model learns to predict the next step in a proof.
* **The Inference Stage** addresses the LLM's fallibility. Instead of relying on a single, potentially incorrect prediction, it uses the LLM as a heuristic tactic generator within a classical search algorithm (DFS). The formal verifier (Lean) acts as an oracle, providing ground truth on whether a proposed step is valid. This makes the system robust; a wrong guess doesn't fail the entire proof but simply causes the search to backtrack and try an alternative.
The key insight is that **LLMs can accelerate formal proof search by proposing promising tactics**, while **formal verification ensures correctness and guides the search away from dead ends**. The "no LLM execution at this step" note in Step 4 is particularly insightful—it suggests the system may cache or reuse previously generated tactics, optimizing the search process. This hybrid approach leverages the strengths of both neural networks (intuition, creativity) and symbolic AI (rigor, reliability) for a complex reasoning task.
</details>
(a) Conventional system with depth-first search
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: LLM Fine-Tuning and TrialMaster Inference Process
### Overview
The image is a technical diagram illustrating two related processes for using Large Language Models (LLMs) in formal theorem proving. The top section depicts a standard "LLM Fine-Tuning" pipeline. The bottom, more detailed section illustrates a multi-step inference process called "TrialMaster Inference," which incorporates backtracking and interaction with a proof assistant (Lean).
### Components/Axes
The diagram is divided into two primary horizontal sections, separated by a double line.
**Top Section: LLM Fine-Tuning**
* **Left Side:** A flow starts with the text `theorem (i.e., state 0)` pointing to a yellow box labeled `LLM`.
* **Middle:** The `LLM` box points to `predicted proof (e.g., state 0; tactic 1; ...; state 4)`.
* **Right Side:** A dashed box contains the `ground truth`. Inside:
* A text sequence: `state 0; tactic 1; state 1; tactic 2; state 2; backtrack to state 0; tactic 3; state 3; tactic 4; state 4.`
* A small state transition graph showing nodes 0, 1, 2, 3, 4. Arrows show transitions: 0→1, 1→2, 2→0 (blue backtrack arrow), 0→3, 3→4. Node 4 has a green checkmark.
* **Connection:** An orange arrow labeled `compute loss` points from the `ground truth` box back to the `predicted proof` text.
**Bottom Section: TrialMaster Inference**
* **Legend (Top-Left):** A gray box defines symbols: `(k) state k → tactic → backtrack`. It shows a circle with a number (state), a black arrow (tactic application), and a blue arrow (backtrack).
* **Main Process (Five Sequential Columns):** Each column represents a step in the inference process, showing a state diagram at the top and a detailed flow below.
* **Column 1:**
* *Top Diagram:* Single node `0`.
* *Flow:* `state 0` → `LLM` (yellow box) → `tactic 1` → `Lean` (blue box) → `state 1`. A curved arrow loops from `state 1` back to the `Lean` box input.
* **Column 2:**
* *Top Diagram:* Nodes `0` and `1` with a black arrow from 0 to 1.
* *Flow:* `state 0; tactic 1; state 1` → `LLM` → `tactic 2` → `Lean` → `state 2`. A curved arrow loops from `state 2` back to the `Lean` box input.
* **Column 3:**
* *Top Diagram:* Nodes `0`, `1`, `2` with black arrows 0→1, 1→2.
* *Flow:* `state 0; tactic 1; tactic 2; state 2` → `LLM` → `"backtrack to state 0"` (blue text).
* **Column 4:**
* *Top Diagram:* Nodes `0`, `1`, `2` with black arrows 0→1, 1→2, and a blue arrow from 2 back to 0.
* *Flow:* `state 0; ...; state 2; backtrack to state 0` → `LLM` → `tactic 3` → `Lean` → `state 3`. A curved arrow loops from `state 3` back to the `Lean` box input.
* **Column 5:**
* *Top Diagram:* Nodes `0`, `1`, `2`, `3`, `4`. Black arrows: 0→1, 1→2, 0→3, 3→4. Blue arrow: 2→0. Node 4 has a green checkmark.
* *Flow:* `state 0; ...; state 3` → `LLM` → `tactic 4` → `Lean` → `state 4: "complete"` (green text).
### Detailed Analysis
The diagram contrasts two methodologies:
1. **LLM Fine-Tuning (Top):** This is a standard supervised learning setup. The model is trained to minimize the loss between its entire predicted proof sequence and a fixed ground truth sequence. The ground truth includes a backtrack step, indicating the training data contains non-linear proof attempts.
2. **TrialMaster Inference (Bottom):** This is an interactive, step-by-step generation process.
* **State:** Represented by circled numbers (0, 1, 2...). It is the current goal or context in the proof.
* **Tactic:** An action proposed by the LLM to advance the proof from the current state.
* **Lean:** The interactive proof assistant that executes the tactic on the current state, producing a new state.
* **Backtrack:** A critical operation where the LLM can decide to revert to a previous state (e.g., from state 2 back to state 0) if it determines the current path is unfruitful. This is shown with a blue arrow in the state diagrams and blue text in the flow.
* **Process Flow:** The LLM receives the current proof state/history, proposes a tactic or a backtrack command. If a tactic is given, Lean executes it to produce a new state. This repeats until a "complete" state is reached.
### Key Observations
* The **state diagrams** at the top of each inference column provide a visual summary of the proof search tree being built. The final diagram (Column 5) matches the structure shown in the `ground truth` of the fine-tuning section.
* The **curved arrows** in the first, second, and fourth inference columns suggest a feedback loop where the new state is fed back into the process for the next LLM call.
* The **color coding** is consistent: Yellow for LLM, Blue for Lean/Backtrack text, Green for success ("complete" and checkmark).
* The **"backtrack to state 0"** command in Column 3 is generated directly by the LLM, not by Lean, indicating the model learns to make strategic control decisions.
### Interpretation
This diagram explains a sophisticated method for improving LLM-based theorem proving. The core insight is that **proof search is not a single, linear forward pass**. The "TrialMaster Inference" system explicitly models and enables:
1. **Interactive Refinement:** The LLM interacts with a formal verifier (Lean) at each step, grounding its predictions in a rigorous environment.
2. **Strategic Backtracking:** The model can recognize dead ends and retreat to an earlier, promising state, mimicking human proof-search behavior. This is a significant advancement over models that can only generate a single, monolithic proof script.
3. **Learning from Non-Linear Proofs:** The fine-tuning data (top section) includes backtracking steps, teaching the model that proofs can involve exploration and correction. The inference process (bottom section) then operationalizes this ability.
The relationship between the two sections is causal: the fine-tuning process trains the LLM to understand proof states, tactics, *and* backtracking, which is then deployed in the more complex, iterative TrialMaster inference loop. The ultimate goal is to produce a verified, complete proof (state 4) by dynamically navigating a proof tree.
</details>
(b) Our training and inference methodology
Figure 2: Method comparison. (a) A conventional system: The tactic generator (i.e., LLM) is fine-tuned on correct proof paths only. During inference, the trained tactic generator produces $N_sampled$ (e.g., 2 in the example) tactics at a time. If Lean decides that the current tactic is wrong, the system backtracks to the last valid state and tries other candidate tactics. (b) Our methodology: The tactic generator is fine-tuned on proofs with trial-and-error. During inference, we take the first tactic it generates and feed that into Lean for state checking at each step.
## 2 Related Work
Automated Theorem Proving. Automated Theorem Proving has evolved significantly since its inception, focusing mainly on developing computer programs that can autonomously prove mathematical theorems. Early ATP systems (mechanical theorem proving) were based on first-order logic Clocksin and Mellish (2003); Chang and Lee (2014), where the resolution method Robinson (1965) played a crucial role. Recent progress in Automated Theorem Proving has been marked by the integration of machine learning Bansal et al. (2019); Davies et al. (2021); Wagner (2021), especially LLMs Yang et al. (2023); Polu and Sutskever (2020); Han et al. (2021); Welleck et al. (2021); Jiang et al. (2022), and heuristic methods Holden and Korovin (2021), aimed at amplifying the efficiency and capacity of Automated Theorem Proving systems. Within the domain of LLMs, formal mathematical languages like Metamath Megill and Wheeler (2019), Lean de Moura et al. (2015), Isabelle Nipkow et al. (2002), and Coq Barras et al. (1997), serve as a bridge, enabling the precise expression and verification of mathematical theorems and concepts through a computer-verifiable format, thereby mitigating hallucination risks Nawaz et al. (2019). COPRA incorporates backtracking information into a prompt and sends the prompt to GPT-4 without fine-tuning it to perform proof search Thakur et al. (2023). Baldur fine-tunes LLMs with proofs and error information given by the proof assistant First et al. (2023). In contrast, our work focuses on fine-tuning LLMs with the complete past proof history without using the error message from the proof assistant.
Propositional Logic Problem. Early implementations of ATP systems demonstrated the potential for computers to automate logical deductions, with notable examples including the Logic Theorist Crevier (1993); McCorduck (2004); Russell and Norvig (2010) and Gilmore’s program Davis (2001); Gilmore (1960). These systems laid the groundwork for the resolution of propositional logic problems, showcasing the ability of automated systems to handle logical reasoning tasks. Recent advancements in Automated Theorem Proving have revisited propositional logic problems, integrating modern computational techniques. Sekiyama et al. Sekiyama and Suenaga (2018) have employed Deep Neural Networks (DNNs) as a statistical approach to generate proofs for these theorems, while Kusumoto et al. Kusumoto et al. (2018) have explored graph representations coupled with reinforcement learning to find proofs. Furthermore, sequence-to-sequence neural networks have been applied for deriving proof terms in intuitionistic propositional logic Sekiyama et al. (2017). This area of research is particularly intriguing due to the simplicity and importance of propositional logic in mathematics, and there is a growing interest in evaluating the capability of LLMs in tackling this specific mathematical domain.
Trial-and-Error. The Chain-of-Thought (CoT) Wei et al. (2022); Wang et al. (2022); Zhou et al. (2022); Fu et al. (2022); Chu et al. (2023); Yu et al. (2023) approach demonstrates that LLMs can be guided to perform step-by-step reasoning by incorporating intermediate reasoning steps in their prompts. This concept is expanded in later research, such as the Tree of Thoughts (ToT) Yao et al. (2023), which organizes reasoning into a tree structure, and the Graph of Thoughts (GoT) Besta et al. (2023), which adopts a graph format for thought structuring. Trial-and-error complements structured reasoning by allowing the model to empirically test hypotheses generated, thereby refining its reasoning process based on feedback from interactions or emulations. The Boosting of Thoughts (BoT) Chen et al. (2024) prompting framework iteratively explores and evaluates multiple trees of thoughts to gain trial-and-error reasoning experiences, using error analysis from the LLMs to revise the prompt. Lyra introduces correction mechanisms to improve the performance of the tactic generator Zheng et al. (2023).
## 3 PropL: A New Dataset for Intuitionistic Propositional Logic Theorems in Lean
Our aim is to experimentally validate that trial-and-error information can enhance models’ ability to do backtracking and tactic generation for theorem-proving tasks. Given that existing open-source theorem proving datasets lack information on trial-and-error processes, we have developed PropL, which is based on theorems of intuitionistic propositional logic. This dataset uniquely includes proofs that encapsulate the complete search process, incorporating the trial-and-error data generated by the FPS algorithm. Our dataset has two other benefits. It is formalized in Lean, so that the validity of the theorems and proofs are guaranteed. The tactics generated by the model trained on PropL can also be directly sent to Lean to be checked. PropL is also representative of all the intuitionistic propositional logic theorems, since by uniformly sampling integers, we can use a bijection between natural numbers and propositions to uniformly sample propositions. This bijection is explained in the Theorem Generation section.
### 3.1 Data Generation of PropL
PropL comprises theorems uniformly sampled from the entire set of propositional logic theorems. It includes various proof types for each theorem. We only report proof types that are used in this paper. For additional information about the dataset, please refer to the GitHub repository and Huggingface.
The construction of PropL involves two primary stages: the generation of propositional logic theorems and the generation of proofs for these theorems from an existing algorithm.
Theorem Generation. Consider the set of propositions $A$ with at most $p$ atomic propositions. $A$ can be inductively generated by the following grammar:
$$
A,B::=P_i\mid T\mid F\mid A∧ B\mid A∨ B\mid A→ B,
$$
where $P_i$ is the $i$ -th atomic proposition with $1≤ i≤ p$ , $T$ stands for True, and $F$ stands for False. We note that an atomic proposition is just a statement like "Today is sunny". The concrete meaning of the statement is irrelevant with respect to the theorem proving task in this paper. A connective $∧$ , $∨$ , $→$ is called an internal node of the proposition. We assign the following lexicographic order to propositions:
1. Number of internal nodes (increasing order)
1. Number of internal nodes of the left child (decreasing order)
1. Top level connective, $T<F<P_1<⋯<P_p<∧<∨<→$
1. The (recursive order (2 - 5)) of the left child
1. The (recursive order (2 - 5)) of the right child
For a fixed upper bound $p$ for the number of atomic propositions, we establish a bijection between the natural numbers and the set of propositional logic formulas. The counting can be made efficient using the Catalan Numbers Atkinson and Sack (1992). Figure 3 gives an example of mapping between propositions and natural numbers. Details about the encoding and decoding algorithms are provided in Appendix B.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Diagram: Tree Representation of a Logical Theorem
### Overview
The image is a technical diagram illustrating the structural decomposition of a propositional logic theorem into a tree format, along with a conceptual encoding/decoding process to a unique identifier. The diagram is composed of three primary sections: a logical formula at the top, a tree representation in a central box, and an encoding/decoding process with an ID at the bottom.
### Components/Axes
The diagram contains the following textual and symbolic components, listed with their spatial grounding:
1. **Top Section (Header):**
* A centered logical formula: `((P₁ ∨ P₂) → F) → ((P₁ → F) ∧ (P₂ → F))`
* A double-headed vertical arrow (`↕`) connects this formula to the box below.
2. **Central Section (Main Diagram):**
* A rectangular box with a light gray background.
* **Label (Top-Right inside box):** `tree representation of theorem`
* **Tree Structure:** A hierarchical tree with circular nodes connected by lines.
* **Root Node (Top Center):** Contains the symbol `→` (implication).
* **Internal Nodes (White circles):** Contain logical operators: `→`, `∨` (disjunction), `∧` (conjunction).
* **Leaf Nodes (Gray circles):** Contain propositional variables: `P₁`, `P₂`, and `F`.
* **Tree Hierarchy (from root down):**
* The root `→` has two children:
* Left child: An internal node `→`.
* Right child: An internal node `∧`.
* The left `→` node has two children:
* Left child: An internal node `∨`.
* Right child: A leaf node `F`.
* The `∨` node has two children: leaf nodes `P₁` and `P₂`.
* The right `∧` node has two children:
* Left child: An internal node `→`.
* Right child: An internal node `→`.
* The left `→` under `∧` has children: leaf nodes `P₁` and `F`.
* The right `→` under `∧` has children: leaf nodes `P₂` and `F`.
3. **Bottom Section (Footer):**
* **Left Side:** The word `encoding` with a downward-pointing arrow (`↓`) below it.
* **Right Side:** The word `decoding` with an upward-pointing arrow (`↑`) below it.
* **Bottom Center:** The text `ID: 760387005`.
### Detailed Analysis
The diagram explicitly maps the syntactic structure of the logical theorem `((P₁ ∨ P₂) → F) → ((P₁ → F) ∧ (P₂ → F))` to a tree data structure.
* **Formula to Tree Mapping:** The tree is a direct parse tree for the formula.
* The main connective (the outermost `→`) becomes the root.
* The left sub-formula `(P₁ ∨ P₂) → F` becomes the left subtree.
* The right sub-formula `(P₁ → F) ∧ (P₂ → F)` becomes the right subtree.
* This decomposition continues recursively until reaching the atomic propositions (`P₁`, `P₂`, `F`) as leaves.
* **Visual Coding:** A clear visual distinction is made between operator nodes (white circles) and operand/variable nodes (gray circles).
* **Process Flow:** The arrows labeled `encoding` (pointing down to the ID) and `decoding` (pointing up from the ID) suggest a bidirectional process. This implies the tree structure (and thus the theorem) can be transformed into the unique numerical identifier `760387005`, and vice-versa.
### Key Observations
1. **Structural Fidelity:** The tree is a faithful and complete representation of the formula's syntax. Every operator and variable from the linear string formula has a corresponding node in the tree.
2. **Balanced Representation:** The right side of the tree (under the `∧` node) is perfectly symmetrical, representing the two similar implications `(P₁ → F)` and `(P₂ → F)`.
3. **Unique Identifier:** The presence of a specific ID (`760387005`) suggests this diagram may be part of a larger system for cataloging, indexing, or processing logical theorems computationally.
### Interpretation
This diagram serves as a pedagogical or technical illustration of how symbolic logic can be structured for computational processing.
* **What it demonstrates:** It shows the transformation of a human-readable logical statement into a structured, hierarchical format (the tree) that is more amenable to algorithmic manipulation. The encoding/decoding step further abstracts this structure into a compact, unique identifier, which is a common practice in computer science for efficient storage, retrieval, and comparison of complex data structures.
* **Relationships:** The core relationship is the equivalence between the linear formula, its tree representation, and its encoded ID. The tree acts as an intermediate, explicit representation of the formula's syntactic dependencies.
* **Notable Implications:** The specific theorem shown is a tautology in classical logic (it's always true). The diagram might be used in contexts like automated theorem proving, formal verification, or the study of logic in computer science, where representing and manipulating such formulas efficiently is crucial. The ID implies a system where theorems are treated as discrete, addressable objects.
</details>
Figure 3: An illustration of the bijection between a proposition and a natural number, where gray nodes are leaf nodes. ID is computed using Algorithm 1 with $n=6$ and $p=2$ in this case.
Proof Generation. Given a randomly sampled theorem, the proof of the theorem is constructed using the focusing method with polarization McLaughlin and Pfenning (2009); Liang and Miller (2009); Pfenning (2017). Proof search is divided into two stages: inversion and chaining. The inversion phase mechanically breaks down negative connectives (e.g. implications) in the goal and positive connectives (e.g. disjunctions) in the premises. After inversion, chaining will pick an implication in the premise or show one of the disjuncts in the conclusion, with backtracking. The proof search procedure terminates when the same atomic proposition appears in both the premise and the conclusion. An example of proofs with trial-and-error information (backtracking) and with trial-and-error information removed are shown in Figure 4.
Once proofs are generated, we use them to fine-tune models and start the proof search on the test set.
The polarization of the connectives affects the behavior of the inversion and the search procedure. We choose to uniformly polarize conjunctions that occur negatively (e.g. on the right-hand side of an arrow) as negative and polarize conjunctions that occur positively (e.g. on the left-hand side of an arrow) as positive. Atomic propositions are assigned polarities based on the connective that they first appear under.
To improve the runtime of the search procedure, we make an additional assumption that once an implication is picked, the implication cannot be used to show its premise. In theory, this introduces incompleteness into the search procedure, but it only affects 1 theorem out of around 1000 provable theorems randomly sampled.
### 3.2 Construction of Training and Test Sets
In this section, we explain how we construct the datasets for training and evaluation. We want to avoid training and testing on similar data. In order to test the model performance on harder out-of-distribution (OOD) tasks, we need to ensure that the lengths of the proofs in the training data are shorter than the lengths of the proofs in the test data.
Given PropL, we fix the number of internal nodes in the theorem statement to be 16 (explained in the dataset generation section). We then uniformly randomly sample 200,000 theorems from PropL, which can be achieved by using the integer-theorem bijection as explained before. Our method ensures that the theorems we study in this paper are representative of the propositional logic theorems in general.
We first apply our deterministic algorithms to generate the proofs of the 200,000 theorems, and then remove the trial-and-error information of those proofs. We get a word-length distribution of those proofs without trial-and-error information. Next, to ensure the diversity of the trial-and-error information, we randomly select propositions to focus on during the chaining phase of the proof search, and then generate 10 different proofs with backtracking. By using the average length of the 10 different proofs, we have another word length distribution of proofs with trial-and-error information.
We then split the 200,000 theorems into training and test sets based on both of the word length distributions mentioned above. The word lengths of the proofs of the training data theorems fall within the lower 0.66 quantile of the two distributions of the word length of all the proofs of the 200,000 theorems (109,887 in total). The word lengths of the in-distribution test set also fall in that category (1000 in total). The word lengths of the proofs among the out-of-distribution test theorems are above 0.8 quantile (1000 total) of the two distributions of the word lengths of all the proofs of the 200,000 theorems. We note that the baseline model and TrialMaster are trained on proofs of the same set of theorems.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Diagram: Comparison of Proof Strategies With and Without Backtracking
### Overview
The image is a technical diagram comparing two approaches to constructing a formal logical proof: one that proceeds linearly without backtracking and one that incorporates a backtracking step to explore an alternative path. The diagram is split vertically by a dashed line, with the left side titled "Without Backtracking" and the right side titled "With Backtracking". Both sides begin with the same initial logical statement and use a series of "Tactics" to reach a "Complete" state.
### Components/Axes
The diagram is a flowchart-style state transition diagram. There are no traditional chart axes. The key components are:
* **Titles:** "Without Backtracking" (top left), "With Backtracking" (top right).
* **States:** Represented by text blocks, often with a label (e.g., "Initial State", "State 1") and a logical sequent.
* **Tactics:** Represented by text in pinkish-red boxes (e.g., "Intro: h₁", "Apply Or.inr"). These are the actions that transform one state into the next.
* **Arrows:** Solid black arrows indicate the forward progression of the proof. A dashed blue arrow labeled "backtrack" indicates a return to a previous state to try a different tactic.
* **Completion Indicator:** A green box labeled "Complete" signifies the end of a successful proof.
* **Grouping Label:** A bracket on the left side groups the tactics under the label "Tactics".
### Detailed Analysis
The proof attempts to demonstrate the logical theorem: `⊢ p₂ → p₃ ∨ p₂`.
**Left Side: Without Backtracking**
1. **Initial State:** `⊢ p₂ → p₃ ∨ p₂`
2. **Tactic Applied:** `Intro: h₁` (Introduces hypothesis `h₁: p₂`).
3. **State 1:** `h₁ : p₂ ⊢ p₃ ∨ p₂`
4. **Tactic Applied:** `Apply Or.inr` (Chooses to prove the right disjunct, `p₂`).
5. **State 2:** `h₁ : p₂ ⊢ p₂`
6. **Tactic Applied:** `Exact h₁` (Uses hypothesis `h₁` directly to close the goal).
7. **State 3 / Complete:** The proof is successfully finished.
**Right Side: With Backtracking**
1. **Initial State:** `⊢ p₂ → p₃ ∨ p₂` (Identical to left side).
2. **Tactic Applied:** `Intro h₁` (Identical to left side).
3. **State 1:** `h₁ : p₂ ⊢ p₃ ∨ p₂` (Identical to left side).
4. **First Tactic Attempt:** `Apply Or.inl` (Chooses to prove the left disjunct, `p₃`).
5. **State 2:** `h₁ : p₂ ⊢ p₃` (This state is highlighted with a light blue background, indicating it is a dead end or problematic path).
6. **Backtrack Action:** A dashed blue arrow labeled "backtrack" curves from State 2 back to State 1. This indicates the proof engine abandons the `Apply Or.inl` path.
7. **Alternative Tactic Applied:** From State 1, the tactic `Apply Or.inr` is now applied (this is the same tactic that was used directly on the left side).
8. **State 3:** `h₁ : p₂ ⊢ p₂` (This matches State 2 from the left-side proof).
9. **Tactic Applied:** `Exact h₁`.
10. **State 4 / Complete:** The proof is successfully finished.
### Key Observations
* **Identical Starting Point:** Both proofs begin with the same goal and initial tactic (`Intro`).
* **Divergent First Choice:** The critical difference is the first choice of disjunction elimination tactic. The left side correctly chooses `Or.inr` immediately. The right side first attempts `Or.inl`, which leads to an unprovable state (`⊢ p₃`).
* **Backtracking Mechanism:** The right side explicitly models the process of recognizing a failed path (`State 2`), returning to the decision point (`State 1`), and trying the alternative (`Or.inr`).
* **Outcome:** Both strategies ultimately succeed, but the backtracking strategy involves an extra, failed step. The "Without Backtracking" path is more efficient.
* **Visual Coding:** The use of a blue highlight on the dead-end state and a blue dashed arrow for the backtrack action clearly distinguishes the exploratory, non-linear nature of the right-side process.
### Interpretation
This diagram illustrates a fundamental concept in automated theorem proving and interactive proof assistants: **proof search strategy**.
* **What it demonstrates:** It shows that for a given goal, there may be multiple applicable tactics (here, `Or.inl` and `Or.inr` for proving a disjunction). A naive or uninformed search might try an incorrect path first.
* **The role of backtracking:** Backtracking is a control mechanism that allows a proof search to recover from poor tactical choices. It enables exhaustive exploration of the proof space by systematically trying alternatives when a path fails. The diagram visually argues that while backtracking guarantees finding a proof if one exists (given enough time and resources), a more intelligent or heuristic-driven strategy (like the one on the left that picks the correct tactic first) is more efficient.
* **Underlying logic:** The specific theorem `p₂ → p₃ ∨ p₂` is a tautology. The proof hinges on the fact that if `p₂` is true, then the disjunction `p₃ ∨ p₂` is true regardless of `p₃`. The tactic `Or.inr` exploits this by focusing on proving `p₂`, which is directly available from the hypothesis. The failed `Or.inl` path attempts to prove `p₃`, which is not supported by the available hypotheses (`h₁: p₂`), hence the dead end.
* **Broader context:** This is a microcosm of how complex proofs are built. In larger proofs, the search space is vast, and the choice of tactics, along with the ability to backtrack, becomes crucial for automation and user-guided proof development. The diagram serves as a clear pedagogical tool for explaining this core concept.
</details>
Figure 4: Two proofs for one propositional logic theorem with tactics and states in Lean.
## 4 Methodology
### 4.1 LLM Fine-Tuning
We utilize the training set formed in PropL for training both TrialMaster and the tactic generator in the DFS system. The numbers of theorems used in the training and test datasets are presented in Table 1.
LLM fine-tuning with trial-and-error. In our approach, we randomly select two out of the shortest five among the ten proofs with trial-and-error information for each theorem in the training set and utilize them to train TrialMaster. Refer to Figure 2(b) for a description of this training process.
LLM fine-tuning in a DFS system. For the tactic generator of the DFS system, we employ the deterministic FPS algorithm to generate the proofs of the theorems in the training set. The trial-and-error information is removed from the proofs. The LLM is then fine-tuned on the proofs without trial-and-error information as the conventional methods do. Figure 2(a) illustrates the training process of the DFS system.
Table 1: The scale of training and test split in our PropL dataset.
| Training set In-dist. test set Out-of-dist. test set | 109,887 1,000 1,000 |
| --- | --- |
### 4.2 Inference
Inference method of model trained with trial-and-error. TrialMaster conducts inference on itself without any help from a backtracking system like DFS or BFS. It outputs two kinds of tactics: tactics in Lean and backtrack instructions. An example of a backtrack instruction would be like “no solution, return to state 2 [that leads to state 4]”, where state 4 is the current state. When TrialMaster is doing a proof search on the test set, it is prompted with all history paths, including previous tactics, states, the backtracking it made before, and the failed search path. It then outputs the entire proof path after. Nonetheless, we only utilize the first tactic in the output and employ Lean as a calculator to determine the next state, thereby ensuring the correctness of the state following the tactic. If the tactic outputted by TrialMaster is a backtrack instruction, it is then prompted with all the proof search history including the backtrack instruction and the state that the backtrack instruction says to return to. If that tactic is not a backtrack instruction, the tactic and the current state will be fed into Lean for producing the state after. TrialMaster is then prompted with the entire proof tree including the state that Lean calculated, and it should output a tactic again. This process is repeated until Lean identifies that the proof is complete or any Lean error occurs. We also note that TrialMaster only outputs one tactic at each state using greedy search.
Inference method of the DFS system. There are two hyperparameters in the DFS system: temperature $t$ and the number of sampled tactics $N_sampled$ . The temperature $t$ decides the diversity of the model outputs. As $t$ increases, the outputs of the model become more varied. The second parameter determines how many tactics the tactic generator of the DFS system produces for a new proof state.
During inference, the LLM in the DFS system produces $N_sampled$ of candidate tactics at each new state. For each proof state, the DFS only makes one inference. If any two of the generated tactics for the same state are the same, we remove one of them to ensure efficiency. We also remove the tactic suggestions that fail to follow the grammar of Lean. The system follows the depth-first order to keep trying untried tactics. If the system exhausts all the tactics for a given state but has not found a valid one, the system returns to the parent state and then keeps trying untried tactics for the parent state. The overview is presented in the Figure 2(a).
To fully exploit the ability of the DFS system, we varied the parameters of it, such as temperature and the number of sampled tactics. We count how many times Lean has been called to check tactics for both the DFS system and TrialMaster during the inference stage.
#### Why do we choose DFS over BFS?
While the breadth-first-search (BFS) system is also popular for building neural provers in Automated Theorem Proving, we have opted for DFS as our baseline over BFS in the context of propositional logic theorem proving. This is due to the finite number (around 20) of tactics available at any step for the search process of intuitionistic propositional logic theorems, making DFS more efficient than BFS without compromising the success rate.
## 5 Evaluation
### 5.1 Experimental Setup
Base LLM. We used Llama-2-7b-hf Touvron et al. (2023) as the backbone LLM for tactic generation. Models are trained on two A100 GPUs for a single epoch with batch size set to $4$ . Huggingface Wolf et al. (2019) is used for fine-tuning the models, and VLLM Kwon et al. (2023) is used for inference of the models for efficiency. The learning rate of all training processes is set to be $2× 10^-5$ .
Hyperparameters. In the experiment, we vary temperature $t$ from $\{0.3,0.7,1.2,1.5,2.0\}$ and number of sampled tactics $N_sampled$ from $\{2,5,10,15,20\}$ . We notice that in all experiments for temperature $t$ less than 1.5, there were only very few (less than 15) theorems that were still under search when the total steps for that search attempt reached 65. For $t$ higher than 1.5, $N_Lean$ starts to increase dramatically. At temperature 2 with $N_sampled$ =20, $N_Lean$ climbs up to 32,171 when the number of total steps reaches 65. Therefore, in our experiment, we set 65 as the search step limit to control time complexity.
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Line Chart: Success Rate vs. Temperature for Two Methods
### Overview
The image is a 2D line chart comparing the performance of two methods, "TrialMaster" and "DFS," across a range of temperature values. The chart plots "Success Rate" on the vertical axis against "Temperature" on the horizontal axis. A single data point is shown for TrialMaster, while a connected line shows the performance trend for DFS across five temperature points.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis (Horizontal):**
* **Label:** "Temperature"
* **Scale:** Linear, ranging from 0 to 2.
* **Major Tick Marks:** 0, 0.5, 1, 1.5, 2.
* **Y-Axis (Vertical):**
* **Label:** "Success Rate"
* **Scale:** Linear, ranging from 0.6 to 1.
* **Major Tick Marks:** 0.6, 0.7, 0.8, 0.9, 1.
* **Legend:**
* **Position:** Bottom-right corner of the plot area.
* **Entry 1:** Orange square marker labeled "TrialMaster".
* **Entry 2:** Light blue line with open circle markers labeled "DFS".
* **Data Point Annotations:** Each data point has a numerical label placed directly above it. These numbers (e.g., 17993, 12749) likely represent sample sizes, identifiers, or another metric associated with that specific experimental condition.
### Detailed Analysis
**1. TrialMaster Data Series:**
* **Visual Trend:** A single, isolated data point. No trend can be inferred.
* **Data Point:**
* **Temperature (X):** 0
* **Success Rate (Y):** ~0.89 (visually estimated between 0.8 and 0.9, closer to 0.9).
* **Annotation:** "17993"
* **Marker:** Solid orange square.
**2. DFS Data Series:**
* **Visual Trend:** The line shows a clear, consistent upward slope from left to right, indicating that the Success Rate for the DFS method increases as Temperature increases.
* **Data Points (from left to right):**
* **Point 1:**
* Temperature (X): ~0.3
* Success Rate (Y): ~0.67 (visually estimated between 0.6 and 0.7).
* Annotation: "12749"
* **Point 2:**
* Temperature (X): ~0.7
* Success Rate (Y): ~0.77 (visually estimated between 0.7 and 0.8).
* Annotation: "14515"
* **Point 3:**
* Temperature (X): ~1.2
* Success Rate (Y): ~0.82 (visually estimated just above the 0.8 grid line).
* Annotation: "16844"
* **Point 4:**
* Temperature (X): ~1.5
* Success Rate (Y): ~0.83 (visually estimated slightly higher than the previous point).
* Annotation: "19034"
* **Point 5:**
* Temperature (X): ~1.8
* Success Rate (Y): ~0.88 (visually estimated just below the 0.9 grid line).
* Annotation: "31101"
### Key Observations
1. **Performance Comparison:** At Temperature = 0, the single TrialMaster point (~0.89) has a higher Success Rate than the first DFS point at Temperature ~0.3 (~0.67).
2. **DFS Improvement:** The DFS method shows a strong positive correlation between Temperature and Success Rate. Its performance improves from ~0.67 to ~0.88 as Temperature increases from ~0.3 to ~1.8.
3. **Convergence:** At the highest plotted temperature (~1.8), the DFS method's Success Rate (~0.88) approaches the performance level of the TrialMaster method at Temperature 0 (~0.89).
4. **Annotation Pattern:** The numerical annotations for DFS increase monotonically with both Temperature and Success Rate (12749 → 14515 → 16844 → 19034 → 31101). The annotation for TrialMaster (17993) falls within the range of the DFS annotations.
### Interpretation
This chart likely presents results from an experiment or simulation where "Temperature" is a control parameter influencing the performance of two different algorithms or processes ("TrialMaster" and "DFS"). The "Success Rate" is the primary performance metric.
* **What the data suggests:** The DFS method is highly sensitive to the Temperature parameter, exhibiting a clear performance benefit from higher temperatures. In contrast, TrialMaster is only shown at Temperature=0, where it performs very well. This could imply that TrialMaster is a specialized method designed for, or only effective at, zero temperature (perhaps a deterministic or baseline approach), while DFS is a stochastic or adaptive method that thrives with increased randomness or exploration (higher temperature).
* **Relationship between elements:** The chart sets up a direct comparison. It asks the viewer to consider whether the high, single-point performance of TrialMaster at T=0 is preferable to the tunable, improving performance of DFS. The increasing annotations for DFS might correlate with increased computational cost, sample size, or another resource alongside the performance improvement.
* **Unexplained elements:** The numerical annotations (e.g., 17993, 12749) are not defined in the chart; their meaning (e.g., iterations, dataset size, cost) is essential for a full technical understanding of the trade-offs involved. The chart does not provide a legend or axis label for these annotations, leaving their interpretation open.
</details>
(a) Temperature $t$
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Line Chart: Success Rate vs. Number of Sampled Tactics
### Overview
The image is a 2D line chart comparing the performance of two methods, "TrialMaster" and "DFS," based on their "Success Rate" as a function of the "# of Sampled Tactic." The chart includes a grid, labeled axes, data points with numerical annotations, and a legend.
### Components/Axes
* **Chart Type:** Line chart with markers.
* **X-Axis:** Labeled "# of Sampled Tactic". The axis has major tick marks and grid lines at values 0, 5, 10, 15, and 20.
* **Y-Axis:** Labeled "Success Rate". The axis has major tick marks and grid lines at values 0.6, 0.7, 0.8, 0.9, and 1.0.
* **Legend:** Located in the bottom-right quadrant of the chart area.
* An orange square symbol is labeled "TrialMaster".
* A light blue line with an open circle marker is labeled "DFS".
* **Data Series & Annotations:**
1. **TrialMaster (Orange Square):** A single data point located at approximately (x=0, y=0.89). The number "17993" is printed directly above this point.
2. **DFS (Light Blue Line with Circles):** A series of five connected data points. Each point has a number printed above it.
* Point 1: At (x=0, y≈0.67). Annotation: "13167".
* Point 2: At (x=5, y≈0.79). Annotation: "15462".
* Point 3: At (x=10, y≈0.82). Annotation: "16844".
* Point 4: At (x=15, y≈0.84). Annotation: "17047".
* Point 5: At (x=20, y≈0.85). Annotation: "17432".
### Detailed Analysis
* **Trend Verification:**
* **DFS Series:** The light blue line shows a clear upward trend. It starts at a lower success rate (~0.67) when the number of sampled tactics is 0 and increases monotonically as the number of sampled tactics increases, showing a steep initial rise between 0 and 5 tactics, followed by a more gradual increase.
* **TrialMaster Series:** This is a single, isolated data point with no associated trend line. It represents a high success rate (~0.89) at 0 sampled tactics.
* **Data Point Extraction (Approximate Y-values based on grid):**
| Method | # of Sampled Tactic (X) | Success Rate (Y, Approx.) | Annotation Number |
| :--- | :--- | :--- | :--- |
| TrialMaster | 0 | 0.89 | 17993 |
| DFS | 0 | 0.67 | 13167 |
| DFS | 5 | 0.79 | 15462 |
| DFS | 10 | 0.82 | 16844 |
| DFS | 15 | 0.84 | 17047 |
| DFS | 20 | 0.85 | 17432 |
### Key Observations
1. **Performance Crossover:** At 0 sampled tactics, TrialMaster (0.89) significantly outperforms DFS (0.67). However, the DFS method's success rate improves with more sampling. By 5 sampled tactics, DFS (0.79) has nearly closed the gap, and for 10 or more sampled tactics, DFS achieves a success rate (0.82-0.85) that approaches but remains slightly below the single TrialMaster data point.
2. **Diminishing Returns for DFS:** The DFS curve shows diminishing returns. The largest gain in success rate occurs between 0 and 5 sampled tactics (+0.12). Subsequent increases of 5 tactics yield smaller gains: +0.03 (5 to 10), +0.02 (10 to 15), and +0.01 (15 to 20).
3. **Annotation Numbers:** The numbers above each data point (e.g., 17993, 13167) increase for the DFS series as the number of sampled tactics increases. Their exact meaning is not defined by the chart labels but could represent sample sizes, computation counts, or unique identifiers associated with each data point.
### Interpretation
The chart demonstrates a comparative analysis between two distinct approaches. **TrialMaster** appears to be a method that achieves a high, fixed success rate (~89%) without requiring the sampling of tactics (at x=0). In contrast, **DFS** is a method whose success is dependent on the number of tactics sampled; it starts with a lower baseline performance but improves with increased sampling effort.
The data suggests a trade-off: TrialMaster may be a more efficient or pre-optimized solution for scenarios where sampling is costly or impossible, as it delivers high performance immediately. DFS, however, is an adaptive or iterative method where performance scales with resource investment (number of sampled tactics). The diminishing returns curve for DFS indicates that while initial sampling is highly beneficial, there is a practical limit to the performance gain achievable through this method alone, as it asymptotically approaches but does not surpass the TrialMaster benchmark within the plotted range. The annotation numbers, which grow with sampling for DFS, might correlate with the computational cost or data volume required to achieve each success rate level.
</details>
(b) Sampled Tactics $N_sampled$
<details>
<summary>x8.png Details</summary>

### Visual Description
## Bar Chart: Comparison of Lean Calls by Method
### Overview
The image is a vertical bar chart comparing four different computational methods based on the number of "Lean Calls" they generate. The chart uses a simple, clean design with a white background and a legend in the top-left corner.
### Components/Axes
* **Y-Axis (Vertical):** Labeled "# of Lean Calls (thousand)". The scale runs from 0 to 30, with major tick marks at 0, 10, 20, and 30. The unit is thousands.
* **X-Axis (Horizontal):** Labeled "Method". It displays four categorical bars representing different methods.
* **Legend:** Positioned in the top-left quadrant of the chart area. It contains four entries, each with a colored box and a corresponding label:
1. **Brown Box:** `TrialMaster`
2. **Beige Box:** `DFS (t=1.5, 20 tactic)`
3. **Light Teal Box:** `DFS (t=1.5, 10 tactic)`
4. **Dark Teal Box:** `DFS (t=1.8, 10 tactic)`
### Detailed Analysis
The chart presents the following approximate data points, derived from visual estimation against the y-axis scale:
1. **TrialMaster (Brown Bar):** The bar height is slightly below the 20 mark. Estimated value: **~18,000 Lean Calls**.
2. **DFS (t=1.5, 20 tactic) (Beige Bar):** The bar height is just below the 20 mark, appearing marginally taller than the TrialMaster bar. Estimated value: **~19,500 Lean Calls**.
3. **DFS (t=1.5, 10 tactic) (Light Teal Bar):** The bar height is visually identical to the beige bar (DFS t=1.5, 20 tactic). Estimated value: **~19,500 Lean Calls**.
4. **DFS (t=1.8, 10 tactic) (Dark Teal Bar):** This bar is significantly taller than the others. Its top aligns just above the 30 mark. Estimated value: **~31,000 Lean Calls**.
**Visual Trend:** The first three methods (TrialMaster and both DFS variants with t=1.5) produce a similar, moderate number of Lean Calls, clustering around 18,000-19,500. The fourth method (DFS with t=1.8) shows a sharp, substantial increase, producing over 50% more calls than the others.
### Key Observations
* **Significant Outlier:** The `DFS (t=1.8, 10 tactic)` method is a clear outlier, generating approximately 11,500 more Lean Calls than the next highest method.
* **Parameter Sensitivity:** Changing the parameter `t` from 1.5 to 1.8 (while keeping "10 tactic" constant) appears to have a dramatic effect on the output, nearly doubling the number of Lean Calls.
* **Tactic Count Insensitivity (at t=1.5):** For the DFS method with `t=1.5`, changing the tactic count from 10 to 20 shows no discernible difference in the number of Lean Calls based on this visual representation.
* **Baseline Comparison:** The `TrialMaster` method serves as a baseline, performing slightly better (fewer calls) than the DFS methods with `t=1.5`.
### Interpretation
This chart likely illustrates a performance or cost metric (Lean Calls) for different algorithmic strategies or configurations in a formal verification or automated reasoning context (suggested by terms like "Lean" and "tactic").
The data suggests that the `t` parameter in the DFS (Depth-First Search) method is a critical factor influencing computational cost. A higher `t` value (1.8 vs. 1.5) leads to a disproportionately large increase in Lean Calls, which could imply a less efficient search, a more exhaustive exploration, or a change in the algorithm's behavior that triggers more proof steps or interactions with the Lean theorem prover.
The similarity between the two `t=1.5` DFS results indicates that, for this specific configuration, the number of tactics (10 vs. 20) is not a primary driver of the Lean Call count. The `TrialMaster` method appears to be a competitive alternative to the `t=1.5` DFS configurations, achieving a slightly lower call count.
**In summary, the chart demonstrates that the choice of method and its parameters (specifically the `t` value for DFS) has a substantial impact on the associated computational workload, as measured by Lean Calls.**
</details>
(c) Search Cost
Figure 5: Experiment results on OOD task. (a) We fix $N_sampled=10$ to see the impact of temperature on the DFS system. (b) We fix $t=1.2$ to see the impact of the number of sampled tactics. The number of Lean calls is noted beside the marker. (c) Comparison of $N_Lean$ among our method and top 3 DFS systems with the highest success rate. In summary, training with trial-and-error achieves a higher success rate with a relatively lower search cost compared to the DFS systems.
### 5.2 Evaluation Metrics
Proof search success rate. We use proof search success rate as our primary metric, which represents the ratio of successful searches by the model. A search attempt for a theorem is marked as successful if Lean outputs state “no goal, the proof is complete”, implying the theorem is effectively proved after applying a tactic produced by the model. For TrialMaster, a search attempt ends and fails immediately after one of the following conditions: 1) the word length of the proof tree with trial-and-error exceeds $1500$ (for the sake of context length of the model), or 2) the tactic produced by the model at any step induces a Lean error. For the conventional DFS system, a search attempt fails when one of the following conditions happens: 1) the word length of the proof tree without trial-and-error exceeds 1500, or 2) all tactics generated for the initial states have been explored and failed, or 3) the total search steps exceed 65 (see Section 5.1 for the choice of this value). We note that the 1500-word limit is stricter for our method since it produces entire proof paths including trial-and-error and thus would be easier to hit the limit.
Search cost. We define a metric to assess the search cost—the total number of Lean calls for tactic checking during proof search for the entire test set, denoted as $N_Lean$ . Given the same proof search success rate, a lower $N_Lean$ indicates a more efficient system for proof search. Note that backtracking instructions from our method do not require applying Lean to check the state and, consequently do not add search cost.
Table 2: Performance on in-distribution task. Both methods perform well for propositional logic.
| DFS | $t=1.2$ | $N_sampled=2$ | 99.5% |
| --- | --- | --- | --- |
| $N_sampled=5$ | 99.9% | | |
| $N_sampled=10$ | 99.6% | | |
| $t=2.0$ | $N_sampled=2$ | 75.9% | |
| $N_sampled=5$ | 97.3% | | |
| $N_sampled=10$ | 99.0% | | |
### 5.3 Results and Analysis
TrialMaster outperforms conventional DFS system. We begin by evaluating the methods of the in-distribution test set. Table 2 illustrates that both our method and the DFS system perform exceptionally well, achieving a success rate of nearly 100% in most configurations. This suggests that Llama-7b effectively masters in-distribution intuitionistic propositional logic theorems. Then, we compare the performance of the methods on the out-of-distribution task. The results are presented in Figure 5. Our method with trial-and-error significantly outperforms the DFS system across various hyperparameter configurations. Additionally, we observe that feeding more proofs without trial-and-error for LLM fine-tuning does not further improve the performance.
Impact of hyperparameters in the DFS system. As shown in Figure 5, on the OOD task, although the success rate of the DFS system gets higher when we increase the temperature $t$ or the number of sampled tactics $N_sampled$ , the search cost (reflected by $N_Lean$ ) also goes up. Specifically, when we increase $N_sampled$ , the DFS system explores a larger pool of candidate tactics during the search, leading to a higher number of Lean calls. In contrast, our method does a greedy search to generate only one tactic for each new state. Likewise, as $t$ increases, the tactic generator of the DFS system tends to produce more diverse tactics at each proof state, improving the system’s performance but leading to higher search costs.
Table 3: Ablation study: Comparison of TrialMaster and model trained without trial-and-error information on OOD task
| TrialMaster Model - proof w/o t.a.e. | 88.7% 59.3 % |
| --- | --- |
TrialMaster achieves high success rates at lower search cost. For a direct comparison of search costs, we plot the $N_Lean$ values of our method alongside those of the top three DFS systems with the highest success rates among all the hyperparameters we experimented with, i.e., $t$ =1.5, $N_sampled$ =20 (87.2%), $t$ =1.5, $N_sampled$ = 10 (86.1%), and $t$ =1.8, $N_sampled$ =10 (88.4%). This comparison is illustrated in Figure 5(c). Notably, we observe that the DFS systems that closely approach our model’s performance exhibit significantly higher search costs. With $t$ =1.8, $N_sampled$ =10, $N_Lean$ of the DFS system, which has 0.3% lower success rate than our method, has reached 31,101, which is 72% higher than that of our method with trial-and-error. The high search cost makes the DFS system with high temperatures unfavorable. These results demonstrate that training with trial-and-error produces higher-quality tactics, achieving a higher success rate with relatively lower search cost.
Model learns backtracking capability from trial-and-error data. In the experiments, we find out that our TrialMaster successfully acquires the backtracking capability from proofs with trial-and-error information. This is evidenced by the fact that during TrialMaster ’s proof search for theorems in the test set, all backtracking instructions produced by the LLM adhere to the correct format and point to existing state numbers.
Including failed search paths helps TrialMaster to learn. The following experiment shows that adding failed search paths to the training data for TrialMaster results in an overall gain. In this experiment, the model is only trained to learn the correct search paths and the backtracking instructions. The model is not trained to learn the failed search paths (we don’t compute the loss for the failed search paths during the training stage in this case). The proof success rate in this case is 75.6%, which is lower than TrialMaster ’s proof success rate of 88.7%. The $N_Lean$ for the model is 13,600, which is lower than that of TrialMaster. This is expected since the model does not learn to predict failed search paths. Our explanation for why TrialMaster has a higher proof search success rate than the model trained in the previously mentioned experiment is that the failed search paths also contribute to improving the proof search success rate of the model. TrialMaster strategically tried some potentially failed search paths to gain a more comprehensive view of the problem, which then led to the correct search paths.
### 5.4 Ablation Study
To evaluate the effectiveness of training with trial-and-error, we craft an ablated version of our method where the LLM is fined-tuned with data of the correct path only and do inference in the same way as our method (i.e., producing one tactic at a time and applying Lean for state checking). We denote the ablated version as Model - proof w/o t.a.e.. For both methods, we mark the search attempt as failed if the tactic induces a Lean error, or the search exceeds the 1500-word limit. The result is shown in the Table 3. The difference between the success rates of the two models is $29.4\$ , which is significant. This clearly shows that failed search states and trial-and-error information tremendously enhance the model’s capability to solve theorem-proving tasks.
Table 4: Comparison of models trained on different lengths of proofs with trial-and-error on OOD task.
| Model - short proof w/ t.a.e. Model - long proof w/ t.a.e. | 88.7% 72.4 % |
| --- | --- |
### 5.5 Exploratory Study: Effect of Training Proof Length on Model Performance
Since the FPS algorithm of PropL dataset can generate multiple proofs with variable length, we conduct an exploratory study to assess the impact of proof length on model performance. We fine-tune two models using proofs with different lengths of trial-and-error information. For the first model, which is our TrialMaster, the training data is derived by randomly selecting two out of the shortest four proofs from the ten available proofs for each theorem in PropL. We denote it as Model - short proof w/ t.a.e. In contrast, the training data of the second model is formed by randomly selecting two proofs from the ten available for each theorem, irrespective of their lengths. We denote it as Model - long proof w/ t.a.e. For both models, we use greedy search to let them generate one tactic for each state. We evaluate the models on our 1000 OOD test set. The results are shown in the Table 4. A higher success rate is observed in the model trained with shorter proofs. This can be attributed to the fact that as the proof with trial-and-error information becomes longer, there is too much trial-and-error information that may detrimentally affect the model’s performance, as too many failed search paths may lower the quality of the training data.
## 6 Conclusion and Future Work
In this paper, we study Automated Theorem Proving in formalized environments. We create a complete, scalable, and representative dataset of intuitionistic propositional logic theorems in Lean. We demonstrate that leveraging information from failed search states and backtracking not only teaches models how to backtrack effectively, but also helps in developing better tactics than those generated by models trained without access to backtracking insights. We release our datasets on GitHub and Huggingface. PropL dataset is available at https://huggingface.co/datasets/KomeijiForce/PropL. Model weights are available at https://huggingface.co/KomeijiForce/llama-2-7b-propositional-logic-prover. Generation codes for theorems and proofs are available at https://github.com/ucsd-atp/PropL.
A natural extension of our research involves investigating whether trial-and-error information is beneficial for more general mathematical theorem-proving settings. Exploring this avenue could provide valuable insights into the effectiveness of our approach across broader mathematical domains.
## Limitations
One limitation of our study is that some proof attempts are forced to stop due to the prompt exceeding the context length of 1500 tokens. This constraint may potentially influence our results by truncating the available information during the proof search process.
Furthermore, our method was not evaluated on general mathematical theorems. This limitation arises from both the scarcity of proofs containing trial-and-error information in current math libraries and the intrinsic challenges associated with producing proofs, whether with or without backtracking, for general mathematical theorems in a formalized setting.
Automated theorem proving with LLMs is an emerging area in machine learning. There is still a lack of baselines on LLMs to compare with our method. We establish a fundamental baseline, but we still need accumulative work to provide methods for comparison.
## Ethical Consideration
Our work learns large language models to automatically prove propositional logic theorems, which generally does not raise ethical concerns.
## Acknowledgments
This work is supported by the National Science Foundation under grants CCF-1955457 and CCF-2220892. This work is also sponsored in part by NSF CAREER Award 2239440, NSF Proto-OKN Award 2333790, as well as generous gifts from Google, Adobe, and Teradata.
## References
- Atkinson and Sack (1992) Michael D Atkinson and J-R Sack. 1992. Generating binary trees at random. Information Processing Letters, 41(1):21–23.
- Bansal et al. (2019) Kshitij Bansal, Christian Szegedy, Markus N Rabe, Sarah M Loos, and Viktor Toman. 2019. Learning to reason in large theories without imitation. arXiv preprint arXiv:1905.10501.
- Barras et al. (1997) Bruno Barras, Samuel Boutin, Cristina Cornes, Judicaël Courant, Jean-Christophe Filliatre, Eduardo Gimenez, Hugo Herbelin, Gerard Huet, Cesar Munoz, Chetan Murthy, et al. 1997. The Coq proof assistant reference manual: Version 6.1. Ph.D. thesis, Inria.
- Besta et al. (2023) Maciej Besta, Nils Blach, Ales Kubicek, Robert Gerstenberger, Lukas Gianinazzi, Joanna Gajda, Tomasz Lehmann, Michal Podstawski, Hubert Niewiadomski, Piotr Nyczyk, et al. 2023. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687.
- Chang and Lee (2014) Chin-Liang Chang and Richard Char-Tung Lee. 2014. Symbolic logic and mechanical theorem proving. Academic press.
- Chen et al. (2024) Sijia Chen, Baochun Li, and Di Niu. 2024. Boosting of thoughts: Trial-and-error problem solving with large language models. In The Twelfth International Conference on Learning Representations.
- Chu et al. (2023) Zheng Chu, Jingchang Chen, Qianglong Chen, Weijiang Yu, Tao He, Haotian Wang, Weihua Peng, Ming Liu, Bing Qin, and Ting Liu. 2023. A survey of chain of thought reasoning: Advances, frontiers and future. arXiv preprint arXiv:2309.15402.
- Clocksin and Mellish (2003) William F Clocksin and Christopher S Mellish. 2003. Programming in PROLOG. Springer Science & Business Media.
- Crevier (1993) Daniel Crevier. 1993. AI: the tumultuous history of the search for artificial intelligence. Basic Books, Inc.
- Davies et al. (2021) Alex Davies, Petar Veličković, Lars Buesing, Sam Blackwell, Daniel Zheng, Nenad Tomašev, Richard Tanburn, Peter Battaglia, Charles Blundell, András Juhász, et al. 2021. Advancing mathematics by guiding human intuition with AI. Nature, 600(7887):70–74.
- Davis (2001) Martin Davis. 2001. The early history of automated deduction. Handbook of automated reasoning, 1:3–15.
- de Moura et al. (2015) Leonardo de Moura, Soonho Kong, Jeremy Avigad, Floris Van Doorn, and Jakob von Raumer. 2015. The lean theorem prover (system description). In Automated Deduction-CADE-25: 25th International Conference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Proceedings 25, pages 378–388. Springer.
- First et al. (2023) Emily First, Markus Rabe, Talia Ringer, and Yuriy Brun. 2023. Baldur: Whole-proof generation and repair with large language models. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1229–1241.
- Fu et al. (2022) Yao Fu, Hao Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-based prompting for multi-step reasoning. arXiv preprint arXiv:2210.00720.
- Gilmore (1960) P. C. Gilmore. 1960. A proof method for quantification theory: Its justification and realization. IBM Journal of Research and Development, 4(1):28–35.
- Han et al. (2021) Jesse Michael Han, Jason Rute, Yuhuai Wu, Edward W Ayers, and Stanislas Polu. 2021. Proof artifact co-training for theorem proving with language models. arXiv preprint arXiv:2102.06203.
- Holden and Korovin (2021) Edvard K Holden and Konstantin Korovin. 2021. Heterogeneous heuristic optimisation and scheduling for first-order theorem proving. In International Conference on Intelligent Computer Mathematics, pages 107–123. Springer.
- Irving et al. (2016) Geoffrey Irving, Christian Szegedy, Alexander A Alemi, Niklas Eén, François Chollet, and Josef Urban. 2016. Deepmath-deep sequence models for premise selection. Advances in neural information processing systems, 29.
- Jiang et al. (2022) Albert Q Jiang, Sean Welleck, Jin Peng Zhou, Wenda Li, Jiacheng Liu, Mateja Jamnik, Timothée Lacroix, Yuhuai Wu, and Guillaume Lample. 2022. Draft, sketch, and prove: Guiding formal theorem provers with informal proofs. arXiv preprint arXiv:2210.12283.
- Kusumoto et al. (2018) Mitsuru Kusumoto, Keisuke Yahata, and Masahiro Sakai. 2018. Automated theorem proving in intuitionistic propositional logic by deep reinforcement learning. arXiv preprint arXiv:1811.00796.
- Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626.
- Lample et al. (2022) Guillaume Lample, Timothee Lacroix, Marie-Anne Lachaux, Aurelien Rodriguez, Amaury Hayat, Thibaut Lavril, Gabriel Ebner, and Xavier Martinet. 2022. Hypertree proof search for neural theorem proving. Advances in Neural Information Processing Systems, 35:26337–26349.
- Liang and Miller (2009) Chuck C. Liang and Dale Miller. 2009. Focusing and polarization in linear, intuitionistic, and classical logics. Theor. Comput. Sci., 410(46):4747–4768.
- McCorduck (2004) Pamela McCorduck. 2004. Machines Who Think (2Nd Ed.). A. K. Peters.
- McLaughlin and Pfenning (2009) Sean McLaughlin and Frank Pfenning. 2009. Efficient intuitionistic theorem proving with the polarized inverse method. In Automated Deduction - CADE-22, 22nd International Conference on Automated Deduction, Montreal, Canada, August 2-7, 2009. Proceedings, volume 5663 of Lecture Notes in Computer Science, pages 230–244. Springer.
- Megill and Wheeler (2019) Norman Megill and David A Wheeler. 2019. Metamath: a computer language for mathematical proofs. Lulu. com.
- Mikuła et al. (2023) Maciej Mikuła, Szymon Antoniak, Szymon Tworkowski, Albert Qiaochu Jiang, Jin Peng Zhou, Christian Szegedy, Łukasz Kuciński, Piotr Miłoś, and Yuhuai Wu. 2023. Magnushammer: A transformer-based approach to premise selection. arXiv preprint arXiv:2303.04488.
- Nawaz et al. (2019) M Saqib Nawaz, Moin Malik, Yi Li, Meng Sun, and M Lali. 2019. A survey on theorem provers in formal methods. arXiv preprint arXiv:1912.03028.
- Nipkow et al. (2002) Tobias Nipkow, Markus Wenzel, and Lawrence C Paulson. 2002. Isabelle/HOL: a proof assistant for higher-order logic. Springer.
- Pfenning (2017) Frank Pfenning. 2017. Lecture notes on focusing.
- Polu and Sutskever (2020) Stanislas Polu and Ilya Sutskever. 2020. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393.
- Rabe et al. (2020) Markus N Rabe, Dennis Lee, Kshitij Bansal, and Christian Szegedy. 2020. Mathematical reasoning via self-supervised skip-tree training. arXiv preprint arXiv:2006.04757.
- Robinson (1965) John Alan Robinson. 1965. A machine-oriented logic based on the resolution principle. Journal of the ACM (JACM), 12(1):23–41.
- Russell and Norvig (2010) Stuart J Russell and Peter Norvig. 2010. Artificial intelligence a modern approach. London.
- Sekiyama et al. (2017) Taro Sekiyama, Akifumi Imanishi, and Kohei Suenaga. 2017. Towards proof synthesis guided by neural machine translation for intuitionistic propositional logic. corr abs/1706.06462 (2017). arXiv preprint arXiv:1706.06462.
- Sekiyama and Suenaga (2018) Taro Sekiyama and Kohei Suenaga. 2018. Automated proof synthesis for propositional logic with deep neural networks. arXiv preprint arXiv:1805.11799.
- Stanley (2015) Richard P Stanley. 2015. Catalan numbers. Cambridge University Press.
- Thakur et al. (2023) Amitayush Thakur, Yeming Wen, and Swarat Chaudhuri. 2023. A language-agent approach to formal theorem-proving. arXiv preprint arXiv:2310.04353.
- Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Wagner (2021) Adam Zsolt Wagner. 2021. Constructions in combinatorics via neural networks. arXiv preprint arXiv:2104.14516.
- Wang et al. (2023) Haiming Wang, Ye Yuan, Zhengying Liu, Jianhao Shen, Yichun Yin, Jing Xiong, Enze Xie, Han Shi, Yujun Li, Lin Li, et al. 2023. Dt-solver: Automated theorem proving with dynamic-tree sampling guided by proof-level value function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12632–12646.
- Wang and Deng (2020) Mingzhe Wang and Jia Deng. 2020. Learning to prove theorems by learning to generate theorems. Advances in Neural Information Processing Systems, 33:18146–18157.
- Wang et al. (2017) Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. 2017. Premise selection for theorem proving by deep graph embedding. Advances in neural information processing systems, 30.
- Wang et al. (2022) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Welleck et al. (2021) Sean Welleck, Jiacheng Liu, Ronan Le Bras, Hannaneh Hajishirzi, Yejin Choi, and Kyunghyun Cho. 2021. Naturalproofs: Mathematical theorem proving in natural language. arXiv preprint arXiv:2104.01112.
- Wolf et al. (2019) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. 2019. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Wu et al. (2021) Minchao Wu, Michael Norrish, Christian Walder, and Amir Dezfouli. 2021. Tacticzero: Learning to prove theorems from scratch with deep reinforcement learning. Advances in Neural Information Processing Systems, 34:9330–9342.
- Yang et al. (2023) Kaiyu Yang, Aidan M Swope, Alex Gu, Rahul Chalamala, Peiyang Song, Shixing Yu, Saad Godil, Ryan Prenger, and Anima Anandkumar. 2023. LeanDojo: Theorem Proving with Retrieval-Augmented Language Models. arXiv preprint arXiv:2306.15626.
- Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- Yu et al. (2023) Fei Yu, Hongbo Zhang, and Benyou Wang. 2023. Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725.
- Zheng et al. (2023) Chuanyang Zheng, Haiming Wang, Enze Xie, Zhengying Liu, Jiankai Sun, Huajian Xin, Jianhao Shen, Zhenguo Li, and Yu Li. 2023. Lyra: Orchestrating dual correction in automated theorem proving. arXiv preprint arXiv:2309.15806.
- Zhou et al. (2022) Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Claire Cui, Olivier Bousquet, Quoc Le, et al. 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
## Appendix A Inference Rules for Intuitionistic Propositional Logic
Figure 6 shows the inference rules for intuitionistic propositional logic. Rules relate sequents of the form $Γ\vdash A$ , where $Γ$ is an unordered list of assumptions. A derivation is constructed from the axiom (Assumption) using the derivation rules until $Γ$ is empty.
For example, the ( $→$ -I) rule says that from a derivation of $Γ\vdash B$ under the assumption of $Γ,A$ , we can get a derivation of $Γ\vdash A→ B$ . The ( $→$ -E) rule says that from a derivation of $Γ\vdash A→ B$ and a derivation of $Γ\vdash A$ we can derive $Γ\vdash B$ .
Γ, A ⊢A (Assumption)
Γ⊢T (T-I)
Γ⊢FΓ⊢A (F-E)
Γ⊢A Γ⊢BΓ⊢A ∧B ( $∧$ -I)
Γ⊢A ∧BΓ⊢A ( $∧$ -E1)
Γ⊢A ∧BΓ⊢B ( $∧$ -E2)
Γ⊢AΓ⊢A ∨B ( $∨$ -I1)
Γ⊢BΓ⊢A ∨B ( $∨$ -I2)
Γ⊢A ∨B Γ, A ⊢C Γ, B ⊢CΓ⊢C ( $∨$ -E)
Γ, A ⊢BΓ⊢A →B ( $→$ -I)
Γ⊢A →B Γ⊢A Γ⊢B ( $→$ -E)
Figure 6: Natural Deduction Rules for Intuitionistic Propositional Logic
Here are an example derivation.
$\texttt{p1}→\texttt{p1}∨\texttt{p2}$ .
$$
\vdash\texttt{p1}→\texttt{p1}∨\texttt{p2}\texttt{p1}\vdash\texttt{p1}∨
\texttt{p2}\texttt{p1}\vdash\texttt{p1}
$$
## Appendix B Uniformly Distributed Data Explanation
In this section, we discuss the uniform characteristics of our dataset, particularly emphasizing the one-to-one mapping between propositions and natural numbers. This bijection allows us to simply sample from the natural numbers to ensure the dataset exhibits uniformity.
Algorithm 1 Encoding
1: A tree $T$ representing a proposition, with $n$ indicating the number of internal nodes and $p$ representing the number of atomic propositions
2: Output a natural number
3: return $3^n(p+2)^n+1$ $×$ ShapeNumber ( $T$ ) $+$ AssignmentNumber ( $T$ )
4: function ShapeNumber ( $T$ )
5: if $T$ is a single node then return $0 0$
6: end if
7: $T_l,T_r←$ left and right sub-trees of $T$
8: $n_l,n_r←$ number of internal nodes in $T_l,T_r$ $\triangleright$ Total $n_r+n_l+1$ internal nodes
9: return $∑_i=1^n_lC_i-1C_n_{l+n_r+1-i}$ $+$ $C_n_{r}$ $×$ ShapeNumber ( $T_l$ ) $+$ ShapeNumber ( $T_r$ )
10: end function
11: function AssignmentNumber ( $T$ )
12: $N←$ NodeNumber (root node of $T$ )
13: if $T$ is a single node then return $N$
14: end if
15: $T_l,T_r←$ left and right sub-trees of $T$
16: $n_r←$ number of internal nodes in $T_r$
17: $A_r← 3^n_r(p+2)^n_r+1$ $\triangleright$ Compute all possible assignments in the right sub-tree
18: return $3A_r$ $×$ AssignmentNumber ( $T_l$ ) $+$ $A_r$ $×$ $N$ $+$ AssignmentNumber ( $T_r$ )
19: end function
20: function NodeNumber ( $N$ )
21: switch $N$
22: case $∧$ or $T$ : return 0
23: case $∨$ or $F$ : return 1
24: case $→$ : return 2
25: case $P_i$ : return $i+1$ $\triangleright$ $1≤ i≤ p$
26: end function
### B.1 Catalan Number
The Catalan number $C_n$ is applicable for counting full binary trees that consist of $n+1$ leaf nodes, corresponding to exactly $n$ internal nodes Stanley (2015). Additionally, it can be calculated through recursion as shown:
$$
C_n=∑_i=1^nC_i-1C_n-i.
$$
The first Catalan numbers for $n=0,1,2,3,4$ are $1,1,2,5,14$ . A concise interpretation of this recursion is as follows: it involves counting the number of internal nodes in the left sub-tree, amounting to $i-1$ , and then in the right sub-tree, amounting to $n-i$ , for each of the $n$ scenarios in computing $C_n$ .
Algorithm 2 Decoding
1: A natural number ID, with $n$ indicating the number of internal nodes and $p$ representing the number of atomic propositions
2: Output a proposition tree
3: quotient, remainder $←$ DivMod (ID, $3^n(p+2)^n+1$ )
4: $S←$ TreeShape (quotient, $n$ )
5: $A←$ TreeAssignment ( $S$ , remainder)
6: return a tree with shape $S$ and assignment $A$
7: function TreeShape ( $N$ , $n$ )
8: if $n$ is $0 0$ then return a single node
9: end if
10: $n_l←\max(\{k|∑_i=1^kC_i-1C_n-i≤ N\}∪\{0\})$
11: $n_r← n-n_l-1$
12: remaining $← N-∑_i=1^n_lC_i-1C_n-i$
13: $N_l,N_r←$ DivMod (remaining, $C_n_{r}$ )
14: return a tree with left and right sub-trees shaped by TreeShape ( $N_l$ , $n_l$ ) and TreeShape ( $N_r$ , $n_r$ ), respectively
15: end function
16: function TreeAssignment ( $S$ , $N$ )
17: Perform an inorder traversal of the tree with shape $S$ . For each leaf node, interpret it as a digit in base $p+2$ , and for each internal node, interpret it as a digit in base 3. Compute the assignment that gives $N$ by utilizing the NodeNumber function introduced in Algorithm 1
18: end function
### B.2 Bijection between Propositions and Natural Numbers
As depicted in Figure 3, every proposition corresponds to a unique tree representation. Consequently, it only requires the identification of a bijection between full binary trees and natural numbers.
For every full binary tree possessing $n$ internal nodes and $p$ atomic propositions, there exist
| | $\displaystyle 3^\#internal node(p+2)^\#leaf node=3^n(p+2)^ {n+1}$ | |
| --- | --- | --- |
distinct cases. Observe that the choices available for internal nodes include conjunction ( $∧$ ), disjunction ( $∨$ ), and implication ( $→$ ); whereas for leaf nodes, the choices encompass true ( $T$ ), false ( $F$ ), and a set of propositions $P_1,…,P_p$ .
This counting facilitates an efficient ranking of all full binary trees with $n$ internal nodes. This ranking process inherently establishes a bijection with the set of natural numbers, allowing for a clear correspondence between each full binary tree and a unique natural number. Consequently, this sets the stage for a detailed examination of two critical processes: encoding (see Algorithm 1), which involves mapping a proposition tree to a natural number, and decoding (see Algorithm 2), which entails mapping a natural number back to a corresponding proposition tree.
Having established the bijection between full binary trees and natural numbers, it becomes apparent that uniformly sampling from the set of natural numbers will, in turn, result in a uniform sampling of full binary trees.
The inclusion of the parameter $n$ in Algorithm 1 is not critical for the functionality of the algorithm as $n$ can be counted from $T$ ; however, it is retained to denote that the most basic encoding and decoding approaches are designed for a fixed number of internal nodes $n$ . For example, ID 0 denotes different propositions depending on the specified value of $n$ . As a result, the decoding algorithm, as outlined in Algorithm 2, is specifically intended for uniformly generating proposition trees with a fixed $n$ . To achieve a uniformly distributed proposition tree from a set of trees with various $n$ , one might simply merge the rankings of fully binary trees with different $n$ , a task which can be performed with relative ease. This approach of merging can be similarly applied to trees with varying $p$ as well.
Given the uncertainty surrounding the proof lengths when generating propositions, our approach involves uniform sampling for proposition selection. This selection is later refined by excluding propositions according to their proof lengths as computed by the proof generation algorithm.
## Appendix C Examples
In Figure 7, we show an example Lean proof for a theorem in Figure 3. Lines preceded by ’–’ are comments solely for explanatory purposes.
⬇
variable (p1 p2 p3 p4 p5 : Prop)
theorem : (((p1 ∨ p2) → False) → ((p1 → False) ∧ (p2 → False))) := by
– Implications on the right can always be decomposed.
– Introduce an assumption h1 that says ((p1 ∨ p2) → False)
intro h1
– Now we want to show ((p1 → False) ∧ (p2 → False))
– Conjunctions on the right can always be decomposed.
– We then need to show (p1 → False) and (p2 → False) separately.
apply And. intro
– We are showing (p1 → False).
– Implications on the right can always be decomposed.
– We introduce assumption h2 for p1. And we try to show False.
intro h2
– We want to use the implication h1. So we show its premise.
have h3 : (p1 ∨ p2) := by
– Show the left disjunct. (The right adjunct leads to an TAE)
apply Or. inl
– One of the premise coincides with the conclusion.
exact h2
– We have shown the premise of h1 (p1 ∨ p2),
– we can now drive its conclusion (False), denoted by h4.
let h4 := h1 h3
– False on the left can always be used.
apply False. elim h4
– We have shown (p1 → False) and now we show (p2 → False).
– Implications on the right can always be decomposed.
– We introduce assumption h2 for p2. And we try to show False.
intro h5
– We want to use the implication h1. So we show its premise.
have h6 : (p1 ∨ p2) := by
– Show the right disjunct. (The left adjunct leads to an TAE)
apply Or. inr
– One of the premise coincides with the conclusion.
exact h5
– We have shown the premise of h1 (p1 ∨ p2),
– we can now drive its conclusion (False), denoted by h7.
let h7 := h1 h6
– False on the left can always be used.
apply False. elim h7
Figure 7: An example of an intuitionistic propositional logic theorem with its proof in Lean