2502.05714

Model: nemotron-free

# Proving the Coding Interview: A Benchmark for Formally Verified Code Generation **Authors**: Quinn Dougherty, Ronak Mehta > Beneficial AI Foundationquinn.dougherty92@gmail.com > Independentronakrm@gmail.com \addbibresource conference_101719.bib Abstract We introduce the Formally Verified Automated Programming Progress Standards, or FVAPPS, a benchmark of 4715 samples for writing programs and proving their correctness, the largest formal verification benchmark, including 1083 curated and quality controlled samples. Previously, APPS provided a benchmark and dataset for programming puzzles to be completed in Python and checked against unit tests, of the kind seen in technical assessments in the software engineering industry. Building upon recent approaches for benchmarks in interactive theorem proving, we generalize the unit tests to Lean 4 theorems given without proof (i.e., using Lean’s “sorry” keyword). On the 406 theorems of 100 randomly selected samples, Sonnet correctly proves 30% and Gemini correctly proves 18%. We challenge the machine learning and program synthesis communities to solve both each general purpose programming problem and its associated correctness specifications. The benchmark is available at https://huggingface.co/datasets/quinn-dougherty/fvapps. Index Terms: theorem proving, machine learning, program synthesis, formal verification I Introduction Large Language Models (LLMs) have developed incredible capabilities, spanning creative writing, tutoring and education, information retrieval and distillation, common sense reasoning, complex mathematical reasoning, and even code generation [chatbotarena, gpt4, llama3, deepseek]. Particular focus and attention has been given towards LLM-assisted software development, given its economic value and position between natural language and formal logic. We have seen rapid advancements in code generation capabilities [Singh2023CodeFusionAP, jimenez2023swe, Hendrycks2021MeasuringCC]. With this has come strong concerns around the misuse and safety of code generated by these models, and the ability to prove or guarantee that the produced code is sound. An approach commonly employed today includes running LLM-generated code in restricted development environments, and ensuring tight feedback loops with software engineers who understand the unique failure modes of their particular complex software. This paired programming-style approach is common in LLM chatbot interfaces, as well as programming assistants more strongly integrated in the software engineering development cycle such as GitHub Copilot [chen2021evaluating] and Cursor (Anysphere Inc., 2023). However, recent paradigm shifts in the ability to develop and enhance capabilities through supervised fine-tuning has led to fully autonomous software engineering, which includes code generation and execution in the development loop (e.g., Devin, Cognition AI, Inc. 2023, [wang2024openhandsopenplatformai]). Without a human evaluating code before it is run, strong assumptions or restrictions are necessary to ensure that generated code is safe and correct. Critically, as these tools become ubiquitous across more and more applications that touch the real world, it will be a requirement that we have guarantees of their safety and correctness. Code generated for managing medical devices, for managing patient records, for ensuring the security of private personal information, for controlling autonomous vehicles, for cybersecurity defense systems, and for military use will all require strong guarantees before users, organizations, and governments can make effective use of these tools. While software quality assurance primarily relies on extensive testing, the emerging field of deductive verification offers a more rigorous alternative by mathematically proving code correctness against formal specifications. However, this approach requires deep system knowledge, explicit property definitions, and often demands rewriting existing code in specialized proof assistant languages—a significant challenge given the current state of tools. As Large Language Models (LLMs) advance in code generation capabilities, they may eventually enable automated creation of deductively verifiable code, allowing us to defer trust to theorem provers rather than testing alone. This could revolutionize mission-critical software development by providing formal guarantees against bugs and hallucinations, though current LLM capabilities fall short of this goal, and it remains unclear whether fine-tuning or prompt engineering can bridge this gap. Contributions. We deliver a set of 4,715 Lean 4 files consisting of a solution function signature and theorem statements, all of which are passed on for future proving via the “sorry” keyword, allowing the lean executable to compile without actual proofs or implementations. We provide tags for samples that pass additional, rigorous quality assurance, defined by inline unit testing and property tests in Lean. II The FVAPPS Benchmark <details> <summary>extracted/6182580/figs/FVAPPS_Pipeline.png Details</summary> ![dc90d66b](/v1/image/dc90d66bae4fcfddedc23712282a5097c107778d152e3532636ed5982a6b39fc) ### Visual Description ## Flowchart: Benchmark Generation Pipeline ### Overview The flowchart illustrates a multi-stage pipeline for generating and validating benchmark solutions. It begins with an original dataset and progresses through preprocessing, test generation, and validation stages. The final output categorizes solutions into three FVAPPS classifications: Guarded and Plausible, Guarded, and Unguarded. ### Components/Axes - **Nodes**: - Process steps (yellow boxes): Preprocess Dataset, Generate Python Property Tests, Generate Lean Theorem Statements, Generate Solutions and #guard_msgs Unit Tests - Decision points (blue diamonds): Python Success?, pytest Success?, lean Success?, Plausible?, Lean Success? - **Arrows**: - Solid arrows indicate sequential flow - Dashed arrows represent conditional loops - **Legend**: - Right-side vertical box categorizes final outputs: - Guarded and Plausible (green) - Guarded (green) - Unguarded (green) ### Detailed Analysis 1. **Preprocess Dataset** → **Python Success?** - Yes: Proceed to Generate Python Property Tests - No: Loop back to Preprocess Dataset 2. **Generate Python Property Tests** → **pytest Success?** - Yes: Proceed to Generate Lean Theorem Statements - No: Loop back to Preprocess Dataset 3. **Generate Lean Theorem Statements** → **lean Success?** - Yes: Proceed to Generate Solutions and #guard_msgs Unit Tests - No: Loop back to Generate Python Property Tests 4. **Generate Solutions and #guard_msgs Unit Tests** → **Plausible?** - Yes: Final output categorized as "Guarded and Plausible" - No: - If 10 failures: Categorized as "Unguarded" - Otherwise: Categorized as "Guarded" ### Key Observations - **Iterative Design**: All process steps include retry mechanisms (loops) when validation fails - **Validation Hierarchy**: - Python tests → pytest → lean → Plausibility checks - Each stage gates progression to subsequent steps - **Final Categorization**: - "Guarded and Plausible" requires passing all validation stages - "Unguarded" only occurs after 10 consecutive failures in the final test ### Interpretation This pipeline demonstrates a rigorous, multi-layered validation process for generating benchmark solutions. The inclusion of multiple testing frameworks (Python, pytest, lean) suggests a focus on cross-platform verification. The final "Guarded" categorization appears to be the default outcome unless solutions fail repeatedly, indicating a conservative approach to solution acceptance. The "Plausible" check introduces a qualitative assessment layer beyond automated testing, potentially evaluating solution logic or edge-case handling. The pipeline's structure implies that solution quality is directly tied to successful validation at each stage, with the most stringent requirements applied to the final output classification. </details> Figure 1: Benchmark generation pipeline for creating coding interview theorem statements in Lean from APPS questions and solutions. A strong benchmark for evaluating model capabilities for theorem proving should have a number of desirable properties. With a particular focus on software engineering, algorithms and properties of those algorithms should well represent tasks and solutions that are used to evaluate skill in the labor force today. Coding interview questions fit this requirement naturally. With generic coding capabilities well-evaluated, we can build upon existing benchmarks of coding ability to generate our formally-verified one. We begin with APPS, a benchmark and dataset for programming puzzle solving [Hendrycks2021MeasuringCC]. Developed largely by scraping the interview preparation platforms CodeForces and LeetCode, each sample consists of a natural language question, a ternary flag marking difficulty, a list of correct solutions in Python, and a few input-output test case pairs. II-A An Example Task ⬇ def solve_elections (n : Nat) (voters : List (Nat Ã Nat)) : Nat := sorry theorem solve_elections_nonnegative (n : Nat) (voters : List (Nat Ã Nat)) : solve_elections n voters >= 0 := sorry theorem solve_elections_upper_bound (n : Nat) (voters : List (Nat Ã Nat)) : solve_elections n voters <= List. foldl (Î» acc (pair : Nat Ã Nat) => acc + pair.2) 0 voters := sorry theorem solve_elections_zero_votes (n : Nat) (voters : List (Nat Ã Nat)) : (List. all voters (fun pair => pair.1 = 0)) -> solve_elections n voters = 0 := sorry theorem solve_elections_single_zero_vote : solve_elections 1 [(0, 5)] = 0 := sorry Figure 2: FVAPPS sample 23, derived from train sample 23 of APPS source. The def is where the solver implements the function, each theorem is a correctness specification. As an example, consider the crooked election problem: There are $n$ voters, and two ways to convince each of them to vote for you. The first way to convince the $i$ -th voter is to pay him $p_{i}$ coins. The second way is to make $m_{i}$ other voters vote for you, and the $i$ -th voter will vote for free. Moreover, the process of such voting takes place in several steps. For example, if there are five voters with $m_{1}=1$ , $m_{2}=2$ , $m_{3}=2$ , $m_{4}=4$ , $m_{5}=5$ , then you can buy the vote of the fifth voter, and eventually everyone will vote for you. Set of people voting for you will change as follows: ${5}→{1,5}→{1,2,3,5}→{1,2,3,4,5}$ . Calculate the minimum number of coins you have to spend so that everyone votes for you. To get everyone to vote for you, you can either purchase votes for each voter’s declared price, or convert each voter through their declared groupthink threshold. This is the setup of sample 0023, derived from APPS problem train/23 https://huggingface.co/datasets/codeparrot/apps/viewer/all/train?row=23. Figure 2 is our pipeline’s output that we ship in our dataset. Crucially, it is one def and multiple theorem s, though many samples include extra definitions that are helper functions either for the solution or some theorem statement. The def is where the problem solution is implemented as a first-order requirement to proving theorems about its correctness. While a number of theorem provers now exist, we choose Lean because of its explosive growth in mathematics https://leanprover-community.github.io/mathlib_stats.html and its fertile potential for general-purpose programming [christiansen2023fpinlean]. Lean is capable of expressing arbitrary programs, taking advantage of the identity monad and the partial keyword. II-B Generation Pipeline Sonnet 3.5 read run $\lambda c.c+\texttt{stderr}$ exitcode write $p_{0}$ $c_{k}$ 0 1 $p_{k+1}$ Figure 3: Our generic scaffolding loop used at various stages of our pipeline. The run element is replaced with the python, pytest, lean, or lake build executable respectively. We use Anthropic’s Claude Sonnet 3.5 Model version claude-3-5-sonnet-20241022, accessed October 25th 2024. to power five core parts of our benchmark creation pipeline. In stages one through three of our pipeline, whenever a non-deductive LLM call is used for processing, we allow for a maximum of 5 attempts before "failing" on that sample. Stage four allows for a maximum of ten, and stage five has no LLM call. These loops include simple instructions that pass relevant error messages back to the model and ask for an updated response. Given that subsets of our pipeline fall withing the scope of what is capable with current LLMs, we construct scaffolding loops for iterating on model-based solutions. Broadly, we use outputs from python, pytest, lean, and lake build to ensure the generated code is correct, and to collect standard error and outputs to send back to the model when incorrect (Figure 3). We first preprocess the original APPS dataset, aggregating unit tests and solutions into one file with a particular function definition and architecture that better lends itself to automated unit testing. This step also selects a single solution from the set of solutions provided by the original APPS dataset. Property-based tests are procedurally generated unit tests, up to boundary conditions given on input types [MacIver2019Hypothesis]. We use our scaffolding loop outlined above, specifically feeding pytest ’s standard error and standard out messages back into the next API call. When pytest returns an empty standard error and an exit code of zero, we write the final file of tests to disk and exit the loop. The lean executable is used as our supervisor for the third step, again with standard error and standard output fed back to the model API until a valid lean file is constructed. The model is prompted to explicitly convert the property tests into unproven theorem statements (cleared via sorry). The lean executable Lean (version 4.12.0, x86_64-unknown-linux-gnu, commit dc2533473114, Release). is used to typecheck the theorem statements, ignoring the sorry s but ensuring statements are well-formed. Step four of our process aims to add more strict quality assurance to our lean outputs. Using unit tests that have been generated by our preprocessing step as assertions in Python, we deductively parse these assertions to construct inline unit tests in Lean https://github.com/leanprover/lean4/blob/master/src/Lean/Elab/GuardMsgs.lean. In parallel, we translate the Python solutions into Lean 4 as def s, and require that the guard messages "pass". While theorem statements and property tests are broadly our primary goal and focus, these unit tests provide an easy way to filter for solutions that are not correct in some minimal form. The last phase of our pipeline involves ensuring that the theorems we wish to provide are probably true, using Lean’s property-based testing framework Plausible https://reservoir.lean-lang.org/@leanprover-community/plausible. Any theorems that are not " plausible " at this stage, given our solutions that pass unit tests, are filtered. All code and prompts used for the generation can be found at our open-source GitHub repository. https://github.com/quinn-dougherty/fvapps II-C Results | 0 | 1963 | 1735 | | --- | --- | --- | | 1 | 1915 | 1450 | | 2 | 584 | 842 | | 3 | 143 | 394 | | 4 | 75 | 183 | | 5 | 35 | 111 | TABLE I: Iterations needed for Pytest and Lean agents to converge, filtered to those that made it passed Stage 3 of the pipeline. We provide three subsets of our benchmark according to the assurance level, depending on when the samples were chosen and filtered in our pipeline. Unguarded This largest set represents all examples that generated valid theorems, ending at Stage 3 of our full pipeline. Table I shows the number of execution-generation loops needed to pass our generation criterion in Stages 2 and 3. The Pytest agent took on average 0.846 iterations (with the 0th position as initialization prompt), and the Lean agent took on average 1.188 iterations. <details> <summary>extracted/6182580/figs/num_theorems_across_samples.png Details</summary> ![bce47019](/v1/image/bce470194939707ac220813fb4d2ea4c2fdbf4e1b0e9beeef95568c39e516d59) ### Visual Description ## Bar Chart: Number of defs and theorems across samples (N=4715) ### Overview The chart visualizes the distribution of the number of theorems across 4,715 samples. The x-axis represents the number of theorems (0–50), and the y-axis represents the number of samples (0–1,200). The data is represented by blue bars, with the tallest bar centered around 5 theorems. ### Components/Axes - **Title**: "Number of defs and theorems across samples (N=4715)" (top-center). - **X-axis**: Labeled "Num theorems," with increments of 10 (0, 10, 20, ..., 50). - **Y-axis**: Labeled "Num samples," with increments of 200 (0, 200, 400, ..., 1,200). - **Bars**: Blue vertical bars clustered between 0–10 theorems. No bars appear for 11–50 theorems. - **Legend**: Not explicitly visible in the image. ### Detailed Analysis - **Bar Heights**: - **0 theorems**: ~300 samples. - **1 theorem**: ~600 samples. - **2 theorems**: ~800 samples. - **3 theorems**: ~900 samples. - **4 theorems**: ~1,100 samples. - **5 theorems**: ~1,250 samples (peak). - **6–10 theorems**: Gradual decline (e.g., ~400 samples at 10 theorems). - **11–50 theorems**: No bars (0 samples). ### Key Observations 1. **Right-Skewed Distribution**: The majority of samples (80%+) contain 0–5 theorems, with a sharp decline beyond 5. 2. **Peak at 5 Theorems**: The highest frequency (~1,250 samples) occurs at 5 theorems. 3. **Sparsity at Higher Values**: No samples report 11 or more theorems. ### Interpretation The data suggests that in the analyzed dataset, most samples are associated with a small number of theorems, indicating either simplicity in the samples or rarity of theorems. The right-skewed distribution implies that while the bulk of samples have few theorems, a small subset may contain more, though these are statistically insignificant (0 samples beyond 10 theorems). This could reflect constraints in the data collection process, such as limited scope for theorem generation or a focus on foundational samples. The absence of values beyond 10 theorems raises questions about data completeness or methodological boundaries. </details> Figure 4: Total number of defs and theorems across FVAPPS samples. Of the 4715 successful samples, the median number of theorems is four (Figure 4). Guarded Samples that passed Stage 4 of our pipeline include those samples where solutions were able to be identified that pass simple unit tests. While we don’t include these solutions in our set here, the definitions have been verified to admit a solution such that these basic unit tests pass. Importantly this means that claude-3-5-sonnet-20241022 was able to solve the problem with our scaffold. 2089 samples passed this level of assurance. Guarded and Plausible 1083 samples passed through all five parts of our pipeline These problems are solvable and well-stated to the highest possible probability short of a full proof (which we leave to the LLMs and other program synthesis methods). Note that the definitions in each of the sets Guarded and Guarded and Plausible are somehow "easy" in the sense that current language models can provide definitions. All 4715 samples are provided as a dataset on HuggingFace with appropriate tagging. https://huggingface.co/datasets/quinn-dougherty/fvapps III Baseline Experiments Our baseline task is to implement and prove each invocation of the word "sorry" in the benchmark’s source specification. We provide preliminary evaluations here, using a loop similar to the generation pipeline. We attempt to solve 101 randomly sampled FVAPPS problems with both Claude 3.5 Sonnet (new) and Gemini 1.5 Pro Retrieved in early November 2024.. For a direct human comparison, we allocated ten hours for a baseliner to attempt one sample. III-A Model Baseline Details We use Lean 4.12.0 throughout our benchmark generation and baseline, pinned with Mathlib available in the environment. Mathlib commit: 809c3fb3b5c8f5d7dace56e200b426187516535a We attempt to produce a partial baseline for a subset of FVAPPS using claude-3-5-sonnet-20241022 (the same model that generated the task). A note on open source models. We spent some time attempting to solve benchmark samples with small and medium-sized open source models. Particularly, we attempted simple, single-shot autocompletion of theorem statements using DeepSeek’s DeepSeek-Prover-V1.5-RL https://huggingface.co/deepseek-ai/DeepSeek-Prover-V1.5-RL [xin2024deepseekproverv15harnessingproofassistant], but were unable to find a one-shot or loop-based scaffolding solution that made any sufficient progress. An important part of this task seems to be the ability to iterate, and while models fine-tuned to reduce loss on next token prediction in Lean may eventually prove to be useful, currently their limited capability in broader, less narrow goals is insufficient. III-A 1 Prompt Details Because Lean 4 is in active development and interfaces and functions change with reasonable frequency, we found it was critical to give the LLM sufficient context to avoid common pitfalls associated with biases it may have due to outdated training data. We found significantly higher success rates by following a number of key insights by Victor Taelin https://github.com/VictorTaelin/AI-scripts/, including 1) direct, explicit description of the problem formulation and goal, 2) details of the Lean 4 environment that may be significantly different from its pretrain dataset, and 3) multiple examples of theorem statements and solutions to theorem statements. III-A 2 Scaffolding Loop (Agent) We initially provide the model with a system prompt consisting of the details in the previous section, and prompt the model first with the original English-language description of the problem and the sorry ’d benchmark .lean file (Figure 2), with instructions to implement the sorry s. After the model responds, any code block provided is extracted, written to a Basic.lean file, and run in a custom managed environment with specifically pinned libraries using lake build. The full standard input and output alongside returncode are sent back to the model with the instruction to address the issues. Initial attempts were unsuccessful in solving all of the theorem statements at once; to give the model a better chance at solving the problem we show and allow the model to solve one statement at a time, starting with the definitions and then appending each theorem independently, to be proven separately from the rest. III-B Human Baseline A human attempted the function and one of the theorems from sample 0023. It took them 10 hours to accomplish the function shown in Figure 5, since they needed termination proofs for two recursors. In the time allotted, they did not make it through any of the proofs. A problem they ran into with the proofs was stack overflow trying to unfold the recursors. Given this, we don’t think a full human baseline is a feasible option. ⬇ def go (voters holdouts : List (Nat Ã Nat)) (expense conversions : Nat) : Nat := if conversions â¥ voters. length then expense else let filtered := filter_cascade holdouts conversions if filtered. isEmpty then expense else let bestValue := filtered. argmax? utility |>. get! let holdouts ’ := filtered. eraseP (. == bestValue) go voters holdouts ’ (expense + bestValue. snd) (conversions + 1) termination_by (voters. length - conversions) decreasing_by omega def solve_elections (n : Nat) (voters : List (Nat Ã Nat)) : Nat := go voters voters 0 0 Figure 5: Human baseline’s solution to the definition for sample 23 of FVAPPS (helper recursor omitted for brevity, available on GitHub) III-C Comparing Model Baseline and Human Baseline on One Sample On sample 23 regarding the crooked election, both successfully implemented the definition of the function (Appendix -C) and succeeded on no nontrivial proofs. We let Sonnet attempt the second theorem for 100 loops, and while it made a valiant effort, it diverged. The human and Sonnet clearly took two very different approaches, but it remains to be seen which approach has better proof ergonomics. III-D Model performance on more samples <details> <summary>extracted/6182580/figs/qualitative_assessment_of_model_baseline_results.png Details</summary> ![a95a0b25](/v1/image/a95a0b257afa8ddb81c904a23c6c11b4525c2945f7c4c92dbf0075e72fef7b37) ### Visual Description ## Bar Chart: Qualitative assessments of model baseline results ### Overview The chart compares qualitative assessment scores between two AI models (Sonnet and Gemini) across seven evaluation categories. Scores range from 0 to 12 on the y-axis, with distinct performance patterns emerging between the models. ### Components/Axes - **X-axis**: Seven evaluation categories: 1. Easy Branch of If 2. Effectively Unit Test 3. Non-negativity of Nat 4. Too Trivial 5. Actually Nontrivial 6. Bug 7. Model Declares Untrue - **Y-axis**: Qualitative assessment scores (0-12) - **Legend**: - Blue = Sonnet - Orange = Gemini - **Visual markers**: - Red dashed lines at y=6 (categories 2, 6) - Green dashed line at y=9 (category 5) ### Detailed Analysis 1. **Easy Branch of If**: - Sonnet: ~12 (highest score) - Gemini: ~7 2. **Effectively Unit Test**: - Sonnet: ~6 (red threshold line) - Gemini: ~1 3. **Non-negativity of Nat**: - Sonnet: ~10 - Gemini: ~1 4. **Too Trivial**: - Sonnet: ~9 - Gemini: ~10 (first category where Gemini outperforms) 5. **Actually Nontrivial**: - Sonnet: ~9 (green benchmark line) - Gemini: ~5 6. **Bug**: - Both models: ~6 (red threshold line) 7. **Model Declares Untrue**: - Sonnet: 0 - Gemini: ~1 ### Key Observations - Sonnet dominates in complex reasoning tasks (Non-negativity of Nat, Easy Branch of If) - Gemini excels in simpler/trivial assessments (Too Trivial, Model Declares Untrue) - Red threshold lines at y=6 appear in two categories (Effectively Unit Test, Bug) - Green benchmark line at y=9 in "Actually Nontrivial" suggests target performance - Significant performance gap in "Non-negativity of Nat" (10 vs 1) ### Interpretation The data suggests Sonnet demonstrates superior capability in handling complex logical structures and edge cases, particularly in non-trivial scenarios. Gemini shows strength in basic assessments and error declaration accuracy. The red threshold lines may represent minimum acceptable performance benchmarks, while the green line at y=9 could indicate an ideal target for complex reasoning tasks. Notably, Sonnet's perfect score in "Easy Branch of If" contrasts sharply with its zero in "Model Declares Untrue," highlighting potential trade-offs between accuracy and error handling in different evaluation dimensions. </details> Figure 6: Qualitative categories of theorem solutions in the 100 samples, first two theorems each sample. The red box shows completely spurious results, either bugs or a substitution of a quantified variable with a single value. The green box shows the most nontrivial results. The other categories are neither spurious nor impressive, though they require some syntactic fluency that many language models would fail at. Both Sonnet and Gemini ran for a maximum of 25 loops for the def s and a max of 50 loops for each theorem. Sonnet completed all but seven of the definitions and 43 samples had at least one theorem completed. Gemini completed only 71 of the definitions, and 23 samples had at least one theorem completed. Figure 6 shows the breakdown across our different buckets, assessed qualitatively. The green box surrounds the best solutions, nontrivial proofs that involve both intricate reasoning and correct syntax. The red box shows completely spurious results, either bugs or a substitution of a quantified variable with a single value. The other categories are neither spurious nor impressive, though they require some syntactic fluency that many language models would fail at. Importantly, Gemini declared the second theorem of sample 3078 incorrect/unsolvable in a comment, while Sonnet nontrivially solved it. Figures 7 and 8 show how many samples took how many attempts. Of the theorems that got eventually completed, roughly 20% of each model’s were done in zero or one iteration of the loop. <details> <summary>extracted/6182580/figs/number_of_theorem_attempts_it_took_to_solve_a_theorem.png Details</summary> ![18f24e2c](/v1/image/18f24e2cc45c09c8a823bf686f2dd19efdd89530b5f17b7ff93aceffbf3bb8a0) ### Visual Description ## Bar Chart: Number of theorem attempts it took to solve a theorem ### Overview The chart compares the distribution of theorem-solving attempts between two models, Sonnet (blue) and Gemini (orange), across varying numbers of attempts. The y-axis represents the number of samples (theorems) solved, while the x-axis shows the number of attempts required. Both models exhibit distinct patterns in their success rates across attempt counts. ### Components/Axes - **Title**: "Number of theorem attempts it took to solve a theorem" - **X-axis**: "Number of Theorem Attempts" (ranges from 0 to 25, with ticks at 0, 5, 10, 15, 20, 25) - **Y-axis**: "Number of Samples" (ranges from 0 to 50, with ticks at 0, 10, 20, 30, 40, 50) - **Legend**: - Blue: Sonnet - Orange: Gemini - **Bar Structure**: Two bars per x-axis value (one for each model), with heights proportional to sample counts. ### Detailed Analysis - **Attempt 0**: - Sonnet: ~50 samples (highest bar) - Gemini: ~38 samples (second-highest bar) - **Attempt 1**: - Sonnet: ~12 samples - Gemini: ~5 samples - **Attempt 2**: - Sonnet: ~8 samples - Gemini: ~3 samples - **Attempt 3**: - Sonnet: ~10 samples - Gemini: ~2 samples - **Attempt 4**: - Sonnet: ~6 samples - Gemini: ~1 sample - **Attempt 5**: - Sonnet: ~4 samples - Gemini: ~1 sample - **Attempt 6**: - Sonnet: ~3 samples - Gemini: ~1 sample - **Attempt 7**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 8**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 9**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 10**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 11**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 12**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 13**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 14**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 15**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 16**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 17**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 18**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 19**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 20**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 21**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 22**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 23**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 24**: - Sonnet: ~2 samples - Gemini: ~1 sample - **Attempt 25**: - Sonnet: ~2 samples - Gemini: ~0 samples ### Key Observations 1. **Initial Dominance**: Both models show significantly higher sample counts at 0 attempts, with Sonnet outperforming Gemini (50 vs. 38). 2. **Rapid Decline**: Sample counts drop sharply as attempts increase, with both models showing minimal success beyond 5 attempts. 3. **Asymmetry**: Sonnet consistently outperforms Gemini across all attempt counts, though the gap narrows at higher attempts (e.g., 2 samples vs. 1 sample at 25 attempts). 4. **Outliers**: Gemini has no samples at 25 attempts, while Sonnet retains 2 samples, suggesting a slight edge in persistence. ### Interpretation The data suggests that **Sonnet** is more effective at solving theorems with minimal attempts compared to **Gemini**, particularly in the early stages (0–5 attempts). The steep decline in sample counts as attempts increase implies that both models struggle with theorems requiring more than 5 attempts, possibly due to increased complexity or limitations in their problem-solving capabilities. The persistence of Sonnet at higher attempts (e.g., 2 samples at 25 attempts) hints at a marginally better ability to handle harder problems, though the overall trend indicates diminishing returns for both models beyond 5 attempts. This could reflect differences in training data, algorithmic efficiency, or inherent model biases toward simpler theorems. </details> Figure 7: Number of theorem attempts it took to solve a theorem, conditional on that theorem succeeding. Of the 406 theorems attempted, Sonnet accomplished 121 ( $30\%$ ) and Gemini accomplished 74 ( $18.5\%$ ). III-E Discussion We noticed that Sonnet solved many problems via imperative programming with the identity monad (Id.run do). However, this kind of solution fails to capture inductive structure that lends itself to proofs. This means future solutions to FVAPPS will take one of two directions: an unwieldy and uninterpretable proof involving the identity monad, or ensuring that solutions are factored in a way that can exploit inductive structure. Table II shows the split over our quality assurance steps over the entire baseline set and those that were solved. While this sample is small, we roughly find that our higher-quality samples that are both guarded and plausible are somewhat more likely to be solvable by current models. | Unguarded Guarded Guarded and Plausible | 69 18 14 | 41 12 7 | 28 11 4 | | --- | --- | --- | --- | | Total | 101 | 60 | 43 | TABLE II: Baseline results split across quality assurance stages of our generation pipeline. IV Related Work Research and development into stronger programming guarantees spans a number of large and growing communities in the software landscape. While interest in generally stronger-typed programming paradigms is growing (e.g., growing interest in converting the Linux kernel codebase to Rust, https://github.com/Rust-for-Linux/linux increasing availability of static typing in Python https://realpython.com/python312-typing/) there remains a smaller group of people and programming languages focused directly on formal verification. Coq [coq], Dafny [dafnybook], Isabelle [isabellebook], and Lean [demoura2015lean] all target theorem proving and formal verification as first class goals, albeit with varying visions and core foundations. Automating programming with machine learning or program synthesis is a rich area [ellis2020dreamcodergrowinggeneralizableinterpretable, austin2021programsynthesislargelanguage] IV-A Datasets and Benchmarks Informal Mathematical theorem proving datasets are the most extensive, with benchmarks ranging from 15,000 to over 100,000 proofs [coqgym, leandojo, jiang2021lisa, naturalproofs], plus the Archive of Formal Proofs containing around 1 million lines of code [afp]. Unverified program synthesis benchmarks are typically an order of magnitude smaller, containing thousands of programs, with APPS being the largest at 10,000 programs [Hendrycks2021MeasuringCC] (though LiveCodeBench continues to grow [livecodebench]). In stark contrast, formal software verification benchmarks have historically been much more limited in size, with both available datasets (Clover and dafny-synthesis) containing order of 100 programs each [clover, dafny-synthesis], with DafnyBench [loughridge2024dafnybench] up at 782. IV-B Models Models like DeepSeek [xin2024deepseekproveradvancingtheoremproving, xin2024deepseekproverv15harnessingproofassistant] and AlphaProof [deepmind2024imo] have advanced the ability of AI to do math, both formally and informally. DeepSeek Prover has demonstrated significant capabilities in formal mathematics, achieving state-of-the-art performance on the Lean theorem proving benchmark. The model combines large-scale pretraining on mathematical texts with specialized fine-tuning on formal proof corpora, enabling it to generate valid formal proofs for complex mathematical statements. AlphaProof [deepmind2024imo] represents another milestone in automated theorem proving, showing particular strength in creative mathematical problem-solving. Its success on IMO-level problems demonstrates the potential for AI systems to engage in sophisticated mathematical reasoning, though the gap between informal mathematical proofs and formal verification remains significant. In the domain of program synthesis and verification, models have evolved along several tracks. Traditional program synthesis approaches using symbolic techniques and SMT solvers have been augmented by neural methods [wang2023legoprover]. Large language models fine-tuned on code, such as CodeLlama [Rozire2023CodeLO] and GPT-4 [gpt4], have shown promising results in generating functionally correct programs, but their capabilities in formal verification remain limited. Recent work has begun to bridge this gap [Baif2024DafnyCopilot, yang2024autoverusautomatedproofgeneration] demonstrating the potential for AI assistance in formal program verification. However, these systems typically require significant human guidance and cannot yet autonomously generate both correct programs and their formal proofs. The challenge of simultaneously reasoning about program semantics and verification conditions remains an open problem in the field. V Discussion Our current presentation has a few limitations, and clear directions for future work. While the samples in our final guarded and plausible set are quite high quality, it is nontheless possible that the theorem statements do not correspond to desirable properties nor that they, in union, cover the full range of properties one would ideally prefer. While manual spot-checking did not reveal any systematic failures post complete five-stage processing, it is nontheless possible that particular theorems are not applicable or vacuous in some form. The FVAPPS benchmark evaluates AI systems on both program synthesis and formal verification by extending APPS with Lean 4 theorem specifications. While current language models can implement programming solutions https://www.anthropic.com/news/3-5-models-and-computer-use https://openai.com/index/introducing-swe-bench-verified/, they struggle with formal verification proofs, highlighting the challenge between imperative programming and proof-amenable structures. The benchmark’s difficulty is demonstrated by limited progress from both human baselines and models. FVAPPS opens research directions in simultaneous implementation and verification reasoning, automated proof-amenable solution structuring, and scaling formal verification, providing a framework to measure progress toward verified automated programming. Acknowledgment We thank Buck Shlegeris, Jason Gross, and Rajashree Agarwal for valuable feedback and discussion during early versions of this work. We are also grateful to the Anthropic External Researcher program for providing API credits for benchmark generation, and to FAR.ai for providing operations and administrative support during the course of the project. -A Details of buckets in 6 Table III declares which samples in particular fall into Figure 6 ’s buckets. Of note is that 3078 is declared untrue in a code comment by Gemini, but solved completely by Sonnet. Otherwise, many samples fall into the same buckets across models. TABLE III: Comparison of theorem classifications between Sonnet and Gemini -B Number of definition attempts it took to solve a function, conditional on it succeeding Figure 8 is a companion to Figure 7 but refers to definitions instead of theorems. Of the def s that succeeded at all, 50% of Sonnet’s and 31% of Gemini’s succeeded in one shot. <details> <summary>extracted/6182580/figs/number_of_definition_attempts_it_took_to_solve_a_function.png Details</summary> ![102e4477](/v1/image/102e4477ca6f49893fa8b40f701642acfa0302654fb0cd2086bdc7825d075ba9) ### Visual Description ## Bar Chart: Number of definition attempts it took to solve a function ### Overview The chart compares the distribution of function-solving attempts between two systems (Sonnet and Gemini) across 45+ attempt thresholds. It shows a stark contrast in performance patterns between the two systems, with one demonstrating rapid convergence and the other exhibiting prolonged struggle on certain tasks. ### Components/Axes - **Title**: "Number of definition attempts it took to solve a function" - **X-axis**: "Number of Attempts" (0–45+ range, integer increments) - **Y-axis**: "Number of Samples" (0–50+ range, integer increments) - **Legend**: - Blue = Sonnet - Orange = Gemini - **Spatial Layout**: - Legend positioned top-right - X-axis spans bottom, Y-axis left - Bars clustered by attempt count (0–45+) ### Detailed Analysis **Sonnet (Blue)**: - Dominates at 0 attempts: 50+ samples (tallest bar) - Rapid decline: 10 samples at 1 attempt, 5 at 2 attempts - Minimal presence beyond 5 attempts (1–2 samples at 10+ attempts) - Notable outlier: 1 sample at 20 attempts **Gemini (Orange)**: - 30+ samples at 0 attempts (second-tallest bar) - Gradual decline: 7 samples at 1 attempt, 4 at 2 attempts - Persistent presence at higher attempts: - 2–3 samples at 10–15 attempts - 1 sample at 20, 30, 35, 40 attempts - Notable outlier: 2 samples at 40 attempts ### Key Observations 1. **Convergence Speed**: Sonnet resolves 50% of cases in 0 attempts vs Gemini's 30% 2. **Long-Tail Performance**: Gemini shows 10x more samples at 40+ attempts than Sonnet 3. **Mid-Range Disparity**: At 10 attempts, Sonnet has 1 sample vs Gemini's 3 4. **Extreme Outliers**: Both systems have 1 sample at 35 attempts, but Gemini has 2 at 40 ### Interpretation The data reveals fundamental differences in problem-solving strategies: - **Sonnet** exhibits near-instant resolution for most functions (50% success at 0 attempts), suggesting strong pattern recognition or pre-trained knowledge bases - **Gemini** demonstrates persistence with complex cases, maintaining non-zero samples up to 40 attempts (2 samples), indicating potential for handling edge cases through iterative refinement - The 20–30 attempt range shows complete divergence: Sonnet abandons these cases while Gemini continues attempts - The 40+ attempt outliers (2 Gemini samples) suggest either: - Truly novel/ambiguous problems - Potential inefficiencies in Gemini's approach - Possible data collection artifacts This pattern mirrors real-world AI behavior where some systems prioritize speed (Sonnet) while others emphasize thoroughness (Gemini), with tradeoffs in resource allocation and problem complexity handling. </details> Figure 8: Definition attempts needed to solve a function, conditional on success. -C Baseline solutions to problem 23 Gemini accomplished 10 in 11 attempts and sonnet failed 9 after 50 attempts. ⬇ def solve_elections (n : Nat) (voters : List (Nat Ã Nat)) : Nat := Id. run do let range := List. range (n + 1) let mut best := none for k in range do match try_subset â¨ n, voters, 0, k, []â© with | some cost => match best with | none => best := some cost | some b => if cost < b then best := some cost | none => continue best. getD 0 Figure 9: claude-3-5-sonnet-20241022’s attempt at the definition for sample 23 of FVAPPS (try_subset omitted for brevity, available on GitHub) ⬇ def solve_elections (n : Nat) (voters : List (Nat Ã Nat)) : Nat := let voters := voters. toArray. qsort (fun v â v â => (v â. fst < v â. fst) || (v â. fst == v â. fst && v â. snd > v â. snd)) |>. toList – Use boolean comparisons for sorting let rec go (cost : Nat) (numVoters : Nat) (remainingVoters : List (Nat Ã Nat)) : Nat := if numVoters == n then cost else match remainingVoters with | [] => cost | (m, p):: rest => let buyCost := go (cost + p) (numVoters) rest let freeCost := if m <= numVoters then go cost (min n (numVoters + 1)) rest else buyCost min buyCost freeCost go 0 0 voters Figure 10: gemini-1.5-pro’s solution to the definition for sample 23 of FVAPPS \printbibliography

Rendering Paper...