# Demystifying GPT Self-Repair for Code Generation
**Authors**:
- Theo X. Olausson (MIT EECS & CSAIL
&Jeevana Priya Inala)
> Correspondence totheoxo@mit.edu. Work partially done while T.X.O. was at Microsoft Research.
\usetikzlibrary
positioning
Abstract
Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair—in which the model debugs and fixes mistakes in its own code—has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4’s ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.
1 Introduction
Large language models (LLMs) have proven capable of generating code snippets from natural language specifications, but still struggle on complex coding challenges such as those found in competitions and professional software engineering interviews. Recent work has sought to improve performance by leveraging self-repair [Gupta et al., 2020, Le et al., 2022, Chen et al., 2023b, Zhang et al., 2023], in which the model introspects and corrects mistakes in its own code. Figure 1 shows a typical workflow of a self-repair based approach. First, given a specification, a program is sampled from a code generation model; this program is then executed on a suite of unit tests provided as part of the specification; if the program fails on any unit test, then the error message and the faulty program are given to a feedback generation model, which outputs a short explanation of why the code failed; finally, the feedback is passed to a repair model, which generates a fixed version of the program. On the surface, this is a very attractive idea. It allows the system to overcome mistakes caused by unfortunate samples during decoding; easily incorporates feedback during the repair phase from symbolic systems such as compilers, static analysis tools, and execution engines; and mimics the trial-and-error way in which human software engineers write code.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Code Model Feedback Loop
### Overview
The image depicts a diagram illustrating a feedback loop in a code model, likely related to testing or refinement of code generation. It shows a process starting with a user input, moving through code generation, execution with assertions, feedback based on assertion results, and finally, an updated code model. The diagram uses stylized representations of a user, code models (as node networks), execution results, and feedback mechanisms.
### Components/Axes
The diagram is segmented into five stages, labeled (1) through (5) with arrows indicating the flow of information. The stages are:
1. **User:** Represented by a silhouette of a person.
2. **Code Model:** Depicted as a network of interconnected nodes, colored blue and yellow.
3. **Execution:** Shows three assertion statements: `assert f(x1) == y1`, `assert f(x2) == y2`, `assert f(x3) == y3`. Two assertions pass (green checkmarks) and one fails (red 'x').
4. **Feedback Model:** Similar to the Code Model, but with a different arrangement of nodes, also colored blue and yellow.
5. **Code Model:** An updated version of the Code Model, represented as a rectangular block containing the three assertion statements, all marked with green checkmarks.
### Detailed Analysis or Content Details
* **Stage 1 (User):** A simple silhouette of a person, indicating the source of the initial input.
* **Stage 2 (Code Model):** A network of approximately 12 nodes, with roughly half colored blue and half colored yellow. The nodes are interconnected by lines.
* **Stage 3 (Execution):** Three assertion statements are presented:
* `assert f(x1) == y1` - Passes (green checkmark)
* `assert f(x2) == y2` - Passes (green checkmark)
* `assert f(x3) == y3` - Fails (red 'x')
* **Stage 4 (Feedback Model):** A network of approximately 12 nodes, with roughly half colored blue and half colored yellow. The arrangement of nodes is different from the initial Code Model.
* **Stage 5 (Code Model):** A rectangular block containing the three assertion statements:
* `assert f(x1) == y1` - Passes (green checkmark)
* `assert f(x2) == y2` - Passes (green checkmark)
* `assert f(x3) == y3` - Passes (green checkmark)
### Key Observations
The diagram highlights a process of iterative refinement. The initial code model fails one of the assertions during execution. This failure triggers a feedback mechanism (represented by the Feedback Model), which results in an updated code model that passes all assertions. The color scheme (blue and yellow) is consistent across the Code Model and Feedback Model stages, suggesting these colors represent different components or aspects of the model.
### Interpretation
This diagram illustrates a core concept in machine learning and software development: the iterative refinement of a model based on feedback. The user provides an initial input, which is translated into a code model. The execution phase tests the code model against a set of assertions. When an assertion fails, the feedback model uses this information to adjust the code model, aiming to improve its performance and correctness. The final stage shows a successful code model, indicating that the feedback loop has converged on a solution that satisfies all assertions.
The diagram suggests a system where code is generated, tested, and refined automatically. The use of assertions implies a formal verification process. The visual representation of the code model as a network of nodes could represent a neural network or a similar complex system. The diagram is a high-level overview and doesn't provide details about the specific algorithms or techniques used in the feedback loop. However, it effectively communicates the fundamental principle of iterative improvement through testing and refinement. The consistent use of blue and yellow nodes suggests that these colors represent different types of components within the code model, and the feedback mechanism adjusts the connections or properties of these components to achieve the desired behavior.
</details>
[ box/.style=rectangle, draw=black, align=justify, inner sep=3pt, tallbox/.style=box, fill=backcolor!50, shortbox/.style=box, fill=backcolor!50 ]
[tallbox, text width=3.6cm, minimum height=3.2cm] (A) Given is a string $s$ representing the day of the week today. $s$ is one of SUN, MON, TUE, WED, THU, FRI, or SAT. After how many days is the next Sunday (tomorrow or later)?
⬇
# UNIT TESTS
# (EXECUTABLE)
assert f (’ MON ’) == 6
assert f (’ WED ’) == 4
assert f (’ SUN ’) == 7
; \node [draw=none, left=0.2cm of A.west] (1);
[shortbox, right=0.2cm of A.north east, text width=8cm, anchor=north west] (B)
⬇
def f (s):
return (7 - [’ SUN ’, ... , ’ FRI ’, ’ SAT ’]. index (s)) %7
; \node [draw=none, right=0.2cm of B.east] (2);
[shortbox, below=0.2cm of B, text width=8cm] (C) Given input ’SUN’, the program returned 0, but the expected output was 7. ; \node [draw=none, right=0.2cm of C.east] (3);
[shortbox, below=0.2cm of C, text width=8cm] (D) The code does not account for the case where the input is ’SUN’ and the output should be 7. This can be fixed by removing the modulo operation. ; \node [draw=none, right=0.2cm of D.east] (4);
[shortbox, below=0.2cm of D, text width=8cm] (E)
⬇
def f (s):
return (7 - [’ SUN ’, ... , ’ FRI ’, ’ SAT ’]. index (s)) # %7
; \node [draw=none, right=0.2cm of E.east] (5);
Figure 1: Self-repair with separate code and feedback models. First, a user gives a specification in the form of text and a suite of unit tests (1). Then, a code model (blue) generates a program (2). The program is checked against the unit tests using a symbolic execution engine, and an error message is returned (3). In order to provide more signal to the code model, textual feedback as to why this happened is provided by a feedback model (yellow; 4). Finally, this feedback is used by the code model to repair the program (5).
However, it is important to remember that self-repair requires more invocations of the model, thus increasing the computational cost. In particular, whether self-repair is a winning strategy or not ultimately boils down to whether you would—at an equivalent compute budget—have had a greater chance of success if you had simply drawn more code samples i.i.d. from the model and checked them against the suite of unit tests provided as part of the task. Crucially, the effectiveness of self-repair depends not only on the model’s ability to generate code, which has been studied extensively in the literature, but also on its ability to identify how the code (generated by the model itself) is wrong with respect to the task specification. As far as we are aware, no previous or contemporary work has attempted to study the effect of this stage in detail.
In this paper, we study the effectiveness of self-repair with GPT-3.5 [Ouyang et al., 2022, OpenAI, 2022] and GPT-4 [OpenAI, 2023] when solving competition-level code generation tasks. We begin by proposing a new evaluation strategy dubbed pass@t, in which the likelihood of obtaining a correct program (with respect to the given unit tests) is weighed against the total number of tokens sampled from the model. Using this instead of the traditional pass@k [Chen et al., 2021, Kulal et al., 2019] metric (which weighs pass rate against the number of trials), we are able to accurately compare performance gained through self-repair against any additional work done by the model when generating the feedback and carrying out the repair. Using this new evaluation strategy, we then carefully study the dynamics of the self-repair process under a range of hyper-parameters. Finally, given our primary objective of gaining insight into the state-of-the-art code generation models’ ability to reflect upon and debug their own code, we carry out a set of experiments in which we investigate the impact of improving the feedback stage alone. We do so by analyzing the impact of using a stronger feedback generation model than the code generation model (using GPT-4 to generate feedback for GPT-3.5 code model), as well as by carrying out a study in which human participants provide feedback on incorrect programs, in order to compare model-generated self-feedback to that provided by human programmers.
From our experiments, we find that:
1. When taking the cost of doing inspection and repair into account, performance gains from self-repair can only be seen with GPT-4; for GPT-3.5, the pass rate with repair is lower than or equal to that of the baseline, no-repair approach at all budgets.
1. Even for the GPT-4 model, performance gains are modest at best ( $66\%→ 71\%$ pass rate with a budget of 7000 tokens, $≈$ the cost of 45 i.i.d. GPT-4 samples) and depend on having sufficient diversity in the initial programs.
1. Replacing GPT-3.5’s explanations of what is wrong with feedback produced by GPT-4 leads to better self-repair performance, even beating the baseline, no-repair GPT-3.5 approach ( $50\%→ 54\%$ at 7000 tokens).
1. Replacing GPT-4’s own explanations with those of a human programmer improves repair significantly, leading to a 57% increase in the number of repaired programs which pass the tests.
2 Related work
Program synthesis with large language models. The use of large language models for program synthesis has been studied extensively in the literature [Li et al., 2022, Austin et al., 2021, Chen et al., 2021, Le et al., 2022, Fried et al., 2023, Nijkamp et al., 2023, Chowdhery et al., 2022, Touvron et al., 2023, Li et al., 2023]. This literature has predominantly focused on evaluating models in terms of either raw accuracy or the pass@k metric [Kulal et al., 2019, Chen et al., 2021], often leveraging filtering techniques based on execution [Li et al., 2022, Shi et al., 2022] or ranking [Chen et al., 2021, Inala et al., 2022, Zhang et al., 2022] to reduce the number of samples which are considered for the final answer. In contrast, our work focuses on evaluating the models from the point of view of minimizing the number of samples that need to be drawn from the model in the first place. Our work is also different in that we assume access to the full collection of input-output examples, as is typically done in inductive synthesis [Kitzelmann, 2010, Polozov and Gulwani, 2015, Gulwani et al., 2017, Chen et al., 2019a, Ellis et al., 2021]. In particular, unlike some prior work [Li et al., 2022, Shi et al., 2022], we do not make a distinction between public tests used for filtering and private tests used to determine correctness, since our method does not involve filtering the outputs.
Code repair. Statistical and learning-based techniques for code repair have a rich history in both the programming languages and machine learning communities, although they have traditionally been used predominantly to repair human-written code [Long and Rinard, 2016, Bader et al., 2019, Le Goues et al., 2021, Yasunaga and Liang, 2021, Chen et al., 2019b, Mesbah et al., 2019, Wang et al., 2018]. More recently, using repair as a post-processing step to improve code which was itself automatically synthesised has been used in the synthesis of both domain-specific languages [Gupta et al., 2020] and general-purpose code [Le et al., 2022, Yasunaga and Liang, 2021, 2020]. Our contribution differs from most prior work in this literature in the use of textual feedback for repair, which is possible thanks to the above mentioned rise in the use of LLMs for program synthesis.
Contemporary work on LLM self-repair. Recognizing that there is much contemporary work seeking to self-repair with LLMs, we now briefly highlight a few such papers which are particularly close to our work. Zhang et al. [2023] explore self-repair without natural language feedback on APPS [Hendrycks et al., 2021] using a diverse range of fine-tuned models. They also experiment with prompt-based repair using Codex [Chen et al., 2021], InCoder [Fried et al., 2023], and CodeGen [Nijkamp et al., 2023]. Notably, their framework does not consider the cost associated with feedback and repair, which presents a significantly different perspective on self-repair. Similarly, Chen et al. [2023b] assess Codex’s ability to self-repair across a variety of tasks, in a framework that closely resembles that which we study in this work. However, their study differs from ours in terms of the models considered, the evaluation strategy, and, most importantly, the research goal, as we specifically aim to investigate the significance of the textual feedback stage. Self-repair, or frameworks with other names that are conceptually very similar to it, has also been used in contexts outside of code generation. Peng et al. [2023] use self-repair to mitigate hallucinations and improve factual grounding in a ChatGPT-based web search assistant, in which the model revises its initial response based on self-generated feedback. Similarly, Madaan et al. [2023] present a framework in which a model iteratively provides feedback on and revises its output until a stopping criterion is reached; they apply this framework to a range of tasks, including dialogue and code optimization. Ultimately, we see our work, in which we use the novel evaluation metric pass@t to investigate the significance of the textual feedback stage in competition-level self-repair, as being complementary to contemporary research which uses traditional metrics to evaluate self-repair in a broader context. We are eager to see what the implications of our results will be in these other domains.
3 Methodology
3.1 Self-Repair Overview
As shown in Figure 1, our self-repair approach involves 4 stages: code generation, code execution, feedback generation, and code repair. We now formally define these four stages.
Code generation. Given a specification $\psi$ , a programming model $M_{P}$ first generates $n_{p}$ samples i.i.d., which we denote
$$
\{p_{i}\}_{i=1}^{n_{p}}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}M_{P}(\psi)
$$
Code execution. These $n_{p}$ code samples are then executed against a test bed. Recall from Section 2 that we assume that we have access to the full set of tests in executable form (see Section 5 for a brief discussion on the validity of this assumption in software engineering domains). Thus, we stop if any sample passes all of the tests, since a satisfying program has then been found. Otherwise, we collect the error messages $\{e_{i}\}_{i}$ returned by the execution environment. These error messages either contain the compile/runtime error information or an example input on which the program’s output differs from the expected one. An example is shown in Figure 1 (component 3).
Feedback generation. Since the error messages from the execution environment are usually very high-level, they provide little signal for repair. Therefore, as an intermediate step, we use a feedback model to produce a more detailed explanation of what went wrong; Figure 1 (component 4) shows an example. Formally, in this stage, we generate $n_{f}$ feedback strings, $\{f_{ij}\}_{j}$ , for each wrong program, $p_{i}$ , as follows:
$$
\{f_{ij}\}_{j=1}^{n_{f}}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}M_{F}(\psi;p_{%
i};e_{i})
$$
Having an explicit feedback generation step allows us to ablate this component so that we can study its significance in isolation.
Code repair. In the final step, for each initial program $p_{i}$ and feedback $f_{ij}$ , $n_{r}$ candidate repaired programs are sampled from $M_{P}$ We use the same model for both the initial code generation and the code repair, since these are fundamentally similar tasks.:
$$
\{r_{ijk}\}_{k=1}^{n_{r}}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}M_{P}(\psi;p_%
{i};e_{i};f_{ij})
$$
Repair tree. We call the tree of interleaved text and programs produced by this procedure—rooted in the specification $\psi$ , then branching into initial programs $p_{i}$ , each of which branches into feedback $f_{ij}$ and then repairs $r_{ijk}$ —a repair tree, $T$ (Figure 2).
Caveat: jointly sampling feedback and repair. The general framework presented above does not require the programming model and feedback model to be the same, thus allowing for the use of specialized models in the system. However, when $M_{P}=M_{F}$ we jointly generate both the feedback and the repaired program in a single API call, since both GPT-3.5 and GPT-4 have a natural tendency to interleave text and code in their responses. See Appendix E for a detailed look at how the prompt differs between this and the previous setting. Formally, we denote this as
$$
\{(f_{ij},r_{ij})\}_{j=1}^{n_{fr}}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}M_{P%
}(\psi;p_{i};e_{i})
$$
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: System Architecture/Flow
### Overview
The image depicts a hierarchical diagram representing a system or process flow. It appears to illustrate a feedback loop involving code generation, feedback, and repair, culminating in a central element denoted by the Greek letter psi (ψ). The diagram is structured as a tree, with branching nodes and terminal elements.
### Components/Axes
The diagram consists of the following components:
* **ψ (Psi):** A central node at the top of the diagram.
* **p₁ ... p<sub>np</sub>:** Intermediate nodes branching from ψ. The subscript 'np' indicates a variable number of these nodes.
* **f<sub>11</sub> ... f<sub>nf</sub>:** Intermediate nodes branching from the 'p' nodes. The subscripts '11' through 'nf' indicate a variable number of these nodes.
* **r<sub>111</sub> ... r<sub>npnf</sub>:** Terminal nodes branching from the 'f' nodes. The subscripts '111' through 'npnf' indicate a variable number of these nodes.
* **Labels:** "Code Gen" (blue), "Feedback" (yellow), "Repair" (purple). These labels are positioned along the right side of the diagram, indicating the flow direction.
### Detailed Analysis or Content Details
The diagram shows a tree-like structure originating from the central node ψ.
* From ψ, there are 'np' branches, each leading to a node labeled p<sub>i</sub>, where 'i' ranges from 1 to np.
* Each p<sub>i</sub> node then branches into 'nf' nodes labeled f<sub>ij</sub>, where 'i' ranges from 1 to np and 'j' ranges from 1 to nf.
* Finally, each f<sub>ij</sub> node branches into a variable number of terminal nodes labeled r<sub>ijk</sub>, where 'i' ranges from 1 to np, 'j' ranges from 1 to nf, and 'k' ranges from 1 to some unspecified number. The notation suggests that the number of terminal nodes can vary for each f<sub>ij</sub> node.
* The right side of the diagram has three labels: "Code Gen" (blue), "Feedback" (yellow), and "Repair" (purple). These labels are connected to the diagram via lines, indicating the direction of flow. The lines are positioned vertically, suggesting a sequential process.
### Key Observations
The diagram illustrates a hierarchical decomposition of a process. The central node ψ likely represents the overall goal or system. The 'p' nodes could represent planning or problem definition stages. The 'f' nodes might represent function or feature implementation, and the 'r' nodes could represent the resulting outputs or results. The labels "Code Gen", "Feedback", and "Repair" suggest an iterative process where code is generated, feedback is received, and repairs are made.
### Interpretation
This diagram likely represents a system for automated code generation and refinement. The central node ψ could represent the desired software or system. The process begins with planning (p nodes), followed by function implementation (f nodes), and finally, the generation of results (r nodes). The "Code Gen" label indicates that code is generated at some point in this process. The "Feedback" label suggests that the generated code is evaluated, and feedback is provided. The "Repair" label indicates that the code is modified based on the feedback. This cycle of code generation, feedback, and repair is repeated iteratively, potentially improving the quality and functionality of the generated code. The variable number of nodes at each level (np, nf, and the number of 'r' nodes) suggests that the process is flexible and can adapt to different problem complexities. The diagram is a high-level representation and does not provide specific details about the algorithms or techniques used in each stage.
</details>
Figure 2: A repair tree begins with a specification $\psi$ (root node), then grows into initial programs, feedback, and repairs.
3.2 pass@t: pass rate vs. token count
Since self-repair requires several dependent model invocations of non-uniform cost, this is a setting in which pass@ $k$ —the likelihood of obtaining a correct program in $k$ i.i.d. samples—is not a suitable metric for comparing and evaluating various hyper-parameter choices of self-repair. Instead, we measure the pass rate as a function of the total number of tokens sampled from the model, a metric which we call pass@t.
Formally, suppose that you are given a dataset $D=\{\psi_{d}\}_{d}$ and a chosen set of values for the hyper-parameters $(M_{P},M_{F},n_{p},n_{f},n_{r})$ . Let $T_{d}^{i}\sim M(\psi_{d})$ denote a repair tree that is sampled as described in Section 3.1 for the task $\psi_{d}$ ; let $\text{size}(T_{d}^{i})$ denote the total number of program and feedback tokens in the repair tree; and say that $T_{d}^{i}\models\psi_{d}$ is true if, and only if, $T_{d}^{i}$ has at least one leaf program that satisfies the unit tests in the specification $\psi_{d}$ . Then the pass@t metric of this choice of hyper-parameters is defined as the expected pass rate at the number of tokens which you would expect to generate with this choice of hyper-parameters:
| | $\displaystyle\texttt{pass@t}\triangleq\mathop{\mathbb{E}}_{\stackrel{{%
\scriptstyle\psi_{d}\sim D}}{{T_{d}^{i}\sim M(\psi_{d})}}}\left[T_{d}^{i}%
\models\psi_{d}\right]\quad\textbf{at}\quad t=\mathop{\mathbb{E}}_{\stackrel{{%
\scriptstyle\psi_{d}\sim D}}{{T_{d}^{i}\sim M(\psi_{d})}}}\left[\text{size}(T_%
{d}^{i})\right]$ | |
| --- | --- | --- |
In our experiments, we plot bootstrapped estimates of these two quantities. To obtain these, we first generate a very large repair tree for each task specification, with: $N_{p}≥ n_{p}$ initial program samples; $N_{f}≥ n_{f}$ feedback strings per wrong program; and $N_{r}≥ n_{r}$ repair candidates per feedback string. Given a setting of $(n_{p},n_{f},n_{r})$ , we then sub-sample (with replacement) $N_{t}$ different repair trees from this frozen dataset. Finally, we compute the sample mean and standard deviation of the pass rate and the tree size over these $N_{t}$ trees. Estimating the pass@t in this way greatly reduces the computational cost of our experiments, since we can reuse the same initial dataset to compute the estimates for all of the various choices of $n_{p},n_{f}$ , and $n_{r}$ .
We use $N_{p}=50$ for all experiments, and consider $n_{p}≤ 25$ for the self-repair approaches and $n_{p}≤ 50$ for the baseline, no-repair approach. Similarly, for the feedback strings, we use $N_{f}=25$ and $n_{f}≤ 10$ (except for Section 4.2, in which we only consider $n_{f}=1$ and therefore settle for $N_{f}=10$ instead). For the repair candidates, since we do joint sampling of feedback and repair in most of our experiments, we set $N_{r}=n_{r}=1$ . Finally, we use $N_{t}=1000$ for all settings.
4 Experiments
In this section, we carry out experiments to answer the following research questions: (a) In the context of challenging programming puzzles, is self-repair better than i.i.d. sampling without repair for the models we consider? If so, under what hyper-parameters is self-repair most effective? (b) Would a stronger feedback model boost the model’s repair performance? (c) Would keeping a human in the loop to provide feedback unlock better repair performance even for the strongest model?
We evaluate these hypotheses on Python programming challenges from the APPS dataset [Hendrycks et al., 2021]. The APPS dataset contains a diverse range of programming challenges paired with a suite of tests, making it a perfect (and challenging) setting to study self-repair in. To keep our experiments tractable, we evaluate on a subset of the APPS test set, consisting of 300 tasks. These tasks are proportionally sampled in accordance with the frequency of the different difficulty levels in the test set: 180 interview-level questions, 60 competition-level questions, and 60 introductory-level questions (listed in Appendix F). We use GPT-3.5 [Ouyang et al., 2022, OpenAI, 2022] and GPT-4 [OpenAI, 2023] as our models of choice, and implement self-repair using templated string concatenation with one-shot prompting; our prompts are given in Appendix E. When appropriate, we compare against a baseline without repair. This baseline, shown with a black line in the plots, is simply i.i.d. sampling from the corresponding model (e.g., GPT-4 when we explore whether GPT-4 is capable of self-repair). Based on preliminary experiments, we set the decoding temperature to $0.8$ for all the models to encourage diverse samples.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This chart displays the relationship between the mean pass rate and the mean number of tokens generated, for different values of *n<sub>p</sub>* and *n<sub>fr</sub>*. The chart uses line plots with shaded confidence intervals, and also includes individual data points for *n<sub>fr</sub>* values.
### Components/Axes
* **X-axis:** "Mean number of tokens generated" ranging from 0 to 10000, with tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-axis:** "Mean pass rate" ranging from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Lines:** Represent different values of *n<sub>p</sub>* (number of prompts): 1, 2, 5, 10, and 25.
* **Markers:** Represent different values of *n<sub>fr</sub>* (number of feedback rounds): 1, 3, 5, and 10.
* **Legend (Top-Right):**
* *n<sub>p</sub>* = 1 (Brown)
* *n<sub>p</sub>* = 2 (Orange)
* *n<sub>p</sub>* = 5 (Light Green)
* *n<sub>p</sub>* = 10 (Light Blue)
* *n<sub>p</sub>* = 25 (Dark Blue)
* *n<sub>fr</sub>* = 1 (Gray Circle)
* *n<sub>fr</sub>* = 3 (Dark Red Triangle)
* *n<sub>fr</sub>* = 5 (Dark Gray Square)
* *n<sub>fr</sub>* = 10 (Dark Green Triangle)
* **Gridlines:** Horizontal and vertical gridlines are present to aid in reading values.
### Detailed Analysis
The chart shows several lines representing different *n<sub>p</sub>* values. Each line is accompanied by a shaded region, presumably representing a confidence interval. Individual data points for different *n<sub>fr</sub>* values are also plotted.
* **n<sub>p</sub> = 1 (Brown):** The line starts at approximately 0.23 at x=0, rises sharply to around 0.38 at x=2000, then plateaus around 0.42-0.45 for x > 4000.
* **n<sub>p</sub> = 2 (Orange):** The line starts at approximately 0.23 at x=0, rises to around 0.35 at x=2000, and then plateaus around 0.40-0.42 for x > 4000.
* **n<sub>p</sub> = 5 (Light Green):** The line starts at approximately 0.23 at x=0, rises to around 0.38 at x=2000, and then plateaus around 0.45-0.48 for x > 4000.
* **n<sub>p</sub> = 10 (Light Blue):** The line starts at approximately 0.23 at x=0, rises to around 0.42 at x=2000, and then plateaus around 0.50-0.53 for x > 4000.
* **n<sub>p</sub> = 25 (Dark Blue):** The line starts at approximately 0.23 at x=0, rises to around 0.48 at x=2000, and then plateaus around 0.55-0.58 for x > 4000.
**Data Points (n<sub>fr</sub>):**
* **n<sub>fr</sub> = 1 (Gray Circle):** (x ≈ 1000, y ≈ 0.32), (x ≈ 9000, y ≈ 0.53)
* **n<sub>fr</sub> = 3 (Dark Red Triangle):** (x ≈ 2000, y ≈ 0.38), (x ≈ 6000, y ≈ 0.47)
* **n<sub>fr</sub> = 5 (Dark Gray Square):** (x ≈ 4000, y ≈ 0.43), (x ≈ 8000, y ≈ 0.51)
* **n<sub>fr</sub> = 10 (Dark Green Triangle):** (x ≈ 6000, y ≈ 0.45), (x ≈ 10000, y ≈ 0.55)
### Key Observations
* The mean pass rate generally increases with the number of tokens generated, but the rate of increase diminishes as the number of tokens grows larger.
* Increasing *n<sub>p</sub>* (number of prompts) generally leads to a higher mean pass rate, especially at higher token counts.
* The *n<sub>fr</sub>* data points appear to fall within the confidence intervals of the corresponding *n<sub>p</sub>* lines.
* The confidence intervals become wider at higher token counts, indicating greater uncertainty in the mean pass rate.
### Interpretation
The data suggests that increasing the number of prompts (*n<sub>p</sub>*) improves the mean pass rate, particularly as the number of tokens generated increases. This indicates that providing more prompts helps the model maintain a higher level of accuracy or quality as it generates more text. The diminishing returns observed at higher token counts suggest that there's a limit to the benefit of additional prompts. The number of feedback rounds (*n<sub>fr</sub>*) also appears to positively correlate with the pass rate, but the data points are relatively sparse and fall within the confidence intervals of the *n<sub>p</sub>* lines, suggesting that the effect of *n<sub>fr</sub>* is less pronounced or is already captured by the *n<sub>p</sub>* parameter. The confidence intervals widening with increasing tokens suggest that the model's performance becomes more variable as it generates longer sequences. This could be due to factors such as increased complexity or the accumulation of errors.
</details>
(a) Mean pass rate vs. number of tokens generated. Black line is i.i.d. sampling without repair from GPT-3.5. Note that the error bars are often smaller than the markers; all settings have a standard deviation of less than 1.5 absolute points on the y-axis. Results truncated at $t=10,000$ .
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Heatmap: Feedback-Repairs Performance
### Overview
This image presents a heatmap visualizing the performance of feedback-repairs based on the number of initial programs (np) and the number of feedback-repairs (nf). The performance is represented by numerical values ranging from approximately 0.78 to 1.00, with "O.O.B." indicating values that are Out Of Bounds. The heatmap uses a color gradient from dark brown to bright yellow, where darker colors represent lower performance values and brighter colors represent higher performance values.
### Components/Axes
* **X-axis:** Number of initial programs (np) with markers at 1, 2, 5, 10, and 25.
* **Y-axis:** Number of feedback-repairs (nf) with markers at 1, 3, 5, and 10.
* **Color Scale:** A gradient from dark brown (low values) to bright yellow (high values).
* **Data Labels:** Numerical values are displayed within each cell of the heatmap.
* **Out-of-Bounds Indicator:** "O.O.B." is used to denote values that fall outside the defined range.
### Detailed Analysis
The heatmap displays the following data points:
* **np = 1:**
* nf = 1: 0.87
* nf = 3: 0.81
* nf = 5: 0.80
* nf = 10: 0.78
* **np = 2:**
* nf = 1: 0.92
* nf = 3: 0.87
* nf = 5: 0.86
* nf = 10: 0.86
* **np = 5:**
* nf = 1: 0.96
* nf = 3: 0.94
* nf = 5: 0.95
* nf = 10: O.O.B.
* **np = 10:**
* nf = 1: 0.99
* nf = 3: 1.00
* nf = 5: O.O.B.
* nf = 10: O.O.B.
* **np = 25:**
* nf = 1: O.O.B.
* nf = 3: O.O.B.
* nf = 5: O.O.B.
* nf = 10: O.O.B.
The values generally increase as the number of initial programs (np) increases from 1 to 10, and as the number of feedback-repairs (nf) increases from 1 to 5. However, beyond np = 10, all values are marked as "O.O.B.".
### Key Observations
* The highest performance value (1.00) is achieved when np = 10 and nf = 3.
* Performance degrades significantly (indicated by "O.O.B.") when np is 25, regardless of the number of feedback-repairs.
* The performance is relatively stable for low values of np (1 and 2) across different values of nf.
* There is a clear trend of increasing performance with increasing np up to a certain point (np = 10).
### Interpretation
The data suggests that feedback-repairs are most effective when a moderate number of initial programs are used (around 10). Increasing the number of initial programs beyond this point leads to out-of-bounds results, indicating that the method becomes unstable or invalid. The "O.O.B." values could represent a saturation point where the feedback-repair mechanism is no longer able to effectively improve the programs, or that the underlying model breaks down. The increasing performance with increasing np up to 10 suggests that a larger initial program set provides more opportunities for effective feedback and repair. The heatmap demonstrates a trade-off between the number of initial programs and the effectiveness of feedback-repairs, highlighting the importance of finding an optimal balance for achieving high performance. The data implies that the feedback-repair process has limitations and is not universally applicable to arbitrarily large sets of initial programs.
</details>
(b) Normalized mean pass rate relative to the (interpolated) baseline at an equivalent budget (number of tokens). Cells for which the number of tokens generated exceeds 50 samples from the GPT-3.5 baseline marked O.O.B. (out of bounds).
Figure 3: Pass rate versus number of tokens generated for various settings of $n_{p}$ (number of initial programs) and $n_{fr}$ (number of repairs sampled per program). GPT-3.5 is used for all samples, including the baseline.
4.1 Self-repair requires strong models and diverse initial samples
In this subsection, we consider the setup where $M_{P}=M_{F}∈\{\text{GPT-3.5, GPT-4}\}$ , i.e., where one single model is used for both code/repair generation and feedback generation. To evaluate if self-repair leads to better pass@t than a no-repair, i.i.d. sampling-based baseline approach, we vary $n_{p}$ and $n_{fr}$ —that is, the number of initial i.i.d. base samples and joint feedback, repair samples drawn from $M_{P}$ —in the range $(n_{p},n_{fr})∈\{1,2,5,10,25\}×\{1,3,5,10\}$ . Recall that when $M_{P}=M_{F}$ , we jointly sample for $n_{fr}$ pairs of feedback strings and repair programs instead of sampling them one after another (Section 3.1).
The results are shown in Figure 3 for GPT-3.5 and Figure 4 for GPT-4. In the left-hand subplots, the color of each dot indicates the number of initial samples ( $n_{p}$ ), while its shape indicates the number of feedback-repair samples ( $n_{fr}$ ). In the right hand plots, we show a heat-map with the two hyper-parameters along the axes, where the value in each cell indicates the mean pass rate with self-repair normalized by the mean pass rate of the baseline, no-repair approach when given the same token budget (i.e., pass@t at the same value of t). When the normalized mean pass rate is 1, this means that self-repair has the same pass rate as the no-repair, baseline approach at that same token budget; a higher value ( $≥ 1$ ) means self-repair performs better than the baseline.
From the plots, we can see that for the GPT-3.5 model, the pass@t is lower than or equal to the corresponding baseline (black line) for all settings of $n_{p},n_{fr}$ , clearly showing that self-repair is not an effective strategy for GPT-3.5. On the other hand, for GPT-4, there are several values of $n_{p},n_{fr}$ for which the pass rate with self-repair is significantly better than that of the baseline. For example, with $n_{p}=10,n_{fr}=3$ the pass rate increases from 65% to 70%, and with $n_{p}=25,n_{fr}=1$ it increases from 65% to 71%.
Our experiments also show a clear trend with respect to the relationship between the hyper-parameters. Given a fixed number of feedback-repairs ( $n_{fr}$ ), increasing the number of initial programs ( $n_{p}$ ) (i.e., moving right along the x-axis on the heat maps) consistently leads to relative performance gains for both models. On the other hand, fixing $n_{p}$ and increasing $n_{fr}$ (i.e., moving up along the y-axis on the heat maps) does not appear to be worth the additional cost incurred, giving very marginal gains at higher budgets and even decreasing performance at lower budgets. This suggests that, given a fixed budget, the most important factor determining whether self-repair will lead to a correct program or not is the diversity of the base samples that are generated up-front, rather than the diversity of the repairs sampled. Having more initial samples increases the likelihood of there being at least one program which is close to the ideal program and, hence, can be successfully repaired.
Since $n_{fr}=1$ is the best choice for the hyper-parameter $n_{fr}$ , we next isolate the effect of the number of initial programs, $n_{p}$ , by exploring a denser set of possible values: $(n_{p},n_{fr})∈\{1,2,....,24,25\}×\{1\}$ . The plots are shown in Figure 6 for both $M_{P}=M_{F}∈\{\text{GPT-3.5},\text{GPT-4}\}$ and the baseline, no-repair approaches. Note that since $n_{fr}$ is fixed, in these plots, there is a direct correlation between $n_{p}$ and the total number of tokens, $t$ . Again, we see that self-repair is not an effective strategy for the GPT-3.5 model, but that it is effective for GPT-4—especially at higher values of $n_{p}$ ( $≥ 5000$ ), where it increases pass rate by over 5 points.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This chart displays the relationship between the mean pass rate and the mean number of tokens generated, with different lines representing different values of *n<sub>p</sub>* (presumably a parameter related to prompt engineering or model configuration). Data points are overlaid on the lines, differentiated by *n<sub>fr</sub>* values. The chart aims to show how the pass rate changes as the number of tokens generated increases, for various parameter settings.
### Components/Axes
* **X-axis:** Mean number of tokens generated (Scale: 0 to 10000, approximately).
* **Y-axis:** Mean pass rate (Scale: 0.0 to 1.0, approximately).
* **Lines:** Represent different values of *n<sub>p</sub>*: 1, 2, 5, 10, and 25.
* **Markers:** Represent different values of *n<sub>fr</sub>*: 1, 3, 5, and 10.
* **Legend (bottom-right):** Clearly identifies the color and corresponding *n<sub>p</sub>* and *n<sub>fr</sub>* values.
### Detailed Analysis
The chart shows five lines, each representing a different *n<sub>p</sub>* value. The data points are overlaid on these lines, each representing a different *n<sub>fr</sub>* value.
* **n<sub>p</sub> = 1 (Brown Line):** The line starts at approximately 0.45 at 0 tokens, rises sharply to around 0.62 at 1000 tokens, and then plateaus around 0.65-0.70 for higher token counts. Data points for *n<sub>fr</sub>* = 1 are overlaid on this line.
* (0, 0.45) - *n<sub>fr</sub>* = 1
* (1000, 0.62) - *n<sub>fr</sub>* = 1
* (4000, 0.67) - *n<sub>fr</sub>* = 1
* (8000, 0.70) - *n<sub>fr</sub>* = 1
* **n<sub>p</sub> = 2 (Orange Line):** Starts at approximately 0.5 at 0 tokens, rises to around 0.65 at 1000 tokens, and plateaus around 0.68-0.72 for higher token counts. Data points for *n<sub>fr</sub>* = 3 are overlaid on this line.
* (0, 0.5) - *n<sub>fr</sub>* = 3
* (1000, 0.65) - *n<sub>fr</sub>* = 3
* (4000, 0.68) - *n<sub>fr</sub>* = 3
* (8000, 0.72) - *n<sub>fr</sub>* = 3
* **n<sub>p</sub> = 5 (Light Blue Line):** Starts at approximately 0.55 at 0 tokens, rises to around 0.70 at 1000 tokens, and plateaus around 0.72-0.75 for higher token counts. Data points for *n<sub>fr</sub>* = 5 are overlaid on this line.
* (0, 0.55) - *n<sub>fr</sub>* = 5
* (1000, 0.70) - *n<sub>fr</sub>* = 5
* (4000, 0.73) - *n<sub>fr</sub>* = 5
* (8000, 0.75) - *n<sub>fr</sub>* = 5
* **n<sub>p</sub> = 10 (Cyan Line):** Starts at approximately 0.6 at 0 tokens, rises to around 0.75 at 1000 tokens, and plateaus around 0.77-0.80 for higher token counts. Data points for *n<sub>fr</sub>* = 10 are overlaid on this line.
* (0, 0.6) - *n<sub>fr</sub>* = 10
* (1000, 0.75) - *n<sub>fr</sub>* = 10
* (4000, 0.78) - *n<sub>fr</sub>* = 10
* (8000, 0.80) - *n<sub>fr</sub>* = 10
* **n<sub>p</sub> = 25 (Purple Line):** Starts at approximately 0.65 at 0 tokens, rises to around 0.80 at 1000 tokens, and plateaus around 0.80-0.82 for higher token counts.
All lines exhibit a similar trend: an initial increase in mean pass rate with increasing tokens generated, followed by a plateau.
### Key Observations
* Higher values of *n<sub>p</sub>* generally result in higher mean pass rates, especially at lower token counts.
* The effect of *n<sub>p</sub>* diminishes as the number of tokens generated increases; the lines converge at higher token counts.
* The data points for different *n<sub>fr</sub>* values closely follow the corresponding *n<sub>p</sub>* lines, suggesting that *n<sub>fr</sub>* has a less pronounced effect on the pass rate compared to *n<sub>p</sub>*.
* The pass rate appears to stabilize after approximately 2000-4000 tokens generated, regardless of the *n<sub>p</sub>* value.
### Interpretation
The chart suggests that the parameter *n<sub>p</sub>* plays a significant role in determining the pass rate of a process that generates tokens. Increasing *n<sub>p</sub>* initially improves the pass rate, but the benefit diminishes as the number of tokens generated increases. This could indicate that *n<sub>p</sub>* influences the initial quality of the generated tokens, but its impact becomes less important as the process continues.
The relatively small variation in pass rate across different *n<sub>fr</sub>* values suggests that this parameter has a less substantial effect on the overall outcome. The convergence of the lines at higher token counts implies that there's a limit to how much the pass rate can be improved, regardless of the *n<sub>p</sub>* setting.
The plateau in pass rate after a certain number of tokens generated could be due to inherent limitations in the process or the model being used. Further investigation might be needed to understand the underlying reasons for this stabilization. The chart provides valuable insights for optimizing the parameters *n<sub>p</sub>* and *n<sub>fr</sub>* to achieve the desired pass rate for a given number of tokens generated.
</details>
(a) Mean pass rate vs. number of tokens generated. Black line is i.i.d. sampling without repair from GPT-4. Note that the error bars are often smaller than the markers; all settings have a standard deviation of less than 1.5 absolute points on the y-axis. Results truncated at $t=10,000$ .
<details>
<summary>x6.png Details</summary>

### Visual Description
## Heatmap: Feedback-Repairs vs. Initial Programs
### Overview
This image presents a heatmap visualizing the relationship between the number of initial programs (n<sub>p</sub>) and the number of feedback-repairs (n<sub>f</sub>). The heatmap uses a color gradient to represent numerical values, with warmer colors (orange/yellow) indicating lower values and cooler colors (green/dark green) indicating higher values. Several cells are marked with "O.O.B." which likely stands for "Out Of Bounds".
### Components/Axes
* **X-axis:** Number of initial programs (n<sub>p</sub>) with markers at 1, 2, 5, 10, and 25.
* **Y-axis:** Number of feedback-repairs (n<sub>f</sub>) with markers at 1, 3, 5, and 10.
* **Color Scale:** A gradient from orange to green, representing the numerical values within the heatmap cells. The scale is not explicitly provided, but the values within the cells suggest a continuous range.
* **Cells:** Each cell represents a combination of n<sub>p</sub> and n<sub>f</sub>, with a numerical value displayed within.
* **"O.O.B." Label:** Appears in several cells, indicating values that are "Out Of Bounds" or not available/applicable.
### Detailed Analysis
The heatmap displays the following values:
* **n<sub>f</sub> = 1:**
* n<sub>p</sub> = 1: 0.98
* n<sub>p</sub> = 2: 1.01
* n<sub>p</sub> = 5: 1.04
* n<sub>p</sub> = 10: 1.06
* n<sub>p</sub> = 25: 1.09
* **n<sub>f</sub> = 3:**
* n<sub>p</sub> = 1: 0.93
* n<sub>p</sub> = 2: 0.99
* n<sub>p</sub> = 5: 1.04
* n<sub>p</sub> = 10: 1.08
* n<sub>p</sub> = 25: O.O.B.
* **n<sub>f</sub> = 5:**
* n<sub>p</sub> = 1: 0.91
* n<sub>p</sub> = 2: 0.98
* n<sub>p</sub> = 5: 1.04
* n<sub>p</sub> = 10: 1.08
* n<sub>p</sub> = 25: O.O.B.
* **n<sub>f</sub> = 10:**
* n<sub>p</sub> = 1: 0.90
* n<sub>p</sub> = 2: 0.98
* n<sub>p</sub> = 5: 1.05
* n<sub>p</sub> = 10: O.O.B.
* n<sub>p</sub> = 25: O.O.B.
The values generally increase as both n<sub>p</sub> and n<sub>f</sub> increase, but this trend is not consistent across all cells, and is interrupted by the "O.O.B." values. The values appear to be clustered around 1.0, with some variation.
### Key Observations
* The values tend to be lower when both n<sub>p</sub> and n<sub>f</sub> are small (bottom-left corner).
* The highest values are found when n<sub>f</sub> is relatively small and n<sub>p</sub> is moderate (e.g., n<sub>f</sub> = 1, n<sub>p</sub> = 25).
* The "O.O.B." values suggest that the model or experiment being represented has limitations or undefined behavior for certain combinations of n<sub>p</sub> and n<sub>f</sub>. Specifically, higher values of n<sub>f</sub> combined with higher values of n<sub>p</sub> result in out-of-bounds conditions.
* The values increase as n<sub>p</sub> increases for a fixed n<sub>f</sub>, up to a point where they become "O.O.B.".
### Interpretation
This heatmap likely represents the results of a simulation or experiment investigating the relationship between the number of initial programs and the number of feedback-repair cycles. The values within the cells could represent a metric such as efficiency, success rate, or cost. The increasing trend suggests that, up to a certain point, increasing the number of initial programs can improve the metric being measured. However, the "O.O.B." values indicate that there is a limit to this improvement, and that beyond a certain combination of n<sub>p</sub> and n<sub>f</sub>, the system becomes unstable or the metric is no longer meaningful.
The "O.O.B." values are particularly important. They suggest that the system being studied has constraints or limitations that prevent it from scaling indefinitely. This could be due to factors such as computational complexity, resource limitations, or inherent instability in the system. Further investigation would be needed to understand the specific reasons for these out-of-bounds conditions. The values being close to 1.0 suggest that the metric is a ratio or index, where 1.0 represents a baseline or ideal value.
</details>
(b) Normalized mean pass rate relative to the (interpolated) baseline at an equivalent budget (number of tokens). Cells for which the number of tokens generated exceeds 50 samples from the GPT-4 baseline marked O.O.B. (out of bounds).
Figure 4: Pass rate versus number of tokens generated for various settings of $n_{p}$ (number of initial programs) and $n_{fr}$ (number of repairs per failing program). GPT-4 is used for all samples, including the baseline.
4.2 GPT-4 feedback improves GPT-3.5 repair
Next, we conduct an experiment in which we evaluate the impact of using a separate, stronger model to generate the feedback. This is to test the hypothesis that self-repair is held back (especially for GPT-3.5) by the model’s inability to introspect and debug its own code.
For this experiment, we set $M_{P}$ = GPT-3.5 and $M_{F}$ = GPT-4 and vary the hyper-parameters as $(n_{p},n_{f},n_{r})∈\{1,2,....,24,25\}×\{1\}×\{1\}$ , similarly to the previous experiment. Note that since we are now operating in a setting in which the feedback and repair stages must be separated, we have three hyper-parameters— $n_{p},n_{f},n_{r}$ —instead of two— $n_{p},n_{fr}$ (Section 3.1). To keep the computational budget tractable, and since the variance was seen to be very low in the previous experiment, we use $N_{f}=10$ instead of $N_{f}=25$ for this experiment (see Section 3.2).
The results for this experiment are shown in Figure 6 (bright blue line). We observe that in terms of absolute performance, $M_{P}=$ GPT-3.5, $M_{F}=$ GPT-4 does break through the performance barrier and becomes marginally more efficient than i.i.d. sampling from GPT-3.5. This suggests that the textual feedback stage itself is of crucial importance, and that improving it relieves the bottleneck in GPT-3.5 self-repair.
Figure 5: Mean pass rate for each model when $n_{fr}$ (or $n_{f}$ and $n_{r}$ ) = 1. Shaded region is $± 1$ standard deviation. Complete breakdown per difficulty in Appendix A.
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Line Chart: Pass Rate vs. Tokens Generated for Language Models
### Overview
This line chart depicts the relationship between the mean number of tokens generated and the mean pass rate for several configurations of GPT-3.5 and GPT-4 language models, with and without a "repair" step (denoted as *M<sub>P</sub>* and *M<sub>F</sub>*). The chart aims to demonstrate how the performance (pass rate) of these models changes as they generate more text.
### Components/Axes
* **X-axis:** "Mean number of tokens generated". Scale ranges from 0 to 10000, with tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-axis:** "Mean pass rate". Scale ranges from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located in the bottom-right corner of the chart. Contains the following entries:
* Black line: *M<sub>P</sub>* = GPT-4 (no repair)
* Green line: *M<sub>P</sub>* = GPT-4; *M<sub>F</sub>* = GPT-4
* Gray line: *M<sub>P</sub>* = GPT-3.5 (no repair)
* Orange line: *M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-3.5
* Light blue line: *M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-4
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points:
* **GPT-4 (no repair) - Black Line:** The line starts at approximately (0, 0.15) and slopes upward, leveling off around (8000, 0.6). It shows a relatively rapid increase in pass rate for the first 2000 tokens, then a slower increase.
* **GPT-4; GPT-4 - Green Line:** This line begins at approximately (0, 0.25) and exhibits a consistent upward trend, reaching approximately (10000, 0.7). It consistently outperforms the "no repair" GPT-4.
* **GPT-3.5 (no repair) - Gray Line:** Starting at approximately (0, 0.1), the line increases gradually, reaching around (10000, 0.5). It shows a slower rate of improvement compared to the GPT-4 models.
* **GPT-3.5; GPT-3.5 - Orange Line:** This line starts at approximately (0, 0.2) and increases steadily, reaching approximately (10000, 0.55). It is consistently higher than the "no repair" GPT-3.5.
* **GPT-3.5; GPT-4 - Light Blue Line:** The line begins at approximately (0, 0.2) and increases, leveling off around (8000, 0.6). It shows a similar trend to the GPT-4 models, but starts at a lower pass rate.
### Key Observations
* GPT-4 consistently outperforms GPT-3.5 across all configurations.
* Adding a "repair" step (*M<sub>F</sub>*) consistently improves the pass rate for both GPT-3.5 and GPT-4.
* The improvement from the repair step is more pronounced for GPT-3.5 than for GPT-4.
* All lines exhibit diminishing returns in pass rate as the number of tokens generated increases. The rate of improvement slows down significantly after approximately 4000 tokens.
### Interpretation
The data suggests that GPT-4 is a more capable language model than GPT-3.5, as evidenced by its higher pass rates. The "repair" step, represented by *M<sub>F</sub>*, appears to be an effective technique for improving the quality of generated text, particularly for GPT-3.5. The diminishing returns observed at higher token counts indicate that there's a limit to how much improvement can be achieved by simply generating more text. The combination of GPT-3.5 as the initial model (*M<sub>P</sub>*) and GPT-4 for repair (*M<sub>F</sub>*) yields a performance level approaching that of GPT-4 with its own repair mechanism, suggesting a potential cost-effective strategy for enhancing model performance. The chart highlights the trade-off between model complexity, repair mechanisms, and the length of generated text in achieving a desired pass rate.
</details>
| Introductory Interview Competition | 42.64% 19.33% 3.67% | 62.21% 45.67% 14.67% |
| --- | --- | --- |
| Overall | 33.30% | 52.60% |
Figure 5: Mean pass rate for each model when $n_{fr}$ (or $n_{f}$ and $n_{r}$ ) = 1. Shaded region is $± 1$ standard deviation. Complete breakdown per difficulty in Appendix A.
Figure 6: Success rate of repair with GPT-4’s explanations vs. with those of our human participants.
4.3 Human feedback significantly improves the success rate of GPT-4 repair
For our final experiment, we consider the effect of using an expert human programmer’s feedback when performing repair with stronger models such as GPT-4. The goal of this study is not to do a direct comparison between a human-in-the-loop approach vs. self-repair, since a human-in-the-loop approach imposes more cognitive burden, which we do not study. Instead, our goal is to understand how the model’s ability to identify mistakes in the code compares to that of a human, and how this affects downstream performance in self-repair. We thus conduct both qualitative and quantitative analyses of the impact of human feedback on self-repair.
Data collection methodology. We first sample 20 tasks $\{\psi_{i}\}_{i=1}^{20}$ from the APPS test set; to make the data collection process less time-consuming for the participants of the study, we skew the distribution towards easier tasks (14 introductory; 3 interview; 3 competition). For each task $\psi_{i}$ , we then sample two failing GPT-4 completions $p_{i,1},p_{i,2}$ , making for a total of $20· 2=40$ programs to refine. We recruit 16 participants, consisting of 15 graduate students and one professional machine learning engineer. Each participant is provided with five different base programs based on their level of experience with Python and competitive programming. Each program is taken from a distinct task; participants are never showed two different programs belonging to the same task. Participants are then asked to explain, in their own words, what the program is doing wrong. To reduce the cognitive load for participants, each program $p_{i,j}$ is accompanied by the error message $e_{i,j}$ and two feedback strings $f_{i,j,1},f_{i,j,2}$ sampled from GPT-4. We obtain these feedback strings by randomly sampling from the feedback-repair pairs used in the previous experiments and removing the code block. Note that each of the 40 programs will be shown to two different participants, to reduce variance caused by participants’ skill levels and writing style. Participants were told to spend approximately one hour on the study overall, and were compensated with a $15 gift card. This human data collection was approved by our Institutional Review Board (IRB) and carried out exclusively through an online survey. See Appendix B for a complete, concrete copy of the instructions which we provide to our participants.
Quantitative Analysis. Having obtained two human-written pieces of feedback $h_{i,j,1},h_{i,j,2}$ for each program $p_{i,j}$ , we sample 25 repaired programs
$$
\{r_{l}\}_{l=1}^{25}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\text{GPT-4}(\psi_%
{i};p_{i,j};e_{i,j};f)
$$
for $f∈\{h_{i,j,1},h_{i,j,2},f_{i,j,1},f_{i,j,2}\}$ . That is: we ask GPT-4 to generate 25 candidate repairs for each program, conditioned on the specification, the initial program, and a feedback string which is either set to one of GPT-4’s own feedback strings or to one provided by a participant. Finally, we execute all of these candidate repairs against the test bed, and take note of how often they pass.
The results are summarized in Table 6, with a complete task-by-task breakdown in Appendix C. We note first of all that the overall success rate is increased by over $1.57×$ when we replace GPT-4’s own debugging with that of our human participants. Perhaps unsurprisingly, the relative difference increases as the problems get harder, indicating that GPT-4’s ability to produce accurate and useful feedback trails further behind our human participants’ when the task (and code) becomes more complex.
Qualitative Analysis. In this section, we qualitatively analyze the difference between the feedback provided by the human participants and the feedback provided by GPT-4. We manually go through all of GPT-4’s and the participants’ feedback and note down whether the feedback: (a) seems, at a cursory glance, to be correct, or if it is obviously inaccurate; (b) explicitly suggests a small change to the code (e.g. "change the condition on line X"); (c) explicitly suggests a large change to the code (e.g. "frame the problem as min-cut instead of shortest-path"); (d) contains blocks of pseudocode or Python (which GPT-4’s feedback never does, per our experiment design); or (e) expresses uncertainty (using phrases such as "unsure", "it appears", etc.). We do not count individual single-line statements/expressions such as “ $x=5$ ” as pseudocode or Python. Examples of each category are shown in Appendix D. We find that
- Only 2/80 human-contributed feedback strings include pseudocode or explicit Python; that is, almost all human feedback we obtain is natural language interleaved with occasional single-statement math/code expressions.
- GPT-4’s feedback is much more likely to be obviously inaccurate (32/80 vs. 7/80 for human feedback).
- GPT-4 is more likely to explicitly suggest small changes (54/80 vs 42/80; 28/48 vs. 38/73 when seemingly correct), while our human participants show a slightly greater tendency to suggest high-level changes (23/80 vs. 18/80 for GPT-4; 21/73 vs. 13/48 when seemingly correct).
- Our human participants sometimes express uncertainty (7/80); GPT-4 never does (0/80).
This further analysis suggests that the results in Table 6 are not due to artefacts such as our participants providing explicit code blocks which the model simply copies. Instead, the difference in performance appears to be caused by a combination of more accurate feedback, a greater ability to suggest high-level, large-scale changes to the code when needed, and our participants’ ability to express their uncertainty (instead of confidently giving potentially inaccurate feedback).
5 Limitations
Firstly, to reduce computational cost, we pre-populate and then sub-sample from large repair trees, which introduces statistical bias. We mitigate this by being generous in our uncertainty of the pass@t, using the maximum standard deviation across all points. We also note that this standard deviation, which is obtained at values of $(n_{p},n_{f},n_{r})$ that are small enough that we have very many samples thereof in our pre-populated repair trees, is very low ( $<2\%$ pass rate for all models). While these measures do not completely eliminate the risk of bias in our results, not performing this amortization would have required significantly larger amounts of compute.
Secondly, we assume access to an executable suite of unit tests for each task. We do not, for example, require the model to extract tests from textual specifications. While this assumption may seem out of place in the era of chat-style assistants like ChatGPT [OpenAI, 2022], it does align well with established software engineering practices like Test-Driven Development [Astels, 2003]. Furthermore, techniques which automatically synthesize test cases given a specification [Li et al., 2022, Chen et al., 2023a] may relieve some of the user burden.
Finally, our study on human data did not track how much time the participants took to debug the programs. As a result, we can only evaluate the quality of the feedback (and the impact this has on repair). Further research at the intersection of HCI, AI, and program synthesis is needed to explore when and how human intervention should be leveraged, as well as how programming assistants should be designed to facilitate this style of interaction.
6 Broader Impact
Any tool that improves the productivity of people writing software will necessarily also increase the productivity of people writing software with malicious intent. It is also important to remember that research on LLMs comes at a very high environmental cost. Although we exclusively use publicly available pre-trained models in this work, and so do not train any models of our own, even inference comes with a significant carbon footprint at scale. At the same time, this work—in which we weigh model performance against the computational cost of obtaining it, and through which we learn more about when and how these models do and do not work—is a step towards more sample-efficient usage paradigms.
7 Conclusion
In this paper, we investigated the role of textual feedback in self-repair. We presented pass@t, a new evaluation strategy which takes the cost of carrying out repair into account, and then used this metric to show that (1) GPT-3.5 is not capable of carrying out self-repair on challenging coding tasks, and (2) while performance gains are seen in GPT-4, they are modest and rely on achieving sufficient diversity in the initial programs. Furthermore, by ablating the feedback stage we found that (3) substituting GPT-3.5’s feedback with GPT-4’s improved performance, even surpassing GPT-3.5’s baseline. Finally, we carried out an experiment with human participants, in which we found that (4) replacing GPT-4’s self-generated feedback with feedback provided by an experienced programmer increased the number of repaired programs which pass all unit tests by 57%.
Acknowledgments and Disclosure of Funding
T.X. Olausson is supported by the Defense Advanced Research Projects Agency (DARPA) under the ASKEM program, award HR00112220042. T.X. Olausson was also supported through a position at Microsoft Research for part of the time period during which this work was carried out. A. Solar-Lezama is supported by the National Science Foundation (NSF) and Intel Corporation through NSF Grant CCF:2217064. This work benefited greatly from discussion with several colleagues at Microsoft Research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the Defense Advanced Research Projects Agency, Intel Corporation, or Microsoft Research.
References
- Astels [2003] D. Astels. Test Driven Development: A Practical Guide. Prentice Hall Professional Technical Reference, 2003. ISBN 0131016490.
- Austin et al. [2021] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program Synthesis with Large Language Models, 2021. arXiv preprint arXiv:2108.07732. https://arxiv.org/abs/2108.07732.
- Bader et al. [2019] J. Bader, A. Scott, M. Pradel, and S. Chandra. Getafix: Learning to fix bugs automatically. Proc. ACM Program. Lang., 3(OOPSLA), Oct 2019. doi: 10.1145/3360585.
- Chen et al. [2023a] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen. CodeT: Code generation with generated tests. In International Conference on Learning Representations, 2023a.
- Chen et al. [2021] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating Large Language Models Trained on Code, 2021. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374.
- Chen et al. [2019a] X. Chen, C. Liu, and D. Song. Execution-Guided Neural Program Synthesis. In International Conference on Learning Representations, 2019a.
- Chen et al. [2023b] X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching Large Language Models to Self-Debug, 2023b. arXiv preprint arXiv:2304.05128. https://arxiv.org/abs/2304.05128.
- Chen et al. [2019b] Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Transaction on Software Engineering, 2019b.
- Chowdhery et al. [2022] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM: Scaling Language Modeling with Pathways, 2022. arXiv preprint arXiv:2204.02311. https://arxiv.org/abs/2204.02311.
- Ellis et al. [2021] K. Ellis, C. Wong, M. Nye, M. Sablé-Meyer, L. Morales, L. Hewitt, L. Cary, A. Solar-Lezama, and J. B. Tenenbaum. DreamCoder: Bootstrapping Inductive Program Synthesis with Wake-Sleep Library Learning. In The International Conference on Programming Language Design and Implementation, 2021.
- Fried et al. [2023] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis. InCoder: A generative model for code infilling and synthesis. In International Conference on Learning Representations, 2023.
- Gulwani et al. [2017] S. Gulwani, O. Polozov, and R. Singh. Program Synthesis. Foundations and Trends® in Programming Languages Series. Now Publishers, 2017. ISBN 9781680832921.
- Gupta et al. [2020] K. Gupta, P. E. Christensen, X. Chen, and D. Song. Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis. In Advances in Neural Information Processing Systems, 2020.
- Hendrycks et al. [2021] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring Coding Challenge Competence With APPS. In Advances in Neural Information Processing Systems, 2021.
- Inala et al. [2022] J. P. Inala, C. Wang, M. Yang, A. Codas, M. Encarnación, S. Lahiri, M. Musuvathi, and J. Gao. Fault-Aware Neural Code Rankers. In Advances in Neural Information Processing Systems, 2022.
- Kitzelmann [2010] E. Kitzelmann. Inductive Programming: A Survey of Program Synthesis Techniques. In Approaches and Applications of Inductive Programming: Third International Workshop, 2010.
- Kulal et al. [2019] S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. S. Liang. SPoC: Search-based Pseudocode to Code. In Advances in Neural Information Processing Systems, 2019.
- Le et al. [2022] H. Le, Y. Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. In Advances in Neural Information Processing Systems, 2022.
- Le Goues et al. [2021] C. Le Goues, M. Pradel, A. Roychoudhury, and S. Chandra. Automatic Program Repair. IEEE Softw., 38(4):22–27, jul 2021. ISSN 0740-7459. doi: 10.1109/MS.2021.3072577.
- Li et al. [2023] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, et al. StarCoder: may the source be with you!, 2023. arXiv preprint arXiv:2305.06161. https://arxiv.org/abs/2305.06161.
- Li et al. [2022] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, et al. Competition-level code generation with AlphaCode. Science, 378(6624):1092–1097, 2022. doi: 10.1126/science.abq1158.
- Long and Rinard [2016] F. Long and M. Rinard. Automatic Patch Generation by Learning Correct Code. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016.
- Madaan et al. [2023] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al. Self-Refine: Iterative Refinement with Self-Feedback, 2023. arXiv preprint arXiv:2303.17651. https://arxiv.org/abs/2303.17651.
- Mesbah et al. [2019] A. Mesbah, A. Rice, E. Johnston, N. Glorioso, and E. Aftandilian. DeepDelta: Learning to Repair Compilation Errors. In Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019.
- Nijkamp et al. [2023] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In International Conference on Learning Representations, 2023.
- OpenAI [2022] OpenAI. Introducing ChatGPT, 2022. Blog post. https://openai.com/blog/chatgpt [Accessed 5/17/2023].
- OpenAI [2023] OpenAI. GPT-4 Technical Report, 2023. arXiv preprint arXiv:2303.08774. https://arxiv.org/abs/2303.08774.
- Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
- Peng et al. [2023] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
- Polozov and Gulwani [2015] O. Polozov and S. Gulwani. FlashMeta: A Framework for Inductive Program Synthesis. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2015.
- Shi et al. [2022] F. Shi, D. Fried, M. Ghazvininejad, L. Zettlemoyer, and S. I. Wang. Natural Language to Code Translation with Execution. In Empirical Methods in Natural Language Processing, 2022.
- Touvron et al. [2023] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971.
- Wang et al. [2018] K. Wang, R. Singh, and Z. Su. Dynamic Neural Program Embedding for Program Repair. In International Conference on Learning Representations, 2018.
- Yasunaga and Liang [2020] M. Yasunaga and P. Liang. Graph-based, Self-supervised Program Repair from Diagnostic Feedback. In International Conference on Machine Learning, 2020.
- Yasunaga and Liang [2021] M. Yasunaga and P. Liang. Break-It-Fix-It: Unsupervised Learning for Program Repair. In International Conference on Machine Learning, 2021.
- Zhang et al. [2023] K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin. Self-Edit: Fault-Aware Code Editor for Code Generation, 2023. arXiv preprint arXiv:2305.04087. https://arxiv.org/abs/2305.04087.
- Zhang et al. [2022] T. Zhang, T. Yu, T. B. Hashimoto, M. Lewis, W.-t. Yih, D. Fried, and S. I. Wang. Coder Reviewer Reranking for Code Generation, 2022. arXiv preprint arXiv:2211.16490. https://arxiv.org/abs/2211.16490.
Appendix A Self-Repair Results Per Difficulty
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
The image presents a line chart illustrating the relationship between the mean number of tokens generated and the mean pass rate, under varying conditions of `n_p` (number of prompts) and `n_fr` (number of failed responses). Error bars are included to represent the variability in the data.
### Components/Axes
* **X-axis:** "Mean number of tokens generated" ranging from 0 to 10000, with tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-axis:** "Mean pass rate" ranging from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Lines:** Represent different values of `n_p`: 1 (orange), 2 (light gray), 5 (teal), 10 (purple), and 25 (dark blue).
* **Markers:** Represent different values of `n_fr`: 1 (gray circle), 3 (gray downward triangle), 5 (dark gray square), and 10 (light gray upward triangle).
* **Legend:** Located in the bottom-left corner, detailing the mapping between line color and `n_p` value, and marker shape and `n_fr` value.
### Detailed Analysis
The chart displays the mean pass rate as a function of the mean number of tokens generated, with different lines representing different values of `n_p`. The markers indicate data points for different values of `n_fr`.
* **n_p = 1 (Orange Line):** Starts at approximately 0.62 at 0 tokens generated. The line initially rises steeply, reaching approximately 0.75 at 2000 tokens. It then plateaus, reaching approximately 0.82 at 4000 tokens and remaining relatively stable until 10000 tokens, where it is approximately 0.83. The error bars are relatively large at the beginning, decreasing as the number of tokens increases.
* **n_p = 2 (Light Gray Line):** Starts at approximately 0.65 at 0 tokens. It rises more gradually than the orange line, reaching approximately 0.78 at 2000 tokens. It continues to rise, reaching approximately 0.85 at 4000 tokens, and plateaus around 0.87-0.88 for the remainder of the range. Error bars are similar in size to the orange line.
* **n_p = 5 (Teal Line):** Starts at approximately 0.70 at 0 tokens. It rises quickly, reaching approximately 0.80 at 2000 tokens. It plateaus around 0.85-0.86 from 4000 tokens onwards. Error bars are smaller than the previous two lines.
* **n_p = 10 (Purple Line):** Starts at approximately 0.73 at 0 tokens. It rises rapidly, reaching approximately 0.83 at 2000 tokens. It plateaus around 0.88-0.89 from 4000 tokens onwards. Error bars are very small.
* **n_p = 25 (Dark Blue Line):** Starts at approximately 0.75 at 0 tokens. It rises quickly, reaching approximately 0.85 at 2000 tokens. It plateaus around 0.90-0.91 from 4000 tokens onwards. Error bars are the smallest of all lines.
**Markers (n_fr):**
* **n_fr = 1 (Gray Circle):** Located at approximately (8000, 0.88) and (10000, 0.89).
* **n_fr = 3 (Gray Downward Triangle):** Located at approximately (8000, 0.85) and (10000, 0.87).
* **n_fr = 5 (Dark Gray Square):** Located at approximately (8000, 0.83) and (10000, 0.85).
* **n_fr = 10 (Light Gray Upward Triangle):** Located at approximately (8000, 0.81) and (10000, 0.83).
### Key Observations
* As `n_p` increases, the mean pass rate generally increases, and the variability (as indicated by the error bars) decreases.
* All lines converge towards a plateau around a mean pass rate of 0.85-0.91 as the number of tokens generated increases.
* The markers indicate that as `n_fr` increases, the mean pass rate decreases at higher token counts (8000-10000).
### Interpretation
The data suggests that increasing the number of prompts (`n_p`) generally improves the mean pass rate, particularly at lower token counts. However, the improvement diminishes as the number of tokens generated increases, indicating a point of diminishing returns. The plateau observed for all `n_p` values suggests that there is a limit to how much the pass rate can be improved by simply increasing the number of prompts.
The decreasing pass rate with increasing `n_fr` at higher token counts suggests that a higher number of failed responses negatively impacts the overall performance, especially when a large number of tokens have already been generated. This could be due to the model getting "stuck" or generating irrelevant content after multiple failures.
The error bars provide a measure of the uncertainty in the data. The decreasing error bars with increasing token counts for each `n_p` value suggest that the model's performance becomes more consistent as it generates more tokens. This could be because the model learns from its previous outputs and becomes more confident in its responses.
</details>
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Heatmap: Feedback-Repairs vs. Initial Programs
### Overview
This image presents a heatmap visualizing the relationship between the number of feedback-repairs (n<sub>f</sub>) and the number of initial programs (n<sub>p</sub>). The heatmap uses a color gradient to represent numerical values, ranging from approximately 0.87 to 1.01, with darker shades indicating lower values and lighter shades indicating higher values. Several cells are marked with "O.O.B." which likely stands for "Out Of Bounds".
### Components/Axes
* **X-axis:** Number of initial programs (n<sub>p</sub>) with markers at 1, 2, 5, 10, and 25.
* **Y-axis:** Number of feedback-repairs (n<sub>f</sub>) with markers at 1, 3, 5, and 10.
* **Color Scale:** A gradient from dark orange/red to light yellow, representing values from approximately 0.87 to 1.01.
* **Data Labels:** Numerical values are displayed within each cell of the heatmap.
* **O.O.B. Labels:** Several cells are labeled "O.O.B." indicating values are out of bounds.
### Detailed Analysis
The heatmap is a 4x5 grid. Here's a breakdown of the values within each cell:
* **n<sub>f</sub> = 1:**
* n<sub>p</sub> = 1: 0.92
* n<sub>p</sub> = 2: 0.97
* n<sub>p</sub> = 5: 1.00
* n<sub>p</sub> = 10: 1.01
* n<sub>p</sub> = 25: 1.01
* **n<sub>f</sub> = 3:**
* n<sub>p</sub> = 1: 0.88
* n<sub>p</sub> = 2: 0.94
* n<sub>p</sub> = 5: 0.99
* n<sub>p</sub> = 10: 1.00
* n<sub>p</sub> = 25: O.O.B.
* **n<sub>f</sub> = 5:**
* n<sub>p</sub> = 1: 0.87
* n<sub>p</sub> = 2: 0.94
* n<sub>p</sub> = 5: 0.98
* n<sub>p</sub> = 10: 1.00
* n<sub>p</sub> = 25: O.O.B.
* **n<sub>f</sub> = 10:**
* n<sub>p</sub> = 1: 0.87
* n<sub>p</sub> = 2: 0.93
* n<sub>p</sub> = 5: 0.97
* n<sub>p</sub> = 10: O.O.B.
* n<sub>p</sub> = 25: O.O.B.
The values generally increase as both n<sub>f</sub> and n<sub>p</sub> increase, but this trend plateaus and becomes undefined (O.O.B.) for larger values of n<sub>p</sub>, especially with higher n<sub>f</sub>.
### Key Observations
* The values are clustered between 0.87 and 1.01.
* The "O.O.B." labels suggest that the model or calculation used to generate these values has limitations or becomes invalid for certain combinations of n<sub>f</sub> and n<sub>p</sub>.
* The values appear to stabilize around 1.00-1.01 when n<sub>p</sub> is 10 or 25 and n<sub>f</sub> is 1.
* The lowest values are observed when n<sub>f</sub> is 1 or 3 and n<sub>p</sub> is 1.
### Interpretation
This heatmap likely represents the performance or success rate of a system undergoing feedback and repair. The x-axis represents the initial complexity or diversity of programs, while the y-axis represents the amount of iterative refinement through feedback-repair cycles. The values within the heatmap likely represent a metric of success, such as accuracy, efficiency, or robustness.
The increasing values with increasing n<sub>f</sub> and n<sub>p</sub> suggest that both initial program diversity and iterative refinement contribute to improved performance. However, the "O.O.B." values indicate that there's a limit to how much improvement can be achieved through feedback-repair, particularly when starting with a large number of initial programs. This could be due to diminishing returns, computational constraints, or the inherent limitations of the feedback-repair process itself.
The plateauing of values at higher n<sub>p</sub> and n<sub>f</sub> suggests that the system reaches a point of saturation where further refinement doesn't yield significant gains. The fact that the values are close to 1.00 suggests that the system is generally performing well, but the "O.O.B." values highlight the need to understand the conditions under which the model breaks down or becomes unreliable.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This chart displays the relationship between the mean number of tokens generated and the mean pass rate, for different values of *n<sub>p</sub>* and *n<sub>fr</sub>*. The chart uses line plots with shaded confidence intervals, and also includes individual data points for *n<sub>fr</sub>* = 1, 3, 5, and 10. The x-axis represents the mean number of tokens generated, and the y-axis represents the mean pass rate.
### Components/Axes
* **X-axis Title:** "Mean number of tokens generated" (Scale: 0 to 10000, approximately)
* **Y-axis Title:** "Mean pass rate" (Scale: 0.0 to 1.0, approximately)
* **Lines/Data Series:**
* *n<sub>p</sub>* = 1 (Brown)
* *n<sub>p</sub>* = 2 (Orange)
* *n<sub>p</sub>* = 5 (Light Green)
* *n<sub>p</sub>* = 10 (Cyan)
* *n<sub>p</sub>* = 25 (Purple)
* **Data Points:**
* *n<sub>fr</sub>* = 1 (Gray Circle)
* *n<sub>fr</sub>* = 3 (Dark Red Downward Triangle)
* *n<sub>fr</sub>* = 5 (Dark Blue Square)
* *n<sub>fr</sub>* = 10 (Dark Green Upward Triangle)
* **Legend:** Located in the top-right corner of the chart.
### Detailed Analysis
The chart shows five lines representing different values of *n<sub>p</sub>*. Each line is accompanied by a shaded region, presumably representing a confidence interval. Additionally, there are individual data points plotted for different values of *n<sub>fr</sub>*.
* **n<sub>p</sub> = 1 (Brown):** The line starts at approximately (0, 0.2) and increases slowly, reaching approximately (9000, 0.5).
* **n<sub>p</sub> = 2 (Orange):** The line starts at approximately (0, 0.2) and increases, reaching approximately (9000, 0.5).
* **n<sub>p</sub> = 5 (Light Green):** The line starts at approximately (0, 0.3) and increases, reaching approximately (9000, 0.55).
* **n<sub>p</sub> = 10 (Cyan):** The line starts at approximately (0, 0.35) and increases, reaching approximately (9000, 0.55).
* **n<sub>p</sub> = 25 (Purple):** The line starts at approximately (0, 0.4) and increases, reaching approximately (9000, 0.6).
**Data Points:**
* *n<sub>fr</sub>* = 1 (Gray Circle):
* (0, 0.2)
* (9000, 0.5)
* *n<sub>fr</sub>* = 3 (Dark Red Downward Triangle):
* (2000, 0.35)
* (6000, 0.4)
* *n<sub>fr</sub>* = 5 (Dark Blue Square):
* (4000, 0.4)
* (8000, 0.5)
* *n<sub>fr</sub>* = 10 (Dark Green Upward Triangle):
* (4000, 0.45)
* (9000, 0.55)
The lines generally show an upward trend, indicating that as the mean number of tokens generated increases, the mean pass rate also tends to increase. The lines for higher values of *n<sub>p</sub>* are consistently above those for lower values.
### Key Observations
* The lines for different *n<sub>p</sub>* values converge at lower token counts and diverge at higher token counts.
* The data points for *n<sub>fr</sub>* generally fall within the confidence intervals of the corresponding *n<sub>p</sub>* lines.
* The confidence intervals become wider as the number of tokens generated increases, indicating greater uncertainty in the pass rate at higher token counts.
* The pass rate appears to plateau around 0.5-0.6 for most values of *n<sub>p</sub>*.
### Interpretation
The chart suggests that increasing the number of tokens generated generally leads to a higher pass rate, but the effect diminishes at higher token counts. The parameter *n<sub>p</sub>* appears to have a positive influence on the pass rate, with higher values of *n<sub>p</sub>* resulting in higher pass rates. The parameter *n<sub>fr</sub>* seems to represent specific data points that validate the trend observed in the lines for different *n<sub>p</sub>* values.
The convergence of the lines at lower token counts suggests that the effect of *n<sub>p</sub>* is less pronounced when the number of tokens generated is small. As the number of tokens generated increases, the differences between the lines become more apparent, indicating that *n<sub>p</sub>* plays a more significant role in determining the pass rate at higher token counts.
The widening confidence intervals at higher token counts suggest that the pass rate becomes more variable as the number of tokens generated increases. This could be due to a variety of factors, such as increased complexity of the generated text or limitations in the evaluation metric. The plateauing of the pass rate around 0.5-0.6 suggests that there may be an upper limit to the pass rate that can be achieved, regardless of the number of tokens generated or the value of *n<sub>p</sub>*.
</details>
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Heatmap: Feedback-Repairs Performance
### Overview
This image presents a heatmap visualizing the performance of feedback-repairs based on the number of initial programs (np) and the number of feedback-repairs (nf). The performance is represented by numerical values ranging from approximately 0.73 to 0.98, with "O.O.B." indicating values that are "Out Of Bounds". The heatmap uses a color gradient, with darker shades representing lower values and lighter shades representing higher values.
### Components/Axes
* **X-axis:** Number of initial programs (np) with markers at 1, 2, 5, 10, and 25.
* **Y-axis:** Number of feedback-repairs (nf) with markers at 1, 3, 5, and 10.
* **Color Scale:** A gradient from dark brown to light yellow, representing performance values.
* **Data Labels:** Numerical values are displayed within each cell of the heatmap.
* **Legend:** There is no explicit legend, but the color gradient serves as an implicit legend.
### Detailed Analysis
The heatmap displays performance values for different combinations of initial programs and feedback-repairs. Here's a breakdown of the values:
* **np = 1:**
* nf = 1: 0.84
* nf = 3: 0.77
* nf = 5: 0.75
* nf = 10: 0.73
* **np = 2:**
* nf = 1: 0.89
* nf = 3: 0.84
* nf = 5: 0.82
* nf = 10: 0.82
* **np = 5:**
* nf = 1: 0.93
* nf = 3: 0.91
* nf = 5: 0.91
* nf = 10: O.O.B.
* **np = 10:**
* nf = 1: 0.97
* nf = 3: 0.98
* nf = 5: O.O.B.
* nf = 10: O.O.B.
* **np = 25:**
* nf = 1: O.O.B.
* nf = 3: O.O.B.
* nf = 5: O.O.B.
* nf = 10: O.O.B.
The values generally increase as the number of initial programs and feedback-repairs increases, up to a certain point. Beyond np = 10, all values are "O.O.B." indicating the model is out of bounds.
### Key Observations
* Performance generally improves with increasing numbers of initial programs and feedback-repairs, but this trend plateaus and then becomes undefined (O.O.B.) at higher values.
* The highest performance values (0.97 and 0.98) are achieved with np = 10 and nf = 1 and 3, respectively.
* The "O.O.B." values suggest that the model or method used to calculate performance is not valid or reliable for larger numbers of initial programs and feedback-repairs.
* The performance is relatively consistent for low values of np (1 and 2) across different values of nf.
### Interpretation
The heatmap demonstrates the relationship between the number of initial programs, the number of feedback-repairs, and the resulting performance. The data suggests that there is an optimal range for both parameters, beyond which the performance becomes unreliable or undefined. The "O.O.B." values could indicate limitations in the model's ability to generalize or handle larger datasets, or it could be a result of numerical instability. The increasing performance with more initial programs and feedback-repairs suggests that more information and refinement lead to better results, but only up to a certain point. The fact that performance plateaus and then becomes undefined suggests that there may be diminishing returns or inherent limitations in the feedback-repair process. This could be due to overfitting, saturation of the learning process, or other factors. Further investigation would be needed to understand the underlying reasons for the "O.O.B." values and to determine the optimal settings for these parameters.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This chart displays the relationship between the mean number of tokens generated and the mean pass rate, for different values of *n<sub>p</sub>* and *n<sub>fr</sub>*. The chart uses line plots with error bars to represent the data, and includes a legend to differentiate between the different values of *n<sub>p</sub>*. The x-axis represents the mean number of tokens generated, ranging from 0 to 10000. The y-axis represents the mean pass rate, ranging from 0 to 1.0.
### Components/Axes
* **X-axis Title:** "Mean number of tokens generated" (Scale: 0 to 10000, linear)
* **Y-axis Title:** "Mean pass rate" (Scale: 0 to 1.0, linear)
* **Legend:** Located in the top-right corner.
* *n<sub>p</sub>* = 1 (Brown line)
* *n<sub>p</sub>* = 2 (Orange line)
* *n<sub>p</sub>* = 5 (Light Green line)
* *n<sub>p</sub>* = 10 (Blue line)
* *n<sub>p</sub>* = 25 (Purple line)
* *n<sub>fr</sub>* = 1 (Gray circle)
* *n<sub>fr</sub>* = 3 (Red triangle)
* *n<sub>fr</sub>* = 5 (Black square)
* *n<sub>fr</sub>* = 10 (Gray triangle)
* **Gridlines:** Present to aid in reading values.
### Detailed Analysis
The chart shows several lines representing different values of *n<sub>p</sub>*. Each line has associated error bars, indicating the variability in the data. The lines generally show a slight upward trend, but the increase is minimal. The data points are sparse, making precise value extraction difficult.
Here's an approximate extraction of data points, noting the uncertainty due to the visual nature of the chart and the error bars:
* **n<sub>p</sub> = 1 (Brown):**
* At 0 tokens: ~0.05 pass rate
* At 2000 tokens: ~0.10 pass rate
* At 4000 tokens: ~0.12 pass rate
* At 8000 tokens: ~0.14 pass rate
* **n<sub>p</sub> = 2 (Orange):**
* At 0 tokens: ~0.05 pass rate
* At 2000 tokens: ~0.12 pass rate
* At 4000 tokens: ~0.14 pass rate
* At 8000 tokens: ~0.15 pass rate
* **n<sub>p</sub> = 5 (Light Green):**
* At 0 tokens: ~0.07 pass rate
* At 2000 tokens: ~0.14 pass rate
* At 4000 tokens: ~0.16 pass rate
* At 8000 tokens: ~0.17 pass rate
* **n<sub>p</sub> = 10 (Blue):**
* At 0 tokens: ~0.08 pass rate
* At 2000 tokens: ~0.15 pass rate
* At 6000 tokens: ~0.18 pass rate
* At 8000 tokens: ~0.19 pass rate
* **n<sub>p</sub> = 25 (Purple):**
* At 0 tokens: ~0.07 pass rate
* At 2000 tokens: ~0.14 pass rate
* At 4000 tokens: ~0.16 pass rate
* At 8000 tokens: ~0.18 pass rate
The gray circles, red triangles, black squares, and gray triangles represent *n<sub>fr</sub>* values of 1, 3, 5, and 10 respectively. These points are scattered across the chart and do not form clear lines. Their values are difficult to extract precisely due to their small size and the presence of error bars.
### Key Observations
* The mean pass rate generally increases with the mean number of tokens generated, but the increase is small.
* Higher values of *n<sub>p</sub>* tend to result in slightly higher mean pass rates.
* The error bars indicate significant variability in the data, making it difficult to draw strong conclusions.
* The *n<sub>fr</sub>* data points appear to be independent of the *n<sub>p</sub>* lines.
### Interpretation
The chart suggests that increasing the number of tokens generated has a limited impact on the mean pass rate. The parameter *n<sub>p</sub>* appears to have a more noticeable, though still modest, effect. The variability in the data, as indicated by the error bars, suggests that other factors may be influencing the pass rate. The scattered *n<sub>fr</sub>* points suggest that this parameter does not have a strong correlation with either the number of tokens generated or the pass rate, or that the relationship is obscured by the variability in the data.
The chart likely represents the performance of a system (e.g., a language model) as it generates text. *n<sub>p</sub>* could represent a parameter related to prompt engineering or decoding strategy, while *n<sub>fr</sub>* could represent a parameter related to filtering or evaluation. The data suggests that optimizing *n<sub>p</sub>* may lead to slight improvements in performance, but that simply generating more tokens does not necessarily guarantee a higher pass rate. Further investigation would be needed to understand the underlying reasons for the observed variability and to identify other factors that may be influencing the pass rate.
</details>
<details>
<summary>x13.png Details</summary>

### Visual Description
## Heatmap: Feedback-Repairs vs. Initial Programs
### Overview
This image presents a heatmap visualizing the relationship between the number of feedback-repairs (n<sub>#</sub>) and the number of initial programs (n<sub>p</sub>). The heatmap uses a color gradient to represent numerical values, with darker colors indicating lower values and lighter colors indicating higher values. Several cells contain the abbreviation "O.O.B." which likely stands for "Out Of Bounds".
### Components/Axes
* **X-axis:** Number of initial programs (n<sub>p</sub>) with markers at 1, 2, 5, 10, and 25.
* **Y-axis:** Number of feedback-repairs (n<sub>#</sub>) with markers at 1, 3, 5, and 10.
* **Color Scale:** A gradient from dark brown to light green, representing the numerical values.
* **Data Labels:** Numerical values are displayed within each cell of the heatmap.
### Detailed Analysis
The heatmap displays the following data points:
* **n<sub>#</sub> = 1, n<sub>p</sub> = 1:** 0.87
* **n<sub>#</sub> = 1, n<sub>p</sub> = 2:** 0.93
* **n<sub>#</sub> = 1, n<sub>p</sub> = 5:** 1.05
* **n<sub>#</sub> = 1, n<sub>p</sub> = 10:** 1.13
* **n<sub>#</sub> = 1, n<sub>p</sub> = 25:** O.O.B.
* **n<sub>#</sub> = 3, n<sub>p</sub> = 1:** 0.81
* **n<sub>#</sub> = 3, n<sub>p</sub> = 2:** 0.91
* **n<sub>#</sub> = 3, n<sub>p</sub> = 5:** 1.08
* **n<sub>#</sub> = 3, n<sub>p</sub> = 10:** O.O.B.
* **n<sub>#</sub> = 3, n<sub>p</sub> = 25:** O.O.B.
* **n<sub>#</sub> = 5, n<sub>p</sub> = 1:** 0.79
* **n<sub>#</sub> = 5, n<sub>p</sub> = 2:** 0.91
* **n<sub>#</sub> = 5, n<sub>p</sub> = 5:** 1.09
* **n<sub>#</sub> = 5, n<sub>p</sub> = 10:** O.O.B.
* **n<sub>#</sub> = 5, n<sub>p</sub> = 25:** O.O.B.
* **n<sub>#</sub> = 10, n<sub>p</sub> = 1:** 0.78
* **n<sub>#</sub> = 10, n<sub>p</sub> = 2:** 0.93
* **n<sub>#</sub> = 10, n<sub>p</sub> = 5:** O.O.B.
* **n<sub>#</sub> = 10, n<sub>p</sub> = 10:** O.O.B.
* **n<sub>#</sub> = 10, n<sub>p</sub> = 25:** O.O.B.
The values generally increase as both the number of initial programs and the number of feedback-repairs increase, but this trend is not consistent across all values. The "O.O.B." values indicate that the corresponding data points are outside the bounds of the measured or defined range.
### Key Observations
* The values are generally clustered between 0.78 and 1.13.
* The highest value observed is 1.13, occurring when n<sub>#</sub> = 1 and n<sub>p</sub> = 10.
* A significant portion of the heatmap is filled with "O.O.B." values, particularly as either n<sub>#</sub> or n<sub>p</sub> increases. This suggests a limitation in the experimental setup or the validity of the model for larger values.
* The values appear to increase with increasing n<sub>p</sub> for a fixed n<sub>#</sub>, up to a point, after which they become "O.O.B.".
### Interpretation
This heatmap likely represents the results of an experiment or simulation investigating the impact of feedback-repairs on the performance of a system with varying numbers of initial programs. The numerical values could represent a metric such as efficiency, accuracy, or success rate.
The increasing trend suggests that, up to a certain point, more feedback-repairs and more initial programs can improve performance. However, the "O.O.B." values indicate that beyond a certain threshold, the system may become unstable or the metric becomes undefined. This could be due to factors such as diminishing returns, increased complexity, or limitations in the experimental setup.
The "O.O.B." values are a critical finding, suggesting that there are practical limits to the benefits of increasing both the number of initial programs and the number of feedback-repairs. Further investigation would be needed to understand the underlying reasons for these limitations and to identify optimal values for n<sub>#</sub> and n<sub>p</sub>. The data suggests a non-linear relationship between the two variables, with a potential saturation point or even a decline in performance beyond a certain level.
</details>
Figure 7: GPT-3.5 results from Figure 3 (Section 4.1) per difficulty (row), from top to bottom: introductory, interview, and competition.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This chart displays the relationship between the mean pass rate and the mean number of tokens generated, with different lines representing different values of *n<sub>p</sub>* (presumably a parameter related to prompt engineering or model configuration) and different markers representing different values of *n<sub>fr</sub>* (likely related to feedback or reinforcement). The chart shows how the pass rate changes as the number of tokens generated increases, for various parameter settings.
### Components/Axes
* **X-axis:** Mean number of tokens generated, ranging from 0 to 10000, with tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-axis:** Mean pass rate, ranging from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Lines:** Represent different values of *n<sub>p</sub>*: 1 (orange), 2 (yellow), 5 (light blue), 10 (teal), and 25 (dark blue).
* **Markers:** Represent different values of *n<sub>fr</sub>*: 1 (gray circle), 3 (gray downward triangle), 5 (dark gray square), and 10 (dark gray upward triangle).
* **Legend:** Located in the bottom-left corner, associating colors with *n<sub>p</sub>* values and marker shapes with *n<sub>fr</sub>* values.
* **Grid:** A light gray grid is present to aid in reading values.
### Detailed Analysis
The chart shows several lines representing different *n<sub>p</sub>* values.
* **n<sub>p</sub> = 1 (Orange):** Starts at approximately 0.75 pass rate at 0 tokens, rises sharply to around 0.85 at 500 tokens, then plateaus around 0.90 from 2000 tokens onwards. Data points are visible at approximately (0, 0.75), (500, 0.85), (2000, 0.90), (4000, 0.90), (6000, 0.90), (8000, 0.90).
* **n<sub>p</sub> = 2 (Yellow):** Starts at approximately 0.78 pass rate at 0 tokens, rises to around 0.88 at 1000 tokens, and then plateaus around 0.92 from 2000 tokens onwards. Data points are visible at approximately (0, 0.78), (1000, 0.88), (2000, 0.92), (4000, 0.92), (6000, 0.92), (8000, 0.92).
* **n<sub>p</sub> = 5 (Light Blue):** Starts at approximately 0.80 pass rate at 0 tokens, rises to around 0.90 at 1000 tokens, and then plateaus around 0.94 from 2000 tokens onwards. Data points are visible at approximately (0, 0.80), (1000, 0.90), (2000, 0.94), (4000, 0.94), (6000, 0.94), (8000, 0.94).
* **n<sub>p</sub> = 10 (Teal):** Starts at approximately 0.82 pass rate at 0 tokens, rises to around 0.92 at 1000 tokens, and then plateaus around 0.96 from 2000 tokens onwards. Data points are visible at approximately (0, 0.82), (1000, 0.92), (2000, 0.96), (4000, 0.96), (6000, 0.96), (8000, 0.96).
* **n<sub>p</sub> = 25 (Dark Blue):** Starts at approximately 0.84 pass rate at 0 tokens, rises to around 0.94 at 1000 tokens, and then plateaus around 0.98 from 2000 tokens onwards. Data points are visible at approximately (0, 0.84), (1000, 0.94), (2000, 0.98), (4000, 0.98), (6000, 0.98), (8000, 0.98).
Markers representing *n<sub>fr</sub>* values are overlaid on the lines.
* **n<sub>fr</sub> = 1 (Gray Circle):** Visible at approximately (0, 0.75), (2000, 0.90), (4000, 0.90), (6000, 0.90), (8000, 0.90).
* **n<sub>fr</sub> = 3 (Gray Downward Triangle):** Visible at approximately (0, 0.78), (2000, 0.92), (4000, 0.92), (6000, 0.92), (8000, 0.92).
* **n<sub>fr</sub> = 5 (Dark Gray Square):** Visible at approximately (0, 0.80), (2000, 0.94), (4000, 0.94), (6000, 0.94), (8000, 0.94).
* **n<sub>fr</sub> = 10 (Dark Gray Upward Triangle):** Visible at approximately (0, 0.82), (2000, 0.96), (4000, 0.96), (6000, 0.96), (8000, 0.96).
### Key Observations
* The pass rate generally increases with the number of tokens generated, but the rate of increase diminishes after approximately 2000 tokens.
* Higher values of *n<sub>p</sub>* consistently result in higher pass rates.
* Higher values of *n<sub>fr</sub>* consistently result in higher pass rates.
* The lines converge as the number of tokens generated increases, suggesting a saturation point where further token generation does not significantly improve the pass rate.
* The markers for different *n<sub>fr</sub>* values appear to follow the corresponding *n<sub>p</sub>* lines, indicating that the effect of *n<sub>fr</sub>* is consistent across different *n<sub>p</sub>* values.
### Interpretation
The data suggests that increasing the number of tokens generated can improve the pass rate, but there are diminishing returns. The parameter *n<sub>p</sub>* appears to be a key factor in determining the pass rate, with higher values leading to better performance. The parameter *n<sub>fr</sub>* also positively influences the pass rate.
The convergence of the lines at higher token counts indicates that there is a limit to how much the pass rate can be improved by simply generating more tokens. This suggests that other factors, such as the quality of the generated content or the effectiveness of the prompting strategy, may become more important at higher token counts.
The consistent effect of *n<sub>fr</sub>* across different *n<sub>p</sub>* values suggests that it represents a general mechanism for improving performance, independent of the specific prompting strategy. This could relate to a feedback loop or reinforcement learning process. The data points at 8000 tokens for n<sub>p</sub>=25 show a pass rate of approximately 0.98, indicating a very high level of performance.
</details>
<details>
<summary>x15.png Details</summary>

### Visual Description
\n
## Heatmap: Feedback-Repairs vs. Initial Programs
### Overview
This image presents a heatmap visualizing the relationship between the number of initial programs (n<sub>p</sub>) and the number of feedback-repairs (n<sub>f</sub>). The heatmap uses a color gradient to represent numerical values, with lighter shades indicating lower values and darker shades indicating higher values. Some cells contain the abbreviation "O.O.B." which likely stands for "Out Of Bounds".
### Components/Axes
* **X-axis:** Number of initial programs (n<sub>p</sub>) with markers at 1, 2, 5, 10, and 25.
* **Y-axis:** Number of feedback-repairs (n<sub>f</sub>) with markers at 1, 3, 5, and 10.
* **Color Scale:** A gradient from light yellow to dark brown, representing the numerical values.
* **Data Values:** Numerical values are displayed within each cell of the heatmap.
* **Legend:** There is no explicit legend, but the color gradient serves as an implicit legend.
### Detailed Analysis
The heatmap is a 4x5 grid. The values within the grid are as follows:
* **n<sub>f</sub> = 1:**
* n<sub>p</sub> = 1: 1.05
* n<sub>p</sub> = 2: 1.04
* n<sub>p</sub> = 5: 1.04
* n<sub>p</sub> = 10: 1.04
* n<sub>p</sub> = 25: 1.04
* **n<sub>f</sub> = 3:**
* n<sub>p</sub> = 1: 1.02
* n<sub>p</sub> = 2: 1.03
* n<sub>p</sub> = 5: 1.04
* n<sub>p</sub> = 10: 1.04
* n<sub>p</sub> = 25: 1.04
* **n<sub>f</sub> = 5:**
* n<sub>p</sub> = 1: 1.00
* n<sub>p</sub> = 2: 1.02
* n<sub>p</sub> = 5: 1.03
* n<sub>p</sub> = 10: 1.03
* n<sub>p</sub> = 25: O.O.B.
* **n<sub>f</sub> = 10:**
* n<sub>p</sub> = 1: 0.98
* n<sub>p</sub> = 2: 1.01
* n<sub>p</sub> = 5: 1.02
* n<sub>p</sub> = 10: 1.03
* n<sub>p</sub> = 25: O.O.B.
The values generally increase as both n<sub>p</sub> and n<sub>f</sub> increase, with the exception of the first cell (n<sub>f</sub>=1, n<sub>p</sub>=1) having a relatively high value of 1.05. The cells corresponding to n<sub>f</sub> = 5 or 10 and n<sub>p</sub> = 25 are marked as "O.O.B."
### Key Observations
* The values are clustered around 1.0, suggesting a relatively small deviation from a baseline.
* The "O.O.B." values indicate that the model or calculation is invalid or undefined for these combinations of n<sub>f</sub> and n<sub>p</sub>.
* The highest value is 1.05, occurring when n<sub>f</sub> = 1 and n<sub>p</sub> = 1.
* The lowest value is 0.98, occurring when n<sub>f</sub> = 10 and n<sub>p</sub> = 1.
### Interpretation
This heatmap likely represents the results of a simulation or experiment investigating the impact of feedback-repairs on the performance of initial programs. The values likely represent a metric of performance, where values closer to 1.0 indicate better performance. The "O.O.B." values suggest that the system becomes unstable or undefined when a large number of initial programs (25) are combined with a significant number of feedback-repairs (5 or 10).
The initial increase in values as both n<sub>f</sub> and n<sub>p</sub> increase could indicate that a moderate amount of feedback-repair is beneficial for improving the performance of initial programs. However, the "O.O.B." values suggest that there is a limit to the number of feedback-repairs that can be applied without causing the system to fail. The higher value at n<sub>f</sub>=1, n<sub>p</sub>=1 could indicate that the initial programs are already relatively well-performing, and additional feedback-repairs do not significantly improve performance.
The data suggests a trade-off between the number of initial programs and the number of feedback-repairs. Finding the optimal balance between these two factors is crucial for maximizing performance and avoiding instability.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This chart displays the relationship between the mean pass rate and the mean number of tokens generated, with different lines representing different values of *n<sub>p</sub>* (likely a parameter related to prompt engineering or model configuration) and data points colored according to different values of *n<sub>fr</sub>* (likely a parameter related to feedback or reinforcement learning). The chart appears to demonstrate how pass rate improves with increasing token generation, and how this relationship is affected by the *n<sub>p</sub>* parameter.
### Components/Axes
* **X-axis:** Mean number of tokens generated, ranging from 0 to 10000. The scale is linear.
* **Y-axis:** Mean pass rate, ranging from 0.0 to 1.0. The scale is linear.
* **Lines:** Represent different values of *n<sub>p</sub>*: 1, 2, 5, 10, and 25. Each line is a different color.
* **Data Points:** Represent individual data observations, colored according to different values of *n<sub>fr</sub>*: 1, 3, 5, and 10.
* **Legend (Top-Right):** Two separate legends are present. The first legend (brown, orange, light blue, dark blue, purple) maps colors to *n<sub>p</sub>* values. The second legend (grey, brown, dark blue, dark green) maps colors to *n<sub>fr</sub>* values.
* **Grid:** A light grey grid is present to aid in reading values.
### Detailed Analysis
The chart shows several lines representing different *n<sub>p</sub>* values. Each line is accompanied by shaded regions representing the uncertainty around the mean.
* **n<sub>p</sub> = 1 (Brown):** The line starts at approximately 0.42 pass rate at 0 tokens generated, rises rapidly to around 0.65 at 1000 tokens, and then plateaus, reaching approximately 0.72 at 8000 tokens.
* **n<sub>p</sub> = 2 (Orange):** The line starts at approximately 0.45 pass rate at 0 tokens generated, rises to around 0.67 at 1000 tokens, and then plateaus, reaching approximately 0.73 at 8000 tokens.
* **n<sub>p</sub> = 5 (Light Blue):** The line starts at approximately 0.50 pass rate at 0 tokens generated, rises to around 0.68 at 1000 tokens, and then plateaus, reaching approximately 0.74 at 8000 tokens.
* **n<sub>p</sub> = 10 (Dark Blue):** The line starts at approximately 0.55 pass rate at 0 tokens generated, rises to around 0.70 at 1000 tokens, and then plateaus, reaching approximately 0.76 at 8000 tokens.
* **n<sub>p</sub> = 25 (Purple):** The line starts at approximately 0.60 pass rate at 0 tokens generated, rises to around 0.72 at 1000 tokens, and then plateaus, reaching approximately 0.78 at 8000 tokens.
Data points are overlaid on the lines, colored according to *n<sub>fr</sub>*:
* **n<sub>fr</sub> = 1 (Grey):** Located at approximately (0, 0.42), (1000, 0.62), (4000, 0.68), (8000, 0.72).
* **n<sub>fr</sub> = 3 (Brown):** Located at approximately (0, 0.45), (1000, 0.65), (4000, 0.70), (8000, 0.74).
* **n<sub>fr</sub> = 5 (Dark Blue):** Located at approximately (0, 0.52), (1000, 0.68), (4000, 0.72), (8000, 0.76).
* **n<sub>fr</sub> = 10 (Dark Green):** Located at approximately (0, 0.58), (1000, 0.70), (4000, 0.74), (8000, 0.77).
### Key Observations
* The pass rate generally increases with the number of tokens generated, but the rate of increase diminishes as the number of tokens increases.
* Higher values of *n<sub>p</sub>* consistently result in higher pass rates.
* Higher values of *n<sub>fr</sub>* consistently result in higher pass rates.
* The shaded regions around the lines indicate uncertainty in the mean pass rate.
* The data points generally fall close to the corresponding lines, suggesting that the lines are good representations of the underlying data.
### Interpretation
The chart suggests that increasing the number of tokens generated can improve the pass rate, up to a point of diminishing returns. The parameters *n<sub>p</sub>* and *n<sub>fr</sub>* appear to have a positive impact on the pass rate, with higher values leading to better performance. *n<sub>p</sub>* likely controls the diversity or complexity of the prompts, while *n<sub>fr</sub>* likely controls the strength or frequency of feedback signals.
The convergence of the lines at higher token counts suggests that the effect of *n<sub>p</sub>* becomes less pronounced as the model generates more text. This could indicate that the model is able to overcome the limitations imposed by lower *n<sub>p</sub>* values as it has more opportunities to refine its output.
The data points provide individual observations that support the trends observed in the lines. The consistency between the lines and the data points suggests that the model is behaving predictably. The chart provides valuable insights into the relationship between model parameters, token generation, and pass rate, which can be used to optimize model performance.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
## Heatmap: Feedback-Repairs vs. Initial Programs
### Overview
This image presents a heatmap visualizing the relationship between the number of initial programs (n<sub>p</sub>) and the number of feedback-repairs (n<sub>f</sub>). The heatmap uses a color gradient to represent numerical values, ranging from orange to dark green. Several cells contain the abbreviation "O.O.B."
### Components/Axes
* **X-axis:** Number of initial programs (n<sub>p</sub>) with markers at 1, 2, 5, 10, and 25.
* **Y-axis:** Number of feedback-repairs (n<sub>f</sub>) with markers at 1, 3, 5, and 10.
* **Color Scale:** A gradient from orange (lower values) to dark green (higher values).
* **Data Labels:** Numerical values are displayed within each cell of the heatmap.
* **"O.O.B." Labels:** Some cells are labeled "O.O.B."
### Detailed Analysis
The heatmap displays the following values:
* **n<sub>p</sub> = 1:**
* n<sub>f</sub> = 1: 0.96
* n<sub>f</sub> = 3: 0.91
* n<sub>f</sub> = 5: 0.89
* n<sub>f</sub> = 10: 0.88
* **n<sub>p</sub> = 2:**
* n<sub>f</sub> = 1: 0.99
* n<sub>f</sub> = 3: 0.97
* n<sub>f</sub> = 5: 0.96
* n<sub>f</sub> = 10: 0.97
* **n<sub>p</sub> = 5:**
* n<sub>f</sub> = 1: 1.03
* n<sub>f</sub> = 3: 1.04
* n<sub>f</sub> = 5: 1.04
* n<sub>f</sub> = 10: 1.06
* **n<sub>p</sub> = 10:**
* n<sub>f</sub> = 1: 1.05
* n<sub>f</sub> = 3: 1.08
* n<sub>f</sub> = 5: 1.09
* n<sub>f</sub> = 10: O.O.B.
* **n<sub>p</sub> = 25:**
* n<sub>f</sub> = 1: 1.09
* n<sub>f</sub> = 3: O.O.B.
* n<sub>f</sub> = 5: O.O.B.
* n<sub>f</sub> = 10: O.O.B.
The values generally increase as both n<sub>p</sub> and n<sub>f</sub> increase, but this trend is not consistent across all cells. The cells labeled "O.O.B." indicate that the corresponding data point is "Out Of Bounds" or invalid.
### Key Observations
* The values are clustered between 0.88 and 1.09, with a few "O.O.B." values.
* The highest values (around 1.09) occur when either n<sub>p</sub> or n<sub>f</sub> is relatively high.
* The "O.O.B." values appear in the top-right corner of the heatmap, suggesting that the relationship between n<sub>p</sub> and n<sub>f</sub> breaks down for larger values of both variables.
* There is a clear gradient from orange to green, indicating a positive correlation between the two variables, but with limitations.
### Interpretation
The heatmap likely represents the results of an experiment or simulation investigating the effectiveness of feedback-repair mechanisms in relation to the number of initial programs. The values within the heatmap could represent a metric such as success rate, efficiency, or performance.
The increasing values as both n<sub>p</sub> and n<sub>f</sub> increase suggest that more initial programs and more feedback-repair iterations generally lead to better outcomes, up to a certain point. However, the "O.O.B." values indicate that this relationship is not sustainable indefinitely. It's possible that beyond a certain number of initial programs or feedback-repair iterations, the system becomes unstable or the benefits of further iterations diminish.
The "O.O.B." designation suggests a boundary or constraint within the system being modeled. This could be due to computational limitations, algorithmic constraints, or inherent properties of the problem being solved. The data suggests that the model or experiment is not valid for the combinations of n<sub>p</sub> and n<sub>f</sub> that result in "O.O.B." values.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This chart displays the relationship between the mean pass rate and the mean number of tokens generated, with different lines representing varying values of *n<sub>p</sub>* (1, 2, 5, 10, 25) and data points colored by *n<sub>fr</sub>* (1, 3, 5, 10). Error bars are present for each data point, indicating the variability in the pass rate. A shaded gray area represents the confidence interval around the *n<sub>p</sub>* = 25 line.
### Components/Axes
* **X-axis:** Mean number of tokens generated, ranging from 0 to 10000, with tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-axis:** Mean pass rate, ranging from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Lines:** Five lines representing different values of *n<sub>p</sub>*:
* *n<sub>p</sub>* = 1 (Brown)
* *n<sub>p</sub>* = 2 (Orange)
* *n<sub>p</sub>* = 5 (Light Green)
* *n<sub>p</sub>* = 10 (Blue)
* *n<sub>p</sub>* = 25 (Purple)
* **Data Points:** Data points are colored to represent different values of *n<sub>fr</sub>*:
* *n<sub>fr</sub>* = 1 (Gray Circle)
* *n<sub>fr</sub>* = 3 (Red Downward Triangle)
* *n<sub>fr</sub>* = 5 (Black Square)
* *n<sub>fr</sub>* = 10 (Gray Upward Triangle)
* **Legend:** Located in the top-right corner, detailing the line colors and corresponding *n<sub>p</sub>* values, and the data point shapes and corresponding *n<sub>fr</sub>* values.
### Detailed Analysis
Let's analyze each line and its associated data points:
* ***n<sub>p</sub>* = 1 (Brown):** The line starts at approximately 0.18 at 0 tokens and increases to approximately 0.38 at 8000 tokens, with some fluctuation. Data points are:
* (0, 0.18) - *n<sub>fr</sub>* = 1
* (2000, 0.28) - *n<sub>fr</sub>* = 3
* (4000, 0.32) - *n<sub>fr</sub>* = 5
* (6000, 0.36) - *n<sub>fr</sub>* = 1
* (8000, 0.38) - *n<sub>fr</sub>* = 3
* ***n<sub>p</sub>* = 2 (Orange):** The line starts at approximately 0.19 at 0 tokens and increases to approximately 0.42 at 8000 tokens. Data points are:
* (0, 0.19) - *n<sub>fr</sub>* = 1
* (2000, 0.33) - *n<sub>fr</sub>* = 3
* (4000, 0.37) - *n<sub>fr</sub>* = 5
* (6000, 0.41) - *n<sub>fr</sub>* = 1
* (8000, 0.42) - *n<sub>fr</sub>* = 3
* ***n<sub>p</sub>* = 5 (Light Green):** The line starts at approximately 0.21 at 0 tokens and increases to approximately 0.43 at 8000 tokens. Data points are:
* (0, 0.21) - *n<sub>fr</sub>* = 1
* (2000, 0.35) - *n<sub>fr</sub>* = 3
* (4000, 0.39) - *n<sub>fr</sub>* = 5
* (6000, 0.42) - *n<sub>fr</sub>* = 1
* (8000, 0.43) - *n<sub>fr</sub>* = 3
* ***n<sub>p</sub>* = 10 (Blue):** The line starts at approximately 0.23 at 0 tokens and increases to approximately 0.45 at 8000 tokens. Data points are:
* (0, 0.23) - *n<sub>fr</sub>* = 1
* (2000, 0.37) - *n<sub>fr</sub>* = 3
* (4000, 0.41) - *n<sub>fr</sub>* = 5
* (6000, 0.44) - *n<sub>fr</sub>* = 1
* (8000, 0.45) - *n<sub>fr</sub>* = 3
* ***n<sub>p</sub>* = 25 (Purple):** The line starts at approximately 0.25 at 0 tokens and increases to approximately 0.50 at 8000 tokens. A shaded gray area surrounds this line, representing a confidence interval. Data points are:
* (0, 0.25) - *n<sub>fr</sub>* = 1
* (2000, 0.39) - *n<sub>fr</sub>* = 3
* (4000, 0.43) - *n<sub>fr</sub>* = 5
* (6000, 0.46) - *n<sub>fr</sub>* = 1
* (8000, 0.47) - *n<sub>fr</sub>* = 3
* (10000, 0.51) - *n<sub>fr</sub>* = 1
All lines generally exhibit an upward trend, indicating that as the mean number of tokens generated increases, the mean pass rate also tends to increase.
### Key Observations
* The *n<sub>p</sub>* = 25 line consistently has the highest mean pass rate across all token counts.
* The lines for different *n<sub>p</sub>* values converge at lower token counts and diverge as the number of tokens increases.
* The data points for *n<sub>fr</sub>* = 1 generally have lower pass rates compared to other *n<sub>fr</sub>* values.
* The confidence interval around the *n<sub>p</sub>* = 25 line suggests a higher degree of certainty in the pass rate for this value.
### Interpretation
The chart demonstrates a positive correlation between the mean number of tokens generated and the mean pass rate, particularly for higher values of *n<sub>p</sub>*. This suggests that increasing the number of tokens generated can improve the pass rate, but the effect is more pronounced when *n<sub>p</sub>* is larger. The parameter *n<sub>p</sub>* likely represents a prompt parameter, and *n<sub>fr</sub>* a feedback parameter. The confidence interval around the *n<sub>p</sub>* = 25 line indicates that this value provides a more stable and predictable pass rate. The convergence of the lines at lower token counts suggests that the effect of *n<sub>p</sub>* is less significant when fewer tokens are generated. The varying pass rates for different *n<sub>fr</sub>* values suggest that the feedback mechanism also influences the overall performance. The data suggests that a combination of a higher prompt parameter (*n<sub>p</sub>* = 25) and a suitable feedback parameter (*n<sub>fr</sub>*) is optimal for achieving a high pass rate.
</details>
<details>
<summary>x19.png Details</summary>

### Visual Description
## Heatmap: Feedback-Repairs vs. Initial Programs
### Overview
This image presents a heatmap visualizing the relationship between the number of initial programs (n<sub>p</sub>) and the number of feedback-repairs (n<sub>f</sub>). The heatmap uses a color gradient to represent numerical values, with warmer colors (red/orange) indicating lower values and cooler colors (green) indicating higher values. Several cells are marked with "O.O.B." which likely stands for "Out Of Bounds".
### Components/Axes
* **X-axis:** Number of initial programs (n<sub>p</sub>) with markers at 1, 2, 5, 10, and 25.
* **Y-axis:** Number of feedback-repairs (n<sub>f</sub>) with markers at 1, 3, 5, and 10.
* **Color Scale:** Ranges from a dark reddish-brown to a light green, representing the numerical values.
* **Data Labels:** Numerical values are displayed within each cell of the heatmap.
* **"O.O.B." Labels:** Indicates values that are out of bounds or not available.
### Detailed Analysis
The heatmap is a 4x5 grid. Here's a breakdown of the values, row by row:
* **n<sub>f</sub> = 1:**
* n<sub>p</sub> = 1: 1.01
* n<sub>p</sub> = 2: 1.05
* n<sub>p</sub> = 5: 1.10
* n<sub>p</sub> = 10: 1.13
* n<sub>p</sub> = 25: 1.19
* **n<sub>f</sub> = 3:**
* n<sub>p</sub> = 1: 0.85
* n<sub>p</sub> = 2: 0.97
* n<sub>p</sub> = 5: 1.10
* n<sub>p</sub> = 10: 1.16
* n<sub>p</sub> = 25: O.O.B.
* **n<sub>f</sub> = 5:**
* n<sub>p</sub> = 1: 0.83
* n<sub>p</sub> = 2: 0.97
* n<sub>p</sub> = 5: 1.10
* n<sub>p</sub> = 10: O.O.B.
* n<sub>p</sub> = 25: O.O.B.
* **n<sub>f</sub> = 10:**
* n<sub>p</sub> = 1: 0.81
* n<sub>p</sub> = 2: 0.97
* n<sub>p</sub> = 5: O.O.B.
* n<sub>p</sub> = 10: O.O.B.
* n<sub>p</sub> = 25: O.O.B.
**Trends:**
* For a fixed n<sub>f</sub>, the values generally increase as n<sub>p</sub> increases, up to a point.
* As n<sub>f</sub> increases, the values tend to decrease for smaller n<sub>p</sub>, then increase again.
* The upper-right portion of the heatmap is largely filled with "O.O.B." values, suggesting that the model or experiment is not valid or defined for those parameter combinations.
### Key Observations
* The lowest value observed is 0.81 (at n<sub>f</sub> = 10, n<sub>p</sub> = 1).
* The highest value observed is 1.19 (at n<sub>f</sub> = 1, n<sub>p</sub> = 25).
* A significant portion of the data is unavailable ("O.O.B."), particularly for larger values of both n<sub>f</sub> and n<sub>p</sub>.
* The values appear to be clustered around 1.0, with deviations in either direction.
### Interpretation
This heatmap likely represents the results of an experiment or simulation investigating the relationship between the number of initial programs and the number of feedback-repair cycles. The values within the heatmap could represent a metric such as efficiency, accuracy, or performance.
The increasing trend for fixed n<sub>f</sub> as n<sub>p</sub> increases suggests that adding more initial programs can improve performance, but only up to a certain point. The "O.O.B." values indicate that there are limitations to this relationship, and that increasing both n<sub>f</sub> and n<sub>p</sub> beyond certain thresholds leads to undefined or invalid results.
The clustering around 1.0 suggests that the metric is relatively stable, and that the impact of varying n<sub>f</sub> and n<sub>p</sub> is moderate. The deviations from 1.0 could represent the degree of improvement or degradation in performance.
The "O.O.B." values are a critical finding. They suggest that the system being modeled has constraints or limitations that prevent it from functioning effectively under certain conditions. Further investigation is needed to understand the reasons for these limitations and to identify ways to overcome them. The data suggests that the system is sensitive to the interplay between the number of initial programs and the number of feedback-repair cycles.
</details>
Figure 8: GPT-4 results from Figure 4 (Section 4.1) per difficulty (row), from top to bottom: introductory, interview, and competition.
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Line Chart: Pass Rate vs. Tokens Generated
### Overview
This line chart depicts the relationship between the mean number of tokens generated and the mean pass rate for different configurations of language models (GPT-3.5 and GPT-4) with and without a "repair" step (denoted by *M<sub>P</sub>* and *M<sub>F</sub>*). The chart aims to illustrate how the performance (pass rate) of these models changes as they generate more text.
### Components/Axes
* **X-axis:** "Mean number of tokens generated". Scale ranges from 0 to 10000, with tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-axis:** "Mean pass rate". Scale ranges from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located in the bottom-left corner of the chart. Contains the following labels and corresponding line colors:
* Black: *M<sub>P</sub>* = GPT-4 (no repair)
* Light Blue: *M<sub>P</sub>* = GPT-4; *M<sub>F</sub>* = GPT-4
* Gray: *M<sub>P</sub>* = GPT-3.5 (no repair)
* Orange: *M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-3.5
* Purple: *M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-4
### Detailed Analysis
The chart displays five distinct lines, each representing a different model configuration.
* **Black Line (*M<sub>P</sub>* = GPT-4 (no repair)):** This line starts at approximately 0.72 and rises rapidly to around 0.85 by 2000 tokens. It then plateaus, reaching approximately 0.88 by 4000 tokens and remaining relatively stable up to 10000 tokens.
* **Light Blue Line (*M<sub>P</sub>* = GPT-4; *M<sub>F</sub>* = GPT-4):** This line begins at approximately 0.78 and exhibits a similar upward trend to the black line, but starts at a higher pass rate. It reaches approximately 0.90 by 2000 tokens and plateaus around 0.92-0.93 from 4000 to 10000 tokens.
* **Gray Line (*M<sub>P</sub>* = GPT-3.5 (no repair)):** This line starts at approximately 0.65 and increases more gradually than the GPT-4 lines. It reaches around 0.80 by 4000 tokens and continues to rise, reaching approximately 0.84 by 8000 tokens.
* **Orange Line (*M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-3.5):** This line begins at approximately 0.70 and shows a faster initial increase than the gray line. It reaches around 0.85 by 4000 tokens and plateaus around 0.87-0.88 from 6000 to 10000 tokens.
* **Purple Line (*M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-4):** This line starts at approximately 0.72 and exhibits a similar trend to the orange line, but with a slightly higher initial pass rate. It reaches approximately 0.88 by 4000 tokens and plateaus around 0.90-0.91 from 6000 to 10000 tokens.
### Key Observations
* GPT-4 consistently outperforms GPT-3.5 across all token ranges, regardless of whether a repair step is used.
* The "repair" step (using a different model for finalization, *M<sub>F</sub>*) generally improves the pass rate, particularly for GPT-3.5.
* All lines exhibit diminishing returns in pass rate as the number of tokens generated increases. The most significant improvements occur within the first 2000-4000 tokens.
* The GPT-4 model with repair (*M<sub>P</sub>* = GPT-4; *M<sub>F</sub>* = GPT-4) achieves the highest pass rates, consistently above 0.90 after 2000 tokens.
### Interpretation
The data suggests that GPT-4 is a more reliable model than GPT-3.5 for generating text that "passes" a certain criteria (the nature of which is not specified). The inclusion of a repair step, where a second model refines the output of the first, further enhances performance, especially for GPT-3.5. The diminishing returns observed at higher token counts indicate that the models' ability to maintain quality degrades as the generated text becomes longer. This could be due to factors such as context loss or increased complexity.
The use of *M<sub>P</sub>* and *M<sub>F</sub>* notation suggests a two-stage generation process. *M<sub>P</sub>* likely represents the primary model used for initial text generation, while *M<sub>F</sub>* represents a finalization model used to refine the output. The fact that using GPT-4 for both stages (*M<sub>P</sub>* = GPT-4; *M<sub>F</sub>* = GPT-4) yields the best results indicates that GPT-4 is not only a better initial generator but also a better finalizer. The improvement seen when GPT-4 is used as the finalizer for GPT-3.5 output (*M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-4) highlights the benefit of leveraging GPT-4's capabilities to correct or improve the output of a less powerful model.
</details>
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Line Chart: Pass Rate vs. Tokens Generated for Different Models
### Overview
This line chart depicts the relationship between the mean number of tokens generated and the mean pass rate for several configurations of GPT models (GPT-4 and GPT-3.5) with and without a "repair" step (denoted as *M<sub>P</sub>* and *M<sub>F</sub>*). The chart aims to demonstrate how the performance (pass rate) of these models changes as they generate more text.
### Components/Axes
* **X-axis:** "Mean number of tokens generated". Scale ranges from 0 to 10000, with tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-axis:** "Mean pass rate". Scale ranges from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located in the top-right corner of the chart. Contains the following entries:
* Black line: *M<sub>P</sub>* = GPT-4 (no repair)
* Light blue line: *M<sub>P</sub>* = GPT-4; *M<sub>F</sub>* = GPT-4
* Gray line: *M<sub>P</sub>* = GPT-3.5 (no repair)
* Orange line: *M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-3.5
* Teal line: *M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-4
### Detailed Analysis
The chart displays five distinct lines, each representing a different model configuration.
* **GPT-4 (no repair):** (Black line) Starts at approximately 0.15 pass rate at 0 tokens, and increases rapidly to approximately 0.6 pass rate at 2000 tokens. The line then plateaus, reaching approximately 0.63 pass rate at 10000 tokens.
* **GPT-4; GPT-4:** (Light blue line) Starts at approximately 0.2 pass rate at 0 tokens, and increases steadily to approximately 0.58 pass rate at 2000 tokens. It continues to increase, reaching approximately 0.6 pass rate at 10000 tokens.
* **GPT-3.5 (no repair):** (Gray line) Starts at approximately 0.1 pass rate at 0 tokens, and increases to approximately 0.45 pass rate at 2000 tokens. The line then plateaus, reaching approximately 0.48 pass rate at 10000 tokens.
* **GPT-3.5; GPT-3.5:** (Orange line) Starts at approximately 0.1 pass rate at 0 tokens, and increases to approximately 0.5 pass rate at 2000 tokens. It continues to increase, reaching approximately 0.53 pass rate at 10000 tokens.
* **GPT-3.5; GPT-4:** (Teal line) Starts at approximately 0.15 pass rate at 0 tokens, and increases to approximately 0.5 pass rate at 2000 tokens. It continues to increase, reaching approximately 0.55 pass rate at 10000 tokens.
All lines exhibit an initial steep increase in pass rate as the number of tokens generated increases, followed by a flattening of the curve as the number of tokens grows larger.
### Key Observations
* GPT-4 consistently outperforms GPT-3.5 across all configurations, as indicated by its higher pass rates.
* The "no repair" configurations for both GPT-4 and GPT-3.5 have lower initial pass rates compared to configurations with a repair step.
* The repair step (*M<sub>F</sub>*) appears to improve the pass rate, particularly at lower token counts.
* Using GPT-4 for the repair step (*M<sub>F</sub>* = GPT-4) with GPT-3.5 as the initial model (*M<sub>P</sub>* = GPT-3.5) results in a pass rate that is higher than using GPT-3.5 for both steps.
### Interpretation
The data suggests that GPT-4 is a more capable model than GPT-3.5 in terms of generating text that "passes" a certain criteria (the nature of which is not specified in the chart). The "repair" step, represented by *M<sub>F</sub>*, seems to be a valuable addition to the process, improving the quality of the generated text. The combination of GPT-3.5 for initial generation and GPT-4 for repair is a particularly effective strategy, demonstrating that leveraging the strengths of different models can lead to improved performance. The flattening of the curves at higher token counts indicates that there is a limit to the improvement that can be achieved by simply generating more text. The initial steep increase suggests that the models are quickly learning to generate text that meets the pass criteria, but further gains become more difficult as the models approach their maximum performance. The *M<sub>P</sub>* and *M<sub>F</sub>* notation likely refers to a two-stage process where *M<sub>P</sub>* is the primary model generating the text, and *M<sub>F</sub>* is a secondary model used to "repair" or refine the output.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Line Chart: Pass Rate vs. Tokens Generated for Different GPT Models
### Overview
This line chart depicts the relationship between the mean number of tokens generated and the mean pass rate for several configurations of GPT models. The chart compares the performance of GPT-4 and GPT-3.5, both with and without a "repair" mechanism (denoted by *M<sub>P</sub>* and *M<sub>F</sub>*). Shaded areas around each line represent uncertainty, likely standard deviation or confidence intervals.
### Components/Axes
* **X-axis:** "Mean number of tokens generated" - Scale ranges from 0 to 10000, with tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-axis:** "Mean pass rate" - Scale ranges from 0.0 to 1.0, with tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Located in the top-right corner of the chart. Contains the following labels and corresponding colors:
* Black: *M<sub>P</sub>* = GPT-4 (no repair)
* Gray: *M<sub>P</sub>* = GPT-4; *M<sub>F</sub>* = GPT-4
* Light Blue: *M<sub>P</sub>* = GPT-3.5 (no repair)
* Orange: *M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-3.5
* Blue: *M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-4
### Detailed Analysis
Here's a breakdown of each line's trend and approximate data points, verified against the legend colors:
* **GPT-4 (no repair) - Black Line:** The line starts at approximately (0, 0.05) and slopes upward, leveling off around (10000, 0.42). The shaded area indicates uncertainty, ranging from approximately 0.35 to 0.48 at 10000 tokens.
* **GPT-4; GPT-4 - Gray Line:** This line begins at approximately (0, 0.03) and exhibits a similar upward trend to GPT-4 (no repair), but remains consistently lower. It plateaus around (10000, 0.38), with a shaded uncertainty range of approximately 0.32 to 0.44.
* **GPT-3.5 (no repair) - Light Blue Line:** Starting at approximately (0, 0.02), this line shows a moderate upward slope, but is significantly lower than the GPT-4 lines. It reaches approximately (10000, 0.22), with a shaded uncertainty range of approximately 0.18 to 0.26.
* **GPT-3.5; GPT-3.5 - Orange Line:** This line begins at approximately (0, 0.01) and increases slowly, remaining the lowest of all lines. It plateaus around (10000, 0.18), with a shaded uncertainty range of approximately 0.14 to 0.22.
* **GPT-3.5; GPT-4 - Blue Line:** Starting at approximately (0, 0.04), this line shows a moderate upward slope, and is higher than the GPT-3.5; GPT-3.5 line, but lower than the GPT-4 lines. It plateaus around (10000, 0.35), with a shaded uncertainty range of approximately 0.30 to 0.40.
### Key Observations
* GPT-4 consistently outperforms GPT-3.5 across all token generation ranges.
* The "repair" mechanism (*M<sub>F</sub>*) appears to have a varying impact depending on the base model. When applied to GPT-3.5 with GPT-4 (*M<sub>P</sub>* = GPT-3.5; *M<sub>F</sub>* = GPT-4), it significantly improves the pass rate compared to GPT-3.5 alone.
* The uncertainty bands suggest that the pass rates are not highly predictable, and there is considerable variation in performance.
* The pass rate appears to plateau as the number of tokens generated increases, indicating diminishing returns.
### Interpretation
The data suggests that GPT-4 is a more reliable model for generating successful outputs (as measured by the pass rate) than GPT-3.5. The "repair" mechanism, represented by *M<sub>F</sub>*, seems to be most effective when combined with GPT-3.5 and GPT-4, suggesting a synergistic effect. The plateauing of the pass rate with increasing tokens indicates that there's a limit to how much improvement can be achieved by simply generating more text. The uncertainty bands highlight the inherent stochasticity of these models; even with the same input and settings, the output quality can vary.
The notation *M<sub>P</sub>* and *M<sub>F</sub>* likely refer to a "prompting" and "fixing" or "feedback" stage in the generation process. *M<sub>P</sub>* likely represents the initial model used for prompting, while *M<sub>F</sub>* represents a second model used to refine or "repair" the output. The results suggest that using a more powerful model (GPT-4) for the repair stage can significantly improve the overall pass rate, even when starting with a less powerful model (GPT-3.5).
</details>
Figure 9: Results from Figure 6 (Section 4.2) per difficulty (row), from top to bottom: introductory, interview, and competition.
Appendix B Human Experiment: Study Instructions
For our study on human data, participants were given a slide deck with instructions. The following ten images show the instructions, which include an example of a task shown to a participant:
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/1.jpg Details</summary>

### Visual Description
\n
## Document: Task Instructions
### Overview
The image presents a set of instructions for a task involving debugging Python programs. The instructions are formatted as a bulleted list with nested sub-points. The document outlines setup requirements, the task itself, and the format for submitting answers.
### Components/Axes
The document consists of a title "Tasks" and three main sections: "Setup", "Task", and "Task format". Each section is further broken down into sub-points. There are no axes or charts present.
### Detailed Analysis or Content Details
Here's a transcription of the text content:
* **Tasks**
* **Setup:**
* Use a laptop or desktop computer, not a phone.
* **Task: Debug five incorrect Python programs**
* Each program is an incorrect attempt to solve a coding challenge.
* Your answer should explain **what the program is doing wrong** (text is underlined and in blue).
* Expect ~10 minutes per task.
* **Task format**
* Each task is in a separate website.
* Submit your answer using the Google form embedded in each page.
* No other data is being collected (text is underlined and in blue).
### Key Observations
The instructions emphasize the use of a computer (laptop or desktop) and the need to explain *why* a program is incorrect, not just *how* to fix it. The estimated time per task is approximately 10 minutes. The use of Google Forms for submission is specified, and data collection is limited. The phrases "what the program is doing wrong" and "No other data is being collected" are highlighted in blue and underlined.
### Interpretation
The document is a straightforward set of instructions for a debugging exercise. The emphasis on explaining the *reason* for the error suggests that the task is designed to assess understanding of programming concepts rather than just the ability to find and fix bugs. The limited data collection statement is likely included to address privacy concerns and reassure participants that their responses will be used solely for the purpose of the exercise. The use of a separate website for each task likely aims to provide a focused environment for each debugging challenge. The instructions are clear and concise, making them easy to follow.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/2.jpg Details</summary>

### Visual Description
\n
## Text Block: Instructions for Program Explanation
### Overview
The image contains a text block providing instructions for explaining a program, including guidance on identifying errors and suggesting improvements. It also includes example answers demonstrating the expected level of detail.
### Components/Axes
There are no axes or components in the traditional sense. The content is structured using bullet points and indentation to indicate hierarchy.
### Detailed Analysis or Content Details
The text block can be transcribed as follows:
**Your Answer**
* Your answer should briefly explain what the program is doing wrong. If it helps you explain your thoughts, you can also say what you would do differently.
* Can be precise: “the formula used to calculate X on line 5 is wrong, it should be…”
* Or high-level: “the program is treating the task as a min-cut graph problem, but it is actually shortest-path… it could be rewritten using Dijkstra’s algorithm…”
* **Example answers:**
* The problem description states that numbers which start or end with zeros (such as ‘010’ and ‘00’) are NOT considered valid numerical palindromes. However, the code above does not take this into account and therefore returns ‘00’ as a valid palindrome.
* The main issue with the provided code is that it only considers direct subordinates when trying to find the k-th officer in the command spreading sequence. However, the problem asks for the order in which officers receive the command, including indirect subordinates. To fix this, we need to traverse the tree of officers and construct the command spreading sequence before finding the k-th element.
* We are not collecting any data about how you use the website. Only your submitted answer is recorded.
* Feel free to use external tools: pen and paper, a Python IDE, etc!
### Key Observations
The text emphasizes the importance of both precise and high-level explanations when identifying program errors. The example answers demonstrate a focus on aligning the code's behavior with the problem description's constraints.
### Interpretation
The text is a set of instructions designed to elicit thoughtful and detailed responses regarding program analysis. It encourages a multi-faceted approach to debugging, encompassing both pinpointing specific errors (e.g., incorrect formulas) and identifying broader conceptual misunderstandings (e.g., applying the wrong algorithm). The inclusion of example answers serves as a benchmark for the expected quality and depth of explanation. The final two bullet points address privacy and encourage the use of external resources, suggesting a learning-focused environment. The overall message is to provide a clear, concise, and well-reasoned explanation of a program's shortcomings and potential improvements.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/3.jpg Details</summary>

### Visual Description
\n
## Text Display: "Example"
### Overview
The image displays a single word, "Example", centered on a white background. There are no charts, diagrams, or other visual elements present. The image consists solely of text.
### Components/Axes
There are no axes, legends, or other components beyond the text itself.
### Detailed Analysis or Content Details
The word "Example" is written in a sans-serif font, colored a light blue (approximately RGB 173, 216, 230). The text is positioned centrally within the image frame. The font size is relatively large, occupying a significant portion of the image.
### Key Observations
The image is extremely simple, containing only a single word. There are no data points, trends, or distributions to analyze.
### Interpretation
The image likely serves as a placeholder or a demonstration of text rendering. The word "Example" suggests it is intended to illustrate a concept or provide a sample. The simplicity of the image indicates it is not intended to convey complex information, but rather to serve as a basic visual element. There is no underlying data or relationships to interpret. It is a purely illustrative element.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/4.jpg Details</summary>

### Visual Description
\n
## Document: Problem Specification
### Overview
This document outlines the specification for a program designed to count votes from a list of names. It details the input format, expected output, and provides example inputs and outputs. The document is formatted as a technical specification, likely for a programming assignment.
### Components/Axes
The document is structured with a numbered heading ("1. Problem Specification") and a "Specification" section. There are also sections labeled "---Input---", "---Output---", and "---Examples---". A side note with a call to action is present on the right side of the page.
### Detailed Analysis or Content Details
The document contains the following textual information:
**Specification:**
"The recent schooldboard elections were hotly contested: a proposal to swap school start times for elementary and high school students, a controversial new dress code proposal that bans athletic clothes in school, and a proposal to raise real-estate taxes to pay for a new football practice facility, and the list goes on and on. It is now hours after the polls have closed and a winner has yet to emerge!"
"In their desperation, the election officials turn to you and ask you to write a program to count the vote!"
**---Input---:**
"The input consists of a single test case, which is a list of votes cast. Each line in the input contains the name of a candidate for whom a vote was cast. A name may consist of multiple words, separated by spaces. Words contain letters or hyphens, but no other punctuation characters. There will be at least $2$ and less than $100$ votes. The list of $100000$ valid votes."
**---Output---:**
"If a candidate obtained a simple or absolute majority of all votes cast (that is, more than any other candidate), output the name of this candidate! If no candidate obtained a simple majority, output "Runoff!" (don't forget to include the exclamation mark!)"
**---Examples---:**
*Sample Input:*
Penny Franklin
Marti Graham
Connie Froggatt
Joseph Ivans
Connie Froggatt
Penny Franklin
Connie Froggatt
Milton Berger
Connie Froggatt
Barbare Skinner
Barbare Skinner
***
*Sample Output:*
Connie Froggatt
"Remember: all input/output of the program should be handled through stdin and stdout."
**Side Note:**
"Each page starts with a specification of what the program should do."
"Begin by carefully reading the problem specification."
### Key Observations
The document is a clear and concise problem statement. The use of "---" delimiters and bold text emphasizes key sections. The inclusion of example input and output is helpful for understanding the expected program behavior. The note on the side reinforces the importance of understanding the specification before coding. The use of mathematical notation ($2$, $100$, $100000$) indicates a level of formality.
### Interpretation
This document serves as a formal specification for a vote-counting program. It outlines the requirements, input/output format, and provides a test case to guide development. The problem is designed to test a programmer's ability to parse input, count occurrences, and determine a majority winner (or indicate a runoff situation). The specification is well-structured and easy to understand, suggesting a focus on clarity and precision. The inclusion of constraints on the number of votes (at least 2, less than 100) suggests that efficiency is not a primary concern, and the focus is on correctness. The use of "stdin" and "stdout" indicates that the program is expected to be a command-line application.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/5.jpg Details</summary>

### Visual Description
\n
## Screenshot: Incorrect Program Code
### Overview
The image presents a screenshot of a slide titled "2. Incorrect Program". The slide is divided into two main sections: a code block labeled "Incorrect Program" and a text block providing context and instructions. The code appears to be Python, intended to calculate a winner based on votes.
### Components/Axes
The slide contains the following elements:
* **Title:** "2. Incorrect Program" (top-center)
* **Left Section:** Code block labeled "Incorrect Program" (left-center)
* **Right Section:** Text block with instructions and notes (right-center)
* **Text within Right Section:**
* "Next, you will be shown the incorrect program."
* "Tip: If you are struggling with debugging the program, try running it on your machine!"
* "Note: the programs handle inputs through “input()”, and outputs through “print()”."
### Content Details
The Python code block contains the following:
```python
from collections import defaultdict
votes = defaultdict(int)
candidate = input()
while candidate != "***":
votes[candidate] += 1
candidate = input()
total_votes = sum(votes.values())
max_votes = max(votes.values())
winner = [name for name, count in votes.items() if count == max_votes]
if len(winner) == 1 and max_votes > total_votes // 2:
print(winner[0])
else:
print("Runoff!")
```
### Key Observations
The code is designed to take candidate names as input until "***" is entered. It then calculates the total votes, the maximum number of votes, and identifies the winner(s). The code checks if there is a single winner with a majority of the votes (more than half). If so, it prints the winner's name. Otherwise, it prints "Runoff!".
### Interpretation
The slide presents a flawed program intended for a voting system. The purpose is likely to demonstrate a debugging exercise or illustrate potential errors in program logic. The note about input/output methods ("input()" and "print()") suggests the user is expected to interact with the program directly. The "Tip" encourages hands-on debugging. The code itself appears logically sound at first glance, but could contain subtle errors that lead to incorrect results in specific scenarios. The slide is part of a learning module focused on program correctness and debugging.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/6.jpg Details</summary>

### Visual Description
\n
## Document: Error Message Example
### Overview
The image presents a document section titled "Error Message", illustrating an example of an error message encountered during program execution. It details the input provided to the program, the program's actual output, and the expected output. The document also provides instructions on how to reproduce the error.
### Components/Axes
The document is structured into the following sections:
* **Header:** "3. Error Message" (top-left)
* **Error Message Explanation:** A paragraph explaining the purpose of the error message.
* **Input:** A list of names.
* **Program Output:** The output generated by the program.
* **Expected Output:** The correct output the program should have produced.
* **Tip:** Instructions for reproducing the error.
### Detailed Analysis or Content Details
The document contains the following textual information:
**Error:**
"Tip: if you're executing the code on your machine, copy-paste the input into a file and pipe it to the program with 'python3 program.py < input.txt'?"
**Input:**
* Penny Franklin
* Marti Graham
* Connie Froggatt
* Joseph Ivers
* Connie Froggatt
* Penny Franklin
* Connie Froggatt
* Bruce Stanger
* Connie Froggatt
* Barbara Skinner
* Barbara Skinner
* ... (indicated by ellipsis)
**Program Output:**
"Runoff!"
**Expected Output:**
"Connie Froggatt"
**Error Message Explanation:**
"The error message shows you the test that the program failed on."
**It contains:**
* An example input
* The program's incorrect output (in blue)
* The expected output
**Tip:**
"Tip: try copy-pasting the input to a file and piping it to the program."
### Key Observations
The error message indicates that the program incorrectly identified the output for the given input. The program outputted "Runoff!", while the expected output was "Connie Froggatt". The input data consists of a list of names, with "Connie Froggatt" appearing multiple times. The ellipsis suggests the input list is longer than what is shown.
### Interpretation
This document serves as an instructional example for debugging a program. It demonstrates how error messages can pinpoint the discrepancy between the program's actual behavior and the desired behavior. The error message provides the necessary information – input, actual output, and expected output – to diagnose and fix the problem. The tip suggests a common method for testing programs with large inputs: redirecting input from a file. The repeated name "Connie Froggatt" in the input might be a key element in understanding the program's logic and the source of the error. The error likely stems from incorrect logic in identifying the correct output based on the input list of names.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/7.jpg Details</summary>

### Visual Description
\n
## Document: Model Explanations
### Overview
The image presents a document section titled "Model Explanations". It details two example explanations generated by a model to aid in debugging. The document emphasizes that these explanations are not necessarily correct and should be treated as suggestions.
### Components/Axes
The document is structured into three main columns:
* **Left Column:** Contains the heading "Model Explanations" and two explanations labeled "Explanation 1" and "Explanation 2".
* **Right Column:** Provides contextual information about the explanations, stating they are generated by the model and may be incorrect. It also includes a comparison to "CoPilot suggestions".
### Detailed Analysis or Content Details
**Heading:** "4. Model Explanations" - positioned at the top-left of the image.
**Left Column - Explanation 1:**
"The concise explanation of the issue is: The code checks if the maximum votes are strictly greater than half of the total votes to determine if there's a majority. However, it should check if the maximum votes are greater than or equal to half of the total votes."
**Left Column - Explanation 2:**
"The following is a concise explanation of the issue: The code checks for a simple majority by comparing 'max_votes' to 'total_votes // 2', which will return the floor division result. This means that if 'max_votes' is equal to half of the 'total_votes', it will not be considered a simple majority. Instead, the condition should check if 'max_votes' is strictly greater than half of the 'total_votes'."
**Right Column:**
"To help you get started with the debugging, each page lists two example explanations."
"These explanations are generated by the model itself. They might be completely wrong. You don't have to use them." - This text is in **blue** font.
"Think of these like CoPilot suggestions."
### Key Observations
* The document focuses on a potential error in code logic related to determining a majority vote.
* Both explanations highlight the difference between "strictly greater than" and "greater than or equal to" when calculating a majority.
* The document explicitly warns the user that the model-generated explanations are not guaranteed to be accurate.
* The use of code snippets ('max_votes', 'total_votes') suggests a programming context.
### Interpretation
The document serves as a cautionary note accompanying model-generated debugging assistance. It acknowledges the potential for inaccuracies in the model's output and encourages users to exercise critical thinking and independent verification. The explanations themselves point to a common error in majority vote calculations – failing to account for the case where the number of votes is exactly half the total. The comparison to "CoPilot suggestions" frames the model's output as a starting point for investigation rather than a definitive solution. The document suggests a system where a model attempts to assist in debugging, but the user remains ultimately responsible for validating the proposed solutions.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/8.jpg Details</summary>

### Visual Description
\n
## Screenshot: Answer Form
### Overview
The image is a screenshot of a Google Form titled "Answer Form". It displays a text input field for a user's explanation, along with accompanying instructions. The form is designed to collect explanations of what a program is doing wrong.
### Components/Axes
The screenshot contains the following elements:
* **Title:** "5. Answer Form" (top-left)
* **Section Header:** "Your Explanation"
* **Placeholder Text:** "REDACTED FOR ANONYMITY" (within a black rectangle)
* **Sign-in Prompt:** "Sign in to Google to save your progress. Learn more"
* **Required Field Indicator:** "\* Indicates required question"
* **Input Field Label:** "Your Explanation \*"
* **Input Field Placeholder:** "Your answer"
* **Submit Button:** "Submit" (purple)
* **Clear Form Button:** "Clear form"
* **Instructional Text:** "Finally, each page contains an embedded Google Form. No login is required. Submit your explanation of what the program is doing wrong. Your answer must be self-contained; it should not be of the form “Just like the first model explanation describes, the issue with the code is that …”"
### Detailed Analysis or Content Details
The form is structured as a single required text input field. The placeholder text within the input field has been redacted for anonymity. The form explicitly requests a self-contained explanation of a program's error, discouraging referencing previous explanations.
### Key Observations
The form is simple and straightforward, focusing on collecting a textual explanation. The redaction suggests a concern for user privacy. The instructions emphasize the need for a complete and independent explanation.
### Interpretation
The image depicts a mechanism for gathering feedback or debugging information related to a program. The form's design suggests that the developers are seeking detailed, self-contained explanations of errors, rather than simple confirmations or references to previous reports. The redaction of the placeholder text indicates a commitment to user anonymity. The form is likely part of a larger process involving multiple models or iterations, where each explanation is expected to stand on its own. The emphasis on avoiding references to previous explanations suggests a desire to understand each user's independent perspective on the problem.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/9.jpg Details</summary>

### Visual Description
\n
## Document: Study Tips
### Overview
The image presents a document titled "Study Tips" containing advice for completing tasks, likely related to a study or evaluation process. It's a text-based document with bullet points offering guidance.
### Components/Axes
The document consists of a title ("Study Tips") followed by a series of bullet points providing instructions and encouragement. There are also two emoji faces: a smiling face and a slightly smiling face.
### Detailed Analysis or Content Details
The document contains the following text:
"We are very grateful for your help! 😊"
* "Make sure you understand the task first! The programs have subtle logic errors, not just simple compiler errors."
* "Try to write clear and concise explanations, with proper grammar and punctuation."
* "Feel free to use (or not use) the model explanations when writing your answers; but make sure your answer is self-contained!"
* "The tasks vary in difficulty. Feel free to allocate your time as you see fit; we are not measuring how quickly you complete the tasks or anything like that!"
* "Feel free to use external tools:"
* "Use pen and paper or a whiteboard to help you reason about the task at hand."
* "Use a Python IDE to execute and debug the code."
* "Search online for help."
* "Have a question? Ask [REDACTED] before moving on with the study! 😃"
The word "[REDACTED]" appears in the last bullet point, obscuring a name or contact information.
### Key Observations
The document emphasizes understanding the task, clear communication, independent problem-solving, and the use of available resources. The redaction suggests sensitivity around who to contact for questions. The use of emojis adds a friendly tone.
### Interpretation
This document appears to be part of instructions given to participants in a study or evaluation involving programming tasks. The emphasis on logic errors, clear explanations, and self-contained answers suggests the evaluation focuses on reasoning and problem-solving skills, not just correct code output. The encouragement to use external tools indicates a focus on practical problem-solving approaches. The redaction of the contact person's name likely protects their privacy or prevents direct contact outside of the study's designated channels. The overall tone is supportive and aims to create a comfortable environment for participants.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/10.jpg Details</summary>

### Visual Description
\n
## Document: Frequently Asked Questions (FAQ)
### Overview
The image presents a screenshot of a Frequently Asked Questions (FAQ) document. The document consists of three questions and their corresponding answers, formatted as a bulleted list with nested sub-bullets for the answers. The overall theme revolves around data collection, the purpose of a study, and the evaluation of model explanations.
### Components/Axes
The document is structured with a main title "FAQ" at the top. Each question is presented as a bullet point, followed by a sub-bullet containing the answer. There are no axes or scales present, as this is a text-based document.
### Content Details
Here's a transcription of the FAQ content:
* **Are you collecting data as I visit the website?**
* **No** - none at all. Only your final answers are recorded.
* **What is the point of the study?**
* To investigate how much better the models are at fixing code when given human feedback, instead of having to debug the code themselves.
* **Are you evaluating how useful the model explanations were to me?**
* **No** - they are just there to help you get started with the debugging. We only care about your final answer.
### Key Observations
The answers consistently emphasize that only final answers are recorded and that the study focuses on model performance with human feedback, not on evaluating the usefulness of model explanations. The word "No" is bolded in the answers to the first and third questions, highlighting a negative response.
### Interpretation
The FAQ aims to reassure users that their browsing data is not being collected and that the study's primary goal is to assess the effectiveness of models when aided by human input. The emphasis on final answers suggests that the process is focused on outcomes rather than the user's thought process or interaction with the model's explanations. The document is designed to address potential privacy concerns and clarify the study's objectives. The study appears to be focused on improving code debugging through a collaborative human-AI approach.
</details>
Figure 10:
Appendix C Human Experiment (Quantitative Analysis): Results Per Task
In the table below, we give a complete breakdown of the quantitative results presented in Section 4.3. Note that each program is associated with four different pieces of feedback: two sampled from GPT-4, and two given by our human participants. Each cell is the number of repair candidates (out of 25) that passed all the unit tests. See Section 4.3 for details, as well as Appendix B for the instructions given to participants.
| 2106 | interview | A | 7 | 10 | 10 | 0 |
| --- | --- | --- | --- | --- | --- | --- |
| B | 0 | 2 | 20 | 16 | | |
| 2673 | interview | A | 4 | 7 | 17 | 24 |
| B | 3 | 25 | 25 | 25 | | |
| 2923 | interview | A | 0 | 0 | 0 | 0 |
| B | 0 | 0 | 0 | 0 | | |
| 3070 | competition | A | 0 | 0 | 0 | 0 |
| B | 3 | 0 | 5 | 0 | | |
| 3286 | competition | A | 2 | 6 | 10 | 25 |
| B | 0 | 0 | 0 | 4 | | |
| 3754 | competition | A | 0 | 0 | 0 | 0 |
| B | 0 | 0 | 0 | 0 | | |
| 4182 | introductory | A | 25 | 25 | 25 | 24 |
| B | 25 | 0 | 25 | 25 | | |
| 4195 | introductory | A | 25 | 3 | 24 | 23 |
| B | 23 | 25 | 25 | 25 | | |
| 4281 | introductory | A | 0 | 4 | 0 | 0 |
| B | 0 | 0 | 0 | 0 | | |
| 4333 | introductory | A | 25 | 0 | 25 | 0 |
| B | 23 | 24 | 24 | 25 | | |
| 4347 | introductory | A | 0 | 0 | 7 | 25 |
| B | 0 | 0 | 25 | 25 | | |
| 4426 | introductory | A | 25 | 25 | 25 | 25 |
| B | 25 | 25 | 25 | 25 | | |
| 4450 | introductory | A | 0 | 0 | 0 | 0 |
| B | 24 | 0 | 22 | 24 | | |
| 4507 | introductory | A | 0 | 0 | 0 | 0 |
| B | 0 | 0 | 1 | 0 | | |
| 4514 | introductory | A | 15 | 21 | 1 | 16 |
| B | 0 | 0 | 25 | 0 | | |
| 4704 | introductory | A | 0 | 25 | 0 | 25 |
| B | 25 | 25 | 24 | 23 | | |
| 4741 | introductory | A | 25 | 25 | 25 | 25 |
| B | 25 | 25 | 25 | 25 | | |
| 4855 | introductory | A | 0 | 1 | 17 | 25 |
| B | 0 | 2 | 3 | 23 | | |
| 4873 | introductory | A | 0 | 0 | 0 | 0 |
| B | 0 | 0 | 0 | 18 | | |
| 4952 | introductory | A | 0 | 0 | 2 | 25 |
| B | 24 | 8 | 24 | 21 | | |
Appendix D Human Experiment (Qualitative Analysis): Examples
In this appendix, we give examples of each category used to classify the responses in the qualitative analysis of Section 4.3. For each category, we give one example from the human participants and one from GPT-4 (when possible). Each example contains: the textual specification of the task; the incorrect program; the error message (in the format it was shown to participants); the feedback in question; and a short comment motivating its classification.
D.1 Category (a): obviously incorrect feedback
D.1.1 GPT-4: Example
⬇
### SPECIFICATION
The recent schoolboard elections were hotly contested: a proposal to swap school start times for elementary and high school students, a controversial new dress code proposal that bans athletic clothes in school, and a proposal to raise real - estate taxes to pay for a new football practice facility, and the list goes on and on. It is now hours after the polls have closed and a winner has yet to emerge!
In their desperation, the election officials turn to you and ask you to write a program to count the vote!
----- Input -----
The input consists of a single test case, which is a list of votes cast. Each line in the input contains the name of a candidate for whom a vote was cast. A name may consist of multiple words, separated by spaces. Words contain letters or hyphens, but no other punctuation characters. There will be at least votes on the list. The list of votes ends with a single line containing the characters ***. This line should not be counted. There can be up to valid votes.
----- Output -----
If a candidate obtained a simple or absolute majority of all votes cast (that is, more than any other candidate), output the name of this candidate! If no candidate obtained a simple majority, output: " Runoff!" (don ’ t forget to include the exclamation mark!)
----- Examples -----
Sample Input:
Penny Franklin
Marti Graham
Connie Froggatt
Joseph Ivers
Connie Froggatt
Penny Franklin
Connie Froggatt
Bruce Stanger
Connie Froggatt
Barbara Skinner
Barbara Skinner
***
Sample Output:
Connie Froggatt
### INCORRECT PROGRAM
from collections import defaultdict
votes = defaultdict (int)
candidate = input ()
while candidate != "***":
votes [candidate] += 1
candidate = input ()
total_votes = sum (votes. values ())
max_votes = max (votes. values ())
winner = [name for name, count in votes. items () if count == max_votes]
if len (winner) == 1 and max_votes > total_votes // 2:
print (winner [0])
else:
print (" Runoff!")
### ERROR MESSAGE
=== Input ===
Penny Franklin
Marti Graham
Connie Froggatt
Joseph Ivers
Connie Froggatt
Penny Franklin
Connie Froggatt
Bruce Stanger
Connie Froggatt
Barbara Skinner
Barbara Skinner
***
=== Program Output ===
Runoff!
=== Expected Output ===
Connie Froggatt
### FEEDBACK
The concise explanation of the issue is: The code checks if the maximum votes are strictly greater than half of the total votes to determine if there ’ s a majority. However, it should check if the maximum votes are greater than or equal to half of the total votes.
### COMMENT
The issue with the code is not that the ">" should have been a ">=", but rather that it is not allowing for a simple majority.
D.1.2 Human: Example
⬇
### SPECIFICATION
Snuke and Raccoon have a heap of N cards. The i - th card from the top has the integer a_i written on it.
They will share these cards.
First, Snuke will take some number of cards from the top of the heap, then Raccoon will take all the remaining cards.
Here, both Snuke and Raccoon have to take at least one card.
Let the sum of the integers on Snuke ’ s cards and Raccoon ’ s cards be x and y, respectively.
They would like to minimize | x - y |.
Find the minimum possible value of | x - y |.
----- Constraints -----
- 2 \ leq N \ leq 2 \ times 10^5
- -10^{9} \ leq a_i \ leq 10^{9}
- a_i is an integer.
----- Input -----
Input is given from Standard Input in the following format:
N
a_1 a_2 ... a_ {N}
----- Output -----
Print the answer.
----- Sample Input -----
6
1 2 3 4 5 6
----- Sample Output -----
1
If Snuke takes four cards from the top, and Raccoon takes the remaining two cards, x =10, y =11, and thus | x - y |=1. This is the minimum possible value.
### INCORRECT PROGRAM
def main ():
n = int (input ())
a = list (map (int, input (). split ()))
Snuke_sum = 0
Raccoon_sum = sum (a)
min_diff = abs (Raccoon_sum)
for i in range (n -1):
Snuke_sum += a [i]
Raccoon_sum -= a [i]
diff = abs (Snuke_sum - Raccoon_sum)
min_diff = min (min_diff, diff)
print (min_diff)
main ()
### ERROR MESSAGE
=== Input ===
2
10 -10
=== Program Output ===
0
=== Expected Output ===
20
### FEEDBACK
The code initializes the result ‘ min_diff ‘ to ‘ abs (Raccoon_sum)‘ corresponding to the case where Raccoon takes all the cards, however this is incorrect because of the constraint " Snuke and Raccoon have to take at least one card ". Instead just setting ‘ min_diff ‘ to negative infinity initially would fix it.
### COMMENT
This human participant was on the right track, but note that setting ‘ min_diff ‘ to * negative * infinity would cause it to never be updated inside of the for loop, and so the program would in fact always return negative infinity.
D.2 Category (b): feedback explicitly suggests a minor change to the code
D.2.1 GPT-4: Example
⬇
### SPECIFICATION
There is a square in the xy - plane. The coordinates of its four vertices are (x_1, y_1),(x_2, y_2),(x_3, y_3) and (x_4, y_4) in counter - clockwise order.
(Assume that the positive x - axis points right, and the positive y - axis points up.)
Takahashi remembers (x_1, y_1) and (x_2, y_2), but he has forgot (x_3, y_3) and (x_4, y_4).
Given x_1, x_2, y_1, y_2, restore x_3, y_3, x_4, y_4. It can be shown that x_3, y_3, x_4 and y_4 uniquely exist and have integer values.
----- Constraints -----
- | x_1 |,| y_1 |,| x_2 |,| y_2 | \ leq 100
- (x_1, y_1) != (x_2, y_2)
- All values in input are integers.
----- Input -----
Input is given from Standard Input in the following format:
x_1 y_1 x_2 y_2
----- Output -----
Print x_3, y_3, x_4 and y_4 as integers, in this order.
----- Sample Input -----
0 0 0 1
----- Sample Output -----
-1 1 -1 0
(0,0),(0,1),(-1,1),(-1,0) is the four vertices of a square in counter - clockwise order.
Note that (x_3, y_3)=(1,1),(x_4, y_4)=(1,0) is not accepted, as the vertices are in clockwise order.
### INCORRECT PROGRAM
x1, y1, x2, y2 = map (int, input (). split ())
x3 = x2 + (y2 - y1)
y3 = y2 - (x2 - x1)
x4 = x1 + (y2 - y1)
y4 = y1 - (x2 - x1)
print (x3, y3, x4, y4)
### ERROR MESSAGE
=== Input ===
0 0 0 1
=== Program Output ===
1 1 1 0
=== Expected Output ===
-1 1 -1 0
### FEEDBACK
The issue with the code is that it doesn ’ t consider the counter - clockwise order of the vertices. To fix this, we need to swap the signs of (y2 - y1) and (x2 - x1) in the calculations.
### COMMENT
Swapping the listed signs is a concrete suggestion of a way to fix the code. Since the overall structure of the code would remain the same, we count this as a minor change.
D.2.2 Human: Example
⬇
### SPECIFICATION
Given is a string S representing the day of the week today.
S is SUN, MON, TUE, WED, THU, FRI, or SAT, for Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, and Saturday, respectively.
After how many days is the next Sunday (tomorrow or later)?
----- Constraints -----
- S is SUN, MON, TUE, WED, THU, FRI, or SAT.
----- Input -----
Input is given from Standard Input in the following format:
S
----- Output -----
Print the number of days before the next Sunday.
----- Sample Input -----
SAT
----- Sample Output -----
1
It is Saturday today, and tomorrow will be Sunday.
### INCORRECT PROGRAM
S = input ()
days = [’ SUN ’, ’ MON ’, ’ TUE ’, ’ WED ’, ’ THU ’, ’ FRI ’, ’ SAT ’]
print ((7 - days. index (S)) % 7)
### ERROR MESSAGE
=== Input ===
SUN
=== Program Output ===
0
=== Expected Output ===
7
### FEEDBACK
The logic is incorrect. There should be no modulo operator.
### COMMENT
Suggesting that the modulo operator should be removed is a relatively small, but explicit, change.
D.3 Category (c): feedback explicitly suggests a large change to the code
D.3.1 GPT-4: Example
⬇
### SPECIFICATION
Takahashi is about to assemble a character figure, consisting of N parts called Part 1, Part 2, ..., Part N and N -1 connecting components. Parts are distinguishable, but connecting components are not.
Part i has d_i holes, called Hole 1, Hole 2, ..., Hole d_i, into which a connecting component can be inserted. These holes in the parts are distinguishable.
Each connecting component will be inserted into two holes in different parts, connecting these two parts. It is impossible to insert multiple connecting components into a hole.
The character figure is said to be complete when it has the following properties:
- All of the N -1 components are used to connect parts.
- Consider a graph with N vertices corresponding to the parts and N -1 undirected edges corresponding to the pairs of vertices connected by a connecting component. Then, this graph is connected.
Two ways A and B to make the figure complete are considered the same when the following is satisfied: for every pair of holes, A uses a connecting component to connect these holes if and only if B uses one to connect them.
Find the number of ways to make the figure complete. Since the answer can be enormous, find the count modulo 998244353.
----- Constraints -----
- All values in input are integers.
- 2 \ leq N \ leq 2 \ times 10^5
- 1 \ leq d_i < 998244353
----- Input -----
Input is given from Standard Input in the following format:
N
d_1 d_2 \ cdots d_N
----- Output -----
Print the answer.
----- Sample Input -----
3
1 1 3
----- Sample Output -----
6
One way to make the figure complete is to connect Hole 1 in Part 1 and Hole 3 in Part 3 and then connect Hole 1 in Part 2 and Hole 1 in Part 3.
### INCORRECT PROGRAM
def main ():
N = int (input ())
d = list (map (int, input (). split ()))
MOD = 998244353
fact = [1] * (N +1)
for i in range (1, N +1):
fact [i] = fact [i -1] * i % MOD
ans = 0
for ki in range (N):
ans += d [ki] * fact [ki] % MOD * fact [N - ki -1] % MOD
ans %= MOD
ans = ans * fact [N -1] % MOD
print (ans)
main ()
### ERROR MESSAGE
=== Input ===
3
1 1 3
=== Program Output ===
18
=== Expected Output ===
6
### FEEDBACK
The following is a concise explanation of the issue: The code above calculates the total number of ways to connect the parts, but it does not ensure that each connecting component is used only once. This results in overcounting the number of ways to make the figure complete. To fix the issue, we need to ensure that each connecting component is used exactly once.
### COMMENT
The feedback suggests changing the code so that each connecting component is used exactly once, which does not seem to be taken into account in the current approach.
Changing the code to satisfy this part of the specification seems to require a major rewrite.
D.3.2 Human: Example
⬇
### SPECIFICATION
Oh, New Year. The time to gather all your friends and reflect on the heartwarming events of the past year...
$n$ friends live in a city which can be represented as a number line. The $i$ - th friend lives in a house with an integer coordinate $x_i$. The $i$ - th friend can come celebrate the New Year to the house with coordinate $x_i -1 $, $x_i +1 $ or stay at $x_i$. Each friend is allowed to move no more than once.
For all friends $1 \ le x_i \ le n$ holds, however, they can come to houses with coordinates $0$ and $n +1 $ (if their houses are at $1$ or $n$, respectively).
For example, let the initial positions be $x = [1, 2, 4, 4] $. The final ones then can be $ [1, 3, 3, 4] $, $ [0, 2, 3, 3] $, $ [2, 2, 5, 5] $, $ [2, 1, 3, 5] $ and so on. The number of occupied houses is the number of distinct positions among the final ones.
So all friends choose the moves they want to perform. After that the number of occupied houses is calculated. What is the minimum and the maximum number of occupied houses can there be?
----- Input -----
The first line contains a single integer $n$ ($1 \ le n \ le 2 \ cdot 10^5 $) - the number of friends.
The second line contains $n$ integers $x_1, x_2, \ dots, x_n$ ($1 \ le x_i \ le n$) - the coordinates of the houses of the friends.
----- Output -----
Print two integers - the minimum and the maximum possible number of occupied houses after all moves are performed.
----- Examples -----
Input
4
1 2 4 4
Output
2 4
Input
9
1 1 8 8 8 4 4 4 4
Output
3 8
Input
7
4 3 7 1 4 3 3
Output
3 6
----- Note -----
In the first example friends can go to $ [2, 2, 3, 3] $. So friend $1$ goes to $x_1 +1 $, friend $2$ stays at his house $x_2$, friend $3$ goes to $x_3 -1 $ and friend $4$ goes to $x_4 -1 $. $ [1, 1, 3, 3] $, $ [2, 2, 3, 3] $ or $ [2, 2, 4, 4] $ are also all valid options to obtain $2$ occupied houses.
For the maximum number of occupied houses friends can go to $ [1, 2, 3, 4] $ or to $ [0, 2, 4, 5] $, for example.
### INCORRECT PROGRAM
from collections import Counter
def main ():
n = int (input ())
x = list (map (int, input (). split ()))
cnt = Counter (x)
max_houses = 0
for pos in range (1, n + 1):
if cnt [pos]:
max_houses += 1
occupied = [0] * (n + 2)
for pos in x:
if not occupied [pos - 1]:
occupied [pos - 1] = 1
elif not occupied [pos]:
occupied [pos] = 1
elif not occupied [pos + 1]:
occupied [pos + 1] = 1
min_houses = sum (occupied)
print (min_houses, max_houses)
main ()
### ERROR MESSAGE
=== Input ===
4
1 2 4 4
=== Program Output ===
4 3
=== Expected Output ===
2 4
### FEEDBACK
The code calculates both the minimum and maximum count of houses incorrectly. The correct solution requires conducting a tree search to determine the true maximum and minimum. This can be further optimized by first sorting the list of occupied houses and using a sliding window to process independent segments of the number line.
### COMMENT
The feedback suggests conducting a tree search to determine the max and min, as well as a way to optimize this search.
These constitute major changes to the program as it is currently written.
D.4 Category (d): Feedback contains blocks of pseudocode or Python code
D.4.1 GPT-4: Example
GPT-4’s feedback never contains blocks of Python per our experiment design, in which the code block is stripped (Section 4.3). We find that GPT-4 never utilizes blocks of pseudocode, either.
D.4.2 Human: Example
⬇
### SPECIFICATION
Given is a string S representing the day of the week today.
S is SUN, MON, TUE, WED, THU, FRI, or SAT, for Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, and Saturday, respectively.
After how many days is the next Sunday (tomorrow or later)?
----- Constraints -----
- S is SUN, MON, TUE, WED, THU, FRI, or SAT.
----- Input -----
Input is given from Standard Input in the following format:
S
----- Output -----
Print the number of days before the next Sunday.
----- Sample Input -----
SAT
----- Sample Output -----
1
It is Saturday today, and tomorrow will be Sunday.
### INCORRECT PROGRAM
day = input ()
days = [’ SUN ’, ’ MON ’, ’ TUE ’, ’ WED ’, ’ THU ’, ’ FRI ’, ’ SAT ’]
index = days. index (day)
result = (7 - index) % 7
print (result)
### ERROR MESSAGE
=== Input ===
SUN
=== Program Output ===
0
=== Expected Output ===
7
### FEEDBACK
before returning the result, we should just check to see if result ==0.
so we just need to add:
if result == 0:
result = 7
### COMMENT
The feedback contains a block of Python code that the model can simply copy in order to fix the program.
D.5 Category (e): feedback which expresses uncertainty
D.5.1 GPT-4: Example
In the $20· 2· 2=80$ examples we study, we do not find that GPT-4 ever expresses uncertainty.
D.5.2 Human: Example
⬇
### SPECIFICATION
Two kingdoms had been at war for a long time, until the emperor intervened to bring an end to the conflict. The territory in question comprises an $M$ by $N$ rectangular grid. At the emperor ’ s insistence, the two kings have withdrawn their troops until no two opposing troops are in adjacent squares of the map (adjacent being horizontal or vertical - diagonal is not considered).
The emperor proposes to designate certain squares of the map as neutral territory. Neither king will be allowed to move troops into those squares, and the emperor ’ s own forces will patrol them to be sure that both kings observe these rules.
The emperor is frugal and does not want to commit more soldiers to this effort than absolutely necessary. His generals have marked each square of the map with the number of soldiers required to secure that square. What remains is to choose which of those squares should be patrolled.
Write a program to determine the minimum number of soldiers that the emperor will need to be deploy to guarantee that the troops of one kingdom cannot move, in one or more steps, into squares occupied by the troops of the second kingdom (moving horizontally or vertically) without encountering the emperor ’ s own soldiers.
----- Input -----
Input begins with a line containing $2$ integers, $w$ and $h$, denoting the width and height of the map, where $1 \ leq w, h \ leq 40 $.
This is followed by $h$ lines. Each line contains $w$ characters, left justified. These characters will be ’ A ’ or ’ B ’, designating a position held by king A or king B, or a single numeric digit, designating a currently unoccupied position that can be secured by the use of that number of soldiers. For example, a ’2’ would indicate that two soldiers must be deployed to that square to secure it against passage of other troops. A ’0’ indicates terrain that is impassible - the emperor need not commit soldiers there because the kingdom troops cannot pass through that square.
No ’ A ’ will be adjacent, horizontally or vertically, to any ’ B ’.
There will be at least one ’ A ’ and one ’ B ’ in the input.
----- Output -----
Print a single line containing an integer denoting the minimum number of soldiers that the emperor must deploy to guarantee that there is no open path between any ’ A ’ position and any ’ B ’ position, using any combination of horizontal or vertical moves.
----- Examples -----
Sample Input:
8 5
A11111AA
AA7B111A
111 BB111
11 BBB111
11 BBB11B
Sample Output:
13
### INCORRECT PROGRAM
def main ():
w, h = map (int, input (). split ())
grid = [input () for _ in range (h)]
def bfs (x, y, z):
dist = [[float (" inf ")] * w for _ in range (h)]
dist [x][y] = 0
q = [(x, y)]
for x, y in q:
for dx, dy in [(-1, 0), (1, 0), (0, -1), (0, 1)]:
nx, ny = x + dx, y + dy
if 0 <= nx < h and 0 <= ny < w and dist [nx][ny] == float (" inf "):
if grid [nx][ny] == z:
dist [nx][ny] = dist [x][y]
q. append ((nx, ny))
if "0" <= grid [nx][ny] <= "9":
dist [nx][ny] = dist [x][y] + int (grid [nx][ny])
q. append ((nx, ny))
return dist
ans = float (" inf ")
for i in range (h):
for j in range (w):
if grid [i][j] == " A ":
ans = min (ans, min (x for row in bfs (i, j, " B ") for x in row))
print (ans)
main ()
### ERROR MESSAGE
=== Input ===
8 5
A11111AA
AA7B111A
111 BB111
11 BBB111
11 BBB11B
=== Program Output ===
0
=== Expected Output ===
13
### FEEDBACK
This code appears to be finding the minimum " distance " between an A position and a B position, where distance is measured in the number of soldiers required to secure the intermediate positions. This isn ’ t really addressing the question at all, because even occupying the positions along some path from a particular A -> B won ’ t necessarily block the kings from reaching each other from those positions through a different, longer path. You probably need some sort of min - cut algorithm to divide the graph into two connected components, one each for A and B nodes.
### COMMENT
The feedback expresses mild uncertainty in two aspects. First, the user indicates that they are not completely certain in their understanding of the code ’ s current behaviour by using the phrase " appears to be ". Then, they express uncertainty in their suggestion for what to do instead, saying that one * probably * needs some sort of min - cut algorithm.
Appendix E Prompts
In this appendix, we describe the prompting structure used for our experiments. All of our experiments use one-shot prompting, in which a single example is given in the prompt before the desired task.
For initial code generation (the first sample from $M_{P}$ ), we use different prompts for the two types of tasks in APPS: call-based tasks, in which the desired program should take the input as a parameter to a function and return the output in the function’s return statement; and stdio-based tasks, in which inputs should be read from stdin and outputs should be written to stdout. These prompts are shown in Listing 1 and 2, respectively. The example tasks and programs were taken from APPS’ training set.
For feedback samples (i.e., samples from $M_{F}$ ), we use the prompt in Listing 3. This prompt contains an example in which the user provides the textual specification, the incorrect program and the error message, and the assistant generates feedback. Similarly, for repair samples (i.e., samples from $M_{P}$ which follow $M_{F}$ ) we use the prompt in Listing 4, in which the user also supplies the feedback, and the assistant returns only the fixed version of the program. Finally, for joint feedback-repair samples (i.e., when sampling $(f,r)\sim M_{P}$ ), we use the prompt in Listing 6. This prompt combines the prompts from Listing 4 and 5 into one prompt, in which the assistant returns both the feedback and the fixed program. In all of these prompts, the specification used was taken from APPS’ training set, while the programs and the feedback were constructed manually.
Listing 1: Code generation prompt for call-based tasks.
⬇
===== system =====
You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. You will NOT return anything except for the program. Put your fixed program within code delimiters, for example: ‘‘‘ python
# YOUR CODE HERE
‘‘‘.
===== user =====
### QUESTION
A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward as forward. Examples of numerical palindromes are: ‘2332, 110011, 54322345‘
For a given number ‘‘‘ num ‘‘‘, write a function which returns an array of all the numerical palindromes contained within each number. The array should be sorted in ascending order and any duplicates should be removed.
In this kata, single digit numbers and numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are ** NOT ** considered valid numerical palindromes.
If ‘ num ‘ contains no valid palindromes, return ‘" No palindromes found "‘.
Otherwise, return ‘" Not valid "‘ if the input is not an integer or is less than ‘0‘.
## Examples
‘‘‘
palindrome (1221) => [22, 1221]
palindrome (34322122) => [22, 212, 343, 22122]
palindrome (1001331) => [33, 1001, 1331]
palindrome (1294) => " No palindromes found "
palindrome ("1221") => " Not valid "
‘‘‘
---
### Other Kata in this Series:
Numerical Palindrome #1
Numerical Palindrome #1.5
Numerical Palindrome #2
Numerical Palindrome #3
Numerical Palindrome #3.5
Numerical Palindrome #4
Numerical Palindrome #5
### Use Call - Based Format
===== assistant =====
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
if ch == ’0’:
continue
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
Listing 2: Code generation prompt for stdio-based tasks.
⬇
===== system =====
You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. You will NOT return anything except for the program. Put your fixed program within code delimiters, for example: ‘‘‘ python
# YOUR CODE HERE
‘‘‘.
===== user =====
### QUESTION
Polycarp has $n$ different binary words. A word called binary if it contains only characters ’0’ and ’1’. For example, these words are binary: "0001", "11", "0" and "0011100".
Polycarp wants to offer his set of $n$ binary words to play a game " words ". In this game, players name words and each next word (starting from the second) must start with the last character of the previous word. The first word can be any. For example, these sequence of words can be named during the game: "0101", "1", "10", "00", "00001".
Word reversal is the operation of reversing the order of the characters. For example, the word "0111" after the reversal becomes "1110", the word "11010" after the reversal becomes "01011".
Probably, Polycarp has such a set of words that there is no way to put them in the order correspondent to the game rules. In this situation, he wants to reverse some words from his set so that: the final set of $n$ words still contains different words (i. e. all words are unique); there is a way to put all words of the final set of words in the order so that the final sequence of $n$ words is consistent with the game rules.
Polycarp wants to reverse minimal number of words. Please, help him.
----- Input -----
The first line of the input contains one integer $t$ ($1 \ le t \ le 10\^4 $) - the number of test cases in the input. Then $t$ test cases follow.
The first line of a test case contains one integer $n$ ($1 \ le n \ le 2\ cdot10 ^5 $) - the number of words in the Polycarp ’ s set. Next $n$ lines contain these words. All of $n$ words aren ’ t empty and contains only characters ’0’ and ’1’. The sum of word lengths doesn ’ t exceed $4 \ cdot10 ^6 $. All words are different.
Guaranteed, that the sum of $n$ for all test cases in the input doesn ’ t exceed $2 \ cdot10 ^5 $. Also, guaranteed that the sum of word lengths for all test cases in the input doesn ’ t exceed $4 \ cdot10 ^6 $.
----- Output -----
Print answer for all of $t$ test cases in the order they appear.
If there is no answer for the test case, print -1. Otherwise, the first line of the output should contain $k$ ($0 \ le k \ le n$) - the minimal number of words in the set which should be reversed. The second line of the output should contain $k$ distinct integers - the indexes of the words in the set which should be reversed. Words are numerated from $1$ to $n$ in the order they appear. If $k =0 $ you can skip this line (or you can print an empty line). If there are many answers you can print any of them.
----- Example -----
Input
4
4
0001
1000
0011
0111
3
010
101
0
2
00000
00001
4
01
001
0001
00001
Output
1
3
-1
0
2
1 2
### Use Standard Input Format (read inputs with ‘ input ()‘, write results with ‘ print ()‘)
===== assistant =====
‘‘‘ python
for _ in range (int (input ())):
n = int (input ())
mass = []
zo = 0
oz = 0
zz = 0
oo = 0
ozs = []
zos = []
ozss = set ()
zoss = set ()
for j in range (n):
k = input ()
mass. append (k)
if k [0] == ’0’ and k [-1] == ’1’:
zoss. add (k)
zos. append (j + 1)
zo += 1
elif k [0] == ’1’ and k [-1] == ’0’:
ozss. add (k)
ozs. append (j + 1)
oz += 1
elif k [0] == ’0’ and k [-1] == ’0’:
zz += 1
else:
oo += 1
if zz and oo and not oz and not zo:
print (-1)
continue
else:
if zo > oz:
print ((zo - oz) // 2)
ans = []
need = (zo - oz) // 2
i = 0
while need:
zzz = mass [zos [i] - 1][len (mass [zos [i] - 1]) - 1:: -1]
if zzz not in ozss:
ans. append (zos [i])
need -= 1
i += 1
print (* ans)
else:
print ((oz - zo) // 2)
ans = []
need = (oz - zo) // 2
i = 0
while need:
zzz = mass [ozs [i] - 1][len (mass [ozs [i] - 1]) - 1:: -1]
if zzz not in zoss:
ans. append (ozs [i])
need -= 1
i += 1
print (* ans)
‘‘‘
Listing 3: Feedback-only prompt.
⬇
===== system =====
You are a helpful programming assistant and an expert Python programmer. You are helping a user debug a program. The user has written some code, but it has some errors and is not passing the tests. You will help the user by giving a concise (at most 2-3 sentences) textual explanation of what is wrong with the code. You will * not * generate any code, because the user wants to fix the code themselves.
===== user =====
### QUESTION
A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward as forward. Examples of numerical palindromes are: ‘2332, 110011, 54322345‘
For a given number ‘‘‘ num ‘‘‘, write a function which returns an array of all the numerical palindromes contained within each number. The array should be sorted in ascending order and any duplicates should be removed.
In this kata, single digit numbers and numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are ** NOT ** considered valid numerical palindromes.
If ‘ num ‘ contains no valid palindromes, return ‘" No palindromes found "‘.
Otherwise, return ‘" Not valid "‘ if the input is not an integer or is less than ‘0‘.
## Examples
‘‘‘
palindrome (1221) => [22, 1221]
palindrome (34322122) => [22, 212, 343, 22122]
palindrome (1001331) => [33, 1001, 1331]
palindrome (1294) => " No palindromes found "
palindrome ("1221") => " Not valid "
‘‘‘
---
### Other Kata in this Series:
Numerical Palindrome #1
Numerical Palindrome #1.5
Numerical Palindrome #2
Numerical Palindrome #3
Numerical Palindrome #3.5
Numerical Palindrome #4
Numerical Palindrome #5
### Use Call - Based Format
### INCORRECT PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
The code above is wrong and contains a bug. Given input "1001331" the output was "[’0’, ’33’, ’1001’, ’1331’]" but the expected output was "[’33’, ’1001’, ’1331’]".
===== assistant =====
The following is a concise explanation of the issue: The problem description states that numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are NOT considered valid numerical palindromes. However, the code above does not take this into account and therefore returns ‘00‘ as a valid palindrome. This can be fixed by checking if the first or last character is ‘0‘ before adding the string to the set of palindromes.
Listing 4: Repair-only prompt.
⬇
===== system =====
You are a helpful programming assistant and an expert Python programmer. You are helping a user write a program to solve a problem. The user has written some code, but it has some errors and is not passing the tests. The user has spent some time debugging the program and will provide you with a concise textual explanation of what is wrong with the code. You will use this explanation to generate a fixed version of the program. Put your fixed program within code delimiters, for example: ‘‘‘ python
# YOUR CODE HERE
‘‘‘.
===== user =====
### QUESTION
A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward as forward. Examples of numerical palindromes are: ‘2332, 110011, 54322345‘
For a given number ‘‘‘ num ‘‘‘, write a function which returns an array of all the numerical palindromes contained within each number. The array should be sorted in ascending order and any duplicates should be removed.
In this kata, single digit numbers and numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are ** NOT ** considered valid numerical palindromes.
If ‘ num ‘ contains no valid palindromes, return ‘" No palindromes found "‘.
Otherwise, return ‘" Not valid "‘ if the input is not an integer or is less than ‘0‘.
## Examples
‘‘‘
palindrome (1221) => [22, 1221]
palindrome (34322122) => [22, 212, 343, 22122]
palindrome (1001331) => [33, 1001, 1331]
palindrome (1294) => " No palindromes found "
palindrome ("1221") => " Not valid "
‘‘‘
---
### Other Kata in this Series:
Numerical Palindrome #1
Numerical Palindrome #1.5
Numerical Palindrome #2
Numerical Palindrome #3
Numerical Palindrome #3.5
Numerical Palindrome #4
Numerical Palindrome #5
### Use Call - Based Format
### INCORRECT PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
The code above is wrong and contains a bug. Given input "1001331" the output was "[’0’, ’33’, ’1001’, ’1331’]" but the expected output was "[’33’, ’1001’, ’1331’]".
The following is a concise explanation of the issue: The problem description states that numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are NOT considered valid numerical palindromes. However, the code above does not take this into account and therefore returns ‘00‘ as a valid palindrome. This can be fixed by checking if the first or last character is ‘0‘ before adding the string to the set of palindromes.
===== assistant =====
### FIXED PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
if ch == ’0’:
continue
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
Listing 5: Joint feedback-repair prompt.
⬇
===== system =====
You are a helpful programming assistant and an expert Python programmer. You are helping a user write a program to solve a problem. The user has written some code, but it has some errors and is not passing the tests. You will help the user by first giving a concise (at most 2-3 sentences) textual explanation of what is wrong with the code. After you have pointed out what is wrong with the code, you will then generate a fixed version of the program. Put your fixed program within code delimiters, for example: ‘‘‘ python
# YOUR CODE HERE
‘‘‘.
===== user =====
### QUESTION
A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward as forward. Examples of numerical palindromes are: ‘2332, 110011, 54322345‘
For a given number ‘‘‘ num ‘‘‘, write a function which returns an array of all the numerical palindromes contained within each number. The array should be sorted in ascending order and any duplicates should be removed.
In this kata, single digit numbers and numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are ** NOT ** considered valid numerical palindromes.
If ‘ num ‘ contains no valid palindromes, return ‘" No palindromes found "‘.
Otherwise, return ‘" Not valid "‘ if the input is not an integer or is less than ‘0‘.
## Examples
‘‘‘
palindrome (1221) => [22, 1221]
palindrome (34322122) => [22, 212, 343, 22122]
palindrome (1001331) => [33, 1001, 1331]
palindrome (1294) => " No palindromes found "
palindrome ("1221") => " Not valid "
‘‘‘
---
### Other Kata in this Series:
Numerical Palindrome #1
Numerical Palindrome #1.5
Numerical Palindrome #2
Numerical Palindrome #3
Numerical Palindrome #3.5
Numerical Palindrome #4
Numerical Palindrome #5
### Use Call - Based Format
### INCORRECT PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
The code above is wrong and contains a bug. Given input "1001331" the output was "[’0’, ’33’, ’1001’, ’1331’]" but the expected output was "[’33’, ’1001’, ’1331’]".
===== assistant =====
The following is a concise explanation of the issue: The problem description states that numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are NOT considered valid numerical palindromes. However, the code above does not take this into account and therefore returns ‘00‘ as a valid palindrome. This can be fixed by checking if the first or last character is ‘0‘ before adding the string to the set of palindromes.
# The code below is the correct version of the code above, where the issue has been fixed:
### FIXED PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
if ch == ’0’:
continue
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
Appendix F Dataset: APPS Tasks Used For Our Evaluations
These tasks were randomly sampled from APPS’ test set. To avoid distribution shift, we sampled according to the relative frequency of difficulties in the full dataset. We report the resulting list of tasks to aid reproducibility.
| introductory | ’4004’, ’4058’, ’4063’, ’4065’, ’4100’, ’4108’, ’4117’, ’4155’, ’4164’, ’4182’, ’4193’, ’4195’, ’4211’, ’4217’, ’4241’, ’4249’, ’4270’, ’4275’, ’4281’, ’4293’, ’4333’, ’4347’, ’4350’, ’4356’, ’4409’, ’4426’, ’4431’, ’4450’, ’4465’, ’4484’, ’4498’, ’4505’, ’4507’, ’4514’, ’4544’, ’4553’, ’4586’, ’4610’, ’4662’, ’4663’, ’4667’, ’4677’, ’4681’, ’4704’, ’4716’, ’4741’, ’4750’, ’4786’, ’4787’, ’4801’, ’4855’, ’4862’, ’4864’, ’4870’, ’4873’, ’4890’, ’4897’, ’4952’, ’4966’, ’4984’ |
| --- | --- |
| interview | ’0004’, ’0013’, ’0033’, ’0056’, ’0073’, ’0074’, ’0089’, ’0091’, ’0124’, ’0131’, ’0139’, ’0162’, ’0166’, ’0183’, ’0186’, ’0191’, ’0199’, ’0205’, ’0249’, ’0253’, ’0268’, ’0274’, ’0300’, ’0304’, ’0341’, ’0342’, ’0413’, ’0427’, ’0434’, ’0466’, ’0467’, ’0496’, ’0501’, ’0511’, ’0537’, ’0564’, ’0571’, ’0575’, ’0579’, ’0592’, ’0597’, ’0626’, ’0637’, ’0676’, ’0704’, ’0728’, ’0757’, ’0765’, ’0788’, ’0794’, ’0804’, ’0805’, ’0811’, ’0829’, ’0879’, ’0904’, ’0915’, ’0925’, ’0937’, ’0948’, ’0954’, ’0955’, ’0972’, ’0985’, ’0989’, ’1018’, ’1019’, ’1033’, ’1046’, ’1076’, ’1133’, ’1140’, ’1141’, ’1145’, ’1146’, ’1149’, ’1168’, ’1185’, ’1221’, ’1232’, ’1256’, ’1257’, ’1280’, ’1285’, ’1299’, ’1317’, ’1347’, ’1380’, ’1392’, ’1393’, ’1418’, ’1444’, ’1448’, ’1458’, ’1489’, ’1517’, ’1533’, ’1573’, ’1635’, ’1653’, ’1668’, ’1672’, ’1721’, ’1736’, ’1748’, ’1756’, ’1759’, ’1775’, ’1777’, ’1825’, ’1850’, ’1863’, ’1865’, ’1870’, ’1875’, ’1906’, ’1917’, ’1956’, ’1962’, ’1967’, ’1976’, ’2024’, ’2049’, ’2062’, ’2092’, ’2093’, ’2097’, ’2106’, ’2172’, ’2176’, ’2203’, ’2231’, ’2246’, ’2264’, ’2266’, ’2295’, ’2326’, ’2328’, ’2332’, ’2342’, ’2361’, ’2369’, ’2407’, ’2408’, ’2418’, ’2455’, ’2463’, ’2511’, ’2515’, ’2516’, ’2535’, ’2585’, ’2623’, ’2629’, ’2642’, ’2651’, ’2662’, ’2668’, ’2673’, ’2698’, ’2701’, ’2709’, ’2735’, ’2742’, ’2752’, ’2759’, ’2765’, ’2787’, ’2802’, ’2832’, ’2835’, ’2844’, ’2858’, ’2885’, ’2897’, ’2923’, ’2932’, ’2945’, ’2973’, ’2980’ |
| competition | ’3017’, ’3019’, ’3054’, ’3062’, ’3063’, ’3066’, ’3070’, ’3077’, ’3083’, ’3097’, ’3117’, ’3135’, ’3161’, ’3186’, ’3209’, ’3220’, ’3286’, ’3287’, ’3323’, ’3335’, ’3353’, ’3355’, ’3371’, ’3375’, ’3376’, ’3388’, ’3404’, ’3411’, ’3433’, ’3441’, ’3445’, ’3470’, ’3481’, ’3484’, ’3548’, ’3557’, ’3605’, ’3609’, ’3634’, ’3635’, ’3671’, ’3679’, ’3709’, ’3754’, ’3769’, ’3792’, ’3798’, ’3799’, ’3804’, ’3810’, ’3819’, ’3823’, ’3836’, ’3843’, ’3849’, ’3876’, ’3913’, ’3934’, ’3972’, ’3974’ |
Appendix G More Examples of Feedback from GPT-4 and our Human Participants
In this appendix, we give five more examples from the study in Section 4.3, so that the keen reader may themself compare and contrast the feedback generated by GPT-4 and by our human participants. Each listing (6-10) contains a textual specification, an incorrect program, an error message, both examples of GPT-4 feedback which were shown to the participants assigned to this task, and the responses from the participants. For each piece of feedback, we also list the number of resulting repaired programs which passed all the tests (c.f. the full results table in Appendix C).
Listing 6: Task 2106, Program A.
⬇
### SPECIFICATION
There are n cities in the country where the Old Peykan lives. These cities are located on a straight line, we ’ ll denote them from left to right as c_1, c_2, ..., c_ {n}. The Old Peykan wants to travel from city c_1 to c_ {n} using roads. There are (n - 1) one way roads, the i - th road goes from city c_ {i} to city c_ {i} + 1 and is d_ {i} kilometers long.
The Old Peykan travels 1 kilometer in 1 hour and consumes 1 liter of fuel during this time.
Each city c_ {i} (except for the last city c_ {n}) has a supply of s_ {i} liters of fuel which immediately transfers to the Old Peykan if it passes the city or stays in it. This supply refreshes instantly k hours after it transfers. The Old Peykan can stay in a city for a while and fill its fuel tank many times.
Initially (at time zero) the Old Peykan is at city c_1 and s_1 liters of fuel is transferred to it ’ s empty tank from c_1 ’ s supply. The Old Peykan ’ s fuel tank capacity is unlimited. Old Peykan can not continue its travel if its tank is emptied strictly between two cities.
Find the minimum time the Old Peykan needs to reach city c_ {n}.
----- Input -----
The first line of the input contains two space - separated integers m and k (1 <= m, k <= 1000). The value m specifies the number of roads between cities which is equal to n - 1.
The next line contains m space - separated integers d_1, d_2, ..., d_ {m} (1 <= d_ {i} <= 1000) and the following line contains m space - separated integers s_1, s_2, ..., s_ {m} (1 <= s_ {i} <= 1000).
----- Output -----
In the only line of the output print a single integer - the minimum time required for The Old Peykan to reach city c_ {n} from city c_1.
----- Examples -----
Input
4 6
1 2 5 2
2 3 3 4
Output
10
Input
2 3
5 6
5 5
Output
14
### INCORRECT PROGRAM
def main ():
m, k = map (int, input (). split ())
ds = list (map (int, input (). split ()))
ss = list (map (int, input (). split ()))
time = 0
fuel = ss [0]
min_time = 0
max_s = ss [0]
for i in range (m):
fuel -= ds [i]
max_s = max (max_s, ss [i])
while fuel < 0:
time += k
fuel += max_s
time += ds [i]
print (time)
main ()
### ERROR MESSAGE
=== Input ===
4 6
1 2 5 2
2 3 3 4
=== Program Output ===
28
=== Expected Output ===
10
### GPT -4 FEEDBACK #1
The issue with the code is that it does not consider the time spent in the current city before moving to the next city, resulting in an overestimation of the minimum time required. This can be fixed by updating the time spent in the current city before calculating the fuel used and accounting for the time it takes to travel to the next city.
# Number of passing repairs: 7/25
### GPT -4 FEEDBACK #2
The following is a concise explanation of the issue: The code incorrectly calculates the waiting time at each city. It should minimize the waiting time by considering the maximum fuel supply that can be collected during the waiting time. The code needs to be updated to account for this.
# Number of passing repairs: 10/25
### PARTICIPANT FEEDBACK #1
The program doesn ’ t add the fuel for passing through cities, only if it needs to actually stay in the city to refuel. Whenever it passes through a city, it should add ss [i] to the fuel.
# Number of passing repairs: 10/25
### PARTICIPANT FEEDBACK #2
The code is incorrect because it doesn ’ t take into account the fuel at each city that is immediately transferred after entering the city (aside for the fuel received at the first city). As a result, this program overestimates the time needed, because it assumes that the Old Peykan always has to wait for a refill (besides the fuel they are given in the first city).
# Number of passing repairs: 0/25
Listing 7: Task 3070, Program B.
⬇
### SPECIFICATION
In the free - market, ruthlessly capitalist world of train fares, only one thing matters: incentives.
Train companies are incentivised with bonuses for high throughput, successful journeys, and customer satisfaction. Conversely, the companies are disincentivised from failure via mandatory refunds for customers delayed by $30$ minutes or more.
Being a ruthless capitalist yourself, you have decided to take advantage of this generous delay compensation provision.
The refund is awarded provided that no matter the combination of trains you had taken (provided they followed the same route of stations as planned), you would still be unable to reach your destination in strictly less time than $30$ minutes (or $1800$ seconds), of the time you would have arrived assuming your booked journey was exactly on time.
Armed with your printout of the day ’ s delays, and the original timetable, you must ask yourself only one question: what is the earliest time you can book a train for from station $1$, in order to earn this restitutive reward?
----- Input -----
- One line containing two integers: $N$ ($1 \ le N \ le 100 $), the number of stations, and $M$ ($1 \ le M \ le 10^5 $), the number of scheduled trains.
- The next $M$ lines each contain 4 integers:
- $X$, the starting station ($1 \ le X \ le N -1 $),
- $S$ and $T$ ($0 \ le S \ le T < 86400 $), the planned departure and arrival times in seconds,
- and $L$ ($0 \ le L < 86400 $), the duration by which the train ’ s departure and arrival times are delayed.
Stations are numbered from $1$ to $N$ in the order you will visit them. Each train goes between stations $X$ and $X +1 $. It is possible to change between trains instantanesouly.
----- Output -----
- One line containing one integer: the start time of the earliest train journey you could book in order to earn your compensation, or impossible if no such journey is possible.
----- Examples -----
Sample Input 1:
2 3
1 1800 9000 1800
1 2000 9200 1600
1 2200 9400 1400
Sample Output 1:
1800
Sample Input 2:
2 2
1 1800 3600 1800
1 1900 3600 1600
Sample Output 2:
impossible
### INCORRECT PROGRAM
def main ():
N, M = map (int, input (). split ())
train_data = [tuple (map (int, input (). split ())) for _ in range (M)]
train_data. sort (key = lambda x: (x [0], x [1]))
times = [1 e9] * (N + 1)
times [1] = 0
for X, S, T, L in train_data:
if times [X] <= S:
times [X + 1] = min (times [X + 1], T + L)
if times [N] >= 1800:
print (int (times [1]))
else:
print (" impossible ")
main ()
### ERROR MESSAGE
=== Input ===
2 3
1 1800 9000 1800
1 2000 9200 1600
1 2200 9400 1400
=== Program Output ===
0
=== Expected Output ===
1800
### GPT -4 FEEDBACK #1
The issue with the code is that it prints ‘ times [1]‘ instead of the earliest possible train start time. The code needs to get the minimum start time from the train_data list that has a valid delay for compensation.
# Number of passing repairs: 3/25
### GPT -4 FEEDBACK #2
The issue with the code is that it directly prints the value of ‘ times [1]‘, which is initialized to ‘0‘ and never updated. The correct output should be the starting time of the earliest train journey that satisfies the condition. To fix this, we need to track the earliest possible start time and update it if a valid train journey is found.
# Number of passing repairs: 0/25
### PARTICIPANT FEEDBACK #1
This code is clearly wrong as times [1] is initialized to 0 then never updated and it is returned as the answer. Also, in the sorting line ‘ int (x [0])‘ should be used in place of ‘ x [0]‘ and likewise for ‘ int (x [1])‘ in place of ‘ x1 ‘ because we want numerical order sorting not string order sorting. The goal in this problem is calculate for each station what the earliest - arriving undelayed path to the station is and what the earliest - arriving delayed path to the station is, take the difference between the two, and see if it results in a total delay of >= 30 minutes (then to return the earliest departure from station 1 for this path. Given this, first the current code based around ‘ times ‘ can remain - this is the delayed time. However, the if - statement within the for - loop should be ‘<= S + L ‘ since the departure time is also delayed. Then ‘ undelayed_times ‘ should be tracked as well, updated in the foor loop when ‘ undelayed_times [X] <= S ‘ and set to a min () between the current undelayed time an ‘ T ‘ (not T + L). Finally the if - statement for time printing can use the difference between the delayed and undelayed times and see if any () are >= 1800. Finally we need to print the earliest arriving train for that path
# Number of passing repairs: 5/25
### PARTICIPANT FEEDBACK #2
One of the main issues is that the code prints ’ times [1]’, which is set to 0 and never updated. However, there are also some larger issues with the code. First, instead of checking if you reach your destination within 30 minutes of the expected time, it checks if you reach your destination in 30 minutes. To fix this, the program would need to keep track of the time you ’ d reach your destination if there were not delays, which it currently doesn ’ t. Second, instead of printing times [1], it should be keeping track of the start time of the train that leads to a delay and printing this.
# Number of passing repairs: 0/25
Listing 8: Task 4333, Program A.
⬇
### SPECIFICATION
There is a square in the xy - plane. The coordinates of its four vertices are (x_1, y_1),(x_2, y_2),(x_3, y_3) and (x_4, y_4) in counter - clockwise order.
(Assume that the positive x - axis points right, and the positive y - axis points up.)
Takahashi remembers (x_1, y_1) and (x_2, y_2), but he has forgot (x_3, y_3) and (x_4, y_4).
Given x_1, x_2, y_1, y_2, restore x_3, y_3, x_4, y_4. It can be shown that x_3, y_3, x_4 and y_4 uniquely exist and have integer values.
----- Constraints -----
- | x_1 |,| y_1 |,| x_2 |,| y_2 | \ leq 100
- (x_1, y_1) != (x_2, y_2)
- All values in input are integers.
----- Input -----
Input is given from Standard Input in the following format:
x_1 y_1 x_2 y_2
----- Output -----
Print x_3, y_3, x_4 and y_4 as integers, in this order.
----- Sample Input -----
0 0 0 1
----- Sample Output -----
-1 1 -1 0
(0,0),(0,1),(-1,1),(-1,0) is the four vertices of a square in counter - clockwise order.
Note that (x_3, y_3)=(1,1),(x_4, y_4)=(1,0) is not accepted, as the vertices are in clockwise order.
### INCORRECT PROGRAM
x1, y1, x2, y2 = map (int, input (). split ())
x3 = x2 + (y2 - y1)
y3 = y2 - (x2 - x1)
x4 = x1 + (y2 - y1)
y4 = y1 - (x2 - x1)
print (x3, y3, x4, y4)
### ERROR MESSAGE
=== Input ===
0 0 0 1
=== Program Output ===
1 1 1 0
=== Expected Output ===
-1 1 -1 0
### GPT -4 FEEDBACK #1
The issue with the code is that it doesn ’ t consider the counter - clockwise order of the vertices. To fix this, we need to swap the signs of (y2 - y1) and (x2 - x1) in the calculations.
# Number of passing repairs: 25/25
### GPT -4 FEEDBACK #2
The following is a concise explanation of the issue: The code above does not ensure that the vertices are in counter - clockwise order. It calculates x3, y3, x4, and y4 in the wrong order, which results in the coordinates being in clockwise order instead. To fix this, we need to swap the calculations for x3 and x4, and y3 and y4.
# Number of passing repairs: 0/25
### PARTICIPANT FEEDBACK #1
The program calculates vertices for a clockwise - oriented square, not counter - clockwise. To fix this, (y2 - y1) and (x2 - x1) should be negated in the equations.
# Number of passing repairs: 25/25
### PARTICIPANT FEEDBACK #2
The calculation for y3 and y4 incorrectly subtract the x direction vector to y0 and y1 respectively. In each of these cases, the calculation can be fixed by changing the first subtraction to addition.
# Number of passing repairs: 0/25
Listing 9: Task 4347, Program B.
⬇
### SPECIFICATION
One day, $n$ people ($n$ is an even number) met on a plaza and made two round dances, each round dance consists of exactly $ \ frac {n}{2} $ people. Your task is to find the number of ways $n$ people can make two round dances if each round dance consists of exactly $ \ frac {n}{2} $ people. Each person should belong to exactly one of these two round dances.
Round dance is a dance circle consisting of $1$ or more people. Two round dances are indistinguishable (equal) if one can be transformed to another by choosing the first participant. For example, round dances $ [1, 3, 4, 2] $, $ [4, 2, 1, 3] $ and $ [2, 1, 3, 4] $ are indistinguishable.
For example, if $n =2 $ then the number of ways is $1$: one round dance consists of the first person and the second one of the second person.
For example, if $n =4 $ then the number of ways is $3$. Possible options: one round dance - $ [1,2] $, another - $ [3,4] $; one round dance - $ [2,4] $, another - $ [3,1] $; one round dance - $ [4,1] $, another - $ [3,2] $.
Your task is to find the number of ways $n$ people can make two round dances if each round dance consists of exactly $ \ frac {n}{2} $ people.
----- Input -----
The input contains one integer $n$ ($2 \ le n \ le 20 $), $n$ is an even number.
----- Output -----
Print one integer - the number of ways to make two round dances. It is guaranteed that the answer fits in the $64$ - bit integer data type.
----- Examples -----
Input
2
Output
1
Input
4
Output
3
Input
8
Output
1260
Input
20
Output
12164510040883200
### INCORRECT PROGRAM
def factorial (n):
if n == 0:
return 1
else:
return n * factorial (n -1)
n = int (input ())
ans = (factorial (n) // (n // 2)**2) // (2 * n)
print (ans)
### ERROR MESSAGE
=== Input ===
2
=== Program Output ===
0
=== Expected Output ===
1
### GPT -4 FEEDBACK #1
The issue with the code is that it uses integer division when calculating the final answer, which may cause the result to be rounded down incorrectly. To fix this, we can use float division and then convert the result to an integer.
# Number of passing repairs: 0/25
### GPT -4 FEEDBACK #2
The following is a concise explanation of the issue: The code attempts to calculate the number of ways using integer division, which may result in a loss of precision and incorrect results. To fix the issue, change integer division to regular division, then round the result to the nearest integer using ‘ round ()‘.
# Number of passing repairs: 0/25
### PARTICIPANT FEEDBACK #1
The problem is that the formula used is incorrect. For n <= 4, n! is less than n ^3 / 2 and so this is why we see a 0. There is an erroneous additional division by n which is the true reason for the wrong answer.
# Number of passing repairs: 25/25
### PARTICIPANT FEEDBACK #2
On line 11, the program calculates a term with a correct numerator, but an incorrect denominator. It divides by (2* n), when it should just divide by 2.
# Number of passing repairs: 25/25
Listing 10: Task 4855, Program A.
⬇
### SPECIFICATION
The impact factor of a scientific journal is a measure reflecting the average number of citations to articles published in science journals. For this task we are using a simplified formula for calculating the impact factor:\[ \ frac {\ text {Total count of citations received by articles published in the journal}}{\ text {Total number of articles published}} \]
Rounding is always performed upwards. For example the impact factor of the " Journal for ore research and time wasting " that published $38$ articles cited $894$ times is $894 / 38 = 23.53 $ rounding up to $24$.
You are the editor of one scientific journal. You know how many articles you are going to publish and the owners are pushing you to reach a specific impact factor. You are wondering how many scientists you will have to bribe to cite your articles to meet the owners demands. Since money is tight you want to bribe the minimal amount of scientists. Each bribed scientist buys you a single citation.
----- Input -----
First and only line of input will contain $2$ integers, $A$ ($1 \ leq A \ leq 100 $), the number of articles you plan to publish and $I$ ($1 \ leq I \ leq 100 $), the impact factor the owners require.
----- Output -----
The first and only line of output should contain one integer, the minimal number of scientists you need to bribe.
----- Examples -----
Sample Input 1:
38 24
Sample Output 1:
875
Sample Input 2:
1 100
Sample Output 2:
100
### INCORRECT PROGRAM
import math
A, I = map (int, input (). split ())
needed_citations = math. ceil (A * (I - 0.5))
scientists_to_bribe = needed_citations - A * (I - 1)
print (scientists_to_bribe)
### ERROR MESSAGE
=== Input ===
38 24
=== Program Output ===
19
=== Expected Output ===
875
### GPT -4 FEEDBACK #1
The following is a concise explanation of the issue: The calculation for the number of scientists to bribe is incorrect. It should be calculated by first finding the minimum number of citations needed: ‘ A * I ‘, and then subtracting the current number of citations: ‘ A * (I - 1)‘.
# Number of passing repairs: 0/25
### GPT -4 FEEDBACK #2
The following is a concise explanation of the issue: The code calculates the required number of scientists to bribe by subtracting the current number of citations from the number of needed citations. However, it incorrectly calculates the current number of citations.
# Number of passing repairs: 1/25
### PARTICIPANT FEEDBACK #1
The scientists_to_bribe variable on line 3 is unnecessary., It ’ s sufficient to just do A * (I - 1) + 1.
# Number of passing repairs: 17/25
### PARTICIPANT FEEDBACK #2
The program seems to have a conceptual misunderstanding. Specifically, the number of scientists needed to bribe is equal to the number of required citations (‘ citations_needed ‘). However, the calculation of ‘ citation_needed ‘ is also incorrect -- it implies rounding impact factor conventionally (e. g. 3.49 goes to 3, 3.51 goes to 4), whereas rounding here is a ceiling function on the impact factor. As such, the actual formula for ‘ citation_needed ‘ should be ‘ A *(I -1) + 1‘
# Number of passing repairs: 25/25