# Demystifying GPT Self-Repair for Code Generation
**Authors**:
- Theo X. Olausson (MIT EECS & CSAIL
&Jeevana Priya Inala)
> Correspondence to . Work partially done while T.X.O. was at Microsoft Research.
positioning
## Abstract
Large Language Models (LLMs) have shown remarkable aptitude in code generation but still struggle on challenging programming tasks. Self-repair—in which the model debugs and fixes mistakes in its own code—has recently become a popular way to boost performance in these settings. However, only very limited studies on how and when self-repair works effectively exist in the literature, and one might wonder to what extent a model is really capable of providing accurate feedback on why the code is wrong when that code was generated by the same model. In this paper, we analyze GPT-3.5 and GPT-4’s ability to perform self-repair on APPS, a challenging dataset consisting of diverse coding challenges. To do so, we first establish a new evaluation strategy dubbed pass@t that measures the pass rate of the tasks against the total number of tokens sampled from the model, enabling a fair comparison to purely sampling-based approaches. With this evaluation strategy, we find that the effectiveness of self-repair is only seen in GPT-4. We also observe that self-repair is bottlenecked by the feedback stage; using GPT-4 to give feedback on the programs generated by GPT-3.5 and using expert human programmers to give feedback on the programs generated by GPT-4, we unlock significant performance gains.
## 1 Introduction
Large language models (LLMs) have proven capable of generating code snippets from natural language specifications, but still struggle on complex coding challenges such as those found in competitions and professional software engineering interviews. Recent work has sought to improve performance by leveraging self-repair [Gupta et al., 2020, Le et al., 2022, Chen et al., 2023b, Zhang et al., 2023], in which the model introspects and corrects mistakes in its own code. Figure 1 shows a typical workflow of a self-repair based approach. First, given a specification, a program is sampled from a code generation model; this program is then executed on a suite of unit tests provided as part of the specification; if the program fails on any unit test, then the error message and the faulty program are given to a feedback generation model, which outputs a short explanation of why the code failed; finally, the feedback is passed to a repair model, which generates a fixed version of the program. On the surface, this is a very attractive idea. It allows the system to overcome mistakes caused by unfortunate samples during decoding; easily incorporates feedback during the repair phase from symbolic systems such as compilers, static analysis tools, and execution engines; and mimics the trial-and-error way in which human software engineers write code.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Iterative Code Generation with Feedback
### Overview
The image is a horizontal flowchart illustrating a five-step iterative process for generating and refining code using machine learning models. The process begins with a user, involves two distinct code models (represented by blue and yellow neural network icons), an execution/testing phase, and a feedback model. The diagram visually depicts a cycle where an initial code model produces code that fails some tests, leading to a feedback model that informs a subsequent code model to produce corrected code that passes all tests.
### Components/Axes
The diagram is composed of five main stages connected by numbered, right-pointing arrows indicating the flow of the process. Each stage is labeled below its corresponding visual element.
1. **Stage (1): User**
* **Visual:** A simple, dark gray icon of a person's head and shoulders.
* **Label:** "User" (centered below the icon).
* **Position:** Far left of the diagram.
2. **Stage (2): Code Model (Initial)**
* **Visual:** A light gray circle containing a neural network diagram with **blue nodes** (7 nodes: 3 input, 2 hidden, 2 output) connected by lines.
* **Label:** "Code Model" (centered below the circle).
* **Position:** To the right of the User, connected by arrow (1).
3. **Stage (3): Execution**
* **Visual:** A vertical stack of three rectangular boxes, each containing a line of code (an assertion). To the right of each box is a status icon.
* **Label:** "Execution" (centered below the stack).
* **Content & Status:**
* Top box: `assert f(x1) == y1` | Status: Green circle with a white checkmark (✓).
* Middle box: `assert f(x2) == y2` | Status: Green circle with a white checkmark (✓).
* Bottom box: `assert f(x3) == y3` | Status: Red circle with a white 'X' (✗).
* **Position:** To the right of the initial Code Model, connected by arrow (2).
4. **Stage (4): Feedback Model**
* **Visual:** A light gray circle containing a neural network diagram with **yellow nodes** (7 nodes: 3 input, 2 hidden, 2 output) connected by lines.
* **Label:** "Feedback Model" (centered below the circle).
* **Position:** To the right of the Execution stage, connected by arrow (3).
5. **Stage (5): Code Model (Refined)**
* **Visual:** A light gray circle containing a neural network diagram with **blue nodes** (identical structure to the initial Code Model).
* **Label:** "Code Model" (centered below the circle).
* **Position:** To the right of the Feedback Model, connected by arrow (4).
6. **Final Output: Corrected Execution**
* **Visual:** A vertical stack of three rectangular boxes, identical in format to the first Execution stage.
* **Label:** None (this is the output of the final Code Model).
* **Content & Status:**
* Top box: `assert f(x1) == y1` | Status: Green circle with a white checkmark (✓).
* Middle box: `assert f(x2) == y2` | Status: Green circle with a white checkmark (✓).
* Bottom box: `assert f(x3) == y3` | Status: Green circle with a white checkmark (✓).
* **Position:** Far right of the diagram, connected to the final Code Model by arrow (5).
### Detailed Analysis
The process flow is explicitly numbered:
* **(1)** The User provides input or a task to the initial **Code Model** (blue).
* **(2)** This Code Model generates a program or function `f`, which is then subjected to **Execution** against three test cases (`x1`, `x2`, `x3`).
* **(3)** The results of the execution (two passes, one failure) are fed into a **Feedback Model** (yellow). This model's distinct color suggests it has a different role, likely analyzing the failure of `f(x3) == y3`.
* **(4)** The Feedback Model provides information (presumably about the error) to a new instance of the **Code Model** (blue).
* **(5)** This refined Code Model generates a corrected program, which is executed again. The final output shows all three assertions now passing.
### Key Observations
1. **Color Coding:** The diagram uses color to distinguish model types. Blue nodes represent the primary code-generating models. Yellow nodes represent the specialized feedback/analysis model.
2. **State Change:** The critical change between the first and second execution is the status of the third assertion (`assert f(x3) == y3`), which changes from a red 'X' (failure) to a green checkmark (success).
3. **Iterative Refinement:** The structure is a classic feedback loop: Generate -> Test -> Analyze Failure -> Regenerate -> Test Successfully.
4. **Identical Final Structure:** The final Code Model and its output are visually identical to the initial Code Model's structure, but the output state is different (all tests pass), implying the model's internal parameters or the code it generated have been updated based on the feedback.
### Interpretation
This diagram illustrates a **self-correcting code generation pipeline**. It demonstrates a method to improve the reliability of AI-generated code by incorporating an execution-based feedback loop.
* **What it suggests:** The process is designed to overcome the limitation of a code model that may produce syntactically correct but logically flawed code for some inputs. Instead of relying solely on the initial model's training, it uses real execution results (test failures) as a training signal.
* **How elements relate:** The **User** initiates the task. The **Code Model** is the primary generator. The **Execution** phase acts as an objective validator. The **Feedback Model** is the crucial diagnostic component that interprets *why* a test failed, translating the execution error into a form the Code Model can use for improvement. The final **Code Model** represents the updated, improved state of the generator.
* **Notable Anomalies/Patterns:** The most significant pattern is the transformation of the third test case from failure to success. This highlights the system's ability to target and fix specific logical errors. The separation of the "Feedback Model" from the "Code Model" suggests a modular architecture where error analysis is a distinct task from code generation, potentially leading to more specialized and effective models. The diagram implies that with this feedback mechanism, the system can achieve 100% test coverage for the given cases, moving from 66% (2/3) to 100% (3/3) success.
</details>
[ box/.style=rectangle, draw=black, align=justify, inner sep=3pt, tallbox/.style=box, fill=backcolor!50, shortbox/.style=box, fill=backcolor!50 ]
[tallbox, text width=3.6cm, minimum height=3.2cm] (A) Given is a string $s$ representing the day of the week today. $s$ is one of SUN, MON, TUE, WED, THU, FRI, or SAT. After how many days is the next Sunday (tomorrow or later)?
⬇
# UNIT TESTS
# (EXECUTABLE)
assert f (’ MON ’) == 6
assert f (’ WED ’) == 4
assert f (’ SUN ’) == 7
; [draw=none, left=0.2cm of A.west] (1);
[shortbox, right=0.2cm of A.north east, text width=8cm, anchor=north west] (B)
⬇
def f (s):
return (7 - [’ SUN ’, ... , ’ FRI ’, ’ SAT ’]. index (s)) %7
; [draw=none, right=0.2cm of B.east] (2);
[shortbox, below=0.2cm of B, text width=8cm] (C) Given input ’SUN’, the program returned 0, but the expected output was 7. ; [draw=none, right=0.2cm of C.east] (3);
[shortbox, below=0.2cm of C, text width=8cm] (D) The code does not account for the case where the input is ’SUN’ and the output should be 7. This can be fixed by removing the modulo operation. ; [draw=none, right=0.2cm of D.east] (4);
[shortbox, below=0.2cm of D, text width=8cm] (E)
⬇
def f (s):
return (7 - [’ SUN ’, ... , ’ FRI ’, ’ SAT ’]. index (s)) # %7
; [draw=none, right=0.2cm of E.east] (5);
Figure 1: Self-repair with separate code and feedback models. First, a user gives a specification in the form of text and a suite of unit tests (1). Then, a code model (blue) generates a program (2). The program is checked against the unit tests using a symbolic execution engine, and an error message is returned (3). In order to provide more signal to the code model, textual feedback as to why this happened is provided by a feedback model (yellow; 4). Finally, this feedback is used by the code model to repair the program (5).
However, it is important to remember that self-repair requires more invocations of the model, thus increasing the computational cost. In particular, whether self-repair is a winning strategy or not ultimately boils down to whether you would—at an equivalent compute budget—have had a greater chance of success if you had simply drawn more code samples i.i.d. from the model and checked them against the suite of unit tests provided as part of the task. Crucially, the effectiveness of self-repair depends not only on the model’s ability to generate code, which has been studied extensively in the literature, but also on its ability to identify how the code (generated by the model itself) is wrong with respect to the task specification. As far as we are aware, no previous or contemporary work has attempted to study the effect of this stage in detail.
In this paper, we study the effectiveness of self-repair with GPT-3.5 [Ouyang et al., 2022, OpenAI, 2022] and GPT-4 [OpenAI, 2023] when solving competition-level code generation tasks. We begin by proposing a new evaluation strategy dubbed pass@t, in which the likelihood of obtaining a correct program (with respect to the given unit tests) is weighed against the total number of tokens sampled from the model. Using this instead of the traditional pass@k [Chen et al., 2021, Kulal et al., 2019] metric (which weighs pass rate against the number of trials), we are able to accurately compare performance gained through self-repair against any additional work done by the model when generating the feedback and carrying out the repair. Using this new evaluation strategy, we then carefully study the dynamics of the self-repair process under a range of hyper-parameters. Finally, given our primary objective of gaining insight into the state-of-the-art code generation models’ ability to reflect upon and debug their own code, we carry out a set of experiments in which we investigate the impact of improving the feedback stage alone. We do so by analyzing the impact of using a stronger feedback generation model than the code generation model (using GPT-4 to generate feedback for GPT-3.5 code model), as well as by carrying out a study in which human participants provide feedback on incorrect programs, in order to compare model-generated self-feedback to that provided by human programmers.
From our experiments, we find that:
1. When taking the cost of doing inspection and repair into account, performance gains from self-repair can only be seen with GPT-4; for GPT-3.5, the pass rate with repair is lower than or equal to that of the baseline, no-repair approach at all budgets.
1. Even for the GPT-4 model, performance gains are modest at best ( $66\$ pass rate with a budget of 7000 tokens, $\approx$ the cost of 45 i.i.d. GPT-4 samples) and depend on having sufficient diversity in the initial programs.
1. Replacing GPT-3.5’s explanations of what is wrong with feedback produced by GPT-4 leads to better self-repair performance, even beating the baseline, no-repair GPT-3.5 approach ( $50\$ at 7000 tokens).
1. Replacing GPT-4’s own explanations with those of a human programmer improves repair significantly, leading to a 57% increase in the number of repaired programs which pass the tests.
## 2 Related work
Program synthesis with large language models. The use of large language models for program synthesis has been studied extensively in the literature [Li et al., 2022, Austin et al., 2021, Chen et al., 2021, Le et al., 2022, Fried et al., 2023, Nijkamp et al., 2023, Chowdhery et al., 2022, Touvron et al., 2023, Li et al., 2023]. This literature has predominantly focused on evaluating models in terms of either raw accuracy or the pass@k metric [Kulal et al., 2019, Chen et al., 2021], often leveraging filtering techniques based on execution [Li et al., 2022, Shi et al., 2022] or ranking [Chen et al., 2021, Inala et al., 2022, Zhang et al., 2022] to reduce the number of samples which are considered for the final answer. In contrast, our work focuses on evaluating the models from the point of view of minimizing the number of samples that need to be drawn from the model in the first place. Our work is also different in that we assume access to the full collection of input-output examples, as is typically done in inductive synthesis [Kitzelmann, 2010, Polozov and Gulwani, 2015, Gulwani et al., 2017, Chen et al., 2019a, Ellis et al., 2021]. In particular, unlike some prior work [Li et al., 2022, Shi et al., 2022], we do not make a distinction between public tests used for filtering and private tests used to determine correctness, since our method does not involve filtering the outputs.
Code repair. Statistical and learning-based techniques for code repair have a rich history in both the programming languages and machine learning communities, although they have traditionally been used predominantly to repair human-written code [Long and Rinard, 2016, Bader et al., 2019, Le Goues et al., 2021, Yasunaga and Liang, 2021, Chen et al., 2019b, Mesbah et al., 2019, Wang et al., 2018]. More recently, using repair as a post-processing step to improve code which was itself automatically synthesised has been used in the synthesis of both domain-specific languages [Gupta et al., 2020] and general-purpose code [Le et al., 2022, Yasunaga and Liang, 2021, 2020]. Our contribution differs from most prior work in this literature in the use of textual feedback for repair, which is possible thanks to the above mentioned rise in the use of LLMs for program synthesis.
Contemporary work on LLM self-repair. Recognizing that there is much contemporary work seeking to self-repair with LLMs, we now briefly highlight a few such papers which are particularly close to our work. Zhang et al. [2023] explore self-repair without natural language feedback on APPS [Hendrycks et al., 2021] using a diverse range of fine-tuned models. They also experiment with prompt-based repair using Codex [Chen et al., 2021], InCoder [Fried et al., 2023], and CodeGen [Nijkamp et al., 2023]. Notably, their framework does not consider the cost associated with feedback and repair, which presents a significantly different perspective on self-repair. Similarly, Chen et al. [2023b] assess Codex’s ability to self-repair across a variety of tasks, in a framework that closely resembles that which we study in this work. However, their study differs from ours in terms of the models considered, the evaluation strategy, and, most importantly, the research goal, as we specifically aim to investigate the significance of the textual feedback stage. Self-repair, or frameworks with other names that are conceptually very similar to it, has also been used in contexts outside of code generation. Peng et al. [2023] use self-repair to mitigate hallucinations and improve factual grounding in a ChatGPT-based web search assistant, in which the model revises its initial response based on self-generated feedback. Similarly, Madaan et al. [2023] present a framework in which a model iteratively provides feedback on and revises its output until a stopping criterion is reached; they apply this framework to a range of tasks, including dialogue and code optimization. Ultimately, we see our work, in which we use the novel evaluation metric pass@t to investigate the significance of the textual feedback stage in competition-level self-repair, as being complementary to contemporary research which uses traditional metrics to evaluate self-repair in a broader context. We are eager to see what the implications of our results will be in these other domains.
## 3 Methodology
### 3.1 Self-Repair Overview
As shown in Figure 1, our self-repair approach involves 4 stages: code generation, code execution, feedback generation, and code repair. We now formally define these four stages.
Code generation. Given a specification $\psi$ , a programming model $M_{P}$ first generates $n_{p}$ samples i.i.d., which we denote
$$
\{p_{i}\}_{i=1}^{n_{p}}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}M_{P}(\psi)
$$
Code execution. These $n_{p}$ code samples are then executed against a test bed. Recall from Section 2 that we assume that we have access to the full set of tests in executable form (see Section 5 for a brief discussion on the validity of this assumption in software engineering domains). Thus, we stop if any sample passes all of the tests, since a satisfying program has then been found. Otherwise, we collect the error messages $\{e_{i}\}_{i}$ returned by the execution environment. These error messages either contain the compile/runtime error information or an example input on which the program’s output differs from the expected one. An example is shown in Figure 1 (component 3).
Feedback generation. Since the error messages from the execution environment are usually very high-level, they provide little signal for repair. Therefore, as an intermediate step, we use a feedback model to produce a more detailed explanation of what went wrong; Figure 1 (component 4) shows an example. Formally, in this stage, we generate $n_{f}$ feedback strings, $\{f_{ij}\}_{j}$ , for each wrong program, $p_{i}$ , as follows:
$$
\{f_{ij}\}_{j=1}^{n_{f}}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}M_{F}(\psi;p_{
i};e_{i})
$$
Having an explicit feedback generation step allows us to ablate this component so that we can study its significance in isolation.
Code repair. In the final step, for each initial program $p_{i}$ and feedback $f_{ij}$ , $n_{r}$ candidate repaired programs are sampled from $M_{P}$ We use the same model for both the initial code generation and the code repair, since these are fundamentally similar tasks.:
$$
\{r_{ijk}\}_{k=1}^{n_{r}}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}M_{P}(\psi;p_
{i};e_{i};f_{ij})
$$
Repair tree. We call the tree of interleaved text and programs produced by this procedure—rooted in the specification $\psi$ , then branching into initial programs $p_{i}$ , each of which branches into feedback $f_{ij}$ and then repairs $r_{ijk}$ —a repair tree, $T$ (Figure 2).
Caveat: jointly sampling feedback and repair. The general framework presented above does not require the programming model and feedback model to be the same, thus allowing for the use of specialized models in the system. However, when $M_{P}=M_{F}$ we jointly generate both the feedback and the repaired program in a single API call, since both GPT-3.5 and GPT-4 have a natural tendency to interleave text and code in their responses. See Appendix E for a detailed look at how the prompt differs between this and the previous setting. Formally, we denote this as
$$
\{(f_{ij},r_{ij})\}_{j=1}^{n_{fr}}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}M_{P
}(\psi;p_{i};e_{i})
$$
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Hierarchical Tree Diagram with Process Flow
### Overview
The image displays a four-level hierarchical tree diagram with a root node at the top, branching downward into multiple child nodes. To the right of the tree, a vertical, three-stage process flow is depicted with directional arrows. The diagram uses a consistent visual language: circular nodes with a light purple fill and dark gray outlines, connected by dark gray directional arrows. The overall structure suggests a decomposition of a high-level concept or task into progressively more specific components or instances.
### Components/Axes
**Tree Structure:**
* **Root Node (Top Center):** A single node labeled with the Greek letter **ψ** (psi).
* **First Level (Below ψ):** Nodes labeled **p₁** through **p_{n_p}**. An ellipsis (**...**) between p₁ and p_{n_p} indicates multiple nodes at this level.
* **Second Level (Below each p node):** Each p node branches into nodes labeled with **f** and double subscripts. For example, under p₁ are nodes **f₁₁** through **f_{1n_f}** (with an ellipsis between). Similarly, under p_{n_p} are nodes **f_{n_p1}** through **f_{n_p n_f}**.
* **Third/Leaf Level (Bottom):** Each f node branches into nodes labeled with **r** and triple subscripts. For example, under f₁₁ are nodes **r₁₁₁** through **r_{11n_r}** (with an ellipsis). This pattern continues for all f nodes, ending with nodes like **r_{n_p n_f 1}** through **r_{n_p n_f n_r}**.
**Process Flow (Right Side):**
* A vertical sequence of three text labels connected by downward-pointing arrows.
* **Top Label:** **Code Gen** (in blue text).
* **Middle Label:** **Feedback** (in orange text).
* **Bottom Label:** **Repair** (in blue text).
* A blue arrow points from "Code Gen" down to "Feedback". An orange arrow points from "Feedback" down to "Repair".
### Detailed Analysis
**Node Hierarchy and Indexing:**
The tree follows a strict, nested indexing scheme:
1. **Level 1 (p nodes):** Indexed by a single variable `p`, from 1 to `n_p`.
2. **Level 2 (f nodes):** Indexed by two variables. The first index corresponds to the parent p node (e.g., `1` for children of p₁, `n_p` for children of p_{n_p}). The second index runs from 1 to `n_f` for each parent.
3. **Level 3 (r nodes):** Indexed by three variables. The first two indices correspond to the parent f node (e.g., `1,1` for children of f₁₁, `n_p, n_f` for children of f_{n_p n_f}). The third index runs from 1 to `n_r` for each parent.
**Spatial Relationships:**
* The tree is centered in the main area of the image.
* The process flow is positioned to the far right, vertically aligned with the middle to lower portion of the tree.
* All arrows in the tree point downward, indicating a parent-to-child or decomposition relationship.
* The arrows in the side flow also point downward, indicating a sequential process.
### Key Observations
1. **Uniform Structure:** The branching factor is consistent at each level for all sub-trees. Every p node has `n_f` children, and every f node has `n_r` children.
2. **Ellipses for Abstraction:** The use of ellipses (`...`) at each level is a critical notational element, explicitly indicating that the diagram shows a generalized template rather than a specific instance with fixed numbers of nodes.
3. **Color-Coded Process Flow:** The "Feedback" stage and its incoming/outgoing arrows are highlighted in orange, distinguishing it from the blue "Code Gen" and "Repair" stages. This visually emphasizes the feedback loop as a distinct, central phase.
4. **Disconnected Flows:** The hierarchical tree and the vertical process flow are not connected by any lines or arrows in the diagram. Their relationship is implied by proximity and context.
### Interpretation
This diagram illustrates a **generalized model for a hierarchical problem-solving or generation process**, likely within a computational or engineering context (e.g., software engineering, automated planning, or machine learning).
* **The Tree (ψ → p → f → r):** Represents the **decomposition of a high-level goal or specification (ψ)** into manageable sub-problems or components (`p`), which are further broken down into specific functions or features (`f`), and finally into concrete results, requirements, or repair actions (`r`). The triple-index `r_{ijk}` suggests a highly granular, traceable mapping from the root to the most specific elements.
* **The Side Flow (Code Gen → Feedback → Repair):** Represents an **iterative execution or refinement cycle**. A process generates code or a solution ("Code Gen"), evaluates it ("Feedback"), and then makes corrections ("Repair").
* **Synthesized Meaning:** The most plausible interpretation is that the tree defines the **structure or search space** of a problem, while the side flow describes the **operational process** applied to elements within that structure. For instance, one might generate code (`Code Gen`) for a specific function `f_{ij}`, receive test results or user input (`Feedback`), and then debug or modify that specific function (`Repair`). The model is abstract and scalable, as denoted by the variables `n_p`, `n_f`, and `n_r`. The use of Greek letter ψ for the root often signifies a program, specification, or overall state in formal methods.
</details>
Figure 2: A repair tree begins with a specification $\psi$ (root node), then grows into initial programs, feedback, and repairs.
### 3.2 pass@t : pass rate vs. token count
Since self-repair requires several dependent model invocations of non-uniform cost, this is a setting in which pass@ $k$ —the likelihood of obtaining a correct program in $k$ i.i.d. samples—is not a suitable metric for comparing and evaluating various hyper-parameter choices of self-repair. Instead, we measure the pass rate as a function of the total number of tokens sampled from the model, a metric which we call pass@t.
Formally, suppose that you are given a dataset $D=\{\psi_{d}\}_{d}$ and a chosen set of values for the hyper-parameters $(M_{P},M_{F},n_{p},n_{f},n_{r})$ . Let $T_{d}^{i}\sim M(\psi_{d})$ denote a repair tree that is sampled as described in Section 3.1 for the task $\psi_{d}$ ; let $\text{size}(T_{d}^{i})$ denote the total number of program and feedback tokens in the repair tree; and say that $T_{d}^{i}\models\psi_{d}$ is true if, and only if, $T_{d}^{i}$ has at least one leaf program that satisfies the unit tests in the specification $\psi_{d}$ . Then the pass@t metric of this choice of hyper-parameters is defined as the expected pass rate at the number of tokens which you would expect to generate with this choice of hyper-parameters:
| | $\displaystyle\texttt{pass@t}\triangleq\mathop{\mathbb{E}}_{\stackrel{{ \scriptstyle\psi_{d}\sim D}}{{T_{d}^{i}\sim M(\psi_{d})}}}\left[T_{d}^{i} \models\psi_{d}\right]\quad\textbf{at}\quad t=\mathop{\mathbb{E}}_{\stackrel{{ \scriptstyle\psi_{d}\sim D}}{{T_{d}^{i}\sim M(\psi_{d})}}}\left[\text{size}(T_ {d}^{i})\right]$ | |
| --- | --- | --- |
In our experiments, we plot bootstrapped estimates of these two quantities. To obtain these, we first generate a very large repair tree for each task specification, with: $N_{p}\geq n_{p}$ initial program samples; $N_{f}\geq n_{f}$ feedback strings per wrong program; and $N_{r}\geq n_{r}$ repair candidates per feedback string. Given a setting of $(n_{p},n_{f},n_{r})$ , we then sub-sample (with replacement) $N_{t}$ different repair trees from this frozen dataset. Finally, we compute the sample mean and standard deviation of the pass rate and the tree size over these $N_{t}$ trees. Estimating the pass@t in this way greatly reduces the computational cost of our experiments, since we can reuse the same initial dataset to compute the estimates for all of the various choices of $n_{p},n_{f}$ , and $n_{r}$ .
We use $N_{p}=50$ for all experiments, and consider $n_{p}\leq 25$ for the self-repair approaches and $n_{p}\leq 50$ for the baseline, no-repair approach. Similarly, for the feedback strings, we use $N_{f}=25$ and $n_{f}\leq 10$ (except for Section 4.2, in which we only consider $n_{f}=1$ and therefore settle for $N_{f}=10$ instead). For the repair candidates, since we do joint sampling of feedback and repair in most of our experiments, we set $N_{r}=n_{r}=1$ . Finally, we use $N_{t}=1000$ for all settings.
## 4 Experiments
In this section, we carry out experiments to answer the following research questions: (a) In the context of challenging programming puzzles, is self-repair better than i.i.d. sampling without repair for the models we consider? If so, under what hyper-parameters is self-repair most effective? (b) Would a stronger feedback model boost the model’s repair performance? (c) Would keeping a human in the loop to provide feedback unlock better repair performance even for the strongest model?
We evaluate these hypotheses on Python programming challenges from the APPS dataset [Hendrycks et al., 2021]. The APPS dataset contains a diverse range of programming challenges paired with a suite of tests, making it a perfect (and challenging) setting to study self-repair in. To keep our experiments tractable, we evaluate on a subset of the APPS test set, consisting of 300 tasks. These tasks are proportionally sampled in accordance with the frequency of the different difficulty levels in the test set: 180 interview-level questions, 60 competition-level questions, and 60 introductory-level questions (listed in Appendix F). We use GPT-3.5 [Ouyang et al., 2022, OpenAI, 2022] and GPT-4 [OpenAI, 2023] as our models of choice, and implement self-repair using templated string concatenation with one-shot prompting; our prompts are given in Appendix E. When appropriate, we compare against a baseline without repair. This baseline, shown with a black line in the plots, is simply i.i.d. sampling from the corresponding model (e.g., GPT-4 when we explore whether GPT-4 is capable of self-repair). Based on preliminary experiments, we set the decoding temperature to $0.8$ for all the models to encourage diverse samples.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Scatter Plot with Trend Line: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
The image is a scatter plot chart displaying the relationship between the "Mean number of tokens generated" (x-axis) and the "Mean pass rate" (y-axis). The chart includes multiple data series differentiated by color (representing a parameter `n_p`) and marker shape (representing a parameter `n_fr`). A single gray trend line with a shaded confidence interval is overlaid on the data points.
### Components/Axes
* **X-Axis:**
* **Label:** "Mean number of tokens generated"
* **Scale:** Linear, ranging from 0 to 10000.
* **Major Tick Marks:** 0, 2000, 4000, 6000, 8000, 10000.
* **Y-Axis:**
* **Label:** "Mean pass rate"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Major Tick Marks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend (Top-Left Quadrant):**
* **Left Column (Color Key for `n_p`):**
* Orange line: `n_p = 1`
* Yellow line: `n_p = 2`
* Teal line: `n_p = 5`
* Light Blue line: `n_p = 10`
* Dark Blue line: `n_p = 25`
* **Right Column (Marker Key for `n_fr`):**
* Circle (●): `n_fr = 1`
* Downward Triangle (▼): `n_fr = 3`
* Square (■): `n_fr = 5`
* Upward Triangle (▲): `n_fr = 10`
* **Trend Line:** A solid gray curve with a light gray shaded area around it, likely representing a confidence interval or error band. The line follows a logarithmic or square-root growth pattern.
### Detailed Analysis
**Data Series and Points (Approximate Values):**
The data points are plotted by combining a color (for `n_p`) and a marker (for `n_fr`). The following list is derived by visually matching each point's color and shape to the legend.
1. **Orange (`n_p=1`) Series:**
* Circle (`n_fr=1`): At (~500 tokens, ~0.27 pass rate).
* Downward Triangle (`n_fr=3`): At (~1000 tokens, ~0.30 pass rate).
* Square (`n_fr=5`): At (~1500 tokens, ~0.32 pass rate).
* Upward Triangle (`n_fr=10`): At (~3000 tokens, ~0.35 pass rate).
2. **Yellow (`n_p=2`) Series:**
* Circle (`n_fr=1`): At (~1000 tokens, ~0.33 pass rate).
* Downward Triangle (`n_fr=3`): At (~2000 tokens, ~0.37 pass rate).
* Square (`n_fr=5`): At (~3000 tokens, ~0.39 pass rate).
* Upward Triangle (`n_fr=10`): At (~5500 tokens, ~0.42 pass rate).
3. **Teal (`n_p=5`) Series:**
* Circle (`n_fr=1`): At (~2000 tokens, ~0.41 pass rate).
* Downward Triangle (`n_fr=3`): At (~4500 tokens, ~0.45 pass rate).
* Square (`n_fr=5`): At (~7000 tokens, ~0.47 pass rate).
* Upward Triangle (`n_fr=10`): Not clearly visible in the provided range.
4. **Light Blue (`n_p=10`) Series:**
* Circle (`n_fr=1`): At (~4000 tokens, ~0.47 pass rate).
* Downward Triangle (`n_fr=3`): At (~8500 tokens, ~0.50 pass rate).
* Square (`n_fr=5`): Not clearly visible.
* Upward Triangle (`n_fr=10`): Not clearly visible.
5. **Dark Blue (`n_p=25`) Series:**
* Circle (`n_fr=1`): At (~9800 tokens, ~0.53 pass rate).
* Other markers for this series are not visible in the plot area.
**Trend Line:**
The gray trend line starts near (0, 0.22) and rises steeply initially. It passes through approximately (2000, 0.42), (4000, 0.47), (6000, 0.49), and (8000, 0.50), showing a diminishing rate of increase as the number of tokens grows. The line appears to asymptote towards a pass rate of ~0.51-0.52.
### Key Observations
1. **Positive Correlation:** There is a clear positive, non-linear relationship between the mean number of tokens generated and the mean pass rate. More tokens are associated with a higher pass rate.
2. **Diminishing Returns:** The relationship is concave; the gain in pass rate per additional token decreases as the total token count increases. The trend line suggests a logarithmic-like growth.
3. **Effect of `n_p`:** For a given approximate token count, series with higher `n_p` values (e.g., dark blue `n_p=25`) generally exhibit higher pass rates than those with lower `n_p` values (e.g., orange `n_p=1`).
4. **Effect of `n_fr`:** Within a single color series (fixed `n_p`), points with higher `n_fr` values (e.g., upward triangles) tend to be located further to the right (higher token count) and slightly higher (higher pass rate) than points with lower `n_fr` values (e.g., circles).
5. **Data Distribution:** The data points are not uniformly distributed. There is a cluster of points below 4000 tokens and pass rates of 0.45, and a sparser set of points extending to 10000 tokens with pass rates approaching 0.55.
### Interpretation
This chart likely illustrates the performance (pass rate) of a generative model or algorithm as a function of its output length (tokens generated). The parameters `n_p` and `n_fr` probably represent different configurations or sampling strategies (e.g., number of proposals, number of refinement steps).
The data suggests a fundamental trade-off: achieving a higher success rate (pass rate) requires generating more content (tokens), but with diminishing efficiency. The trend line represents an average or expected performance curve across configurations.
The separation of series by `n_p` indicates that this parameter has a significant impact on the efficiency of the process. Higher `n_p` configurations achieve better pass rates for the same token budget, or achieve the same pass rate with fewer tokens. The `n_fr` parameter appears to control a secondary factor that influences both the token count and the pass rate within a given `n_p` strategy.
The plateauing of the trend line near a pass rate of 0.5 suggests a potential upper bound or saturation point for the mean performance under the tested conditions. The absence of data points near a pass rate of 1.0 indicates that perfect performance is not achieved on average, even with high token counts.
</details>
(a) Mean pass rate vs. number of tokens generated. Black line is i.i.d. sampling without repair from GPT-3.5. Note that the error bars are often smaller than the markers; all settings have a standard deviation of less than 1.5 absolute points on the y-axis. Results truncated at $t=10,000$ .
<details>
<summary>x4.png Details</summary>

### Visual Description
## Heatmap: Performance vs. Initial Programs and Feedback-Repairs
### Overview
The image is a heatmap chart visualizing a performance metric (likely a success rate or score, ranging from 0.78 to 1.00) as a function of two variables: the "Number of initial programs (n_p)" on the x-axis and the "Number of feedback-repairs (n_fr)" on the y-axis. Cells are colored on a gradient from dark brown (lower values) to bright yellow (higher values). A distinct dark gray color is used for cells labeled "O.O.B.", indicating an "Out of Bounds" or invalid condition.
### Components/Axes
* **Chart Type:** Heatmap (2D grid).
* **X-Axis (Horizontal):**
* **Label:** "Number of initial programs (n_p)"
* **Categories/Markers (from left to right):** 1, 2, 5, 10, 25.
* **Y-Axis (Vertical):**
* **Label:** "Number of feedback-repairs (n_fr)"
* **Categories/Markers (from bottom to top):** 1, 3, 5, 10.
* **Data Representation:** Each cell contains a numerical value (to two decimal places) or the text "O.O.B.".
* **Color Scale:** A sequential color gradient is used for valid numerical values:
* Dark Brown: Lower values (~0.78-0.81)
* Orange-Brown: Mid-range values (~0.86-0.92)
* Golden Yellow: Higher values (~0.94-1.00)
* Dark Gray: "O.O.B." cells.
### Detailed Analysis
The heatmap presents a 4x5 grid of data points. The following table reconstructs the data, with rows representing `n_fr` (y-axis) and columns representing `n_p` (x-axis).
| n_fr \ n_p | 1 | 2 | 5 | 10 | 25 |
| :--------- | :--- | :--- | :--- | :--- | :--- |
| **10** | 0.78 | 0.86 | O.O.B. | O.O.B. | O.O.B. |
| **5** | 0.80 | 0.86 | 0.95 | O.O.B. | O.O.B. |
| **3** | 0.81 | 0.87 | 0.94 | 1.00 | O.O.B. |
| **1** | 0.87 | 0.92 | 0.96 | 0.99 | O.O.B. |
**Trend Verification:**
* **Horizontal Trend (Increasing n_p):** For a fixed number of feedback-repairs (`n_fr`), the performance value generally **increases** as the number of initial programs (`n_p`) increases, until it hits the "O.O.B." boundary. For example, at `n_fr=1`, the value rises from 0.87 (n_p=1) to 0.99 (n_p=10).
* **Vertical Trend (Increasing n_fr):** For a fixed number of initial programs (`n_p`), the performance value generally **decreases** as the number of feedback-repairs (`n_fr`) increases. For example, at `n_p=1`, the value falls from 0.87 (n_fr=1) to 0.78 (n_fr=10).
* **"O.O.B." Region:** The "Out of Bounds" condition forms a clear triangular region in the top-right of the heatmap. It occurs for all combinations where `n_p=25`, and expands to include lower `n_p` values as `n_fr` increases (e.g., at `n_fr=10`, `n_p=5,10,25` are O.O.B.).
### Key Observations
1. **Peak Performance:** The highest recorded value is **1.00**, located at the intersection of `n_p=10` and `n_fr=3`.
2. **Performance Floor:** The lowest recorded valid value is **0.78**, at `n_p=1` and `n_fr=10`.
3. **"O.O.B." Boundary:** The system or model being measured appears to have an operational limit. The "O.O.B." region suggests that combinations of high initial programs and high feedback-repairs are invalid, computationally infeasible, or lead to failure. The boundary is not a simple linear cutoff but a stepped, diagonal frontier.
4. **Color-Value Correlation:** The color gradient accurately reflects the numerical values, with the brightest yellow (1.00) corresponding to the highest score and the darkest brown (0.78) to the lowest valid score.
### Interpretation
This heatmap likely illustrates the performance of an iterative program repair or synthesis system. The data suggests a nuanced relationship between initial diversity (`n_p`) and iterative correction (`n_fr`):
* **Benefit of Initial Diversity:** Increasing the number of initial program candidates (`n_p`) consistently improves the success metric, up to a point. This implies that starting with more diverse guesses increases the chance of finding or converging on a correct solution.
* **Cost of Excessive Repair:** Increasing the number of feedback-repair cycles (`n_fr`) for a given set of initial programs tends to slightly degrade the final performance score. This could indicate overfitting to the feedback, introduction of new errors, or diminishing returns where later repairs are less effective.
* **The "O.O.B." Constraint:** The critical insight is the operational boundary. The system cannot handle (or is not tested on) scenarios requiring both a very large pool of initial programs (`n_p=25`) and any significant number of repairs. Furthermore, as the repair budget (`n_fr`) increases, the maximum viable number of initial programs decreases. This points to a resource constraint (like memory or time) where the computational cost scales with the product of `n_p` and `n_fr`, making high values in both dimensions prohibitive.
* **Optimal Zone:** The "sweet spot" for performance appears to be in the central region of the valid grid, specifically around `n_p=5-10` and `n_fr=1-3`, where scores are highest (0.94-1.00) and the system is operating within its bounds. The chart effectively maps the trade-off between exploration (more initial programs) and exploitation (more repair cycles) within the system's computational limits.
</details>
(b) Normalized mean pass rate relative to the (interpolated) baseline at an equivalent budget (number of tokens). Cells for which the number of tokens generated exceeds 50 samples from the GPT-3.5 baseline marked O.O.B. (out of bounds).
Figure 3: Pass rate versus number of tokens generated for various settings of $n_{p}$ (number of initial programs) and $n_{fr}$ (number of repairs sampled per program). GPT-3.5 is used for all samples, including the baseline.
### 4.1 Self-repair requires strong models and diverse initial samples
In this subsection, we consider the setup where $M_{P}=M_{F}\in\{\text{GPT-3.5, GPT-4}\}$ , i.e., where one single model is used for both code/repair generation and feedback generation. To evaluate if self-repair leads to better pass@t than a no-repair, i.i.d. sampling-based baseline approach, we vary $n_{p}$ and $n_{fr}$ —that is, the number of initial i.i.d. base samples and joint feedback, repair samples drawn from $M_{P}$ —in the range $(n_{p},n_{fr})\in\{1,2,5,10,25\}\times\{1,3,5,10\}$ . Recall that when $M_{P}=M_{F}$ , we jointly sample for $n_{fr}$ pairs of feedback strings and repair programs instead of sampling them one after another (Section 3.1).
The results are shown in Figure 3 for GPT-3.5 and Figure 4 for GPT-4. In the left-hand subplots, the color of each dot indicates the number of initial samples ( $n_{p}$ ), while its shape indicates the number of feedback-repair samples ( $n_{fr}$ ). In the right hand plots, we show a heat-map with the two hyper-parameters along the axes, where the value in each cell indicates the mean pass rate with self-repair normalized by the mean pass rate of the baseline, no-repair approach when given the same token budget (i.e., pass@t at the same value of t). When the normalized mean pass rate is 1, this means that self-repair has the same pass rate as the no-repair, baseline approach at that same token budget; a higher value ( $\geq 1$ ) means self-repair performs better than the baseline.
From the plots, we can see that for the GPT-3.5 model, the pass@t is lower than or equal to the corresponding baseline (black line) for all settings of $n_{p},n_{fr}$ , clearly showing that self-repair is not an effective strategy for GPT-3.5. On the other hand, for GPT-4, there are several values of $n_{p},n_{fr}$ for which the pass rate with self-repair is significantly better than that of the baseline. For example, with $n_{p}=10,n_{fr}=3$ the pass rate increases from 65% to 70%, and with $n_{p}=25,n_{fr}=1$ it increases from 65% to 71%.
Our experiments also show a clear trend with respect to the relationship between the hyper-parameters. Given a fixed number of feedback-repairs ( $n_{fr}$ ), increasing the number of initial programs ( $n_{p}$ ) (i.e., moving right along the x-axis on the heat maps) consistently leads to relative performance gains for both models. On the other hand, fixing $n_{p}$ and increasing $n_{fr}$ (i.e., moving up along the y-axis on the heat maps) does not appear to be worth the additional cost incurred, giving very marginal gains at higher budgets and even decreasing performance at lower budgets. This suggests that, given a fixed budget, the most important factor determining whether self-repair will lead to a correct program or not is the diversity of the base samples that are generated up-front, rather than the diversity of the repairs sampled. Having more initial samples increases the likelihood of there being at least one program which is close to the ideal program and, hence, can be successfully repaired.
Since $n_{fr}=1$ is the best choice for the hyper-parameter $n_{fr}$ , we next isolate the effect of the number of initial programs, $n_{p}$ , by exploring a denser set of possible values: $(n_{p},n_{fr})\in\{1,2,....,24,25\}\times\{1\}$ . The plots are shown in Figure 6 for both $M_{P}=M_{F}\in\{\text{GPT-3.5},\text{GPT-4}\}$ and the baseline, no-repair approaches. Note that since $n_{fr}$ is fixed, in these plots, there is a direct correlation between $n_{p}$ and the total number of tokens, $t$ . Again, we see that self-repair is not an effective strategy for the GPT-3.5 model, but that it is effective for GPT-4—especially at higher values of $n_{p}$ ( $\geq 5000$ ), where it increases pass rate by over 5 points.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Scatter Plot with Trend Line: Mean Pass Rate vs. Mean Tokens Generated
### Overview
This image is a scatter plot with an overlaid trend line, illustrating the relationship between the mean number of tokens generated (x-axis) and the mean pass rate (y-axis). The data is categorized by two parameters: `n_p` (represented by color) and `n_fr` (represented by marker shape). The plot suggests a logarithmic or saturating relationship where pass rate increases with token count but plateaus.
### Components/Axes
* **X-Axis:** Labeled "Mean number of tokens generated". Scale ranges from 0 to 10,000 with major tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-Axis:** Labeled "Mean pass rate". Scale ranges from 0.0 to 1.0 with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend (Bottom-Right Quadrant):** Contains two columns.
* **Left Column (Color for `n_p`):**
* Brown line: `n_p = 1`
* Yellow line: `n_p = 2`
* Teal line: `n_p = 5`
* Light Blue line: `n_p = 10`
* Dark Blue line: `n_p = 25`
* **Right Column (Marker Shape for `n_fr`):**
* Circle: `n_fr = 1`
* Downward Triangle: `n_fr = 3`
* Square: `n_fr = 5`
* Upward Triangle: `n_fr = 10`
* **Trend Line:** A single, thick, dark gray line runs through the data points, showing the overall trend. It starts steeply and flattens out.
### Detailed Analysis
Data points are identified by matching the color (for `n_p`) and marker shape (for `n_fr`). Values are approximate based on visual inspection.
**Data Series & Approximate Points (x, y):**
1. **`n_p = 1` (Brown):**
* `n_fr = 1` (Circle): (~500, 0.46)
* `n_fr = 3` (Down Triangle): (~800, 0.50)
* `n_fr = 5` (Square): (~1200, 0.52)
* `n_fr = 10` (Up Triangle): (~1800, 0.54)
2. **`n_p = 2` (Yellow):**
* `n_fr = 1` (Circle): (~700, 0.53)
* `n_fr = 3` (Down Triangle): (~1400, 0.58)
* `n_fr = 5` (Square): (~1900, 0.60)
* `n_fr = 10` (Up Triangle): (~3500, 0.62)
3. **`n_p = 5` (Teal):**
* `n_fr = 1` (Circle): (~1600, 0.61)
* `n_fr = 3` (Down Triangle): (~2700, 0.66)
* `n_fr = 5` (Square): (~4000, 0.67)
* `n_fr = 10` (Up Triangle): (~7500, 0.70)
4. **`n_p = 10` (Light Blue):**
* `n_fr = 3` (Down Triangle): (~5000, 0.70)
* `n_fr = 5` (Square): (~7500, 0.71)
5. **`n_p = 25` (Dark Blue):**
* `n_fr = 1` (Circle): (~6500, 0.71)
**Trend Line Behavior:** The gray trend line begins near (0, 0.40), rises sharply until approximately 2000 tokens (reaching ~0.60), then its slope decreases significantly, approaching a plateau near 0.66-0.67 as tokens approach 8000.
### Key Observations
1. **Positive Correlation:** For all series, the mean pass rate increases as the mean number of tokens generated increases.
2. **Diminishing Returns:** The rate of improvement slows dramatically. The most significant gains in pass rate occur within the first ~2000 tokens. Beyond ~4000 tokens, additional tokens yield very small increases in pass rate.
3. **Effect of `n_p`:** Higher values of `n_p` (warmer to cooler colors: brown -> yellow -> teal -> blue) are associated with higher pass rates for a given token count. The `n_p=25` (dark blue) point achieves the highest observed pass rate (~0.71).
4. **Effect of `n_fr`:** For a fixed `n_p`, increasing `n_fr` (different marker shapes) generally corresponds to a higher mean number of tokens generated and a slightly higher pass rate. This is most clearly seen in the `n_p=1` and `n_p=2` series.
5. **Data Sparsity:** The plot is sparse for high `n_p` values (`n_p=10, 25`) and high `n_fr` values, with only one or two data points each.
### Interpretation
This chart likely visualizes the performance (pass rate) of a generative AI model or system against its output length (token count), under different configurations of two parameters, `n_p` and `n_fr`.
* **Core Finding:** There is a fundamental trade-off or relationship where generating more content (tokens) is associated with a higher probability of success (pass rate), but with severe diminishing returns. The system reaches near-optimal performance after generating a few thousand tokens.
* **Parameter Insights:** `n_p` appears to be a more impactful parameter for boosting the maximum achievable pass rate than `n_fr`. `n_fr` seems to control a trade-off between output length and performance within a given `n_p` setting.
* **Practical Implication:** The data suggests that for this system, allocating resources to generate beyond 4000-6000 tokens provides minimal performance benefit. Optimal efficiency might be found at the "knee" of the curve, around 2000-4000 tokens. The highest performance ceiling is set by the `n_p` parameter.
* **Anomaly/Note:** The single dark blue circle (`n_p=25, n_fr=1`) at (~6500, 0.71) is an outlier in terms of parameter combination but follows the general trend, confirming that even with high `n_p`, the saturating behavior holds.
</details>
(a) Mean pass rate vs. number of tokens generated. Black line is i.i.d. sampling without repair from GPT-4. Note that the error bars are often smaller than the markers; all settings have a standard deviation of less than 1.5 absolute points on the y-axis. Results truncated at $t=10,000$ .
<details>
<summary>x6.png Details</summary>

### Visual Description
## Heatmap: Performance Metrics by Initial Programs and Feedback-Repairs
### Overview
This image is a heatmap chart displaying numerical performance metrics (likely ratios or scores) across a grid defined by two parameters: the "Number of initial programs (n_p)" on the horizontal axis and the "Number of feedback-repairs (n_fr)" on the vertical axis. The cells are color-coded in a gradient from orange (lower values) through yellow-green to green (higher values), with a distinct dark gray used for cells labeled "O.O.B.".
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Number of initial programs (n_p)". The categorical markers are, from left to right: `1`, `2`, `5`, `10`, `25`.
* **Y-Axis (Vertical):** Labeled "Number of feedback-repairs (n_fr)". The categorical markers are, from bottom to top: `1`, `3`, `5`, `10`.
* **Legend/Color Scale:** There is no explicit legend. The color gradient is implicit:
* **Orange tones:** Represent values below 1.00 (e.g., 0.90, 0.91).
* **Yellow-Green tones:** Represent values around 1.00 to 1.05.
* **Green tones:** Represent values above 1.05 (e.g., 1.06, 1.08, 1.09).
* **Dark Gray:** Represents the label "O.O.B." (likely meaning "Out Of Bounds" or a failure condition).
* **Data Grid:** A 4-row (n_fr) by 5-column (n_p) matrix of cells, each containing a numerical value or the text "O.O.B.".
### Detailed Analysis
The following table reconstructs the data from the heatmap. Values are read from the image with high confidence. The color description is based on the visual gradient.
| n_fr (y-axis) \ n_p (x-axis) | **1** | **2** | **5** | **10** | **25** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **10** | 0.90 (Orange) | 0.98 (Yellow-Orange) | 1.05 (Light Green) | O.O.B. (Dark Gray) | O.O.B. (Dark Gray) |
| **5** | 0.91 (Orange) | 0.98 (Yellow-Orange) | 1.04 (Yellow-Green) | 1.08 (Green) | O.O.B. (Dark Gray) |
| **3** | 0.93 (Orange) | 0.99 (Yellow) | 1.04 (Yellow-Green) | 1.08 (Green) | O.O.B. (Dark Gray) |
| **1** | 0.98 (Yellow-Orange) | 1.01 (Yellow) | 1.04 (Yellow-Green) | 1.06 (Green) | 1.09 (Green) |
**Trend Verification:**
* **Horizontal Trend (Increasing n_p):** For any fixed row (n_fr), moving from left to right, the numerical values generally increase. The color shifts from orange/yellow towards green. This trend is interrupted by "O.O.B." entries in the rightmost columns for higher n_fr values.
* **Vertical Trend (Increasing n_fr):** For any fixed column (n_p), moving from bottom to top, the numerical values show a slight decrease or remain stable. For example, at n_p=1, values go from 0.98 (n_fr=1) down to 0.90 (n_fr=10). The color becomes a slightly deeper orange. The exception is the n_p=10 column, where the value increases from 1.06 to 1.08 before hitting O.O.B.
### Key Observations
1. **Performance Threshold:** The highest numerical values (1.08, 1.09) are found in the lower-right region of the grid (high n_p, low n_fr).
2. **"O.O.B." Region:** The "Out Of Bounds" condition appears in a contiguous block in the top-right corner of the heatmap. It occurs when both parameters are high: specifically, for n_p=25 at all n_fr levels except 1, and for n_p=10 when n_fr=10.
3. **Color-Value Correlation:** The color gradient is consistent with the numerical values. The darkest orange corresponds to the lowest value (0.90), and the most vibrant green corresponds to the highest value (1.09).
4. **Stability at Low n_fr:** The bottom row (n_fr=1) is the only row without any "O.O.B." entries, showing a clean progression from 0.98 to 1.09 as n_p increases.
### Interpretation
This heatmap likely illustrates the results of an experiment or simulation in software engineering or automated program repair, where `n_p` is the number of candidate programs generated initially, and `n_fr` is the number of iterative repair cycles applied.
* **What the data suggests:** The metric (values around 0.9 to 1.1) appears to be a performance ratio, possibly comparing the final program's quality or correctness to a baseline. A value of 1.00 would indicate parity. Values below 1.00 (orange) indicate worse-than-baseline performance, while values above 1.00 (green) indicate improvement.
* **Relationship between elements:** The data demonstrates a trade-off. Increasing the number of initial programs (`n_p`) generally improves the outcome (higher values), but only up to a point. Combining a high number of initial programs with a high number of feedback-repair cycles (`n_fr`) leads to a failure state ("O.O.B."), suggesting the process becomes unstable, times out, or produces invalid results under those resource-intensive conditions.
* **Notable Pattern:** The most successful and stable configuration appears to be a moderate-to-high number of initial programs (n_p=10 or 25) paired with a low number of repair cycles (n_fr=1). This yields the highest scores (1.06-1.09) without triggering failure. The system seems sensitive to excessive iterative repair when starting with many candidates.
</details>
(b) Normalized mean pass rate relative to the (interpolated) baseline at an equivalent budget (number of tokens). Cells for which the number of tokens generated exceeds 50 samples from the GPT-4 baseline marked O.O.B. (out of bounds).
Figure 4: Pass rate versus number of tokens generated for various settings of $n_{p}$ (number of initial programs) and $n_{fr}$ (number of repairs per failing program). GPT-4 is used for all samples, including the baseline.
### 4.2 GPT-4 feedback improves GPT-3.5 repair
Next, we conduct an experiment in which we evaluate the impact of using a separate, stronger model to generate the feedback. This is to test the hypothesis that self-repair is held back (especially for GPT-3.5) by the model’s inability to introspect and debug its own code.
For this experiment, we set $M_{P}$ = GPT-3.5 and $M_{F}$ = GPT-4 and vary the hyper-parameters as $(n_{p},n_{f},n_{r})\in\{1,2,....,24,25\}\times\{1\}\times\{1\}$ , similarly to the previous experiment. Note that since we are now operating in a setting in which the feedback and repair stages must be separated, we have three hyper-parameters— $n_{p},n_{f},n_{r}$ —instead of two— $n_{p},n_{fr}$ (Section 3.1). To keep the computational budget tractable, and since the variance was seen to be very low in the previous experiment, we use $N_{f}=10$ instead of $N_{f}=25$ for this experiment (see Section 3.2).
The results for this experiment are shown in Figure 6 (bright blue line). We observe that in terms of absolute performance, $M_{P}=$ GPT-3.5, $M_{F}=$ GPT-4 does break through the performance barrier and becomes marginally more efficient than i.i.d. sampling from GPT-3.5. This suggests that the textual feedback stage itself is of crucial importance, and that improving it relieves the bottleneck in GPT-3.5 self-repair.
Figure 5: Mean pass rate for each model when $n_{fr}$ (or $n_{f}$ and $n_{r}$ ) = 1. Shaded region is $\pm 1$ standard deviation. Complete breakdown per difficulty in Appendix A.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This is a line chart plotting the "Mean pass rate" against the "Mean number of tokens generated" for five different configurations of large language models (LLMs). The chart compares the performance of GPT-4 and GPT-3.5, both in standalone modes and with a "repair" mechanism ($M_F$) applied. The data is presented as smooth curves with shaded confidence intervals.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Mean number of tokens generated". The scale runs from 0 to 10,000, with major tick marks at 0, 2000, 4000, 6000, 8000, and 10000. The labels are rotated approximately 45 degrees.
* **Y-Axis (Vertical):** Labeled "Mean pass rate". The scale runs from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Positioned in the bottom-right quadrant of the chart area. It contains five entries, each with a colored line sample and a corresponding label:
1. **Dark Blue Line:** $M_P =$ GPT-4 (no repair)
2. **Teal/Green Line:** $M_P =$ GPT-4; $M_F =$ GPT-4
3. **Gray Line:** $M_P =$ GPT-3.5 (no repair)
4. **Orange Line:** $M_P =$ GPT-3.5; $M_F =$ GPT-3.5
5. **Light Blue Line:** $M_P =$ GPT-3.5; $M_F =$ GPT-4
* **Grid:** A light gray grid is present, aligned with the major tick marks on both axes.
### Detailed Analysis
The chart displays five distinct data series, each showing a logarithmic-like growth trend where the mean pass rate increases rapidly with initial token generation and then plateaus.
**Trend Verification & Data Points (Approximate):**
1. **$M_P =$ GPT-4 (no repair) [Dark Blue]:**
* **Trend:** Starts at a pass rate of ~0.4 at 0 tokens. Rises steeply, crossing 0.6 at ~2000 tokens, and continues a gradual ascent.
* **Endpoint:** At ~8000 tokens, the pass rate is approximately 0.66. This is the second-highest performing series.
2. **$M_P =$ GPT-4; $M_F =$ GPT-4 [Teal/Green]:**
* **Trend:** Follows a nearly identical path to the dark blue line but is consistently slightly higher. Starts at ~0.4, crosses 0.6 at ~1500 tokens.
* **Endpoint:** At its visible end (~6500 tokens), the pass rate is approximately 0.71. This is the highest-performing series overall.
3. **$M_P =$ GPT-3.5 (no repair) [Gray]:**
* **Trend:** Starts lower, at a pass rate of ~0.22 at 0 tokens. Rises steadily, crossing 0.4 at ~2000 tokens.
* **Endpoint:** At ~9000 tokens, the pass rate is approximately 0.51. This is the lowest-performing series.
4. **$M_P =$ GPT-3.5; $M_F =$ GPT-3.5 [Orange]:**
* **Trend:** Starts at a similar point to the gray line (~0.22). Initially tracks closely with the gray line but begins to slightly outperform it after ~2000 tokens.
* **Endpoint:** At 10,000 tokens, the pass rate is approximately 0.53.
5. **$M_P =$ GPT-3.5; $M_F =$ GPT-4 [Light Blue]:**
* **Trend:** Starts at ~0.22, similar to the other GPT-3.5 series. Shows a more pronounced improvement over the gray and orange lines, pulling ahead after ~1000 tokens.
* **Endpoint:** At 10,000 tokens, the pass rate is approximately 0.57. This is the highest-performing configuration using GPT-3.5 as the primary model ($M_P$).
### Key Observations
* **Performance Hierarchy:** The two GPT-4 primary model ($M_P$) configurations (dark blue, teal) significantly outperform all three GPT-3.5 primary model configurations across the entire token range.
* **Impact of Repair ($M_F$):** For both GPT-4 and GPT-3.5, adding a repair model ($M_F$) improves the mean pass rate compared to the "no repair" baseline. The improvement is more pronounced for GPT-3.5.
* **Cross-Model Repair Benefit:** The most effective repair configuration for a GPT-3.5 primary model is using GPT-4 as the repair model (light blue line), which yields a higher pass rate than using GPT-3.5 for repair (orange line).
* **Diminishing Returns:** All curves show strong diminishing returns; the gain in pass rate per additional token generated decreases substantially after the first 2000-4000 tokens.
* **Confidence Intervals:** The shaded areas around each line indicate uncertainty or variance. The intervals appear relatively narrow and consistent across the series, suggesting stable measurements.
### Interpretation
This chart demonstrates the relationship between computational cost (measured by tokens generated) and task success rate (pass rate) for different LLM setups. The data suggests several key insights:
1. **Model Capability is Primary:** The base capability of the primary model ($M_P$) is the strongest determinant of performance. GPT-4's inherent superiority over GPT-3.5 is clear, as even its "no repair" version outperforms all GPT-3.5 variants.
2. **Repair Mechanisms are Effective:** Implementing a repair or refinement step ($M_F$) consistently improves outcomes. This indicates that iterative or corrective processes add value beyond a single generation pass.
3. **Strategic Resource Allocation:** Using a more powerful model (GPT-4) as a repair agent for a weaker primary model (GPT-3.5) is a viable strategy to boost performance, potentially offering a cost-benefit trade-off compared to using the powerful model for the entire task.
4. **Efficiency Plateau:** The logarithmic trend implies that for any given model configuration, there is a point (around 4000-6000 tokens) where expending significantly more tokens yields only marginal improvements in pass rate. This has implications for optimizing token usage and cost in practical applications.
In summary, the chart provides a quantitative basis for choosing model configurations and token budgets, highlighting the trade-offs between model power, repair strategies, and computational efficiency.
</details>
| Introductory Interview Competition | 42.64% 19.33% 3.67% | 62.21% 45.67% 14.67% |
| --- | --- | --- |
| Overall | 33.30% | 52.60% |
Figure 5: Mean pass rate for each model when $n_{fr}$ (or $n_{f}$ and $n_{r}$ ) = 1. Shaded region is $\pm 1$ standard deviation. Complete breakdown per difficulty in Appendix A.
Figure 6: Success rate of repair with GPT-4’s explanations vs. with those of our human participants.
### 4.3 Human feedback significantly improves the success rate of GPT-4 repair
For our final experiment, we consider the effect of using an expert human programmer’s feedback when performing repair with stronger models such as GPT-4. The goal of this study is not to do a direct comparison between a human-in-the-loop approach vs. self-repair, since a human-in-the-loop approach imposes more cognitive burden, which we do not study. Instead, our goal is to understand how the model’s ability to identify mistakes in the code compares to that of a human, and how this affects downstream performance in self-repair. We thus conduct both qualitative and quantitative analyses of the impact of human feedback on self-repair.
Data collection methodology. We first sample 20 tasks $\{\psi_{i}\}_{i=1}^{20}$ from the APPS test set; to make the data collection process less time-consuming for the participants of the study, we skew the distribution towards easier tasks (14 introductory; 3 interview; 3 competition). For each task $\psi_{i}$ , we then sample two failing GPT-4 completions $p_{i,1},p_{i,2}$ , making for a total of $20\cdot 2=40$ programs to refine. We recruit 16 participants, consisting of 15 graduate students and one professional machine learning engineer. Each participant is provided with five different base programs based on their level of experience with Python and competitive programming. Each program is taken from a distinct task; participants are never showed two different programs belonging to the same task. Participants are then asked to explain, in their own words, what the program is doing wrong. To reduce the cognitive load for participants, each program $p_{i,j}$ is accompanied by the error message $e_{i,j}$ and two feedback strings $f_{i,j,1},f_{i,j,2}$ sampled from GPT-4. We obtain these feedback strings by randomly sampling from the feedback-repair pairs used in the previous experiments and removing the code block. Note that each of the 40 programs will be shown to two different participants, to reduce variance caused by participants’ skill levels and writing style. Participants were told to spend approximately one hour on the study overall, and were compensated with a $15 gift card. This human data collection was approved by our Institutional Review Board (IRB) and carried out exclusively through an online survey. See Appendix B for a complete, concrete copy of the instructions which we provide to our participants.
Quantitative Analysis. Having obtained two human-written pieces of feedback $h_{i,j,1},h_{i,j,2}$ for each program $p_{i,j}$ , we sample 25 repaired programs
$$
\{r_{l}\}_{l=1}^{25}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\text{GPT-4}(\psi_
{i};p_{i,j};e_{i,j};f)
$$
for $f\in\{h_{i,j,1},h_{i,j,2},f_{i,j,1},f_{i,j,2}\}$ . That is: we ask GPT-4 to generate 25 candidate repairs for each program, conditioned on the specification, the initial program, and a feedback string which is either set to one of GPT-4’s own feedback strings or to one provided by a participant. Finally, we execute all of these candidate repairs against the test bed, and take note of how often they pass.
The results are summarized in Table 6, with a complete task-by-task breakdown in Appendix C. We note first of all that the overall success rate is increased by over $1.57\times$ when we replace GPT-4’s own debugging with that of our human participants. Perhaps unsurprisingly, the relative difference increases as the problems get harder, indicating that GPT-4’s ability to produce accurate and useful feedback trails further behind our human participants’ when the task (and code) becomes more complex.
Qualitative Analysis. In this section, we qualitatively analyze the difference between the feedback provided by the human participants and the feedback provided by GPT-4. We manually go through all of GPT-4’s and the participants’ feedback and note down whether the feedback: (a) seems, at a cursory glance, to be correct, or if it is obviously inaccurate; (b) explicitly suggests a small change to the code (e.g. "change the condition on line X"); (c) explicitly suggests a large change to the code (e.g. "frame the problem as min-cut instead of shortest-path"); (d) contains blocks of pseudocode or Python (which GPT-4’s feedback never does, per our experiment design); or (e) expresses uncertainty (using phrases such as "unsure", "it appears", etc.). We do not count individual single-line statements/expressions such as “ $x=5$ ” as pseudocode or Python. Examples of each category are shown in Appendix D. We find that
- Only 2/80 human-contributed feedback strings include pseudocode or explicit Python; that is, almost all human feedback we obtain is natural language interleaved with occasional single-statement math/code expressions.
- GPT-4’s feedback is much more likely to be obviously inaccurate (32/80 vs. 7/80 for human feedback).
- GPT-4 is more likely to explicitly suggest small changes (54/80 vs 42/80; 28/48 vs. 38/73 when seemingly correct), while our human participants show a slightly greater tendency to suggest high-level changes (23/80 vs. 18/80 for GPT-4; 21/73 vs. 13/48 when seemingly correct).
- Our human participants sometimes express uncertainty (7/80); GPT-4 never does (0/80).
This further analysis suggests that the results in Table 6 are not due to artefacts such as our participants providing explicit code blocks which the model simply copies. Instead, the difference in performance appears to be caused by a combination of more accurate feedback, a greater ability to suggest high-level, large-scale changes to the code when needed, and our participants’ ability to express their uncertainty (instead of confidently giving potentially inaccurate feedback).
## 5 Limitations
Firstly, to reduce computational cost, we pre-populate and then sub-sample from large repair trees, which introduces statistical bias. We mitigate this by being generous in our uncertainty of the pass@t, using the maximum standard deviation across all points. We also note that this standard deviation, which is obtained at values of $(n_{p},n_{f},n_{r})$ that are small enough that we have very many samples thereof in our pre-populated repair trees, is very low ( $<2\$ pass rate for all models). While these measures do not completely eliminate the risk of bias in our results, not performing this amortization would have required significantly larger amounts of compute.
Secondly, we assume access to an executable suite of unit tests for each task. We do not, for example, require the model to extract tests from textual specifications. While this assumption may seem out of place in the era of chat-style assistants like ChatGPT [OpenAI, 2022], it does align well with established software engineering practices like Test-Driven Development [Astels, 2003]. Furthermore, techniques which automatically synthesize test cases given a specification [Li et al., 2022, Chen et al., 2023a] may relieve some of the user burden.
Finally, our study on human data did not track how much time the participants took to debug the programs. As a result, we can only evaluate the quality of the feedback (and the impact this has on repair). Further research at the intersection of HCI, AI, and program synthesis is needed to explore when and how human intervention should be leveraged, as well as how programming assistants should be designed to facilitate this style of interaction.
## 6 Broader Impact
Any tool that improves the productivity of people writing software will necessarily also increase the productivity of people writing software with malicious intent. It is also important to remember that research on LLMs comes at a very high environmental cost. Although we exclusively use publicly available pre-trained models in this work, and so do not train any models of our own, even inference comes with a significant carbon footprint at scale. At the same time, this work—in which we weigh model performance against the computational cost of obtaining it, and through which we learn more about when and how these models do and do not work—is a step towards more sample-efficient usage paradigms.
## 7 Conclusion
In this paper, we investigated the role of textual feedback in self-repair. We presented pass@t, a new evaluation strategy which takes the cost of carrying out repair into account, and then used this metric to show that (1) GPT-3.5 is not capable of carrying out self-repair on challenging coding tasks, and (2) while performance gains are seen in GPT-4, they are modest and rely on achieving sufficient diversity in the initial programs. Furthermore, by ablating the feedback stage we found that (3) substituting GPT-3.5’s feedback with GPT-4’s improved performance, even surpassing GPT-3.5’s baseline. Finally, we carried out an experiment with human participants, in which we found that (4) replacing GPT-4’s self-generated feedback with feedback provided by an experienced programmer increased the number of repaired programs which pass all unit tests by 57%.
## Acknowledgments and Disclosure of Funding
T.X. Olausson is supported by the Defense Advanced Research Projects Agency (DARPA) under the ASKEM program, award HR00112220042. T.X. Olausson was also supported through a position at Microsoft Research for part of the time period during which this work was carried out. A. Solar-Lezama is supported by the National Science Foundation (NSF) and Intel Corporation through NSF Grant CCF:2217064. This work benefited greatly from discussion with several colleagues at Microsoft Research. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, the Defense Advanced Research Projects Agency, Intel Corporation, or Microsoft Research.
## References
- Astels [2003] D. Astels. Test Driven Development: A Practical Guide. Prentice Hall Professional Technical Reference, 2003. ISBN 0131016490.
- Austin et al. [2021] J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program Synthesis with Large Language Models, 2021. arXiv preprint arXiv:2108.07732. https://arxiv.org/abs/2108.07732.
- Bader et al. [2019] J. Bader, A. Scott, M. Pradel, and S. Chandra. Getafix: Learning to fix bugs automatically. Proc. ACM Program. Lang., 3(OOPSLA), Oct 2019. doi: 10.1145/3360585.
- Chen et al. [2023a] B. Chen, F. Zhang, A. Nguyen, D. Zan, Z. Lin, J.-G. Lou, and W. Chen. CodeT: Code generation with generated tests. In International Conference on Learning Representations, 2023a.
- Chen et al. [2021] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al. Evaluating Large Language Models Trained on Code, 2021. arXiv preprint arXiv:2107.03374. https://arxiv.org/abs/2107.03374.
- Chen et al. [2019a] X. Chen, C. Liu, and D. Song. Execution-Guided Neural Program Synthesis. In International Conference on Learning Representations, 2019a.
- Chen et al. [2023b] X. Chen, M. Lin, N. Schärli, and D. Zhou. Teaching Large Language Models to Self-Debug, 2023b. arXiv preprint arXiv:2304.05128. https://arxiv.org/abs/2304.05128.
- Chen et al. [2019b] Z. Chen, S. Kommrusch, M. Tufano, L.-N. Pouchet, D. Poshyvanyk, and M. Monperrus. SequenceR: Sequence-to-Sequence Learning for End-to-End Program Repair. IEEE Transaction on Software Engineering, 2019b.
- Chowdhery et al. [2022] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann, et al. PaLM: Scaling Language Modeling with Pathways, 2022. arXiv preprint arXiv:2204.02311. https://arxiv.org/abs/2204.02311.
- Ellis et al. [2021] K. Ellis, C. Wong, M. Nye, M. Sablé-Meyer, L. Morales, L. Hewitt, L. Cary, A. Solar-Lezama, and J. B. Tenenbaum. DreamCoder: Bootstrapping Inductive Program Synthesis with Wake-Sleep Library Learning. In The International Conference on Programming Language Design and Implementation, 2021.
- Fried et al. [2023] D. Fried, A. Aghajanyan, J. Lin, S. Wang, E. Wallace, F. Shi, R. Zhong, W.-t. Yih, L. Zettlemoyer, and M. Lewis. InCoder: A generative model for code infilling and synthesis. In International Conference on Learning Representations, 2023.
- Gulwani et al. [2017] S. Gulwani, O. Polozov, and R. Singh. Program Synthesis. Foundations and Trends® in Programming Languages Series. Now Publishers, 2017. ISBN 9781680832921.
- Gupta et al. [2020] K. Gupta, P. E. Christensen, X. Chen, and D. Song. Synthesize, Execute and Debug: Learning to Repair for Neural Program Synthesis. In Advances in Neural Information Processing Systems, 2020.
- Hendrycks et al. [2021] D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring Coding Challenge Competence With APPS. In Advances in Neural Information Processing Systems, 2021.
- Inala et al. [2022] J. P. Inala, C. Wang, M. Yang, A. Codas, M. Encarnación, S. Lahiri, M. Musuvathi, and J. Gao. Fault-Aware Neural Code Rankers. In Advances in Neural Information Processing Systems, 2022.
- Kitzelmann [2010] E. Kitzelmann. Inductive Programming: A Survey of Program Synthesis Techniques. In Approaches and Applications of Inductive Programming: Third International Workshop, 2010.
- Kulal et al. [2019] S. Kulal, P. Pasupat, K. Chandra, M. Lee, O. Padon, A. Aiken, and P. S. Liang. SPoC: Search-based Pseudocode to Code. In Advances in Neural Information Processing Systems, 2019.
- Le et al. [2022] H. Le, Y. Wang, A. D. Gotmare, S. Savarese, and S. C. H. Hoi. CodeRL: Mastering Code Generation through Pretrained Models and Deep Reinforcement Learning. In Advances in Neural Information Processing Systems, 2022.
- Le Goues et al. [2021] C. Le Goues, M. Pradel, A. Roychoudhury, and S. Chandra. Automatic Program Repair. IEEE Softw., 38(4):22–27, jul 2021. ISSN 0740-7459. doi: 10.1109/MS.2021.3072577.
- Li et al. [2023] R. Li, L. B. Allal, Y. Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Marone, C. Akiki, J. Li, J. Chim, et al. StarCoder: may the source be with you!, 2023. arXiv preprint arXiv:2305.06161. https://arxiv.org/abs/2305.06161.
- Li et al. [2022] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, et al. Competition-level code generation with AlphaCode. Science, 378(6624):1092–1097, 2022. doi: 10.1126/science.abq1158.
- Long and Rinard [2016] F. Long and M. Rinard. Automatic Patch Generation by Learning Correct Code. In ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016.
- Madaan et al. [2023] A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, et al. Self-Refine: Iterative Refinement with Self-Feedback, 2023. arXiv preprint arXiv:2303.17651. https://arxiv.org/abs/2303.17651.
- Mesbah et al. [2019] A. Mesbah, A. Rice, E. Johnston, N. Glorioso, and E. Aftandilian. DeepDelta: Learning to Repair Compilation Errors. In Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2019.
- Nijkamp et al. [2023] E. Nijkamp, B. Pang, H. Hayashi, L. Tu, H. Wang, Y. Zhou, S. Savarese, and C. Xiong. CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis. In International Conference on Learning Representations, 2023.
- OpenAI [2022] OpenAI. Introducing ChatGPT, 2022. Blog post. https://openai.com/blog/chatgpt [Accessed 5/17/2023].
- OpenAI [2023] OpenAI. GPT-4 Technical Report, 2023. arXiv preprint arXiv:2303.08774. https://arxiv.org/abs/2303.08774.
- Ouyang et al. [2022] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, et al. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, 2022.
- Peng et al. [2023] B. Peng, M. Galley, P. He, H. Cheng, Y. Xie, Y. Hu, Q. Huang, L. Liden, Z. Yu, W. Chen, and J. Gao. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
- Polozov and Gulwani [2015] O. Polozov and S. Gulwani. FlashMeta: A Framework for Inductive Program Synthesis. In ACM SIGPLAN International Conference on Object-Oriented Programming, Systems, Languages, and Applications, 2015.
- Shi et al. [2022] F. Shi, D. Fried, M. Ghazvininejad, L. Zettlemoyer, and S. I. Wang. Natural Language to Code Translation with Execution. In Empirical Methods in Natural Language Processing, 2022.
- Touvron et al. [2023] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, et al. Llama: Open and efficient foundation language models, 2023. arXiv preprint arXiv:2302.13971. https://arxiv.org/abs/2302.13971.
- Wang et al. [2018] K. Wang, R. Singh, and Z. Su. Dynamic Neural Program Embedding for Program Repair. In International Conference on Learning Representations, 2018.
- Yasunaga and Liang [2020] M. Yasunaga and P. Liang. Graph-based, Self-supervised Program Repair from Diagnostic Feedback. In International Conference on Machine Learning, 2020.
- Yasunaga and Liang [2021] M. Yasunaga and P. Liang. Break-It-Fix-It: Unsupervised Learning for Program Repair. In International Conference on Machine Learning, 2021.
- Zhang et al. [2023] K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin. Self-Edit: Fault-Aware Code Editor for Code Generation, 2023. arXiv preprint arXiv:2305.04087. https://arxiv.org/abs/2305.04087.
- Zhang et al. [2022] T. Zhang, T. Yu, T. B. Hashimoto, M. Lewis, W.-t. Yih, D. Fried, and S. I. Wang. Coder Reviewer Reranking for Code Generation, 2022. arXiv preprint arXiv:2211.16490. https://arxiv.org/abs/2211.16490.
## Appendix A Self-Repair Results Per Difficulty
<details>
<summary>x8.png Details</summary>

### Visual Description
## Scatter Plot with Fitted Curve: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
The image is a technical scatter plot displaying the relationship between the "Mean number of tokens generated" (x-axis) and the "Mean pass rate" (y-axis). It includes multiple data series differentiated by color and marker shape, each with vertical error bars. A dark gray fitted curve with a light gray shaded confidence interval is overlaid on the data, suggesting a modeled trend. The plot is contained within a grid.
### Components/Axes
* **X-Axis:** Labeled "Mean number of tokens generated". The scale is linear, ranging from 0 to 10000, with major tick marks and labels at 0, 2000, 4000, 6000, 8000, and 10000. The labels are rotated approximately 45 degrees.
* **Y-Axis:** Labeled "Mean pass rate". The scale is linear, ranging from 0.0 to 1.0, with major tick marks and labels at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend:** Positioned in the bottom-left quadrant of the plot area. It is divided into two columns:
* **Left Column (Color Key):** Defines the parameter `n_p` by line color.
* Brown line: `n_p = 1`
* Gold/Yellow line: `n_p = 2`
* Teal line: `n_p = 5`
* Light Blue line: `n_p = 10`
* Dark Blue line: `n_p = 25`
* **Right Column (Marker Key):** Defines the parameter `n_fr` by marker shape.
* Circle: `n_fr = 1`
* Inverted Triangle: `n_fr = 3`
* Square: `n_fr = 5`
* Triangle: `n_fr = 10`
* **Data Series:** Each data point is a combination of a color (from `n_p`) and a marker shape (from `n_fr`). All points have vertical error bars indicating variability.
* **Fitted Curve:** A dark gray line that follows the general trend of the data, rising steeply from the origin and plateauing. It is surrounded by a light gray shaded region representing the confidence interval.
### Detailed Analysis
Data points are identified by their approximate coordinates (x, y), color (`n_p`), and marker shape (`n_fr`). Values are approximate due to visual estimation.
**Data Points (Approximate):**
* **Brown (`n_p=1`):**
* Circle (`n_fr=1`): (~100, 0.58)
* Inverted Triangle (`n_fr=3`): (~300, 0.63)
* Square (`n_fr=5`): (~600, 0.66)
* Triangle (`n_fr=10`): (~1200, 0.70)
* **Gold (`n_p=2`):**
* Circle (`n_fr=1`): (~200, 0.68)
* Inverted Triangle (`n_fr=3`): (~800, 0.74)
* Square (`n_fr=5`): (~1000, 0.75)
* Triangle (`n_fr=10`): (~2500, 0.76)
* **Teal (`n_p=5`):**
* Circle (`n_fr=1`): (~500, 0.77)
* Inverted Triangle (`n_fr=3`): (~1800, 0.81)
* Square (`n_fr=5`): (~2800, 0.81)
* Triangle (`n_fr=10`): (~5200, 0.82)
* **Light Blue (`n_p=10`):**
* Circle (`n_fr=1`): (~1000, 0.81)
* Inverted Triangle (`n_fr=3`): (~3500, 0.83)
* Square (`n_fr=5`): (~4800, 0.84)
* Triangle (`n_fr=10`): (~8800, 0.85)
* **Dark Blue (`n_p=25`):**
* Circle (`n_fr=1`): (~4500, 0.84)
* Inverted Triangle (`n_fr=3`): (~7500, 0.86)
**Fitted Curve Trend:** The curve starts near (0, 0.5), rises sharply until approximately x=2000 (y≈0.8), and then gradually flattens, approaching a horizontal asymptote near y=0.85 as x increases towards 6000 and beyond.
### Key Observations
1. **Primary Trend:** There is a strong, positive, non-linear relationship between the mean number of tokens generated and the mean pass rate. Performance improves rapidly with initial increases in token count and then experiences diminishing returns.
2. **Effect of `n_p` (Color):** For a given `n_fr` (marker shape), series with higher `n_p` values (cooler colors: teal, light blue, dark blue) consistently achieve higher mean pass rates at similar or higher token counts compared to lower `n_p` values (warmer colors: brown, gold).
3. **Effect of `n_fr` (Marker):** Within a single color series (`n_p`), increasing `n_fr` (moving from circle to triangle) generally corresponds to both a higher mean number of tokens generated and a slightly higher mean pass rate.
4. **Convergence:** All data series appear to converge towards a similar maximum pass rate (approximately 0.85-0.87) as the mean number of tokens generated becomes very large (>6000).
5. **Error Bars:** The vertical error bars indicate variability in the pass rate for each condition. The size of the error bars appears relatively consistent across the plot, without a clear trend of increasing or decreasing with x or y.
### Interpretation
This chart likely evaluates the performance (pass rate) of a generative model or system as a function of the average length of its output (tokens generated). The parameters `n_p` and `n_fr` represent different experimental conditions, possibly related to prompting strategies, number of attempts, or refinement steps.
The data suggests a fundamental trade-off: generating more tokens is associated with a higher probability of success (pass rate), but the benefit saturates. The initial steep climb indicates that a modest increase in output length yields significant reliability gains. The plateau implies that beyond a certain point (~2000-4000 tokens), simply making the output longer provides minimal additional benefit to the pass rate.
The separation by `n_p` indicates that this parameter is a key driver of efficiency. Higher `n_p` conditions achieve better pass rates with fewer tokens (are more efficient) and also reach a higher ultimate performance ceiling. The `n_fr` parameter seems to control a secondary axis of effort; increasing it consumes more tokens for a marginal gain in pass rate within a given `n_p` strategy.
The fitted curve models the underlying law of diminishing returns. The shaded confidence interval suggests the model's certainty is highest in the middle range of the data (1000-5000 tokens) and slightly less certain at the extremes. The convergence of all strategies at high token counts implies that with enough computational budget (tokens), most approaches can achieve near-peak performance, but smarter strategies (`n_p`) get there much faster.
</details>
<details>
<summary>x9.png Details</summary>

### Visual Description
## Heatmap: Relationship Between Initial Programs and Feedback-Repairs
### Overview
The image is a heatmap chart visualizing a numerical metric (likely a performance score, success rate, or efficiency ratio) as a function of two variables: the "Number of initial programs (n_p)" on the x-axis and the "Number of feedback-repairs (n_fr)" on the y-axis. The data is presented in a 5x4 grid. The color of each cell corresponds to its value, with a gradient from dark orange (lower values) to bright yellow (higher values). A distinct dark gray color is used for cells labeled "O.O.B.".
### Components/Axes
* **X-Axis (Horizontal):**
* **Label:** "Number of initial programs (n_p)"
* **Categories/Markers (from left to right):** 1, 2, 5, 10, 25
* **Y-Axis (Vertical):**
* **Label:** "Number of feedback-repairs (n_fr)"
* **Categories/Markers (from bottom to top):** 1, 3, 5, 10
* **Legend/Color Scale:** There is no separate legend box. The color gradient is implicit:
* **Dark Orange:** Represents lower numerical values (e.g., ~0.87-0.92).
* **Bright Yellow:** Represents higher numerical values (e.g., ~0.97-1.01).
* **Dark Gray:** Represents the categorical label "O.O.B." (likely meaning "Out Of Bounds" or a similar status indicating the condition was not met or data is unavailable).
* **Spatial Layout:** The axes form the left and bottom borders of the chart. The main data grid occupies the central area. The "O.O.B." cells are clustered in the top-right region of the grid.
### Detailed Analysis
The following table reconstructs the data from the heatmap grid. Values are read directly from the text within each cell.
| n_fr (y-axis) \ n_p (x-axis) | **1** | **2** | **5** | **10** | **25** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **10** | 0.87 | 0.93 | 0.97 | O.O.B. | O.O.B. |
| **5** | 0.87 | 0.94 | 0.98 | 1.00 | O.O.B. |
| **3** | 0.88 | 0.94 | 0.99 | 1.00 | O.O.B. |
| **1** | 0.92 | 0.97 | 1.00 | 1.01 | 1.01 |
**Trend Verification:**
* **Horizontal Trend (Increasing n_p):** For a fixed number of feedback-repairs (n_fr), moving from left to right, the numerical values generally increase. For example, at n_fr=1, the value rises from 0.92 (n_p=1) to 1.01 (n_p=25). This positive trend holds until the "O.O.B." condition is triggered for higher n_fr values.
* **Vertical Trend (Increasing n_fr):** For a fixed number of initial programs (n_p), moving from bottom to top, the numerical values generally decrease slightly. For example, at n_p=1, the value decreases from 0.92 (n_fr=1) to 0.87 (n_fr=10). Furthermore, the likelihood of encountering "O.O.B." increases significantly as n_fr increases.
### Key Observations
1. **Peak Performance:** The highest recorded value is **1.01**, occurring at the combination of **n_p=10, n_fr=1** and **n_p=25, n_fr=1**.
2. **"O.O.B." Boundary:** The "O.O.B." status appears in a clear pattern. It occurs when:
* n_p is high (10 or 25) **and** n_fr is high (5 or 10).
* Specifically, for n_p=10, O.O.B. occurs at n_fr=10.
* For n_p=25, O.O.B. occurs at n_fr=3, 5, and 10.
3. **Stability at Low n_fr:** When the number of feedback-repairs is low (n_fr=1), all combinations of n_p yield a valid numerical score, and performance improves with more initial programs.
4. **Color-Value Correlation:** The color gradient accurately reflects the numerical values. The darkest orange cells correspond to the lowest values (0.87, 0.88), while the brightest yellow cells correspond to the highest values (1.00, 1.01).
### Interpretation
This heatmap likely illustrates the performance or outcome of an iterative program repair or generation system. The metric (values ~0.87 to 1.01) could represent a normalized success rate, quality score, or efficiency ratio.
* **Core Relationship:** The data suggests a **trade-off or interaction effect** between the initial population size (n_p) and the number of repair iterations (n_fr). Increasing the initial number of programs (n_p) generally improves the outcome, but this benefit diminishes or reverses when combined with a high number of feedback-repair cycles.
* **Meaning of "O.O.B.":** The "Out Of Bounds" label for high n_p and high n_fr combinations is critical. It may indicate that the system becomes unstable, resources are exhausted, or the process fails to converge when both parameters are set too high. This defines a practical operational boundary for the system.
* **Optimal Zone:** The "sweet spot" for maximizing the metric appears to be a **moderate-to-high number of initial programs (n_p=5 to 25) combined with a low number of feedback-repairs (n_fr=1)**. This configuration yields the highest scores (1.00-1.01) without triggering the O.O.B. condition.
* **Diminishing Returns:** The slight decrease in score as n_fr increases (for valid cells) suggests that additional repair cycles beyond a certain point may not be beneficial and could even be slightly detrimental to the measured outcome, possibly due to overfitting or increased complexity.
In summary, the chart demonstrates that while more initial programs are beneficial, the system's ability to handle extensive feedback-repair cycles is limited. Optimal performance requires balancing these two parameters, favoring a larger initial pool with minimal iterative repairs.
</details>
<details>
<summary>x10.png Details</summary>

### Visual Description
## Scatter Plot with Fitted Curve: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This image is a technical scatter plot chart displaying the relationship between the mean number of tokens generated (x-axis) and the mean pass rate (y-axis). The chart includes multiple data series differentiated by color (representing a parameter \( n_p \)) and marker shape (representing a parameter \( n_{fr} \)). A fitted curve with a confidence interval is overlaid on the data points.
### Components/Axes
* **Chart Type:** Scatter plot with error bars and a fitted trend line.
* **X-Axis:**
* **Title:** "Mean number of tokens generated"
* **Scale:** Linear, ranging from 0 to 10000.
* **Major Tick Marks:** 0, 2000, 4000, 6000, 8000, 10000.
* **Y-Axis:**
* **Title:** "Mean pass rate"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Major Tick Marks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend (Top-Left Corner):** The legend is split into two columns.
* **Left Column (Line Color - \( n_p \)):**
* Orange line: \( n_p = 1 \)
* Yellow line: \( n_p = 2 \)
* Teal line: \( n_p = 5 \)
* Light Blue line: \( n_p = 10 \)
* Dark Blue line: \( n_p = 25 \)
* **Right Column (Marker Shape - \( n_{fr} \)):**
* Circle (●): \( n_{fr} = 1 \)
* Inverted Triangle (▼): \( n_{fr} = 3 \)
* Square (■): \( n_{fr} = 5 \)
* Triangle (▲): \( n_{fr} = 10 \)
* **Fitted Curve:** A dark grey line with a light grey shaded area representing the confidence interval. The curve shows a logarithmic-like trend, rising steeply initially and then plateauing.
### Detailed Analysis
**Data Series and Points (Approximate Values):**
The data points are plotted using a combination of color (for \( n_p \)) and shape (for \( n_{fr} \)). Each point includes vertical error bars.
1. **\( n_p = 1 \) (Orange):**
* Circle (\( n_{fr}=1 \)): x ≈ 500, y ≈ 0.23
* Inverted Triangle (\( n_{fr}=3 \)): x ≈ 1000, y ≈ 0.27
* Square (\( n_{fr}=5 \)): x ≈ 1500, y ≈ 0.29
* Triangle (\( n_{fr}=10 \)): x ≈ 3000, y ≈ 0.32
2. **\( n_p = 2 \) (Yellow):**
* Circle (\( n_{fr}=1 \)): x ≈ 1000, y ≈ 0.30
* Inverted Triangle (\( n_{fr}=3 \)): x ≈ 2000, y ≈ 0.35
* Square (\( n_{fr}=5 \)): x ≈ 3000, y ≈ 0.36
* Triangle (\( n_{fr}=10 \)): x ≈ 5800, y ≈ 0.40
3. **\( n_p = 5 \) (Teal):**
* Circle (\( n_{fr}=1 \)): x ≈ 2000, y ≈ 0.39
* Inverted Triangle (\( n_{fr}=3 \)): x ≈ 4500, y ≈ 0.43
* Square (\( n_{fr}=5 \)): x ≈ 7000, y ≈ 0.45
4. **\( n_p = 10 \) (Light Blue):**
* Circle (\( n_{fr}=1 \)): x ≈ 4000, y ≈ 0.46
* Inverted Triangle (\( n_{fr}=3 \)): x ≈ 8500, y ≈ 0.50
5. **\( n_p = 25 \) (Dark Blue):**
* Circle (\( n_{fr}=1 \)): x ≈ 9800, y ≈ 0.53
**Fitted Curve Trend:** The dark grey fitted line starts near (0, 0.2) and rises sharply until approximately x=2000, y=0.4. The slope then decreases, and the curve gradually flattens, approaching a value just above 0.5 as x approaches 10000. The light grey confidence interval is narrowest at lower x-values and widens slightly as x increases.
### Key Observations
1. **Positive Correlation:** There is a clear positive relationship between the mean number of tokens generated and the mean pass rate. As token count increases, the pass rate increases.
2. **Diminishing Returns:** The relationship is non-linear. The most significant gains in pass rate occur with the initial increase in tokens (0 to ~2000). After this point, each additional token yields a smaller improvement in pass rate.
3. **Parameter Influence:**
* **\( n_p \):** For a given approximate x-value, data points with higher \( n_p \) (cooler colors: teal, blue) tend to have a higher y-value (pass rate) than those with lower \( n_p \) (warmer colors: orange, yellow).
* **\( n_{fr} \):** For a given color series (\( n_p \)), points with higher \( n_{fr} \) (e.g., triangle) are generally positioned further to the right (higher token count) than points with lower \( n_{fr} \) (e.g., circle).
4. **Data Spread:** The data points are distributed around the fitted curve, with some scatter. The error bars indicate variability in the mean pass rate measurement for each condition.
### Interpretation
This chart likely illustrates the performance of a generative model (e.g., for code or text) where "pass rate" is a success metric. The data suggests that allocating more computational resources (allowing the model to generate more tokens) improves the probability of success, but with clear diminishing marginal returns.
The parameters \( n_p \) and \( n_{fr} \) represent different experimental conditions or model configurations. The trend indicates that configurations with a higher \( n_p \) value are more efficient, achieving a higher pass rate for a similar or even lower token budget. The parameter \( n_{fr} \) appears to be correlated with the token count itself, possibly representing a retry or refinement step that consumes more tokens.
The fitted curve provides a generalized model of the token-pass rate trade-off, useful for predicting performance or optimizing resource allocation. The plateau suggests there may be an upper bound to the pass rate achievable simply by increasing token generation, hinting at other limiting factors in the model's capability or the task's difficulty.
</details>
<details>
<summary>x11.png Details</summary>

### Visual Description
## Heatmap: Performance vs. Initial Programs and Feedback Repairs
### Overview
This image is a heatmap visualizing a performance metric (likely a success rate or score, ranging from 0 to 1) as a function of two parameters: the "Number of initial programs (n_p)" on the horizontal axis and the "Number of feedback-repairs (n_fr)" on the vertical axis. The color intensity represents the metric's value, with darker browns indicating lower scores and brighter oranges/yellows indicating higher scores. Cells marked "O.O.B." (likely "Out Of Bounds") are colored dark gray and represent conditions where the metric could not be computed or was invalid.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Number of initial programs (n_p)". The axis markers are at discrete values: 1, 2, 5, 10, and 25.
* **Y-Axis (Vertical):** Labeled "Number of feedback-repairs (n_fr)". The axis markers are at discrete values: 1, 3, 5, and 10.
* **Data Grid:** A 4x5 grid of cells, where each cell corresponds to a unique combination of (n_p, n_fr).
* **Color Scale:** A gradient from dark brown (lowest numerical values) through orange to bright yellow-orange (highest numerical values). A separate dark gray color is used for "O.O.B." cells.
* **Cell Labels:** Each cell contains either a numerical value (e.g., "0.73") or the text "O.O.B.".
### Detailed Analysis
The following table reconstructs the data from the heatmap. Values are read directly from the cell labels. The color description is based on visual inspection of the gradient.
| n_fr \ n_p | **1** | **2** | **5** | **10** | **25** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **10** | 0.73 (Dark Brown) | 0.82 (Medium Brown) | O.O.B. (Dark Gray) | O.O.B. (Dark Gray) | O.O.B. (Dark Gray) |
| **5** | 0.75 (Dark Brown) | 0.82 (Medium Brown) | 0.91 (Orange) | O.O.B. (Dark Gray) | O.O.B. (Dark Gray) |
| **3** | 0.77 (Dark Brown) | 0.84 (Medium Brown) | 0.91 (Orange) | 0.98 (Bright Yellow-Orange) | O.O.B. (Dark Gray) |
| **1** | 0.84 (Medium Brown) | 0.89 (Orange-Brown) | 0.93 (Orange) | 0.97 (Bright Yellow-Orange) | O.O.B. (Dark Gray) |
**Trend Verification:**
* **Horizontal Trend (Increasing n_p for fixed n_fr):** For a fixed number of feedback repairs (n_fr), the performance metric generally **increases** as the number of initial programs (n_p) increases, until it hits the "O.O.B." condition. For example, at n_fr=1, the value rises from 0.84 (n_p=1) to 0.97 (n_p=10).
* **Vertical Trend (Increasing n_fr for fixed n_p):** For a fixed number of initial programs (n_p), the relationship is less uniform. At low n_p (1 or 2), performance slightly **decreases** as n_fr increases (e.g., at n_p=1: 0.84 at n_fr=1 vs. 0.73 at n_fr=10). At n_p=5, performance is stable (0.91) for n_fr=3 and 5. The highest performance (0.98) occurs at a moderate combination (n_p=10, n_fr=3).
### Key Observations
1. **"O.O.B." Region:** The "Out Of Bounds" condition forms a clear region in the top-right of the heatmap. It occurs for all tested values when n_p=25, and for combinations where both n_p and n_fr are relatively high (e.g., n_p=10 with n_fr≥5, n_p=5 with n_fr=10).
2. **Peak Performance:** The highest observed value is **0.98**, located at the cell (n_p=10, n_fr=3).
3. **Performance Gradient:** The lowest numerical value is **0.73** at (n_p=1, n_fr=10). The general gradient shows performance improving from the bottom-left (low n_p, low n_fr) towards the center-right (moderate n_p, low-to-moderate n_fr).
4. **Color-Value Correlation:** The color scale is consistent. The darkest brown corresponds to the lowest value (0.73), and the brightest yellow-orange corresponds to the highest value (0.98). The "O.O.B." cells are uniformly dark gray, distinct from the performance gradient.
### Interpretation
This heatmap likely illustrates the results of an experiment in automated program repair or synthesis, where an initial set of programs (n_p) is refined through a number of feedback-repair cycles (n_fr). The metric is probably a success rate (e.g., percentage of problems solved).
* **Optimal Operating Range:** The data suggests an optimal "sweet spot" for resource allocation. Using a moderate number of initial programs (around 5-10) with a low-to-moderate number of feedback repairs (1-3) yields the highest success rates (0.91-0.98).
* **Diminishing Returns & Failure Modes:** Simply increasing resources does not guarantee better results. Beyond a certain point (the "O.O.B." boundary), the process likely fails entirely. This could be due to computational limits, search space explosion, or the repair mechanism becoming unstable with too many initial candidates or repair iterations.
* **Trade-off at Low Resources:** When starting with very few initial programs (n_p=1 or 2), applying more feedback repairs (higher n_fr) appears slightly detrimental. This might indicate that with a poor starting point, repeated repairs are ineffective or even degrade the solution.
* **Asymmetry of Parameters:** The number of initial programs (n_p) has a stronger positive influence on performance than the number of feedback repairs (n_fr), up to the failure boundary. The system benefits more from a diverse initial pool than from extensive iterative repair on a small pool.
In summary, the chart demonstrates that more resources (programs and repairs) can improve outcomes, but only within a specific operational envelope. Exceeding this envelope leads to complete process failure ("O.O.B."), highlighting the importance of parameter tuning in such systems.
</details>
<details>
<summary>x12.png Details</summary>

### Visual Description
## Scatter Plot with Trend Line: Mean Pass Rate vs. Mean Tokens Generated
### Overview
The image is a technical scatter plot chart displaying the relationship between the "Mean number of tokens generated" (x-axis) and the "Mean pass rate" (y-axis). The chart includes multiple data series differentiated by two parameters: `n_p` (represented by color) and `n_fr` (represented by marker shape). A dark gray trend line with a light gray shaded confidence interval runs through the data points.
### Components/Axes
* **X-Axis:** Labeled "Mean number of tokens generated". The scale runs from 0 to 10,000, with major tick marks at 0, 2000, 4000, 6000, 8000, and 10000. The labels are rotated approximately 45 degrees.
* **Y-Axis:** Labeled "Mean pass rate". The scale runs from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend (Top-Right Quadrant):** A two-column legend box defines the data series.
* **Left Column (Color for `n_p`):**
* Burnt Orange line: `n_p = 1`
* Golden Yellow line: `n_p = 2`
* Teal line: `n_p = 5`
* Sky Blue line: `n_p = 10`
* Dark Navy Blue line: `n_p = 25`
* **Right Column (Marker Shape for `n_fr`):**
* Circle (●): `n_fr = 1`
* Inverted Triangle (▼): `n_fr = 3`
* Square (■): `n_fr = 5`
* Triangle (▲): `n_fr = 10`
* **Data Series:** Each data point is a combination of a color (from `n_p`) and a marker shape (from `n_fr`), with vertical error bars.
* **Trend Line:** A solid dark gray line with a surrounding light gray shaded area (likely representing a confidence interval or standard error band).
### Detailed Analysis
**Data Points (Approximate Values, reading left to right):**
1. **~500 tokens, ~0.04 pass rate:** Burnt Orange Circle (`n_p=1`, `n_fr=1`).
2. **~1200 tokens, ~0.06 pass rate:** Golden Yellow Inverted Triangle (`n_p=2`, `n_fr=3`).
3. **~2200 tokens, ~0.07 pass rate:** Burnt Orange Square (`n_p=1`, `n_fr=5`).
4. **~2800 tokens, ~0.09 pass rate:** Golden Yellow Inverted Triangle (`n_p=2`, `n_fr=3`).
5. **~3200 tokens, ~0.11 pass rate:** Teal Circle (`n_p=5`, `n_fr=1`).
6. **~4200 tokens, ~0.08 pass rate:** Burnt Orange Triangle (`n_p=1`, `n_fr=10`).
7. **~4500 tokens, ~0.11 pass rate:** Golden Yellow Square (`n_p=2`, `n_fr=5`).
8. **~6300 tokens, ~0.15 pass rate:** Sky Blue Circle (`n_p=10`, `n_fr=1`).
9. **~7000 tokens, ~0.15 pass rate:** Teal Inverted Triangle (`n_p=5`, `n_fr=3`).
10. **~8200 tokens, ~0.13 pass rate:** Golden Yellow Triangle (`n_p=2`, `n_fr=10`).
**Trend Line:** The dark gray trend line starts near (0, 0.03), rises steeply until approximately 2000-3000 tokens, and then continues to rise at a much slower, almost linear rate, reaching approximately 0.15 at 10,000 tokens. The shaded confidence band is narrowest at the start and widens slightly as the token count increases.
### Key Observations
1. **Overall Trend:** There is a positive correlation between the mean number of tokens generated and the mean pass rate. The relationship appears logarithmic or asymptotic, showing strong initial gains that diminish as token count increases.
2. **Parameter Influence:** The data points for higher `n_p` values (Teal `n_p=5`, Sky Blue `n_p=10`) are positioned further to the right (higher token counts) and generally higher on the y-axis (higher pass rates) than points for lower `n_p` values (Burnt Orange `n_p=1`, Golden Yellow `n_p=2`).
3. **Marker (`n_fr`) Distribution:** For a given `n_p` color, different `n_fr` markers are spread along the x-axis. For example, the Golden Yellow (`n_p=2`) series includes an inverted triangle at ~1200 tokens, a square at ~4500 tokens, and a triangle at ~8200 tokens.
4. **Plateau Effect:** The trend line suggests the mean pass rate begins to plateau around 0.15-0.16 after approximately 6000 tokens, indicating diminishing returns for generating additional tokens beyond this point.
5. **Variability:** The error bars on individual data points indicate variability in the mean pass rate measurement for each specific parameter combination.
### Interpretation
This chart likely evaluates the performance of a text generation or code synthesis system, where "pass rate" is a metric of success (e.g., passing unit tests). The "mean number of tokens generated" is a measure of output length or computational cost.
The data suggests a fundamental trade-off: allowing the system to generate more tokens (a longer output) increases its probability of success, but with sharply diminishing returns. The initial increase in pass rate is significant, but after a certain point (~6000 tokens in this context), additional tokens yield minimal improvement.
The parameters `n_p` and `n_fr` appear to be control variables for the generation process. The trend that higher `n_p` values are associated with both longer generations and higher pass rates implies that `n_p` might be a parameter like "number of proposals" or "parallel attempts," where allocating more resources (`n_p`) leads to better but more verbose solutions. The `n_fr` parameter, varied within each `n_p` color group, might represent a different aspect of the process, like "number of refinement steps," whose effect on token count and pass rate is less straightforward from this visualization alone.
The plateau in the trend line is a critical insight for system optimization. It indicates an operational sweet spot: generating beyond ~6000-8000 tokens may waste computational resources for negligible gain in success rate. The chart provides empirical evidence for setting efficient generation limits.
</details>
<details>
<summary>x13.png Details</summary>

### Visual Description
## Heatmap: Performance Metric vs. Initial Programs and Feedback-Repairs
### Overview
The image is a heatmap chart displaying a performance metric (unlabeled, but likely a score or ratio) as a function of two variables: the "Number of initial programs (n_p)" on the x-axis and the "Number of feedback-repairs (n_fr)" on the y-axis. The chart uses a color gradient to represent the metric's value, with specific numerical values or the label "O.O.B." (likely "Out of Bounds" or "Out of Budget") printed in each cell.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Number of initial programs (n_p)". The axis markers are categorical, not linear, with the following values from left to right: `1`, `2`, `5`, `10`, `25`.
* **Y-Axis (Vertical):** Labeled "Number of feedback-repairs (n_fr)". The axis markers are also categorical, with the following values from bottom to top: `1`, `3`, `5`, `10`.
* **Data Grid:** A 4-row by 5-column grid. Each cell's background color corresponds to its value, following a gradient from dark brown (lowest values) through orange and yellow-green to teal (highest values). Cells labeled "O.O.B." have a dark gray background.
* **Legend/Color Scale:** There is no explicit legend provided. The color-to-value mapping must be inferred from the data within the cells. The gradient suggests:
* Dark Brown: ~0.78 - 0.87
* Orange: ~0.91 - 0.93
* Yellow-Green: ~1.05 - 1.09
* Teal: ~1.13
* Dark Gray: "O.O.B."
### Detailed Analysis
The following table reconstructs the data from the heatmap. The spatial position is described as (Column, Row), where Column 1 is the leftmost (n_p=1) and Row 1 is the bottommost (n_fr=1).
| n_p (Column) / n_fr (Row) | 1 (Bottom) | 3 | 5 | 10 (Top) |
| :--- | :--- | :--- | :--- | :--- |
| **1 (Left)** | 0.87 | 0.81 | 0.79 | 0.78 |
| **2** | 0.93 | 0.91 | 0.91 | 0.93 |
| **5** | 1.05 | 1.08 | 1.09 | O.O.B. |
| **10** | 1.13 | O.O.B. | O.O.B. | O.O.B. |
| **25 (Right)** | O.O.B. | O.O.B. | O.O.B. | O.O.B. |
**Trend Verification:**
* **Column 1 (n_p=1):** The line of values slopes slightly downward as n_fr increases (0.87 -> 0.81 -> 0.79 -> 0.78).
* **Column 2 (n_p=2):** The values are relatively stable, with a slight dip in the middle (0.93 -> 0.91 -> 0.91 -> 0.93).
* **Column 3 (n_p=5):** The values slope upward as n_fr increases from 1 to 5 (1.05 -> 1.08 -> 1.09), then the result becomes O.O.B. at n_fr=10.
* **Column 4 (n_p=10):** Only the bottom cell (n_fr=1) has a value (1.13). All cells above it are O.O.B.
* **Column 5 (n_p=25):** All cells are O.O.B., regardless of n_fr.
### Key Observations
1. **Performance Peak:** The highest recorded metric value is **1.13**, found at the combination of n_p=10 and n_fr=1 (bottom-right of the populated area).
2. **"O.O.B." Boundary:** A clear boundary exists where results become "O.O.B.". This boundary moves leftward (towards lower n_p) as n_fr increases.
* At n_fr=1, O.O.B. starts at n_p=25.
* At n_fr=3 & 5, O.O.B. starts at n_p=10.
* At n_fr=10, O.O.B. starts at n_p=5.
3. **Inverse Relationship at Low n_p:** For the smallest number of initial programs (n_p=1), increasing feedback-repairs (n_fr) leads to a slight *decrease* in the metric.
4. **Optimal Zone:** The "sweet spot" for achieving a high metric (>1.0) appears to be with a moderate number of initial programs (n_p=5 or 10) combined with a low number of feedback-repairs (n_fr=1 or 3).
### Interpretation
This heatmap likely illustrates the results of an experiment or simulation in fields like automated program repair, machine learning model tuning, or iterative software development. The metric could represent a success rate, efficiency score, or quality improvement ratio.
The data suggests a **trade-off between initial exploration (n_p) and iterative refinement (n_fr)**. Simply increasing both parameters does not guarantee better results and quickly leads to an "Out of Bounds" condition, which could signify computational cost, time limits, or diminishing returns.
The most effective strategy indicated by this data is to start with a moderate pool of initial candidates (5-10 programs) and apply a limited, focused number of repair iterations (1-3). Starting with too few initial programs (n_p=1) limits the ceiling for improvement, while starting with too many (n_p=25) or applying too many repairs (high n_fr) is inefficient or infeasible. The slight performance dip at n_p=1 as repairs increase might indicate overfitting or wasted effort on a poor initial candidate.
</details>
Figure 7: GPT-3.5 results from Figure 3 (Section 4.1) per difficulty (row), from top to bottom: introductory, interview, and competition.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Scatter Plot with Error Bars and Trend Line: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This image is a scatter plot chart displaying the relationship between the "Mean number of tokens generated" (x-axis) and the "Mean pass rate" (y-axis). The chart includes multiple data series differentiated by two parameters: `n_p` (represented by colored lines) and `n_fr` (represented by different marker shapes). A gray trend line with a shaded confidence interval is overlaid on the data points.
### Components/Axes
* **X-Axis:**
* **Title:** "Mean number of tokens generated"
* **Scale:** Linear scale from 0 to 10,000.
* **Major Tick Marks:** 0, 2000, 4000, 6000, 8000, 10000.
* **Y-Axis:**
* **Title:** "Mean pass rate"
* **Scale:** Linear scale from 0.0 to 1.0.
* **Major Tick Marks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend (Position: Bottom-Left Quadrant):**
* **Left Column (Parameter `n_p` - Colored Lines):**
* Brown line: `n_p = 1`
* Gold/Yellow line: `n_p = 2`
* Teal line: `n_p = 5`
* Light Blue line: `n_p = 10`
* Dark Blue line: `n_p = 25`
* **Right Column (Parameter `n_fr` - Marker Shapes):**
* Circle marker: `n_fr = 1`
* Downward-pointing triangle marker: `n_fr = 3`
* Square marker: `n_fr = 5`
* Upward-pointing triangle marker: `n_fr = 10`
* **Data Series:** Each data point on the plot is a combination of a color (from the `n_p` legend) and a marker shape (from the `n_fr` legend). For example, a brown circle represents `n_p=1, n_fr=1`.
* **Trend Line:** A solid gray line with a light gray shaded area around it, representing a fitted curve and its confidence interval.
### Detailed Analysis
* **Overall Trend:** The data shows a strong, positive, logarithmic-like relationship. The mean pass rate increases rapidly as the mean number of tokens generated increases from 0 to approximately 2000, after which the rate of increase slows significantly, approaching a plateau near a pass rate of 1.0.
* **Data Series by `n_p` (Color):**
* **`n_p = 1` (Brown):** Data points are clustered at the lowest token counts (approx. 0-500 tokens) and have the lowest pass rates (approx. 0.75-0.85).
* **`n_p = 2` (Gold):** Points are located at slightly higher token counts (approx. 500-1500) with pass rates approx. 0.85-0.90.
* **`n_p = 5` (Teal):** Points span a mid-range of token counts (approx. 1000-3000) with pass rates approx. 0.90-0.93.
* **`n_p = 10` (Light Blue):** Points are found at higher token counts (approx. 2000-4500) with pass rates approx. 0.93-0.95.
* **`n_p = 25` (Dark Blue):** Points are at the highest token counts (approx. 3000-5500) with pass rates approx. 0.94-0.96.
* **Data Series by `n_fr` (Marker Shape):** The marker shapes are distributed across the different `n_p` color series. For instance, the upward-pointing triangle (`n_fr=10`) appears in brown (`n_p=1`), gold (`n_p=2`), and dark blue (`n_p=25`) series.
* **Trend Line:** The gray fitted curve starts near (0, 0.7) and rises steeply, passing through the main cluster of data points. It begins to flatten around x=2000, y=0.9, and continues to rise very gradually towards the right edge of the plotted data (approx. x=5500, y=0.95). The shaded confidence interval is narrow, suggesting a good fit.
* **Notable Outlier:** There is a single, isolated data point in the top-right corner of the plot. It is a dark blue upward-pointing triangle (`n_p=25, n_fr=10`) located at approximately x=8200, y=0.96. This point has a much higher mean token count than the rest of the data but maintains a high pass rate consistent with the plateau trend.
### Key Observations
1. **Diminishing Returns:** The most significant gains in pass rate occur with the initial increase in generated tokens. Beyond ~2000 tokens, additional tokens yield progressively smaller improvements in pass rate.
2. **Parameter Correlation:** Higher values of `n_p` (darker blue colors) are associated with both higher mean token counts and higher mean pass rates. The relationship between `n_fr` (marker shape) and the other variables is less visually clear from this plot alone.
3. **Plateau Behavior:** The data suggests a performance ceiling or saturation point for the "pass rate" metric, approaching but not quite reaching 1.0 within the observed token range.
4. **High-Token Outlier:** The single point at ~8200 tokens demonstrates that very high token generation can still achieve a top-tier pass rate, though it is an exception to the main data cluster.
### Interpretation
This chart likely illustrates the performance (pass rate) of a generative model or algorithm as a function of its output length (tokens generated). The key insight is that **longer outputs are strongly correlated with higher success rates, but only up to a point.** The steep initial curve implies that very short outputs are often insufficient to complete the task successfully. The plateau indicates that beyond a certain length (~2000-4000 tokens), simply generating more text does not significantly increase the likelihood of success; other factors (captured by parameters `n_p` and `n_fr`, which might represent prompts, attempts, or refinement steps) become more critical for achieving the final few percentage points of pass rate.
The parameter `n_p` appears to be a primary driver of both output length and success. The outlier suggests that under specific configurations (high `n_p` and high `n_fr`), it's possible to generate very long outputs while maintaining high quality, but this is not the typical behavior. For system design, this data argues for targeting an optimal output length range to balance resource usage (tokens) against performance (pass rate), rather than indefinitely increasing generation length.
</details>
<details>
<summary>x15.png Details</summary>

### Visual Description
## Heatmap: Relationship Between Initial Programs and Feedback-Repairs
### Overview
The image is a heatmap visualizing a numerical metric (values ranging from 0.98 to 1.05, plus "O.O.B.") across a grid defined by two variables: the "Number of initial programs (n_p)" on the horizontal axis and the "Number of feedback-repairs (n_fr)" on the vertical axis. The color of each cell corresponds to its value, creating a gradient from greenish-yellow (higher values) to orange (lower values), with dark gray indicating "O.O.B." (Out Of Bounds).
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Number of initial programs (n_p)". The axis markers are categorical, not linear, with the following values from left to right: **1, 2, 5, 10, 25**.
* **Y-Axis (Vertical):** Labeled "Number of feedback-repairs (n_fr)". The axis markers are also categorical, with the following values from bottom to top: **1, 3, 5, 10**.
* **Data Grid:** A 4-row by 5-column grid (20 cells total). Each cell contains a numerical value or the text "O.O.B.".
* **Color Scale (Implied Legend):** The color of each cell represents its value. The gradient suggests:
* **Greenish-Yellow:** Highest values (~1.04-1.05).
* **Yellow-Gold:** Mid-range values (~1.02-1.03).
* **Orange:** Lower values (~0.98-1.01).
* **Dark Gray:** "O.O.B." cells, indicating an out-of-bounds or invalid condition.
### Detailed Analysis
The following table reconstructs the data from the heatmap. The cell value is listed for each combination of `n_p` (column) and `n_fr` (row).
| n_fr \ n_p | **1** | **2** | **5** | **10** | **25** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **10** | 0.98 | 1.01 | 1.02 | 1.03 | **O.O.B.** |
| **5** | 1.00 | 1.02 | 1.03 | 1.03 | **O.O.B.** |
| **3** | 1.02 | 1.03 | 1.04 | 1.04 | 1.04 |
| **1** | 1.05 | 1.04 | 1.04 | 1.04 | 1.04 |
**Trend Verification:**
* **Horizontal Trend (Increasing n_p):** For a fixed `n_fr`, the value generally **decreases slightly or stabilizes** as `n_p` increases. For example, at `n_fr=1`, the value drops from 1.05 (n_p=1) to 1.04 (n_p=2,5,10,25). At `n_fr=10`, it increases from 0.98 to 1.03 before hitting O.O.B.
* **Vertical Trend (Increasing n_fr):** For a fixed `n_p`, the value generally **decreases** as `n_fr` increases. For example, at `n_p=1`, the value drops from 1.05 (n_fr=1) to 0.98 (n_fr=10). This downward trend is consistent across all columns until the O.O.B. threshold.
### Key Observations
1. **Maximum Value:** The highest recorded value is **1.05**, located at the bottom-left corner (`n_p=1`, `n_fr=1`).
2. **Minimum Value:** The lowest numerical value is **0.98**, located at the top-left corner (`n_p=1`, `n_fr=10`).
3. **Plateau Region:** The bottom-right quadrant of the heatmap (where `n_p >= 5` and `n_fr <= 3`) shows a plateau of stable values at **1.04**.
4. **Out-of-Bounds (O.O.B.) Zone:** The two cells in the top-right corner (`n_p=25` for `n_fr=5` and `n_fr=10`) are marked "O.O.B.". This indicates that the combination of a high number of initial programs (25) with a high number of feedback-repairs (5 or 10) results in an invalid, undefined, or failed state for the measured metric.
5. **Color-Value Correlation:** The color gradient aligns with the numerical trend: cells with higher values (e.g., 1.05) are a cooler greenish-yellow, while cells with lower values (e.g., 0.98) are a warmer orange.
### Interpretation
This heatmap likely illustrates the performance or efficiency of a system (e.g., a program repair tool, an optimization algorithm) under different resource configurations. The metric (values ~1.0) could represent a ratio, such as speedup, success rate, or quality score relative to a baseline.
* **Trade-off Relationship:** The data suggests a trade-off. Increasing the initial pool of programs (`n_p`) provides diminishing returns and can slightly lower the metric. Conversely, increasing the number of feedback-repairs (`n_fr`) consistently reduces the metric. The most favorable outcome (1.05) occurs with minimal resources on both axes (`n_p=1`, `n_fr=1`).
* **System Limits:** The "O.O.B." zone defines a clear operational boundary. The system cannot handle or produces invalid results for configurations where both the initial program count and repair iterations are high. This could be due to computational limits, error accumulation, or a breakdown in the underlying model's assumptions.
* **Optimal Operating Region:** The plateau at 1.04 suggests a stable, sub-optimal operating region where increasing resources (`n_p` up to 25) does not degrade performance further, provided the number of repairs (`n_fr`) is kept low (≤3). This might represent a "safe" configuration space for the system.
In essence, the chart demonstrates that "more" is not always better. The system performs best with a focused, minimal initial set and few repair cycles, and it fails when pushed to extremes on both parameters simultaneously.
</details>
<details>
<summary>x16.png Details</summary>

### Visual Description
## Scatter Plot: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This is a scatter plot with a fitted trend line, illustrating the relationship between the mean number of tokens generated (x-axis) and the mean pass rate (y-axis). The plot compares multiple experimental conditions defined by two parameters: `n_p` (represented by color) and `n_fr` (represented by marker shape). The data suggests a logarithmic or diminishing returns relationship, where pass rate increases rapidly with initial token generation and then plateaus.
### Components/Axes
* **X-Axis:** "Mean number of tokens generated". Scale is linear, ranging from 0 to 10,000, with major tick marks at 0, 2000, 4000, 6000, 8000, and 10000.
* **Y-Axis:** "Mean pass rate". Scale is linear, ranging from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend (Bottom-Right Quadrant):** A two-column legend box.
* **Left Column (Color Key for `n_p`):**
* Brown line: `n_p = 1`
* Yellow line: `n_p = 2`
* Teal line: `n_p = 5`
* Light Blue line: `n_p = 10`
* Dark Blue line: `n_p = 25`
* **Right Column (Marker Key for `n_fr`):**
* Circle (●): `n_fr = 1`
* Inverted Triangle (▼): `n_fr = 3`
* Square (■): `n_fr = 5`
* Triangle (▲): `n_fr = 10`
* **Trend Line:** A solid gray curve is fitted through the aggregate data, showing the overall trend.
### Detailed Analysis
Data points are plotted as combinations of the above colors and markers. Each point includes vertical error bars (standard deviation or confidence interval). Below is an approximate reconstruction of the data series, grouped by `n_p` (color).
**Series: `n_p = 1` (Brown)**
* **Trend:** This series occupies the lowest pass rate region for a given token count.
* **Data Points (Approximate):**
* `n_fr=1` (●): (~500 tokens, ~0.44 pass rate)
* `n_fr=3` (▼): (~800 tokens, ~0.48 pass rate)
* `n_fr=5` (■): (~1200 tokens, ~0.50 pass rate)
* `n_fr=10` (▲): (~1800 tokens, ~0.52 pass rate)
**Series: `n_p = 2` (Yellow)**
* **Trend:** Slightly higher pass rates than `n_p=1` for similar token counts.
* **Data Points (Approximate):**
* `n_fr=1` (●): (~600 tokens, ~0.52 pass rate)
* `n_fr=3` (▼): (~1400 tokens, ~0.57 pass rate)
* `n_fr=5` (■): (~1800 tokens, ~0.58 pass rate)
* `n_fr=10` (▲): (~3500 tokens, ~0.61 pass rate)
**Series: `n_p = 5` (Teal)**
* **Trend:** Shows a clear jump in pass rate compared to lower `n_p` values.
* **Data Points (Approximate):**
* `n_fr=1` (●): (~1500 tokens, ~0.60 pass rate)
* `n_fr=3` (▼): (~2800 tokens, ~0.65 pass rate)
* `n_fr=5` (■): (~4000 tokens, ~0.67 pass rate)
* `n_fr=10` (▲): (~7500 tokens, ~0.69 pass rate)
**Series: `n_p = 10` (Light Blue)**
* **Trend:** Achieves high pass rates, clustering near the top of the chart.
* **Data Points (Approximate):**
* `n_fr=3` (▼): (~5000 tokens, ~0.69 pass rate)
* `n_fr=5` (■): (~7500 tokens, ~0.71 pass rate)
**Series: `n_p = 25` (Dark Blue)**
* **Trend:** The highest `n_p` value shown, with a data point at the highest observed pass rate.
* **Data Points (Approximate):**
* `n_fr=1` (●): (~6500 tokens, ~0.71 pass rate)
### Key Observations
1. **Diminishing Returns:** All series follow a logarithmic-like curve. The initial increase in tokens generated yields a steep rise in pass rate, which then flattens significantly after approximately 2000-4000 tokens.
2. **Parameter Hierarchy:** The parameter `n_p` (color) has a stronger effect on the pass rate ceiling than `n_fr` (marker). For any given `n_fr`, a higher `n_p` results in a higher pass rate.
3. **Clustering by `n_p`:** Data points group vertically by color. For example, all brown points (`n_p=1`) are below all yellow points (`n_p=2`), which are below the teal points (`n_p=5`), and so on.
4. **Token Cost of Higher `n_p`:** Achieving the higher pass rates associated with larger `n_p` values generally requires generating a larger mean number of tokens.
5. **Plateau Region:** The fitted gray trend line suggests the overall system performance approaches a maximum mean pass rate of approximately 0.65-0.70 as the mean number of tokens generated exceeds ~4000.
### Interpretation
This chart likely evaluates the performance of a generative AI system (e.g., for code, text, or reasoning tasks) under different configurations. The "pass rate" is a metric of success or correctness.
* **Trade-off Demonstrated:** The plot visualizes a fundamental trade-off between **efficiency** (mean tokens generated, a proxy for cost/latency) and **effectiveness** (mean pass rate). Spending more computational resources (more tokens) improves results, but with sharply diminishing returns.
* **Role of Parameters:** `n_p` and `n_fr` are likely hyperparameters controlling aspects like the number of parallel attempts, refinement steps, or sampling diversity. The data suggests `n_p` is a more powerful lever for improving quality than `n_fr`, but it comes at a higher token cost.
* **Practical Implication:** For resource-constrained applications, operating in the "knee" of the curve (around 1000-3000 tokens) may offer the best balance. For applications where maximum quality is paramount and cost is secondary, using a high `n_p` (e.g., 10 or 25) is justified, as it pushes the pass rate towards the observed ceiling of ~0.71.
* **System Limit:** The plateau of the trend line around 0.65-0.70 may indicate an inherent limitation of the base model or task difficulty, beyond which simply generating more tokens provides minimal benefit.
</details>
<details>
<summary>x17.png Details</summary>

### Visual Description
## Heatmap: Performance Metric vs. Initial Programs and Feedback-Repairs
### Overview
The image is a heatmap displaying a numerical performance metric (values ranging from 0.88 to 1.09) across a grid defined by two parameters: the "Number of initial programs (n_p)" on the x-axis and the "Number of feedback-repairs (n_fr)" on the y-axis. Cells where the metric could not be computed or is invalid are labeled "O.O.B." (likely "Out Of Bounds"). The color of each cell corresponds to its numerical value, following a gradient from dark orange (lowest values) through yellow and light green to dark green (highest values). Cells labeled "O.O.B." are colored black.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Number of initial programs (n_p)". The axis has five discrete, non-linearly spaced tick marks at values: **1, 2, 5, 10, 25**.
* **Y-Axis (Vertical):** Labeled "Number of feedback-repairs (n_fr)". The axis has four discrete tick marks at values: **1, 3, 5, 10**.
* **Data Grid:** A 4-row by 5-column grid. Each cell contains a centered, white-text numerical value or the label "O.O.B.".
* **Color Legend (Implicit):** There is no separate legend box. The color scale is embedded in the data cells themselves, mapping value to color:
* **Dark Orange:** ~0.88 - 0.91
* **Orange/Yellow:** ~0.96 - 0.99
* **Yellow-Green:** ~1.03 - 1.06
* **Light Green:** ~1.08 - 1.09
* **Black:** "O.O.B."
### Detailed Analysis
The following table reconstructs the heatmap's data, with rows corresponding to `n_fr` (y-axis, from bottom to top) and columns to `n_p` (x-axis, from left to right).
| n_fr \ n_p | **1** | **2** | **5** | **10** | **25** |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **10** | 0.88 | 0.97 | 1.06 | O.O.B. | O.O.B. |
| **5** | 0.89 | 0.96 | 1.04 | 1.09 | O.O.B. |
| **3** | 0.91 | 0.97 | 1.04 | 1.08 | O.O.B. |
| **1** | 0.96 | 0.99 | 1.03 | 1.05 | 1.09 |
**Trend Verification:**
* **Horizontal Trend (Fixed n_fr):** For a fixed number of feedback-repairs, the metric generally **increases** as the number of initial programs (`n_p`) increases. This is visually represented by the color shifting from orange towards green as you move right along any row. The trend is consistent until the "O.O.B." region is reached.
* **Vertical Trend (Fixed n_p):** For a fixed number of initial programs, the relationship with feedback-repairs (`n_fr`) is less uniform.
* At `n_p = 1`, the value **decreases** slightly as `n_fr` increases (0.96 -> 0.91 -> 0.89 -> 0.88).
* At `n_p = 2` and `n_p = 5`, the values are relatively stable across different `n_fr`.
* At `n_p = 10`, the value **increases** as `n_fr` decreases (1.09 at n_fr=5 vs. 1.05 at n_fr=1).
### Key Observations
1. **Performance Ceiling & O.O.B. Zone:** The highest recorded metric value is **1.09**, achieved at two points: (`n_p=10, n_fr=5`) and (`n_p=25, n_fr=1`). A clear "Out Of Bounds" (O.O.B.) region exists in the top-right corner of the heatmap, corresponding to high values of both parameters (`n_p >=10` with `n_fr >=10`, and `n_p=25` with `n_fr >=3`).
2. **Optimal Region:** The highest performance (values >= 1.08) is found in a cluster around `n_p=10` and `n_p=25` with lower `n_fr` values (1 and 5).
3. **Lowest Performance:** The lowest metric value (**0.88**) occurs at the combination of the fewest initial programs (`n_p=1`) and the most feedback-repairs (`n_fr=10`).
4. **Color-Value Correlation:** The color gradient is consistent. Dark orange cells contain the lowest values (0.88-0.91), yellow-green cells contain mid-range values (1.03-1.06), and light green cells contain the highest valid values (1.08-1.09).
### Interpretation
This heatmap likely visualizes the performance (e.g., success rate, efficiency score, or quality metric) of a program repair or synthesis system that uses an initial set of programs (`n_p`) and a feedback-driven repair loop (`n_fr`). The data suggests several key insights:
* **More Initial Programs Generally Help:** Increasing the initial pool of candidate programs (`n_p`) tends to improve the final outcome, as seen by the rightward increase in values across rows.
* **Feedback Repairs Have Diminishing/Complex Returns:** The role of feedback repairs (`n_fr`) is nuanced. While some repair iterations are beneficial (e.g., moving from `n_fr=1` to `n_fr=5` at `n_p=10` improves the score from 1.05 to 1.09), excessive repairs combined with a large initial pool (`n_p=25, n_fr>=3`) lead to failure (O.O.B.). This could indicate system instability, overfitting, or resource exhaustion under high computational load.
* **Trade-off and System Limits:** The "O.O.B." zone defines the operational limits of the system. The most robust high-performance configuration appears to be a moderate-to-large initial program set (`n_p=10`) with a moderate number of repair cycles (`n_fr=5`). Pushing both parameters to their maximum (`n_p=25, n_fr=10`) is not viable.
* **The "Less is More" Anomaly:** At the smallest initial pool (`n_p=1`), adding more feedback repairs actually *degrades* performance. This could imply that with very few starting options, aggressive repair attempts are counterproductive, perhaps by corrupting the only viable candidate.
In summary, the chart demonstrates a non-linear interaction between initial diversity (`n_p`) and iterative refinement (`n_fr`). Optimal performance requires balancing these two factors, as maximizing both leads to system failure. The sweet spot lies in providing sufficient initial candidates while applying a controlled, not maximal, amount of repair effort.
</details>
<details>
<summary>x18.png Details</summary>

### Visual Description
\n
## Scatter Plot with Trend Line: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
The image is a technical scatter plot chart displaying the relationship between the "Mean number of tokens generated" (x-axis) and the "Mean pass rate" (y-axis). The chart includes multiple data series differentiated by color and marker shape, representing different values of two parameters: \( n_p \) and \( n_{fr} \). A dark gray trend line with a shaded confidence interval runs through the data points.
### Components/Axes
* **X-Axis:** Labeled "Mean number of tokens generated". The scale runs from 0 to 10000, with major tick marks at 0, 2000, 4000, 6000, 8000, and 10000. The labels are rotated approximately 45 degrees.
* **Y-Axis:** Labeled "Mean pass rate". The scale runs from 0.0 to 1.0, with major tick marks at 0.0, 0.2, 0.4, 0.6, 0.8, and 1.0.
* **Legend (Top-Right Quadrant):** The legend is contained within a light gray box and is split into two columns.
* **Left Column (Color Key for \( n_p \)):**
* Orange line: \( n_p = 1 \)
* Yellow line: \( n_p = 2 \)
* Teal line: \( n_p = 5 \)
* Light Blue line: \( n_p = 10 \)
* Dark Blue line: \( n_p = 25 \)
* **Right Column (Marker Key for \( n_{fr} \)):**
* Circle (●): \( n_{fr} = 1 \)
* Downward Triangle (▼): \( n_{fr} = 3 \)
* Square (■): \( n_{fr} = 5 \)
* Upward Triangle (▲): \( n_{fr} = 10 \)
* **Trend Line:** A dark gray, solid line representing a fitted model or average trend. It is surrounded by a light gray shaded region, likely indicating a confidence interval or standard deviation band.
* **Data Points:** Each point is a colored marker (corresponding to an \( n_p \) value) with a specific shape (corresponding to an \( n_{fr} \) value). Each point includes vertical error bars.
### Detailed Analysis
**Data Point Extraction (Approximate Values):**
The following table lists each visible data point, identified by its color (\( n_p \)) and marker shape (\( n_{fr} \)), with approximate coordinates and error bar ranges.
| \( n_p \) (Color) | \( n_{fr} \) (Marker) | Approx. Mean Tokens (X) | Approx. Mean Pass Rate (Y) | Approx. Error Bar Range (Y) |
| :--- | :--- | :--- | :--- | :--- |
| 1 (Orange) | 1 (Circle) | 500 | 0.19 | 0.16 - 0.22 |
| 1 (Orange) | 3 (Down Triangle) | 1000 | 0.23 | 0.19 - 0.27 |
| 1 (Orange) | 5 (Square) | 1500 | 0.25 | 0.22 - 0.28 |
| 1 (Orange) | 10 (Up Triangle) | 3000 | 0.27 | 0.24 - 0.30 |
| 2 (Yellow) | 1 (Circle) | 1000 | 0.27 | 0.25 - 0.29 |
| 2 (Yellow) | 3 (Down Triangle) | 2000 | 0.32 | 0.28 - 0.36 |
| 2 (Yellow) | 5 (Square) | 3000 | 0.34 | 0.31 - 0.37 |
| 2 (Yellow) | 10 (Up Triangle) | 5500 | 0.37 | 0.34 - 0.40 |
| 5 (Teal) | 1 (Circle) | 2000 | 0.36 | 0.33 - 0.39 |
| 5 (Teal) | 3 (Down Triangle) | 4500 | 0.41 | 0.38 - 0.44 |
| 5 (Teal) | 5 (Square) | 6500 | 0.42 | 0.40 - 0.44 |
| 10 (Light Blue) | 1 (Circle) | 4000 | 0.42 | 0.39 - 0.45 |
| 10 (Light Blue) | 3 (Down Triangle) | 8000 | 0.47 | 0.44 - 0.50 |
| 25 (Dark Blue) | 1 (Circle) | 9800 | 0.49 | 0.47 - 0.51 |
**Trend Line Description:**
The dark gray trend line shows a logarithmic or diminishing returns relationship. It rises steeply from near (0, 0.1) and begins to plateau around x=4000, y=0.38. The slope becomes very shallow after x=6000, approaching a pass rate of approximately 0.41 at x=10000. The shaded confidence interval is narrowest in the middle range (x=2000 to 6000) and widens slightly at the extremes.
### Key Observations
1. **Positive Correlation:** There is a clear positive correlation between the mean number of tokens generated and the mean pass rate. As token count increases, pass rate increases.
2. **Diminishing Returns:** The rate of improvement in pass rate slows significantly as token count increases. The largest gains occur below 4000 tokens.
3. **Parameter Influence:**
* **Effect of \( n_p \):** For a given token count, higher \( n_p \) values (cooler colors: teal, blue) are associated with higher pass rates. For example, at ~4000 tokens, the light blue point (\( n_p=10 \)) is higher than the teal point (\( n_p=5 \)), which is higher than the yellow point (\( n_p=2 \)).
* **Effect of \( n_{fr} \):** For a given \( n_p \) color, points with higher \( n_{fr} \) values (different markers) tend to be located further to the right (higher token count) and often slightly higher on the y-axis (higher pass rate). This suggests that increasing \( n_{fr} \) also increases token generation.
4. **Data Distribution:** Data points are clustered more densely in the lower-left region (lower tokens, lower pass rate) and become sparser towards the upper-right. The highest pass rate observed is just below 0.5.
### Interpretation
This chart likely illustrates the performance of a generative model (e.g., for code, text, or reasoning) where "pass rate" is a success metric. The data suggests a fundamental trade-off or resource relationship: achieving a higher success rate requires generating more tokens (i.e., using more computational effort or producing longer outputs).
The parameters \( n_p \) and \( n_{fr} \) appear to be hyperparameters controlling the generation process. The trends indicate that:
* Increasing \( n_p \) improves the efficiency of the model, yielding a higher pass rate for the same token budget.
* Increasing \( n_{fr} \) leads to longer generations (more tokens) and a modest boost in pass rate, but its primary effect seems to be on output length.
The plateau in the trend line implies a performance ceiling; beyond a certain point (~6000 tokens), simply generating more text yields minimal improvement in success rate. This could indicate a limit of the model's capability or that the evaluation task has a inherent difficulty that cannot be overcome by longer generation alone. The widening confidence interval at high token counts suggests greater variability or uncertainty in outcomes when the model generates very long sequences.
</details>
<details>
<summary>x19.png Details</summary>

### Visual Description
## Heatmap: Relationship Between Initial Programs and Feedback-Repairs
### Overview
The image is a heatmap chart visualizing a numerical metric (unspecified, but likely a performance ratio or efficiency score) across a grid defined by two variables: the "Number of initial programs (n_p)" on the horizontal axis and the "Number of feedback-repairs (n_fr)" on the vertical axis. The chart uses a color gradient from yellow to teal to dark gray to represent the metric's value, with specific numerical values annotated in each cell. Cells marked "O.O.B." indicate values that are "Out Of Bounds" or undefined for that combination.
### Components/Axes
* **X-Axis (Horizontal):** Labeled "Number of initial programs (n_p)". The axis has five discrete, non-linearly spaced tick marks at values: **1, 2, 5, 10, 25**.
* **Y-Axis (Vertical):** Labeled "Number of feedback-repairs (n_fr)". The axis has four discrete tick marks at values: **1, 3, 5, 10**.
* **Legend/Color Scale:** There is no separate legend. The color of each cell corresponds to the numerical value within it, following a gradient:
* **Yellow/Gold:** Represents lower values (approx. 1.01 to 1.05).
* **Light Green/Teal:** Represents mid-range values (approx. 1.10 to 1.19).
* **Dark Brown/Orange:** Represents values below 1.0 (0.81 to 0.97).
* **Dark Gray/Black:** Represents "O.O.B." (Out Of Bounds) cells.
* **Data Grid:** A 4-row (n_fr) by 5-column (n_p) matrix of colored cells, each containing a white text annotation of its value.
### Detailed Analysis
The heatmap data is extracted below, organized by the number of feedback-repairs (n_fr, rows from bottom to top) and the number of initial programs (n_p, columns from left to right).
| n_fr \ n_p | 1 | 2 | 5 | 10 | 25 |
| :--- | :--- | :--- | :--- | :--- | :--- |
| **1** | 1.01 | 1.05 | 1.10 | 1.13 | 1.19 |
| **3** | 0.85 | 0.97 | 1.10 | 1.16 | O.O.B. |
| **5** | 0.83 | 0.97 | 1.10 | O.O.B. | O.O.B. |
| **10** | 0.81 | 0.97 | O.O.B. | O.O.B. | O.O.B. |
**Row 1: n_fr = 1 (Bottom Row)**
* n_p = 1: Color = Yellow. Value = **1.01**
* n_p = 2: Color = Light Yellow-Green. Value = **1.05**
* n_p = 5: Color = Light Green. Value = **1.10**
* n_p = 10: Color = Teal-Green. Value = **1.13**
* n_p = 25: Color = Teal. Value = **1.19**
* *Trend:* Values increase steadily from left to right (1.01 to 1.19).
**Row 2: n_fr = 3**
* n_p = 1: Color = Dark Orange. Value = **0.85**
* n_p = 2: Color = Gold. Value = **0.97**
* n_p = 5: Color = Light Green. Value = **1.10**
* n_p = 10: Color = Teal. Value = **1.16**
* n_p = 25: Color = Dark Gray. Value = **O.O.B.**
* *Trend:* Values increase from left to right until n_p=10, then become O.O.B. at n_p=25.
**Row 3: n_fr = 5**
* n_p = 1: Color = Dark Orange-Brown. Value = **0.83**
* n_p = 2: Color = Gold. Value = **0.97**
* n_p = 5: Color = Light Green. Value = **1.10**
* n_p = 10: Color = Dark Gray. Value = **O.O.B.**
* n_p = 25: Color = Dark Gray. Value = **O.O.B.**
* *Trend:* Values increase from left to right until n_p=5, then become O.O.B. for n_p=10 and 25.
**Row 4: n_fr = 10 (Top Row)**
* n_p = 1: Color = Dark Brown. Value = **0.81**
* n_p = 2: Color = Gold. Value = **0.97**
* n_p = 5: Color = Dark Gray. Value = **O.O.B.**
* n_p = 10: Color = Dark Gray. Value = **O.O.B.**
* n_p = 25: Color = Dark Gray. Value = **O.O.B.**
* *Trend:* Values increase from n_p=1 to n_p=2, then become O.O.B. for all n_p ≥ 5.
### Key Observations
1. **Diagonal Boundary of O.O.B.:** There is a clear diagonal frontier separating defined values from O.O.B. regions. As both n_p and n_fr increase, the combination is more likely to be O.O.B. The boundary runs roughly from the top-middle (n_p=5, n_fr=10) to the bottom-right (n_p=25, n_fr=3).
2. **Value Plateau at n_p=2:** For n_fr ≥ 3, the value at n_p=2 is consistently **0.97**, regardless of the number of feedback-repairs.
3. **Value Plateau at n_p=5:** For n_fr ≤ 5, the value at n_p=5 is consistently **1.10**.
4. **Lowest Values:** The lowest defined values (0.81, 0.83, 0.85) occur at the lowest n_p (1) combined with higher n_fr (10, 5, 3 respectively).
5. **Highest Value:** The highest defined value (**1.19**) occurs at the highest n_p (25) combined with the lowest n_fr (1).
### Interpretation
The heatmap suggests a complex, non-linear relationship between the number of initial programs (n_p) and feedback-repairs (n_fr) on the measured metric. The data implies:
* **Trade-off and Feasibility Frontier:** The O.O.B. region likely represents combinations of parameters that are computationally infeasible, unstable, or yield no valid result. The diagonal boundary indicates that increasing both parameters simultaneously quickly leads to this infeasible region.
* **Diminishing Returns or Saturation:** The consistent values at n_p=2 (0.97) and n_p=5 (1.10) across multiple n_fr levels suggest a saturation point where adding more initial programs (up to that point) yields a predictable outcome regardless of repair attempts. Beyond these points (e.g., n_p=10, 25), the outcome becomes highly dependent on n_fr.
* **Optimal Configuration:** The highest metric value (1.19) is achieved with a high number of initial programs (25) but only a single feedback-repair cycle. This could indicate that for this system, investing heavily in the initial program set is more effective than engaging in extensive iterative repair, provided the initial set is large enough.
* **Performance Degradation with More Repairs:** Counter-intuitively, for a fixed, low number of initial programs (n_p=1), increasing the number of feedback-repairs (n_fr) leads to a *decrease* in the metric (from 1.01 at n_fr=1 down to 0.81 at n_fr=10). This suggests that excessive repair cycles on a poor initial program may be detrimental.
In summary, the chart maps a performance or efficiency landscape, highlighting a clear infeasibility boundary and suggesting that system optimization favors either a large initial investment (high n_p) with minimal repair, or a moderate initial investment with a specific, limited number of repair cycles.
</details>
Figure 8: GPT-4 results from Figure 4 (Section 4.1) per difficulty (row), from top to bottom: introductory, interview, and competition.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
This is a line chart comparing the performance of different Large Language Model (LLM) configurations. The chart plots the "Mean pass rate" (y-axis) against the "Mean number of tokens generated" (x-axis) for five distinct model setups, showing how performance scales with increased token generation. The data suggests an evaluation of code generation or problem-solving tasks where "pass rate" is the success metric.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence intervals or variance bands around each line.
* **X-Axis:**
* **Label:** "Mean number of tokens generated"
* **Scale:** Linear scale from 0 to 10,000.
* **Major Tick Marks:** 0, 2000, 4000, 6000, 8000, 10000.
* **Y-Axis:**
* **Label:** "Mean pass rate"
* **Scale:** Linear scale from 0.0 to 1.0.
* **Major Tick Marks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend:** Located in the bottom-left quadrant of the plot area. It contains five entries, each associating a colored line with a model configuration.
* **Dark Blue Line:** `M_P = GPT-4 (no repair)`
* **Teal Line:** `M_P = GPT-4; M_F = GPT-4`
* **Gray Line:** `M_P = GPT-3.5 (no repair)`
* **Orange Line:** `M_P = GPT-3.5; M_F = GPT-3.5`
* **Light Blue Line:** `M_P = GPT-3.5; M_F = GPT-4`
* *Notation:* `M_P` likely denotes the primary model, and `M_F` denotes a model used for a "repair" or refinement step.
### Detailed Analysis
The chart displays five data series, each showing a logarithmic-like growth trend where the mean pass rate increases rapidly with initial token generation and then plateaus.
1. **`M_P = GPT-4 (no repair)` (Dark Blue Line):**
* **Trend:** Starts highest among all series and maintains the lead throughout. Shows a steep initial rise followed by a very gradual plateau.
* **Approximate Data Points:** Starts at ~0.7 pass rate for very low tokens. Reaches ~0.85 by 1000 tokens, ~0.9 by 2000 tokens, and plateaus near ~0.92 by 5000+ tokens.
2. **`M_P = GPT-4; M_F = GPT-4` (Teal Line):**
* **Trend:** Starts slightly below the dark blue line but follows a nearly parallel trajectory, consistently performing second-best.
* **Approximate Data Points:** Starts at ~0.75. Reaches ~0.88 by 1000 tokens, ~0.92 by 2000 tokens, and plateaus near ~0.95 by 3000+ tokens.
3. **`M_P = GPT-3.5 (no repair)` (Gray Line):**
* **Trend:** Starts the lowest but shows significant improvement, converging with the orange line at higher token counts.
* **Approximate Data Points:** Starts at ~0.5. Reaches ~0.75 by 1000 tokens, ~0.82 by 2000 tokens, and plateaus near ~0.85 by 6000+ tokens.
4. **`M_P = GPT-3.5; M_F = GPT-3.5` (Orange Line):**
* **Trend:** Follows a path very similar to the gray line (`GPT-3.5 no repair`), often overlapping or sitting slightly below it.
* **Approximate Data Points:** Starts at ~0.55. Reaches ~0.78 by 1000 tokens, ~0.83 by 2000 tokens, and plateaus near ~0.85 by 4500+ tokens.
5. **`M_P = GPT-3.5; M_F = GPT-4` (Light Blue Line):**
* **Trend:** Starts higher than the other GPT-3.5 configurations but lower than the GPT-4 lines. It maintains a clear gap above the gray and orange lines.
* **Approximate Data Points:** Starts at ~0.6. Reaches ~0.82 by 1000 tokens, ~0.86 by 2000 tokens, and plateaus near ~0.88 by 5000+ tokens.
### Key Observations
1. **Performance Hierarchy:** A clear performance hierarchy is established: GPT-4 based configurations (dark blue, teal) significantly outperform all GPT-3.5 based configurations (gray, orange, light blue).
2. **Impact of Repair (`M_F`):**
* For GPT-4, adding a GPT-4 repair step (teal line) results in a slight but consistent performance *decrease* compared to no repair (dark blue line).
* For GPT-3.5, adding a same-model repair step (orange line) shows negligible difference from no repair (gray line).
* Using a stronger model (GPT-4) for repair on a GPT-3.5 primary model (light blue line) provides a clear performance boost over GPT-3.5 alone.
3. **Diminishing Returns:** All configurations exhibit strong diminishing returns. The majority of performance gain occurs within the first 2000 generated tokens. Beyond 4000-6000 tokens, the pass rate curves flatten considerably.
4. **Convergence:** The two GPT-3.5-only configurations (gray and orange) converge to nearly the same plateau (~0.85), suggesting a performance ceiling for that model in this task, regardless of a same-model repair step.
### Interpretation
This chart likely visualizes results from a study on iterative code generation or "self-repair" in LLMs. The data suggests several key insights:
* **Model Capability is Primary:** The base model's capability (`M_P`) is the dominant factor in performance. GPT-4's inherent superiority is evident across all token budgets.
* **Repair Dynamics are Nuanced:** The effect of a repair model (`M_F`) is not universally beneficial. Using the same powerful model (GPT-4) for both generation and repair may introduce redundancy or over-optimization, slightly harming performance. Conversely, using a stronger model to repair outputs from a weaker one (GPT-3.5 -> GPT-4) is an effective strategy to boost performance, acting as a "force multiplier."
* **Token Budget Efficiency:** The steep initial curves indicate that most problems solvable by these models are solved with relatively short solutions (under 2000 tokens). Generating beyond 6000 tokens yields minimal accuracy gains, suggesting a point where additional computation is inefficient for improving pass rate.
* **Performance Ceilings:** The plateauing of all lines indicates task-specific performance ceilings. GPT-3.5 appears to hit a ceiling around 85% pass rate, while GPT-4 approaches but does not reach 100%, suggesting some problems in the evaluation set are inherently difficult or unsolvable for these models within the given constraints.
**Language Note:** All text in the image is in English.
</details>
<details>
<summary>x21.png Details</summary>

### Visual Description
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
The image is a line chart comparing the performance of different Large Language Model (LLM) configurations. It plots the "Mean pass rate" (y-axis) against the "Mean number of tokens generated" (x-axis). The chart shows five distinct data series, each representing a different model setup, with shaded regions likely indicating confidence intervals or standard deviation around the mean.
### Components/Axes
* **Chart Type:** Line chart with shaded error bands.
* **X-Axis:**
* **Label:** "Mean number of tokens generated"
* **Scale:** Linear, ranging from 0 to 10,000.
* **Major Tick Marks:** 0, 2000, 4000, 6000, 8000, 10000.
* **Y-Axis:**
* **Label:** "Mean pass rate"
* **Scale:** Linear, ranging from 0.0 to 1.0.
* **Major Tick Marks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend:** Located in the bottom-right quadrant of the chart area. It contains five entries, each with a colored line sample and a text label.
1. **Dark Blue Line:** `M_P = GPT-4 (no repair)`
2. **Teal/Green Line:** `M_P = GPT-4; M_F = GPT-4`
3. **Gray Line:** `M_P = GPT-3.5 (no repair)`
4. **Orange/Red Line:** `M_P = GPT-3.5; M_F = GPT-3.5`
5. **Light Blue Line:** `M_P = GPT-3.5; M_F = GPT-4`
* **Notation:** `M_P` likely denotes the primary model, and `M_F` denotes a secondary model used for "repair" or refinement.
### Detailed Analysis
**Trend Verification & Data Point Extraction:**
All five lines exhibit a similar logarithmic growth trend: a steep initial increase in pass rate as token count rises from 0, followed by a gradual plateau. The lines are ordered vertically, indicating clear performance tiers.
1. **`M_P = GPT-4; M_F = GPT-4` (Teal/Green Line):**
* **Trend:** Highest-performing series. Slopes upward sharply and maintains the highest position throughout.
* **Approximate Values:** Starts near 0.4 at ~0 tokens. Reaches ~0.6 at 2000 tokens, ~0.68 at 4000 tokens, and plateaus near 0.71 by 6000-7000 tokens (the line ends before 8000).
2. **`M_P = GPT-4 (no repair)` (Dark Blue Line):**
* **Trend:** Second-highest series. Follows a path very close to but consistently below the teal line.
* **Approximate Values:** Starts near 0.4. Reaches ~0.6 at 2000 tokens, ~0.64 at 4000 tokens, and plateaus near 0.66 by 8000 tokens.
3. **`M_P = GPT-3.5; M_F = GPT-4` (Light Blue Line):**
* **Trend:** Middle series. Shows a more gradual ascent compared to the GPT-4 lines.
* **Approximate Values:** Starts near 0.2. Reaches ~0.4 at 2000 tokens, ~0.48 at 4000 tokens, ~0.54 at 6000 tokens, and ~0.57 at 10000 tokens.
4. **`M_P = GPT-3.5; M_F = GPT-3.5` (Orange/Red Line):**
* **Trend:** Fourth series, closely tracking the light blue line but slightly below it.
* **Approximate Values:** Starts near 0.2. Reaches ~0.38 at 2000 tokens, ~0.45 at 4000 tokens, ~0.50 at 6000 tokens, and ~0.53 at 10000 tokens.
5. **`M_P = GPT-3.5 (no repair)` (Gray Line):**
* **Trend:** Lowest-performing series. Follows a path very close to but consistently below the orange line.
* **Approximate Values:** Starts near 0.2. Reaches ~0.37 at 2000 tokens, ~0.44 at 4000 tokens, ~0.49 at 6000 tokens, and ~0.51 at 10000 tokens.
### Key Observations
1. **Clear Performance Hierarchy:** There is a significant and consistent gap between the two GPT-4 based models (top two lines) and the three GPT-3.5 based models (bottom three lines). The GPT-4 models achieve a mean pass rate approximately 0.15-0.20 points higher than their GPT-3.5 counterparts at equivalent token counts.
2. **Impact of Repair (`M_F`):** For both GPT-4 and GPT-3.5, adding a repair model (`M_F`) improves performance over the "no repair" baseline. The teal line (GPT-4 repair) is above the dark blue line (GPT-4 no repair). The light blue and orange lines (GPT-3.5 with repair) are above the gray line (GPT-3.5 no repair).
3. **Cross-Model Repair Benefit:** The most notable observation is that using GPT-4 as a repair model for GPT-3.5 (`M_P = GPT-3.5; M_F = GPT-4`, light blue line) yields better performance than using GPT-3.5 to repair itself (orange line). This suggests a stronger repair model can elevate the performance of a weaker primary model.
4. **Diminishing Returns:** All curves show strong diminishing returns. The majority of the pass rate gain occurs within the first 2000-4000 tokens generated. Beyond 6000 tokens, the curves flatten considerably.
### Interpretation
This chart demonstrates the relationship between computational effort (measured in generated tokens) and task success rate (pass rate) for different LLM agent configurations. The data suggests several key insights:
* **Model Capability is Primary:** The base capability of the primary model (`M_P`) is the strongest determinant of performance. GPT-4 fundamentally outperforms GPT-3.5, regardless of the repair strategy.
* **Repair Mechanisms are Effective:** Implementing a repair or refinement step (`M_F`) consistently improves outcomes. This supports the architectural pattern of using a secondary model to critique and improve the output of a primary model.
* **Asymmetric Repair is Powerful:** The benefit is not symmetric. A more capable model (GPT-4) serves as a more effective repair agent for a less capable model (GPT-3.5) than the less capable model does for itself. This implies that in a multi-model system, assigning the most capable model to the "critic" or "repair" role can be a highly efficient strategy.
* **Efficiency Plateau:** There is a clear point of diminishing returns. Investing tokens beyond approximately 6000 yields minimal improvement in pass rate for these tasks. This has practical implications for cost and latency optimization in deployed systems, suggesting that capping token generation could be beneficial without significant performance loss.
The chart effectively argues for the value of both using advanced base models and implementing multi-model repair architectures, with a particular emphasis on the strategic advantage of using a strong model in the repair role.
</details>
<details>
<summary>x22.png Details</summary>

### Visual Description
## Line Chart: Mean Pass Rate vs. Mean Number of Tokens Generated
### Overview
The image is a line chart displaying the relationship between the mean number of tokens generated (x-axis) and the mean pass rate (y-axis) for five different model configurations. The chart compares the performance of GPT-4 and GPT-3.5 models, both with and without a "repair" mechanism. Each line is accompanied by a shaded region representing a confidence interval or variability.
### Components/Axes
* **Chart Type:** Line chart with shaded confidence bands.
* **X-Axis:**
* **Label:** "Mean number of tokens generated"
* **Scale:** Linear scale from 0 to 10,000.
* **Major Ticks:** 0, 2000, 4000, 6000, 8000, 10000.
* **Y-Axis:**
* **Label:** "Mean pass rate"
* **Scale:** Linear scale from 0.0 to 1.0.
* **Major Ticks:** 0.0, 0.2, 0.4, 0.6, 0.8, 1.0.
* **Legend:** Located in the top-right quadrant of the chart area. It contains five entries, each with a colored line sample and a text label.
1. **Dark Blue Line:** `M_P = GPT-4 (no repair)`
2. **Teal Line:** `M_P = GPT-4; M_F = GPT-4`
3. **Gray Line:** `M_P = GPT-3.5 (no repair)`
4. **Orange Line:** `M_P = GPT-3.5; M_F = GPT-3.5`
5. **Light Blue Line:** `M_P = GPT-3.5; M_F = GPT-4`
* **Grid:** A light gray grid is present, aligned with the major ticks on both axes.
### Detailed Analysis
The chart plots five data series. All series show a logarithmic-like growth trend: a rapid initial increase in mean pass rate as the mean number of tokens generated increases, followed by a gradual flattening (diminishing returns).
**Trend Verification & Data Point Extraction (Approximate Values):**
1. **`M_P = GPT-4 (no repair)` (Dark Blue Line):**
* **Trend:** Starts low, rises steeply until ~2000 tokens, then continues to rise at a decreasing rate.
* **Approximate Points:** At 0 tokens: ~0.10. At 2000 tokens: ~0.32. At 10000 tokens: ~0.41.
2. **`M_P = GPT-4; M_F = GPT-4` (Teal Line):**
* **Trend:** Follows a similar shape to the dark blue line but is consistently higher. This is the top-performing series.
* **Approximate Points:** At 0 tokens: ~0.12. At 2000 tokens: ~0.35. At 10000 tokens: ~0.49.
3. **`M_P = GPT-3.5 (no repair)` (Gray Line):**
* **Trend:** The lowest-performing series. Shows a very shallow, almost linear increase.
* **Approximate Points:** At 0 tokens: ~0.03. At 2000 tokens: ~0.08. At 10000 tokens: ~0.15.
4. **`M_P = GPT-3.5; M_F = GPT-3.5` (Orange Line):**
* **Trend:** Performs better than the GPT-3.5 (no repair) baseline but worse than all GPT-4 configurations.
* **Approximate Points:** At 0 tokens: ~0.03. At 2000 tokens: ~0.09. At 10000 tokens: ~0.18.
5. **`M_P = GPT-3.5; M_F = GPT-4` (Light Blue Line):**
* **Trend:** The best-performing configuration using GPT-3.5 as the primary model (`M_P`). It shows a more pronounced upward curve than the other GPT-3.5 lines.
* **Approximate Points:** At 0 tokens: ~0.03. At 2000 tokens: ~0.11. At 10000 tokens: ~0.23.
**Spatial Grounding & Confidence Intervals:**
* The shaded bands around each line indicate uncertainty. The bands for the top two (GPT-4) lines are wider, suggesting greater variance in their performance.
* The legend is positioned in the top-right, overlapping the upper portion of the chart but not obscuring critical data trends.
### Key Observations
1. **Model Hierarchy:** There is a clear performance hierarchy: GPT-4 based configurations significantly outperform all GPT-3.5 based configurations across the entire range of token counts.
2. **Impact of Repair (`M_F`):** For both GPT-4 and GPT-3.5, adding a repair model (`M_F`) improves the mean pass rate compared to the "no repair" baseline. The improvement is more dramatic for GPT-4.
3. **Cross-Model Repair:** Using a stronger model for repair (`M_F = GPT-4`) provides a notable boost to the weaker primary model (`M_P = GPT-3.5`), lifting its performance curve above the `M_P = GPT-3.5; M_F = GPT-3.5` configuration.
4. **Diminishing Returns:** All curves exhibit diminishing returns; the gain in pass rate per additional token generated decreases as the total token count increases. The curves begin to plateau after approximately 4000-6000 tokens.
### Interpretation
This chart demonstrates the efficacy of a "repair" or refinement step (`M_F`) in improving the pass rate of generated code or text (implied by "pass rate" and "tokens generated"). The data suggests two key insights:
1. **Base Model Capability is Paramount:** The primary model (`M_P`) sets the fundamental performance ceiling. No amount of repair with GPT-3.5 can bring a GPT-3.5-based system to the level of a GPT-4 system, even one without repair.
2. **Repair is a Force Multiplier:** The repair mechanism acts as a multiplier on the base model's capability. It is most effective when applied to a strong base model (GPT-4 + GPT-4 repair yields the highest absolute performance). However, it can also be used strategically to elevate a weaker model's output, as seen by `M_P = GPT-3.5; M_F = GPT-4` outperforming `M_P = GPT-3.5; M_F = GPT-3.5`.
The chart implies that for tasks where generating more tokens (e.g., through iterative refinement or longer solutions) is acceptable, investing in both a strong base model and a strong repair model yields the best results. The plateauing curves also suggest there may be an optimal token budget beyond which further generation provides minimal quality improvement.
</details>
Figure 9: Results from Figure 6 (Section 4.2) per difficulty (row), from top to bottom: introductory, interview, and competition.
## Appendix B Human Experiment: Study Instructions
For our study on human data, participants were given a slide deck with instructions. The following ten images show the instructions, which include an example of a task shown to a participant:
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/1.jpg Details</summary>

### Visual Description
\n
## Screenshot: Task Instructions List
### Overview
The image is a screenshot of a text-based instructional list titled "Tasks." It outlines the requirements for a coding-related assignment, presented in a clean, digital document format with a white background and dark text. The content is structured as a hierarchical bulleted list.
### Content Structure
The content is organized under a main heading and three primary bullet points, each with nested sub-bullets.
**Main Heading:**
* **Title:** "Tasks" (displayed in a large, bold font at the top-left).
**Primary Bullet Points & Nested Content:**
1. **Setup:**
* Sub-bullet: "Use a laptop or desktop computer, not a phone"
2. **Task: Debug five incorrect Python programs**
* The word "five" is highlighted in blue.
* Sub-bullet 1: "Each program is an incorrect attempt to solve a coding challenge"
* Sub-bullet 2: "Your answer should explain what the program is doing wrong"
* The phrase "explain what the program is doing wrong" is highlighted in blue.
* Sub-bullet 3: "Expect ~10 minutes per task"
3. **Task format**
* Sub-bullet 1: "Each task is in a separate website"
* Sub-bullet 2: "Submit your answer using the Google form embedded in each page"
* Sub-bullet 3: "No other data is being collected"
* This entire line is highlighted in blue and appears in italics.
### Detailed Analysis
The text provides specific, actionable instructions:
* **Hardware Requirement:** The task must be performed on a laptop or desktop computer.
* **Core Activity:** The user must debug five separate, incorrect Python programs. Each program represents a flawed solution to a coding problem.
* **Expected Output:** For each program, the user must provide an explanation of its errors.
* **Time Estimate:** Approximately 10 minutes should be allocated per debugging task.
* **Submission Process:** Each of the five tasks is hosted on its own website. Answers are to be submitted via an embedded Google Form on each respective page.
* **Data Privacy Note:** An explicit assurance is given that "No other data is being collected" beyond the form submission.
### Key Observations
* **Visual Emphasis:** Key pieces of information are emphasized using blue text color: the quantity ("five"), the required action ("explain what the program is doing wrong"), and the data privacy statement.
* **Clear Hierarchy:** The use of main bullets and nested sub-bullets creates a clear, easy-to-follow structure for the instructions.
* **Concise Language:** The instructions are direct and use imperative sentences (e.g., "Use," "Debug," "Submit").
### Interpretation
This image documents the instructions for a structured technical assessment or learning exercise. The design prioritizes clarity and sets clear expectations for the participant.
* **Purpose:** The primary goal is to evaluate or practice Python debugging skills. The format suggests it could be part of a coding interview, a course assignment, or a skills test.
* **Process Flow:** The instructions define a linear workflow: 1) Ensure correct hardware, 2) Access five separate web-based tasks, 3) For each, analyze a broken Python script, 4) Document the errors in an explanation, 5) Submit the explanation via a form.
* **Notable Design Choices:** The highlighting draws immediate attention to the scope (five tasks), the critical deliverable (the explanation), and a reassurance about privacy. The privacy note is particularly significant, as it addresses potential user concerns about data collection in an online assessment context, likely to encourage honest participation.
* **Underlying Assumptions:** The instructions assume the user has foundational Python knowledge and access to the internet. The time estimate (~10 minutes per task) implies the problems are of moderate complexity, designed to be solvable within a short, focused session.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/2.jpg Details</summary>

### Visual Description
## Screenshot: Technical Answer Instructions
### Overview
The image is a screenshot of a text-based instructional section, likely from a website or document, titled "Your Answer." It provides guidelines for users on how to formulate a response to a programming or technical problem. The content is presented as a structured list with explanatory text and examples.
### Components/Axes
The image contains a single, continuous block of text organized with a main heading and a nested bulleted list. There are no charts, diagrams, or data tables. The layout is simple, left-aligned text on a plain background.
**Text Elements & Formatting:**
* **Main Heading:** "Your Answer" (Large font size).
* **Primary Bullet Points:** Four main items, indicated by solid black circles (•).
* **Sub-bullets:** Nested under the first two primary bullets, indicated by hollow circles (◦).
* **Highlighted Text:** Certain phrases are rendered in a blue color, suggesting they are key instructions or possibly hyperlinks. These are:
* "explain what the program is doing wrong"
* "Example answers:"
* "We are not collecting any data about how you use the website."
* "Feel free to use external tools:"
### Content Details (Full Transcription)
The complete textual content of the image is as follows:
**Your Answer**
* Your answer should briefly **explain what the program is doing wrong**. If it helps you explain your thoughts, you can also say what you would do differently.
* Can be precise: "the formula used to calculate X on line 5 is wrong, it should be..."
* Or high-level: "the program is treating the task as a min-cut graph problem, but it is actually shortest-path... it could be rewritten using Dijkstra's algorithm..."
* **Example answers:**
* The problem description states that numbers which start or end with zeros (such as `010` and `00`) are NOT considered valid numerical palindromes. However, the code above does not take this into account and therefore returns `00` as a valid palindrome.
* The main issue with the provided code is that it only considers direct subordinates when trying to find the k-th officer in the command spreading sequence. However, the problem asks for the order in which officers receive the command, including indirect subordinates. To fix this, we need to traverse the tree of officers and construct the command spreading sequence before finding the k-th element.
* **We are not collecting any data about how you use the website.** Only your submitted answer is recorded.
* **Feel free to use external tools:** pen and paper, a Python IDE, etc!
### Key Observations
1. **Instructional Focus:** The text is purely directive, aiming to standardize and clarify the expected format for a technical answer.
2. **Emphasis on Clarity:** The instructions explicitly ask for a brief explanation of the error, offering both precise (line-specific) and high-level (algorithmic) approaches as valid models.
3. **Use of Examples:** Two detailed "Example answers" are provided. These are not generic; they describe specific, plausible bugs in programming problems (one concerning input validation for palindromes, another concerning tree traversal for a command sequence).
4. **Privacy Reassurance:** A dedicated point clarifies that only the final answer is recorded, not behavioral data from the website interaction.
5. **Tool Encouragement:** The final point explicitly permits and encourages the use of external aids like IDEs or pen and paper.
### Interpretation
This image captures the meta-instructions for a technical assessment or coding challenge platform. Its primary purpose is to reduce ambiguity for the respondent and ensure the submitted answers are useful for evaluation.
* **Pedagogical Intent:** By providing concrete examples of good answers, the instructions teach the user *how* to debug and communicate their reasoning effectively. The examples move beyond "this code is wrong" to explain *why* it's wrong in the context of the problem specification.
* **Problem-Solving Framework:** The guidelines implicitly outline a problem-solving methodology: 1) Identify the discrepancy between the problem statement and the code's behavior, 2) Articulate the nature of the error (precise or conceptual), and 3) Propose a corrective approach.
* **Trust and Transparency:** The statements about data collection and tool usage are designed to build trust with the user. They address potential concerns about surveillance and signal that the assessment values the correctness of the final answer and the reasoning process over the purity of the working environment.
* **Contextual Clues:** The specific examples given (palindromes with leading zeros, tree traversal for command spreading) suggest the platform hosts algorithmic or data structure problems, likely for educational purposes, interview preparation, or competitive programming. The instructions aim to elicit answers that demonstrate true understanding rather than just corrected code.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/3.jpg Details</summary>

### Visual Description
## Text Element: Centered Word
### Overview
The image is a simple, minimalist visual containing a single word of text centered on a plain white background. There are no charts, diagrams, data tables, or complex graphical elements present. The image serves as a basic example of text placement.
### Components/Axes
* **Background:** Solid white (#FFFFFF).
* **Text Element:** A single word, "Example".
* **Text Color:** A medium blue shade (approximate hex: #4285F4).
* **Text Position:** Centered both horizontally and vertically within the image frame.
* **Font Style:** Appears to be a standard sans-serif typeface (e.g., Arial, Helvetica).
### Detailed Analysis
* **Text Content:** The only textual information is the word "Example".
* **Language:** English.
* **Transcription:** "Example"
* **Visual Properties:** The text is rendered in a clear, legible font at a moderate size. There are no other labels, titles, legends, or annotations.
### Key Observations
* The image contains no quantitative data, trends, or comparative information.
* The composition is extremely sparse, with the text as the sole focal point.
* There is no additional context provided by surrounding elements.
### Interpretation
This image does not present factual data or complex information for extraction. Its primary function is demonstrative. It visually represents the concept of an "example" through its own content and layout. The centered placement and neutral background focus all attention on the word itself, making it a clear, unambiguous placeholder or label. The choice of blue for the text is a common, neutral color often used in default settings or templates, reinforcing its role as a generic illustrative element rather than a carrier of specific meaning.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/4.jpg Details</summary>

### Visual Description
\n
## Technical Document: Problem Specification
### Overview
The image displays a single page from a technical document, specifically a programming problem specification. The page is titled "1. Problem Specification" and contains a detailed description of a programming task related to counting votes in a contested school board election. The document is structured with a main heading, a bordered specification box, and explanatory annotations on the right side.
### Components/Axes
The document is composed of the following visual and textual components:
1. **Main Heading:** "1. Problem Specification" in a large, bold, sans-serif font at the top-left.
2. **Specification Box:** A rectangular box with a thin black border containing the core problem description. It has its own internal heading "Specification" in bold.
3. **Explanatory Annotations:** Text placed to the right of the specification box, providing meta-instructions for the reader.
* "Each page starts with a specification of what the program should do."
* "Begin by carefully reading the problem specification." (This line is in blue text).
4. **Footer Note:** A line of italicized text at the very bottom of the specification box.
### Detailed Analysis / Content Details
All text within the image is in English. Below is a precise transcription of the textual content, preserving structure and formatting.
**From the Specification Box:**
**Specification**
The recent schoolboard elections were hotly contested: a proposal to swap school start times for elementary and high school students, a controversial new dress code proposal that bans athletic clothes in school, and a proposal to raise real-estate taxes to pay for a new football practice facility, and the list goes on and on. It is now hours after the polls have closed and a winner has yet to emerge!
In their desperation, the election officials turn to you and ask you to write a program to count the vote!
-----Input-----
The input consists of a single test case, which is a list of votes cast. Each line in the input contains the name of a candidate for whom a vote was cast. A name may consist of multiple words separated by spaces. Words contain only alphanumeric characters and/or punctuation characters. There will be at least 2 votes on the list. The list of votes ends with a single line containing the characters "***". This line should not be counted. There can be up to $100000$ valid votes.
-----Output-----
If a candidate obtained a simple or absolute majority of all votes cast (that is, more than any other candidate), output the name of this candidate. If no candidate obtained a simple majority, output: "Runoff!" (don't forget to include the exclamation mark).
-----Examples-----
Sample Input:
Penny Franklin
Barbara Skinner
Connie Froggatt
Joseph Ivers
Connie Froggatt
Penny Franklin
Connie Froggatt
Connie Froggatt
Connie Froggatt
Barbara Skinner
Barbara Skinner
***
Sample Output:
Connie Froggatt
*Remember: all input/output of the program should be handled through stdin and stdout.*
**From the Right-Side Annotations:**
* Each page starts with a specification of what the program should do.
* Begin by carefully reading the problem specification. (in blue)
### Key Observations
1. **Document Structure:** The page uses a clear two-column layout for presentation: the formal specification on the left and pedagogical guidance on the right.
2. **Problem Type:** This is a classic "majority vote" or "plurality" algorithm problem, common in introductory programming courses or competitions.
3. **Input Format:** The input is a stream of names (strings) terminated by a sentinel value (`***`). The constraint is up to 100,000 votes.
4. **Output Logic:** The program must determine if any single candidate received more than 50% of the votes (an absolute majority). If yes, print that candidate's name. If no candidate has a majority (which would happen in a multi-candidate race where the top vote-getter has less than 50%), the program must output "Runoff!".
5. **Sample Case:** The provided example has 11 valid votes (excluding the `***` line). "Connie Froggatt" receives 5 votes, which is 5/11 ≈ 45.5%. This is *not* a majority (>50%). However, the sample output is "Connie Froggatt". This indicates a potential discrepancy or that the problem defines "simple or absolute majority" as "more than any other candidate" (a plurality), not necessarily >50%. The text explicitly states: "more than any other candidate". This is a critical detail for implementation.
6. **Technical Requirement:** The footer note emphasizes that standard input/output (`stdin`/`stdout`) must be used, which is typical for automated judging systems.
### Interpretation
This document defines a programming challenge that tests a candidate's ability to process streaming text data, count frequencies, and implement conditional logic based on a specific voting rule.
The core task is to parse a list of names, tally the votes for each unique candidate, and then apply the winning condition. The key investigative point is the precise definition of "majority." The specification clarifies it as "more than any other candidate," which is the definition of a **plurality** (or "first-past-the-post") system, not an absolute majority (>50%). The sample output confirms this interpretation, as Connie Froggatt wins with a plurality (5 votes) despite not having over 50% of the total (11 votes).
The problem includes realistic constraints (handling multi-word names, a large number of votes up to 100,000) and a clear termination condition (`***`). The "Runoff!" output case would occur in a scenario where the top two or more candidates are tied for the highest number of votes, or if the voting rule were changed to require >50%. Given the stated rule, a tie for first place would likely also result in "Runoff!" as no single candidate has "more than any other."
The annotations on the right suggest this page is part of a tutorial or guided assignment, instructing the learner on how to approach such problem statements methodically. The blue, emphasized text highlights the first and most crucial step: careful reading to avoid misinterpretation of the requirements—a lesson underscored by the potential confusion between "plurality" and "majority."
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/5.jpg Details</summary>

### Visual Description
## Screenshot: Incorrect Program Code Snippet
### Overview
The image displays a presentation slide or document page titled "2. Incorrect Program." It features a code snippet within a light gray box on the left, accompanied by explanatory text on the right. The content appears to be part of an instructional or debugging exercise, presenting a Python program labeled as "incorrect" for the viewer to analyze.
### Components/Axes
1. **Main Title (Top Left):** "2. Incorrect Program"
2. **Code Block Container:**
* **Header:** "Incorrect Program" (bold, within the gray box).
* **Content:** A block of Python code with syntax highlighting (keywords like `from`, `import`, `while`, `if`, `else` are in a different color, likely blue or purple, though exact color is not specified).
3. **Explanatory Text (Right Side):** Three separate paragraphs of instructional text.
4. **Layout:** The code block is positioned on the left side of the frame. The explanatory text is aligned to the right of the code block, starting roughly at the vertical midpoint of the code box.
### Detailed Analysis
**1. Code Transcription (Python):**
```python
from collections import defaultdict
votes = defaultdict(int)
candidate = input()
while candidate != '****':
votes[candidate] += 1
candidate = input()
total_votes = sum(votes.values())
max_votes = max(votes.values())
winner = [name for name, count in votes.items() if count == max_votes]
if len(winner) == 1 and max_votes > total_votes // 2:
print(winner[0])
else:
print('Runoff!')
```
**2. Explanatory Text Transcription:**
* **Paragraph 1:** "Next, you will be shown the incorrect program."
* **Paragraph 2 (Tip):** "Tip: If you are struggling with debugging the program, **try running it on your machine!**" (The phrase "try running it on your machine!" is highlighted in blue).
* **Paragraph 3 (Note):** "Note: the programs handle inputs through “input()”, and outputs through “print()”."
### Key Observations
* **Purpose:** The code implements a simple voting system. It reads candidate names via `input()` until the sentinel value `'****'` is entered, tallies votes, and then determines a winner.
* **Stated Incorrectness:** The program is explicitly labeled as "incorrect," implying it contains a logical error or does not meet specified requirements, though the specific flaw is not stated in the image.
* **Potential Logic:** The code checks for a single winner (`len(winner) == 1`) who has more than half of the total votes (`max_votes > total_votes // 2`). If this condition is not met, it prints "Runoff!". A common logical error in such programs could be related to handling ties or the definition of a majority.
* **User Guidance:** The text actively encourages the viewer to engage with the code by running it, suggesting a hands-on debugging exercise.
### Interpretation
This image is a pedagogical tool. It presents a concrete, self-contained programming problem. The "incorrect" label frames the code not as a solution, but as a puzzle to be solved. The accompanying text guides the learner's approach: first, understand the code's intended function (a voting tally), then use practical tools (running the program) to discover its flaw.
The code's structure suggests it aims to implement a "majority wins" system. The potential incorrectness likely lies in the edge cases: for example, if no candidate receives a majority, it correctly calls for a runoff. However, if there is a tie for the highest number of votes between multiple candidates (e.g., two candidates each have 5 votes, and that is the maximum), the list `winner` will have a length greater than 1. The condition `len(winner) == 1` will fail, and it will print "Runoff!", which is logically correct. Therefore, the stated "incorrectness" might be more subtle, perhaps related to input handling, the sentinel value, or an unstated requirement not visible in this isolated snippet. The image successfully sets up a debugging scenario by providing the code and prompting investigation.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/6.jpg Details</summary>

### Visual Description
## Screenshot: Programming Error Message
### Overview
The image is a screenshot displaying a programming error message, likely from an automated testing or grading system. It shows a failed test case where a program's output does not match the expected output. The layout consists of a main error box on the left and explanatory text on the right.
### Components
The image is divided into two primary regions:
1. **Left Region (Error Box):** A bordered box containing the raw error details.
2. **Right Region (Explanatory Text):** A text block explaining the components of the error message.
**Left Region - Error Box Components:**
* **Header:** The word "Error" in bold.
* **Tip Section:** A line of italicized text providing execution advice.
* **Input Section:** A labeled list of names provided as input to the program.
* **Program Output Section:** A labeled section showing the output generated by the user's program.
* **Expected Output Section:** A labeled section showing the correct, expected output.
**Right Region - Explanatory Text Components:**
* A descriptive sentence.
* A bulleted list titled "It contains:".
* A final tip.
### Content Details
**1. Text from the Left Region (Error Box):**
* **Header:** `Error`
* **Tip:** `Tip: If you're executing the code on your machine, copy-paste the input into a file and pipe it to the program with 'python3 program.py < input.txt'`
* **Input Section:**
* Label: `Input`
* Data (List of names):
* Penny Franklin
* Marti Graham
* Connie Froggatt
* Joseph Ivers
* Connie Froggatt
* Penny Franklin
* Connie Froggatt
* Bruce Stanger
* Connie Froggatt
* Barbara Skinner
* Barbara Skinner
* `***` (This appears to be a delimiter or end-of-input marker)
* **Program Output Section:**
* Label: `Program Output`
* Value: `Runoff!`
* **Expected Output Section:**
* Label: `Expected Output`
* Value: `Connie Froggatt`
**2. Text from the Right Region (Explanatory Text):**
* `The error message shows you the test that the program failed on.`
* `It contains:`
* `An example input`
* `The program's incorrect output`
* `The expected output`
* `Tip: try copy-pasting the input to a file and piping it to the program.`
### Key Observations
1. **Clear Failure:** The program output (`Runoff!`) is completely different from the expected output (`Connie Froggatt`), indicating a fundamental logic error, not a minor formatting issue.
2. **Input Pattern:** The input list contains multiple occurrences of the name "Connie Froggatt" (4 times) and "Barbara Skinner" (2 times). Other names appear once. The expected output is the name that appears most frequently.
3. **Semantic Mismatch:** The program's output "Runoff!" suggests it might be printing a status message or the name of an algorithm (like a "runoff" voting system) instead of processing the input data to find the result.
4. **Instructional Design:** The error message is designed to be self-contained and helpful, providing the exact input that caused the failure and a tip for local debugging.
### Interpretation
This error message is from a programming exercise, likely related to an algorithm problem such as finding the most frequent element in a list or implementing a specific voting system (e.g., a "runoff" election). The test case checks if the program can correctly identify "Connie Froggatt" as the winner or most frequent name from the provided input list.
The program's incorrect output of "Runoff!" indicates one of several possible issues:
* The program is not reading or processing the input list correctly.
* The program is printing a debug statement or the name of the intended algorithm instead of performing the required computation.
* There is a control flow error where the program exits or prints prematurely.
The provided tip about using input redirection (`<`) is a standard method for testing command-line programs with file-based input, confirming this is a console-based programming task. The structure of the error message—showing input, actual output, and expected output—is a classic format used in automated code testing platforms to help developers diagnose failures efficiently.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/7.jpg Details</summary>

### Visual Description
\n
## Screenshot: Model Explanations Interface
### Overview
The image is a screenshot of a digital document or application interface, likely from a debugging or code analysis tool. It presents a section titled "4. Model Explanations" which contains a bordered content box with two example explanations generated by an AI model, alongside explanatory text for the user.
### Components/Areas
The image is divided into two primary regions:
1. **Left Region (Main Content Box):** A white box with a light gray border containing the core content.
* **Box Title:** "Model Explanations" (bold, top-left inside the box).
* **Content:** Two numbered explanations ("Explanation 1" and "Explanation 2") describing a potential issue in code logic related to determining a majority vote.
2. **Right Region (User Guidance):** Text placed to the right of the main box, providing context and warnings about the model-generated explanations.
### Detailed Analysis / Content Details
**1. Main Heading (Top of Image):**
* Text: `4. Model Explanations`
**2. Content Box (Left Region):**
* **Title:** `Model Explanations`
* **Explanation 1:**
* **Label:** `Explanation 1`
* **Transcribed Text:** `The concise explanation of the issue is: The code checks if the maximum votes are strictly greater than half of the total votes to determine if there's a majority. However, it should check if the maximum votes are greater than or equal to half of the total votes.`
* **Explanation 2:**
* **Label:** `Explanation 2`
* **Transcribed Text:** `The following is a concise explanation of the issue: The code checks for a simple majority by comparing `max_votes` to `total_votes / 2`, which will return the floor division result. This means that if `max_votes` is equal to half of the `total_votes`, it will not be considered a simple majority. Instead, the condition should check if `max_votes` is strictly greater than half of the `total_votes`.`
**3. User Guidance Text (Right Region):**
* **First Paragraph:** `To help you get started with the debugging, each page lists two example explanations.`
* **Highlighted Warning (Blue Text):** `These explanations are generated by the model itself. They might be completely wrong. You don't have to use them.`
* **Final Analogy:** `Think of these like CoPilot suggestions.`
### Key Observations
* **Purpose:** The interface is designed to assist a user in debugging code by providing AI-generated hypotheses about the nature of a bug.
* **Content Focus:** Both explanations address the same logical flaw in code: an incorrect comparison operator (`>` vs. `>=`) when evaluating a majority condition.
* **Critical Disclaimer:** The most visually prominent text (in blue) is a strong disclaimer about the potential inaccuracy of the model's output, explicitly stating the user is not obligated to use them.
* **Analogy:** The guidance frames the model's role as a suggestive tool, similar to GitHub's CoPilot, rather than an authoritative source.
### Interpretation
This screenshot depicts a human-AI collaborative debugging environment. The system's design acknowledges the fallibility of its AI component by providing clear warnings and framing its output as optional "suggestions." The two explanations, while slightly different in phrasing and technical detail (Explanation 2 mentions specific code variables like `max_votes` and `total_votes / 2`), converge on the same core logical error. This suggests the underlying model is consistent in its identification of this specific type of off-by-one or boundary condition bug. The interface encourages critical evaluation—the user is expected to read, understand, and verify the model's explanations against their own code and logic, using them as a starting point rather than a final answer. The presence of "each page lists two example explanations" implies this is part of a larger, paginated debugging report or tutorial.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/8.jpg Details</summary>

### Visual Description
\n
## Screenshot: Technical Assessment Answer Form
### Overview
The image is a screenshot of a digital document or webpage section, specifically step "5. Answer Form." It displays an embedded Google Form for user submission alongside explanatory text. The primary purpose is to instruct a user to submit a self-contained explanation of a program's error.
### Components/Axes
The image is divided into two main spatial regions:
1. **Left Region (Approx. 60% width):** Contains the embedded Google Form.
2. **Right Region (Approx. 40% width):** Contains instructional text.
**Left Region - Google Form Elements:**
* **Form Header:** "Your Explanation" (bold, top-left of form frame).
* **Redacted Block:** A black rectangle with white text: "REDACTED FOR ANONYMITY".
* **Sign-in Prompt:** "Sign in to Google to save your progress. Learn more" (blue text, below redacted block).
* **Required Question Indicator:** "* Indicates required question" (red text).
* **Form Field Label:** "Your Explanation *" (asterisk denotes required field).
* **Text Input Field:** Contains placeholder text "Your answer".
* **Action Buttons:**
* "Submit" (purple button, bottom-left).
* "Clear form" (text link, bottom-right).
**Right Region - Instructional Text:**
* **Paragraph 1:** "Finally, each page contains an embedded Google Form. No login is required."
* **Instruction (Blue Text):** "Submit *your explanation* of what the *program is doing wrong*."
* **Paragraph 2:** "Your answer must be self-contained; it should *not* be of the form 'Just like the first model explanation describes, the issue with the code is that …'"
### Content Details
* **Primary Language:** English.
* **Text Transcription (Complete):**
* **Title:** "5. Answer Form"
* **Form Header:** "Your Explanation"
* **Redacted Text:** "REDACTED FOR ANONYMITY"
* **Form UI Text:** "Sign in to Google to save your progress. Learn more", "* Indicates required question", "Your Explanation *", "Your answer", "Submit", "Clear form"
* **Instructional Text:** "Finally, each page contains an embedded Google Form. No login is required.", "Submit *your explanation* of what the *program is doing wrong*.", "Your answer must be self-contained; it should *not* be of the form 'Just like the first model explanation describes, the issue with the code is that …'"
### Key Observations
1. **Anonymity Measure:** A significant portion of the form's header is explicitly "REDACTED FOR ANONYMITY," suggesting the form or its context is part of a process where participant or content identification is deliberately obscured.
2. **Explicit Submission Instructions:** The instructions are highly specific. The answer must be **self-contained** and must **not** reference or defer to a prior "model explanation." This indicates the assessment values independent analysis.
3. **Low Barrier to Entry:** The form explicitly states "No login is required," lowering the friction for submission.
4. **Visual Hierarchy:** The key instruction ("Submit your explanation...") is highlighted in blue and uses italics for emphasis on "your explanation" and "program is doing wrong."
### Interpretation
This image captures a component of a technical evaluation or exercise, likely related to code analysis or debugging. The structure suggests a multi-step process ("5. Answer Form") where the participant has reviewed material (a program, code snippet, or model output) and is now required to articulate their independent diagnosis of its flaw.
The **redaction for anonymity** is a critical detail. It implies the form submissions might be reviewed by multiple parties or used in a context where blinding is necessary to prevent bias—common in research studies, peer reviews, or competitive assessments.
The **strict instruction against referencing a "first model explanation"** is the most revealing element. It frames the task as a test of the user's own reasoning capability, not their ability to summarize or agree with a provided analysis. This is a common design in assessments meant to gauge genuine understanding versus mimicry. The requirement for a "self-contained" answer ensures the evaluator receives the participant's raw, unfiltered thought process.
In essence, this is not just a form; it is the final step in a structured analytical task designed to elicit and capture an independent technical judgment.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/9.jpg Details</summary>

### Visual Description
\n
## Text Document: Study Tips
### Overview
The image displays a text-based document titled "Study Tips." It consists of a thank you message followed by a bulleted list of six main recommendations for individuals engaged in a study or task-based activity, likely related to programming or technical problem-solving. The text is primarily in English, with some phrases highlighted in blue. Two smiling face emojis (😊) are present.
### Components/Axes
This is not a chart or diagram with axes. The components are textual elements arranged in a hierarchical list:
- **Title**: "Study Tips"
- **Introductory Sentence**: "We are very grateful for your help! 😊"
- **Bulleted List**: Six main points, with the fifth point containing three sub-bullets.
- **Formatting**: Certain key phrases within the bullet points are displayed in a blue color, indicating emphasis or possibly hyperlinks in the original context.
- **Visual Elements**: Two 😊 emojis and one black rectangular redaction box obscuring a word or name in the final bullet point.
### Detailed Analysis / Content Details
**Full Text Transcription:**
**Title:** Study Tips
**Introductory Line:** We are very grateful for your help! 😊
**Bullet Points:**
1. **Make sure you understand the task first!** The programs have subtle logic errors, not just simple compiler errors.
2. Try to write **clear and concise** explanations, with proper grammar and punctuation.
3. Feel free to **use (or not use) the model explanations** when writing your answers; but make sure your answer is self-contained!
4. The tasks vary in difficulty. Feel free to **allocate your time as you see fit**; we are not measuring how quickly you complete the tasks or anything like that!
5. Feel free to use **external tools**:
* Use pen and paper or a whiteboard to help you reason about the task at hand.
* Use a Python IDE to execute and debug the code.
* Search online for help.
6. **Have a question? Ask [REDACTED] before moving on with the study! 😊**
**Language:** All text is in English.
**Redaction:** A black box obscures a word or name in the sixth bullet point, following the word "Ask".
### Key Observations
1. **Emphasis on Understanding:** The first and most prominent tip stresses comprehending the task's logic before attempting to solve it, highlighting that errors are subtle and logical, not just syntactic.
2. **Communication Quality:** There is a clear directive to write explanations that are both clear and concise, with attention to grammatical correctness.
3. **Autonomy and Self-Pacing:** The document explicitly grants autonomy to the learner regarding the use of provided model explanations and the allocation of time, stating that speed is not being measured.
4. **Encouragement of Resource Use:** The tips actively encourage the use of external tools and resources, including physical tools (pen/paper), software tools (Python IDE), and online research.
5. **Support Channel:** The final point directs the user to ask questions to a specific entity (whose name is redacted) before proceeding, indicating a supported learning environment.
6. **Tone:** The tone is supportive and grateful, using positive language ("grateful," "feel free") and friendly emojis.
### Interpretation
This document serves as a guideline for participants in a structured learning or evaluation exercise, most likely involving debugging or analyzing code. The Peircean investigation reveals several underlying principles:
* **The Primacy of Comprehension:** The instruction to "understand the task first" is a foundational rule. It suggests that the core challenge is analytical and logical, not merely technical. The mention of "subtle logic errors" implies the tasks are designed to test deep understanding, not surface-level knowledge.
* **Process Over Speed:** By decoupling task completion from time measurement and encouraging deliberate resource use, the guidelines promote a quality-focused, methodical approach. The goal is correct reasoning and solution-finding, not rapid output.
* **Integrated Learning:** The permission to use (or ignore) model explanations and external tools frames the activity as an open-book, real-world problem-solving scenario. It values the ability to synthesize information from various sources (provided materials, personal notes, IDEs, the internet) into a coherent, self-contained answer.
* **Structured Support:** The redacted name in the final point is crucial. It indicates that while autonomy is encouraged, there is a designated point of contact for clarification. This creates a safety net, ensuring learners do not proceed based on fundamental misunderstandings. The redaction itself is an interesting artifact, likely done to anonymize the document for broader distribution.
In essence, the text outlines a pedagogical philosophy that values deep understanding, clear communication, independent problem-solving with resourcefulness, and a supported, non-pressured learning environment.
</details>
<details>
<summary>extracted/2306.09896v1/figures/appendix/instructions/10.jpg Details</summary>

### Visual Description
## FAQ Section: User Study Information
### Overview
The image displays a Frequently Asked Questions (FAQ) section from what appears to be a website or digital document related to a user study. The section is designed to inform participants about data collection practices, the study's purpose, and the role of model-generated explanations. The text is presented in a clean, minimalist style with a white background and dark gray text. Certain phrases are highlighted in blue, likely indicating hyperlinks or key terms.
### Components/Axes
The content is structured as a hierarchical list:
* **Main Header:** "FAQ"
* **List Items:** Three primary questions, each followed by a sub-bullet containing the answer.
* **Text Formatting:** Standard bullet points (•) for main questions and hollow circle sub-bullets (◦) for answers. Key phrases within the answers are highlighted in blue.
### Detailed Analysis
The FAQ contains three question-and-answer pairs:
1. **Question:** "Are you collecting data as I visit the website?"
* **Answer:** "No - none at all. Only your final answers are recorded."
* **Note:** The word "No" is highlighted in blue.
2. **Question:** "What is the point of the study?"
* **Answer:** "To investigate how much better the models are at fixing code when given human feedback, instead of having to debug the code themselves."
* **Note:** The phrase "fixing code when given human feedback" is highlighted in blue.
3. **Question:** "Are you evaluating how useful the model explanations were to me?"
* **Answer:** "No - they are just there to help you get started with the debugging. We only care about your final answer."
* **Note:** The word "No" is highlighted in blue.
### Key Observations
* **Consistent Structure:** Each FAQ item follows a clear Question -> Answer format using nested bullet points.
* **Emphasis on Privacy:** The first and third answers begin with a highlighted "No," strongly emphasizing what data is *not* being collected (browsing behavior, usefulness ratings of explanations).
* **Study Focus:** The core research objective is explicitly stated: comparing model performance in code-fixing scenarios with and without human feedback.
* **Role of Explanations:** Model-generated explanations are framed solely as a tool to aid the participant's debugging process, not as an object of evaluation themselves.
* **Data Collection Scope:** The only data explicitly mentioned as being collected are the participants' "final answers."
### Interpretation
This FAQ section serves as a crucial informed consent and transparency tool for a human-computer interaction (HCI) or AI evaluation study. Its primary purpose is to manage participant expectations and address common concerns about surveillance and data usage.
The content reveals several key aspects of the study's design:
1. **Controlled Variable:** The independent variable being manipulated is the presence or absence of "human feedback" during a code-fixing task.
2. **Dependent Variable:** The primary metric for success is the quality or correctness of the participant's "final answer" (the fixed code).
3. **Methodology:** The study likely involves a platform where participants debug code, with some receiving AI-generated suggestions or explanations and others not. The FAQ assures participants that their interaction data (clicks, time spent) is not being analyzed, only their output.
4. **Ethical Considerations:** The explicit statements about data collection are designed to build trust and comply with ethical research standards, reassuring participants that their privacy is protected and the study's scope is limited.
The highlighted blue text draws immediate attention to the most critical takeaways: the denial of broad data collection and the specific research question about "fixing code when given human feedback." This design efficiently guides the reader to the most important information.
</details>
Figure 10:
## Appendix C Human Experiment (Quantitative Analysis): Results Per Task
In the table below, we give a complete breakdown of the quantitative results presented in Section 4.3. Note that each program is associated with four different pieces of feedback: two sampled from GPT-4, and two given by our human participants. Each cell is the number of repair candidates (out of 25) that passed all the unit tests. See Section 4.3 for details, as well as Appendix B for the instructions given to participants.
| 2106 | interview | A | 7 | 10 | 10 | 0 |
| --- | --- | --- | --- | --- | --- | --- |
| B | 0 | 2 | 20 | 16 | | |
| 2673 | interview | A | 4 | 7 | 17 | 24 |
| B | 3 | 25 | 25 | 25 | | |
| 2923 | interview | A | 0 | 0 | 0 | 0 |
| B | 0 | 0 | 0 | 0 | | |
| 3070 | competition | A | 0 | 0 | 0 | 0 |
| B | 3 | 0 | 5 | 0 | | |
| 3286 | competition | A | 2 | 6 | 10 | 25 |
| B | 0 | 0 | 0 | 4 | | |
| 3754 | competition | A | 0 | 0 | 0 | 0 |
| B | 0 | 0 | 0 | 0 | | |
| 4182 | introductory | A | 25 | 25 | 25 | 24 |
| B | 25 | 0 | 25 | 25 | | |
| 4195 | introductory | A | 25 | 3 | 24 | 23 |
| B | 23 | 25 | 25 | 25 | | |
| 4281 | introductory | A | 0 | 4 | 0 | 0 |
| B | 0 | 0 | 0 | 0 | | |
| 4333 | introductory | A | 25 | 0 | 25 | 0 |
| B | 23 | 24 | 24 | 25 | | |
| 4347 | introductory | A | 0 | 0 | 7 | 25 |
| B | 0 | 0 | 25 | 25 | | |
| 4426 | introductory | A | 25 | 25 | 25 | 25 |
| B | 25 | 25 | 25 | 25 | | |
| 4450 | introductory | A | 0 | 0 | 0 | 0 |
| B | 24 | 0 | 22 | 24 | | |
| 4507 | introductory | A | 0 | 0 | 0 | 0 |
| B | 0 | 0 | 1 | 0 | | |
| 4514 | introductory | A | 15 | 21 | 1 | 16 |
| B | 0 | 0 | 25 | 0 | | |
| 4704 | introductory | A | 0 | 25 | 0 | 25 |
| B | 25 | 25 | 24 | 23 | | |
| 4741 | introductory | A | 25 | 25 | 25 | 25 |
| B | 25 | 25 | 25 | 25 | | |
| 4855 | introductory | A | 0 | 1 | 17 | 25 |
| B | 0 | 2 | 3 | 23 | | |
| 4873 | introductory | A | 0 | 0 | 0 | 0 |
| B | 0 | 0 | 0 | 18 | | |
| 4952 | introductory | A | 0 | 0 | 2 | 25 |
| B | 24 | 8 | 24 | 21 | | |
## Appendix D Human Experiment (Qualitative Analysis): Examples
In this appendix, we give examples of each category used to classify the responses in the qualitative analysis of Section 4.3. For each category, we give one example from the human participants and one from GPT-4 (when possible). Each example contains: the textual specification of the task; the incorrect program; the error message (in the format it was shown to participants); the feedback in question; and a short comment motivating its classification.
### D.1 Category (a): obviously incorrect feedback
#### D.1.1 GPT-4: Example
⬇
### SPECIFICATION
The recent schoolboard elections were hotly contested: a proposal to swap school start times for elementary and high school students, a controversial new dress code proposal that bans athletic clothes in school, and a proposal to raise real - estate taxes to pay for a new football practice facility, and the list goes on and on. It is now hours after the polls have closed and a winner has yet to emerge!
In their desperation, the election officials turn to you and ask you to write a program to count the vote!
----- Input -----
The input consists of a single test case, which is a list of votes cast. Each line in the input contains the name of a candidate for whom a vote was cast. A name may consist of multiple words, separated by spaces. Words contain letters or hyphens, but no other punctuation characters. There will be at least votes on the list. The list of votes ends with a single line containing the characters ***. This line should not be counted. There can be up to valid votes.
----- Output -----
If a candidate obtained a simple or absolute majority of all votes cast (that is, more than any other candidate), output the name of this candidate! If no candidate obtained a simple majority, output: " Runoff!" (don ’ t forget to include the exclamation mark!)
----- Examples -----
Sample Input:
Penny Franklin
Marti Graham
Connie Froggatt
Joseph Ivers
Connie Froggatt
Penny Franklin
Connie Froggatt
Bruce Stanger
Connie Froggatt
Barbara Skinner
Barbara Skinner
***
Sample Output:
Connie Froggatt
### INCORRECT PROGRAM
from collections import defaultdict
votes = defaultdict (int)
candidate = input ()
while candidate != "***":
votes [candidate] += 1
candidate = input ()
total_votes = sum (votes. values ())
max_votes = max (votes. values ())
winner = [name for name, count in votes. items () if count == max_votes]
if len (winner) == 1 and max_votes > total_votes // 2:
print (winner [0])
else:
print (" Runoff!")
### ERROR MESSAGE
=== Input ===
Penny Franklin
Marti Graham
Connie Froggatt
Joseph Ivers
Connie Froggatt
Penny Franklin
Connie Froggatt
Bruce Stanger
Connie Froggatt
Barbara Skinner
Barbara Skinner
***
=== Program Output ===
Runoff!
=== Expected Output ===
Connie Froggatt
### FEEDBACK
The concise explanation of the issue is: The code checks if the maximum votes are strictly greater than half of the total votes to determine if there ’ s a majority. However, it should check if the maximum votes are greater than or equal to half of the total votes.
### COMMENT
The issue with the code is not that the ">" should have been a ">=", but rather that it is not allowing for a simple majority.
#### D.1.2 Human: Example
⬇
### SPECIFICATION
Snuke and Raccoon have a heap of N cards. The i - th card from the top has the integer a_i written on it.
They will share these cards.
First, Snuke will take some number of cards from the top of the heap, then Raccoon will take all the remaining cards.
Here, both Snuke and Raccoon have to take at least one card.
Let the sum of the integers on Snuke ’ s cards and Raccoon ’ s cards be x and y, respectively.
They would like to minimize | x - y |.
Find the minimum possible value of | x - y |.
----- Constraints -----
- 2 \ leq N \ leq 2 \ times 10^5
- -10^{9} \ leq a_i \ leq 10^{9}
- a_i is an integer.
----- Input -----
Input is given from Standard Input in the following format:
N
a_1 a_2 ... a_ {N}
----- Output -----
Print the answer.
----- Sample Input -----
6
1 2 3 4 5 6
----- Sample Output -----
1
If Snuke takes four cards from the top, and Raccoon takes the remaining two cards, x =10, y =11, and thus | x - y |=1. This is the minimum possible value.
### INCORRECT PROGRAM
def main ():
n = int (input ())
a = list (map (int, input (). split ()))
Snuke_sum = 0
Raccoon_sum = sum (a)
min_diff = abs (Raccoon_sum)
for i in range (n -1):
Snuke_sum += a [i]
Raccoon_sum -= a [i]
diff = abs (Snuke_sum - Raccoon_sum)
min_diff = min (min_diff, diff)
print (min_diff)
main ()
### ERROR MESSAGE
=== Input ===
2
10 -10
=== Program Output ===
0
=== Expected Output ===
20
### FEEDBACK
The code initializes the result ‘ min_diff ‘ to ‘ abs (Raccoon_sum)‘ corresponding to the case where Raccoon takes all the cards, however this is incorrect because of the constraint " Snuke and Raccoon have to take at least one card ". Instead just setting ‘ min_diff ‘ to negative infinity initially would fix it.
### COMMENT
This human participant was on the right track, but note that setting ‘ min_diff ‘ to * negative * infinity would cause it to never be updated inside of the for loop, and so the program would in fact always return negative infinity.
### D.2 Category (b): feedback explicitly suggests a minor change to the code
#### D.2.1 GPT-4: Example
⬇
### SPECIFICATION
There is a square in the xy - plane. The coordinates of its four vertices are (x_1, y_1),(x_2, y_2),(x_3, y_3) and (x_4, y_4) in counter - clockwise order.
(Assume that the positive x - axis points right, and the positive y - axis points up.)
Takahashi remembers (x_1, y_1) and (x_2, y_2), but he has forgot (x_3, y_3) and (x_4, y_4).
Given x_1, x_2, y_1, y_2, restore x_3, y_3, x_4, y_4. It can be shown that x_3, y_3, x_4 and y_4 uniquely exist and have integer values.
----- Constraints -----
- | x_1 |,| y_1 |,| x_2 |,| y_2 | \ leq 100
- (x_1, y_1) != (x_2, y_2)
- All values in input are integers.
----- Input -----
Input is given from Standard Input in the following format:
x_1 y_1 x_2 y_2
----- Output -----
Print x_3, y_3, x_4 and y_4 as integers, in this order.
----- Sample Input -----
0 0 0 1
----- Sample Output -----
-1 1 -1 0
(0,0),(0,1),(-1,1),(-1,0) is the four vertices of a square in counter - clockwise order.
Note that (x_3, y_3)=(1,1),(x_4, y_4)=(1,0) is not accepted, as the vertices are in clockwise order.
### INCORRECT PROGRAM
x1, y1, x2, y2 = map (int, input (). split ())
x3 = x2 + (y2 - y1)
y3 = y2 - (x2 - x1)
x4 = x1 + (y2 - y1)
y4 = y1 - (x2 - x1)
print (x3, y3, x4, y4)
### ERROR MESSAGE
=== Input ===
0 0 0 1
=== Program Output ===
1 1 1 0
=== Expected Output ===
-1 1 -1 0
### FEEDBACK
The issue with the code is that it doesn ’ t consider the counter - clockwise order of the vertices. To fix this, we need to swap the signs of (y2 - y1) and (x2 - x1) in the calculations.
### COMMENT
Swapping the listed signs is a concrete suggestion of a way to fix the code. Since the overall structure of the code would remain the same, we count this as a minor change.
#### D.2.2 Human: Example
⬇
### SPECIFICATION
Given is a string S representing the day of the week today.
S is SUN, MON, TUE, WED, THU, FRI, or SAT, for Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, and Saturday, respectively.
After how many days is the next Sunday (tomorrow or later)?
----- Constraints -----
- S is SUN, MON, TUE, WED, THU, FRI, or SAT.
----- Input -----
Input is given from Standard Input in the following format:
S
----- Output -----
Print the number of days before the next Sunday.
----- Sample Input -----
SAT
----- Sample Output -----
1
It is Saturday today, and tomorrow will be Sunday.
### INCORRECT PROGRAM
S = input ()
days = [’ SUN ’, ’ MON ’, ’ TUE ’, ’ WED ’, ’ THU ’, ’ FRI ’, ’ SAT ’]
print ((7 - days. index (S)) % 7)
### ERROR MESSAGE
=== Input ===
SUN
=== Program Output ===
0
=== Expected Output ===
7
### FEEDBACK
The logic is incorrect. There should be no modulo operator.
### COMMENT
Suggesting that the modulo operator should be removed is a relatively small, but explicit, change.
### D.3 Category (c): feedback explicitly suggests a large change to the code
#### D.3.1 GPT-4: Example
⬇
### SPECIFICATION
Takahashi is about to assemble a character figure, consisting of N parts called Part 1, Part 2, ..., Part N and N -1 connecting components. Parts are distinguishable, but connecting components are not.
Part i has d_i holes, called Hole 1, Hole 2, ..., Hole d_i, into which a connecting component can be inserted. These holes in the parts are distinguishable.
Each connecting component will be inserted into two holes in different parts, connecting these two parts. It is impossible to insert multiple connecting components into a hole.
The character figure is said to be complete when it has the following properties:
- All of the N -1 components are used to connect parts.
- Consider a graph with N vertices corresponding to the parts and N -1 undirected edges corresponding to the pairs of vertices connected by a connecting component. Then, this graph is connected.
Two ways A and B to make the figure complete are considered the same when the following is satisfied: for every pair of holes, A uses a connecting component to connect these holes if and only if B uses one to connect them.
Find the number of ways to make the figure complete. Since the answer can be enormous, find the count modulo 998244353.
----- Constraints -----
- All values in input are integers.
- 2 \ leq N \ leq 2 \ times 10^5
- 1 \ leq d_i < 998244353
----- Input -----
Input is given from Standard Input in the following format:
N
d_1 d_2 \ cdots d_N
----- Output -----
Print the answer.
----- Sample Input -----
3
1 1 3
----- Sample Output -----
6
One way to make the figure complete is to connect Hole 1 in Part 1 and Hole 3 in Part 3 and then connect Hole 1 in Part 2 and Hole 1 in Part 3.
### INCORRECT PROGRAM
def main ():
N = int (input ())
d = list (map (int, input (). split ()))
MOD = 998244353
fact = [1] * (N +1)
for i in range (1, N +1):
fact [i] = fact [i -1] * i % MOD
ans = 0
for ki in range (N):
ans += d [ki] * fact [ki] % MOD * fact [N - ki -1] % MOD
ans %= MOD
ans = ans * fact [N -1] % MOD
print (ans)
main ()
### ERROR MESSAGE
=== Input ===
3
1 1 3
=== Program Output ===
18
=== Expected Output ===
6
### FEEDBACK
The following is a concise explanation of the issue: The code above calculates the total number of ways to connect the parts, but it does not ensure that each connecting component is used only once. This results in overcounting the number of ways to make the figure complete. To fix the issue, we need to ensure that each connecting component is used exactly once.
### COMMENT
The feedback suggests changing the code so that each connecting component is used exactly once, which does not seem to be taken into account in the current approach.
Changing the code to satisfy this part of the specification seems to require a major rewrite.
#### D.3.2 Human: Example
⬇
### SPECIFICATION
Oh, New Year. The time to gather all your friends and reflect on the heartwarming events of the past year...
$n$ friends live in a city which can be represented as a number line. The $i$ - th friend lives in a house with an integer coordinate $x_i$. The $i$ - th friend can come celebrate the New Year to the house with coordinate $x_i -1 $, $x_i +1 $ or stay at $x_i$. Each friend is allowed to move no more than once.
For all friends $1 \ le x_i \ le n$ holds, however, they can come to houses with coordinates $0$ and $n +1 $ (if their houses are at $1$ or $n$, respectively).
For example, let the initial positions be $x = [1, 2, 4, 4] $. The final ones then can be $ [1, 3, 3, 4] $, $ [0, 2, 3, 3] $, $ [2, 2, 5, 5] $, $ [2, 1, 3, 5] $ and so on. The number of occupied houses is the number of distinct positions among the final ones.
So all friends choose the moves they want to perform. After that the number of occupied houses is calculated. What is the minimum and the maximum number of occupied houses can there be?
----- Input -----
The first line contains a single integer $n$ ($1 \ le n \ le 2 \ cdot 10^5 $) - the number of friends.
The second line contains $n$ integers $x_1, x_2, \ dots, x_n$ ($1 \ le x_i \ le n$) - the coordinates of the houses of the friends.
----- Output -----
Print two integers - the minimum and the maximum possible number of occupied houses after all moves are performed.
----- Examples -----
Input
4
1 2 4 4
Output
2 4
Input
9
1 1 8 8 8 4 4 4 4
Output
3 8
Input
7
4 3 7 1 4 3 3
Output
3 6
----- Note -----
In the first example friends can go to $ [2, 2, 3, 3] $. So friend $1$ goes to $x_1 +1 $, friend $2$ stays at his house $x_2$, friend $3$ goes to $x_3 -1 $ and friend $4$ goes to $x_4 -1 $. $ [1, 1, 3, 3] $, $ [2, 2, 3, 3] $ or $ [2, 2, 4, 4] $ are also all valid options to obtain $2$ occupied houses.
For the maximum number of occupied houses friends can go to $ [1, 2, 3, 4] $ or to $ [0, 2, 4, 5] $, for example.
### INCORRECT PROGRAM
from collections import Counter
def main ():
n = int (input ())
x = list (map (int, input (). split ()))
cnt = Counter (x)
max_houses = 0
for pos in range (1, n + 1):
if cnt [pos]:
max_houses += 1
occupied = [0] * (n + 2)
for pos in x:
if not occupied [pos - 1]:
occupied [pos - 1] = 1
elif not occupied [pos]:
occupied [pos] = 1
elif not occupied [pos + 1]:
occupied [pos + 1] = 1
min_houses = sum (occupied)
print (min_houses, max_houses)
main ()
### ERROR MESSAGE
=== Input ===
4
1 2 4 4
=== Program Output ===
4 3
=== Expected Output ===
2 4
### FEEDBACK
The code calculates both the minimum and maximum count of houses incorrectly. The correct solution requires conducting a tree search to determine the true maximum and minimum. This can be further optimized by first sorting the list of occupied houses and using a sliding window to process independent segments of the number line.
### COMMENT
The feedback suggests conducting a tree search to determine the max and min, as well as a way to optimize this search.
These constitute major changes to the program as it is currently written.
### D.4 Category (d): Feedback contains blocks of pseudocode or Python code
#### D.4.1 GPT-4: Example
GPT-4’s feedback never contains blocks of Python per our experiment design, in which the code block is stripped (Section 4.3). We find that GPT-4 never utilizes blocks of pseudocode, either.
#### D.4.2 Human: Example
⬇
### SPECIFICATION
Given is a string S representing the day of the week today.
S is SUN, MON, TUE, WED, THU, FRI, or SAT, for Sunday, Monday, Tuesday, Wednesday, Thursday, Friday, and Saturday, respectively.
After how many days is the next Sunday (tomorrow or later)?
----- Constraints -----
- S is SUN, MON, TUE, WED, THU, FRI, or SAT.
----- Input -----
Input is given from Standard Input in the following format:
S
----- Output -----
Print the number of days before the next Sunday.
----- Sample Input -----
SAT
----- Sample Output -----
1
It is Saturday today, and tomorrow will be Sunday.
### INCORRECT PROGRAM
day = input ()
days = [’ SUN ’, ’ MON ’, ’ TUE ’, ’ WED ’, ’ THU ’, ’ FRI ’, ’ SAT ’]
index = days. index (day)
result = (7 - index) % 7
print (result)
### ERROR MESSAGE
=== Input ===
SUN
=== Program Output ===
0
=== Expected Output ===
7
### FEEDBACK
before returning the result, we should just check to see if result ==0.
so we just need to add:
if result == 0:
result = 7
### COMMENT
The feedback contains a block of Python code that the model can simply copy in order to fix the program.
### D.5 Category (e): feedback which expresses uncertainty
#### D.5.1 GPT-4: Example
In the $20\cdot 2\cdot 2=80$ examples we study, we do not find that GPT-4 ever expresses uncertainty.
#### D.5.2 Human: Example
⬇
### SPECIFICATION
Two kingdoms had been at war for a long time, until the emperor intervened to bring an end to the conflict. The territory in question comprises an $M$ by $N$ rectangular grid. At the emperor ’ s insistence, the two kings have withdrawn their troops until no two opposing troops are in adjacent squares of the map (adjacent being horizontal or vertical - diagonal is not considered).
The emperor proposes to designate certain squares of the map as neutral territory. Neither king will be allowed to move troops into those squares, and the emperor ’ s own forces will patrol them to be sure that both kings observe these rules.
The emperor is frugal and does not want to commit more soldiers to this effort than absolutely necessary. His generals have marked each square of the map with the number of soldiers required to secure that square. What remains is to choose which of those squares should be patrolled.
Write a program to determine the minimum number of soldiers that the emperor will need to be deploy to guarantee that the troops of one kingdom cannot move, in one or more steps, into squares occupied by the troops of the second kingdom (moving horizontally or vertically) without encountering the emperor ’ s own soldiers.
----- Input -----
Input begins with a line containing $2$ integers, $w$ and $h$, denoting the width and height of the map, where $1 \ leq w, h \ leq 40 $.
This is followed by $h$ lines. Each line contains $w$ characters, left justified. These characters will be ’ A ’ or ’ B ’, designating a position held by king A or king B, or a single numeric digit, designating a currently unoccupied position that can be secured by the use of that number of soldiers. For example, a ’2’ would indicate that two soldiers must be deployed to that square to secure it against passage of other troops. A ’0’ indicates terrain that is impassible - the emperor need not commit soldiers there because the kingdom troops cannot pass through that square.
No ’ A ’ will be adjacent, horizontally or vertically, to any ’ B ’.
There will be at least one ’ A ’ and one ’ B ’ in the input.
----- Output -----
Print a single line containing an integer denoting the minimum number of soldiers that the emperor must deploy to guarantee that there is no open path between any ’ A ’ position and any ’ B ’ position, using any combination of horizontal or vertical moves.
----- Examples -----
Sample Input:
8 5
A11111AA
AA7B111A
111 BB111
11 BBB111
11 BBB11B
Sample Output:
13
### INCORRECT PROGRAM
def main ():
w, h = map (int, input (). split ())
grid = [input () for _ in range (h)]
def bfs (x, y, z):
dist = [[float (" inf ")] * w for _ in range (h)]
dist [x][y] = 0
q = [(x, y)]
for x, y in q:
for dx, dy in [(-1, 0), (1, 0), (0, -1), (0, 1)]:
nx, ny = x + dx, y + dy
if 0 <= nx < h and 0 <= ny < w and dist [nx][ny] == float (" inf "):
if grid [nx][ny] == z:
dist [nx][ny] = dist [x][y]
q. append ((nx, ny))
if "0" <= grid [nx][ny] <= "9":
dist [nx][ny] = dist [x][y] + int (grid [nx][ny])
q. append ((nx, ny))
return dist
ans = float (" inf ")
for i in range (h):
for j in range (w):
if grid [i][j] == " A ":
ans = min (ans, min (x for row in bfs (i, j, " B ") for x in row))
print (ans)
main ()
### ERROR MESSAGE
=== Input ===
8 5
A11111AA
AA7B111A
111 BB111
11 BBB111
11 BBB11B
=== Program Output ===
0
=== Expected Output ===
13
### FEEDBACK
This code appears to be finding the minimum " distance " between an A position and a B position, where distance is measured in the number of soldiers required to secure the intermediate positions. This isn ’ t really addressing the question at all, because even occupying the positions along some path from a particular A -> B won ’ t necessarily block the kings from reaching each other from those positions through a different, longer path. You probably need some sort of min - cut algorithm to divide the graph into two connected components, one each for A and B nodes.
### COMMENT
The feedback expresses mild uncertainty in two aspects. First, the user indicates that they are not completely certain in their understanding of the code ’ s current behaviour by using the phrase " appears to be ". Then, they express uncertainty in their suggestion for what to do instead, saying that one * probably * needs some sort of min - cut algorithm.
## Appendix E Prompts
In this appendix, we describe the prompting structure used for our experiments. All of our experiments use one-shot prompting, in which a single example is given in the prompt before the desired task.
For initial code generation (the first sample from $M_{P}$ ), we use different prompts for the two types of tasks in APPS: call-based tasks, in which the desired program should take the input as a parameter to a function and return the output in the function’s return statement; and stdio-based tasks, in which inputs should be read from stdin and outputs should be written to stdout. These prompts are shown in Listing 1 and 2, respectively. The example tasks and programs were taken from APPS’ training set.
For feedback samples (i.e., samples from $M_{F}$ ), we use the prompt in Listing 3. This prompt contains an example in which the user provides the textual specification, the incorrect program and the error message, and the assistant generates feedback. Similarly, for repair samples (i.e., samples from $M_{P}$ which follow $M_{F}$ ) we use the prompt in Listing 4, in which the user also supplies the feedback, and the assistant returns only the fixed version of the program. Finally, for joint feedback-repair samples (i.e., when sampling $(f,r)\sim M_{P}$ ), we use the prompt in Listing 6. This prompt combines the prompts from Listing 4 and 5 into one prompt, in which the assistant returns both the feedback and the fixed program. In all of these prompts, the specification used was taken from APPS’ training set, while the programs and the feedback were constructed manually.
Listing 1: Code generation prompt for call-based tasks.
⬇
===== system =====
You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. You will NOT return anything except for the program. Put your fixed program within code delimiters, for example: ‘‘‘ python
# YOUR CODE HERE
‘‘‘.
===== user =====
### QUESTION
A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward as forward. Examples of numerical palindromes are: ‘2332, 110011, 54322345‘
For a given number ‘‘‘ num ‘‘‘, write a function which returns an array of all the numerical palindromes contained within each number. The array should be sorted in ascending order and any duplicates should be removed.
In this kata, single digit numbers and numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are ** NOT ** considered valid numerical palindromes.
If ‘ num ‘ contains no valid palindromes, return ‘" No palindromes found "‘.
Otherwise, return ‘" Not valid "‘ if the input is not an integer or is less than ‘0‘.
## Examples
‘‘‘
palindrome (1221) => [22, 1221]
palindrome (34322122) => [22, 212, 343, 22122]
palindrome (1001331) => [33, 1001, 1331]
palindrome (1294) => " No palindromes found "
palindrome ("1221") => " Not valid "
‘‘‘
---
### Other Kata in this Series:
Numerical Palindrome #1
Numerical Palindrome #1.5
Numerical Palindrome #2
Numerical Palindrome #3
Numerical Palindrome #3.5
Numerical Palindrome #4
Numerical Palindrome #5
### Use Call - Based Format
===== assistant =====
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
if ch == ’0’:
continue
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
Listing 2: Code generation prompt for stdio-based tasks.
⬇
===== system =====
You are an expert Python programmer. You will be given a question (problem specification) and will generate a correct Python program that matches the specification and passes all tests. You will NOT return anything except for the program. Put your fixed program within code delimiters, for example: ‘‘‘ python
# YOUR CODE HERE
‘‘‘.
===== user =====
### QUESTION
Polycarp has $n$ different binary words. A word called binary if it contains only characters ’0’ and ’1’. For example, these words are binary: "0001", "11", "0" and "0011100".
Polycarp wants to offer his set of $n$ binary words to play a game " words ". In this game, players name words and each next word (starting from the second) must start with the last character of the previous word. The first word can be any. For example, these sequence of words can be named during the game: "0101", "1", "10", "00", "00001".
Word reversal is the operation of reversing the order of the characters. For example, the word "0111" after the reversal becomes "1110", the word "11010" after the reversal becomes "01011".
Probably, Polycarp has such a set of words that there is no way to put them in the order correspondent to the game rules. In this situation, he wants to reverse some words from his set so that: the final set of $n$ words still contains different words (i. e. all words are unique); there is a way to put all words of the final set of words in the order so that the final sequence of $n$ words is consistent with the game rules.
Polycarp wants to reverse minimal number of words. Please, help him.
----- Input -----
The first line of the input contains one integer $t$ ($1 \ le t \ le 10\^4 $) - the number of test cases in the input. Then $t$ test cases follow.
The first line of a test case contains one integer $n$ ($1 \ le n \ le 2\ cdot10 ^5 $) - the number of words in the Polycarp ’ s set. Next $n$ lines contain these words. All of $n$ words aren ’ t empty and contains only characters ’0’ and ’1’. The sum of word lengths doesn ’ t exceed $4 \ cdot10 ^6 $. All words are different.
Guaranteed, that the sum of $n$ for all test cases in the input doesn ’ t exceed $2 \ cdot10 ^5 $. Also, guaranteed that the sum of word lengths for all test cases in the input doesn ’ t exceed $4 \ cdot10 ^6 $.
----- Output -----
Print answer for all of $t$ test cases in the order they appear.
If there is no answer for the test case, print -1. Otherwise, the first line of the output should contain $k$ ($0 \ le k \ le n$) - the minimal number of words in the set which should be reversed. The second line of the output should contain $k$ distinct integers - the indexes of the words in the set which should be reversed. Words are numerated from $1$ to $n$ in the order they appear. If $k =0 $ you can skip this line (or you can print an empty line). If there are many answers you can print any of them.
----- Example -----
Input
4
4
0001
1000
0011
0111
3
010
101
0
2
00000
00001
4
01
001
0001
00001
Output
1
3
-1
0
2
1 2
### Use Standard Input Format (read inputs with ‘ input ()‘, write results with ‘ print ()‘)
===== assistant =====
‘‘‘ python
for _ in range (int (input ())):
n = int (input ())
mass = []
zo = 0
oz = 0
zz = 0
oo = 0
ozs = []
zos = []
ozss = set ()
zoss = set ()
for j in range (n):
k = input ()
mass. append (k)
if k [0] == ’0’ and k [-1] == ’1’:
zoss. add (k)
zos. append (j + 1)
zo += 1
elif k [0] == ’1’ and k [-1] == ’0’:
ozss. add (k)
ozs. append (j + 1)
oz += 1
elif k [0] == ’0’ and k [-1] == ’0’:
zz += 1
else:
oo += 1
if zz and oo and not oz and not zo:
print (-1)
continue
else:
if zo > oz:
print ((zo - oz) // 2)
ans = []
need = (zo - oz) // 2
i = 0
while need:
zzz = mass [zos [i] - 1][len (mass [zos [i] - 1]) - 1:: -1]
if zzz not in ozss:
ans. append (zos [i])
need -= 1
i += 1
print (* ans)
else:
print ((oz - zo) // 2)
ans = []
need = (oz - zo) // 2
i = 0
while need:
zzz = mass [ozs [i] - 1][len (mass [ozs [i] - 1]) - 1:: -1]
if zzz not in zoss:
ans. append (ozs [i])
need -= 1
i += 1
print (* ans)
‘‘‘
Listing 3: Feedback-only prompt.
⬇
===== system =====
You are a helpful programming assistant and an expert Python programmer. You are helping a user debug a program. The user has written some code, but it has some errors and is not passing the tests. You will help the user by giving a concise (at most 2-3 sentences) textual explanation of what is wrong with the code. You will * not * generate any code, because the user wants to fix the code themselves.
===== user =====
### QUESTION
A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward as forward. Examples of numerical palindromes are: ‘2332, 110011, 54322345‘
For a given number ‘‘‘ num ‘‘‘, write a function which returns an array of all the numerical palindromes contained within each number. The array should be sorted in ascending order and any duplicates should be removed.
In this kata, single digit numbers and numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are ** NOT ** considered valid numerical palindromes.
If ‘ num ‘ contains no valid palindromes, return ‘" No palindromes found "‘.
Otherwise, return ‘" Not valid "‘ if the input is not an integer or is less than ‘0‘.
## Examples
‘‘‘
palindrome (1221) => [22, 1221]
palindrome (34322122) => [22, 212, 343, 22122]
palindrome (1001331) => [33, 1001, 1331]
palindrome (1294) => " No palindromes found "
palindrome ("1221") => " Not valid "
‘‘‘
---
### Other Kata in this Series:
Numerical Palindrome #1
Numerical Palindrome #1.5
Numerical Palindrome #2
Numerical Palindrome #3
Numerical Palindrome #3.5
Numerical Palindrome #4
Numerical Palindrome #5
### Use Call - Based Format
### INCORRECT PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
The code above is wrong and contains a bug. Given input "1001331" the output was "[’0’, ’33’, ’1001’, ’1331’]" but the expected output was "[’33’, ’1001’, ’1331’]".
===== assistant =====
The following is a concise explanation of the issue: The problem description states that numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are NOT considered valid numerical palindromes. However, the code above does not take this into account and therefore returns ‘00‘ as a valid palindrome. This can be fixed by checking if the first or last character is ‘0‘ before adding the string to the set of palindromes.
Listing 4: Repair-only prompt.
⬇
===== system =====
You are a helpful programming assistant and an expert Python programmer. You are helping a user write a program to solve a problem. The user has written some code, but it has some errors and is not passing the tests. The user has spent some time debugging the program and will provide you with a concise textual explanation of what is wrong with the code. You will use this explanation to generate a fixed version of the program. Put your fixed program within code delimiters, for example: ‘‘‘ python
# YOUR CODE HERE
‘‘‘.
===== user =====
### QUESTION
A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward as forward. Examples of numerical palindromes are: ‘2332, 110011, 54322345‘
For a given number ‘‘‘ num ‘‘‘, write a function which returns an array of all the numerical palindromes contained within each number. The array should be sorted in ascending order and any duplicates should be removed.
In this kata, single digit numbers and numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are ** NOT ** considered valid numerical palindromes.
If ‘ num ‘ contains no valid palindromes, return ‘" No palindromes found "‘.
Otherwise, return ‘" Not valid "‘ if the input is not an integer or is less than ‘0‘.
## Examples
‘‘‘
palindrome (1221) => [22, 1221]
palindrome (34322122) => [22, 212, 343, 22122]
palindrome (1001331) => [33, 1001, 1331]
palindrome (1294) => " No palindromes found "
palindrome ("1221") => " Not valid "
‘‘‘
---
### Other Kata in this Series:
Numerical Palindrome #1
Numerical Palindrome #1.5
Numerical Palindrome #2
Numerical Palindrome #3
Numerical Palindrome #3.5
Numerical Palindrome #4
Numerical Palindrome #5
### Use Call - Based Format
### INCORRECT PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
The code above is wrong and contains a bug. Given input "1001331" the output was "[’0’, ’33’, ’1001’, ’1331’]" but the expected output was "[’33’, ’1001’, ’1331’]".
The following is a concise explanation of the issue: The problem description states that numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are NOT considered valid numerical palindromes. However, the code above does not take this into account and therefore returns ‘00‘ as a valid palindrome. This can be fixed by checking if the first or last character is ‘0‘ before adding the string to the set of palindromes.
===== assistant =====
### FIXED PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
if ch == ’0’:
continue
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
Listing 5: Joint feedback-repair prompt.
⬇
===== system =====
You are a helpful programming assistant and an expert Python programmer. You are helping a user write a program to solve a problem. The user has written some code, but it has some errors and is not passing the tests. You will help the user by first giving a concise (at most 2-3 sentences) textual explanation of what is wrong with the code. After you have pointed out what is wrong with the code, you will then generate a fixed version of the program. Put your fixed program within code delimiters, for example: ‘‘‘ python
# YOUR CODE HERE
‘‘‘.
===== user =====
### QUESTION
A palindrome is a word, phrase, number, or other sequence of characters which reads the same backward as forward. Examples of numerical palindromes are: ‘2332, 110011, 54322345‘
For a given number ‘‘‘ num ‘‘‘, write a function which returns an array of all the numerical palindromes contained within each number. The array should be sorted in ascending order and any duplicates should be removed.
In this kata, single digit numbers and numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are ** NOT ** considered valid numerical palindromes.
If ‘ num ‘ contains no valid palindromes, return ‘" No palindromes found "‘.
Otherwise, return ‘" Not valid "‘ if the input is not an integer or is less than ‘0‘.
## Examples
‘‘‘
palindrome (1221) => [22, 1221]
palindrome (34322122) => [22, 212, 343, 22122]
palindrome (1001331) => [33, 1001, 1331]
palindrome (1294) => " No palindromes found "
palindrome ("1221") => " Not valid "
‘‘‘
---
### Other Kata in this Series:
Numerical Palindrome #1
Numerical Palindrome #1.5
Numerical Palindrome #2
Numerical Palindrome #3
Numerical Palindrome #3.5
Numerical Palindrome #4
Numerical Palindrome #5
### Use Call - Based Format
### INCORRECT PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
The code above is wrong and contains a bug. Given input "1001331" the output was "[’0’, ’33’, ’1001’, ’1331’]" but the expected output was "[’33’, ’1001’, ’1331’]".
===== assistant =====
The following is a concise explanation of the issue: The problem description states that numbers which start or end with zeros (such as ‘010‘ and ‘00‘) are NOT considered valid numerical palindromes. However, the code above does not take this into account and therefore returns ‘00‘ as a valid palindrome. This can be fixed by checking if the first or last character is ‘0‘ before adding the string to the set of palindromes.
# The code below is the correct version of the code above, where the issue has been fixed:
### FIXED PYTHON CODE
‘‘‘ python
def is_pal (s):
return s == s [::-1]
def palindrome (num):
if not isinstance (num, int) or num < 0:
return ’ Not valid ’
s = str (num)
pals = set ()
for i, ch in enumerate (s):
if ch == ’0’:
continue
for j in range (i + 2, len (s) + 1):
test = s [i: j]
if is_pal (test):
pals. add (test)
return sorted (int (x) for x in pals) or ’ No palindromes found ’
‘‘‘
## Appendix F Dataset: APPS Tasks Used For Our Evaluations
These tasks were randomly sampled from APPS’ test set. To avoid distribution shift, we sampled according to the relative frequency of difficulties in the full dataset. We report the resulting list of tasks to aid reproducibility.
| introductory | ’4004’, ’4058’, ’4063’, ’4065’, ’4100’, ’4108’, ’4117’, ’4155’, ’4164’, ’4182’, ’4193’, ’4195’, ’4211’, ’4217’, ’4241’, ’4249’, ’4270’, ’4275’, ’4281’, ’4293’, ’4333’, ’4347’, ’4350’, ’4356’, ’4409’, ’4426’, ’4431’, ’4450’, ’4465’, ’4484’, ’4498’, ’4505’, ’4507’, ’4514’, ’4544’, ’4553’, ’4586’, ’4610’, ’4662’, ’4663’, ’4667’, ’4677’, ’4681’, ’4704’, ’4716’, ’4741’, ’4750’, ’4786’, ’4787’, ’4801’, ’4855’, ’4862’, ’4864’, ’4870’, ’4873’, ’4890’, ’4897’, ’4952’, ’4966’, ’4984’ |
| --- | --- |
| interview | ’0004’, ’0013’, ’0033’, ’0056’, ’0073’, ’0074’, ’0089’, ’0091’, ’0124’, ’0131’, ’0139’, ’0162’, ’0166’, ’0183’, ’0186’, ’0191’, ’0199’, ’0205’, ’0249’, ’0253’, ’0268’, ’0274’, ’0300’, ’0304’, ’0341’, ’0342’, ’0413’, ’0427’, ’0434’, ’0466’, ’0467’, ’0496’, ’0501’, ’0511’, ’0537’, ’0564’, ’0571’, ’0575’, ’0579’, ’0592’, ’0597’, ’0626’, ’0637’, ’0676’, ’0704’, ’0728’, ’0757’, ’0765’, ’0788’, ’0794’, ’0804’, ’0805’, ’0811’, ’0829’, ’0879’, ’0904’, ’0915’, ’0925’, ’0937’, ’0948’, ’0954’, ’0955’, ’0972’, ’0985’, ’0989’, ’1018’, ’1019’, ’1033’, ’1046’, ’1076’, ’1133’, ’1140’, ’1141’, ’1145’, ’1146’, ’1149’, ’1168’, ’1185’, ’1221’, ’1232’, ’1256’, ’1257’, ’1280’, ’1285’, ’1299’, ’1317’, ’1347’, ’1380’, ’1392’, ’1393’, ’1418’, ’1444’, ’1448’, ’1458’, ’1489’, ’1517’, ’1533’, ’1573’, ’1635’, ’1653’, ’1668’, ’1672’, ’1721’, ’1736’, ’1748’, ’1756’, ’1759’, ’1775’, ’1777’, ’1825’, ’1850’, ’1863’, ’1865’, ’1870’, ’1875’, ’1906’, ’1917’, ’1956’, ’1962’, ’1967’, ’1976’, ’2024’, ’2049’, ’2062’, ’2092’, ’2093’, ’2097’, ’2106’, ’2172’, ’2176’, ’2203’, ’2231’, ’2246’, ’2264’, ’2266’, ’2295’, ’2326’, ’2328’, ’2332’, ’2342’, ’2361’, ’2369’, ’2407’, ’2408’, ’2418’, ’2455’, ’2463’, ’2511’, ’2515’, ’2516’, ’2535’, ’2585’, ’2623’, ’2629’, ’2642’, ’2651’, ’2662’, ’2668’, ’2673’, ’2698’, ’2701’, ’2709’, ’2735’, ’2742’, ’2752’, ’2759’, ’2765’, ’2787’, ’2802’, ’2832’, ’2835’, ’2844’, ’2858’, ’2885’, ’2897’, ’2923’, ’2932’, ’2945’, ’2973’, ’2980’ |
| competition | ’3017’, ’3019’, ’3054’, ’3062’, ’3063’, ’3066’, ’3070’, ’3077’, ’3083’, ’3097’, ’3117’, ’3135’, ’3161’, ’3186’, ’3209’, ’3220’, ’3286’, ’3287’, ’3323’, ’3335’, ’3353’, ’3355’, ’3371’, ’3375’, ’3376’, ’3388’, ’3404’, ’3411’, ’3433’, ’3441’, ’3445’, ’3470’, ’3481’, ’3484’, ’3548’, ’3557’, ’3605’, ’3609’, ’3634’, ’3635’, ’3671’, ’3679’, ’3709’, ’3754’, ’3769’, ’3792’, ’3798’, ’3799’, ’3804’, ’3810’, ’3819’, ’3823’, ’3836’, ’3843’, ’3849’, ’3876’, ’3913’, ’3934’, ’3972’, ’3974’ |
## Appendix G More Examples of Feedback from GPT-4 and our Human Participants
In this appendix, we give five more examples from the study in Section 4.3, so that the keen reader may themself compare and contrast the feedback generated by GPT-4 and by our human participants. Each listing (6-10) contains a textual specification, an incorrect program, an error message, both examples of GPT-4 feedback which were shown to the participants assigned to this task, and the responses from the participants. For each piece of feedback, we also list the number of resulting repaired programs which passed all the tests (c.f. the full results table in Appendix C).
Listing 6: Task 2106, Program A.
⬇
### SPECIFICATION
There are n cities in the country where the Old Peykan lives. These cities are located on a straight line, we ’ ll denote them from left to right as c_1, c_2, ..., c_ {n}. The Old Peykan wants to travel from city c_1 to c_ {n} using roads. There are (n - 1) one way roads, the i - th road goes from city c_ {i} to city c_ {i} + 1 and is d_ {i} kilometers long.
The Old Peykan travels 1 kilometer in 1 hour and consumes 1 liter of fuel during this time.
Each city c_ {i} (except for the last city c_ {n}) has a supply of s_ {i} liters of fuel which immediately transfers to the Old Peykan if it passes the city or stays in it. This supply refreshes instantly k hours after it transfers. The Old Peykan can stay in a city for a while and fill its fuel tank many times.
Initially (at time zero) the Old Peykan is at city c_1 and s_1 liters of fuel is transferred to it ’ s empty tank from c_1 ’ s supply. The Old Peykan ’ s fuel tank capacity is unlimited. Old Peykan can not continue its travel if its tank is emptied strictly between two cities.
Find the minimum time the Old Peykan needs to reach city c_ {n}.
----- Input -----
The first line of the input contains two space - separated integers m and k (1 <= m, k <= 1000). The value m specifies the number of roads between cities which is equal to n - 1.
The next line contains m space - separated integers d_1, d_2, ..., d_ {m} (1 <= d_ {i} <= 1000) and the following line contains m space - separated integers s_1, s_2, ..., s_ {m} (1 <= s_ {i} <= 1000).
----- Output -----
In the only line of the output print a single integer - the minimum time required for The Old Peykan to reach city c_ {n} from city c_1.
----- Examples -----
Input
4 6
1 2 5 2
2 3 3 4
Output
10
Input
2 3
5 6
5 5
Output
14
### INCORRECT PROGRAM
def main ():
m, k = map (int, input (). split ())
ds = list (map (int, input (). split ()))
ss = list (map (int, input (). split ()))
time = 0
fuel = ss [0]
min_time = 0
max_s = ss [0]
for i in range (m):
fuel -= ds [i]
max_s = max (max_s, ss [i])
while fuel < 0:
time += k
fuel += max_s
time += ds [i]
print (time)
main ()
### ERROR MESSAGE
=== Input ===
4 6
1 2 5 2
2 3 3 4
=== Program Output ===
28
=== Expected Output ===
10
### GPT -4 FEEDBACK #1
The issue with the code is that it does not consider the time spent in the current city before moving to the next city, resulting in an overestimation of the minimum time required. This can be fixed by updating the time spent in the current city before calculating the fuel used and accounting for the time it takes to travel to the next city.
# Number of passing repairs: 7/25
### GPT -4 FEEDBACK #2
The following is a concise explanation of the issue: The code incorrectly calculates the waiting time at each city. It should minimize the waiting time by considering the maximum fuel supply that can be collected during the waiting time. The code needs to be updated to account for this.
# Number of passing repairs: 10/25
### PARTICIPANT FEEDBACK #1
The program doesn ’ t add the fuel for passing through cities, only if it needs to actually stay in the city to refuel. Whenever it passes through a city, it should add ss [i] to the fuel.
# Number of passing repairs: 10/25
### PARTICIPANT FEEDBACK #2
The code is incorrect because it doesn ’ t take into account the fuel at each city that is immediately transferred after entering the city (aside for the fuel received at the first city). As a result, this program overestimates the time needed, because it assumes that the Old Peykan always has to wait for a refill (besides the fuel they are given in the first city).
# Number of passing repairs: 0/25
Listing 7: Task 3070, Program B.
⬇
### SPECIFICATION
In the free - market, ruthlessly capitalist world of train fares, only one thing matters: incentives.
Train companies are incentivised with bonuses for high throughput, successful journeys, and customer satisfaction. Conversely, the companies are disincentivised from failure via mandatory refunds for customers delayed by $30$ minutes or more.
Being a ruthless capitalist yourself, you have decided to take advantage of this generous delay compensation provision.
The refund is awarded provided that no matter the combination of trains you had taken (provided they followed the same route of stations as planned), you would still be unable to reach your destination in strictly less time than $30$ minutes (or $1800$ seconds), of the time you would have arrived assuming your booked journey was exactly on time.
Armed with your printout of the day ’ s delays, and the original timetable, you must ask yourself only one question: what is the earliest time you can book a train for from station $1$, in order to earn this restitutive reward?
----- Input -----
- One line containing two integers: $N$ ($1 \ le N \ le 100 $), the number of stations, and $M$ ($1 \ le M \ le 10^5 $), the number of scheduled trains.
- The next $M$ lines each contain 4 integers:
- $X$, the starting station ($1 \ le X \ le N -1 $),
- $S$ and $T$ ($0 \ le S \ le T < 86400 $), the planned departure and arrival times in seconds,
- and $L$ ($0 \ le L < 86400 $), the duration by which the train ’ s departure and arrival times are delayed.
Stations are numbered from $1$ to $N$ in the order you will visit them. Each train goes between stations $X$ and $X +1 $. It is possible to change between trains instantanesouly.
----- Output -----
- One line containing one integer: the start time of the earliest train journey you could book in order to earn your compensation, or impossible if no such journey is possible.
----- Examples -----
Sample Input 1:
2 3
1 1800 9000 1800
1 2000 9200 1600
1 2200 9400 1400
Sample Output 1:
1800
Sample Input 2:
2 2
1 1800 3600 1800
1 1900 3600 1600
Sample Output 2:
impossible
### INCORRECT PROGRAM
def main ():
N, M = map (int, input (). split ())
train_data = [tuple (map (int, input (). split ())) for _ in range (M)]
train_data. sort (key = lambda x: (x [0], x [1]))
times = [1 e9] * (N + 1)
times [1] = 0
for X, S, T, L in train_data:
if times [X] <= S:
times [X + 1] = min (times [X + 1], T + L)
if times [N] >= 1800:
print (int (times [1]))
else:
print (" impossible ")
main ()
### ERROR MESSAGE
=== Input ===
2 3
1 1800 9000 1800
1 2000 9200 1600
1 2200 9400 1400
=== Program Output ===
0
=== Expected Output ===
1800
### GPT -4 FEEDBACK #1
The issue with the code is that it prints ‘ times [1]‘ instead of the earliest possible train start time. The code needs to get the minimum start time from the train_data list that has a valid delay for compensation.
# Number of passing repairs: 3/25
### GPT -4 FEEDBACK #2
The issue with the code is that it directly prints the value of ‘ times [1]‘, which is initialized to ‘0‘ and never updated. The correct output should be the starting time of the earliest train journey that satisfies the condition. To fix this, we need to track the earliest possible start time and update it if a valid train journey is found.
# Number of passing repairs: 0/25
### PARTICIPANT FEEDBACK #1
This code is clearly wrong as times [1] is initialized to 0 then never updated and it is returned as the answer. Also, in the sorting line ‘ int (x [0])‘ should be used in place of ‘ x [0]‘ and likewise for ‘ int (x [1])‘ in place of ‘ x1 ‘ because we want numerical order sorting not string order sorting. The goal in this problem is calculate for each station what the earliest - arriving undelayed path to the station is and what the earliest - arriving delayed path to the station is, take the difference between the two, and see if it results in a total delay of >= 30 minutes (then to return the earliest departure from station 1 for this path. Given this, first the current code based around ‘ times ‘ can remain - this is the delayed time. However, the if - statement within the for - loop should be ‘<= S + L ‘ since the departure time is also delayed. Then ‘ undelayed_times ‘ should be tracked as well, updated in the foor loop when ‘ undelayed_times [X] <= S ‘ and set to a min () between the current undelayed time an ‘ T ‘ (not T + L). Finally the if - statement for time printing can use the difference between the delayed and undelayed times and see if any () are >= 1800. Finally we need to print the earliest arriving train for that path
# Number of passing repairs: 5/25
### PARTICIPANT FEEDBACK #2
One of the main issues is that the code prints ’ times [1]’, which is set to 0 and never updated. However, there are also some larger issues with the code. First, instead of checking if you reach your destination within 30 minutes of the expected time, it checks if you reach your destination in 30 minutes. To fix this, the program would need to keep track of the time you ’ d reach your destination if there were not delays, which it currently doesn ’ t. Second, instead of printing times [1], it should be keeping track of the start time of the train that leads to a delay and printing this.
# Number of passing repairs: 0/25
Listing 8: Task 4333, Program A.
⬇
### SPECIFICATION
There is a square in the xy - plane. The coordinates of its four vertices are (x_1, y_1),(x_2, y_2),(x_3, y_3) and (x_4, y_4) in counter - clockwise order.
(Assume that the positive x - axis points right, and the positive y - axis points up.)
Takahashi remembers (x_1, y_1) and (x_2, y_2), but he has forgot (x_3, y_3) and (x_4, y_4).
Given x_1, x_2, y_1, y_2, restore x_3, y_3, x_4, y_4. It can be shown that x_3, y_3, x_4 and y_4 uniquely exist and have integer values.
----- Constraints -----
- | x_1 |,| y_1 |,| x_2 |,| y_2 | \ leq 100
- (x_1, y_1) != (x_2, y_2)
- All values in input are integers.
----- Input -----
Input is given from Standard Input in the following format:
x_1 y_1 x_2 y_2
----- Output -----
Print x_3, y_3, x_4 and y_4 as integers, in this order.
----- Sample Input -----
0 0 0 1
----- Sample Output -----
-1 1 -1 0
(0,0),(0,1),(-1,1),(-1,0) is the four vertices of a square in counter - clockwise order.
Note that (x_3, y_3)=(1,1),(x_4, y_4)=(1,0) is not accepted, as the vertices are in clockwise order.
### INCORRECT PROGRAM
x1, y1, x2, y2 = map (int, input (). split ())
x3 = x2 + (y2 - y1)
y3 = y2 - (x2 - x1)
x4 = x1 + (y2 - y1)
y4 = y1 - (x2 - x1)
print (x3, y3, x4, y4)
### ERROR MESSAGE
=== Input ===
0 0 0 1
=== Program Output ===
1 1 1 0
=== Expected Output ===
-1 1 -1 0
### GPT -4 FEEDBACK #1
The issue with the code is that it doesn ’ t consider the counter - clockwise order of the vertices. To fix this, we need to swap the signs of (y2 - y1) and (x2 - x1) in the calculations.
# Number of passing repairs: 25/25
### GPT -4 FEEDBACK #2
The following is a concise explanation of the issue: The code above does not ensure that the vertices are in counter - clockwise order. It calculates x3, y3, x4, and y4 in the wrong order, which results in the coordinates being in clockwise order instead. To fix this, we need to swap the calculations for x3 and x4, and y3 and y4.
# Number of passing repairs: 0/25
### PARTICIPANT FEEDBACK #1
The program calculates vertices for a clockwise - oriented square, not counter - clockwise. To fix this, (y2 - y1) and (x2 - x1) should be negated in the equations.
# Number of passing repairs: 25/25
### PARTICIPANT FEEDBACK #2
The calculation for y3 and y4 incorrectly subtract the x direction vector to y0 and y1 respectively. In each of these cases, the calculation can be fixed by changing the first subtraction to addition.
# Number of passing repairs: 0/25
Listing 9: Task 4347, Program B.
⬇
### SPECIFICATION
One day, $n$ people ($n$ is an even number) met on a plaza and made two round dances, each round dance consists of exactly $ \ frac {n}{2} $ people. Your task is to find the number of ways $n$ people can make two round dances if each round dance consists of exactly $ \ frac {n}{2} $ people. Each person should belong to exactly one of these two round dances.
Round dance is a dance circle consisting of $1$ or more people. Two round dances are indistinguishable (equal) if one can be transformed to another by choosing the first participant. For example, round dances $ [1, 3, 4, 2] $, $ [4, 2, 1, 3] $ and $ [2, 1, 3, 4] $ are indistinguishable.
For example, if $n =2 $ then the number of ways is $1$: one round dance consists of the first person and the second one of the second person.
For example, if $n =4 $ then the number of ways is $3$. Possible options: one round dance - $ [1,2] $, another - $ [3,4] $; one round dance - $ [2,4] $, another - $ [3,1] $; one round dance - $ [4,1] $, another - $ [3,2] $.
Your task is to find the number of ways $n$ people can make two round dances if each round dance consists of exactly $ \ frac {n}{2} $ people.
----- Input -----
The input contains one integer $n$ ($2 \ le n \ le 20 $), $n$ is an even number.
----- Output -----
Print one integer - the number of ways to make two round dances. It is guaranteed that the answer fits in the $64$ - bit integer data type.
----- Examples -----
Input
2
Output
1
Input
4
Output
3
Input
8
Output
1260
Input
20
Output
12164510040883200
### INCORRECT PROGRAM
def factorial (n):
if n == 0:
return 1
else:
return n * factorial (n -1)
n = int (input ())
ans = (factorial (n) // (n // 2)**2) // (2 * n)
print (ans)
### ERROR MESSAGE
=== Input ===
2
=== Program Output ===
0
=== Expected Output ===
1
### GPT -4 FEEDBACK #1
The issue with the code is that it uses integer division when calculating the final answer, which may cause the result to be rounded down incorrectly. To fix this, we can use float division and then convert the result to an integer.
# Number of passing repairs: 0/25
### GPT -4 FEEDBACK #2
The following is a concise explanation of the issue: The code attempts to calculate the number of ways using integer division, which may result in a loss of precision and incorrect results. To fix the issue, change integer division to regular division, then round the result to the nearest integer using ‘ round ()‘.
# Number of passing repairs: 0/25
### PARTICIPANT FEEDBACK #1
The problem is that the formula used is incorrect. For n <= 4, n! is less than n ^3 / 2 and so this is why we see a 0. There is an erroneous additional division by n which is the true reason for the wrong answer.
# Number of passing repairs: 25/25
### PARTICIPANT FEEDBACK #2
On line 11, the program calculates a term with a correct numerator, but an incorrect denominator. It divides by (2* n), when it should just divide by 2.
# Number of passing repairs: 25/25
Listing 10: Task 4855, Program A.
⬇
### SPECIFICATION
The impact factor of a scientific journal is a measure reflecting the average number of citations to articles published in science journals. For this task we are using a simplified formula for calculating the impact factor:\[ \ frac {\ text {Total count of citations received by articles published in the journal}}{\ text {Total number of articles published}} \]
Rounding is always performed upwards. For example the impact factor of the " Journal for ore research and time wasting " that published $38$ articles cited $894$ times is $894 / 38 = 23.53 $ rounding up to $24$.
You are the editor of one scientific journal. You know how many articles you are going to publish and the owners are pushing you to reach a specific impact factor. You are wondering how many scientists you will have to bribe to cite your articles to meet the owners demands. Since money is tight you want to bribe the minimal amount of scientists. Each bribed scientist buys you a single citation.
----- Input -----
First and only line of input will contain $2$ integers, $A$ ($1 \ leq A \ leq 100 $), the number of articles you plan to publish and $I$ ($1 \ leq I \ leq 100 $), the impact factor the owners require.
----- Output -----
The first and only line of output should contain one integer, the minimal number of scientists you need to bribe.
----- Examples -----
Sample Input 1:
38 24
Sample Output 1:
875
Sample Input 2:
1 100
Sample Output 2:
100
### INCORRECT PROGRAM
import math
A, I = map (int, input (). split ())
needed_citations = math. ceil (A * (I - 0.5))
scientists_to_bribe = needed_citations - A * (I - 1)
print (scientists_to_bribe)
### ERROR MESSAGE
=== Input ===
38 24
=== Program Output ===
19
=== Expected Output ===
875
### GPT -4 FEEDBACK #1
The following is a concise explanation of the issue: The calculation for the number of scientists to bribe is incorrect. It should be calculated by first finding the minimum number of citations needed: ‘ A * I ‘, and then subtracting the current number of citations: ‘ A * (I - 1)‘.
# Number of passing repairs: 0/25
### GPT -4 FEEDBACK #2
The following is a concise explanation of the issue: The code calculates the required number of scientists to bribe by subtracting the current number of citations from the number of needed citations. However, it incorrectly calculates the current number of citations.
# Number of passing repairs: 1/25
### PARTICIPANT FEEDBACK #1
The scientists_to_bribe variable on line 3 is unnecessary., It ’ s sufficient to just do A * (I - 1) + 1.
# Number of passing repairs: 17/25
### PARTICIPANT FEEDBACK #2
The program seems to have a conceptual misunderstanding. Specifically, the number of scientists needed to bribe is equal to the number of required citations (‘ citations_needed ‘). However, the calculation of ‘ citation_needed ‘ is also incorrect -- it implies rounding impact factor conventionally (e. g. 3.49 goes to 3, 3.51 goes to 4), whereas rounding here is a ceiling function on the impact factor. As such, the actual formula for ‘ citation_needed ‘ should be ‘ A *(I -1) + 1‘
# Number of passing repairs: 25/25