# Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
**Authors**: Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter
## Abstract
Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks (Chen et al., 2022; Nye et al., 2021; Austin et al., 2021), but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for â detect_sarcasm(string) â that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively âemulateâ the interpreter by generating the expected output of â detect_sarcasm(string) â. In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an âLMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by âthinking in code".
Machine Learning, ICML
UTF8gbsn
https://chain-of-code.github.io/
Direct answer only
Q: How many countries have I been to? Iâve been to Mumbai, London, Washington, Grand Canyon, ... A: 32 (20%, â), 29 (10%, â), 52 (10%, â), ...
Chain of Thought
Q: Letâs think step by step. How many countries have I been to? Iâve been to Mumbai, London, ... Weâll group by countries and count: 1. India: Mumbai, Delhi, Agra 2. UK: London, Dover, Edinburgh, Skye 3. USA: Washington, Grand Canyon, ... A: 61 (20%, â), 60 (20%, â), 52 (10%, â), ...
Chain of Code (Ours)
Q: How many countries have I been to? Iâve been to Mumbai, London, Washington, Grand Canyon, Baltimore, ... 1 places, countries = ["Mumbai", ...], set() delta state: {places = [âMumbaiâ, ...], countries = set()} 2 for place in places: delta state: {place = âMumbaiâ} 3 country = get_country(place) delta state: {country = âIndiaâ)} 4 countries.add(country) delta state: {countries = {âIndiaâ}} 5 answer = len(countries) delta state: {answer = 52} A: 52 (100%, â)
Figure 1: Chain of Code generates code and reasons through an LM-augmented code emulator. Lines evaluated with Python are in red and with an LM are in purple. The full query is in Fig. LABEL:fig:intro_query.
<details>
<summary>extracted/5762267/fig/all_tasks_direct.png Details</summary>

### Visual Description
## Bar Chart: Î w.r.t. average human rater (%)
### Overview
The chart visualizes the difference (Î) between AI-generated responses and average human ratings, expressed as percentages. Red bars represent negative differences (AI underperformance), while blue bars indicate positive differences (AI outperformance). The x-axis contains partially legible categories, with "Human Rater" explicitly labeled.
### Components/Axes
- **Y-Axis**: Labeled "Î w.r.t. average human rater (%)" with ticks at -100%, -50%, 0%, 50%, and 100%.
- **X-Axis**: Categories are blurred but include "Human Rater" (leftmost) and other illegible labels.
- **Legend**: Located at the bottom-right, with red for "Negative" and blue for "Positive."
### Detailed Analysis
- **Negative Bars (Red)**:
- Start at approximately -50% for the leftmost category.
- Decrease in magnitude toward the center, reaching ~-100% for the third category.
- Transition to brown bars (possibly intermediate values) before shifting to blue.
- **Positive Bars (Blue)**:
- Begin near 0% on the far right.
- Increase to ~20% for the second-to-last category and ~30% for the rightmost category.
### Key Observations
1. **Gradient of Performance**: The chart shows a clear transition from negative (red/brown) to positive (blue) values, suggesting a spectrum of AI performance relative to humans.
2. **Outliers**: The third category on the left has the largest negative deviation (-100%), while the rightmost category shows the highest positive deviation (~30%).
3. **Ambiguity**: X-axis labels beyond "Human Rater" are unreadable, limiting categorical interpretation.
### Interpretation
The data likely compares AI-generated responses to human benchmarks, highlighting areas where AI underperforms (e.g., bias, accuracy) and outperforms (e.g., efficiency, creativity). The abrupt shift from red to blue suggests a threshold where AI transitions from being worse to better than humans. The -100% value implies a complete failure in at least one metric, while the 30% positive value indicates strong AI superiority in another. Without clearer x-axis labels, the specific categories remain ambiguous, but the trend underscores the duality of AI capabilities.
</details>
(a) Direct answer only
<details>
<summary>extracted/5762267/fig/all_tasks_cot.png Details</summary>

### Visual Description
## Bar Chart: Î w.r.t. average human rater (%)
### Overview
The chart displays a bar graph comparing the change in average human rater percentage (Î w.r.t. average human rater) across three categories: "Negative", "Neutral", and "Positive". The y-axis represents percentage values ranging from -100% to 100%, while the x-axis categorizes the data into the three groups. The bars are color-coded: red for "Negative", brown for "Neutral", and blue for "Positive".
### Components/Axes
- **X-axis (Categories)**:
- "Negative" (leftmost)
- "Neutral" (middle)
- "Positive" (rightmost)
- **Y-axis (Values)**:
- Labeled "Î w.r.t. average human rater (%)"
- Scale: -100% to 100% in increments of 50%
- **Legend**:
- Located on the right side of the chart.
- Red = "Negative"
- Brown = "Neutral"
- Blue = "Positive"
### Detailed Analysis
- **Negative Category**:
- Three bars with approximate values: -70%, -60%, and -50%.
- All bars are below the zero line, indicating a decline in average human rater percentage.
- **Neutral Category**:
- Three bars with approximate values: -10%, 0%, and 10%.
- Bars are near the zero line, showing minimal change.
- **Positive Category**:
- Three bars with approximate values: 10%, 20%, and 40%.
- The tallest bar (40%) is in this category, indicating the highest increase.
### Key Observations
1. The "Positive" category shows the most significant improvement, with the largest bar at 40%.
2. The "Negative" category exhibits the most substantial declines, with values ranging from -70% to -50%.
3. The "Neutral" category remains relatively stable, with values close to 0%.
### Interpretation
The data suggests a clear trend where the "Positive" category demonstrates the greatest improvement in average human rater percentage, while the "Negative" category shows the most significant declines. The "Neutral" category remains relatively unchanged, indicating stability. This could imply that the intervention or variable being measured had a more favorable impact on the "Positive" group, whereas the "Negative" group experienced a notable deterioration. The "Neutral" categoryâs stability might reflect a lack of significant change or a balanced effect. The chart highlights the importance of addressing the "Negative" category to mitigate its adverse impact, while leveraging the "Positive" trends for further optimization.
</details>
(b) Chain of Thought
<details>
<summary>extracted/5762267/fig/all_tasks_coc_interweave_no_title.png Details</summary>

### Visual Description
## Bar Chart: Î w.r.t. average human rater (%)
### Overview
The chart visualizes the distribution of percentage changes (Î) in average human rater scores across sentiment categories. The x-axis represents sentiment labels (from "Very negative" to "Very positive"), while the y-axis shows the magnitude of change in percentage. Bars are color-coded: red for negative changes, blue for positive changes, and gray for neutral/zero changes.
### Components/Axes
- **X-axis**: Sentiment categories labeled as:
- Very negative
- Negative
- Neutral
- Positive
- Very positive
- **Y-axis**: Î w.r.t. average human rater (%) ranging from -100% to +100%.
- **Legend**: Located at the bottom-right corner, mapping colors to sentiment polarity:
- Red: Negative changes
- Blue: Positive changes
- Gray: Neutral/zero changes
### Detailed Analysis
1. **Negative Sentiment (Left Side)**:
- "Very negative" (red bar): Approximately -60%.
- "Negative" (red bar): Approximately -40%.
- "Neutral" (gray bar): Approximately -5%.
- "Positive" (blue bar): Approximately +10%.
- "Very positive" (blue bar): Approximately +90%.
2. **Positive Sentiment (Right Side)**:
- The largest bar corresponds to "Very positive," reaching ~90%.
- "Positive" shows ~40%, with smaller increments for "Neutral" (~+5%) and "Negative" (~+10%).
### Key Observations
- The most significant increase in rater scores occurs in the "Very positive" category (+90%).
- Negative sentiment categories show smaller magnitude changes compared to positive ones.
- The "Neutral" category has minimal change (~+5%), acting as a transitional point between negative and positive shifts.
### Interpretation
The data suggests a strong positive bias in human rater evaluations, with the highest impact observed in the "Very positive" category. The stark contrast between negative (-60%) and very positive (+90%) values indicates a polarized distribution, possibly reflecting a scenario where interventions or stimuli strongly influenced perceptions toward positivity. The near-zero change in "Neutral" (-5%) implies limited impact on neutral sentiments. This pattern could highlight the effectiveness of a targeted strategy or the presence of extreme outliers in the dataset.
</details>
(c) Chain of Code (Ours)
Figure 2: Overall results on BIG-Bench Hard compared to human performance (Srivastava et al., 2022).
## 1 Introduction
Language models (LMs) at certain scale exhibit the profound ability to solve complex reasoning questions (Brown et al., 2020; Wei et al., 2022a) â from writing math programs (Drori et al., 2022) to solving science problems (Lewkowycz et al., 2022). Notably, these capabilities have shown to improve with Chain of Thought (CoT) prompting (Wei et al., 2022b), whereby complex problems are decomposed into a sequence of intermediate reasoning steps. CoT excels at semantic reasoning tasks, but tends to struggle with questions that involve numeric or symbolic reasoning (Suzgun et al., 2022; Mirchandani et al., 2023). Subsequent work addresses this by prompting LMs (e.g., trained on Github (Chen et al., 2021)) to write and execute code (Chen et al., 2022; Nye et al., 2021; Austin et al., 2021). Code in particular is advantageous because it provides both (i) a general syntactic structure to build and encode complex programs (Liang et al., 2023) (e.g., logic structures, functional vocabularies â in ways that are Turing complete), and (ii) an interface by which existing APIs paired together with an interpreter can be used to perform precise algorithmic computations (e.g., from multiplication of large numbers to sorting an array of size 10,000) that a language model trained only to mimic the statistically most likely next token would otherwise struggle to produce.
While writing and executing code may improve LM reasoning performance across a wide range of arithmetic tasks, this particular approach contends with the fact that many semantic tasks are rather difficult (and at times, nearly impossible) to express in code. For example, it remains unclear how to write a function that returns a boolean when it detects sarcasm in a string (Suzgun et al., 2022) (handling the edge cases would be insurmountable). Perhaps fundamentally, using LMs to write programs in lieu of multi-step textual reasoning inherently assumes that the intermediate reasoning traces (expressed in lines of code) all need to be executable by an interpreter. Is it possible to lift these restrictions to get the best of both reasoning in code and reasoning in language?
In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension to improve LM code-driven reasoning â where the LM not only writes a program, but also selectively âsimulatesâ the interpreter by generating the expected output of certain lines of code (that the interpreter could not execute). The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that at runtime can be explicitly caught and handed off to emulate with an LM â we term this an LMulator (a portmanteau of LM and emulator). For example, given the task â in the above paragraph, count how many times the person was sarcastic,â we can in-context prompt the LM to write a program that may call helper functions such as is_sarcastic(sentence), to which the LM makes a linguistic prediction and returns the result as a boolean output, that then gets processed with the rest of the program. Specifically, we formulate LM reasoning as the following process (illustrated in Figure 1): the LM writes code, the interpreter steps through to execute each line of code (in red), or if it fails, simulates the result with the LM (in purple) and updates the program state (in green). CoC inherits the benefits of both (i) writing executable code (where precise algorithmic compututations are left to an interpreter), and (ii) writing pseudocode for semantic problems, and generating their outputs (which can be thought of as a simple formatting change, to which LMs are robust (Min et al., 2022)) â enabling the LM to âthink in codeâ.
Extensive experiments demonstrate that CoC is applicable to a wide variety of challenging numerical and semantic reasoning questions, and outperforms a number of popular baselines. In particular, we find that it achieves high performance on BIG-Bench Hard tasks (Suzgun et al., 2022), outperforming average human raters overall and outperforming even the best human raters on an algorithmic subset of tasks, and to the best of our knowledge setting a new state of the art. We further show that both code interpreter execution and language model execution simulation are necessary for this performance, and that the approach scales well with large and small models alike â contrary to prompting techniques like Chain of Thought that only emerge at scale. We then demonstrate how Chain of Code can serve as a general purpose reasoner via cross-task prompting benchmark, which in contrast to prior work, uses prompts from different families of problems as context â providing only the structure of the response (as opposed to the solution itself). Finally, we show CoC is complementary to more advanced instruction tuned chat models, robust against prompt variation, and applicable beyond language reasoning domain like robotics. This work underscores how one may leverage the structure and computational power of code and the reasoning abilities of language models to enable a âbest of both worldsâ reasoner.
## 2 Chain of Code: Reasoning with an LMulator
In this section, we describe Chain of Code (CoC), an approach that leverages the ability of language models to code, to reason, and to leverage an LM-augmented code emulator (an LMulator) to simulate running code. We start with background in Section 2.1, then overview the method in Section 2.2, its implementation in Section 2.3, and finally its capabilities in Section 2.4.
### 2.1 Preliminaries
Briefly, we overview some background on LM reasoning. Many of these reasoning techniques have been enabled by in-context learning (Brown et al., 2020), which provides the model with a few demonstrative examples at inference time, rather than updating any weights with gradients. These examples serve to provide context and format for the setting, enabling the model to emulate these examples while adapting to a new query. This property has been instrumental in easily applying LMs to new tasks as it can be rapidly adapted and requires minimal data.
Through in-context learning, approaches have been developed to leverage human thought processes and use tools to improve performance of language models. We outline three such approaches that provide the foundations for Chain of Code. Chain of Thought (CoT) (Wei et al., 2022b), ScratchPad (Nye et al., 2021), and Program of Thoughts (Chen et al., 2022) demonstrated the efficacy of breaking problems down into substeps. For CoT these substeps are in natural language, mirroring oneâs thought process when stepping through a complicated problem. ScratchPad, on the other hand, maintains a program state of intermediate steps when simulating the output of code â resulting in an LM acting as a code interpreter. Program of Thoughts (Chen et al., 2022) focused on generating the code itself, which is then executed by a code interpreter to solve reasoning problems. Each of these is visualized in Figure 3(c).
(a) Chain of Thought
(b) Program of Thoughts
(c) ScratchPad
Figure 3: Previous reasoning methods: To solve advanced problems, (LABEL:fig:prelim-cot) Chain of Thought prompting breaks the problem down into intermediate steps, (LABEL:fig:prelim-pot) Program of Thoughts prompting writes and executes code, and (LABEL:fig:prelim-scratchpad) ScratchPad prompting simulates running already written code by tracking intermediate steps through a program state. Our reasoning method: Chain of Code first (LABEL:fig:method_generation) generates code or psuedocode to solve the question and then (LABEL:fig:method_execution) executes the code with a code interpreter if possible, and with an LMulator (language model emulating code) otherwise. Blue highlight indicates LM generation, red highlight indicates LM generated code being executed, and purple highlight indicates LMulator simulating the code via a program state in green.
(d) Chain of Code Generation (Ours)
(e) Chain of Code Execution (Ours)
### 2.2 Chain of Code
Inspired by how a human may reason through a particularly complex problem with a mix of natural language, pseudocode, and runnable code or how a researcher may develop a new general algorithm through a code-based formalism then apply it to a problem, Chain of Code proceeds in two steps: (1) Generation, which, given the question to solve, an LM generates code to reason through the problem, and (2) Execution, which executes the code via a code interpreter when possible and via an LM when not. See Section 2.3 for more details on the specific implementation.
Chain of Code Generation Given a problem to solve, CoC generates reasoning substeps in the structure of code. This code provides the framework of reasoning through the problem, and may be in the form of explicit code, pseudocode, or natural language. Figure LABEL:fig:method_generation walks through a potential generation to solve an object counting problem from BIG-Bench.
Chain of Code Execution A core contribution of CoC is not just the generation of reasoning code, but the manner in which it is executed. Once the code is written, the code is attempted to be run by a code interpreter â in this work we consider Python, but the approach is general to any interpreter. If the code is successfully executed, the program state is updated and the execution continues. If the code is not executable or raises any exception, the language model instead is used to simulate the execution. The program state is subsequently updated by the language modelâs outputs and the execution continues. Herein, we refer to this as an LMulator, a portmanteau of LM and code emulator. This relatively simple change enables a variety of new applications for code which mix semantics and numerics. Figure LABEL:fig:method_execution shows how the generated code is run, maintaining the program state and switching between the Python executor and the LMulator.
### 2.3 Chain of Code Implementation
While the generation implementation is straightforward prompting and language model generation, the execution implementation is slightly more complex. Our implementation is based on using Pythonâs try and except and maintaining a program state. Line by line CoC steps through the code. If the line is executable by a code interpreter, it is executed, the program state is updated, and the program continues. If it is not executable by a code interpreter, a language model is given the context of the program (the question, the prior lines, and the history of the program state) and generates the next program state. This emulation can also leverage chain of thought to determine how to respond. That generated program state is then updated for the code interpreter as well. This sharing of program state interweaves the code interpreter and the language model simulator in a manner applicable to arbitrary interweaving, even control flow like for -loops and if -statements. This continues until the entire code is run, and the answer is retrieved as the value of the variable named answer, or in case of irrecoverable errors, with the language model outputting A: answer.
To illustrate with a brief example, the code answer = 0; answer += is_sarcastic(âyou donât sayâ); answer += 1; would be executed as follows: (1) Python would execute the first line answer = 0; and update the program state to {answer = 0}, (2) Python would attempt to execute the second line and fail, and thus the LMulator would simulate the code answer += is_sarcastic(âyou donât sayâ); by generating the program state {answer = 1}, which would be updated in the program, (3) Python would execute the last line answer += 1; and update the program state to {answer = 2}, (4) the answer would be retrieved as 2.
### 2.4 Chain of Code Abilities
Chain of Code has several attractive properties:
1. It enables code use in entirely new regimes, by combining the advantages of code with the powerful semantic and commonsense knowledge of language models, which can easily express rules that are challenging to express in code (e.g., which foods are fruits?). Such an ability may have benefits beyond reasoning problems and its flexibility enables executing expressive language, such as pseudocode.
1. It leverages the ability of language models to code, a particular strength of recent language models due to the high quality data available.
1. It inherits many of the benefits of reasoning code, both the formal yet expressive structure of code (e.g., Turing completeness) and powerful computational tools available to code (whether simply multiplying two numbers, calculating $\sqrt[5]{12121}$ , or simulating physics).
1. It inherits many of the benefits of techniques that reason via intermediate steps, such as Chain of Thought. These techniques enable the language model to use more computation when necessary to solve a problem as well as provide more interpretability.
Empirically, we observe in Section 3 that these benefits results in significant improvements in reasoning performance over a variety of challenging tasks.
## 3 Experimental Evaluation
We select challenging problems requiring varied types of reasoning, whether arithmetic, commonsense, or symbolic reasoning tasks, to answer the following questions:
1. How well does CoC perform across a variety of tasks?
1. Which types of problems does CoC perform best?
1. How does each aspect of CoC affect performance?
1. How does CoC scale with model size?
1. How does CoC perform as a general-purpose reasoner, with prompt examples from different problems rather than the same problem (which we term cross-task prompting)?
1. How can CoC be used with instruction tuned chat models?
1. How robust CoC is against prompt variation?
1. Can CoC be applied beyond language reasoning tasks?
We first discuss the approaches, ablations, and baselines considered in Section 3.1, then the tasks considered in Section 3.2, and finally the results in Section 3.3.
### 3.1 Baselines and Ablations
We consider our main method to be CoC (Interweave), also referred to as CoC (Ours), though we also propose two variants with simpler implementation and modestly lower performance: CoC (try Python except LM) and CoC (try Python except LM state). These two variants attempt to run the entire generated code with Python (rather than line by line) and if it fails, simulate the code execution with the LMulator, outputting a final answer or an intermediate state trace, respectively. We also perform the following ablations, some of which are comparable to previous work as noted. In CoC (Python) Python is used to run the entire generated code and if the code is not executable, it is marked as failure â this can be thought of as a comparison to Program of Thoughts (Chen et al., 2022) or Program-aided language models (Gao et al., 2023). We note that in many cases this baseline is particularly challenged, as writing executable code for some of the reasoning problems becomes nearly impossible (e.g., writing code to judge if a phrase is sarcastic), but one may focus on the results for Algorithmic only tasks for a more fair comparison. In CoC (LM) the code is interpreted by an LMulator outputting the final answer, and in CoC (LM state) the code is interpreted by an LMulator outputting a state trace of intermediate steps â this can be thought of as ScratchPad prompting for reasoning (Nye et al., 2021). Note, the last two ablations do not leverage the Python interpreter.
We also compare against the following baselines. In Direct question answering the LM simply responds to the question with a final answer. In Chain of Thought prompting (CoT) the LM uses intermediate steps to solve the task; we use CoT as our standard prompt technique for the field of substep prompting (Kojima et al., 2022; Zhou et al., 2022a) as prompts are readily available.
### 3.2 Tasks
We consider a subset of challenging tasks from BIG-Bench (Srivastava et al., 2022) called BIG-Bench Hard (BBH) (Suzgun et al., 2022) to ensure we are solving the most challenging tasks. These tasks were specifically selected for their difficulty for language models and the datasets provides human-rater baselines and a set of Chain of Thought prompts. The 23 tasks require semantic reasoning (e.g., âMovie Recommendationâ), numerical reasoning (e.g., âMulti-Step Arithmeticâ), and a combination of both (e.g., âObject Countingâ). As such they enable us to study the efficacy of CoC across varied problems, not just those that coding is a natural fit for. Several prompts are shown in Figure A1. We also show results for the grade-school math (GSM8K) benchmark (Cobbe et al., 2021) in Section A.2, although we find that these problems are primarily solved algorithmically alone through code.
These tasks are evaluated with few-shot prompting, whereby three examples from the same problem family are provided as context. We also introduce a new evaluation setting, cross-task prompting, whereby three examples of different problems are provided as context. As such, the language model has in-context examples of the format of reasoning, but isnât provided explicit instructions on how to reason. We see this as an indicative signal for a general-purpose reasoner, which in many real-world applications (e.g., chatbots) would be asked to reason across a wide variety of tasks.
The models used herein include the OpenAI family of models: text-ada-001, text-baggage-001, text-curie-001, and text-davinci-003 (in plots we denote these as a-1, b-1, c-1, and d-3). We also consider PaLM-2âs code finetuned variant (Chowdhery et al., 2022; Google et al., 2023). For instruction tuned models, we compare to recent variants of GPT (gpt-3.5-turbo and gpt-4) with the chat completion mode run in October 2023 and January 2024. The results below are using the text-davinci-003 model unless otherwise stated.
### 3.3 Results
Question 1: Overall Performance. The overall performance of CoC is shown in Figure 2 and Table 1 (with full results in Table A1). We see that CoC outperforms other approaches, both in the number of tasks it exceeds the human baseline and in the overall amount that it exceeds the baseline. Indeed, CoCâs 84% is SoTA to the best of our knowledge (Gemini Team, 2023). In fact, when combined with gpt-4, CoC achieves 91% (see Table A4). In several tasks CoC vastly outperforms the human baseline and other methods, achieving nearly 100% â generally for these tasks the result is complicated in language but trivial in code (e.g., a task from multi-step arithmetic Q: $((-3+5\times 8\times-4)-(9-8\times-7))=$ ). We also observe that CoT outperforms the human baseline on a number of tasks, while the Direct answer fares poorly.
Table 1: Overall performance (%) on BIG-Bench Hard with both few-shot prompting with a single task and cross-task. The delta compared to direct prompting is shown in parenthesis.
| Prompt | Human | text-davinci-003 Direct | PaLM 2-S* (code variant (Google et al., 2023)) CoT | CoC (Ours) | Direct | CoT | CoC (Ours) |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Single task | 68 | 55 | 72 (+17) | 84 (+29) | 49 | 61 (+12) | 78 (+29) |
| Cross task | - | 50 | 55 (+5) | 61 (+11) | 45 | 47 (+2) | 47 (+2) |
Question 2: Problem Type. Figure 4 breaks the results down by problem type; the task labels are shown in Table A1. First, we isolate problems that are primarily algorithmic or primarily natural language (these categories were identified in (Suzgun et al., 2022)). We see that on algorithmic tasks, CoC performs particularly well, while on natural language tasks CoC performs on par with CoT. This is particularly encouraging, because one may expect these language oriented tasks to be a worse fit for code. The key is that our method offers the flexibility of using a LMulator to simulate the output of code execution, retaining the semantic reasoning capabilities of LMs for natural language problems.
Figure 4 additionally breaks the tasks down into categories that capture how different each questionâs response is and whether the code can be fully executed by Python (denoted Python only vs. Python + LM). For some tasks within the benchmark, each question has the same code or Chain of Thought, with the only variation being the inputs â in this case we say the code is (repeated code), and if not then it is denoted (new code). As expected, we see that when the code is repeated and run by Python, CoC gets nearly 100%, though these tasks (e.g., multi-step arithmetic) seem to be among the most challenging for the other baselines, including human raters. The other categories are more challenging for CoC; however in each, we still see a benefit over baselines.
<details>
<summary>extracted/5762267/fig/by_task_type.png Details</summary>

### Visual Description
## Bar Chart: Average Performance Comparison Across Methods
### Overview
The chart compares the average performance (%) of four methods (Human, Direct, CoT, CoC) across seven categories: "All", "NLP", "Alg", "Python only (repeated code)", "Python only (new code)", "Python + LM (repeated code)", and "Python + LM (new code)". The legend uses distinct colors for each method, with "Human (Avg.)" in teal, "Direct" in gray, "CoT" in blue, and "CoC (ours)" in purple. The chart emphasizes CoC's dominance in most categories.
### Components/Axes
- **X-axis**: Categories (All, NLP, Alg, Python only (repeated code), Python only (new code), Python + LM (repeated code), Python + LM (new code)).
- **Y-axis**: Average performance (%) from 0 to 100.
- **Legend**: Located on the right, with four methods:
- Teal: Human (Avg.)
- Gray: Direct
- Blue: CoT
- Purple: CoC (ours)
### Detailed Analysis
1. **All**:
- Human (Avg.): ~65%
- Direct: ~55%
- CoT: ~70%
- CoC (ours): ~80%
2. **NLP**:
- Human (Avg.): ~70%
- Direct: ~65%
- CoT: ~75%
- CoC (ours): ~75%
3. **Alg**:
- Human (Avg.): ~60%
- Direct: ~40%
- CoT: ~70%
- CoC (ours): ~95%
4. **Python only (repeated code)**:
- Human (Avg.): ~50%
- Direct: ~40%
- CoT: ~60%
- CoC (ours): ~100%
5. **Python only (new code)**:
- Human (Avg.): ~75%
- Direct: ~50%
- CoT: ~85%
- CoC (ours): ~90%
6. **Python + LM (repeated code)**:
- Human (Avg.): ~65%
- Direct: ~50%
- CoT: ~80%
- CoC (ours): ~75%
7. **Python + LM (new code)**:
- Human (Avg.): ~70%
- Direct: ~55%
- CoT: ~70%
- CoC (ours): ~75%
### Key Observations
- **CoC (ours)** consistently outperforms other methods, achieving the highest scores in 6/7 categories (e.g., 95% in "Alg", 100% in "Python only (repeated code)").
- **Direct** method underperforms across all categories, with the largest gap in "Python only (repeated code)" (~40% vs. CoC's 100%).
- **Human (Avg.)** shows moderate performance, ranging from 50% to 75%, but never exceeds CoC or CoT.
- **CoT** performs second-best in most categories but lags behind CoC in "Alg" and "Python only (repeated code)".
### Interpretation
The data suggests **CoC (ours)** is the most effective method overall, particularly in algorithmic and Python-specific tasks. The **Direct** method's poor performance in "Python only (repeated code)" may indicate limitations in handling repetitive code structures. **Human (Avg.)** provides a baseline but is outperformed by automated methods in most scenarios. The chart highlights CoC's robustness in leveraging both repeated and new code contexts, while CoT remains a strong but secondary alternative. The Direct method's consistent underperformance warrants further investigation into its design or training data.
</details>
Figure 4: Average performance across different baselines grouped by task type, indicating the problem type and how CoC is generated & executed.
Question 3: Ablations. Figures 5 and 6, and Table 2 show the ablations performed to motivate each aspect of Chain of Code prompting. As one may expect, the approaches that execute Python (CoC (Interweave, Python, try Python except LM, try Python except LM state)) achieve 100% performance on several tasks â if the code is correct, then the model will be correct every time. However, the approach that relies on Python alone (CoC (Python)) performs poorly when applied to non-algorithmic tasks, failing almost all. The CoC (Python) ablation is similar to recent works (Gao et al., 2023; Chen et al., 2022), which show that if applied to numerical problems then code reasoning performs well. CoC without the Python interpreter (CoC (LM, LM state)) too fares poorly, though we see that the step-by-step approach proposed in ScratchPad prompting (Nye et al., 2021) improves in each task.
We also show that ablations CoC (try Python except LM, try Python except LM state), in which CoC first tries to run the entire code with Python and if it fails simulates the code with an LM, perform quite well. Again we see that maintaining a program state provides an improvement in performance. With only minor degradations in performance observed, they are reasonable alternatives to the fully interweaved CoC for their simplicity. Though we note, these ablationsâ performance would be much worse in cases where interweaving code and semantics is truly necessary â for example, if we imagine a case where code is necessary to parse image inputs or to access an external database, but language is necessary to parse the results (see the robotics applications in Section A.6).
<details>
<summary>extracted/5762267/fig/by_task_type_ablation.png Details</summary>

### Visual Description
## Bar Chart: Average Performance Comparison Across Configurations
### Overview
The chart compares average performance (%) across six configurations (CoC variants) in different categories: "All," "NLP," "Alg," and Python/LM-specific scenarios ("Python only (repeated code)," "Python only (new code)," "Python + LM (repeated code)," "Python + LM (new code)"). Performance is measured on a 0-100% scale.
### Components/Axes
- **X-axis**: Categories (All, NLP, Alg, Python only (repeated code), Python only (new code), Python + LM (repeated code), Python + LM (new code))
- **Y-axis**: Average performance (%) from 0 to 100
- **Legend**:
- Purple: CoC (Interweave)
- Red: CoC (Python)
- Blue: CoC (LM)
- Light blue: CoC (LM state)
- Green: CoC (try Python except LM)
- Light green: CoC (try Python except LM state)
### Detailed Analysis
1. **All Category**:
- CoC (Interweave): ~82%
- CoC (Python): ~48%
- CoC (LM): ~55%
- CoC (LM state): ~62%
- CoC (try Python except LM): ~78%
- CoC (try Python except LM state): ~80%
2. **NLP Category**:
- CoC (Interweave): ~74%
- CoC (Python): ~15%
- CoC (LM): ~62%
- CoC (LM state): ~68%
- CoC (try Python except LM): ~68%
- CoC (try Python except LM state): ~70%
3. **Alg Category**:
- CoC (Interweave): ~92%
- CoC (Python): ~78%
- CoC (LM): ~50%
- CoC (LM state): ~58%
- CoC (try Python except LM): ~88%
- CoC (try Python except LM state): ~90%
4. **Python only (repeated code)**:
- CoC (Interweave): ~100%
- CoC (Python): ~100%
- CoC (LM): ~40%
- CoC (LM state): ~30%
- CoC (try Python except LM): ~100%
- CoC (try Python except LM state): ~100%
5. **Python only (new code)**:
- CoC (Interweave): ~85%
- CoC (Python): ~82%
- CoC (LM): ~51%
- CoC (LM state): ~70%
- CoC (try Python except LM): ~88%
- CoC (try Python except LM state): ~85%
6. **Python + LM (repeated code)**:
- CoC (Interweave): ~75%
- CoC (Python): ~0% (no bar)
- CoC (LM): ~65%
- CoC (LM state): ~70%
- CoC (try Python except LM): ~65%
- CoC (try Python except LM state): ~70%
7. **Python + LM (new code)**:
- CoC (Interweave): ~74%
- CoC (Python): ~18%
- CoC (LM): ~68%
- CoC (LM state): ~72%
- CoC (try Python except LM): ~67%
- CoC (try Python except LM state): ~72%
### Key Observations
- **Highest Performance**: "Python only (new code)" and "Python only (repeated code)" categories show near-perfect scores (100%) for CoC (Interweave), CoC (Python), and CoC (try Python except LM).
- **Lowest Performance**: CoC (Python) underperforms in "NLP" (~15%) and "Python + LM (new code)" (~18%).
- **Consistency**: CoC (Interweave) and CoC (try Python except LM) configurations generally outperform others across most categories.
- **LM State Impact**: CoC (LM state) and CoC (try Python except LM state) show moderate performance, often trailing their non-state counterparts.
### Interpretation
The data suggests that configurations leveraging Python (CoC (Interweave) and CoC (Python)) achieve the highest performance, particularly in Python-specific scenarios. The LM state variants (CoC (LM state) and CoC (try Python except LM state)) show mixed results, with some cases performing comparably to their non-state counterparts. The stark drop in CoC (Python) performance in NLP and Python+LM (new code) categories implies potential limitations in handling non-Python or novel code contexts. The near-perfect scores in Python-only scenarios highlight the effectiveness of Python-centric configurations when code is well-defined or repeated.
</details>
Figure 5: Chain of Code ablations on average performance grouped by task type.
<details>
<summary>extracted/5762267/fig/all_tasks_coc_interweave.png Details</summary>

### Visual Description
## Bar Chart: CoC (Interweave)
### Overview
The chart visualizes the change in performance relative to an average human rater (%) for a metric labeled "CoC (Interweave)" across incremental percentage values. The y-axis represents Î w.r.t. average human rater (%), ranging from -100% to 100%, while the x-axis categorizes data points as "CoC (Interweave)" and incremental additions (e.g., "+10%", "+20%", ..., "+100%").
### Components/Axes
- **Y-Axis**: Î w.r.t. average human rater (%)
- Scale: -100% (bottom) to 100% (top), with gridlines at -100, -50, 0, 50, 100.
- **X-Axis**: Categories labeled as:
- "CoC (Interweave)" (leftmost)
- "CoC (Interweave) + 10%"
- "CoC (Interweave) + 20%"
- ...
- "CoC (Interweave) + 100%" (rightmost)
- **Legend**: No explicit legend, but colors are inferred:
- **Red**: Negative values (underperformance relative to human rater).
- **Gray**: Near-zero values (neutral performance).
- **Blue**: Positive values (outperformance relative to human rater).
### Detailed Analysis
- **Negative Values (Red Bars)**:
- "CoC (Interweave)": -70%
- "CoC (Interweave) + 10%": -40%
- "CoC (Interweave) + 20%": -20%
- **Neutral Value (Gray Bar)**:
- "CoC (Interweave) + 30%": ~0%
- **Positive Values (Blue Bars)**:
- "CoC (Interweave) + 40%": ~10%
- "CoC (Interweave) + 50%": ~20%
- "CoC (Interweave) + 60%": ~30%
- "CoC (Interweave) + 70%": ~40%
- "CoC (Interweave) + 80%": ~50%
- "CoC (Interweave) + 90%": ~60%
- "CoC (Interweave) + 100%": ~100%
### Key Observations
1. **Initial Underperformance**: The first three categories ("CoC (Interweave)", "+10%", "+20%") show significant negative performance (-70%, -40%, -20%), indicating a decline relative to human raters.
2. **Neutral Transition**: At "+30%", performance stabilizes near 0%, suggesting parity with human raters.
3. **Positive Correlation**: From "+40%" onward, performance improves linearly, with each 10% increment adding ~10% to the Î value.
4. **Outlier**: The "+100%" category exhibits a sharp spike to 100%, far exceeding the incremental trend (expected ~90% based on prior pattern).
### Interpretation
The data suggests that the "CoC (Interweave)" metric initially underperforms compared to human raters but improves as the percentage increases. The linear relationship between incremental percentages and performance gains implies a scalable improvement in the metric's effectiveness. However, the abrupt jump to 100% at "+100%" deviates from the established trend, raising questions about data accuracy or a potential threshold effect. This could indicate a critical point where the metric achieves near-perfect alignment with human judgment, or it may reflect an outlier requiring further investigation. The chart underscores the importance of incremental adjustments in optimizing performance relative to human benchmarks.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_try_except_llm_state.png Details</summary>

### Visual Description
## Bar Chart: CoC (try Python except LM state)
### Overview
The chart visualizes the change in performance (Î w.r.t. average human rater %) when comparing different configurations of a system ("CoC") with Python, excluding the LM state. The y-axis represents percentage changes relative to a baseline human rater, ranging from -100% to 100%. The x-axis lists categories, with the first being "Python" and subsequent labels partially truncated (e.g., "Python (except LM)" and others). Bars transition from red (negative values) to blue (positive values), indicating performance degradation or improvement.
### Components/Axes
- **Title**: "CoC (try Python except LM state)"
- **Y-Axis**: "Î w.r.t. average human rater (%)" (range: -100 to 100)
- **X-Axis**: Categories (partially visible):
1. "Python"
2. "Python (except LM)"
3. Truncated labels (e.g., "Python (except LM state)" and others)
- **Legend**: Located on the right, with two colors:
- Red: Negative values (degradation)
- Blue: Positive values (improvement)
### Detailed Analysis
- **Negative Bars (Red)**:
- First bar ("Python"): ~-70%
- Second bar ("Python (except LM)"): ~-50%
- Third bar: ~-30%
- Fourth bar: ~-10%
- Fifth bar: ~-5%
- Sixth bar: ~-2%
- **Positive Bars (Blue)**:
- Seventh bar: ~10%
- Eighth bar: ~15%
- Ninth bar: ~20%
- Tenth bar: ~25%
- Eleventh bar: ~30%
- Twelfth bar: ~40%
- Thirteenth bar: ~50%
- Fourteenth bar: ~60%
- Fifteenth bar: ~100% (tallest bar)
### Key Observations
1. **Trend Shift**: The chart shows a clear transition from negative to positive values, indicating a shift from performance degradation to improvement as categories progress.
2. **Outlier**: The final bar ("Python (except LM state)"?) reaches ~100%, far exceeding other values, suggesting a significant improvement in this configuration.
3. **Gradual Improvement**: After the initial negative values, performance improves incrementally, with the largest jump occurring in the last category.
### Interpretation
The data suggests that excluding the LM state in Python configurations leads to progressively better performance, with the most dramatic improvement in the final category. The negative values for early categories may reflect baseline limitations or trade-offs in the system's design. The ~100% improvement in the last bar implies a near-doubling of performance relative to the human rater baseline, though the exact mechanism (e.g., computational efficiency, accuracy) is not specified. The truncated x-axis labels limit contextual clarity, but the trend strongly supports the hypothesis that removing the LM state enhances outcomes.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_try_except_llm.png Details</summary>

### Visual Description
## Bar Chart: CoC (try Python except LM)
### Overview
The chart visualizes the change in average human rater scores (Î w.r.t. baseline) across different conditions related to Python programming, with a focus on excluding a component labeled "LM". The y-axis represents percentage changes, while the x-axis categorizes conditions. Bars transition from red (negative changes) to blue (positive changes), indicating a shift in human rater evaluations.
### Components/Axes
- **Title**: "CoC (try Python except LM)"
- **Y-Axis**: "Î w.r.t. average human rater (%)" (range: -100 to 100)
- **X-Axis**: Unlabeled categories (likely conditions or trials), with approximate 15â20 bars.
- **Legend**:
- Red: "Python"
- Blue: "Python (except LM)"
### Detailed Analysis
- **Negative Values (Red Bars)**:
- First 5â7 bars show negative changes, ranging from **-30% to -50%**.
- Values gradually increase toward zero (e.g., -20% to -10% in later negative bars).
- **Positive Values (Blue Bars)**:
- Transition begins around the 8th bar, with values rising from **5% to 20%**.
- Steeper increase in the final 5 bars, peaking at **~95%** in the last bar.
- **Color Consistency**:
- Red bars align with "Python" (negative changes).
- Blue bars align with "Python (except LM)" (positive changes).
### Key Observations
1. **Significant Shift**: The exclusion of "LM" correlates with a dramatic increase in positive human rater scores.
2. **Outlier**: The final barâs value (~95%) is an outlier, suggesting a strong effect in the last condition.
3. **Gradual Improvement**: Early conditions show negative feedback, but later conditions (excluding LM) improve progressively.
### Interpretation
The data suggests that removing "LM" from Python-related tasks leads to markedly higher human rater satisfaction. The sharp rise in the final bar implies that "LM" may have been a critical factor reducing performance or satisfaction in earlier conditions. This could indicate that "LM" introduces complexity, errors, or inefficiencies that negatively impact human evaluations. The trend highlights the importance of isolating components like "LM" to optimize user experience or task outcomes.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_python.png Details</summary>

### Visual Description
## Bar Chart: CoC (Python)
### Overview
The chart visualizes the delta (Î) between a metric (CoC) and the average human rater percentage across 101 categories (x-axis: 0 to 100). The y-axis ranges from -100% to 100%, with negative values (orange bars) dominating the left half and positive values (blue bars) increasing sharply on the right.
### Components/Axes
- **Title**: "CoC (Python)"
- **Y-Axis**: "Î w.r.t. average human rater (%)" (range: -100 to 100)
- **X-Axis**: Categories labeled 0 to 100 (discrete intervals)
- **Legend**: Implied by color (orange = negative Î, blue = positive Î)
### Detailed Analysis
- **Negative Î (Orange Bars)**:
- Categories 0â50 show progressively decreasing values:
- x=0: ~-100%
- x=10: ~-90%
- x=20: ~-80%
- x=30: ~-70%
- x=40: ~-60%
- x=50: ~-50%
- Bars decrease in height linearly from left to right.
- **Positive Î (Blue Bars)**:
- Categories 50â100 show progressively increasing values:
- x=50: ~+5%
- x=60: ~+10%
- x=70: ~+20%
- x=80: ~+30%
- x=90: ~+40%
- x=100: ~+100%
- Bars increase in height exponentially from left to right.
### Key Observations
1. **Transition at x=50**: The shift from negative to positive Î occurs abruptly at x=50, with a small positive value (~+5%) immediately following the last negative bar (~-50%).
2. **Extreme Values**:
- The most negative value (-100%) occurs at x=0.
- The most positive value (+100%) occurs at x=100.
3. **Symmetry**: The magnitude of negative and positive values mirrors each other (e.g., x=0: -100% vs. x=100: +100%).
### Interpretation
The chart suggests a **threshold effect** at x=50, where the metric (CoC) transitions from underperforming to outperforming the average human rater. The exponential growth in positive Î values for x > 50 implies that higher x-values correlate with significantly better performance. The symmetry in magnitude between negative and positive extremes hints at a binary or polarized evaluation system (e.g., "poor" vs. "excellent"). The absence of intermediate values (e.g., x=50: +5%) suggests a categorical rather than continuous relationship between x and Î.
**Note**: The chart lacks explicit labels for x-axis categories, leaving their semantic meaning (e.g., "0" vs. "100") ambiguous. Further context is required to interpret the practical significance of these values.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_llm_state.png Details</summary>

### Visual Description
## Bar Chart: CoC (LM state)
### Overview
The chart visualizes the performance of different language models (LM states) relative to human raters, measured as a percentage change (Î w.r.t. average human rater %). The x-axis represents 15 distinct categories (labeled 1â15), while the y-axis ranges from -100% to 100%. Three LM states are differentiated by color: red (LM state 1), purple (LM state 2), and blue (LM state 3).
### Components/Axes
- **Y-axis**: Î w.r.t. average human rater (%)
- Scale: -100% (bottom) to 100% (top)
- Labels: Discrete percentage increments (e.g., -50%, 0%, 50%)
- **X-axis**: Categories 1â15 (no explicit labels beyond numerical indices)
- **Legend**:
- Red: LM state 1
- Purple: LM state 2
- Blue: LM state 3
- **Bars**: Horizontal bars aligned with categories 1â15, colored by LM state.
### Detailed Analysis
- **Categories 1â10**:
- Dominated by red (LM state 1) and purple (LM state 2) bars.
- All values are negative, indicating underperformance relative to human raters.
- **Extreme outlier**: Category 1 (red) reaches approximately -50%.
- Gradual improvement: Purple bars (LM state 2) show less severe deficits (e.g., -20% to -10% in categories 5â10).
- **Categories 11â15**:
- Transition to blue (LM state 3) bars.
- All values are positive, indicating outperformance.
- **Strongest performance**: Category 15 (blue) peaks at ~40%.
- Steady increase: Blue bars rise from ~10% (category 11) to ~40% (category 15).
### Key Observations
1. **Divergent Performance**: LM state 1 (red) underperforms significantly in early categories, while LM state 3 (blue) dominates later categories.
2. **Transition Point**: Category 10 marks the shift from negative (red/purple) to positive (blue) values.
3. **Consistency**: LM state 2 (purple) shows moderate improvement but remains below human rater averages in categories 1â10.
4. **Outlier**: Category 1 (red) exhibits the largest deviation (-50%), suggesting a critical failure or anomaly.
### Interpretation
The data suggests a hierarchical relationship between LM states and their alignment with human rater expectations. LM state 3 (blue) demonstrates superior performance, potentially due to advanced training or optimization. The negative values in LM state 1 (red) may indicate overfitting, data scarcity, or misalignment with human preferences. The gradual improvement across categories (1â15) could reflect incremental model updates or task-specific tuning. Notably, the abrupt shift at category 11 implies a structural change in the LM states or evaluation criteria. This chart underscores the importance of iterative model refinement to bridge the gap between automated systems and human judgment.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_llm.png Details</summary>

### Visual Description
## Bar Chart: CoC (LM)
### Overview
The chart visualizes the difference in average human rater scores (Î w.r.t. average human rater) across 19 categories, represented as percentages. The y-axis ranges from -100% to 100%, while the x-axis is labeled with categories 1 to 15. The bars are color-coded: orange for negative values, brown for near-zero values, and blue for positive values.
### Components/Axes
- **Y-axis**: "Î w.r.t. average human rater (%)" (range: -100% to 100%).
- **X-axis**: Categories labeled 1 to 15 (though bars extend to 19, suggesting a potential inconsistency).
- **Legend**: Not explicitly visible, but colors are inferred as:
- **Orange**: Negative values (e.g., -40% to -10%).
- **Brown**: Near-zero values (e.g., -5% to 0%).
- **Blue**: Positive values (e.g., 5% to 30%).
### Detailed Analysis
- **Categories 1â10 (Orange)**:
- Bars decrease in magnitude from approximately -40% (category 1) to -10% (category 10).
- Example: Category 1 â -40%, Category 5 â -30%, Category 10 â -10%.
- **Categories 11â15 (Brown)**:
- Bars increase slightly from -5% (category 11) to 0% (category 15).
- Example: Category 11 â -5%, Category 13 â -2%, Category 15 â 0%.
- **Categories 16â19 (Blue)**:
- Bars increase from 5% (category 16) to 30% (category 19).
- Example: Category 16 â 5%, Category 18 â 20%, Category 19 â 30%.
### Key Observations
1. **Negative Trend (Categories 1â10)**: The first 10 categories show a consistent decline in human rater scores, with the largest drop in the earliest categories.
2. **Neutral Transition (Categories 11â15)**: A gradual shift toward neutrality, with values approaching zero.
3. **Positive Trend (Categories 16â19)**: A sharp increase in positive scores, with the highest value (30%) in the final category.
4. **X-axis Inconsistency**: The x-axis labels only go up to 15, but bars extend to 19, suggesting a possible labeling error or misalignment.
### Interpretation
The data suggests a progression from negative to positive human rater scores across categories. The initial 10 categories (1â10) exhibit significant negative deviations, possibly indicating poor performance or dissatisfaction. Categories 11â15 show a stabilization near zero, while the final 4 categories (16â19) demonstrate a strong positive trend, potentially reflecting improvements or favorable outcomes.
The x-axis labeling discrepancy (categories 1â15 vs. bars up to 19) raises questions about data alignment or visualization errors. If the x-axis is intended to represent 19 categories, the labels should be corrected to 1â19. The color-coded trends align with the legend, confirming the visual representation of positive, neutral, and negative values.
This chart highlights a clear shift in human rater perceptions, with the latter categories showing marked improvement. Further investigation into the cause of the x-axis inconsistency is recommended to ensure data accuracy.
</details>
Figure 6: Results across all BIG-Bench Hard tasks compared to human baseline (Srivastava et al., 2022). The tasks (x-axis) in each plot are sorted individually by performance. See Table A1 and Figure 5 for a breakdown by task type.
Table 2: Ablation overall performance (%) with both few-shot prompting with a single task and cross-task. The delta compared to the full model (Interweave) is shown in parenthesis.
| Prompt | Chain of Code Interweave | try Python-except LM state | try Python-except LM | Python | LM state | LM |
| --- | --- | --- | --- | --- | --- | --- |
| Single task | 84 | 82 (-2) | 80 (-4) | 48 (-36) | 63 (-21) | 57 (-27) |
| Cross task | 61 | 57 (-4) | 60 (-1) | 35 (-26) | 49 (-12) | 50 (-11) |
Question 4: Scaling. Figure 7 shows the performance of CoC across various model sizes. We observe that, similar to Chain of Thought prompting, the improvements of CoC increases as model size increases. In fact, for some of the algorithmic tasks, Chain of Code even outperforms the best human raters (whom admittedly did not have access to code). Unlike Chain of Thought prompting, however, which only brings performance benefits for the largest model (d-3), CoC outperforms the direct question answering baseline also for smaller models (a-1, b-1, c-1), suggesting that itâs easier for smaller models to output structured code as intermediate steps rather than natural languages.
Question 5: Cross-task Prompting. For cross-task prompting, we prompt the language models with a few examples from different problems. We see the performance drops for all methods in Figure 7 and Table 1. Despite this drop, CoC outperforms CoT and direct prompting at scale, nearly achieving human average performance. This is a promising indication towards general purpose reasoning, in which a model does not expect to receive examples of similar problems in its prompt.
<details>
<summary>extracted/5762267/fig/by_size_all.png Details</summary>

### Visual Description
## Line Graphs: Model Performance Across Task Types
### Overview
The image contains four line graphs comparing model performance across different task types (All Tasks, NLP Tasks, Alg Tasks, Cross-task Prompted) as model scale increases. Each graph shares consistent axes and legend elements, with performance measured in percentage on the y-axis and model scale (parameters) on the x-axis.
### Components/Axes
- **X-axis**: Model scale (parameters) with categories: a-1, b-1, c-1, d-3
- **Y-axis**: Average performance (%) ranging from 0 to 100
- **Legends**:
- **Direct**: Gray line
- **CoT**: Blue line
- **Co (ours)**: Purple line
- **Co (Python)**: Red line
- **Human Best**: Dashed green line (100%)
- **Human Average**: Dashed yellow line (~70%)
- **Random**: Orange dashed line (~20%)
### Detailed Analysis
#### All Tasks
- **Co (ours)** (purple): Steep upward trend from ~30% (a-1) to ~85% (d-3)
- **Direct** (gray): Gradual increase from ~25% to ~55%
- **CoT** (blue): Moderate rise from ~25% to ~70%
- **Co (Python)** (red): Slow growth from ~15% to ~45%
- **Human Best**: Horizontal dashed green line at 100%
- **Human Average**: Horizontal dashed yellow line at ~70%
- **Random**: Horizontal orange dashed line at ~20%
#### NLP Tasks
- **Co (ours)** (purple): Sharp rise from ~30% to ~75%
- **Direct** (gray): Steady increase from ~30% to ~70%
- **CoT** (blue): Flat at ~35% then jumps to ~70%
- **Co (Python)** (red): Minimal growth from ~5% to ~15%
- **Human Best**: 100% dashed green line
- **Human Average**: ~70% dashed yellow line
- **Random**: ~20% orange dashed line
#### Alg Tasks
- **Co (ours)** (purple): Steep ascent from ~30% to ~95%
- **Direct** (gray): Gradual rise from ~25% to ~40%
- **CoT** (blue): Moderate increase from ~20% to ~70%
- **Co (Python)** (red): Strong upward trend from ~25% to ~80%
- **Human Best**: 100% dashed green line
- **Human Average**: ~60% dashed yellow line
- **Random**: ~20% orange dashed line
#### Cross-task Prompted
- **Co (ours)** (purple): Rapid growth from ~25% to ~60%
- **Direct** (gray): Steady climb from ~25% to ~50%
- **CoT** (blue): Moderate rise from ~20% to ~55%
- **Co (Python)** (red): Slow increase from ~10% to ~35%
- **Human Best**: 100% dashed green line
- **Human Average**: ~50% dashed yellow line
- **Random**: ~20% orange dashed line
### Key Observations
1. **Co (ours)** consistently outperforms other methods across all task types, especially at larger model scales (d-3)
2. **Human Best** remains an unattainable upper bound (100%) in all cases
3. **Co (Python)** shows variable performance but generally lags behind Co (ours)
4. **Random** baseline remains constant at ~20% across all tasks
5. **Direct** method shows moderate improvement but underperforms Co approaches
### Interpretation
The data demonstrates that Co (ours) methodology achieves the highest performance gains as model scale increases, particularly in algorithmic and cross-task scenarios. This suggests that the Co (ours) approach effectively leverages larger models for complex reasoning tasks. The persistent gap between model performance and Human Best indicates significant room for improvement in AI reasoning capabilities. The Random baseline's consistency highlights the importance of structured reasoning methods over chance. Notably, Co (Python) underperforms in NLP tasks despite similar parameter scales, suggesting task-specific limitations in its implementation.
</details>
Figure 7: Average performance with model scaling, from text-ada-001 (smallest) to text-davinci-003 (largest).
Question 6: Instruction Tuned Models. The reason why we chose text-davinci-003, a completion model, as our primary evaluation model, over more advanced instruction tuned models (gpt-3.5-turbo and gpt-4) is that the former is more amenable to few-shot prompting with examples, which is the main evaluation paradigm for BIG-Bench Hard. However, we still made our best attempt to evaluate our method with the instruction tuned models using two different setups. The first is zero-shot prompting, where we directly prompt the models via the system message to output direct answers, chain of thoughts, or pseudocode/code (which we optionally execute with the python interpreter and feed back the results). The second is few-shot prompting, where we coerce the models to behave like completion models via the system message, and feed the few-shot examples as usual. In both cases, we demonstrated that CoC brings noticeable benefits with little modification needed. See Sec. A.4 for more details.
Question 7: Robustness of Chain of Code We showed that CoC is generally robust against prompt variation by evaluating with different prompts independently written by three annotators on the same set of problems. Specifically, we select four representative tasks from BIG-Bench Hard that require generation of new code (as opposed to repeated code). While the performance of individual tasks has some variance, the average performance across the four tasks only vary within a few percentage points. See Sec. A.5 for more details.
Question 8: Beyond Language Reasoning We showed that CoC is well-suited for tasks that require both semantic and algorithmic reasoning beyond language reasoning, such as robotics. The unique advantage of CoC in robotics is that it interact seamlessly with the robot perception and control APIs via python code such as running object detectors or invoking parameterized robot skills, while performing semantic subtasks in an âinlineâ fashion (e.g. classifying what trash is compostable before picking them). When equipped with the necessary robot APIs, and a single example in the prompt to teach LMs the format, CoC can solve seven different robot manipulation tasks in the real world, showcasing generalization to new objects, languages and task domains. See Sec. A.6 for more details.
## 4 Related Work
Language Model Reasoning The abilities and applications of language models have seen significant progress, due to their overall performance (Chowdhery et al., 2022; Touvron et al., 2023; Radford et al., 2019; Gemini Team, 2023) and emergent capabilities (Wei et al., 2022a), such as few-shot prompting (Brown et al., 2020) and abstract reasoning (Wei et al., 2022b). Perhaps most related to this work, a number of works have leveraged prompting to improve reasoning (Dohan et al., 2022): Chain of Thought (Wei et al., 2022b) proposes to break a task down into intermediate reasoning steps, least-to-most (Zhou et al., 2022a) proposes a series of increasingly simpler problems, and ScratchPad (Nye et al., 2021) proposes to maintain a trace of intermediate results for interpreting code (this first demonstrated the code simulation ability of LMs required for our LMulator). Along these lines âletâs think step by stepâ (Kojima et al., 2022) uses a few key words to elicit such break downs (words that were later refined to âTake a deep breath and work on this problem step-by-stepâ in (Yang et al., 2023)). Beyond these, other approaches structure such step-by-step solutions into graphical structures (Yao et al., 2023; Besta et al., 2023), plans (Wang et al., 2023b; Ning et al., 2023), or mixture of expert-based sampling (Wang et al., 2022; Zhou et al., 2022b). CoC builds upon the intuition of these works, with the observation that code is a formal, structured approach to breaking a problem down into sub-steps with many advantages beyond natural language alone.
Language Model Tool Use Many recent works have proposed techniques for language models to use tools to respond to queries (Mialon et al., 2023). These tools have often been provided to the language model through prompting (Cobbe et al., 2021; Khot et al., 2022; Chowdhery et al., 2022; Drori et al., 2022; Yao et al., 2022), enabling tools like calculators for math problems, code interpreters, databases, or more. These tools too can provide feedback on novel modalities (SurĂs et al., 2023; Zeng et al., 2022). To expand the range of tools available, others have used external tool databases or finetuned language models (Schick et al., 2023; Qin et al., 2023; Parisi et al., 2022; Paranjape et al., 2023). As tool interfaces vary, feedback from the tool too can improve performance (Gou et al., 2023; Zhou et al., 2023). In this work we leverage the expressibility and generality of full code as well as its structure, by treating it both as a tool and as a framework.
Language Model Program Synthesis The ability of language models to code is well known and they have been applied as programming assistants (Chen et al., 2021) and shown to be capable programmers on their own (Austin et al., 2021; Li et al., 2022; Nijkamp et al., 2022). This ability has been applied to a variety of tasks outside of language alone, leveraging their ability to reason through code in new settings, such as robotics (Liang et al., 2023; Singh et al., 2023), embodied agents (Wang et al., 2023a), or vision (SurĂs et al., 2023). Others have specifically done so for reasoning, such as Program of Thoughts (Chen et al., 2022) and Program-aided Language Models (Gao et al., 2023), which generate code to solve numerical reasoning problems. Herein, we focus on the interplay between writing code, running code, and language models simulating code, thus enabling new regimes of language model code applications, such as semantic reasoning.
## 5 Conclusions, Limitations, and Future Work
We have proposed Chain of Code, an approach towards reasoning with language models through writing code, and executing code either with an interpreter or with a language model that simulates the execution (termed herein an LMulator) if the code is not executable. As such, CoC can leverage both the expressive structure of code and the powerful tools available to it. Beyond this, by simulating the execution of non-executable code, CoC can apply to problems nominally outside the scope of code (e.g., semantic reasoning problems). We have demonstrated that this approach outperforms baselines, and for some tasks even the best human raters, in a range of challenging language and numeric reasoning problems.
This work is not without its limitations. First, generating and executing in two steps as well as interweaving code and language execution requires additional context length and computation time. Second, though we have not seen any loss of performance for semantic tasks in aggregate, there are few tasks in which code doesnât help, e.g., the task Ruin Names, which asks whether an edit for a name is humorous. Finally, our implementation to interweave LM and code is quite simple, tracking the program state in strings and parsing the strings into Pythonâs built-in data types (e.g., dict, tuple). As our method stands now, the LM cannot modify custom Python objects while simulating code execution. In theory, however, it is doable as long as each of these Python objects have a serialization and deserialization method, e.g., using techniques like Protocol Buffers.
There are many avenues for future work with CoC. First, we believe that a unified code and language interpreter well combines the commonsense of language models with the analytical abilities, structure, and interpretability of code. Such a technology can thus enable applications of code and code-like reasoning to novel problem regimes, beyond simple reasoning. Second, we are interested in investigating the degree to which finetuning a language model to be an LMulator can benefit semantic code reasoning. Third, we see evidence that reasoning through many pathways yields improvements, which is a promising step forward. Finally, we believe this integration with code enables access to external modalities, such as vision or databases, and represents a interesting path for new applications (e.g., robotics, augmented reality).
## Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, most of which are related to the usage of large language models (LLMs). One aspect of Chain of Code that warrants further discussion is that CoC executes the output of LLMs using the Python interpreter as if they are always benign code. If deployed in the wild, however, Chain of Code will need to install additional safeguards against potentially harmful code from LLMs that might be maliciously prompted, before running the code.
## References
- Austin et al. (2021) Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Besta et al. (2023) Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., Gajda, J., Lehmann, T., Podstawski, M., Niewiadomski, H., Nyczyk, P., et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020.
- Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Chen et al. (2022) Chen, W., Ma, X., Wang, X., and Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
- Chowdhery et al. (2022) Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Dohan et al. (2022) Dohan, D., Xu, W., Lewkowycz, A., Austin, J., Bieber, D., Lopes, R. G., Wu, Y., Michalewski, H., Saurous, R. A., Sohl-Dickstein, J., et al. Language model cascades. arXiv preprint arXiv:2207.10342, 2022.
- Drori et al. (2022) Drori, I., Zhang, S., Shuttleworth, R., Tang, L., Lu, A., Ke, E., Liu, K., Chen, L., Tran, S., Cheng, N., et al. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 119(32):e2123433119, 2022.
- Gao et al. (2023) Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., and Neubig, G. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764â10799. PMLR, 2023.
- Gemini Team (2023) Gemini Team, G. Gemini: A family of highly capable multimodal models. Technical report, Google, 2023. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.
- Google et al. (2023) Google, Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Gou et al. (2023) Gou, Z., Shao, Z., Gong, Y., Shen, Y., Yang, Y., Duan, N., and Chen, W. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023.
- Khot et al. (2022) Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K., Clark, P., and Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
- Kirillov et al. (2023) Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., DollĂĄr, P., and Girshick, R. Segment anything. arXiv:2304.02643, 2023.
- Kojima et al. (2022) Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199â22213, 2022.
- Lewkowycz et al. (2022) Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models, 2022. arXiv preprint arXiv:2206.14858, 2022. URL https://arxiv.org/abs/2206.14858.
- Li et al. (2022) Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 378(6624):1092â1097, 2022.
- Liang et al. (2023) Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493â9500. IEEE, 2023.
- Liu et al. (2023) Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Mialon et al. (2023) Mialon, G., DessĂŹ, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., RoziĂšre, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
- Min et al. (2022) Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
- Mirchandani et al. (2023) Mirchandani, S., Xia, F., Florence, P., Ichter, B., Driess, D., Arenas, M. G., Rao, K., Sadigh, D., and Zeng, A. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
- Nijkamp et al. (2022) Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S., and Xiong, C. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- Ning et al. (2023) Ning, X., Lin, Z., Zhou, Z., Yang, H., and Wang, Y. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
- Nye et al. (2021) Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
- OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
- Paranjape et al. (2023) Paranjape, B., Lundberg, S., Singh, S., Hajishirzi, H., Zettlemoyer, L., and Ribeiro, M. T. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.
- Parisi et al. (2022) Parisi, A., Zhao, Y., and Fiedel, N. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
- Qin et al. (2023) Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
- Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Schick et al. (2023) Schick, T., Dwivedi-Yu, J., DessĂŹ, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Singh et al. (2023) Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11523â11530. IEEE, 2023.
- Srivastava et al. (2022) Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- SurĂs et al. (2023) SurĂs, D., Menon, S., and Vondrick, C. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
- Suzgun et al. (2022) Suzgun, M., Scales, N., SchÀrli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Touvron et al. (2023) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., RoziĂšre, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Wang et al. (2023a) Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- Wang et al. (2023b) Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R. K.-W., and Lim, E.-P. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023b.
- Wang et al. (2022) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Wei et al. (2022a) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Wei et al. (2022b) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824â24837, 2022b.
- Yang et al. (2023) Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., and Chen, X. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
- Yao et al. (2022) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Yao et al. (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Zeng et al. (2022) Zeng, A., Attarian, M., Ichter, B., Choromanski, K., Wong, A., Welker, S., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Zhou et al. (2023) Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., Lu, S., Jia, A., Song, L., Zhan, M., et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
- Zhou et al. (2022a) Zhou, D., SchÀrli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
- Zhou et al. (2022b) Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A. M., Le, Q. V., Laudon, J., et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103â7114, 2022b.
## Appendix A Appendix
### A.1 Quantitative results on language reasoning tasks
Table A1 shows the full per-task results across ablations on BIG-Bench Hard (BBH) tasks, as well as broken down by task type and execution type.
.2in.2in Srivastava et al. (2022) Suzgun et al. (2022) Chain of Code BIG-Bench Hard Task Rand. Human (Avg.) Human (Max) Direct CoT Inter-weave try Python except LM state try Python except LM Python LM state LM Boolean Expressions λ+ 50 79 100 88 89 100 100 100 100 95 90 Causal Judgement Îșâ 50 70 100 64 64 56 57 63 0 57 60 Date Understanding Îș- 17 77 100 61 84 75 72 74 59 66 57 Disambiguation QA Îș/ 33 67 93 70 68 71 67 68 0 67 68 Dyck Languages λ+ 1 48 100 6 50 100 100 99 99 1 7 Formal Fallacies Îșâ 25 91 100 56 56 55 54 55 0 54 56 Geometric Shapes λ+ 12 54 100 48 66 100 100 100 100 13 44 Hyperbaton Îș/ 50 75 100 63 64 98 62 55 0 62 55 Logical Deduction λâ 23 40 89 49 66 68 79 57 0 79 58 Movie Recommendation Îș/ 25 61 90 85 81 80 83 80 0 83 79 Multi-Step Arithmetic λ+ 0 10 25 0 48 100 100 100 100 0 1 Navigate λâ 50 82 100 58 94 86 84 68 0 84 68 Object Counting λ- 0 86 100 30 82 96 98 98 98 57 50 Penguins in a Table Îș- 0 78 100 62 82 90 88 90 88 71 59 Reasoning about Colored Objects Îș- 12 75 100 64 87 78 74 78 64 64 70 Ruin Names Îș/ 25 78 100 76 70 55 56 46 0 56 47 Salient Translation Error Detection Îș/ 17 37 80 66 61 58 63 64 0 63 64 Snarks Îș/ 50 77 100 70 71 76 76 66 0 76 66 Sports Understanding Îș/ 50 71 100 72 96 91 93 75 0 93 74 Temporal Sequences λâ 25 91 100 38 60 98 93 99 93 93 99 Tracking Shuffled Objects λ- 23 65 100 25 72 100 96 96 96 71 24 Web of Lies λ- 50 81 100 54 100 97 96 96 97 96 50 Word Sorting λ+ 0 63 100 51 50 99 100 99 100 54 54 Task Averages NLP Task (avg) Îș 30 71 97 67 74 74 70 68 18 68 63 Algorithmic Task (avg) λ 21 64 92 41 71 95 95 92 80 58 50 All Tasks (avg) 26 68 95 55 72 84 82 80 48 63 57 Execution Type Python exec (same program) + 13 51 85 38 61 100 100 100 100 33 39 Python exec (different program) - 17 77 100 49 84 89 87 89 84 71 52 LM exec (same program) / 36 66 95 72 73 76 71 65 0 71 65 LM exec (different program) â 35 75 98 53 68 72 73 68 19 73 68 $\lambda$ denotes an algorithmic task and $\kappa$ denotes an NLP task (with categories outlined in Suzgun et al. (2022)). $+$ denotes a task where the code between prompts is repeated and can be executed by Python, $-$ denotes a task where the code between prompts must change and can be executed by Python, $/$ denotes a task where the code between prompts is repeated and must be executed by the LM, and $*$ denotes a task where the code between prompts must change and must be executed by the LM.
Table A1: Full results across ablations on BIG-Bench Hard (BBH) tasks.
### A.2 Quantitative results on the GSM8K Benchmark
Table A2 shows results on the the grade-school math benchmark (GSM8K) (Cobbe et al., 2021) with direct prompting, Chain of Thought, and Chain of Code. We find that CoC generally outperforms CoT and Direct prompting. Since these tasks are primarily algorithmic and are solved by Python alone, all Chain of Code variants that use Python achieve the same performance â also the same performance shown in Program of Thoughts (Chen et al., 2022).
Table A2: GSM8K (Cobbe et al., 2021) performance (%) with both few-shot prompting with a single task and cross-task. The delta compared to direct prompting is shown in parenthesis.
### A.3 Qualitative results on language reasoning tasks
Figure A1 shows the model outputs for a few reasoning tasks from BIG-Bench Hard (BBH) and Figure A2 shows a demonstrative example of date reasoning. These examples are selected to highlight the interweaving execution of the Python interpreter and the LMulator.
(a) Movie Recommendation
(b) Hyperbaton
(c) Logical Deduction
(d) Disambiguation QA
Figure A1: Model outputs for a few reasoning tasks from BIG-Bench Hard (BBH). We observe that CoC can apply to a wide variety of complex reasoning tasks that involve both semantic and numeric reasoning. Red highlight indicates LM generated code being executed by the Python interpreter, and purple highlight indicates LM simulating the code execution.
Direct answer only
Chain of Code
Figure A2: A demonstrative example of how Chain of Code generates code and reasons through an LM-augmented code emulator. Lines evaluated with Python are in red and with an LM are in purple. The chain of thought and direct answers were evaluated with gpt-4 in October 2023, and we note the current model (as of December 2023) writes code to solve this problem and gets the same solution as Chain of Code.
### A.4 Instruction Tuned Models
Since most of the results presented in our main paper are using text-davinci-003, a completion model that is particularly amenable to few-shot prompting, one may wonder how CoC can be used with instruction tuned models, like gpt-4 (OpenAI, 2023). Figuring out ways to elicit the desired behavior of CoC from these instruction tuned models (i.e. writing code/pseudocode to solve problems) is non-trivial. We conduct two additional experiments below as our best effort to shed some light on this subject.
Zero-shot prompting. We directly prompt the models with instructions to elicit the desired reasoning approaches. Note that we do not provide few-shot examples in the prompt (hence âzero-shotâ). For the baselines, we ask the model to âdirectly answerâ (Direct) or âthink step by stepâ (CoT). For CoC variants, we ask the model to âwrite python code to help solve the problem, if itâs helpfulâ. If a program is written, we either run the code with a Python interpreter and then feed the result (or the error message if execution fails) back to the model to determine a final answer (CoC (Python)), or ask the model to simulate the output of code execution as a LMulator (CoC (LM)). The CoC (Python) baseline can be thought of as a comparison to an LM with Python tool use.
Table A3 shows the performance of each. With gpt-3.5-turbo, both CoT and CoC (Python) show benefits over direct prompting, although both are strongly outperformed by CoC (Interweave). With gpt-4, despite the considerable model strength advantage over text-davinci-003, CoC (Interweave) still outperforms, though the gap is narrower. Admittedly, CoC (Interweave) is prompted with three examples whereas the other two are not.
Table A3: Comparisons with instruction tuned models in the chat interface, with and without tool use, denoted as CoC (Python) and CoC (LLM) respectively. The delta compared to CoC with text-davinci-003 is shown in parenthesis. In this experiment, the prompts for gpt-3.5-turbo and gpt-4 only contain a generic, shared system message and do not contain few-shot examples.
| CoC (Interweave) 84 | Direct 51 (-33) | CoT 56 (-28) | CoC (Python) 56 (-28) | CoC (LM) 45 (-39) | Direct 70 (-14) | CoT 78 (-6) | CoC (Python) 82 (-2) | CoC (LM) 75 (-9) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
Few-shot prompting. We attempt to coerce the instruction tuned models to behave like completion models by using the following system message: âPretend you are a completion model that is prompted with three examples. You should follow the pattern of these examples strictly. At the end of your reply, you should always output an answerâ. In the user message, we include three examples from the same (single task) or different (cross task) task domains, as well as the query question, exactly the same as how we evaluated the completion models.
Table A4 shows that CoC still brings a sizable performance gain over the Direct and CoT baselines. With gpt-4, the gap is again narrower mainly because the base model already performs quite well and leaves little room for improvement. This experiment suggests a viable way to combine the strength of CoC with that of more advanced instruction tuned models like gpt-4.
Table A4: Applying CoC to instruction tuned models in the chat interface, while coercing them to behave like completion models. The delta compared to direct prompting is shown in parenthesis. In this experiment, the prompts contains a generic, shared system message that asks LMs to behave like completion models, and also three examples from the same or different task domains at the beginning of the user message. as before.
| Prompt | gpt-3.5-turbo Direct | gpt-4 CoT | CoC | Direct | CoT | CoC |
| --- | --- | --- | --- | --- | --- | --- |
| Single task | 47 | 73 (+26) | 79 (+32) | 69 | 88 (+19) | 91 (+22) |
| Cross task | 47 | 60 (+13) | 61 (+14) | 67 | 81 (+14) | 84 (+17) |
### A.5 Robustness of Chain of Code
Similar to Chain of Thought prompts, Chain of Code prompts can also come with various forms: e.g. different ways of function decomposition, coding styles, variable names, reasoning pathways, and so on. In this section, we want to analyze whether CoC is robust against variation across prompts.
We invite three annotators that are familiar with Python to write CoC prompts for four representative tasks in BIG-Bench Hard. We select these four tasks because they all require generation of new code (as opposed to repeated code) during test time. As before, for single task evaluation, we prompt LMs with three examples from the same task domain, whereas for cross task evaluation, we prompt LMs with three examples from different task domains (one from each of the other three tasks).
Results in Table A5 show that our method is robust against prompt variation and doesnât require extensive prompt engineering.
Table A5: Performance variation across prompts written independently by different authors for four representative tasks in BIG-Bench Hard. Our results that CoC is generally robust against prompt variation, allowing for different coding styles and reasoning logic in the few-shot prompts.
| Prompt Single task | Prompt Annotator A B | Date Understanding 73 68 | Logical Deduction 64 54 | Object Counting 92 95 | Penguins in a Table 78 88 | Average 77 76 |
| --- | --- | --- | --- | --- | --- | --- |
| C | 69 | 43 | 90 | 89 | 73 | |
| Cross task | A | 41 | 33 | 67 | 76 | 54 |
| B | 48 | 29 | 78 | 88 | 61 | |
| C | 60 | 30 | 76 | 64 | 57 | |
### A.6 Robotics Applications
Downstream applications such as robotics are well fit for CoC as robotics tasks require semantic reasoning and algorithmic reasoning, as well as interfacing with other APIs through code (such as control or perception APIs (Liang et al., 2023)) and with users through natural language. For example, given a task like âsort the fruits by sizeâ, the robot must reason over which items are fruits, sort them by size, and then connect those decisions to actions executable on the robot. CoC (Interweave) is able to solve these challenges with the Python interpreter and the LMulator at runtime, while allowing for more interpretability and fine-grained control of the robot policies.
Environment and Robot Setup. Our environment is a tabletop with small objects (containers, toys, etc) and a UR5 robot arm equipped with a vacuum gripper and a wrist-mounted RGB-D camera. For the purpose of our experiments, the available perception API is detect_objects(), which returns a list of detected objects (probabilities, labels, bounding boxes and segmentation masks) from the wrist camera. This API is implemented with first querying GPT-4V (OpenAI, 2023) for a list of objects, and then using Grounding-SAM (Kirillov et al., 2023; Liu et al., 2023) to localize them. The available control API is pick_place(obj1, obj2), which is a scripted primitive skill that picks up obj1 and places it on top of obj2. There is also a text-to-speech API say(sentence) that allows the robot to communicate with the user.
Experimental Setup. We evaluate with a number of tabletop pick-and-place robotics tasks that involve semantic reasoning. For few-shot prompting, we include a single example: âServe a meal that follows the userâs dietary restrictionsâ, so that the language model understands the expected structure as well as the available robot APIs. During test time, we query the model with each of the following instructions.
1. âPack a lunch box for someone who is on a vegan diet.â
1. âAssemble a sandwich for someone who is vegetarian.â
1. âGather ingredients for a peanut butter sandwich in a plate.â
1. âPrepare è„żçșąæżçè in the pot.â (interleaving English and Chinese on purpose)
1. âPlace all paper-made objects in the grass-colored container.â
1. âSort the objects on the table into the compost bin and the recycle bin.â
1. âMy steak is too bland. Can you help?â
Results. With a single example in our prompt, we see that our model is able to generalize to new objects, languages, and task domains (see Figure A3 and an example trajectory in Figure A4). Note that for these robotics tasks, unlike the previous language reasoning tasks, our main method CoC (Interweave) is the only capable approach, as the code requires line-by-line interplay between the Python interpreter execution (robot APIs) and the LMulator (commonsense QA like is_compostable).
Figure A3 shows the one-shot prompt as well as the model outputs and how they are executed for a few test instructions.
(a) Given Prompt
(b) Novel Objects
(c) Novel Languages
(d) Novel Tasks
Figure A3: The one-shot prompt as well as the model outputs for a few test instructions for the robotics tasks. When given a single example in the prompt (a), our method can generalize (b-d) to new objects, languages, and task domains. Red highlight indicates LM generated code being executed by the Python interpreter, and purple highlight indicates LM simulating the code execution. Gray text is for illustration purpose only, and not provided to our model. Note that code in the form of robot.<func_name> invokes robot APIs.
<details>
<summary>extracted/5762267/fig/robot_traj_1.jpg Details</summary>

### Visual Description
## Photograph: Automated Sorting System Test Setup
### Overview
The image depicts a robotic arm positioned over two labeled bins ("Recycle Bin" and "Compost Bin") on a wooden surface. The setup includes various objects (e.g., banana, carrot, bread, milk carton) arranged in the background, suggesting a simulated waste-sorting environment. The robotic arm, equipped with a gripper, appears to be interacting with a yellow sticky note on the wooden board.
### Components/Axes
- **Robotic Arm**: Gray and light blue, with a blue cable extending from its base. Mounted on a metallic base with a black gripper.
- **Bins**:
- **Recycle Bin**: Blue, labeled in black text.
- **Compost Bin**: Green, labeled in orange text.
- **Objects**:
- Banana (top-right), carrot (center-right), bread slice (right-middle), milk carton (center), and a red cash register (top-left).
- **Surface**: Light brown wooden board with a grid-like pattern.
### Detailed Analysis
- **Robotic Arm**: Positioned centrally, oriented toward the bins. The gripper is lowered toward a yellow sticky note on the wooden board.
- **Bins**: Placed side-by-side on the wooden surface. The "Recycle Bin" is to the left of the "Compost Bin."
- **Background Objects**: Scattered across a dark perforated metal surface. Includes a banana (top-right), carrot (center-right), bread slice (right-middle), and a milk carton (center). A red cash register with a blue base is partially visible in the top-left corner.
### Key Observations
1. The robotic armâs gripper is aligned with the yellow sticky note, suggesting a task involving object manipulation or placement.
2. The bins are clearly labeled, indicating a focus on waste categorization.
3. The background objects (banana, carrot, bread, milk carton) are common household items, likely used to simulate recyclable/compostable materials.
4. The cash register in the background may imply a retail or commercial context for the test.
### Interpretation
This setup likely demonstrates a proof-of-concept for an automated waste-sorting system. The robotic armâs interaction with the sticky note could represent a step in object recognition or placement accuracy testing. The labeled bins and diverse objects suggest the system is designed to categorize items into recyclable or compostable streams. The presence of the cash register hints at a broader application in retail environments, where automated sorting could streamline waste management.
No numerical data, charts, or diagrams are present. The image focuses on physical components and their arrangement, with no textual content beyond the bin labels.
</details>
<details>
<summary>extracted/5762267/fig/robot_traj_2.jpg Details</summary>

### Visual Description
## Photograph: Play Kitchen Setup with Robotic Arm
### Overview
The image depicts a staged play kitchen environment with a wooden surface, a red cash register, a robotic arm, and various play food items. The scene includes labeled objects, a robotic arm interacting with a sponge, and a "Compost Bin" label.
### Components/Axes
- **Textual Labels**:
- **"Compost Bin"**: Written in red text on a green square plate (bottom-right of the wooden surface).
- **"MILK"**: Printed on a white and blue milk carton (top-center of the wooden surface).
- **Cash Register Keypad**: Numeric buttons (0-9) and function keys (e.g., "=", "+", "-") visible on the red cash register.
- **Robotic Arm**: No explicit label, but its position suggests interaction with the sponge.
- **Objects**:
- **Play Food**: Banana (top-right), carrot (top-center), slice of bread (top-right), and a blue plate (top-center).
- **Other Items**: A yellow sponge (center), a white cup (top-left), and a blue tray (top-left).
### Detailed Analysis
- **Textual Elements**:
- **"Compost Bin"**: Located on the green plate, positioned at the bottom-right of the wooden surface.
- **"MILK"**: On the milk carton, centered on the wooden surface.
- **Cash Register Keypad**: Numeric buttons (0-9) and function keys (e.g., "=", "+", "-") are visible on the red cash register, which is located in the top-left corner of the image.
- **Robotic Arm**:
- A metallic robotic arm with a black gripper is positioned above the blue plate, holding a yellow sponge. The arm is centered in the upper portion of the image.
### Key Observations
- The "Compost Bin" label is the only explicit textual instruction in the scene, suggesting a simulated waste-sorting activity.
- The robotic armâs interaction with the sponge implies a task related to cleaning or sorting.
- No numerical data or charts are present; the focus is on textual labels and object placement.
### Interpretation
This setup appears to simulate a childâs play kitchen with a robotic arm for interactive learning. The "Compost Bin" label indicates a pedagogical element, teaching waste segregation. The robotic armâs presence suggests a blend of traditional play and technology, possibly for educational or experimental purposes. The absence of numerical data or complex diagrams implies the image is not a technical chart but a staged scene for demonstration or storytelling.
**Note**: No numerical data, charts, or diagrams are present. The image focuses on textual labels and object placement in a play environment.
</details>
<details>
<summary>extracted/5762267/fig/robot_traj_3.jpg Details</summary>

### Visual Description
## Photograph: Robotic Arm Training Setup
### Overview
The image depicts a robotic arm interacting with objects on a wooden surface, likely part of a machine learning or robotics training environment. The scene includes labeled bins, a toy cash register, and play food items, suggesting tasks related to object recognition, sorting, or manipulation.
### Components/Axes
- **Robotic Arm**: Central focus, with a black gripper and metallic components. A blue cable connects the arm to an off-frame control unit.
- **Bins**:
- **Green Bin**: Labeled "Compost Bin" in orange text (bottom right).
- **Blue Bin**: Contains a yellow sticky note (top right).
- **Background Objects**:
- **Toy Cash Register**: Red and blue, with a numeric keypad (top left).
- **Play Food Items**: Includes a banana, carrot, bread, milk carton (labeled "MILK"), and a blue plate (scattered across the perforated metal surface).
- **Surface**: Light brown wooden board with visible grain, placed on a dark perforated metal table.
### Detailed Analysis
- **Robotic Arm**: Positioned centrally, the armâs gripper is lowered toward a small white cube with a green label (possibly a training object). The armâs base is mounted on the wooden surface.
- **Bins**:
- The green "Compost Bin" is empty but labeled, suggesting it may be a target for sorting tasks.
- The blue bin holds a yellow sticky note, though no text is visible on the note.
- **Background Items**:
- The cash register and play food items (banana, carrot, bread, milk carton) are arranged haphazardly, likely serving as training data for object recognition.
- The milk cartonâs label ("MILK") is clearly visible in black text on a blue background.
### Key Observations
- The robotic armâs gripper is in contact with a white cube, indicating an active manipulation task.
- The "Compost Bin" label is the only explicit textual instruction in the scene.
- Play food items are disproportionately sized compared to real-world objects, consistent with toy models.
- No numerical data, charts, or graphs are present.
### Interpretation
This setup appears designed for training a robotic system to recognize and interact with objects. The labeled bins ("Compost Bin") and play food items suggest tasks such as:
1. **Object Classification**: Identifying items (e.g., food vs. non-food).
2. **Sorting**: Placing items into designated bins (e.g., compostable vs. non-compostable).
3. **Manipulation**: Using the robotic arm to grasp and move objects.
The presence of a toy cash register and play food items implies a simulated environment for testing real-world scenarios in a controlled setting. The absence of numerical data or explicit instructions beyond the "Compost Bin" label suggests the focus is on visual and spatial reasoning rather than quantitative analysis.
</details>
<details>
<summary>extracted/5762267/fig/robot_traj_4.jpg Details</summary>

### Visual Description
## Photograph: Robotic Arm Interaction Setup
### Overview
The image depicts a robotic arm interacting with objects on a wooden surface. The scene includes a green plate with a yellow viscous substance (possibly food), a blue plate with a yellow sticky note, a milk carton, a banana, a toy cash register, and a wooden cutting board. The robotic arm has a blue tube connected to its base, suggesting a fluid or pneumatic system. The background features a pegboard with scattered items, including a banana, bread, and a toy cash register.
### Components/Axes
- **Robotic Arm**: Central focus, with a metallic gripper and blue tube.
- **Plates**:
- Green plate (foreground) with yellow substance.
- Blue plate (background) with yellow sticky note.
- **Objects**:
- Milk carton (left side, labeled "MILK" in English).
- Banana (top-right corner).
- Toy cash register (top-left corner, red and blue).
- Wooden cutting board (base surface).
- **Background**: Black pegboard with holes, scattered with additional items (bread, banana, toy cash register).
### Detailed Analysis
- **Robotic Arm**: Positioned centrally, gripping the green plate. The blue tube connects to the armâs base, possibly for fluid transfer or stabilization.
- **Green Plate**: Contains a yellow substance (e.g., sauce, paint) with a small black mark or stain near the robotic armâs gripper.
- **Blue Plate**: Holds a yellow sticky note with illegible text.
- **Milk Carton**: Positioned to the left of the robotic arm, upright with a label facing the viewer.
- **Banana**: Located in the top-right corner, partially visible.
- **Toy Cash Register**: Red and blue, with a drawer containing dollar bills.
- **Wooden Surface**: Light brown, with visible grain patterns and a grid-like cutting board.
### Key Observations
- No textual information (labels, axis titles, legends) is visible in the image.
- The robotic armâs interaction with the green plate suggests a task involving precision or manipulation of viscous materials.
- The scattered objects on the pegboard imply a simulated environment (e.g., kitchen, store).
### Interpretation
The setup appears to be a robotics experiment or demonstration, testing the armâs ability to handle objects in a cluttered environment. The presence of food items (banana, bread) and a cash register may indicate a scenario simulating real-world tasks, such as food preparation or retail automation. The yellow substance on the green plate could represent a challenge for the robotic gripper, testing its ability to manage liquids or stains. The absence of visible text or data points suggests the image focuses on hardware interaction rather than data visualization.
### Notable Anomalies
- No discernible numerical data or trends.
- The robotic armâs exact task (e.g., picking up, dispensing) is unclear without motion context.
- The yellow substanceâs purpose (e.g., test material, food) is ambiguous.
</details>
Figure A4: Robot trajectory visualization for task âsort the objects on the table into the compost bin and the recycle binâ. CoC first generates code to solve the problem, and then executes the code with Python if possible (e.g., robot APIs like detect_objects and pick_place), and with LMulator if not (e.g., commonsense QA like is_compostable). The robot successfully picks and places the Post-it note to the recycle bin and the orange peel to the compost bin. See the full code in Fig. A3.