# Chain of Code: Reasoning with a Language Model-Augmented Code Emulator
**Authors**: Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, Brian Ichter
## Abstract
Code provides a general syntactic structure to build complex programs and perform precise computations when paired with a code interpreter - we hypothesize that language models (LMs) can leverage code-writing to improve Chain of Thought reasoning not only for logic and arithmetic tasks (Chen et al., 2022; Nye et al., 2021; Austin et al., 2021), but also for semantic ones (and in particular, those that are a mix of both). For example, consider prompting an LM to write code that counts the number of times it detects sarcasm in an essay: the LM may struggle to write an implementation for â detect_sarcasm(string) â that can be executed by the interpreter (handling the edge cases would be insurmountable). However, LMs may still produce a valid solution if they not only write code, but also selectively âemulateâ the interpreter by generating the expected output of â detect_sarcasm(string) â. In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension that improves LM code-driven reasoning. The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that the interpreter can explicitly catch undefined behaviors and hand off to simulate with an LM (as an âLMulator"). Experiments demonstrate that Chain of Code outperforms Chain of Thought and other baselines across a variety of benchmarks; on BIG-Bench Hard, Chain of Code achieves 84%, a gain of 12% over Chain of Thought. In a nutshell, CoC broadens the scope of reasoning questions that LMs can answer by âthinking in code".
Machine Learning, ICML
UTF8gbsn
https://chain-of-code.github.io/
Direct answer only
Q: How many countries have I been to? Iâve been to Mumbai, London, Washington, Grand Canyon, ... A: 32 (20%, â), 29 (10%, â), 52 (10%, â), ...
Chain of Thought
Q: Letâs think step by step. How many countries have I been to? Iâve been to Mumbai, London, ... Weâll group by countries and count: 1. India: Mumbai, Delhi, Agra 2. UK: London, Dover, Edinburgh, Skye 3. USA: Washington, Grand Canyon, ... A: 61 (20%, â), 60 (20%, â), 52 (10%, â), ...
Chain of Code (Ours)
Q: How many countries have I been to? Iâve been to Mumbai, London, Washington, Grand Canyon, Baltimore, ... 1 places, countries = ["Mumbai", ...], set() delta state: {places = [âMumbaiâ, ...], countries = set()} 2 for place in places: delta state: {place = âMumbaiâ} 3 country = get_country(place) delta state: {country = âIndiaâ)} 4 countries.add(country) delta state: {countries = {âIndiaâ}} 5 answer = len(countries) delta state: {answer = 52} A: 52 (100%, â)
Figure 1: Chain of Code generates code and reasons through an LM-augmented code emulator. Lines evaluated with Python are in red and with an LM are in purple. The full query is in Fig. LABEL:fig:intro_query.
<details>
<summary>extracted/5762267/fig/all_tasks_direct.png Details</summary>

### Visual Description
\n
## Bar Chart: Delta w.r.t. Average Human Rater
### Overview
The image presents a bar chart illustrating the delta (Î) with respect to an average human rater, expressed as a percentage. The chart displays a series of bars, transitioning from negative values to positive values. The x-axis is not explicitly labeled, implying a categorical or sequential index.
### Components/Axes
* **Y-axis:** "Î w.r.t. average human rater (%)" - Represents the percentage difference from the average human rater score. The scale ranges from approximately -100% to 100%.
* **X-axis:** Unlabeled. Represents the index of the data points.
* **Bars:** The chart consists of a series of vertical bars, color-coded to indicate the magnitude and direction of the delta. The bars transition from orange to red to pink to blue.
### Detailed Analysis
The chart shows a progression of values. The initial bars (orange) are significantly negative, around -50%. As we move across the x-axis, the bars gradually increase in height, approaching zero. Around the middle of the chart, the bars are close to zero, fluctuating around 0%. The final bars (blue) show a sharp increase into positive territory, reaching approximately +30%.
Here's a breakdown of approximate values, reading from left to right:
* Bar 1: Approximately -55% (orange)
* Bar 2: Approximately -50% (orange)
* Bar 3: Approximately -45% (orange)
* Bar 4: Approximately -40% (orange)
* Bar 5: Approximately -35% (orange)
* Bar 6: Approximately -30% (orange)
* Bar 7: Approximately -25% (orange)
* Bar 8: Approximately -20% (orange)
* Bar 9: Approximately -15% (orange)
* Bar 10: Approximately -10% (red)
* Bar 11: Approximately -5% (red)
* Bar 12: Approximately 0% (pink)
* Bar 13: Approximately +5% (pink)
* Bar 14: Approximately +10% (pink)
* Bar 15: Approximately +15% (blue)
* Bar 16: Approximately +20% (blue)
* Bar 17: Approximately +25% (blue)
* Bar 18: Approximately +30% (blue)
### Key Observations
* The chart demonstrates a clear trend from negative to positive delta values.
* The transition from negative to positive occurs gradually, with a steeper increase at the end.
* The initial values are substantially negative, indicating a significant difference from the average human rater.
* The final values are positive, suggesting that the system or method being evaluated outperforms the average human rater in those instances.
### Interpretation
The data suggests an improvement in performance or accuracy as one moves along the x-axis. Initially, the system or method under evaluation performs significantly worse than the average human rater. However, as the index increases, the performance improves, eventually surpassing the average human rater. This could represent a learning curve, an optimization process, or the application of a refined algorithm. The sharp increase at the end suggests a critical threshold or a particularly effective adjustment was made. The unlabeled x-axis implies that the index represents a sequence of steps, iterations, or categories. Without knowing what the x-axis represents, it's difficult to provide a more specific interpretation. The chart highlights the potential for a system to improve and eventually outperform human performance, but also emphasizes the initial gap in performance.
</details>
(a) Direct answer only
<details>
<summary>extracted/5762267/fig/all_tasks_cot.png Details</summary>

### Visual Description
\n
## Bar Chart: Delta w.r.t. Average Human Rater
### Overview
The image presents a bar chart illustrating the delta (Î) with respect to the average human rater, expressed as a percentage. The chart displays a series of bars, transitioning from negative values to positive values as you move from left to right. The x-axis is not explicitly labeled, implying it represents a categorical variable or sequence of conditions.
### Components/Axes
* **Y-axis:** "Î w.r.t. average human rater (%)" - Represents the percentage difference from the average human rater score. The scale ranges from approximately -100% to 100%.
* **X-axis:** Unlabeled. Represents the categories or conditions being compared.
* **Bars:** Represent the delta values for each category. The bars are colored in a gradient, starting with orange, transitioning to gray, and finally to blue.
### Detailed Analysis
The chart shows a clear trend of increasing delta values. The bars start with negative values, indicating scores below the average human rater, and gradually increase to positive values, indicating scores above the average.
Here's a breakdown of approximate values, reading from left to right:
1. **Orange Bar 1:** Approximately -20%
2. **Orange Bar 2:** Approximately -10%
3. **Gray Bar 3:** Approximately -5%
4. **Gray Bar 4:** Approximately 0%
5. **Gray Bar 5:** Approximately +5%
6. **Gray Bar 6:** Approximately +10%
7. **Gray Bar 7:** Approximately +15%
8. **Blue Bar 8:** Approximately +20%
9. **Blue Bar 9:** Approximately +30%
10. **Blue Bar 10:** Approximately +40%
11. **Blue Bar 11:** Approximately +50%
12. **Blue Bar 12:** Approximately +60%
The transition from orange to gray to blue appears to coincide with the crossing of the zero line (average human rater score).
### Key Observations
* The initial bars (orange) consistently show scores below the average human rater.
* There's a gradual increase in scores, crossing the average human rater level around the 4th or 5th bar (transitioning from gray to blue).
* The final bars (blue) demonstrate scores significantly above the average human rater.
* The rate of increase appears to accelerate towards the right side of the chart.
### Interpretation
The data suggests a system or method being evaluated initially performs worse than the average human rater. However, as the conditions or categories change (represented by the x-axis), the performance improves, eventually surpassing the average human rater. The accelerating trend in the later bars indicates that the improvement becomes more pronounced with each subsequent condition.
The color gradient could represent different stages of development or optimization. The initial orange bars might represent early iterations, the gray bars represent intermediate stages, and the blue bars represent the final, optimized version. The chart demonstrates a clear positive trajectory in performance, indicating successful improvement over time. The unlabeled x-axis is a limitation, as it prevents a more specific understanding of what factors are driving the observed changes.
</details>
(b) Chain of Thought
<details>
<summary>extracted/5762267/fig/all_tasks_coc_interweave_no_title.png Details</summary>

### Visual Description
\n
## Bar Chart: Delta w.r.t. Average Human Rater
### Overview
The image presents a bar chart illustrating the delta (Î) with respect to the average human rater, expressed as a percentage. The chart displays a series of bars, predominantly blue, with a few initial orange bars indicating negative deltas. The x-axis is implicit, representing a sequence of data points, while the y-axis represents the percentage delta.
### Components/Axes
* **Y-axis Title:** Î w.r.t. average human rater (%)
* **X-axis:** Implicit, representing a sequence of data points (likely iterations or categories).
* **Y-axis Scale:** Ranges from approximately -100% to 100%. The scale is linear with markings at -100, 0, 50, and 100.
* **Bar Colors:** Two distinct colors are used: orange for negative deltas and blue for positive or zero deltas.
### Detailed Analysis
The chart consists of a series of vertical bars. The initial bars (approximately the first 3) are orange, indicating negative deltas. The remaining bars are blue, indicating positive or zero deltas.
Here's an approximate breakdown of the data points, reading from left to right:
1. **Bar 1 (Orange):** Approximately -20%
2. **Bar 2 (Orange):** Approximately -10%
3. **Bar 3 (Orange):** Approximately -5%
4. **Bar 4 (Blue):** Approximately 0%
5. **Bar 5 (Blue):** Approximately 5%
6. **Bar 6 (Blue):** Approximately 10%
7. **Bar 7 (Blue):** Approximately 15%
8. **Bar 8 (Blue):** Approximately 20%
9. **Bar 9 (Blue):** Approximately 25%
10. **Bar 10 (Blue):** Approximately 35%
11. **Bar 11 (Blue):** Approximately 40%
12. **Bar 12 (Blue):** Approximately 50%
13. **Bar 13 (Blue):** Approximately 60%
14. **Bar 14 (Blue):** Approximately 75%
15. **Bar 15 (Blue):** Approximately 90%
The trend is initially negative, crossing over to positive around the fourth bar. The positive bars exhibit a generally increasing trend, with a steeper increase towards the end of the sequence.
### Key Observations
* The chart shows a clear transition from negative deltas to positive deltas.
* The magnitude of the positive deltas increases significantly towards the end of the sequence, indicating a growing difference compared to the average human rater.
* The initial negative deltas suggest that, in the early stages, the system being evaluated performed worse than the average human rater.
### Interpretation
The data suggests an improvement in performance over time or iterations. Initially, the system under evaluation lagged behind the average human rater, as indicated by the negative deltas. However, as the sequence progresses, the system's performance surpasses that of the average human rater, with the difference becoming increasingly pronounced. This could represent a learning curve, optimization process, or the effect of improvements made to the system. The steep increase in the final bars suggests a particularly significant improvement in the later stages. The chart demonstrates a positive correlation between the sequence (likely iterations or time) and the performance delta relative to human raters.
</details>
(c) Chain of Code (Ours)
Figure 2: Overall results on BIG-Bench Hard compared to human performance (Srivastava et al., 2022).
## 1 Introduction
Language models (LMs) at certain scale exhibit the profound ability to solve complex reasoning questions (Brown et al., 2020; Wei et al., 2022a) â from writing math programs (Drori et al., 2022) to solving science problems (Lewkowycz et al., 2022). Notably, these capabilities have shown to improve with Chain of Thought (CoT) prompting (Wei et al., 2022b), whereby complex problems are decomposed into a sequence of intermediate reasoning steps. CoT excels at semantic reasoning tasks, but tends to struggle with questions that involve numeric or symbolic reasoning (Suzgun et al., 2022; Mirchandani et al., 2023). Subsequent work addresses this by prompting LMs (e.g., trained on Github (Chen et al., 2021)) to write and execute code (Chen et al., 2022; Nye et al., 2021; Austin et al., 2021). Code in particular is advantageous because it provides both (i) a general syntactic structure to build and encode complex programs (Liang et al., 2023) (e.g., logic structures, functional vocabularies â in ways that are Turing complete), and (ii) an interface by which existing APIs paired together with an interpreter can be used to perform precise algorithmic computations (e.g., from multiplication of large numbers to sorting an array of size 10,000) that a language model trained only to mimic the statistically most likely next token would otherwise struggle to produce.
While writing and executing code may improve LM reasoning performance across a wide range of arithmetic tasks, this particular approach contends with the fact that many semantic tasks are rather difficult (and at times, nearly impossible) to express in code. For example, it remains unclear how to write a function that returns a boolean when it detects sarcasm in a string (Suzgun et al., 2022) (handling the edge cases would be insurmountable). Perhaps fundamentally, using LMs to write programs in lieu of multi-step textual reasoning inherently assumes that the intermediate reasoning traces (expressed in lines of code) all need to be executable by an interpreter. Is it possible to lift these restrictions to get the best of both reasoning in code and reasoning in language?
In this work, we propose Chain of Code (CoC), a simple yet surprisingly effective extension to improve LM code-driven reasoning â where the LM not only writes a program, but also selectively âsimulatesâ the interpreter by generating the expected output of certain lines of code (that the interpreter could not execute). The key idea is to encourage LMs to format semantic sub-tasks in a program as flexible pseudocode that at runtime can be explicitly caught and handed off to emulate with an LM â we term this an LMulator (a portmanteau of LM and emulator). For example, given the task â in the above paragraph, count how many times the person was sarcastic,â we can in-context prompt the LM to write a program that may call helper functions such as is_sarcastic(sentence), to which the LM makes a linguistic prediction and returns the result as a boolean output, that then gets processed with the rest of the program. Specifically, we formulate LM reasoning as the following process (illustrated in Figure 1): the LM writes code, the interpreter steps through to execute each line of code (in red), or if it fails, simulates the result with the LM (in purple) and updates the program state (in green). CoC inherits the benefits of both (i) writing executable code (where precise algorithmic compututations are left to an interpreter), and (ii) writing pseudocode for semantic problems, and generating their outputs (which can be thought of as a simple formatting change, to which LMs are robust (Min et al., 2022)) â enabling the LM to âthink in codeâ.
Extensive experiments demonstrate that CoC is applicable to a wide variety of challenging numerical and semantic reasoning questions, and outperforms a number of popular baselines. In particular, we find that it achieves high performance on BIG-Bench Hard tasks (Suzgun et al., 2022), outperforming average human raters overall and outperforming even the best human raters on an algorithmic subset of tasks, and to the best of our knowledge setting a new state of the art. We further show that both code interpreter execution and language model execution simulation are necessary for this performance, and that the approach scales well with large and small models alike â contrary to prompting techniques like Chain of Thought that only emerge at scale. We then demonstrate how Chain of Code can serve as a general purpose reasoner via cross-task prompting benchmark, which in contrast to prior work, uses prompts from different families of problems as context â providing only the structure of the response (as opposed to the solution itself). Finally, we show CoC is complementary to more advanced instruction tuned chat models, robust against prompt variation, and applicable beyond language reasoning domain like robotics. This work underscores how one may leverage the structure and computational power of code and the reasoning abilities of language models to enable a âbest of both worldsâ reasoner.
## 2 Chain of Code: Reasoning with an LMulator
In this section, we describe Chain of Code (CoC), an approach that leverages the ability of language models to code, to reason, and to leverage an LM-augmented code emulator (an LMulator) to simulate running code. We start with background in Section 2.1, then overview the method in Section 2.2, its implementation in Section 2.3, and finally its capabilities in Section 2.4.
### 2.1 Preliminaries
Briefly, we overview some background on LM reasoning. Many of these reasoning techniques have been enabled by in-context learning (Brown et al., 2020), which provides the model with a few demonstrative examples at inference time, rather than updating any weights with gradients. These examples serve to provide context and format for the setting, enabling the model to emulate these examples while adapting to a new query. This property has been instrumental in easily applying LMs to new tasks as it can be rapidly adapted and requires minimal data.
Through in-context learning, approaches have been developed to leverage human thought processes and use tools to improve performance of language models. We outline three such approaches that provide the foundations for Chain of Code. Chain of Thought (CoT) (Wei et al., 2022b), ScratchPad (Nye et al., 2021), and Program of Thoughts (Chen et al., 2022) demonstrated the efficacy of breaking problems down into substeps. For CoT these substeps are in natural language, mirroring oneâs thought process when stepping through a complicated problem. ScratchPad, on the other hand, maintains a program state of intermediate steps when simulating the output of code â resulting in an LM acting as a code interpreter. Program of Thoughts (Chen et al., 2022) focused on generating the code itself, which is then executed by a code interpreter to solve reasoning problems. Each of these is visualized in Figure 3(c).
(a) Chain of Thought
(b) Program of Thoughts
(c) ScratchPad
Figure 3: Previous reasoning methods: To solve advanced problems, (LABEL:fig:prelim-cot) Chain of Thought prompting breaks the problem down into intermediate steps, (LABEL:fig:prelim-pot) Program of Thoughts prompting writes and executes code, and (LABEL:fig:prelim-scratchpad) ScratchPad prompting simulates running already written code by tracking intermediate steps through a program state. Our reasoning method: Chain of Code first (LABEL:fig:method_generation) generates code or psuedocode to solve the question and then (LABEL:fig:method_execution) executes the code with a code interpreter if possible, and with an LMulator (language model emulating code) otherwise. Blue highlight indicates LM generation, red highlight indicates LM generated code being executed, and purple highlight indicates LMulator simulating the code via a program state in green.
(d) Chain of Code Generation (Ours)
(e) Chain of Code Execution (Ours)
### 2.2 Chain of Code
Inspired by how a human may reason through a particularly complex problem with a mix of natural language, pseudocode, and runnable code or how a researcher may develop a new general algorithm through a code-based formalism then apply it to a problem, Chain of Code proceeds in two steps: (1) Generation, which, given the question to solve, an LM generates code to reason through the problem, and (2) Execution, which executes the code via a code interpreter when possible and via an LM when not. See Section 2.3 for more details on the specific implementation.
Chain of Code Generation Given a problem to solve, CoC generates reasoning substeps in the structure of code. This code provides the framework of reasoning through the problem, and may be in the form of explicit code, pseudocode, or natural language. Figure LABEL:fig:method_generation walks through a potential generation to solve an object counting problem from BIG-Bench.
Chain of Code Execution A core contribution of CoC is not just the generation of reasoning code, but the manner in which it is executed. Once the code is written, the code is attempted to be run by a code interpreter â in this work we consider Python, but the approach is general to any interpreter. If the code is successfully executed, the program state is updated and the execution continues. If the code is not executable or raises any exception, the language model instead is used to simulate the execution. The program state is subsequently updated by the language modelâs outputs and the execution continues. Herein, we refer to this as an LMulator, a portmanteau of LM and code emulator. This relatively simple change enables a variety of new applications for code which mix semantics and numerics. Figure LABEL:fig:method_execution shows how the generated code is run, maintaining the program state and switching between the Python executor and the LMulator.
### 2.3 Chain of Code Implementation
While the generation implementation is straightforward prompting and language model generation, the execution implementation is slightly more complex. Our implementation is based on using Pythonâs try and except and maintaining a program state. Line by line CoC steps through the code. If the line is executable by a code interpreter, it is executed, the program state is updated, and the program continues. If it is not executable by a code interpreter, a language model is given the context of the program (the question, the prior lines, and the history of the program state) and generates the next program state. This emulation can also leverage chain of thought to determine how to respond. That generated program state is then updated for the code interpreter as well. This sharing of program state interweaves the code interpreter and the language model simulator in a manner applicable to arbitrary interweaving, even control flow like for -loops and if -statements. This continues until the entire code is run, and the answer is retrieved as the value of the variable named answer, or in case of irrecoverable errors, with the language model outputting A: answer.
To illustrate with a brief example, the code answer = 0; answer += is_sarcastic(âyou donât sayâ); answer += 1; would be executed as follows: (1) Python would execute the first line answer = 0; and update the program state to {answer = 0}, (2) Python would attempt to execute the second line and fail, and thus the LMulator would simulate the code answer += is_sarcastic(âyou donât sayâ); by generating the program state {answer = 1}, which would be updated in the program, (3) Python would execute the last line answer += 1; and update the program state to {answer = 2}, (4) the answer would be retrieved as 2.
### 2.4 Chain of Code Abilities
Chain of Code has several attractive properties:
1. It enables code use in entirely new regimes, by combining the advantages of code with the powerful semantic and commonsense knowledge of language models, which can easily express rules that are challenging to express in code (e.g., which foods are fruits?). Such an ability may have benefits beyond reasoning problems and its flexibility enables executing expressive language, such as pseudocode.
1. It leverages the ability of language models to code, a particular strength of recent language models due to the high quality data available.
1. It inherits many of the benefits of reasoning code, both the formal yet expressive structure of code (e.g., Turing completeness) and powerful computational tools available to code (whether simply multiplying two numbers, calculating $\sqrt[5]{12121}$ , or simulating physics).
1. It inherits many of the benefits of techniques that reason via intermediate steps, such as Chain of Thought. These techniques enable the language model to use more computation when necessary to solve a problem as well as provide more interpretability.
Empirically, we observe in Section 3 that these benefits results in significant improvements in reasoning performance over a variety of challenging tasks.
## 3 Experimental Evaluation
We select challenging problems requiring varied types of reasoning, whether arithmetic, commonsense, or symbolic reasoning tasks, to answer the following questions:
1. How well does CoC perform across a variety of tasks?
1. Which types of problems does CoC perform best?
1. How does each aspect of CoC affect performance?
1. How does CoC scale with model size?
1. How does CoC perform as a general-purpose reasoner, with prompt examples from different problems rather than the same problem (which we term cross-task prompting)?
1. How can CoC be used with instruction tuned chat models?
1. How robust CoC is against prompt variation?
1. Can CoC be applied beyond language reasoning tasks?
We first discuss the approaches, ablations, and baselines considered in Section 3.1, then the tasks considered in Section 3.2, and finally the results in Section 3.3.
### 3.1 Baselines and Ablations
We consider our main method to be CoC (Interweave), also referred to as CoC (Ours), though we also propose two variants with simpler implementation and modestly lower performance: CoC (try Python except LM) and CoC (try Python except LM state). These two variants attempt to run the entire generated code with Python (rather than line by line) and if it fails, simulate the code execution with the LMulator, outputting a final answer or an intermediate state trace, respectively. We also perform the following ablations, some of which are comparable to previous work as noted. In CoC (Python) Python is used to run the entire generated code and if the code is not executable, it is marked as failure â this can be thought of as a comparison to Program of Thoughts (Chen et al., 2022) or Program-aided language models (Gao et al., 2023). We note that in many cases this baseline is particularly challenged, as writing executable code for some of the reasoning problems becomes nearly impossible (e.g., writing code to judge if a phrase is sarcastic), but one may focus on the results for Algorithmic only tasks for a more fair comparison. In CoC (LM) the code is interpreted by an LMulator outputting the final answer, and in CoC (LM state) the code is interpreted by an LMulator outputting a state trace of intermediate steps â this can be thought of as ScratchPad prompting for reasoning (Nye et al., 2021). Note, the last two ablations do not leverage the Python interpreter.
We also compare against the following baselines. In Direct question answering the LM simply responds to the question with a final answer. In Chain of Thought prompting (CoT) the LM uses intermediate steps to solve the task; we use CoT as our standard prompt technique for the field of substep prompting (Kojima et al., 2022; Zhou et al., 2022a) as prompts are readily available.
### 3.2 Tasks
We consider a subset of challenging tasks from BIG-Bench (Srivastava et al., 2022) called BIG-Bench Hard (BBH) (Suzgun et al., 2022) to ensure we are solving the most challenging tasks. These tasks were specifically selected for their difficulty for language models and the datasets provides human-rater baselines and a set of Chain of Thought prompts. The 23 tasks require semantic reasoning (e.g., âMovie Recommendationâ), numerical reasoning (e.g., âMulti-Step Arithmeticâ), and a combination of both (e.g., âObject Countingâ). As such they enable us to study the efficacy of CoC across varied problems, not just those that coding is a natural fit for. Several prompts are shown in Figure A1. We also show results for the grade-school math (GSM8K) benchmark (Cobbe et al., 2021) in Section A.2, although we find that these problems are primarily solved algorithmically alone through code.
These tasks are evaluated with few-shot prompting, whereby three examples from the same problem family are provided as context. We also introduce a new evaluation setting, cross-task prompting, whereby three examples of different problems are provided as context. As such, the language model has in-context examples of the format of reasoning, but isnât provided explicit instructions on how to reason. We see this as an indicative signal for a general-purpose reasoner, which in many real-world applications (e.g., chatbots) would be asked to reason across a wide variety of tasks.
The models used herein include the OpenAI family of models: text-ada-001, text-baggage-001, text-curie-001, and text-davinci-003 (in plots we denote these as a-1, b-1, c-1, and d-3). We also consider PaLM-2âs code finetuned variant (Chowdhery et al., 2022; Google et al., 2023). For instruction tuned models, we compare to recent variants of GPT (gpt-3.5-turbo and gpt-4) with the chat completion mode run in October 2023 and January 2024. The results below are using the text-davinci-003 model unless otherwise stated.
### 3.3 Results
Question 1: Overall Performance. The overall performance of CoC is shown in Figure 2 and Table 1 (with full results in Table A1). We see that CoC outperforms other approaches, both in the number of tasks it exceeds the human baseline and in the overall amount that it exceeds the baseline. Indeed, CoCâs 84% is SoTA to the best of our knowledge (Gemini Team, 2023). In fact, when combined with gpt-4, CoC achieves 91% (see Table A4). In several tasks CoC vastly outperforms the human baseline and other methods, achieving nearly 100% â generally for these tasks the result is complicated in language but trivial in code (e.g., a task from multi-step arithmetic Q: $((-3+5\times 8\times-4)-(9-8\times-7))=$ ). We also observe that CoT outperforms the human baseline on a number of tasks, while the Direct answer fares poorly.
Table 1: Overall performance (%) on BIG-Bench Hard with both few-shot prompting with a single task and cross-task. The delta compared to direct prompting is shown in parenthesis.
| Prompt | Human | text-davinci-003 Direct | PaLM 2-S* (code variant (Google et al., 2023)) CoT | CoC (Ours) | Direct | CoT | CoC (Ours) |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Single task | 68 | 55 | 72 (+17) | 84 (+29) | 49 | 61 (+12) | 78 (+29) |
| Cross task | - | 50 | 55 (+5) | 61 (+11) | 45 | 47 (+2) | 47 (+2) |
Question 2: Problem Type. Figure 4 breaks the results down by problem type; the task labels are shown in Table A1. First, we isolate problems that are primarily algorithmic or primarily natural language (these categories were identified in (Suzgun et al., 2022)). We see that on algorithmic tasks, CoC performs particularly well, while on natural language tasks CoC performs on par with CoT. This is particularly encouraging, because one may expect these language oriented tasks to be a worse fit for code. The key is that our method offers the flexibility of using a LMulator to simulate the output of code execution, retaining the semantic reasoning capabilities of LMs for natural language problems.
Figure 4 additionally breaks the tasks down into categories that capture how different each questionâs response is and whether the code can be fully executed by Python (denoted Python only vs. Python + LM). For some tasks within the benchmark, each question has the same code or Chain of Thought, with the only variation being the inputs â in this case we say the code is (repeated code), and if not then it is denoted (new code). As expected, we see that when the code is repeated and run by Python, CoC gets nearly 100%, though these tasks (e.g., multi-step arithmetic) seem to be among the most challenging for the other baselines, including human raters. The other categories are more challenging for CoC; however in each, we still see a benefit over baselines.
<details>
<summary>extracted/5762267/fig/by_task_type.png Details</summary>

### Visual Description
## Bar Chart: Performance Comparison of Different Approaches
### Overview
This bar chart compares the average performance (%) of different approaches â Human (average and best), Direct, Chain-of-Thought (CoT), and CoC (ours) â across various task categories: All, NLP, Alg, Python only (repeated code), Python only (new code), Python + LM (repeated code), and Python + LM (new code). The performance is measured on the y-axis, ranging from 0% to 100%.
### Components/Axes
* **X-axis:** Task Category (All, NLP, Alg, Python only (repeated code), Python only (new code), Python + LM (repeated code), Python + LM (new code)).
* **Y-axis:** Average Performance (%) - Scale from 0 to 100.
* **Legend:**
* Human (Avg.) - Light Cyan
* Human (Best) - Light Green
* Direct - Gray
* CoT - Blue
* CoC (ours) - Purple
### Detailed Analysis
The chart consists of grouped bar plots for each task category. Each group contains five bars representing the performance of the different approaches.
* **All:**
* Human (Avg.): ~63%
* Human (Best): ~95%
* Direct: ~68%
* CoT: ~72%
* CoC (ours): ~82%
* **NLP:**
* Human (Avg.): ~68%
* Human (Best): ~98%
* Direct: ~70%
* CoT: ~74%
* CoC (ours): ~85%
* **Alg:**
* Human (Avg.): ~55%
* Human (Best): ~85%
* Direct: ~62%
* CoT: ~66%
* CoC (ours): ~75%
* **Python only (repeated code):**
* Human (Avg.): ~75%
* Human (Best): ~92%
* Direct: ~80%
* CoT: ~70%
* CoC (ours): ~90%
* **Python only (new code):**
* Human (Avg.): ~72%
* Human (Best): ~96%
* Direct: ~75%
* CoT: ~78%
* CoC (ours): ~94%
* **Python + LM (repeated code):**
* Human (Avg.): ~65%
* Human (Best): ~88%
* Direct: ~68%
* CoT: ~70%
* CoC (ours): ~78%
* **Python + LM (new code):**
* Human (Avg.): ~67%
* Human (Best): ~90%
* Direct: ~70%
* CoT: ~72%
* CoC (ours): ~75%
**Trends:**
* Human (Best) consistently achieves the highest performance across all categories.
* CoC (ours) generally outperforms Direct and CoT across all categories.
* The performance gap between Human (Avg.) and Human (Best) is significant, indicating substantial variability in human performance.
* The "Python only (repeated code)" and "Python only (new code)" categories show the highest performance for all approaches, suggesting that these tasks are relatively easier.
### Key Observations
* CoC (ours) consistently performs close to the Human (Avg.) level, especially in the "Python only" tasks.
* The performance of CoT is generally lower than Direct, except in the "All" category.
* The difference in performance between "repeated code" and "new code" is minimal for CoC (ours), suggesting that the approach is robust to code variations.
### Interpretation
The data suggests that the CoC (ours) approach is a strong contender, achieving performance levels comparable to average human performance, particularly in tasks involving Python code. The consistently high performance of Human (Best) highlights the potential for further improvement in automated approaches. The chart demonstrates the effectiveness of the CoC approach in bridging the gap between automated systems and human-level performance, especially in code-related tasks. The relatively lower performance of CoT compared to Direct suggests that a simpler, more direct approach might be more effective in certain scenarios. The high performance in "Python only" tasks indicates that the models are well-suited for code-related problems. The consistent performance of CoC (ours) across "repeated code" and "new code" suggests that the approach is not overly reliant on memorization or specific code patterns.
</details>
Figure 4: Average performance across different baselines grouped by task type, indicating the problem type and how CoC is generated & executed.
Question 3: Ablations. Figures 5 and 6, and Table 2 show the ablations performed to motivate each aspect of Chain of Code prompting. As one may expect, the approaches that execute Python (CoC (Interweave, Python, try Python except LM, try Python except LM state)) achieve 100% performance on several tasks â if the code is correct, then the model will be correct every time. However, the approach that relies on Python alone (CoC (Python)) performs poorly when applied to non-algorithmic tasks, failing almost all. The CoC (Python) ablation is similar to recent works (Gao et al., 2023; Chen et al., 2022), which show that if applied to numerical problems then code reasoning performs well. CoC without the Python interpreter (CoC (LM, LM state)) too fares poorly, though we see that the step-by-step approach proposed in ScratchPad prompting (Nye et al., 2021) improves in each task.
We also show that ablations CoC (try Python except LM, try Python except LM state), in which CoC first tries to run the entire code with Python and if it fails simulates the code with an LM, perform quite well. Again we see that maintaining a program state provides an improvement in performance. With only minor degradations in performance observed, they are reasonable alternatives to the fully interweaved CoC for their simplicity. Though we note, these ablationsâ performance would be much worse in cases where interweaving code and semantics is truly necessary â for example, if we imagine a case where code is necessary to parse image inputs or to access an external database, but language is necessary to parse the results (see the robotics applications in Section A.6).
<details>
<summary>extracted/5762267/fig/by_task_type_ablation.png Details</summary>

### Visual Description
\n
## Bar Chart: Average Performance Comparison of Code Generation Techniques
### Overview
This bar chart compares the average performance (in percentage) of several code generation techniques across different task categories. The techniques are variations of "CoC" (likely Code Completion) utilizing different approaches like Interweave, Python, and Language Models (LM), with and without state management and exception handling. The task categories are "All", "NLP", "Alg", "Python only (repeated code)", "Python only (new code)", "Python + LM (repeated code)", and "Python + LM (new code)".
### Components/Axes
* **X-axis:** Task Category. Categories are: "All", "NLP", "Alg", "Python only (repeated code)", "Python only (new code)", "Python + LM (repeated code)", "Python + LM (new code)".
* **Y-axis:** Average performance (%). Scale ranges from 0 to 100.
* **Legend:** Located in the top-right corner, identifies the different code generation techniques using color-coded bars.
* CoC (Interweave) - Purple
* CoC (Python) - Red
* CoC (LM) - Blue
* CoC (LM state) - Light Blue
* CoC (try Python except LM) - Green
* CoC (try Python except LM state) - Teal
### Detailed Analysis
The chart consists of seven groups of bars, one for each task category. Each group contains six bars, representing the performance of each of the six code generation techniques.
**All:**
* CoC (Interweave): Approximately 82%
* CoC (Python): Approximately 78%
* CoC (LM): Approximately 73%
* CoC (LM state): Approximately 76%
* CoC (try Python except LM): Approximately 88%
* CoC (try Python except LM state): Approximately 85%
**NLP:**
* CoC (Interweave): Approximately 72%
* CoC (Python): Approximately 68%
* CoC (LM): Approximately 63%
* CoC (LM state): Approximately 66%
* CoC (try Python except LM): Approximately 78%
* CoC (try Python except LM state): Approximately 75%
**Alg:**
* CoC (Interweave): Approximately 78%
* CoC (Python): Approximately 74%
* CoC (LM): Approximately 92%
* CoC (LM state): Approximately 88%
* CoC (try Python except LM): Approximately 94%
* CoC (try Python except LM state): Approximately 91%
**Python only (repeated code):**
* CoC (Interweave): Approximately 95%
* CoC (Python): Approximately 92%
* CoC (LM): Approximately 98%
* CoC (LM state): Approximately 96%
* CoC (try Python except LM): Approximately 99%
* CoC (try Python except LM state): Approximately 98%
**Python only (new code):**
* CoC (Interweave): Approximately 85%
* CoC (Python): Approximately 82%
* CoC (LM): Approximately 88%
* CoC (LM state): Approximately 86%
* CoC (try Python except LM): Approximately 91%
* CoC (try Python except LM state): Approximately 89%
**Python + LM (repeated code):**
* CoC (Interweave): Approximately 65%
* CoC (Python): Approximately 62%
* CoC (LM): Approximately 68%
* CoC (LM state): Approximately 66%
* CoC (try Python except LM): Approximately 72%
* CoC (try Python except LM state): Approximately 70%
**Python + LM (new code):**
* CoC (Interweave): Approximately 63%
* CoC (Python): Approximately 60%
* CoC (LM): Approximately 66%
* CoC (LM state): Approximately 64%
* CoC (try Python except LM): Approximately 69%
* CoC (try Python except LM state): Approximately 67%
### Key Observations
* The "try Python except LM" technique consistently achieves the highest performance across all task categories.
* Performance is generally highest for tasks involving "repeated code" and lowest for tasks involving "new code" when combining Python and LM.
* The "LM" based approaches (CoC (LM) and CoC (LM state)) perform better on the "Alg" task compared to "All", "NLP", "Python only", and "Python + LM" tasks.
* The "Interweave" and "Python" techniques show relatively stable performance across different task categories.
### Interpretation
The data suggests that a hybrid approach of "try Python except LM" is the most effective code generation technique across a variety of tasks. This likely indicates that leveraging Python for its strengths and falling back on a Language Model when Python fails is a robust strategy. The superior performance on tasks with "repeated code" suggests that these techniques excel at recognizing and replicating existing patterns. The performance dip when generating "new code" implies a challenge in creative problem-solving or understanding complex requirements. The strong performance of LM-based approaches on the "Alg" task could be due to the LM's ability to learn and generalize algorithmic patterns. The chart highlights the importance of combining different code generation strategies to maximize overall performance and adaptability. The consistent performance of the "Interweave" and "Python" techniques suggests they provide a solid baseline, while the "try...except" approach adds a layer of resilience and adaptability.
</details>
Figure 5: Chain of Code ablations on average performance grouped by task type.
<details>
<summary>extracted/5762267/fig/all_tasks_coc_interweave.png Details</summary>

### Visual Description
\n
## Bar Chart: CoC (Interweave) - Delta w.r.t. Average Human Rater
### Overview
The image presents a bar chart titled "CoC (Interweave)" displaying the delta (Î) with respect to the average human rater, expressed as a percentage. The x-axis is not explicitly labeled, implying discrete categories or steps. The y-axis represents the percentage difference, ranging from -100% to 100%. The chart shows a generally increasing trend, starting with negative values and culminating in a significantly positive value.
### Components/Axes
* **Title:** CoC (Interweave) - positioned at the top-center of the chart.
* **Y-axis Label:** Î w.r.t. average human rater (%) - positioned on the left side of the chart.
* **X-axis:** Unlabeled, representing discrete categories.
* **Bars:** Representing the delta values. The bars are colored in shades of orange and blue. The initial bars are orange, transitioning to blue as the values increase.
### Detailed Analysis
The chart consists of approximately 15 bars. The trend is initially negative, then plateaus around 0%, and finally increases sharply to a positive value.
Here's a breakdown of approximate values, reading from left to right:
1. -10% (Orange)
2. 0% (Orange)
3. 5% (Orange)
4. -5% (Orange)
5. 0% (Blue)
6. 5% (Blue)
7. 10% (Blue)
8. 15% (Blue)
9. 20% (Blue)
10. 25% (Blue)
11. 30% (Blue)
12. 40% (Blue)
13. 50% (Blue)
14. 60% (Blue)
15. 90% (Blue)
The transition from orange to blue occurs around the 5% mark. The final bar is significantly higher than the preceding bars, reaching approximately 90%.
### Key Observations
* The chart demonstrates an initial underperformance (negative delta) relative to the average human rater, followed by a period of performance around the average (delta near 0%).
* The performance then consistently improves, culminating in a substantial outperformance (positive delta) in the final category.
* The sharp increase in the final category is a notable outlier.
* The color change from orange to blue appears to coincide with the transition from underperformance to performance at or above the average.
### Interpretation
The data suggests a system or process (represented by "CoC (Interweave)") initially performs worse than an average human rater. However, as the process progresses through different stages or iterations (represented by the x-axis categories), its performance improves. The final stage shows a significant improvement, exceeding the performance of the average human rater by a substantial margin (approximately 90%).
The use of color coding (orange for negative delta, blue for positive delta) effectively highlights the transition in performance. The chart likely represents the results of an experiment or evaluation comparing the performance of an automated system to human performance. The "Interweave" aspect of the title suggests a possible integration or combination of automated and human elements. The sharp increase at the end could indicate a critical improvement or optimization in the process.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_try_except_llm_state.png Details</summary>

### Visual Description
\n
## Bar Chart: Change in Correctness of Code (CoC) with Python, Compared to Human Raters
### Overview
This bar chart displays the change (Î) in correctness of code (CoC) when using a Python-based system, relative to the average performance of human raters. The chart shows a progression from negative values to significantly positive values, indicating an increasing improvement in CoC compared to human raters as the system is applied.
### Components/Axes
* **Title:** "CoC (try Python except LM state)" - positioned at the top-center of the chart.
* **Y-axis Label:** "Î w.r.t. average human rater (%)" - indicating the percentage change in correctness relative to human raters. The scale ranges from approximately -100% to 100%.
* **X-axis:** Represents the progression of the system or trials. There are no explicit labels on the x-axis, only a series of bars.
* **Color Coding:** Bars are colored orange for negative/near-zero values and transition to blue for positive values. A dashed grey line at y=0 serves as a reference point.
### Detailed Analysis
The chart consists of a series of vertical bars. The trend is initially negative, then stabilizes around zero, and finally increases sharply to positive values.
* **Initial Bars (Orange):** The first four bars are orange, indicating a decrease in correctness compared to human raters.
* Bar 1: Approximately -20%
* Bar 2: Approximately -10%
* Bar 3: Approximately 0%
* Bar 4: Approximately -5%
* **Stabilization (Grey):** The fifth bar is grey, indicating a value close to zero.
* Bar 5: Approximately 0%
* **Increasing Positive Values (Blue):** The remaining bars are blue, showing an increasing positive change in correctness.
* Bar 6: Approximately 10%
* Bar 7: Approximately 15%
* Bar 8: Approximately 20%
* Bar 9: Approximately 30%
* Bar 10: Approximately 40%
* Bar 11: Approximately 50%
* Bar 12: Approximately 60%
* Bar 13: Approximately 75%
* Bar 14: Approximately 90%
* Bar 15: Approximately 95%
### Key Observations
* The chart demonstrates a clear transition from a system performing worse than human raters to one performing significantly better.
* The improvement is not linear; it accelerates towards the end of the series.
* The initial negative values suggest the system initially introduces errors or performs poorly.
* The final bar shows a substantial improvement, indicating a high degree of correctness compared to human raters.
### Interpretation
The data suggests that the Python-based system, initially less accurate than human raters, undergoes a learning or optimization process that leads to a substantial improvement in code correctness. The "try Python except LM state" title implies that the system attempts to use Python and handles cases where a Language Model (LM) state is involved. The initial negative values could represent cases where the system struggles with complex scenarios or requires a learning period to adapt. The rapid increase in positive values indicates that the system effectively learns from its mistakes and eventually surpasses human performance. The chart highlights the potential of the system to automate code correctness assessment and potentially outperform human experts after an initial period of adjustment. The large final value suggests a significant advantage in correctness when the system is fully optimized.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_try_except_llm.png Details</summary>

### Visual Description
\n
## Bar Chart: CoC (try Python except LM)
### Overview
The image presents a bar chart comparing the performance of a system ("CoC") against an average human rater, specifically when trying Python code excluding Large Language Models (LM). The y-axis represents the difference (Î) in performance relative to the average human rater, expressed as a percentage. The x-axis represents a series of trials or steps, with each bar representing the performance at that step.
### Components/Axes
* **Title:** "CoC (try Python except LM)" - positioned at the top-center of the chart.
* **Y-axis Label:** "Î w.r.t. average human rater (%)" - positioned on the left side of the chart. The scale ranges from approximately -100% to 100%.
* **X-axis:** Implied sequential steps or trials. No explicit labels are present on the x-axis.
* **Color Scheme:** Two distinct colors are used: orange for negative differences and blue for positive differences.
### Detailed Analysis
The chart displays a series of bars representing the performance difference. The initial bars (approximately the first 10) are orange, indicating performance below the average human rater. These bars range from approximately -10% to 0%. After this initial segment, the bars transition to blue, indicating performance above the average human rater. The blue bars gradually increase in height, starting around 0% and reaching approximately 80-90% for the final bars.
Here's a breakdown of approximate values, reading from left to right:
* Bar 1: -10%
* Bar 2: -5%
* Bar 3: ~-3%
* Bar 4: ~-1%
* Bar 5: ~0%
* Bar 6: ~0%
* Bar 7: ~5%
* Bar 8: ~10%
* Bar 9: ~15%
* Bar 10: ~20%
* Bar 11: ~25%
* Bar 12: ~35%
* Bar 13: ~45%
* Bar 14: ~55%
* Bar 15: ~70%
* Bar 16: ~85%
The trend is a clear shift from underperforming (negative values) to significantly outperforming (positive values) the average human rater as the number of trials increases.
### Key Observations
* The system initially performs worse than the average human rater.
* There is a distinct transition point where the system's performance surpasses that of the human rater.
* The performance improvement is not linear; it accelerates towards the end of the trials.
* The final performance is substantially higher than the human rater, reaching approximately 85-90%.
### Interpretation
The data suggests that the "CoC" system, when attempting Python code without the aid of a Large Language Model, experiences a learning curve. Initially, it struggles to match human performance. However, with each trial, it improves, eventually exceeding human capabilities by a significant margin. This could indicate that the system benefits from iterative refinement or learning from its mistakes. The fact that the improvement is not linear suggests that there may be a critical mass of experience or a specific point at which the system unlocks a more effective strategy. The exclusion of Large Language Models is important to note, as it suggests the observed performance is due to the system's inherent capabilities rather than reliance on pre-trained models. The chart demonstrates a clear positive trend in performance over time, highlighting the potential of the system to become a valuable tool for code generation or assistance.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_python.png Details</summary>

### Visual Description
\n
## Bar Chart: CoC (Python) - Delta w.r.t. Average Human Rater
### Overview
This is a bar chart displaying the delta (Î) with respect to the average human rater, expressed as a percentage, for CoC (presumably "Consistency of Code" or similar) in Python. The chart shows a clear transition from negative values to positive values as the bars progress from left to right.
### Components/Axes
* **Title:** CoC (Python) - positioned at the top-center of the chart.
* **Y-axis Label:** Î w.r.t. average human rater (%) - positioned on the left side of the chart. The scale ranges from approximately -100% to 100%.
* **X-axis:** The x-axis is not explicitly labeled, but represents the progression of some metric or category. There are approximately 20 bars.
* **Color Coding:** Two distinct colors are used:
* Orange: Represents negative delta values.
* Blue: Represents positive delta values.
* **Legend:** There is no explicit legend, but the color coding serves as an implicit legend.
### Detailed Analysis
The chart consists of a series of vertical bars. The bars transition in color from orange to blue, indicating a change from negative to positive delta values.
* **Orange Bars (Negative Delta):** The first approximately 12 bars are orange. The values start around -80% and gradually increase towards 0%.
* Bar 1: Approximately -80%
* Bar 2: Approximately -70%
* Bar 3: Approximately -60%
* Bar 4: Approximately -50%
* Bar 5: Approximately -40%
* Bar 6: Approximately -30%
* Bar 7: Approximately -20%
* Bar 8: Approximately -10%
* Bar 9: Approximately 0%
* Bar 10: Approximately 0%
* Bar 11: Approximately 0%
* Bar 12: Approximately 0%
* **Blue Bars (Positive Delta):** The remaining approximately 8 bars are blue. The values start around 0% and increase rapidly to approximately 90%.
* Bar 13: Approximately 10%
* Bar 14: Approximately 20%
* Bar 15: Approximately 30%
* Bar 16: Approximately 40%
* Bar 17: Approximately 50%
* Bar 18: Approximately 60%
* Bar 19: Approximately 80%
* Bar 20: Approximately 90%
### Key Observations
* The chart demonstrates a clear shift in performance from being significantly below the average human rater (negative delta) to exceeding it (positive delta).
* The transition from negative to positive values is not linear; the increase in positive delta is more pronounced in the later bars.
* The initial negative values are relatively consistent before beginning to rise.
### Interpretation
The data suggests that the CoC (Python) system initially performs worse than an average human rater, as indicated by the negative delta values. However, as the metric progresses (represented by the x-axis), the system's performance improves, eventually surpassing human-level performance. This could indicate a learning curve, optimization process, or a change in the complexity of the code being evaluated. The rapid increase in positive delta suggests that the system reaches a point of significant improvement. The x-axis likely represents some form of iteration, complexity, or a specific aspect of the code being analyzed. Without knowing what the x-axis represents, it's difficult to provide a more specific interpretation. The chart highlights the potential for automated code consistency checks to eventually outperform human reviewers, but also shows that initial performance may be suboptimal.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_llm_state.png Details</summary>

### Visual Description
\n
## Bar Chart: CoC (LM state)
### Overview
The image presents a bar chart visualizing the difference (Î) with respect to an average human rater (%) for a "CoC (LM state)" metric. The chart displays a series of bars, transitioning from negative values to positive values, suggesting a shift in performance or assessment relative to human raters.
### Components/Axes
* **Title:** "CoC (LM state)" - positioned at the top-center of the chart.
* **Y-axis Label:** "Î w.r.t. average human rater (%)" - positioned on the left side of the chart. The scale ranges from approximately -100% to 100%.
* **X-axis:** No explicit labels are visible on the x-axis, indicating categorical data points. The number of data points is approximately 20.
* **Bars:** The chart consists of a series of vertical bars, colored in shades of orange, purple, and blue.
### Detailed Analysis
The bars transition from orange (negative values) to purple (values around 0) and then to blue (positive values).
* **Initial Orange Bars (approximately x=1 to x=8):** These bars show negative differences, ranging from approximately -10% to -60%. The trend is decreasing from left to right.
* Bar 1: Approximately -10%
* Bar 2: Approximately -20%
* Bar 3: Approximately -25%
* Bar 4: Approximately -30%
* Bar 5: Approximately -35%
* Bar 6: Approximately -40%
* Bar 7: Approximately -50%
* Bar 8: Approximately -60%
* **Purple Bars (approximately x=9 to x=15):** These bars show values close to zero, fluctuating around the 0% mark.
* Bar 9: Approximately -5%
* Bar 10: Approximately 0%
* Bar 11: Approximately 5%
* Bar 12: Approximately 0%
* Bar 13: Approximately 5%
* Bar 14: Approximately 10%
* Bar 15: Approximately 10%
* **Blue Bars (approximately x=16 to x=20):** These bars show positive differences, increasing in magnitude from left to right.
* Bar 16: Approximately 15%
* Bar 17: Approximately 20%
* Bar 18: Approximately 30%
* Bar 19: Approximately 40%
* Bar 20: Approximately 50%
### Key Observations
The chart demonstrates a clear trend: initially, the "CoC (LM state)" metric is significantly lower than the average human rater. However, as the data points progress, the metric gradually approaches and then surpasses the human rater's performance. The transition from negative to positive values indicates a point where the LM state begins to outperform the average human rater.
### Interpretation
The data suggests that the "CoC (LM state)" initially underperforms compared to human raters, but with progression (potentially representing increasing complexity or training), it improves and eventually exceeds human-level performance. This could indicate that the LM state requires a certain level of input or training before it can effectively match or surpass human capabilities. The sharp increase in the final bars suggests a potential inflection point where the LM state demonstrates a significant advantage. The metric "CoC (LM state)" is not defined, but it is likely a measure of consistency or correctness. The chart is a performance comparison between a Language Model (LM) state and an average human rater. The x-axis likely represents some form of progression or iteration.
</details>
<details>
<summary>extracted/5762267/fig/all_tasks_coc_llm.png Details</summary>

### Visual Description
\n
## Bar Chart: CoC (LM) - Delta w.r.t. Average Human Rater
### Overview
The image presents a bar chart titled "CoC (LM)", displaying the delta (Î) with respect to an average human rater, expressed as a percentage. The chart appears to compare a model's performance (likely a Language Model, indicated by "LM") against human ratings. The x-axis is not explicitly labeled, implying it represents some sequential ordering or category.
### Components/Axes
* **Title:** CoC (LM)
* **Y-axis Label:** Î w.r.t. average human rater (%) - indicating the percentage difference from the average human rater score. The scale ranges from -100% to 100%.
* **X-axis:** Unlabeled, representing categories or steps in a process.
* **Bars:** The chart consists of a series of vertical bars, colored in shades of orange, purple, and blue. The color transitions seem to indicate a change in performance.
### Detailed Analysis
The chart shows a progression of values. Initially, the bars are orange and hover around a delta of approximately +10% to -10%. As the chart progresses, the bars transition through shades of purple, remaining relatively close to 0% delta. Finally, the bars turn blue and show a positive trend, increasing from approximately 0% to +30%.
Here's a breakdown of approximate values, reading from left to right:
1. +10% (Orange)
2. +5% (Orange)
3. 0% (Orange)
4. -5% (Orange)
5. -10% (Orange)
6. -15% (Purple)
7. -10% (Purple)
8. -5% (Purple)
9. 0% (Purple)
10. +5% (Purple)
11. +10% (Purple)
12. +15% (Blue)
13. +20% (Blue)
14. +25% (Blue)
15. +30% (Blue)
### Key Observations
* The chart demonstrates an initial period of performance that is slightly above or below the average human rater.
* There's a period of stability around 0% delta, indicating the model's performance is aligned with the average human rater.
* The most notable trend is the positive shift towards the end of the chart, where the model's performance consistently exceeds the average human rater.
* The color change from orange to purple to blue appears to correlate with the performance trend.
### Interpretation
The data suggests that the Language Model (LM) initially exhibits performance comparable to, but slightly varying from, human raters. After a period of stability, the model's performance improves significantly, consistently surpassing the average human rater. This could indicate that the model learns and adapts over time, or that the evaluation criteria become more favorable to the model's strengths. The transition in bar colors likely represents different stages or configurations of the model, or different types of input data. The increasing positive delta at the end of the chart is a key finding, suggesting the model is becoming increasingly effective. The unlabeled x-axis is a limitation, as it prevents a more specific understanding of what the categories or steps represent. Without knowing what the x-axis represents, it's difficult to determine the cause of the performance improvement.
</details>
Figure 6: Results across all BIG-Bench Hard tasks compared to human baseline (Srivastava et al., 2022). The tasks (x-axis) in each plot are sorted individually by performance. See Table A1 and Figure 5 for a breakdown by task type.
Table 2: Ablation overall performance (%) with both few-shot prompting with a single task and cross-task. The delta compared to the full model (Interweave) is shown in parenthesis.
| Prompt | Chain of Code Interweave | try Python-except LM state | try Python-except LM | Python | LM state | LM |
| --- | --- | --- | --- | --- | --- | --- |
| Single task | 84 | 82 (-2) | 80 (-4) | 48 (-36) | 63 (-21) | 57 (-27) |
| Cross task | 61 | 57 (-4) | 60 (-1) | 35 (-26) | 49 (-12) | 50 (-11) |
Question 4: Scaling. Figure 7 shows the performance of CoC across various model sizes. We observe that, similar to Chain of Thought prompting, the improvements of CoC increases as model size increases. In fact, for some of the algorithmic tasks, Chain of Code even outperforms the best human raters (whom admittedly did not have access to code). Unlike Chain of Thought prompting, however, which only brings performance benefits for the largest model (d-3), CoC outperforms the direct question answering baseline also for smaller models (a-1, b-1, c-1), suggesting that itâs easier for smaller models to output structured code as intermediate steps rather than natural languages.
Question 5: Cross-task Prompting. For cross-task prompting, we prompt the language models with a few examples from different problems. We see the performance drops for all methods in Figure 7 and Table 1. Despite this drop, CoC outperforms CoT and direct prompting at scale, nearly achieving human average performance. This is a promising indication towards general purpose reasoning, in which a model does not expect to receive examples of similar problems in its prompt.
<details>
<summary>extracted/5762267/fig/by_size_all.png Details</summary>

### Visual Description
\n
## Line Charts: Model Performance vs. Scale
### Overview
The image presents four line charts comparing the performance of different approaches (Human, Direct, CoT, CoC) across varying model scales (a-1, b-1, c-1, d-3) for different task categories: All Tasks, NLP Tasks, Alg Tasks, and Cross-task Prompted. Performance is measured as Average Performance (%) on the y-axis, while the x-axis represents Model Scale (parameters).
### Components/Axes
* **X-axis:** Model Scale (parameters) with markers: a-1, b-1, c-1, d-3.
* **Y-axis:** Average Performance (%) with a scale from 0 to 100, incrementing by 20.
* **Legend (top-left of each chart):**
* Human (Orange)
* Direct (Blue)
* CoT (Purple)
* CoC (ours) (Green)
* CoC (Python) (Pink)
* **Horizontal dashed lines:**
* Random (approximately at 20%)
* Human Average (approximately at 40% for All Tasks, NLP Tasks, and Cross-task Prompted, and approximately at 60% for Alg Tasks)
* Human Best (approximately at 100%)
### Detailed Analysis or Content Details
**Chart 1: All Tasks**
* **Human:** Performance is relatively stable around 30-40% across all model scales.
* **Direct:** Starts around 25% at a-1, dips slightly to ~20% at b-1, rises to ~30% at c-1, and then increases to ~40% at d-3.
* **CoT:** Starts around 20% at a-1, increases sharply to ~60% at b-1, continues to rise to ~80% at c-1, and reaches approximately 90% at d-3.
* **CoC (ours):** Starts around 20% at a-1, increases to ~40% at b-1, rises to ~60% at c-1, and reaches approximately 70% at d-3.
* **CoC (Python):** Starts around 20% at a-1, increases to ~30% at b-1, rises to ~50% at c-1, and reaches approximately 60% at d-3.
**Chart 2: NLP Tasks**
* **Human:** Performance is relatively stable around 40-50% across all model scales.
* **Direct:** Starts around 20% at a-1, remains around 20-25% at b-1 and c-1, and increases to ~30% at d-3.
* **CoT:** Starts around 20% at a-1, increases sharply to ~60% at b-1, continues to rise to ~80% at c-1, and reaches approximately 90% at d-3.
* **CoC (ours):** Starts around 20% at a-1, increases to ~40% at b-1, rises to ~60% at c-1, and reaches approximately 70% at d-3.
* **CoC (Python):** Starts around 20% at a-1, increases to ~30% at b-1, rises to ~40% at c-1, and reaches approximately 50% at d-3.
**Chart 3: Alg Tasks**
* **Human:** Performance is relatively stable around 60-70% across all model scales.
* **Direct:** Starts around 20% at a-1, remains around 20-25% at b-1 and c-1, and increases to ~30% at d-3.
* **CoT:** Starts around 20% at a-1, increases sharply to ~40% at b-1, continues to rise to ~70% at c-1, and reaches approximately 90% at d-3.
* **CoC (ours):** Starts around 20% at a-1, increases to ~40% at b-1, rises to ~60% at c-1, and reaches approximately 80% at d-3.
* **CoC (Python):** Starts around 20% at a-1, increases to ~30% at b-1, rises to ~40% at c-1, and reaches approximately 50% at d-3.
**Chart 4: Cross-task Prompted**
* **Human:** Performance is relatively stable around 40-50% across all model scales.
* **Direct:** Starts around 20% at a-1, remains around 20-25% at b-1 and c-1, and increases to ~30% at d-3.
* **CoT:** Starts around 20% at a-1, increases sharply to ~60% at b-1, continues to rise to ~80% at c-1, and reaches approximately 90% at d-3.
* **CoC (ours):** Starts around 20% at a-1, increases to ~40% at b-1, rises to ~60% at c-1, and reaches approximately 70% at d-3.
* **CoC (Python):** Starts around 20% at a-1, increases to ~30% at b-1, rises to ~40% at c-1, and reaches approximately 50% at d-3.
### Key Observations
* CoT consistently outperforms all other methods across all task categories and model scales, showing a strong positive correlation between model scale and performance.
* CoC (ours) and CoC (Python) show improvement with increasing model scale, but their performance remains below CoT.
* Direct performance is relatively low and shows minimal improvement with increasing model scale.
* Human performance serves as a baseline, with Human Average and Human Best providing upper and lower bounds for the models.
* The "Random" baseline is consistently below all model performances, indicating that the models are learning something beyond random chance.
### Interpretation
The data demonstrates that Chain-of-Thought (CoT) prompting significantly enhances model performance, particularly as the model scale increases. This suggests that larger models benefit more from the reasoning capabilities enabled by CoT. CoC (ours) and CoC (Python) also show positive scaling, but to a lesser extent than CoT. The consistent underperformance of the "Direct" approach highlights the importance of prompting strategies for eliciting better results from language models. The differences in performance across task categories (NLP, Alg, Cross-task) suggest that the effectiveness of different prompting strategies may vary depending on the nature of the task. The relatively stable human performance provides a benchmark for evaluating the models' progress towards human-level intelligence. The gap between Human Average and Human Best indicates the potential for further improvement in model performance. The consistent performance above the "Random" baseline confirms that the models are learning meaningful patterns from the data.
</details>
Figure 7: Average performance with model scaling, from text-ada-001 (smallest) to text-davinci-003 (largest).
Question 6: Instruction Tuned Models. The reason why we chose text-davinci-003, a completion model, as our primary evaluation model, over more advanced instruction tuned models (gpt-3.5-turbo and gpt-4) is that the former is more amenable to few-shot prompting with examples, which is the main evaluation paradigm for BIG-Bench Hard. However, we still made our best attempt to evaluate our method with the instruction tuned models using two different setups. The first is zero-shot prompting, where we directly prompt the models via the system message to output direct answers, chain of thoughts, or pseudocode/code (which we optionally execute with the python interpreter and feed back the results). The second is few-shot prompting, where we coerce the models to behave like completion models via the system message, and feed the few-shot examples as usual. In both cases, we demonstrated that CoC brings noticeable benefits with little modification needed. See Sec. A.4 for more details.
Question 7: Robustness of Chain of Code We showed that CoC is generally robust against prompt variation by evaluating with different prompts independently written by three annotators on the same set of problems. Specifically, we select four representative tasks from BIG-Bench Hard that require generation of new code (as opposed to repeated code). While the performance of individual tasks has some variance, the average performance across the four tasks only vary within a few percentage points. See Sec. A.5 for more details.
Question 8: Beyond Language Reasoning We showed that CoC is well-suited for tasks that require both semantic and algorithmic reasoning beyond language reasoning, such as robotics. The unique advantage of CoC in robotics is that it interact seamlessly with the robot perception and control APIs via python code such as running object detectors or invoking parameterized robot skills, while performing semantic subtasks in an âinlineâ fashion (e.g. classifying what trash is compostable before picking them). When equipped with the necessary robot APIs, and a single example in the prompt to teach LMs the format, CoC can solve seven different robot manipulation tasks in the real world, showcasing generalization to new objects, languages and task domains. See Sec. A.6 for more details.
## 4 Related Work
Language Model Reasoning The abilities and applications of language models have seen significant progress, due to their overall performance (Chowdhery et al., 2022; Touvron et al., 2023; Radford et al., 2019; Gemini Team, 2023) and emergent capabilities (Wei et al., 2022a), such as few-shot prompting (Brown et al., 2020) and abstract reasoning (Wei et al., 2022b). Perhaps most related to this work, a number of works have leveraged prompting to improve reasoning (Dohan et al., 2022): Chain of Thought (Wei et al., 2022b) proposes to break a task down into intermediate reasoning steps, least-to-most (Zhou et al., 2022a) proposes a series of increasingly simpler problems, and ScratchPad (Nye et al., 2021) proposes to maintain a trace of intermediate results for interpreting code (this first demonstrated the code simulation ability of LMs required for our LMulator). Along these lines âletâs think step by stepâ (Kojima et al., 2022) uses a few key words to elicit such break downs (words that were later refined to âTake a deep breath and work on this problem step-by-stepâ in (Yang et al., 2023)). Beyond these, other approaches structure such step-by-step solutions into graphical structures (Yao et al., 2023; Besta et al., 2023), plans (Wang et al., 2023b; Ning et al., 2023), or mixture of expert-based sampling (Wang et al., 2022; Zhou et al., 2022b). CoC builds upon the intuition of these works, with the observation that code is a formal, structured approach to breaking a problem down into sub-steps with many advantages beyond natural language alone.
Language Model Tool Use Many recent works have proposed techniques for language models to use tools to respond to queries (Mialon et al., 2023). These tools have often been provided to the language model through prompting (Cobbe et al., 2021; Khot et al., 2022; Chowdhery et al., 2022; Drori et al., 2022; Yao et al., 2022), enabling tools like calculators for math problems, code interpreters, databases, or more. These tools too can provide feedback on novel modalities (SurĂs et al., 2023; Zeng et al., 2022). To expand the range of tools available, others have used external tool databases or finetuned language models (Schick et al., 2023; Qin et al., 2023; Parisi et al., 2022; Paranjape et al., 2023). As tool interfaces vary, feedback from the tool too can improve performance (Gou et al., 2023; Zhou et al., 2023). In this work we leverage the expressibility and generality of full code as well as its structure, by treating it both as a tool and as a framework.
Language Model Program Synthesis The ability of language models to code is well known and they have been applied as programming assistants (Chen et al., 2021) and shown to be capable programmers on their own (Austin et al., 2021; Li et al., 2022; Nijkamp et al., 2022). This ability has been applied to a variety of tasks outside of language alone, leveraging their ability to reason through code in new settings, such as robotics (Liang et al., 2023; Singh et al., 2023), embodied agents (Wang et al., 2023a), or vision (SurĂs et al., 2023). Others have specifically done so for reasoning, such as Program of Thoughts (Chen et al., 2022) and Program-aided Language Models (Gao et al., 2023), which generate code to solve numerical reasoning problems. Herein, we focus on the interplay between writing code, running code, and language models simulating code, thus enabling new regimes of language model code applications, such as semantic reasoning.
## 5 Conclusions, Limitations, and Future Work
We have proposed Chain of Code, an approach towards reasoning with language models through writing code, and executing code either with an interpreter or with a language model that simulates the execution (termed herein an LMulator) if the code is not executable. As such, CoC can leverage both the expressive structure of code and the powerful tools available to it. Beyond this, by simulating the execution of non-executable code, CoC can apply to problems nominally outside the scope of code (e.g., semantic reasoning problems). We have demonstrated that this approach outperforms baselines, and for some tasks even the best human raters, in a range of challenging language and numeric reasoning problems.
This work is not without its limitations. First, generating and executing in two steps as well as interweaving code and language execution requires additional context length and computation time. Second, though we have not seen any loss of performance for semantic tasks in aggregate, there are few tasks in which code doesnât help, e.g., the task Ruin Names, which asks whether an edit for a name is humorous. Finally, our implementation to interweave LM and code is quite simple, tracking the program state in strings and parsing the strings into Pythonâs built-in data types (e.g., dict, tuple). As our method stands now, the LM cannot modify custom Python objects while simulating code execution. In theory, however, it is doable as long as each of these Python objects have a serialization and deserialization method, e.g., using techniques like Protocol Buffers.
There are many avenues for future work with CoC. First, we believe that a unified code and language interpreter well combines the commonsense of language models with the analytical abilities, structure, and interpretability of code. Such a technology can thus enable applications of code and code-like reasoning to novel problem regimes, beyond simple reasoning. Second, we are interested in investigating the degree to which finetuning a language model to be an LMulator can benefit semantic code reasoning. Third, we see evidence that reasoning through many pathways yields improvements, which is a promising step forward. Finally, we believe this integration with code enables access to external modalities, such as vision or databases, and represents a interesting path for new applications (e.g., robotics, augmented reality).
## Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, most of which are related to the usage of large language models (LLMs). One aspect of Chain of Code that warrants further discussion is that CoC executes the output of LLMs using the Python interpreter as if they are always benign code. If deployed in the wild, however, Chain of Code will need to install additional safeguards against potentially harmful code from LLMs that might be maliciously prompted, before running the code.
## References
- Austin et al. (2021) Austin, J., Odena, A., Nye, M., Bosma, M., Michalewski, H., Dohan, D., Jiang, E., Cai, C., Terry, M., Le, Q., et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Besta et al. (2023) Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., Gajda, J., Lehmann, T., Podstawski, M., Niewiadomski, H., Nyczyk, P., et al. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
- Brown et al. (2020) Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877â1901, 2020.
- Chen et al. (2021) Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Chen et al. (2022) Chen, W., Ma, X., Wang, X., and Cohen, W. W. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
- Chowdhery et al. (2022) Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
- Cobbe et al. (2021) Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Dohan et al. (2022) Dohan, D., Xu, W., Lewkowycz, A., Austin, J., Bieber, D., Lopes, R. G., Wu, Y., Michalewski, H., Saurous, R. A., Sohl-Dickstein, J., et al. Language model cascades. arXiv preprint arXiv:2207.10342, 2022.
- Drori et al. (2022) Drori, I., Zhang, S., Shuttleworth, R., Tang, L., Lu, A., Ke, E., Liu, K., Chen, L., Tran, S., Cheng, N., et al. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. Proceedings of the National Academy of Sciences, 119(32):e2123433119, 2022.
- Gao et al. (2023) Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Callan, J., and Neubig, G. Pal: Program-aided language models. In International Conference on Machine Learning, pp. 10764â10799. PMLR, 2023.
- Gemini Team (2023) Gemini Team, G. Gemini: A family of highly capable multimodal models. Technical report, Google, 2023. URL https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.
- Google et al. (2023) Google, Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Gou et al. (2023) Gou, Z., Shao, Z., Gong, Y., Shen, Y., Yang, Y., Duan, N., and Chen, W. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023.
- Khot et al. (2022) Khot, T., Trivedi, H., Finlayson, M., Fu, Y., Richardson, K., Clark, P., and Sabharwal, A. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
- Kirillov et al. (2023) Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A. C., Lo, W.-Y., DollĂĄr, P., and Girshick, R. Segment anything. arXiv:2304.02643, 2023.
- Kojima et al. (2022) Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199â22213, 2022.
- Lewkowycz et al. (2022) Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models, 2022. arXiv preprint arXiv:2206.14858, 2022. URL https://arxiv.org/abs/2206.14858.
- Li et al. (2022) Li, Y., Choi, D., Chung, J., Kushman, N., Schrittwieser, J., Leblond, R., Eccles, T., Keeling, J., Gimeno, F., Dal Lago, A., et al. Competition-level code generation with alphacode. Science, 378(6624):1092â1097, 2022.
- Liang et al. (2023) Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Ichter, B., Florence, P., and Zeng, A. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 9493â9500. IEEE, 2023.
- Liu et al. (2023) Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Li, C., Yang, J., Su, H., Zhu, J., et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Mialon et al. (2023) Mialon, G., DessĂŹ, R., Lomeli, M., Nalmpantis, C., Pasunuru, R., Raileanu, R., RoziĂšre, B., Schick, T., Dwivedi-Yu, J., Celikyilmaz, A., et al. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
- Min et al. (2022) Min, S., Lyu, X., Holtzman, A., Artetxe, M., Lewis, M., Hajishirzi, H., and Zettlemoyer, L. Rethinking the role of demonstrations: What makes in-context learning work? arXiv preprint arXiv:2202.12837, 2022.
- Mirchandani et al. (2023) Mirchandani, S., Xia, F., Florence, P., Ichter, B., Driess, D., Arenas, M. G., Rao, K., Sadigh, D., and Zeng, A. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721, 2023.
- Nijkamp et al. (2022) Nijkamp, E., Pang, B., Hayashi, H., Tu, L., Wang, H., Zhou, Y., Savarese, S., and Xiong, C. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
- Ning et al. (2023) Ning, X., Lin, Z., Zhou, Z., Yang, H., and Wang, Y. Skeleton-of-thought: Large language models can do parallel decoding. arXiv preprint arXiv:2307.15337, 2023.
- Nye et al. (2021) Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
- OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
- Paranjape et al. (2023) Paranjape, B., Lundberg, S., Singh, S., Hajishirzi, H., Zettlemoyer, L., and Ribeiro, M. T. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.
- Parisi et al. (2022) Parisi, A., Zhao, Y., and Fiedel, N. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
- Qin et al. (2023) Qin, Y., Liang, S., Ye, Y., Zhu, K., Yan, L., Lu, Y., Lin, Y., Cong, X., Tang, X., Qian, B., et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
- Radford et al. (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Schick et al. (2023) Schick, T., Dwivedi-Yu, J., DessĂŹ, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., and Scialom, T. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
- Singh et al. (2023) Singh, I., Blukis, V., Mousavian, A., Goyal, A., Xu, D., Tremblay, J., Fox, D., Thomason, J., and Garg, A. Progprompt: Generating situated robot task plans using large language models. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pp. 11523â11530. IEEE, 2023.
- Srivastava et al. (2022) Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- SurĂs et al. (2023) SurĂs, D., Menon, S., and Vondrick, C. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128, 2023.
- Suzgun et al. (2022) Suzgun, M., Scales, N., SchÀrli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Touvron et al. (2023) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., RoziĂšre, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Wang et al. (2023a) Wang, G., Xie, Y., Jiang, Y., Mandlekar, A., Xiao, C., Zhu, Y., Fan, L., and Anandkumar, A. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
- Wang et al. (2023b) Wang, L., Xu, W., Lan, Y., Hu, Z., Lan, Y., Lee, R. K.-W., and Lim, E.-P. Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091, 2023b.
- Wang et al. (2022) Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., Narang, S., Chowdhery, A., and Zhou, D. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Wei et al. (2022a) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Wei et al. (2022b) Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., Zhou, D., et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824â24837, 2022b.
- Yang et al. (2023) Yang, C., Wang, X., Lu, Y., Liu, H., Le, Q. V., Zhou, D., and Chen, X. Large language models as optimizers. arXiv preprint arXiv:2309.03409, 2023.
- Yao et al. (2022) Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- Yao et al. (2023) Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
- Zeng et al. (2022) Zeng, A., Attarian, M., Ichter, B., Choromanski, K., Wong, A., Welker, S., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., et al. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
- Zhou et al. (2023) Zhou, A., Wang, K., Lu, Z., Shi, W., Luo, S., Qin, Z., Lu, S., Jia, A., Song, L., Zhan, M., et al. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921, 2023.
- Zhou et al. (2022a) Zhou, D., SchÀrli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Cui, C., Bousquet, O., Le, Q., et al. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022a.
- Zhou et al. (2022b) Zhou, Y., Lei, T., Liu, H., Du, N., Huang, Y., Zhao, V., Dai, A. M., Le, Q. V., Laudon, J., et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103â7114, 2022b.
## Appendix A Appendix
### A.1 Quantitative results on language reasoning tasks
Table A1 shows the full per-task results across ablations on BIG-Bench Hard (BBH) tasks, as well as broken down by task type and execution type.
.2in.2in Srivastava et al. (2022) Suzgun et al. (2022) Chain of Code BIG-Bench Hard Task Rand. Human (Avg.) Human (Max) Direct CoT Inter-weave try Python except LM state try Python except LM Python LM state LM Boolean Expressions λ+ 50 79 100 88 89 100 100 100 100 95 90 Causal Judgement Îșâ 50 70 100 64 64 56 57 63 0 57 60 Date Understanding Îș- 17 77 100 61 84 75 72 74 59 66 57 Disambiguation QA Îș/ 33 67 93 70 68 71 67 68 0 67 68 Dyck Languages λ+ 1 48 100 6 50 100 100 99 99 1 7 Formal Fallacies Îșâ 25 91 100 56 56 55 54 55 0 54 56 Geometric Shapes λ+ 12 54 100 48 66 100 100 100 100 13 44 Hyperbaton Îș/ 50 75 100 63 64 98 62 55 0 62 55 Logical Deduction λâ 23 40 89 49 66 68 79 57 0 79 58 Movie Recommendation Îș/ 25 61 90 85 81 80 83 80 0 83 79 Multi-Step Arithmetic λ+ 0 10 25 0 48 100 100 100 100 0 1 Navigate λâ 50 82 100 58 94 86 84 68 0 84 68 Object Counting λ- 0 86 100 30 82 96 98 98 98 57 50 Penguins in a Table Îș- 0 78 100 62 82 90 88 90 88 71 59 Reasoning about Colored Objects Îș- 12 75 100 64 87 78 74 78 64 64 70 Ruin Names Îș/ 25 78 100 76 70 55 56 46 0 56 47 Salient Translation Error Detection Îș/ 17 37 80 66 61 58 63 64 0 63 64 Snarks Îș/ 50 77 100 70 71 76 76 66 0 76 66 Sports Understanding Îș/ 50 71 100 72 96 91 93 75 0 93 74 Temporal Sequences λâ 25 91 100 38 60 98 93 99 93 93 99 Tracking Shuffled Objects λ- 23 65 100 25 72 100 96 96 96 71 24 Web of Lies λ- 50 81 100 54 100 97 96 96 97 96 50 Word Sorting λ+ 0 63 100 51 50 99 100 99 100 54 54 Task Averages NLP Task (avg) Îș 30 71 97 67 74 74 70 68 18 68 63 Algorithmic Task (avg) λ 21 64 92 41 71 95 95 92 80 58 50 All Tasks (avg) 26 68 95 55 72 84 82 80 48 63 57 Execution Type Python exec (same program) + 13 51 85 38 61 100 100 100 100 33 39 Python exec (different program) - 17 77 100 49 84 89 87 89 84 71 52 LM exec (same program) / 36 66 95 72 73 76 71 65 0 71 65 LM exec (different program) â 35 75 98 53 68 72 73 68 19 73 68 $\lambda$ denotes an algorithmic task and $\kappa$ denotes an NLP task (with categories outlined in Suzgun et al. (2022)). $+$ denotes a task where the code between prompts is repeated and can be executed by Python, $-$ denotes a task where the code between prompts must change and can be executed by Python, $/$ denotes a task where the code between prompts is repeated and must be executed by the LM, and $*$ denotes a task where the code between prompts must change and must be executed by the LM.
Table A1: Full results across ablations on BIG-Bench Hard (BBH) tasks.
### A.2 Quantitative results on the GSM8K Benchmark
Table A2 shows results on the the grade-school math benchmark (GSM8K) (Cobbe et al., 2021) with direct prompting, Chain of Thought, and Chain of Code. We find that CoC generally outperforms CoT and Direct prompting. Since these tasks are primarily algorithmic and are solved by Python alone, all Chain of Code variants that use Python achieve the same performance â also the same performance shown in Program of Thoughts (Chen et al., 2022).
Table A2: GSM8K (Cobbe et al., 2021) performance (%) with both few-shot prompting with a single task and cross-task. The delta compared to direct prompting is shown in parenthesis.
### A.3 Qualitative results on language reasoning tasks
Figure A1 shows the model outputs for a few reasoning tasks from BIG-Bench Hard (BBH) and Figure A2 shows a demonstrative example of date reasoning. These examples are selected to highlight the interweaving execution of the Python interpreter and the LMulator.
(a) Movie Recommendation
(b) Hyperbaton
(c) Logical Deduction
(d) Disambiguation QA
Figure A1: Model outputs for a few reasoning tasks from BIG-Bench Hard (BBH). We observe that CoC can apply to a wide variety of complex reasoning tasks that involve both semantic and numeric reasoning. Red highlight indicates LM generated code being executed by the Python interpreter, and purple highlight indicates LM simulating the code execution.
Direct answer only
Chain of Code
Figure A2: A demonstrative example of how Chain of Code generates code and reasons through an LM-augmented code emulator. Lines evaluated with Python are in red and with an LM are in purple. The chain of thought and direct answers were evaluated with gpt-4 in October 2023, and we note the current model (as of December 2023) writes code to solve this problem and gets the same solution as Chain of Code.
### A.4 Instruction Tuned Models
Since most of the results presented in our main paper are using text-davinci-003, a completion model that is particularly amenable to few-shot prompting, one may wonder how CoC can be used with instruction tuned models, like gpt-4 (OpenAI, 2023). Figuring out ways to elicit the desired behavior of CoC from these instruction tuned models (i.e. writing code/pseudocode to solve problems) is non-trivial. We conduct two additional experiments below as our best effort to shed some light on this subject.
Zero-shot prompting. We directly prompt the models with instructions to elicit the desired reasoning approaches. Note that we do not provide few-shot examples in the prompt (hence âzero-shotâ). For the baselines, we ask the model to âdirectly answerâ (Direct) or âthink step by stepâ (CoT). For CoC variants, we ask the model to âwrite python code to help solve the problem, if itâs helpfulâ. If a program is written, we either run the code with a Python interpreter and then feed the result (or the error message if execution fails) back to the model to determine a final answer (CoC (Python)), or ask the model to simulate the output of code execution as a LMulator (CoC (LM)). The CoC (Python) baseline can be thought of as a comparison to an LM with Python tool use.
Table A3 shows the performance of each. With gpt-3.5-turbo, both CoT and CoC (Python) show benefits over direct prompting, although both are strongly outperformed by CoC (Interweave). With gpt-4, despite the considerable model strength advantage over text-davinci-003, CoC (Interweave) still outperforms, though the gap is narrower. Admittedly, CoC (Interweave) is prompted with three examples whereas the other two are not.
Table A3: Comparisons with instruction tuned models in the chat interface, with and without tool use, denoted as CoC (Python) and CoC (LLM) respectively. The delta compared to CoC with text-davinci-003 is shown in parenthesis. In this experiment, the prompts for gpt-3.5-turbo and gpt-4 only contain a generic, shared system message and do not contain few-shot examples.
| CoC (Interweave) 84 | Direct 51 (-33) | CoT 56 (-28) | CoC (Python) 56 (-28) | CoC (LM) 45 (-39) | Direct 70 (-14) | CoT 78 (-6) | CoC (Python) 82 (-2) | CoC (LM) 75 (-9) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
Few-shot prompting. We attempt to coerce the instruction tuned models to behave like completion models by using the following system message: âPretend you are a completion model that is prompted with three examples. You should follow the pattern of these examples strictly. At the end of your reply, you should always output an answerâ. In the user message, we include three examples from the same (single task) or different (cross task) task domains, as well as the query question, exactly the same as how we evaluated the completion models.
Table A4 shows that CoC still brings a sizable performance gain over the Direct and CoT baselines. With gpt-4, the gap is again narrower mainly because the base model already performs quite well and leaves little room for improvement. This experiment suggests a viable way to combine the strength of CoC with that of more advanced instruction tuned models like gpt-4.
Table A4: Applying CoC to instruction tuned models in the chat interface, while coercing them to behave like completion models. The delta compared to direct prompting is shown in parenthesis. In this experiment, the prompts contains a generic, shared system message that asks LMs to behave like completion models, and also three examples from the same or different task domains at the beginning of the user message. as before.
| Prompt | gpt-3.5-turbo Direct | gpt-4 CoT | CoC | Direct | CoT | CoC |
| --- | --- | --- | --- | --- | --- | --- |
| Single task | 47 | 73 (+26) | 79 (+32) | 69 | 88 (+19) | 91 (+22) |
| Cross task | 47 | 60 (+13) | 61 (+14) | 67 | 81 (+14) | 84 (+17) |
### A.5 Robustness of Chain of Code
Similar to Chain of Thought prompts, Chain of Code prompts can also come with various forms: e.g. different ways of function decomposition, coding styles, variable names, reasoning pathways, and so on. In this section, we want to analyze whether CoC is robust against variation across prompts.
We invite three annotators that are familiar with Python to write CoC prompts for four representative tasks in BIG-Bench Hard. We select these four tasks because they all require generation of new code (as opposed to repeated code) during test time. As before, for single task evaluation, we prompt LMs with three examples from the same task domain, whereas for cross task evaluation, we prompt LMs with three examples from different task domains (one from each of the other three tasks).
Results in Table A5 show that our method is robust against prompt variation and doesnât require extensive prompt engineering.
Table A5: Performance variation across prompts written independently by different authors for four representative tasks in BIG-Bench Hard. Our results that CoC is generally robust against prompt variation, allowing for different coding styles and reasoning logic in the few-shot prompts.
| Prompt Single task | Prompt Annotator A B | Date Understanding 73 68 | Logical Deduction 64 54 | Object Counting 92 95 | Penguins in a Table 78 88 | Average 77 76 |
| --- | --- | --- | --- | --- | --- | --- |
| C | 69 | 43 | 90 | 89 | 73 | |
| Cross task | A | 41 | 33 | 67 | 76 | 54 |
| B | 48 | 29 | 78 | 88 | 61 | |
| C | 60 | 30 | 76 | 64 | 57 | |
### A.6 Robotics Applications
Downstream applications such as robotics are well fit for CoC as robotics tasks require semantic reasoning and algorithmic reasoning, as well as interfacing with other APIs through code (such as control or perception APIs (Liang et al., 2023)) and with users through natural language. For example, given a task like âsort the fruits by sizeâ, the robot must reason over which items are fruits, sort them by size, and then connect those decisions to actions executable on the robot. CoC (Interweave) is able to solve these challenges with the Python interpreter and the LMulator at runtime, while allowing for more interpretability and fine-grained control of the robot policies.
Environment and Robot Setup. Our environment is a tabletop with small objects (containers, toys, etc) and a UR5 robot arm equipped with a vacuum gripper and a wrist-mounted RGB-D camera. For the purpose of our experiments, the available perception API is detect_objects(), which returns a list of detected objects (probabilities, labels, bounding boxes and segmentation masks) from the wrist camera. This API is implemented with first querying GPT-4V (OpenAI, 2023) for a list of objects, and then using Grounding-SAM (Kirillov et al., 2023; Liu et al., 2023) to localize them. The available control API is pick_place(obj1, obj2), which is a scripted primitive skill that picks up obj1 and places it on top of obj2. There is also a text-to-speech API say(sentence) that allows the robot to communicate with the user.
Experimental Setup. We evaluate with a number of tabletop pick-and-place robotics tasks that involve semantic reasoning. For few-shot prompting, we include a single example: âServe a meal that follows the userâs dietary restrictionsâ, so that the language model understands the expected structure as well as the available robot APIs. During test time, we query the model with each of the following instructions.
1. âPack a lunch box for someone who is on a vegan diet.â
1. âAssemble a sandwich for someone who is vegetarian.â
1. âGather ingredients for a peanut butter sandwich in a plate.â
1. âPrepare è„żçșąæżçè in the pot.â (interleaving English and Chinese on purpose)
1. âPlace all paper-made objects in the grass-colored container.â
1. âSort the objects on the table into the compost bin and the recycle bin.â
1. âMy steak is too bland. Can you help?â
Results. With a single example in our prompt, we see that our model is able to generalize to new objects, languages, and task domains (see Figure A3 and an example trajectory in Figure A4). Note that for these robotics tasks, unlike the previous language reasoning tasks, our main method CoC (Interweave) is the only capable approach, as the code requires line-by-line interplay between the Python interpreter execution (robot APIs) and the LMulator (commonsense QA like is_compostable).
Figure A3 shows the one-shot prompt as well as the model outputs and how they are executed for a few test instructions.
(a) Given Prompt
(b) Novel Objects
(c) Novel Languages
(d) Novel Tasks
Figure A3: The one-shot prompt as well as the model outputs for a few test instructions for the robotics tasks. When given a single example in the prompt (a), our method can generalize (b-d) to new objects, languages, and task domains. Red highlight indicates LM generated code being executed by the Python interpreter, and purple highlight indicates LM simulating the code execution. Gray text is for illustration purpose only, and not provided to our model. Note that code in the form of robot.<func_name> invokes robot APIs.
<details>
<summary>extracted/5762267/fig/robot_traj_1.jpg Details</summary>

### Visual Description
\n
## Photograph: Robotic Sorting Setup
</details>
<details>
<summary>extracted/5762267/fig/robot_traj_2.jpg Details</summary>

### Visual Description
\n
## Photograph: Waste Sorting Setup
### Overview
The image depicts a setup for demonstrating or testing waste sorting. Various food items and packaging are arranged on a wooden surface, with containers labeled for different waste streams. A robotic arm is positioned above a container, seemingly preparing to sort an item.
### Components/Axes
The image contains the following visible components:
* **Wooden Surface:** A grid-patterned wooden board serving as the workspace.
* **Containers:** Three plastic containers are present: a blue container, a green container labeled "Compost Bin", and a partially visible white container.
* **Robotic Arm:** A metallic robotic arm with a suction cup at the end, positioned above the blue container.
* **Waste Items:** Several items are scattered on the wooden surface, including:
* A toy cash register (top-left)
* US Dollar bills (top-center)
* A carton of milk (top-center)
* Carrots (top-right)
* A slice of bread (top-right)
* A banana (top-right)
* A coffee cup with a lid (center-left)
* A milk carton (center-left)
* A piece of orange peel (center)
* A square yellow object (center)
* **Background:** A black perforated board serves as the background.
### Detailed Analysis or Content Details
The image does not contain numerical data or axes. It is a visual representation of a physical setup.
The containers are arranged in a row towards the bottom of the image. The "Compost Bin" is clearly labeled in yellow text on a green background. The blue container has a yellow square object inside. The robotic arm is positioned directly above the yellow square.
The items are arranged somewhat randomly on the wooden surface. The toy cash register and money suggest a scenario involving purchasing food items. The variety of food items and packaging represents different types of waste.
### Key Observations
The setup appears to be designed to simulate a waste sorting process. The robotic arm suggests an automated sorting system. The presence of different waste items and labeled containers indicates a focus on separating waste into appropriate streams (compost, recycling, landfill).
### Interpretation
The image likely represents a demonstration or experiment related to waste management and automation. It could be used to illustrate the concept of robotic sorting, the importance of proper waste separation, or the challenges of dealing with different types of waste. The setup suggests an attempt to create a controlled environment for studying waste sorting processes. The inclusion of a toy cash register and money might indicate a connection to consumer behavior and waste generation.
The image does not provide any quantitative data, but it offers a visual representation of a complex system. The arrangement of the items and the positioning of the robotic arm suggest a deliberate attempt to create a realistic and informative demonstration. The overall impression is one of innovation and sustainability.
</details>
<details>
<summary>extracted/5762267/fig/robot_traj_3.jpg Details</summary>

### Visual Description
\n
## Photograph: Robotic Sorting System
### Overview
The image depicts a robotic arm positioned above a grid-like surface, seemingly engaged in a sorting task. Two colored bins (blue and green) are visible to the right of the grid. Various toy food items are arranged in the background. The scene suggests an automated waste sorting or object recognition experiment.
### Components/Axes
There are no axes or scales present in the image. The key components are:
* **Robotic Arm:** A grey robotic arm with a nozzle extending downwards.
* **Grid Surface:** A light-colored grid composed of square tiles.
* **Blue Bin:** A blue square bin containing a yellow sticky note.
* **Green Bin:** A green square bin with the text "Compost Bin" written on it in orange.
* **Toy Food Items:** A collection of toy food items including milk cartons, chocolate bars, carrots, bread, bananas, oranges, and coins.
* **Red Cash Register:** A toy red cash register is visible in the top-left corner.
### Detailed Analysis or Content Details
The robotic arm is positioned directly above a patch of orange material on the grid. The nozzle appears to be dispensing or interacting with this material. The grid is approximately 7x7 tiles. The blue bin is positioned to the right of the grid, slightly above the green bin. The text on the green bin reads "Compost Bin". The toy food items are arranged in a line along the top edge of the image, behind the grid and bins. The cash register is in the top left corner.
### Key Observations
The setup suggests a system designed to identify and sort objects. The orange material being dispensed by the robotic arm could be a test substance or a representation of waste material. The "Compost Bin" label indicates a potential sorting category. The presence of toy food items suggests the system might be trained to recognize and categorize different types of food waste.
### Interpretation
The image likely represents a research or demonstration setup for a robotic waste sorting system. The robotic arm is likely programmed to identify and sort materials based on their properties or type. The grid provides a controlled environment for the sorting process. The use of toy food items suggests the system is being tested on a simplified set of objects. The "Compost Bin" label indicates that the system is capable of identifying and separating compostable materials. The cash register is likely unrelated to the sorting task and is simply part of the surrounding environment. The overall setup suggests an effort to automate the process of waste management and improve recycling efficiency. The orange material could be a simulation of food waste, and the robotic arm is testing its ability to identify and sort it into the appropriate bin. The system is likely using computer vision and machine learning algorithms to recognize the objects and make sorting decisions.
</details>
<details>
<summary>extracted/5762267/fig/robot_traj_4.jpg Details</summary>

### Visual Description
\n
## Photograph: Automated Food Dispensing Setup
### Overview
The image depicts a setup for automated food dispensing, likely a demonstration or experiment. A robotic arm-like device is positioned above a green plate containing a yellow substance (possibly scrambled eggs). Various food items and play money are scattered around the setup, suggesting a simulated mealtime scenario. The background is a pegboard.
### Components/Axes
There are no axes or scales present in this image. The components are:
* **Robotic Dispenser:** A cylindrical, metallic device with a nozzle at the bottom, connected to a flexible tube.
* **Green Plate:** A square, green plate with a yellow substance and the letters "C" and "B" visible on it.
* **Food Items:** A carton of milk, a coffee cup with a lid, a slice of bread, a banana, and a small container.
* **Play Money:** Several bills of US currency are visible.
* **Pegboard Background:** A black pegboard serves as the backdrop.
* **Blue Square:** A blue square is positioned between the robotic dispenser and the milk carton.
* **Yellow Sticky Note:** A yellow sticky note is placed on top of the blue square.
* **Red Disc:** A red disc is visible near the bread.
* **Cash Register:** A red toy cash register is visible in the top-left corner.
### Detailed Analysis or Content Details
The image does not contain numerical data or precise measurements. The following observations can be made:
* The robotic dispenser is positioned directly above the center of the green plate.
* The yellow substance on the plate appears to be dispensed from the nozzle of the robotic dispenser.
* The letters "C" and "B" are partially visible on the plate, possibly indicating the initials of the food item or a child's name.
* The food items are arranged around the plate, suggesting a meal setting.
* The play money is present, potentially representing a payment for the meal.
* The blue square and yellow sticky note appear to be part of the setup, possibly serving as a calibration point or a marker.
### Key Observations
The setup appears to be a demonstration of automated food dispensing, potentially for children or individuals with limited mobility. The presence of play money and toy food suggests a simulated environment. The letters on the plate are a minor detail that could be significant depending on the context.
### Interpretation
The image demonstrates a proof-of-concept for an automated food dispensing system. The system likely uses the robotic dispenser to deliver a measured amount of food onto the plate. The setup is designed to mimic a real-life mealtime scenario, potentially for educational or therapeutic purposes. The use of play money suggests that the system could also be used to teach children about financial literacy. The letters on the plate could be a personalized element, indicating that the system is tailored to a specific individual. The overall impression is one of innovation and accessibility, suggesting that the system could improve the quality of life for people with disabilities or other challenges. The setup is not providing any quantitative data, but rather a visual representation of a potential solution. The arrangement of the items suggests a deliberate design, aimed at creating a realistic and engaging experience.
</details>
Figure A4: Robot trajectory visualization for task âsort the objects on the table into the compost bin and the recycle binâ. CoC first generates code to solve the problem, and then executes the code with Python if possible (e.g., robot APIs like detect_objects and pick_place), and with LMulator if not (e.g., commonsense QA like is_compostable). The robot successfully picks and places the Post-it note to the recycle bin and the orange peel to the compost bin. See the full code in Fig. A3.