## Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
Rylan Schaeffer * 1 Kateryna Pistunova * 2 Samar Khanna * 1 Sarthak Consul * 1 Sanmi Koyejo 1
## Abstract
Language models can be prompted to reason through problems in a manner that significantly improves performance. However, why such prompting improves performance is unclear. Recent work showed that using logically invalid Chain-of-Thought (CoT) prompting improves performance almost as much as logically valid CoT prompting, and that editing CoT prompts to replace problem-specific information with abstract information or out-of-distribution information typically doesn't harm performance. Critics have responded that these findings are based on too few and too easy tasks to draw meaningful conclusions. To resolve this dispute, we test whether logically invalid CoT prompts offer the same level of performance gains as logically valid prompts on the hardest tasks in the BIG-Bench benchmark, termed BIG-Bench Hard (BBH). We find that the logically invalid reasoning prompts do indeed achieve similar performance gains on BBH tasks as logically valid reasoning prompts. We also discover that some CoT prompts used by previous works contain logical errors. This suggests that covariates beyond logically valid reasoning are responsible for performance improvements.
## 1. Introduction
Language models can perform significantly better when prompted in particular ways. For example, prompts that recommend or guide language models through step-by-step processing have been shown to significantly improve performance on question answering, conversational response generation and other tasks (Nye et al., 2021; Wei et al., 2022b; Jung et al., 2022; Kojima et al., 2022; Yao et al.,
* Equal contribution 1 Department of Computer Science, Stanford University 2 Department of Physics, Stanford University. Correspondence to: Rylan Schaeffer < rschaef@cs.stanford.edu > , Sanmi Koyejo < sanmi@cs.stanford.edu > .
2023). These prompting techniques are especially effective on the hardest tasks (Suzgun et al., 2022) in the BIG-Bench benchmark (Srivastava et al., 2022), leading many to conclude that such techniques unlock emergent 1 human-like reasoning abilities in large language models (Wei et al., 2022a). However, why such prompting strategies improve performance is unclear. Madaan & Yazdanbakhsh (2022) showed replacing problem-specific information in Chainof-Thought (CoT) prompts with either abstract information or out-of-distribution information typically doesn't harm CoT's performance gains, and Wang et al. (2022) showed that logically invalid CoT prompts (i.e., prompts which contain logically invalid chains of reasoning) achieve approximately the same gains as logically valid CoT prompts. These discoveries are puzzling, but critics warned against drawing confident or far-ranging conclusions results because (1) only a small number of tasks are considered (specifically GSM8K (Cobbe et al., 2021), Bamboogle (Press et al., 2022), and two commonsense tasks (Srivastava et al., 2022)) and (2) the tasks are relatively easy to solve.
In this work, we aim to clarify this dispute by asking whether logically invalid CoT prompts indeed offer comparable performance gains as logically valid CoT prompts in more and harder tasks. To answer this question, we evaluate the performance of logically invalid CoT prompts on the hardest tasks in BIG-Bench (Srivastava et al., 2022), the so-called BIG-Bench Hard (BBH) tasks (Suzgun et al., 2022). We find that logically invalid CoT prompts induce performance gains approaching logically valid CoT prompts on BBH tasks. We also discover that CoT prompts from previous work contain logical errors, a discovery that we confirm with the original authors. Our findings support growing evidence that while prompting techniques can, without doubt, significantly improve language model performance on complex tasks, there is questionable evidence that these performance gains are attributable to logical reasoning skills. Covariates other than logical reasoning in the prompts exemplars may be responsible for the performance improvements and point to the need for continued investigation.
1 Although see Schaeffer et al. (2023))
## 2. Methods
## 2.1. Tasks: BIG-Bench Hard ⊂ BIG-Bench
Beyond-the-Imitation Game Benchmark (BIG-Bench) is a benchmark of over 200 natural language tasks created to evaluate the capabilities and limitations of language models (Srivastava et al., 2022). BIG-Bench Hard (BBH) is a subset of BIG-Bench consisting of 23 tasks specifically identified as extremely difficult for language models (Suzgun et al., 2022). BBH was constructed by excluding BIGBench tasks using the following criteria: tasks with more than three subtasks, tasks with fewer than 103 examples, tasks without human-rater baselines, tasks that do not use multiple-choice or exact match as the evaluation metric, and tasks on which the best reported model beats average reported human-rater score; the remaining 23 BIG-Bench tasks comprise the BBH tasks. The BBH tasks cover a variety of categories, including traditional NLP, mathematics, commonsense reasoning, and question-answering. In general, BBH tasks typically fall into one of two categories: more traditional NLP tasks (e.g., Date Understanding) or more algorithmic tasks (e.g., Boolean Expressions). Many state of the art language models, including GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022), as well as internal dense and sparse Google models, were shown to be incapable of solving BBH tasks above average human rater performance when asked directly.
## 2.2. Prompt Types
In this work, we compare three prompt types applied to the BBH tasks. Examples of task problems and prompts are shown in Table 1.
Chain-of-Thought (CoT) Prompting In (logically valid) Chain-of-Thought (CoT) prompting, each question in each task is preappended with three highquality, human hand-written, topical examples of Question → Reasoning → Answer. The reasoning is a sequence of natural language steps that derive the answer from information in the question (Nye et al., 2021; Wei et al., 2022b). For each BBH task, Suzgun et al. (2022) released three examples of Question → Reasoning → Answer that we adapt. We do not use the examples directly because we want CoT prompts to contain logically valid chains of reasoning, and as we discovered (Sec. 3.1), some of the examples contain logical errors.
Logically Invalid Chain-of-Thought (Invalid CoT) Prompting In logically invalid Chain-of-Thought (Invalid CoT) prompting, we edit each tasks three examples' reasoning to become logically invalid. To do this, we modify the reasoning to contain nonsensical or invalid steps. For instance, on the Boolean Expressions task, we provided the reasoning, 'Because English does not permit multiple negatives, the expression '(not not True)' evaluates to '( not True )', and on the Date Understanding task, we provided the reasoning, 'If tomorrow is 11/12/2019, then today is 11/12/2018. The date one year ago from today is 11/11/2018.' In all cases, the modified BBH exemplars were logically invalid but reached the correct answer.
Answer-Only (AO) Prompting To establish baseline performance, we follow previous work and prompt language models to answer each question in each task directly. We call this 'Answer Only' (AO) prompting.
Table 1. Examples of different prompt types.
| Prompt Type | Example Query |
|--------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| | Evaluate the result of a random Boolean expression. |
| | Q: not ( ( not not True ) ) is |
| Answer Only | False |
| Chain of Thought (CoT) | A: Let's think step by step. Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest pri- ority to lowest priority is 'not', 'and', 'or', respectively. Wefirst simplify this expression 'Z' as follows: 'Z = not ( ( not not True ) ) = not ( ( A ) )' where 'A = not not True'. Let's evaluate A: A = not not True = not (not True) = not False = True. Plugging in A, we get: Z = not ( ( A ) ) = not ( ( True ) ) = not True = False. So the answer is False. |
| Logically Invalid Chain of Thought (Invalid CoT) | A: Let's think step by step. Remem- ber that (i) expressions inside brack- ets are always evaluated first and that (ii) the order of operations from high- est priority to lowest priority is 'not', 'and', 'or', respectively. Because En- glish does not permit multiple nega- tives, the expression '(not not True)' evaluates to '( not True )'. The expres- sion 'not ( ( not not True ) )' therefore evaluates to 'not ( ( not True ) )'. By the same logic, the expression 'not ( ( not True ) )' simplifies to 'not True'. In Boolean logic, 'not True' is False. So the answer is False. |
Figure 1. Codex benefits more from Chain-of-Thought Prompting than InstructGPT or PaLM 540B on BIG-Bench Hard Tasks. Dashed vertical lines indicate average model performance across all BIG-Bench Hard Tasks. Data from Suzgun et al. (2022).
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Scatter Plot: Chain of Thought Accuracy vs. Answer Only Accuracy for Various Tasks
### Overview
The image is a scatter plot comparing the accuracy of "Chain of Thought" (CoT) and "Answer Only" (AO) approaches for different language models (InstructGPT, Codex, and PaLM 540B) across a range of tasks. The x-axis represents the accuracy difference (CoT - AO), and the y-axis lists the tasks. Each data point represents the accuracy difference for a specific model and task.
### Components/Axes
* **X-axis:** "Chain of Thought Accuracy - Answer Only Accuracy". The scale ranges from -20 to 60, with tick marks at -20, -10, 0, 10, 20, 30, 40, 50, and 60.
* **Y-axis:** List of tasks. The tasks are:
* Boolean Expressionsλ
* Causal Judgement
* Date Understanding
* Disambiguation QA
* Dyck Languagesλ
* Formal Fallacies
* Geometric Shapesλ
* Hyperbaton
* Logical Deductionλ (avg)
* Movie Recommendation
* Multi Step Arithmeticλ [Two]
* Navigateλ
* Object Countingλ
* Penguins in a Table
* Reasoning about Colored Objects
* Ruin Names
* Salient Translation Error Detection
* Snarks
* Sports Understanding
* Temporal Sequencesλ
* Tracking Shuffled Objectsλ (avg)
* Web of Liesλ
* Word Sortingλ
* NLP Task (avg)
* Algorithmic Taskλ (avg)
* All Tasks (avg)
* **Legend:** Located in the bottom-right corner.
* Blue: InstructGPT (CoT) - InstructGPT (AO)
* Orange: Codex (CoT) - Codex (AO)
* Green: PaLM 540B (CoT) - PaLM 540B (AO)
* **Vertical Lines:**
* Dashed Black Line: Located at approximately x = 0.
* Dashed Green Line: Located at approximately x = 15.
* Dashed Gray Lines: Located at approximately x = 18 and x = 20.
### Detailed Analysis
* **InstructGPT (CoT - AO) - Blue:**
* Generally, the blue data points are clustered towards the right side of the plot, indicating that InstructGPT benefits from the Chain of Thought approach for most tasks.
* Specific points:
* Boolean Expressionsλ: ~35
* Causal Judgement: ~25
* Date Understanding: ~30
* Disambiguation QA: ~40
* Dyck Languagesλ: ~35
* Formal Fallacies: ~30
* Geometric Shapesλ: ~35
* Hyperbaton: ~40
* Logical Deductionλ (avg): ~35
* Movie Recommendation: ~35
* Multi Step Arithmeticλ [Two]: ~35
* Navigateλ: ~40
* Object Countingλ: ~40
* Penguins in a Table: ~30
* Reasoning about Colored Objects: ~35
* Ruin Names: ~35
* Salient Translation Error Detection: ~35
* Snarks: ~40
* Sports Understanding: ~40
* Temporal Sequencesλ: ~40
* Tracking Shuffled Objectsλ (avg): ~40
* Web of Liesλ: ~40
* Word Sortingλ: ~35
* NLP Task (avg): ~35
* Algorithmic Taskλ (avg): ~35
* All Tasks (avg): ~35
* **Codex (CoT - AO) - Orange:**
* The orange data points are more scattered, with some tasks showing a benefit from CoT and others showing little to no difference or even a slight decrease in accuracy.
* Specific points:
* Boolean Expressionsλ: ~-5
* Causal Judgement: ~10
* Date Understanding: ~10
* Disambiguation QA: ~10
* Dyck Languagesλ: ~10
* Formal Fallacies: ~10
* Geometric Shapesλ: ~10
* Hyperbaton: ~10
* Logical Deductionλ (avg): ~10
* Movie Recommendation: ~10
* Multi Step Arithmeticλ [Two]: ~10
* Navigateλ: ~10
* Object Countingλ: ~10
* Penguins in a Table: ~10
* Reasoning about Colored Objects: ~10
* Ruin Names: ~10
* Salient Translation Error Detection: ~10
* Snarks: ~10
* Sports Understanding: ~10
* Temporal Sequencesλ: ~10
* Tracking Shuffled Objectsλ (avg): ~10
* Web of Liesλ: ~10
* Word Sortingλ: ~10
* NLP Task (avg): ~10
* Algorithmic Taskλ (avg): ~10
* All Tasks (avg): ~10
* **PaLM 540B (CoT - AO) - Green:**
* The green data points are generally clustered between 0 and 20, suggesting a modest benefit from CoT for PaLM 540B.
* Specific points:
* Boolean Expressionsλ: ~10
* Causal Judgement: ~10
* Date Understanding: ~10
* Disambiguation QA: ~10
* Dyck Languagesλ: ~10
* Formal Fallacies: ~10
* Geometric Shapesλ: ~10
* Hyperbaton: ~10
* Logical Deductionλ (avg): ~10
* Movie Recommendation: ~10
* Multi Step Arithmeticλ [Two]: ~10
* Navigateλ: ~10
* Object Countingλ: ~10
* Penguins in a Table: ~10
* Reasoning about Colored Objects: ~10
* Ruin Names: ~10
* Salient Translation Error Detection: ~10
* Snarks: ~10
* Sports Understanding: ~10
* Temporal Sequencesλ: ~10
* Tracking Shuffled Objectsλ (avg): ~10
* Web of Liesλ: ~10
* Word Sortingλ: ~10
* NLP Task (avg): ~10
* Algorithmic Taskλ (avg): ~10
* All Tasks (avg): ~10
### Key Observations
* InstructGPT consistently benefits from the Chain of Thought approach across all tasks.
* Codex shows mixed results, with some tasks benefiting from CoT and others not.
* PaLM 540B shows a moderate benefit from CoT, generally less pronounced than InstructGPT.
* The "All Tasks (avg)" data points for each model reflect the general trend observed for the individual tasks.
### Interpretation
The data suggests that the effectiveness of the Chain of Thought approach varies significantly depending on the language model and the specific task. InstructGPT appears to be the most sensitive to the benefits of CoT, while Codex shows more task-dependent performance. PaLM 540B exhibits a more consistent, albeit less dramatic, improvement with CoT.
The vertical lines could represent thresholds or benchmarks for acceptable accuracy differences. The dashed black line at 0 indicates the point where CoT and AO have equal accuracy. The other lines may represent target accuracy improvements or significant performance differences.
The tasks themselves likely vary in complexity and the degree to which they benefit from explicit reasoning steps. Tasks with a higher "λ" symbol may be more amenable to the Chain of Thought approach. Further investigation into the nature of these tasks and the internal mechanisms of each language model would be needed to fully explain the observed differences.
</details>
## 2.3. Metrics
To evaluate the performance of different prompting types, we used Accuracy (i.e. Exact String Match) to evaluate model outputs. We acknowledge that recent work showed such a choice of metric may reach misleading scientific conclusions (Schaeffer et al., 2023), but made the choice to ensure fair comparison with the relevant preceding work.
## 2.4. Choice of Language Model
Due to limited resources, we could only evaluate one model. Only Codex (Chen et al., 2021) and InstructGPT (Ouyang et al., 2022) were publicly queryable, thus ruling out PaLM 540B (Chowdhery et al., 2022). We chose to evaluate Codex because Suzgun et al. (2022) found Codex outperformed both InstructGPT and PaLM 540B; this point was not made by the authors, but visualizing their data anew reveals this conclusion (Fig. 1). Although this finding may seem surprising, independent work has found that language models reason better if pretrained primarily on code (Madaan et al., 2022). Perhaps unsurprisingly, we found that Codex's high average performance with CoT prompting is driven by large improvements on BBH algorithmic tasks, overriding mediocre improvements on natural language tasks.
## 2.5. Choice of BBH Tasks
We evaluate AO, CoT, and Invalid CoT prompting strategies on all BBH tasks.
## 3. Experimental Results
## 3.1. Previous CoT Prompts Contain Logical Errors
While converting the logically valid CoT prompts written for BBH tasks by Suzgun et al. (2022) to logically invalid CoT prompts, we discovered that at least three of the BBH tasks' CoT prompts were already logically invalid. On
Figure 2. Sanity checks. Top: All prompting strategies almost always produced an answer across tasks. Bottom: Our accuracies for Answer Only (AO) and Chain-of-Thought (CoT) prompting closely matched published values from Suzgun et al. (2022) except on one task: word sorting.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Heatmap and Scatter Plots: Prompt Type Performance on Various Tasks
### Overview
The image presents a performance analysis of different prompt types on a variety of tasks. It includes a heatmap showing the "Answer Found (%)" for each task across three prompt types: AO, CoT, and CoT (Invalid). Additionally, there are two scatter plots comparing the performance of AO vs. CoT and AO vs. CoT (Invalid).
### Components/Axes
#### Heatmap
* **Title:** Answer Found (%)
* **Y-axis (Task):** Lists various tasks such as boolean expressions, causal judgement, date understanding, disambiguation\_qa, dyck languages, formal fallacies, geometric shapes, hyperbaton, logical\_deduction\_five\_objects, logical\_deduction\_seven\_objects, logical\_deduction\_three\_objects, movie\_recommendation, multistep\_arithmetic\_two, navigate, object\_counting, penguins\_in\_a\_table, reasoning\_about\_colored\_objects, ruin\_names, salient\_translation\_error\_detection, snarks, sports\_understanding, temporal sequences, tracking\_shuffled\_objects\_five\_objects, tracking\_shuffled\_objects\_seven\_objects, tracking\_shuffled\_objects\_three\_objects, web\_of\_lies, word\_sorting.
* **X-axis (Prompt Type):** AO, CoT, CoT (Invalid)
* **Color Scale:** Ranges from dark blue (0%) to dark red (100%), with intermediate colors representing values in between.
#### Scatter Plots
* **Left Scatter Plot:**
* **X-axis:** Answer Only Accuracy (Ours)
* **Y-axis:** Answer Only Accuracy (Kojima et al. 2022)
* **Right Scatter Plot:**
* **X-axis:** Chain of Thought Accuracy (Ours)
* **Y-axis:** Chain of Thought Accuracy (Kojima et al. 2022)
### Detailed Analysis
#### Heatmap Data
The heatmap displays the percentage of correct answers for each task and prompt type.
* **boolean expressions:** AO: 100.0%, CoT: 98.8%, CoT (Invalid): 100.0%
* **causal judgement:** AO: 100.0%, CoT: 99.5%, CoT (Invalid): 100.0%
* **date\_understanding:** AO: 100.0%, CoT: 99.6%, CoT (Invalid): 100.0%
* **disambiguation\_qa:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **dyck languages:** AO: 100.0%, CoT: 70.8%, CoT (Invalid): 86.4%
* **formal fallacies:** AO: 100.0%, CoT: 37.6%, CoT (Invalid): 97.2%
* **geometric\_shapes:** AO: 100.0%, CoT: 92.8%, CoT (Invalid): 100.0%
* **hyperbaton:** AO: 100.0%, CoT: 96.0%, CoT (Invalid): 100.0%
* **logical\_deduction\_five\_objects:** AO: 100.0%, CoT: 96.4%, CoT (Invalid): 99.2%
* **logical\_deduction\_seven\_objects:** AO: 100.0%, CoT: 24.4%, CoT (Invalid): 91.6%
* **logical\_deduction\_three\_objects:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **movie\_recommendation:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **multistep\_arithmetic\_two:** AO: 100.0%, CoT: 96.8%, CoT (Invalid): 96.8%
* **navigate:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **object\_counting:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **penguins\_in\_a\_table:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **reasoning\_about\_colored\_objects:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 99.2%
* **ruin\_names:** AO: 100.0%, CoT: 98.4%, CoT (Invalid): 100.0%
* **salient\_translation\_error\_detection:** AO: 100.0%, CoT: 96.4%, CoT (Invalid): 100.0%
* **snarks:** AO: 100.0%, CoT: 98.3%, CoT (Invalid): 100.0%
* **sports\_understanding:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **temporal\_sequences:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **tracking\_shuffled\_objects\_five\_objects:** AO: 85.6%, CoT: 83.2%, CoT (Invalid): 100.0%
* **tracking\_shuffled\_objects\_seven\_objects:** AO: 0.0%, CoT: 0.0%, CoT (Invalid): 92.8%
* **tracking\_shuffled\_objects\_three\_objects:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **web\_of\_lies:** AO: 100.0%, CoT: 100.0%, CoT (Invalid): 100.0%
* **word\_sorting:** AO: 0.0%, CoT: 0.0%, CoT (Invalid): 42.0%
#### Scatter Plot Data
The scatter plots compare the accuracy of "Ours" (presumably the current study) against "Kojima et al. 2022" for both Answer Only and Chain of Thought approaches. Each point represents a task, but the specific tasks are difficult to read due to the small font size. The diagonal line represents perfect agreement between the two studies.
* **Left Scatter Plot (Answer Only):** The points are clustered relatively close to the diagonal line, suggesting a general agreement between the two studies regarding Answer Only accuracy.
* **Right Scatter Plot (Chain of Thought):** The points are more scattered compared to the left plot, indicating less agreement between the two studies regarding Chain of Thought accuracy. Many points lie above the diagonal, suggesting that the "Ours" method generally outperforms Kojima et al. 2022 in Chain of Thought accuracy.
### Key Observations
* The AO prompt type generally performs very well across most tasks, often achieving 100% accuracy.
* The CoT prompt type shows variable performance, with some tasks achieving high accuracy (e.g., movie\_recommendation) and others showing significantly lower accuracy (e.g., logical\_deduction\_seven\_objects, formal fallacies).
* The CoT (Invalid) prompt type generally performs well, often matching or exceeding the performance of AO. However, it shows lower performance on "word\_sorting" compared to other tasks.
* The "tracking\_shuffled\_objects\_seven\_objects" and "word\_sorting" tasks show particularly poor performance for AO and CoT prompt types.
* The scatter plots indicate a stronger agreement between the current study and Kojima et al. 2022 for Answer Only accuracy compared to Chain of Thought accuracy.
### Interpretation
The data suggests that the choice of prompt type can significantly impact the performance of language models on various tasks. The AO prompt type appears to be a robust baseline, achieving high accuracy across many tasks. However, the CoT prompt type can lead to either improved or degraded performance depending on the specific task. The CoT (Invalid) prompt type seems to offer a good balance, often matching or exceeding the performance of AO while avoiding the significant performance drops observed with CoT on certain tasks.
The poor performance of AO and CoT on "tracking\_shuffled\_objects\_seven\_objects" and "word\_sorting" suggests that these tasks may be particularly challenging for language models, requiring more sophisticated reasoning or problem-solving strategies. The improved performance of CoT (Invalid) on "tracking\_shuffled\_objects\_seven\_objects" and "word_sorting" compared to AO and CoT suggests that the "invalid" chain of thought may be providing some useful information or regularization.
The scatter plots highlight the variability in Chain of Thought performance across different studies, suggesting that the effectiveness of CoT may be sensitive to factors such as model architecture, training data, or prompt engineering. The fact that "Ours" generally outperforms Kojima et al. 2022 in Chain of Thought accuracy suggests that the current study may have employed more effective CoT strategies.
</details>
the Multistep Arithmetic Task, a human handwritten CoT prompt reads: 'Then, the final equation is A * B = -41 * -3 = (-61) * (-3) = 123. So the answer is 123.' To clarify, -41 was substituted with -61 without justification, and -61 * -3 is not 123, even though 123 is the correct answer. On the Navigate task, a human handwritten CoT prompt reads: '(4) Take 7 steps right: (0, 7), facing the positive y-axis.' should be '(4) Take 7 steps right: (0, 0), facing the positive y-axis.' On the Web of Lies task, '(3) Vina says Jerry tells the truth. Since we know from (2) that Jerry tells the truth, if Vine says Jerry tells the truth, then Vina tells the truth.' should be '(3) Vina says Jerry tells the truth. Since we know from (2) that Jerry tells the truth, if Vina says Jerry tells the truth, then Vina tells the truth.' All three errors were raised to the authors, who confirmed our discoveries and accepted our suggested corrections as well. This discovery already hints that CoT prompts need not be logically correct for the model to output the correct answer. It also hints that previous CoT prompts from earlier papers, e.g., (Wei et al., 2022b), may need to be investigated further.
## 3.2. Sanity Checks
We found that Codex produced an answer under all prompting strategies on almost all tasks (Fig. 2 top), and the accuracies under Answer Only (AO) and Chain-of-Thought (CoT)
Figure 3. Accuracy per task per prompt type on BIG-Bench Hard. Prompt Types: AO = Answer Only, CoT = Chain-ofThought, CoT (Invalid) = Logically Invalid Chain-of-Thought. Approximately 200-250 questions per BIG-Bench Hard task.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Heatmap: Task Accuracy vs. Prompt Type
### Overview
The image is a heatmap displaying the accuracy (in percentage) of different tasks when using different prompt types: AO, CoT, and CoT (Invalid). The tasks are listed on the vertical axis, and the prompt types are listed on the horizontal axis. The color intensity represents the accuracy percentage, with red indicating high accuracy and blue indicating low accuracy.
### Components/Axes
* **Title:** Accuracy (%)
* **X-Axis Title:** Prompt Type
* **X-Axis Labels:** AO, CoT, CoT (Invalid)
* **Y-Axis Title:** Task
* **Y-Axis Labels:** boolean\_expressions, causal\_judgement, date\_understanding, disambiguation\_qa, dyck\_languages, formal\_fallacies, geometric\_shapes, hyperbaton, logical\_deduction\_five\_objects, logical\_deduction\_seven\_objects, logical\_deduction\_three\_objects, movie\_recommendation, multistep\_arithmetic\_two, navigate, object\_counting, penguins\_in\_a\_table, reasoning\_about\_colored\_objects, ruin\_names, salient\_translation\_error\_detection, snarks, sports\_understanding, temporal\_sequences, tracking\_shuffled\_objects\_five\_objects, tracking\_shuffled\_objects\_seven\_objects, tracking\_shuffled\_objects\_three\_objects, web\_of\_lies, word\_sorting
* **Colorbar (Right Side):**
* 100 (Red)
* 80
* 60
* 40
* 20
* 0 (Blue)
### Detailed Analysis
Here's a breakdown of the accuracy for each task across the different prompt types:
* **boolean\_expressions:** AO: 88.0%, CoT: 93.6%, CoT (Invalid): 52.9%
* **causal\_judgement:** AO: 65.8%, CoT: 57.2%, CoT (Invalid): 52.9%
* **date\_understanding:** AO: 60.4%, CoT: 87.6%, CoT (Invalid): 84.8%
* **disambiguation\_qa:** AO: 67.2%, CoT: 76.0%, CoT (Invalid): 53.6%
* **dyck\_languages:** AO: 45.2%, CoT: 51.2%, CoT (Invalid): 34.0%
* **formal\_fallacies:** AO: 55.2%, CoT: 21.6%, CoT (Invalid): 54.4%
* **geometric\_shapes:** AO: 31.2%, CoT: 55.6%, CoT (Invalid): 53.6%
* **hyperbaton:** AO: 64.0%, CoT: 59.6%, CoT (Invalid): 58.4%
* **logical\_deduction\_five\_objects:** AO: 31.2%, CoT: 52.4%, CoT (Invalid): 51.2%
* **logical\_deduction\_seven\_objects:** AO: 29.2%, CoT: 12.4%, CoT (Invalid): 34.4%
* **logical\_deduction\_three\_objects:** AO: 54.8%, CoT: 86.4%, CoT (Invalid): 70.0%
* **movie\_recommendation:** AO: 85.6%, CoT: 90.8%, CoT (Invalid): 90.0%
* **multistep\_arithmetic\_two:** AO: 0.4%, CoT: 46.4%, CoT (Invalid): 22.0%
* **navigate:** AO: 42.8%, CoT: 96.0%, CoT (Invalid): 90.8%
* **object\_counting:** AO: 46.0%, CoT: 93.2%, CoT (Invalid): 15.2%
* **penguins\_in\_a\_table:** AO: 64.4%, CoT: 78.8%, CoT (Invalid): 65.8%
* **reasoning\_about\_colored\_objects:** AO: 66.4%, CoT: 91.2%, CoT (Invalid): 82.8%
* **ruin\_names:** AO: 75.2%, CoT: 68.0%, CoT (Invalid): 60.8%
* **salient\_translation\_error\_detection:** AO: 61.6%, CoT: 57.6%, CoT (Invalid): 55.6%
* **snarks:** AO: 59.0%, CoT: 60.1%, CoT (Invalid): 64.0%
* **sports\_understanding:** AO: 71.0%, CoT: 98.4%, CoT (Invalid): 52.4%
* **temporal\_sequences:** AO: 80.4%, CoT: 97.2%, CoT (Invalid): 92.8%
* **tracking\_shuffled\_objects\_five\_objects:** AO: 15.6%, CoT: 74.0%, CoT (Invalid): 84.4%
* **tracking\_shuffled\_objects\_seven\_objects:** AO: 0.0%, CoT: 0.0%, CoT (Invalid): 76.0%
* **tracking\_shuffled\_objects\_three\_objects:** AO: 36.0%, CoT: 76.8%, CoT (Invalid): 66.4%
* **web\_of\_lies:** AO: 51.2%, CoT: 94.4%, CoT (Invalid): 56.0%
* **word\_sorting:** AO: 0.0%, CoT: 0.0%, CoT (Invalid): 20.8%
### Key Observations
* **CoT (Chain of Thought) Prompting generally improves accuracy:** For most tasks, the CoT prompt type (both valid and invalid) yields higher accuracy compared to the AO prompt type.
* **Multistep Arithmetic is uniquely bad with AO:** The "multistep\_arithmetic\_two" task has extremely low accuracy with the AO prompt (0.4%), but improves significantly with CoT (46.4%) and CoT (Invalid) (22.0%).
* **Tracking Shuffled Objects is uniquely bad with AO and CoT:** The "tracking\_shuffled\_objects\_seven\_objects" task has 0.0% accuracy with both AO and CoT, but improves significantly with CoT (Invalid) (76.0%).
* **CoT (Invalid) can sometimes outperform CoT:** In some cases, the "CoT (Invalid)" prompt performs better than the "CoT" prompt, suggesting that even flawed reasoning chains can be beneficial.
* **Some tasks are consistently high-performing:** "movie\_recommendation" and "temporal\_sequences" consistently show high accuracy across all prompt types.
### Interpretation
The heatmap illustrates the impact of different prompting strategies on the accuracy of various tasks. The Chain of Thought (CoT) prompting method generally enhances performance, likely by enabling the model to break down complex problems into smaller, more manageable steps. However, the "CoT (Invalid)" results suggest that even imperfect reasoning chains can lead to improved outcomes compared to direct prompting (AO).
The significant accuracy boost observed for "multistep\_arithmetic\_two" and "tracking\_shuffled\_objects\_seven\_objects" with CoT prompting highlights the importance of structured reasoning for tasks that require multiple steps or complex logic. The fact that "CoT (Invalid)" sometimes outperforms "CoT" indicates that the mere presence of a reasoning chain, even if flawed, can be more beneficial than no reasoning at all.
The consistent high performance of tasks like "movie\_recommendation" and "temporal\_sequences" suggests that these tasks are inherently easier for the model, regardless of the prompting strategy used. Conversely, tasks like "dyck\_languages" and "logical\_deduction\_seven\_objects" remain challenging even with CoT prompting, indicating a need for more sophisticated approaches.
</details>
prompting closely matched published values (Suzgun et al., 2022). The one exception was a single task: word sorting.
## 3.3. Accuracy by Prompt Type by Task
The Accuracy of Codex on each of the BIG-Bench Hard (BBH) tasks under each of the three prompt types (Answer Only, Chain-of-Thought, Logically Invalid Chain-ofThought) is displayed in Fig. 3. Despite using the same model and the same decoding hyperparameters (e.g., temperature), we notice that there is tremendous variation in the accuracies. To better compare the performance across the three prompt types, we visualized each prompt type's average accuracy over all BBH tasks. We found that Chainof-Thought prompting beats Answer Only prompting, but Logically Invalid Chain-of-Thought is close behind Chainof-Thought and better than Answer Only (Fig. 4 top). Pairwise comparisons between prompt types across different BBH tasks reveal that different prompt types do not dominate one another (Fig. 4 bottom), but rather display rich structure.
## 4. Conclusion
On the diverse and challenging BIG-Bench Hard tasks, we find that Chain-of-Thought prompting performs best on average, but logically invalid Chain-of-Thought prompting is close behind and outperforms Answer Only prompting. This demonstrates that completely illogical reasoning in the CoT prompts do not significantly harm the performance of the language model. Our findings suggest that valid reasoning in prompting is not the chief driver of performance gains, raising the question of what is. We note that there are complementary approaches towards achieving reasoning in language models such as enforcing valid reasoning in
Figure 4. Logically invalid Chain-of-Thought outperforms Answer Only and nearly matches Chain-of-Thought. Top: Accuracy per BBH task, averaged over all BBH tasks. Bottom: Pairwise accuracy plots comparing the three different prompt types.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Scatter Plot Matrix: Prompt Type vs. Mean Accuracy
### Overview
The image presents a scatter plot matrix analyzing the relationship between different prompt types (AO, CoT, CoT (Invalid)) and their mean accuracy. The top plot shows the distribution of mean accuracy for each prompt type, while the lower plots show pairwise scatter plots of the prompt types against each other.
### Components/Axes
**Top Plot:**
* **Y-axis:** "Prompt Type" with categories "AO", "CoT", and "CoT (Invalid)".
* **X-axis:** "Mean Accuracy" ranging from 0 to 100.
* **Legend (Top-Right):**
* Blue: "AO"
* Orange: "CoT"
* Green: "CoT (Invalid)"
* Error bars are present for each prompt type, indicating variability in the mean accuracy.
**Lower Plots (Scatter Plot Matrix):**
* **Diagonal Plots:** Histograms showing the distribution of each prompt type's accuracy.
* **Off-Diagonal Plots:** Scatter plots showing the relationship between pairs of prompt types. A dashed line is present in each scatter plot, representing the line of equality (y=x).
* **Bottom-Left:** CoT (Invalid) vs AO
* **Bottom-Middle:** CoT (Invalid) vs CoT
* **Middle-Left:** CoT vs AO
### Detailed Analysis
**Top Plot:**
* **AO (Blue):** The blue data points representing "AO" are scattered around a mean accuracy of approximately 55, with a range from near 0 to almost 90. The error bar spans from approximately 45 to 65.
* **CoT (Orange):** The orange data points representing "CoT" are scattered around a mean accuracy of approximately 65, with a range from near 10 to 100. The error bar spans from approximately 55 to 75.
* **CoT (Invalid) (Green):** The green data points representing "CoT (Invalid)" are scattered around a mean accuracy of approximately 55, with a range from near 10 to 80. The error bar spans from approximately 50 to 60.
**Lower Plots (Scatter Plot Matrix):**
* **AO Histogram:** The histogram shows a distribution centered around 50-60, with a spread from 0 to 100.
* **CoT Histogram:** The histogram shows a distribution centered around 80-90, with a spread from 0 to 100.
* **CoT (Invalid) Histogram:** The histogram shows a distribution centered around 50-60, with a spread from 20 to 80.
* **CoT vs AO Scatter Plot:** The scatter plot shows a positive correlation between AO and CoT. Most points are clustered above the dashed line, indicating that CoT generally has higher accuracy than AO.
* **CoT (Invalid) vs AO Scatter Plot:** The scatter plot shows a positive correlation between AO and CoT (Invalid). The points are scattered around the dashed line.
* **CoT (Invalid) vs CoT Scatter Plot:** The scatter plot shows a positive correlation between CoT and CoT (Invalid). Most points are clustered above the dashed line, indicating that CoT generally has higher accuracy than CoT (Invalid).
### Key Observations
* CoT generally has a higher mean accuracy compared to AO and CoT (Invalid).
* There is a positive correlation between all pairs of prompt types.
* The distributions of AO and CoT (Invalid) are similar.
### Interpretation
The data suggests that the "CoT" prompt type leads to higher accuracy compared to "AO" and "CoT (Invalid)". The positive correlations between the prompt types indicate that if one prompt type performs well, the others are also likely to perform well. The similarity in the distributions of "AO" and "CoT (Invalid)" suggests that these two prompt types might be related or have similar characteristics. The error bars in the top plot indicate the variability in the mean accuracy for each prompt type. The scatter plots provide a more detailed view of the relationship between the prompt types, showing how the accuracy of one prompt type relates to the accuracy of another.
</details>
the model architecture (Creswell et al., 2022; Creswell & Shanahan, 2022) or via autoformalization (Wu et al., 2022; Azerbayev et al., 2023).
Our work raises important questions for future work. Why are models robust to invalid CoT prompts? What features of the data or prompts result in the model outputting inconsistent or invalid outputs? Does increasing the degree of 'incorrectness' or the number of incorrect prompts affect the model's sensitivity to invalid CoT? What other properties of the valid prompts is the model sensitive to? Answering these questions can yield useful insights into prompt engineering for LLMs, as well as a deeper understanding of when models output inconsistent or 'hallucinated' answers.
## References
- Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433 , 2023.
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems , 33: 1877-1901, 2020.
- Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021.
- Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 , 2022.
- Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021.
- Creswell, A. and Shanahan, M. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271 , 2022.
- Creswell, A., Shanahan, M., and Higgins, I. Selectioninference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712 , 2022.
- Jung, J., Qin, L., Welleck, S., Brahman, F., Bhagavatula, C., Bras, R. L., and Choi, Y. Maieutic prompting: Logically consistent reasoning with recursive explanations. arXiv preprint arXiv:2205.11822 , 2022.
- Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 , 2022.
- Madaan, A. and Yazdanbakhsh, A. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686 , 2022.
- Madaan, A., Zhou, S., Alon, U., Yang, Y., and Neubig, G. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128 , 2022.
- Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021.
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730-27744, 2022.
- Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350 , 2022.
- Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004 , 2023.
- Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 , 2022.
- Suzgun, M., Scales, N., Sch¨ arli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261 , 2022.
- Wang, B., Min, S., Deng, X., Shen, J., Wu, Y., Zettlemoyer, L., and Sun, H. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001 , 2022.
- Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 , 2022a.
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 , 2022b.
- Wu, Y., Jiang, A. Q., Li, W., Rabe, M. N., Staats, C., Jamnik, M., and Szegedy, C. Autoformalization with large language models. arXiv preprint arXiv:2205.12615 , 2022.
- Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 , 2023.