## Invalid Logic, Equivalent Gains: The Bizarreness of Reasoning in Language Model Prompting
Rylan Schaeffer * 1 Kateryna Pistunova * 2 Samar Khanna * 1 Sarthak Consul * 1 Sanmi Koyejo 1
## Abstract
Language models can be prompted to reason through problems in a manner that significantly improves performance. However, why such prompting improves performance is unclear. Recent work showed that using logically invalid Chain-of-Thought (CoT) prompting improves performance almost as much as logically valid CoT prompting, and that editing CoT prompts to replace problem-specific information with abstract information or out-of-distribution information typically doesn't harm performance. Critics have responded that these findings are based on too few and too easy tasks to draw meaningful conclusions. To resolve this dispute, we test whether logically invalid CoT prompts offer the same level of performance gains as logically valid prompts on the hardest tasks in the BIG-Bench benchmark, termed BIG-Bench Hard (BBH). We find that the logically invalid reasoning prompts do indeed achieve similar performance gains on BBH tasks as logically valid reasoning prompts. We also discover that some CoT prompts used by previous works contain logical errors. This suggests that covariates beyond logically valid reasoning are responsible for performance improvements.
## 1. Introduction
Language models can perform significantly better when prompted in particular ways. For example, prompts that recommend or guide language models through step-by-step processing have been shown to significantly improve performance on question answering, conversational response generation and other tasks (Nye et al., 2021; Wei et al., 2022b; Jung et al., 2022; Kojima et al., 2022; Yao et al.,
* Equal contribution 1 Department of Computer Science, Stanford University 2 Department of Physics, Stanford University. Correspondence to: Rylan Schaeffer < rschaef@cs.stanford.edu > , Sanmi Koyejo < sanmi@cs.stanford.edu > .
2023). These prompting techniques are especially effective on the hardest tasks (Suzgun et al., 2022) in the BIG-Bench benchmark (Srivastava et al., 2022), leading many to conclude that such techniques unlock emergent 1 human-like reasoning abilities in large language models (Wei et al., 2022a). However, why such prompting strategies improve performance is unclear. Madaan & Yazdanbakhsh (2022) showed replacing problem-specific information in Chainof-Thought (CoT) prompts with either abstract information or out-of-distribution information typically doesn't harm CoT's performance gains, and Wang et al. (2022) showed that logically invalid CoT prompts (i.e., prompts which contain logically invalid chains of reasoning) achieve approximately the same gains as logically valid CoT prompts. These discoveries are puzzling, but critics warned against drawing confident or far-ranging conclusions results because (1) only a small number of tasks are considered (specifically GSM8K (Cobbe et al., 2021), Bamboogle (Press et al., 2022), and two commonsense tasks (Srivastava et al., 2022)) and (2) the tasks are relatively easy to solve.
In this work, we aim to clarify this dispute by asking whether logically invalid CoT prompts indeed offer comparable performance gains as logically valid CoT prompts in more and harder tasks. To answer this question, we evaluate the performance of logically invalid CoT prompts on the hardest tasks in BIG-Bench (Srivastava et al., 2022), the so-called BIG-Bench Hard (BBH) tasks (Suzgun et al., 2022). We find that logically invalid CoT prompts induce performance gains approaching logically valid CoT prompts on BBH tasks. We also discover that CoT prompts from previous work contain logical errors, a discovery that we confirm with the original authors. Our findings support growing evidence that while prompting techniques can, without doubt, significantly improve language model performance on complex tasks, there is questionable evidence that these performance gains are attributable to logical reasoning skills. Covariates other than logical reasoning in the prompts exemplars may be responsible for the performance improvements and point to the need for continued investigation.
1 Although see Schaeffer et al. (2023))
## 2. Methods
## 2.1. Tasks: BIG-Bench Hard β BIG-Bench
Beyond-the-Imitation Game Benchmark (BIG-Bench) is a benchmark of over 200 natural language tasks created to evaluate the capabilities and limitations of language models (Srivastava et al., 2022). BIG-Bench Hard (BBH) is a subset of BIG-Bench consisting of 23 tasks specifically identified as extremely difficult for language models (Suzgun et al., 2022). BBH was constructed by excluding BIGBench tasks using the following criteria: tasks with more than three subtasks, tasks with fewer than 103 examples, tasks without human-rater baselines, tasks that do not use multiple-choice or exact match as the evaluation metric, and tasks on which the best reported model beats average reported human-rater score; the remaining 23 BIG-Bench tasks comprise the BBH tasks. The BBH tasks cover a variety of categories, including traditional NLP, mathematics, commonsense reasoning, and question-answering. In general, BBH tasks typically fall into one of two categories: more traditional NLP tasks (e.g., Date Understanding) or more algorithmic tasks (e.g., Boolean Expressions). Many state of the art language models, including GPT-3 (Brown et al., 2020) and PaLM (Chowdhery et al., 2022), as well as internal dense and sparse Google models, were shown to be incapable of solving BBH tasks above average human rater performance when asked directly.
## 2.2. Prompt Types
In this work, we compare three prompt types applied to the BBH tasks. Examples of task problems and prompts are shown in Table 1.
Chain-of-Thought (CoT) Prompting In (logically valid) Chain-of-Thought (CoT) prompting, each question in each task is preappended with three highquality, human hand-written, topical examples of Question β Reasoning β Answer. The reasoning is a sequence of natural language steps that derive the answer from information in the question (Nye et al., 2021; Wei et al., 2022b). For each BBH task, Suzgun et al. (2022) released three examples of Question β Reasoning β Answer that we adapt. We do not use the examples directly because we want CoT prompts to contain logically valid chains of reasoning, and as we discovered (Sec. 3.1), some of the examples contain logical errors.
Logically Invalid Chain-of-Thought (Invalid CoT) Prompting In logically invalid Chain-of-Thought (Invalid CoT) prompting, we edit each tasks three examples' reasoning to become logically invalid. To do this, we modify the reasoning to contain nonsensical or invalid steps. For instance, on the Boolean Expressions task, we provided the reasoning, 'Because English does not permit multiple negatives, the expression '(not not True)' evaluates to '( not True )', and on the Date Understanding task, we provided the reasoning, 'If tomorrow is 11/12/2019, then today is 11/12/2018. The date one year ago from today is 11/11/2018.' In all cases, the modified BBH exemplars were logically invalid but reached the correct answer.
Answer-Only (AO) Prompting To establish baseline performance, we follow previous work and prompt language models to answer each question in each task directly. We call this 'Answer Only' (AO) prompting.
Table 1. Examples of different prompt types.
| Prompt Type | Example Query |
|--------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| | Evaluate the result of a random Boolean expression. |
| | Q: not ( ( not not True ) ) is |
| Answer Only | False |
| Chain of Thought (CoT) | A: Let's think step by step. Remember that (i) expressions inside brackets are always evaluated first and that (ii) the order of operations from highest pri- ority to lowest priority is 'not', 'and', 'or', respectively. Wefirst simplify this expression 'Z' as follows: 'Z = not ( ( not not True ) ) = not ( ( A ) )' where 'A = not not True'. Let's evaluate A: A = not not True = not (not True) = not False = True. Plugging in A, we get: Z = not ( ( A ) ) = not ( ( True ) ) = not True = False. So the answer is False. |
| Logically Invalid Chain of Thought (Invalid CoT) | A: Let's think step by step. Remem- ber that (i) expressions inside brack- ets are always evaluated first and that (ii) the order of operations from high- est priority to lowest priority is 'not', 'and', 'or', respectively. Because En- glish does not permit multiple nega- tives, the expression '(not not True)' evaluates to '( not True )'. The expres- sion 'not ( ( not not True ) )' therefore evaluates to 'not ( ( not True ) )'. By the same logic, the expression 'not ( ( not True ) )' simplifies to 'not True'. In Boolean logic, 'not True' is False. So the answer is False. |
Figure 1. Codex benefits more from Chain-of-Thought Prompting than InstructGPT or PaLM 540B on BIG-Bench Hard Tasks. Dashed vertical lines indicate average model performance across all BIG-Bench Hard Tasks. Data from Suzgun et al. (2022).
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Heatmap: Chain of Thought Accuracy
### Overview
The heatmap displays the accuracy of various AI models in answering questions based on a given Chain of Thought (CoT) approach. The x-axis represents the accuracy of the CoT approach, while the y-axis represents the accuracy of different AI models. The color intensity indicates the level of accuracy, with darker colors representing higher accuracy.
### Components/Axes
- **X-Axis**: Chain of Thought Accuracy - Answer Only Accuracy
- **Y-Axis**: AI Models
- **Legend**:
- **Blue**: CoT (CoT)
- **Orange**: CoT (CoT)
- **Green**: CoT (CoT)
- **Red**: CoT (CoT)
### Detailed Analysis or ### Content Details
The heatmap shows that the CoT approach generally leads to higher accuracy across all AI models. The CoT (CoT) model consistently outperforms the other models, with the highest accuracy observed in the CoT (CoT) model. The CoT (CoT) model also shows a slight improvement in accuracy as the CoT approach becomes more complex.
### Key Observations
- The CoT (CoT) model consistently outperforms the other models.
- The CoT (CoT) model shows a slight improvement in accuracy as the CoT approach becomes more complex.
- There is a noticeable difference in accuracy between the CoT (CoT) model and the other models.
### Interpretation
The heatmap suggests that the CoT approach is a more effective method for answering questions based on a given Chain of Thought approach. The CoT (CoT) model consistently outperforms the other models, indicating that it is a more reliable method for answering questions. The slight improvement in accuracy as the CoT approach becomes more complex suggests that the CoT approach can be further refined to improve its accuracy.
</details>
## 2.3. Metrics
To evaluate the performance of different prompting types, we used Accuracy (i.e. Exact String Match) to evaluate model outputs. We acknowledge that recent work showed such a choice of metric may reach misleading scientific conclusions (Schaeffer et al., 2023), but made the choice to ensure fair comparison with the relevant preceding work.
## 2.4. Choice of Language Model
Due to limited resources, we could only evaluate one model. Only Codex (Chen et al., 2021) and InstructGPT (Ouyang et al., 2022) were publicly queryable, thus ruling out PaLM 540B (Chowdhery et al., 2022). We chose to evaluate Codex because Suzgun et al. (2022) found Codex outperformed both InstructGPT and PaLM 540B; this point was not made by the authors, but visualizing their data anew reveals this conclusion (Fig. 1). Although this finding may seem surprising, independent work has found that language models reason better if pretrained primarily on code (Madaan et al., 2022). Perhaps unsurprisingly, we found that Codex's high average performance with CoT prompting is driven by large improvements on BBH algorithmic tasks, overriding mediocre improvements on natural language tasks.
## 2.5. Choice of BBH Tasks
We evaluate AO, CoT, and Invalid CoT prompting strategies on all BBH tasks.
## 3. Experimental Results
## 3.1. Previous CoT Prompts Contain Logical Errors
While converting the logically valid CoT prompts written for BBH tasks by Suzgun et al. (2022) to logically invalid CoT prompts, we discovered that at least three of the BBH tasks' CoT prompts were already logically invalid. On
Figure 2. Sanity checks. Top: All prompting strategies almost always produced an answer across tasks. Bottom: Our accuracies for Answer Only (AO) and Chain-of-Thought (CoT) prompting closely matched published values from Suzgun et al. (2022) except on one task: word sorting.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Heatmap: Task Performance by Prompt Type and Task
### Overview
The heatmap displays the percentage of tasks completed correctly by different prompt types (AO, CoT, CoT (Invalid)) across various tasks. The tasks are categorized by their complexity and the cognitive processes they require.
### Components/Axes
- **X-axis**: Task categories, including boolean expressions, causal judgment, disambiguation, etc.
- **Y-axis**: Task complexity levels, ranging from simple to complex.
- **Color Gradient**: Represents the percentage of correct answers, with red indicating lower percentages and blue indicating higher percentages.
### Detailed Analysis or ### Content Details
- **AO (Answer Only)**: Tasks that require only the answer without any intermediate steps.
- **CoT (Chain of Thought)**: Tasks that require a step-by-step reasoning process to arrive at the answer.
- **CoT (Invalid)**: Tasks that are either invalid or not applicable to the given prompt type.
### Key Observations
- **Boolean Expressions**: Tasks involving boolean expressions are generally completed correctly by all prompt types.
- **Causal Judgment**: Tasks requiring causal judgment show a significant difference in performance between AO and CoT, with CoT performing better.
- **Disambiguation**: Tasks involving disambiguation are completed correctly by all prompt types, but CoT performs slightly better than AO.
- **Hyperbolic Geometry**: Tasks involving hyperbolic geometry are completed correctly by all prompt types, but CoT performs slightly better than AO.
### Interpretation
The data suggests that CoT is generally more effective than AO in completing tasks that require reasoning and problem-solving. However, the performance of CoT (Invalid) is significantly lower than CoT (Valid), indicating that the prompt type may be a limiting factor in the performance of CoT. The heatmap also shows that tasks involving complex cognitive processes, such as hyperbolic geometry, are completed correctly by all prompt types, but CoT performs slightly better than AO. Overall, the data suggests that CoT is a more effective approach to completing tasks that require reasoning and problem-solving, but the prompt type may be a limiting factor in the performance of CoT.
</details>
the Multistep Arithmetic Task, a human handwritten CoT prompt reads: 'Then, the final equation is A * B = -41 * -3 = (-61) * (-3) = 123. So the answer is 123.' To clarify, -41 was substituted with -61 without justification, and -61 * -3 is not 123, even though 123 is the correct answer. On the Navigate task, a human handwritten CoT prompt reads: '(4) Take 7 steps right: (0, 7), facing the positive y-axis.' should be '(4) Take 7 steps right: (0, 0), facing the positive y-axis.' On the Web of Lies task, '(3) Vina says Jerry tells the truth. Since we know from (2) that Jerry tells the truth, if Vine says Jerry tells the truth, then Vina tells the truth.' should be '(3) Vina says Jerry tells the truth. Since we know from (2) that Jerry tells the truth, if Vina says Jerry tells the truth, then Vina tells the truth.' All three errors were raised to the authors, who confirmed our discoveries and accepted our suggested corrections as well. This discovery already hints that CoT prompts need not be logically correct for the model to output the correct answer. It also hints that previous CoT prompts from earlier papers, e.g., (Wei et al., 2022b), may need to be investigated further.
## 3.2. Sanity Checks
We found that Codex produced an answer under all prompting strategies on almost all tasks (Fig. 2 top), and the accuracies under Answer Only (AO) and Chain-of-Thought (CoT)
Figure 3. Accuracy per task per prompt type on BIG-Bench Hard. Prompt Types: AO = Answer Only, CoT = Chain-ofThought, CoT (Invalid) = Logically Invalid Chain-of-Thought. Approximately 200-250 questions per BIG-Bench Hard task.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Heatmap: Accuracy of Various Natural Language Processing Tasks
### Overview
The heatmap displays the accuracy percentages of various natural language processing (NLP) tasks. The tasks are categorized by their complexity and the type of language processing they involve. The accuracy is represented by a color gradient, with darker shades indicating lower accuracy and lighter shades indicating higher accuracy.
### Components/Axes
- **Tasks**: Boolean expressions, date causal judgment, disambiguation QA, dyck languages, formal fallacies, geometric shapes, hyperbotion, logical deduction five objects, logical deduction seven objects, logical deduction three objects, movie recommendation, multipath arithmetic two, object counting, reasoning about colored objects, reasoning about colored objects in a table, ruin names, salient translation error detection, snarks, sports understanding, temporal sequences, tracking shuffled objects five objects, tracking shuffled objects seven objects, tracking shuffled objects three objects, web of lies, word sorting.
- **Accuracy (%)**: The accuracy percentage for each task is displayed on the right side of the heatmap.
- **Prompt Type**: The type of prompt used for each task is displayed at the bottom of the heatmap.
- **AO**: The accuracy of the AI model.
- **CoT**: The accuracy of the human model.
- **CoT (Invalid)**: The accuracy of the model when the prompt type is invalid.
### Detailed Analysis or ### Content Details
The heatmap shows that the accuracy of the AI model (AO) is generally lower than the human model (CoT) across all tasks. The highest accuracy for the AI model is 93.6% for the task of "boolean expressions," while the lowest accuracy is 0.0% for the task of "word sorting." The human model (CoT) has the highest accuracy for the task of "reasoning about colored objects in a table," with an accuracy of 92.8%. The lowest accuracy for the human model is 0.0% for the task of "word sorting."
### Key Observations
- The AI model (AO) performs better than the human model (CoT) in tasks that involve logical deduction and reasoning about colored objects.
- The AI model (AO) performs worse than the human model (CoT) in tasks that involve disambiguation and word sorting.
- The AI model (AO) performs better than the human model (CoT) in tasks that involve temporal sequences and sports understanding.
### Interpretation
The heatmap suggests that the AI model (AO) is currently outperforming the human model (CoT) in terms of accuracy across various NLP tasks. However, there are still areas where the human model (CoT) performs better, particularly in tasks that involve logical deduction and reasoning about colored objects. The AI model (AO) may need further training and development to improve its performance in these areas.
</details>
prompting closely matched published values (Suzgun et al., 2022). The one exception was a single task: word sorting.
## 3.3. Accuracy by Prompt Type by Task
The Accuracy of Codex on each of the BIG-Bench Hard (BBH) tasks under each of the three prompt types (Answer Only, Chain-of-Thought, Logically Invalid Chain-ofThought) is displayed in Fig. 3. Despite using the same model and the same decoding hyperparameters (e.g., temperature), we notice that there is tremendous variation in the accuracies. To better compare the performance across the three prompt types, we visualized each prompt type's average accuracy over all BBH tasks. We found that Chainof-Thought prompting beats Answer Only prompting, but Logically Invalid Chain-of-Thought is close behind Chainof-Thought and better than Answer Only (Fig. 4 top). Pairwise comparisons between prompt types across different BBH tasks reveal that different prompt types do not dominate one another (Fig. 4 bottom), but rather display rich structure.
## 4. Conclusion
On the diverse and challenging BIG-Bench Hard tasks, we find that Chain-of-Thought prompting performs best on average, but logically invalid Chain-of-Thought prompting is close behind and outperforms Answer Only prompting. This demonstrates that completely illogical reasoning in the CoT prompts do not significantly harm the performance of the language model. Our findings suggest that valid reasoning in prompting is not the chief driver of performance gains, raising the question of what is. We note that there are complementary approaches towards achieving reasoning in language models such as enforcing valid reasoning in
Figure 4. Logically invalid Chain-of-Thought outperforms Answer Only and nearly matches Chain-of-Thought. Top: Accuracy per BBH task, averaged over all BBH tasks. Bottom: Pairwise accuracy plots comparing the three different prompt types.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Heatmap: Prompt Type vs. CoT Accuracy
### Overview
The heatmap illustrates the relationship between different prompt types and their corresponding CoT (Chain of Thought) accuracy. The x-axis represents the Mean Accuracy, ranging from 0 to 100, while the y-axis categorizes the prompt types into three groups: AO (Answer Only), CoT (Chain of Thought), and CoT (Invalid).
### Components/Axes
- **X-Axis**: Mean Accuracy (ranging from 0 to 100)
- **Y-Axis**: Prompt Type (AO, CoT, CoT (Invalid))
- **Legend**:
- Blue dots: AO
- Orange dots: CoT (Valid)
- Green dots: CoT (Invalid)
### Detailed Analysis or ### Content Details
The heatmap shows that the AO prompt type generally has the highest Mean Accuracy, with most data points clustering around the 80-90% mark. The CoT (Valid) prompt type also performs well, with a significant number of data points above 70% accuracy. In contrast, the CoT (Invalid) prompt type has the lowest Mean Accuracy, with most data points below 60%.
### Key Observations
- **AO Prompt Type**: High Mean Accuracy, with a concentration of data points around 80-90%.
- **CoT (Valid) Prompt Type**: Moderate Mean Accuracy, with a significant number of data points above 70%.
- **CoT (Invalid) Prompt Type**: Low Mean Accuracy, with most data points below 60%.
### Interpretation
The heatmap suggests that the AO prompt type is the most effective in terms of CoT accuracy, followed by the CoT (Valid) prompt type. The CoT (Invalid) prompt type is less effective, with a significant number of data points performing poorly. This could indicate that the CoT (Invalid) prompt type may not be well-suited for tasks that require a structured approach to problem-solving. The high Mean Accuracy of the AO prompt type could be due to its simplicity and directness, which may make it easier for the model to understand and respond to. The moderate accuracy of the CoT (Valid) prompt type suggests that while it is effective, there is room for improvement in terms of the quality of the CoT process. The low accuracy of the CoT (Invalid) prompt type indicates that the CoT process may not be reliable or may lead to incorrect conclusions.
</details>
the model architecture (Creswell et al., 2022; Creswell & Shanahan, 2022) or via autoformalization (Wu et al., 2022; Azerbayev et al., 2023).
Our work raises important questions for future work. Why are models robust to invalid CoT prompts? What features of the data or prompts result in the model outputting inconsistent or invalid outputs? Does increasing the degree of 'incorrectness' or the number of incorrect prompts affect the model's sensitivity to invalid CoT? What other properties of the valid prompts is the model sensitive to? Answering these questions can yield useful insights into prompt engineering for LLMs, as well as a deeper understanding of when models output inconsistent or 'hallucinated' answers.
## References
- Azerbayev, Z., Piotrowski, B., Schoelkopf, H., Ayers, E. W., Radev, D., and Avigad, J. Proofnet: Autoformalizing and formally proving undergraduate-level mathematics. arXiv preprint arXiv:2302.12433 , 2023.
- Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems , 33: 1877-1901, 2020.
- Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 , 2021.
- Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 , 2022.
- Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 , 2021.
- Creswell, A. and Shanahan, M. Faithful reasoning using large language models. arXiv preprint arXiv:2208.14271 , 2022.
- Creswell, A., Shanahan, M., and Higgins, I. Selectioninference: Exploiting large language models for interpretable logical reasoning. arXiv preprint arXiv:2205.09712 , 2022.
- Jung, J., Qin, L., Welleck, S., Brahman, F., Bhagavatula, C., Bras, R. L., and Choi, Y. Maieutic prompting: Logically consistent reasoning with recursive explanations. arXiv preprint arXiv:2205.11822 , 2022.
- Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916 , 2022.
- Madaan, A. and Yazdanbakhsh, A. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686 , 2022.
- Madaan, A., Zhou, S., Alon, U., Yang, Y., and Neubig, G. Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128 , 2022.
- Nye, M., Andreassen, A. J., Gur-Ari, G., Michalewski, H., Austin, J., Bieber, D., Dohan, D., Lewkowycz, A., Bosma, M., Luan, D., et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114 , 2021.
- Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730-27744, 2022.
- Press, O., Zhang, M., Min, S., Schmidt, L., Smith, N. A., and Lewis, M. Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350 , 2022.
- Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004 , 2023.
- Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 , 2022.
- Suzgun, M., Scales, N., Sch¨ arli, N., Gehrmann, S., Tay, Y., Chung, H. W., Chowdhery, A., Le, Q. V., Chi, E. H., Zhou, D., et al. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261 , 2022.
- Wang, B., Min, S., Deng, X., Shen, J., Wu, Y., Zettlemoyer, L., and Sun, H. Towards understanding chain-of-thought prompting: An empirical study of what matters. arXiv preprint arXiv:2212.10001 , 2022.
- Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682 , 2022a.
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., and Zhou, D. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 , 2022b.
- Wu, Y., Jiang, A. Q., Li, W., Rabe, M. N., Staats, C., Jamnik, M., and Szegedy, C. Autoformalization with large language models. arXiv preprint arXiv:2205.12615 , 2022.
- Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y ., and Narasimhan, K. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601 , 2023.