## Response: Emergent analogical reasoning in large language models
Damian Hodel 1 and Jevin West 1
1 Center for an Informed Public, Information School, University of Washington hodeld@uw.edu, jevinw@uw.edu
May 2, 2024
## 1 Introduction
In their recent Nature Human Behaviour paper, 'Emergent analogical reasoning in large language models,' (Webb, Holyoak, and Lu, 2023) the authors argue that 'GPT-3 exhibits a very general capacity to identify and generalize-in zero-shot fashion-relational patterns found within both formal problems and meaningful texts.' This conclusion arises from their comparison of GPT-3 with human performance across four analogical reasoning domains, where they find comparable results. In this response, we argue that this approach is unsuitable for evaluating general, zero-shot reasoning in large language models (LLMs). Two primary reasons underlie our objection. First, the term 'zeroshot' implies problem sets entirely novel to GPT-3. However, the chosen approach cannot conclusively eliminate the possibility of these problems residing in the LLM's training data, as acknowledged by the authors themselves in the review file 1 . Second, the assumption underlying this approach is that tests designed for humans can accurately measure LLM capabilities. This assumption is prevalent, but remains unverified. We also provide empirical results to support our claims, see appendix (Section 7.1). Our counterexamples show that GPT-3 fails to solve simplest variations of the original tasks, whereas human performance remains consistently high across all modified versions.
Given the hype surrounding LLM capability and this paper in particular 234
1 https://www.nature.com/articles/s41562-023-01659-w#peer-review
https://www.tagesanzeiger.ch/beherrscht-die-kuenstliche-intelligenz-
2 analogien-768020720454
3 https://www.news-medical.net/news/20230731/AI-language-model-GPT-3-performsabout-as-well-as-college-undergraduates-in-analogical-reasoning.aspx
4 https://www.sciencemediacentre.org/expert-reaction-to-study-looking-at-gpt3-large-language-model-and-ability-to-reason-by-analogy/
contrasted by the many findings of LLM brittleness 5 , we felt it was important to respond and illustrate the insufficiency of the methods employed in addressing GPT-3's supposed general, zero-shot reasoning. It is important that we interpret LLM results with caution and refrain from LLM anthropomorphization. Tests designed to assess the general capabilities of humans may not inherently serve the same purpose when applied to LLMs.
Others have commented on this paper and we want to note these contributions. Mitchell (2023) discusses this paper, focusing on the letter string and digit matrix analogy problems. Mitchell disagrees that 'the digit matrix problems are essentially equivalent in complexity and difficulty to Ravens Progressive Matrix problems.' Further, Mitchell presents individual counterexamples of the letter string problems where GPT-3 makes nonhuman-like errors, as evidence against the claimed robustness of GPT-3 in analogy reasoning. We conduct a similar but more systematic analysis and include human behavioral experiments, as detailed in the appendix, that concurs with Mitchell's conclusion. Mitchell also points out that the term 'accuracy' implies that there was only one correct answer to each problem, which isn't the case with these problems, but an assumption implicitly made by the authors. For comparison purposes, we adopt Webb, Holyoak, and Lu (2023)'s assumption in our paper and use the same terms, i.e. 'accuracy' and 'performance' but recognize this limitation.
## 2 Criticism of the Methods Employed in the Original Paper
To assess general, zero-shot reasoning capacity of LLMs, Webb, Holyoak, and Lu (2023) compare GPT-3 with humans and find similar or even better performance across a range of analogical reasoning tests adapted from existing cognitive tests designed for humans. However, we believe that this approach is not sufficient for testing the general, zero-shot reasoning capacity of large language models (LLMs). Here is why:
First, 'zero-shot' implies analogical problem sets that are entirely novel to GPT-3, encompassing both specific examples and variants of those examples. However, this condition is not met by some of the letter string problems used in the original paper, as noted by the authors themselves in the review file 6 : 'It is possible that GPT-3 has been trained on other letter string analogy problems, as these problems are discussed on a number of webpages.' Without ruling out the possibility of data memorization, one cannot claim zero-shot reasoning. As the first author notes in a recent MIT Technology Review article 7 , [if the test
5 https://www.technologyreview.com/2023/08/30/1078670/large-language-modelsarent-people-lets-stop-testing-them-like-they-were
6 https://www.nature.com/articles/s41562-023-01659-w#peer-review
7 https://www.technologyreview.com/2023/08/30/1078670/large-language-modelsarent-people-lets-stop-testing-them-like-they-were
examples exist in the training data], 'I think we really can't conclude much of anything.'
In the before-mentioned peer review file, the authors further note that they ask GPT-3 about these problems as a way of testing their existence in the training data. It makes sense to at least try this, but we find this to be weak evidence, given the large number of possible answers to this question and the ambiguity of the answers given. To investigate this further, we asked ChatGPT to provide examples of letter string problems. Examples were given, suggesting that it has seen such examples in the training data. We include our question and ChatGPT's answer in the appendix. Important to note is that ChatGPT was trained on more data than GPT-3 so this result only provides circumstantial evidence.
Zero-shot reasoning is an extraordinary claim that requires extraordinary evidence. At the very least, it necessitates demonstrating that the problems, as well as their variations, do not already exist within the training data, as previously mentioned. The original paper fails to offer such evidence for any of the four task domains. We do recognize that obtaining such evidence can be exceptionally challenging. Many researchers lack access to GPT-3's training data, and even if they did, confirming the absence of examples or derivations from the training data is nearly impossible. However, the difficulty to provide evidence of zero-shot should not be a reason to claim it.
Second, Webb, Holyoak, and Lu (2023) claim that the presented problem types test GPT-3's human-like reasoning capacity in a 'very general' way. This assumption is based on the premise that LLMs behave similarly to humans, thus implying that a test designed for humans can adequately assess LLMs in a broader capacity beyond the tasks included in the test. However, this assumption has not been substantiated.
On the contrary, generalized findings across the literature of LLM brittleness tend to contradict it. In Appendix 7.1, we present counterexamples involving the letter string analogy problems, which demonstrate the brittleness of the assessment approach employed. In these tests, GPT-3 fails to solve simple variants of the letter string analogies presented in the original paper, while human performance remains on a high level.
In addition to the finding that GPT-3 matches or even outperforms human performance, Webb, Holyoak, and Lu (2023) further show that GPT-3 exhibits human-like characteristics in analogical reasoning, i.e., decreasing performance with increasing problem complexity. Based on this result, the authors propose that GPT-3 may have developed mechanisms similar to those underlying human intelligence. This is one possible interpretation. However, an alternative explanation could be that the training data contains a scarcity of solutions to complex problems, possibly reflecting the challenges humans encounter with such problems, a notion supported by our experiments involving human subjects.
It is important to note that our intention is not to discredit the use of such tests for studying LLMs but to point out the limitations of these methods for making claims about the reasoning capacity of LLMs.
Before conducting the human behavior experiments, we shared our counterexamples on GPT-3 with the first author of the original paper, and greatly appreciate their engagement in this discussion. One of their main objections was the expectation that our modified problems would also be significantly more difficult for human subjects. The human behavioral studies we carried out definitively contradict the predictions of the primary author. Despite a notable decrease in GPT-3's performance on our adapted tasks, humans consistently demonstrate strong performance. Nevertheless, it is important to note that comparing performance to humans, whether better or worse, is not evidence of the claimed capacity. For example, if one ran this comparison only among humans and two groups emerged from the sampling, one with adults and one with children 8 , we would likely find that adults outperform children on these reasoning tasks. According to the authors' logic (Webb, Holyoak, and Lu, 2023), this would be evidence against zero-shot reasoning in children. But we know that children have this ability. Hence, performance compared to humans cannot be used to support or refute zero-shot reasoning.
## 3 Conclusion
Based on their analysis, Webb, Holyoak, and Lu (2023) argue that LLMs have acquired a general ability for zero-shot reasoning. With full respect to the authors and their work, we disagree with this interpretation. As we show and argue in our response, the methods are insufficient to evaluate a capacity for true, zero-shot reasoning. Given the current hype surrounding LLMs, we hope this can be used to spur further tests and evaluations of what LLMs can and cannot do.
## 4 Code and data availability
Code and data can be downloaded from: https://github.com/hodeld/emergent\_ analogies\_LLM\_fork
## References
Mitchell, Melanie (Jan. 2023). On analogy-making in large language models . url : https://aiguide.substack.com/p/on-analogy-making-in-largelanguage (visited on 08/09/2023).
8 This analogy presumes that LLMs' performances can be assessed against human standards. It is important to clarify that we don't endorse this assumption until it is substantiated, but for the sake of our argument, we will adopt it from the original paper.
- Webb, Taylor, Keith J. Holyoak, and Hongjing Lu (July 2023). 'Emergent analogical reasoning in large language models'. en. In: Nature Human Behaviour . issn : 2397-3374. doi : 10.1038/s41562-023-01659-w . url : https: //www.nature.com/articles/s41562-023-01659-w .
- Wu, Zhaofeng et al. (Aug. 2023). Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks . en. arXiv:2307.02477 [cs]. url : http://arxiv.org/abs/2307.02477 (visited on 03/14/2024).
## 5 Author contributions
D.H. conducted the experiments. D.H. and J.W. drafted the manuscript.
## 6 Competing interests
The authors declare no competing interests.
## 7 Appendix
## 7.1 Counterexamples
To investigate whether the problems presented in the original paper truly assess analogical reasoning in GPT-3 or primarily its capability to recite training data, we create non-standard variants of the original tasks that are less likely to be found in training data. Our focus is on the letter string analogies, a subset of the four problem domains examined, and we conduct tests with both human subjects and GPT-3. In our experiments, GPT-3 performance significantly declines when presented with these additional counterexamples, while human performance remains consistently high across all tests (2). This suggests that the claims made in the original paper regarding GPT-3's zero-shot reasoning may not be substantiated.
## 7.1.1 Methods
In order to test GPT-3's generality in zero-shot analogical reasoning, we extend the letter string analogies with two modifications and compare GPT-3's and humans' performance analogous to the original approach. The modifications involve using a synthetic alphabet and increasing the size of the interval from one to two letters, see Figure 1. If the claim regarding GPT-3's zero-shot reasoning capability is true, we can expect similar performance across modifications, in particular, independent of the alphabet. Unlike the original study, we view the comparison with human performance not as evidence for or against GPT-3's analogical reasoning abilities, but rather as a confirmation of the validity of our set of problems.
We create the synthetic alphabet by randomly changing the order of the letters in the real alphabet. For both humans and GPT-3, we incorporate the synthetic alphabet in the tasks by preceding the original prompt with the sentence 'Use this fictional alphabet: [ x y l k w b f z t n j r q a h v g m u o p d i c s e ] .'
The increase in the size of the interval from one to two letters aims to rule out the possibility that GPT-3 merely replicates the fed sequence of letters. We achieve this in two ways. For the problem types 'extend sequence', 'successor', and 'predecessor', we increase the interval size for the letter to change from one to two. For the problem types 'remove redundant letter', 'fix alphabetic sequence', and 'sort', we increase the interval size of the complete letter sequence from one to two 9 .
We compare GPT-3's and human performance for the following three settings: the original tasks as reported in (Webb, Holyoak, and Lu, 2023), counterexamples that involve the interval size modification, and counterexamples
9 It is worth noting that we apply this modification to both the source (the first row for each example in Figure 1) and the target (the second row for each example in Figure 1), minimizing the difficulty of the modified problems and allowing us to compare our tests to the zero-generalization problems given in the original paper.
Figure 1: Letter string analogies along their transformations of both the original paper and our counterexamples. We introduce a synthetic alphabet into the task and apply two types of letter sequence modifications, both based on increasing the interval from one to two letters. For the transformation types 'extend sequence', 'successor', and 'predecessor', the modification only affects the letter to change (last or first letter). For 'remove redundant letter', 'fix alphabetic sequence', and 'sort', the interval is increased for the complete letter sequence. We apply the same modifications to the problems generated with the synthetic alphabet.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Transformation Types: Original, Modified, and Synthetic
### Overview
The image presents a comparison of string transformation types, categorized into "Original," "Modified," and "Modified with Synthetic Alphabet." Each category demonstrates four transformation operations: "Extend sequence," "Remove redundant letter," "Fix alphabetic sequence," and "Sort." The transformations are shown with example strings, illustrating the input and output of each operation. The output strings are highlighted in blue.
### Components/Axes
The image is divided into three main sections, each representing a different type of transformation:
1. **Original transformation types:** Located at the top.
2. **Modified transformation types:** Located in the middle.
3. **Modified transformation types with synthetic alphabet:** Located at the bottom.
Each section is further divided into four transformation operations:
* **Extend sequence:** Shows how a string is extended.
* **Remove redundant letter:** Shows how a redundant letter is removed from a string.
* **Fix alphabetic sequence:** Shows how a string's alphabetic sequence is corrected.
* **Sort:** Shows how a string is sorted alphabetically.
Within each operation, there are three columns:
* **Left Column:** Shows the original string before transformation.
* **Middle Column:** Shows the "Successor" transformation.
* **Right Column:** Shows the "Predecessor" transformation.
The "Modified transformation types with synthetic alphabet" section includes a "Synthetic alphabet" which is: `xylkwbfztnjrqahvgmuopdicse`
### Detailed Analysis or ### Content Details
**1. Original transformation types:**
* **Extend sequence:**
* `abcd` -> `abcde` (Successor)
* `ijkl` -> `ijklm` (Successor)
* `bcde` -> `acde` (Predecessor)
* `ijkl` -> `hjkl` (Predecessor)
* **Remove redundant letter:**
* `abbcde` -> `abcde`
* `ijkklm` -> `ijklm`
* **Fix alphabetic sequence:**
* `abcwe` -> `abcde`
* `ijkxm` -> `ijklm`
* **Sort:**
* `adcbe` -> `abcde`
* `kjmli` -> `ijklm`
**2. Modified transformation types:**
* **Extend sequence:**
* `abcd` -> `abcdf` (Successor)
* `ijkl` -> `ijklm` (Successor)
* `cdef` -> `adef` (Predecessor)
* `jklm` -> `hklm` (Predecessor)
* **Remove redundant letter:**
* `acegii` -> `acegi`
* `ikkmoq` -> `ikmoq`
* **Fix alphabetic sequence:**
* `acego` -> `acegi`
* `ikxoq` -> `ikmoq`
* **Sort:**
* `kfapu` -> `afkpu`
* `imkoq` -> `ikmoq`
**3. Modified transformation types with synthetic alphabet:**
* **Synthetic alphabet:** `xylkwbfztnjrqahvgmuopdicse`
* **Extend sequence:**
* `xylk` -> `xylkb` (Successor)
* `tnjr` -> `tnjra` (Successor)
* `lkwb` -> `xkwb` (Predecessor)
* `njrq` -> `zjrq` (Predecessor)
* **Remove redundant letter:**
* `xlwwft` -> `xlwft`
* `ttjqhg` -> `tjqhg`
* **Fix alphabetic sequence:**
* `xlwrt` -> `xlwft`
* `tjphg` -> `tjqhg`
* **Sort:**
* `xlfwt` -> `xlwft`
* `jtqhg` -> `tjqhg`
### Key Observations
* The "Original transformation types" section uses standard alphabetical order.
* The "Modified transformation types" section also uses standard alphabetical order but with different examples.
* The "Modified transformation types with synthetic alphabet" section uses a custom alphabet, which significantly alters the sorting and sequence-related transformations.
* The "Remove redundant letter" transformation simply removes a repeated character.
* The "Fix alphabetic sequence" transformation aims to correct the order of characters within the string based on the alphabet being used.
* The "Sort" transformation arranges the characters in the string according to the defined alphabet.
### Interpretation
The image illustrates different approaches to string transformation, highlighting the impact of using a custom alphabet. The "Original" and "Modified" types demonstrate transformations based on the standard alphabet, while the "Synthetic alphabet" type shows how a different alphabet can drastically change the outcome of these transformations. This comparison is useful for understanding how algorithms can be adapted to work with different character sets or customized ordering rules. The use of "Successor" and "Predecessor" transformations suggests a focus on generating variations of strings, potentially for tasks like data augmentation or code generation.
</details>
that include both the interval size modification and the synthetic alphabet. To ensure that GPT-3 is capable of processing the introduced modifications (Wu et al., 2023, 'counterfactial comprehension check'), we additionally include tests on GPT-3 for two additional settings: original examples on the real alphabet but including the modified prompt, i.e. 'Use this fictional alphabet: [ a b c d e f g h i j k l m n o p q r s t u v w x y z ] . '), and counterexamples involving the synthetic alphabet but without increasing the interval size.
GPT-3 evaluation Our code for reproducing Figure 2 is available on Github 10 . For each problem type, we create 50 instances to mirror the original paper. The settings are as follows: model variant=text-davinci-003, temperature=0, maximum length=20. Using the original code, we mirror the evaluation and analysis
10 https://github.com/hodeld/emergent\_analogies\_LLM\_fork
approach of the original paper. The prompt pattern including the synthetic alphabet illustrates the following example.
```
alphabet illustrates the following example.
Use this fictional alphabet: [x y l k w b f z t n j r q a h v g m u o p d i c s e]. Let's try to complete the pattern:
[x y l k] [x y l k b]
[t n j r] [
```
h
Human behavioral experiment. We conducted human behavior experiments through an online study with University of Washington (UW) undergraduates analogous to the experiments of the original paper. All participants provided their informed consent prior to the study, and the data collection process was approved by the UW Institutional Review Board (IRB ID STUDY00019080, approved on 6 November 2023). 121 participants completed the study. They were compensated with extra course credits for their participation.
The first author of the original study generously provided participant instructions, which we adapted for our experiments. In particular, we presented the participants an additional example problem to introduce the synthetic alphabet.
Use this fictional alphabet: [ x y l k w b f z t n j r q a h v g m u o p d i c s e ] .
$$[ x x x ] [ y y y ] \\ [ 1 1 1 ] [ ? ]$$
Each participant completed a total of 18 zero-generalization tasks, consisting of six problems for each setting (one problem for each transformation type).
## 7.1.2 Results
In our experiments, human achieved consistently higher accuracy than GPT-3, in particular on modified letter string tasks involving both the synthetic alphabet and increased letter interval size, see Figure 2. Human performance remains at a level similar across modifications (Figure 4) while GPT-3 performance declines significantly for modified problem types (Figure 4). The generative accuracy of GPT-3 for the synthetic alphabet is close to zero ( < 0 . 1) when performing the modified tasks 'extend sequence', 'successor' or 'predecessor', and 'fix alphabetic sequence'. Only for 'remove redundant letter' and 'sort' does GPT-3 achieve accuracy in a range similar to that reported in the original paper (Webb, Holyoak, and Lu, 2023).
Figure 5 shows the accuracy of GPT-3 in the two counterfactual comprehension checks (Wu et al., 2023). For all but on the 'precessor' task on the synthetic alphabet, we obtain a GPT-3 accuracy of at least 30% of the original level, demonstrating GPT-3's ability to process the introduced modifications.
Lastly, Figure 6 illustrates the comparison of human performance in the original tasks between the participants of the original study and those in our
Figure 2: Comparison between GPT-3's (blue) and human (orange) performances on modified letter string problems involving a synthetic alphabet and a larger interval size. The transformation types and their order correspond to Figure 6b in the original paper. Humans demonstrate significantly higher accuracy compared to GPT-3. Human results represent the average performance of 121 participants (UW undergraduates). Each participant received one randomly selected instance of each problem subtype. GPT-3 results reflect the average performance across all 50 instances. Gray error bars indicate 95% binomial confidence intervals for the average performance across multiple problems.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Bar Chart: Generative Accuracy Comparison
### Overview
The image is a bar chart comparing the generative accuracy of GPT-3 and humans across different transformation types. The chart displays the accuracy on the y-axis and the transformation type on the x-axis. Error bars are included on each bar.
### Components/Axes
* **Y-axis:** "Generative accuracy", ranging from 0 to 1 in increments of 0.2.
* **X-axis:** "Transformation type", with the following categories: "Extend sequence", "Successor", "Predecessor", "Remove redundant letter", "Fix alphabetic sequence", and "Sort".
* **Legend:** Located at the top-right of the chart.
* Blue bars represent "GPT-3".
* Orange bars represent "Human".
### Detailed Analysis
Here's a breakdown of the data for each transformation type, including approximate values and error bar considerations:
* **Extend sequence:**
* GPT-3 (Blue): Accuracy is approximately 0.02, with an error bar extending to approximately 0.04.
* Human (Orange): Accuracy is approximately 0.78, with an error bar extending from approximately 0.7 to 0.85.
* **Successor:**
* GPT-3 (Blue): Accuracy is approximately 0.06, with an error bar extending to approximately 0.1.
* Human (Orange): Accuracy is approximately 0.74, with an error bar extending from approximately 0.68 to 0.8.
* **Predecessor:**
* GPT-3 (Blue): Accuracy is approximately 0.02, with an error bar extending to approximately 0.04.
* Human (Orange): Accuracy is approximately 0.79, with an error bar extending from approximately 0.72 to 0.86.
* **Remove redundant letter:**
* GPT-3 (Blue): Accuracy is approximately 0.77, with an error bar extending from approximately 0.7 to 0.84.
* Human (Orange): Accuracy is approximately 0.86, with an error bar extending from approximately 0.79 to 0.93.
* **Fix alphabetic sequence:**
* GPT-3 (Blue): Accuracy is approximately 0.02, with an error bar extending to approximately 0.04.
* Human (Orange): Accuracy is approximately 0.32, with an error bar extending from approximately 0.25 to 0.39.
* **Sort:**
* GPT-3 (Blue): Accuracy is approximately 0.15, with an error bar extending from approximately 0.08 to 0.22.
* Human (Orange): Accuracy is approximately 0.30, with an error bar extending from approximately 0.23 to 0.37.
### Key Observations
* Humans consistently outperform GPT-3 in "Extend sequence", "Successor", "Predecessor", "Fix alphabetic sequence", and "Sort" transformation types.
* GPT-3 and humans have relatively similar performance in the "Remove redundant letter" transformation type, although humans still perform slightly better.
* The error bars indicate the variability in the data.
### Interpretation
The data suggests that humans are generally better at generative tasks involving sequence manipulation and alphabetization compared to GPT-3, except for removing redundant letters where GPT-3's performance is closer to human performance. The significant differences in accuracy for tasks like "Extend sequence", "Successor", and "Predecessor" highlight the areas where GPT-3 needs improvement to match human-level performance. The error bars provide a measure of confidence in these observations, indicating the range within which the true accuracy likely falls.
</details>
study. Although the subjects in our study marginally outperform those in the previous study, the similarity in performances is evidence that our experimental setup and execution align with the original study at UCLA.
## 7.1.3 Discussion
The recent paper, 'Emergent analogical reasoning in large language models' (Webb, Holyoak, and Lu, 2023), and subsequent news articles argue that LLMs may have acquired the emergent ability for zero-shot analogical reasoning. We are less certain of these conclusions, given our own follow-up experiments. Our results show low success of GPT-3 in solving letter string problems with simple modifications and with a synthetic alphabet, while human performance remains high.
Only in two out of six problem types ('remove redundant letter' and 'sort'), GPT-3 achieves similar generative accuracy on our counterexamples compared to the original problems involving the real alphabet, as well as in comparison to human performance on the same modified problems. For these two problem
Figure 3: GPT-3 performance for zero-generalization letter string problems for the original experiment (blue) and with the larger interval size (green), and larger interval size with synthetic alphabet (orange). Except for 'remove redundant letter,' GPT3's accuracy declines significantly for the modified problems. The results reflect an average performance for N=50 instances.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Bar Chart: Generative Accuracy by Transformation Type
### Overview
The image is a bar chart comparing the generative accuracy of three different models ("Original", "Interval", and "Interval & synthetic alphabet") across six different transformation types: "Extend sequence", "Successor", "Predecessor", "Remove redundant letter", "Fix alphabetic sequence", and "Sort". The chart includes error bars, presumably indicating standard deviation or confidence intervals.
### Components/Axes
* **Y-axis:** "Generative accuracy", ranging from 0 to 1 in increments of 0.2.
* **X-axis:** "Transformation type", with six categories: "Extend sequence", "Successor", "Predecessor", "Remove redundant letter", "Fix alphabetic sequence", and "Sort".
* **Legend:** Located at the top-right of the chart.
* Blue: "Original"
* Green: "Interval"
* Orange: "Interval & synthetic alphabet"
### Detailed Analysis
Here's a breakdown of the generative accuracy for each transformation type and model, including trend descriptions:
* **Extend sequence:**
* Original (Blue): Accuracy is approximately 0.97, with an error bar extending down to approximately 0.45.
* Interval (Green): Accuracy is approximately 0.32, with an error bar extending down to approximately 0.05.
* Interval & synthetic alphabet (Orange): Not present for this transformation.
* **Successor:**
* Original (Blue): Accuracy is approximately 0.95, with an error bar extending down to approximately 0.72.
* Interval (Green): Accuracy is approximately 0.60, with an error bar extending down to approximately 0.35.
* Interval & synthetic alphabet (Orange): Not present for this transformation.
* **Predecessor:**
* Original (Blue): Accuracy is approximately 0.78, with an error bar extending down to approximately 0.5.
* Interval (Green): Accuracy is approximately 0.16, with an error bar extending down to approximately 0.0.
* Interval & synthetic alphabet (Orange): Not present for this transformation.
* **Remove redundant letter:**
* Original (Blue): Accuracy is approximately 0.86, with an error bar extending down to approximately 0.7.
* Interval (Green): Accuracy is approximately 0.78, with an error bar extending down to approximately 0.65.
* Interval & synthetic alphabet (Orange): Accuracy is approximately 0.76, with an error bar extending down to approximately 0.6.
* **Fix alphabetic sequence:**
* Original (Blue): Accuracy is approximately 0.52, with an error bar extending down to approximately 0.05.
* Interval (Green): Accuracy is approximately 0.26, with an error bar extending down to approximately 0.0.
* Interval & synthetic alphabet (Orange): Not present for this transformation.
* **Sort:**
* Original (Blue): Accuracy is approximately 0.22, with an error bar extending down to approximately 0.0.
* Interval (Green): Accuracy is approximately 0.08, with an error bar extending down to approximately 0.0.
* Interval & synthetic alphabet (Orange): Accuracy is approximately 0.14, with an error bar extending down to approximately 0.0.
### Key Observations
* The "Original" model consistently outperforms the "Interval" model across all transformation types.
* The "Interval & synthetic alphabet" model is only present for "Remove redundant letter" and "Sort" transformations.
* The "Extend sequence" transformation has the highest accuracy for the "Original" model.
* The "Sort" transformation has the lowest accuracy for all models.
* The error bars are relatively large, indicating substantial variability in the data.
### Interpretation
The data suggests that the "Original" model is better at generalizing across different types of sequence transformations compared to the "Interval" model. The "Interval & synthetic alphabet" model shows mixed results, performing similarly to the other models for "Remove redundant letter" but slightly better than "Interval" for "Sort". The large error bars indicate that the performance of each model can vary significantly depending on the specific input sequence. The "Sort" transformation appears to be the most challenging for all models, possibly due to the complexity of rearranging sequences. The absence of "Interval & synthetic alphabet" for most transformations suggests it might be specifically designed or optimized for certain types of tasks.
</details>
subtypes, GPT-3 does not need to generate a letter from the full alphabet, but only to remove the duplicate letter or to rearrange given letters, which may explain the higher performance. The results of these two tasks also serve as an additional counterfactual comprehension check (Wu et al., 2023) in addition to the accuracy of GPT-3 under the only marginally modified conditions, shown in Figure 5. The results demonstrate that GPT-3 is capable of processing synthetic alphabets, which validates our approach.
So what explains the high success of GPT-3 in solving the problems on the real alphabet (as used in the original paper) but failure with the synthetic alphabet and with the modified interval size for most of the letter string problems while human performance remains consistently high?
Our results suggest that the answer resides in the training data confirming the analysis of the methods in Section 2. Unlike humans, GPT-3 performs well only for simple analogy problems with the standard English alphabet, which are likely to be present in the training data. These findings contradict two of the main claims in the original paper (Webb, Holyoak, and Lu, 2023) regarding GPT-3's capacity for general, zero-shot reasoning and its human-like characteristics in analogical reasoning. Consequently, we reject the proposition made in the original paper that GPT-3 may have developed mechanisms similar to those underlying human intelligence.
The GPT-3 failure to solve simple variations of the original problems demon-
Figure 4: Human performance for zero-generalization letter string problems for the original experiment (blue) and with the larger interval size (green), and larger interval size with synthetic alphabet (orange). Human accuracy in the modified problems is comparable to that in the original problems (blue). The results reflect the average performance of N = 121 participants (UW undergraduates).
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Bar Chart: Generative Accuracy by Transformation Type
### Overview
The image is a bar chart comparing the generative accuracy of three different methods ("Original", "Interval", and "Interval & synthetic alphabet") across six different transformation types: "Extend sequence", "Successor", "Predecessor", "Remove redundant letter", "Fix alphabetic sequence", and "Sort". The chart includes error bars, indicating the variability in the data.
### Components/Axes
* **Y-axis:** "Generative accuracy", ranging from 0 to 1 in increments of 0.2.
* **X-axis:** "Transformation type", with six categories: "Extend sequence", "Successor", "Predecessor", "Remove redundant letter", "Fix alphabetic sequence", and "Sort".
* **Legend:** Located at the top-right of the chart.
* Blue: "Original"
* Green: "Interval"
* Orange: "Interval & synthetic alphabet"
### Detailed Analysis
Here's a breakdown of the generative accuracy for each transformation type and method, including the approximate values and observed trends:
* **Extend sequence:**
* Original (Blue): Approximately 0.86, with error bars extending from approximately 0.8 to 0.92.
* Interval (Green): Approximately 0.68, with error bars extending from approximately 0.6 to 0.76.
* Interval & synthetic alphabet (Orange): Approximately 0.78, with error bars extending from approximately 0.7 to 0.86.
* **Successor:**
* Original (Blue): Approximately 0.87, with error bars extending from approximately 0.8 to 0.94.
* Interval (Green): Approximately 0.64, with error bars extending from approximately 0.56 to 0.72.
* Interval & synthetic alphabet (Orange): Approximately 0.74, with error bars extending from approximately 0.66 to 0.82.
* **Predecessor:**
* Original (Blue): Approximately 0.82, with error bars extending from approximately 0.74 to 0.90.
* Interval (Green): Approximately 0.70, with error bars extending from approximately 0.62 to 0.78.
* Interval & synthetic alphabet (Orange): Approximately 0.78, with error bars extending from approximately 0.7 to 0.86.
* **Remove redundant letter:**
* Original (Blue): Approximately 0.88, with error bars extending from approximately 0.8 to 0.96.
* Interval (Green): Approximately 0.83, with error bars extending from approximately 0.75 to 0.91.
* Interval & synthetic alphabet (Orange): Approximately 0.87, with error bars extending from approximately 0.79 to 0.95.
* **Fix alphabetic sequence:**
* Original (Blue): Approximately 0.43, with error bars extending from approximately 0.35 to 0.51.
* Interval (Green): Approximately 0.22, with error bars extending from approximately 0.14 to 0.30.
* Interval & synthetic alphabet (Orange): Approximately 0.32, with error bars extending from approximately 0.24 to 0.40.
* **Sort:**
* Original (Blue): Approximately 0.36, with error bars extending from approximately 0.28 to 0.44.
* Interval (Green): Approximately 0.26, with error bars extending from approximately 0.18 to 0.34.
* Interval & synthetic alphabet (Orange): Approximately 0.30, with error bars extending from approximately 0.22 to 0.38.
### Key Observations
* The "Original" method generally has the highest generative accuracy across all transformation types, except for "Remove redundant letter" where it is comparable to "Interval & synthetic alphabet".
* The "Fix alphabetic sequence" and "Sort" transformation types have significantly lower generative accuracy compared to the other transformation types for all three methods.
* The error bars suggest that there is some variability in the data, but the general trends are consistent across all transformation types.
### Interpretation
The data suggests that the "Original" method is generally more effective at generating accurate sequences compared to the "Interval" and "Interval & synthetic alphabet" methods. The lower accuracy for "Fix alphabetic sequence" and "Sort" indicates that these transformation types are more challenging for all three methods. The "Remove redundant letter" transformation type shows a high accuracy across all methods, suggesting it is a relatively easier task. The error bars provide a measure of the uncertainty in the data, which should be considered when interpreting the results.
</details>
strates the brittleness of the presented approach when assessing human-like reasoning in language models.
## 7.2 ChatGPT's answer to our question: 'Could you give an example of a copycat problem?'
Figure 5: Counterfactual comprehension check. Comparison of GPT-3 performance on zero-generalization letter string problems between original tasks (blue) and the only marginally modified tasks involving a synthetic alphabet without modification of the interval size (green) and a modified prompt without modified string sequence (orange). The accuracy on modified tasks is lower than on the original ones but, greater than 0.2 except for 'remove redundant letter' and 'sort' involving the synthetic alphabet. The figure and the order of the transformation types correspond to Figure 6b in the original paper. These results reflect an average performance for N=50 instances.
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Bar Chart: Generative Accuracy by Transformation Type
### Overview
The image is a bar chart comparing the generative accuracy of three different models ("Original", "Original & synthetic alphabet", and "Original & prompt") across six transformation types: "Extend sequence", "Successor", "Predecessor", "Remove redundant letter", "Fix alphabetic sequence", and "Sort". The chart includes error bars, presumably indicating standard deviation or confidence intervals.
### Components/Axes
* **Y-axis:** "Generative accuracy", ranging from 0 to 1 in increments of 0.2.
* **X-axis:** "Transformation type", with the following categories:
* Extend sequence
* Successor
* Predecessor
* Remove redundant letter
* Fix alphabetic sequence
* Sort
* **Legend:** Located at the top-right of the chart.
* Blue: "Original"
* Green: "Original & synthetic alphabet"
* Orange: "Original & prompt"
### Detailed Analysis
**1. Extend sequence:**
* Original (Blue): Accuracy ~0.97, error bar extends to ~1.0
* Original & synthetic alphabet (Green): Accuracy ~0.93, error bar extends to ~0.98
* Original & prompt (Orange): Accuracy ~0.34, error bar extends to ~0.4
**2. Successor:**
* Original (Blue): Accuracy ~0.93, error bar extends to ~1.0
* Original & synthetic alphabet (Green): Accuracy ~0.70, error bar extends to ~0.82
* Original & prompt (Orange): Accuracy ~0.54, error bar extends to ~0.6
**3. Predecessor:**
* Original (Blue): Accuracy ~0.78, error bar extends to ~0.85
* Original & synthetic alphabet (Green): Accuracy ~0.09, error bar extends to ~0.15
* Original & prompt (Orange): Accuracy ~0.40, error bar extends to ~0.45
**4. Remove redundant letter:**
* Original (Blue): Accuracy ~0.87, error bar extends to ~0.95
* Original & synthetic alphabet (Green): Accuracy ~0.64, error bar extends to ~0.7
* Original & prompt (Orange): Accuracy ~0.50, error bar extends to ~0.55
**5. Fix alphabetic sequence:**
* Original (Blue): Accuracy ~0.52, error bar extends to ~0.6
* Original & synthetic alphabet (Green): Accuracy ~0.52, error bar extends to ~0.58
* Original & prompt (Orange): Accuracy ~0.27, error bar extends to ~0.3
**6. Sort:**
* Original (Blue): Accuracy ~0.23, error bar extends to ~0.3
* Original & synthetic alphabet (Green): Accuracy ~0.18, error bar extends to ~0.25
* Original & prompt (Orange): Accuracy ~0.25, error bar extends to ~0.3
### Key Observations
* The "Original" model generally outperforms the other two models across all transformation types, except for "Fix alphabetic sequence" where "Original" and "Original & synthetic alphabet" are approximately equal.
* The "Predecessor" transformation type shows a significant performance drop for the "Original & synthetic alphabet" model compared to the "Original" model.
* The "Sort" transformation type has the lowest accuracy across all models.
* The error bars indicate variability in the results, with some transformation types showing more consistent performance than others.
### Interpretation
The data suggests that the "Original" model is generally more robust to the tested transformations than the models that incorporate synthetic alphabet or prompts. The significant drop in accuracy for the "Original & synthetic alphabet" model on the "Predecessor" transformation indicates that the synthetic alphabet may negatively impact the model's ability to handle this specific type of transformation. The low accuracy for the "Sort" transformation across all models suggests that this is a particularly challenging task for the models. The error bars provide insight into the reliability of the results, with larger error bars indicating more variability and potentially less confidence in the observed differences between the models.
</details>
Figure 6: Comparison of human performance on the original letter string tasks between the outcomes reported in the original study (blue) and the findings presented in this paper (green). UW undergraduate students exhibit marginally higher accuracies. The transformation types and their order correspond to Figure 6b in the original paper.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Bar Chart: Generative Accuracy by Transformation Type
### Overview
The image is a bar chart comparing the generative accuracy of two entities, UCLA and UW, across six different transformation types. The chart displays the mean accuracy for each transformation type, with error bars indicating the variability.
### Components/Axes
* **X-axis:** Transformation type, with categories: "Extend sequence", "Successor", "Predecessor", "Remove redundant letter", "Fix alphabetic sequence", and "Sort".
* **Y-axis:** Generative accuracy, ranging from 0 to 1.0, with tick marks at intervals of 0.2.
* **Legend:** Located at the top-right of the chart.
* UCLA: Represented by light blue bars.
* UW: Represented by light green bars.
### Detailed Analysis
The chart presents a comparison of generative accuracy between UCLA (light blue) and UW (light green) for different transformation types. Error bars are present on each bar, indicating the standard deviation or standard error.
* **Extend sequence:** UCLA has an accuracy of approximately 0.84 +/- 0.05, while UW has an accuracy of approximately 0.85 +/- 0.05.
* **Successor:** UCLA has an accuracy of approximately 0.79 +/- 0.10, while UW has an accuracy of approximately 0.86 +/- 0.05.
* **Predecessor:** UCLA has an accuracy of approximately 0.73 +/- 0.10, while UW has an accuracy of approximately 0.82 +/- 0.05.
* **Remove redundant letter:** UCLA has an accuracy of approximately 0.69 +/- 0.10, while UW has an accuracy of approximately 0.86 +/- 0.05.
* **Fix alphabetic sequence:** UCLA has an accuracy of approximately 0.25 +/- 0.05, while UW has an accuracy of approximately 0.42 +/- 0.10.
* **Sort:** UCLA has an accuracy of approximately 0.23 +/- 0.10, while UW has an accuracy of approximately 0.36 +/- 0.10.
### Key Observations
* UW generally outperforms UCLA across all transformation types.
* The "Fix alphabetic sequence" and "Sort" transformation types have the lowest generative accuracy for both UCLA and UW.
* The "Extend sequence", "Successor", "Predecessor", and "Remove redundant letter" transformation types have the highest generative accuracy for both UCLA and UW.
* The error bars suggest that the difference in performance between UCLA and UW is statistically significant for some transformation types, but not for others.
### Interpretation
The bar chart illustrates the generative accuracy of UCLA and UW across various transformation types. The data suggests that UW generally performs better than UCLA in these tasks. The lower accuracy for "Fix alphabetic sequence" and "Sort" indicates that these transformations are more challenging for both entities. The error bars provide a measure of the variability in the data, which is important for assessing the statistical significance of the observed differences. The chart provides a visual comparison of the performance of UCLA and UW, highlighting their strengths and weaknesses in different transformation tasks.
</details>
Figure 7: ChatGPT's answer to our question: 'Could you give an example of a copycat problem?'. The tasks presented in the original paper are called Copycat. This name refers to a computer program that tests such letter string analogy problems.
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Text Conversation: Copycat Problem Example
### Overview
The image shows a text conversation where a user asks for an example of a copycat problem, and the response provides a detailed explanation and example of such a problem.
### Components/Axes
* **User Query:** "Could you give an example of a copycat problem?"
* **Response:** A detailed explanation of a copycat problem, including an example with original and new data sets, and the solution.
### Detailed Analysis or Content Details
The response includes the following information:
* **Definition:** A copycat problem is a cognitive challenge where one must identify a pattern or transformation in a set of data and apply it to a new set of data.
* **Original Pattern:**
* AB -> CD
* EF -> GH
* **Task:** Decipher the pattern and apply it to a new set of data.
* **New Data:**
* IJ -> ??
* KL -> ??
* **Pattern Identification:** The pattern involves shifting each letter two positions forward in the alphabet.
* **Applying Pattern:**
* IJ -> KL
* KL -> NO
* **Solution:** The correct answers are "KL" and "NO" for the new data set.
### Key Observations
The response clearly explains the concept of a copycat problem and provides a step-by-step example to illustrate how to solve it. The pattern is simple (shifting letters), making it easy to understand.
### Interpretation
The conversation demonstrates a clear and concise explanation of a copycat problem. The example provided is well-structured and easy to follow, making it effective for understanding the concept. The response effectively breaks down the problem into smaller, manageable steps, guiding the user through the process of identifying the pattern and applying it to new data.
</details>