2308.16118v2

Model: nemotron-free

## Response: Emergent analogical reasoning in large language models Damian Hodel 1 and Jevin West 1 1 Center for an Informed Public, Information School, University of Washington hodeld@uw.edu, jevinw@uw.edu May 2, 2024 ## 1 Introduction In their recent Nature Human Behaviour paper, 'Emergent analogical reasoning in large language models,' (Webb, Holyoak, and Lu, 2023) the authors argue that 'GPT-3 exhibits a very general capacity to identify and generalize-in zero-shot fashion-relational patterns found within both formal problems and meaningful texts.' This conclusion arises from their comparison of GPT-3 with human performance across four analogical reasoning domains, where they find comparable results. In this response, we argue that this approach is unsuitable for evaluating general, zero-shot reasoning in large language models (LLMs). Two primary reasons underlie our objection. First, the term 'zeroshot' implies problem sets entirely novel to GPT-3. However, the chosen approach cannot conclusively eliminate the possibility of these problems residing in the LLM's training data, as acknowledged by the authors themselves in the review file 1 . Second, the assumption underlying this approach is that tests designed for humans can accurately measure LLM capabilities. This assumption is prevalent, but remains unverified. We also provide empirical results to support our claims, see appendix (Section 7.1). Our counterexamples show that GPT-3 fails to solve simplest variations of the original tasks, whereas human performance remains consistently high across all modified versions. Given the hype surrounding LLM capability and this paper in particular 234 1 https://www.nature.com/articles/s41562-023-01659-w#peer-review https://www.tagesanzeiger.ch/beherrscht-die-kuenstliche-intelligenz- 2 analogien-768020720454 3 https://www.news-medical.net/news/20230731/AI-language-model-GPT-3-performsabout-as-well-as-college-undergraduates-in-analogical-reasoning.aspx 4 https://www.sciencemediacentre.org/expert-reaction-to-study-looking-at-gpt3-large-language-model-and-ability-to-reason-by-analogy/ contrasted by the many findings of LLM brittleness 5 , we felt it was important to respond and illustrate the insufficiency of the methods employed in addressing GPT-3's supposed general, zero-shot reasoning. It is important that we interpret LLM results with caution and refrain from LLM anthropomorphization. Tests designed to assess the general capabilities of humans may not inherently serve the same purpose when applied to LLMs. Others have commented on this paper and we want to note these contributions. Mitchell (2023) discusses this paper, focusing on the letter string and digit matrix analogy problems. Mitchell disagrees that 'the digit matrix problems are essentially equivalent in complexity and difficulty to Ravens Progressive Matrix problems.' Further, Mitchell presents individual counterexamples of the letter string problems where GPT-3 makes nonhuman-like errors, as evidence against the claimed robustness of GPT-3 in analogy reasoning. We conduct a similar but more systematic analysis and include human behavioral experiments, as detailed in the appendix, that concurs with Mitchell's conclusion. Mitchell also points out that the term 'accuracy' implies that there was only one correct answer to each problem, which isn't the case with these problems, but an assumption implicitly made by the authors. For comparison purposes, we adopt Webb, Holyoak, and Lu (2023)'s assumption in our paper and use the same terms, i.e. 'accuracy' and 'performance' but recognize this limitation. ## 2 Criticism of the Methods Employed in the Original Paper To assess general, zero-shot reasoning capacity of LLMs, Webb, Holyoak, and Lu (2023) compare GPT-3 with humans and find similar or even better performance across a range of analogical reasoning tests adapted from existing cognitive tests designed for humans. However, we believe that this approach is not sufficient for testing the general, zero-shot reasoning capacity of large language models (LLMs). Here is why: First, 'zero-shot' implies analogical problem sets that are entirely novel to GPT-3, encompassing both specific examples and variants of those examples. However, this condition is not met by some of the letter string problems used in the original paper, as noted by the authors themselves in the review file 6 : 'It is possible that GPT-3 has been trained on other letter string analogy problems, as these problems are discussed on a number of webpages.' Without ruling out the possibility of data memorization, one cannot claim zero-shot reasoning. As the first author notes in a recent MIT Technology Review article 7 , [if the test 5 https://www.technologyreview.com/2023/08/30/1078670/large-language-modelsarent-people-lets-stop-testing-them-like-they-were 6 https://www.nature.com/articles/s41562-023-01659-w#peer-review 7 https://www.technologyreview.com/2023/08/30/1078670/large-language-modelsarent-people-lets-stop-testing-them-like-they-were examples exist in the training data], 'I think we really can't conclude much of anything.' In the before-mentioned peer review file, the authors further note that they ask GPT-3 about these problems as a way of testing their existence in the training data. It makes sense to at least try this, but we find this to be weak evidence, given the large number of possible answers to this question and the ambiguity of the answers given. To investigate this further, we asked ChatGPT to provide examples of letter string problems. Examples were given, suggesting that it has seen such examples in the training data. We include our question and ChatGPT's answer in the appendix. Important to note is that ChatGPT was trained on more data than GPT-3 so this result only provides circumstantial evidence. Zero-shot reasoning is an extraordinary claim that requires extraordinary evidence. At the very least, it necessitates demonstrating that the problems, as well as their variations, do not already exist within the training data, as previously mentioned. The original paper fails to offer such evidence for any of the four task domains. We do recognize that obtaining such evidence can be exceptionally challenging. Many researchers lack access to GPT-3's training data, and even if they did, confirming the absence of examples or derivations from the training data is nearly impossible. However, the difficulty to provide evidence of zero-shot should not be a reason to claim it. Second, Webb, Holyoak, and Lu (2023) claim that the presented problem types test GPT-3's human-like reasoning capacity in a 'very general' way. This assumption is based on the premise that LLMs behave similarly to humans, thus implying that a test designed for humans can adequately assess LLMs in a broader capacity beyond the tasks included in the test. However, this assumption has not been substantiated. On the contrary, generalized findings across the literature of LLM brittleness tend to contradict it. In Appendix 7.1, we present counterexamples involving the letter string analogy problems, which demonstrate the brittleness of the assessment approach employed. In these tests, GPT-3 fails to solve simple variants of the letter string analogies presented in the original paper, while human performance remains on a high level. In addition to the finding that GPT-3 matches or even outperforms human performance, Webb, Holyoak, and Lu (2023) further show that GPT-3 exhibits human-like characteristics in analogical reasoning, i.e., decreasing performance with increasing problem complexity. Based on this result, the authors propose that GPT-3 may have developed mechanisms similar to those underlying human intelligence. This is one possible interpretation. However, an alternative explanation could be that the training data contains a scarcity of solutions to complex problems, possibly reflecting the challenges humans encounter with such problems, a notion supported by our experiments involving human subjects. It is important to note that our intention is not to discredit the use of such tests for studying LLMs but to point out the limitations of these methods for making claims about the reasoning capacity of LLMs. Before conducting the human behavior experiments, we shared our counterexamples on GPT-3 with the first author of the original paper, and greatly appreciate their engagement in this discussion. One of their main objections was the expectation that our modified problems would also be significantly more difficult for human subjects. The human behavioral studies we carried out definitively contradict the predictions of the primary author. Despite a notable decrease in GPT-3's performance on our adapted tasks, humans consistently demonstrate strong performance. Nevertheless, it is important to note that comparing performance to humans, whether better or worse, is not evidence of the claimed capacity. For example, if one ran this comparison only among humans and two groups emerged from the sampling, one with adults and one with children 8 , we would likely find that adults outperform children on these reasoning tasks. According to the authors' logic (Webb, Holyoak, and Lu, 2023), this would be evidence against zero-shot reasoning in children. But we know that children have this ability. Hence, performance compared to humans cannot be used to support or refute zero-shot reasoning. ## 3 Conclusion Based on their analysis, Webb, Holyoak, and Lu (2023) argue that LLMs have acquired a general ability for zero-shot reasoning. With full respect to the authors and their work, we disagree with this interpretation. As we show and argue in our response, the methods are insufficient to evaluate a capacity for true, zero-shot reasoning. Given the current hype surrounding LLMs, we hope this can be used to spur further tests and evaluations of what LLMs can and cannot do. ## 4 Code and data availability Code and data can be downloaded from: https://github.com/hodeld/emergent\_ analogies\_LLM\_fork ## References Mitchell, Melanie (Jan. 2023). On analogy-making in large language models . url : https://aiguide.substack.com/p/on-analogy-making-in-largelanguage (visited on 08/09/2023). 8 This analogy presumes that LLMs' performances can be assessed against human standards. It is important to clarify that we don't endorse this assumption until it is substantiated, but for the sake of our argument, we will adopt it from the original paper. - Webb, Taylor, Keith J. Holyoak, and Hongjing Lu (July 2023). 'Emergent analogical reasoning in large language models'. en. In: Nature Human Behaviour . issn : 2397-3374. doi : 10.1038/s41562-023-01659-w . url : https: //www.nature.com/articles/s41562-023-01659-w . - Wu, Zhaofeng et al. (Aug. 2023). Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks . en. arXiv:2307.02477 [cs]. url : http://arxiv.org/abs/2307.02477 (visited on 03/14/2024). ## 5 Author contributions D.H. conducted the experiments. D.H. and J.W. drafted the manuscript. ## 6 Competing interests The authors declare no competing interests. ## 7 Appendix ## 7.1 Counterexamples To investigate whether the problems presented in the original paper truly assess analogical reasoning in GPT-3 or primarily its capability to recite training data, we create non-standard variants of the original tasks that are less likely to be found in training data. Our focus is on the letter string analogies, a subset of the four problem domains examined, and we conduct tests with both human subjects and GPT-3. In our experiments, GPT-3 performance significantly declines when presented with these additional counterexamples, while human performance remains consistently high across all tests (2). This suggests that the claims made in the original paper regarding GPT-3's zero-shot reasoning may not be substantiated. ## 7.1.1 Methods In order to test GPT-3's generality in zero-shot analogical reasoning, we extend the letter string analogies with two modifications and compare GPT-3's and humans' performance analogous to the original approach. The modifications involve using a synthetic alphabet and increasing the size of the interval from one to two letters, see Figure 1. If the claim regarding GPT-3's zero-shot reasoning capability is true, we can expect similar performance across modifications, in particular, independent of the alphabet. Unlike the original study, we view the comparison with human performance not as evidence for or against GPT-3's analogical reasoning abilities, but rather as a confirmation of the validity of our set of problems. We create the synthetic alphabet by randomly changing the order of the letters in the real alphabet. For both humans and GPT-3, we incorporate the synthetic alphabet in the tasks by preceding the original prompt with the sentence 'Use this fictional alphabet: [ x y l k w b f z t n j r q a h v g m u o p d i c s e ] .' The increase in the size of the interval from one to two letters aims to rule out the possibility that GPT-3 merely replicates the fed sequence of letters. We achieve this in two ways. For the problem types 'extend sequence', 'successor', and 'predecessor', we increase the interval size for the letter to change from one to two. For the problem types 'remove redundant letter', 'fix alphabetic sequence', and 'sort', we increase the interval size of the complete letter sequence from one to two 9 . We compare GPT-3's and human performance for the following three settings: the original tasks as reported in (Webb, Holyoak, and Lu, 2023), counterexamples that involve the interval size modification, and counterexamples 9 It is worth noting that we apply this modification to both the source (the first row for each example in Figure 1) and the target (the second row for each example in Figure 1), minimizing the difficulty of the modified problems and allowing us to compare our tests to the zero-generalization problems given in the original paper. Figure 1: Letter string analogies along their transformations of both the original paper and our counterexamples. We introduce a synthetic alphabet into the task and apply two types of letter sequence modifications, both based on increasing the interval from one to two letters. For the transformation types 'extend sequence', 'successor', and 'predecessor', the modification only affects the letter to change (last or first letter). For 'remove redundant letter', 'fix alphabetic sequence', and 'sort', the interval is increased for the complete letter sequence. We apply the same modifications to the problems generated with the synthetic alphabet. <details> <summary>Image 1 Details</summary> ![5048ad67](/v1/image/5048ad671ea724ccf9364661ab0c8c75a2408fa6424e6eca6fc055d466cd4bea) ### Visual Description ## Diagram: Transformation Types and Examples ### Overview The image is a structured diagram illustrating various string transformation types applied to alphabetic sequences and a synthetic alphabet. It is divided into three main sections: **Original transformation types**, **Modified transformation types**, and **Modified transformation types with synthetic alphabet**. Each section contains subcategories (e.g., Extend sequence, Remove redundant letter) with specific examples of transformations. --- ### Components/Axes - **Sections**: 1. **Original transformation types** 2. **Modified transformation types** 3. **Modified transformation types with synthetic alphabet** - **Subcategories** (consistent across sections): - Extend sequence - Remove redundant letter - Fix alphabetic sequence - Successor - Predecessor - Sort - **Examples**: - Arrows (`→`) denote transformations. - Blue highlights emphasize specific letters in transformations. - Synthetic alphabet uses symbols like `x, y, l, k, t, n, j, r, q, a, h, v, g, m, u, o, p, d, i, c, s, e`. --- ### Detailed Analysis #### Original Transformation Types - **Extend sequence**: - `abcd → abcde` (adds `e`) - `ijkl → ijklm` (adds `m`) - **Remove redundant letter**: - `abbcde → abcde` (removes duplicate `b`) - `ijkkim → ijkIm` (removes duplicate `k`) - **Fix alphabetic sequence**: - `abcwe → abcde` (replaces `w` with `d`) - `ijkmx → ijkIm` (replaces `x` with `m`) - **Successor**: - `abcd → abce` (replaces `d` with `e`) - `ijkl → ijkM` (replaces `l` with `M`) - **Predecessor**: - `bcde → acde` (replaces `b` with `a`) - `ijkl → hjkl` (replaces `i` with `h`) - **Sort**: - `adcbe → abcde` (reorders letters) - `kimli → ijkIm` (reorders letters) #### Modified Transformation Types - **Extend sequence**: - `abcd → abcdef` (adds `f`) - `ijkl → ijklmn` (adds `n`) - **Remove redundant letter**: - `acegii → acegi` (removes duplicate `i`) - `ikkmoq → ikmoq` (removes duplicate `k`) - **Fix alphabetic sequence**: - `acego → acegi` (replaces `o` with `i`) - `ikxoq → ikmoq` (replaces `x` with `m`) - **Successor**: - `abcd → abcf` (replaces `d` with `f`) - `ijkl → ijkN` (replaces `l` with `N`) - **Predecessor**: - `cdef → adef` (replaces `c` with `a`) - `jklm → hklm` (replaces `j` with `h`) - **Sort**: - `kfapu → afkpu` (reorders letters) - `imkoq → ikmoq` (reorders letters) #### Modified Transformation Types with Synthetic Alphabet - **Extend sequence**: - `xylk → xylkb` (adds `b`) - `tnjr → tnjra` (adds `a`) - **Remove redundant letter**: - `xlwwft → xlwft` (removes duplicate `w`) - `ttjqhg → tjqhg` (removes duplicate `t`) - **Fix alphabetic sequence**: - `xlwrt → xlwft` (replaces `r` with `f`) - `tjphg → tjqhg` (replaces `p` with `q`) - **Successor**: - `xylk → xylb` (replaces `k` with `b`) - `tnjr → tnja` (replaces `r` with `a`) - **Predecessor**: - `lkwb → xkwb` (replaces `l` with `x`) - `njrq → zjrq` (replaces `n` with `z`) - **Sort**: - `xlfwt → xlwft` (reorders letters) - `jtqhg → tjqhg` (reorders letters) --- ### Key Observations 1. **Consistency**: Each transformation type (e.g., Extend, Remove) follows a logical pattern across all sections. 2. **Synthetic Alphabet**: The synthetic alphabet (`x, y, l, k, ...`) mirrors the structure of the original alphabet but uses non-standard symbols. 3. **Highlighting**: Blue highlights emphasize specific letters in transformations (e.g., `ijklm` in "Extend sequence"). 4. **Directionality**: Successor/Predecessor transformations explicitly modify one letter at a time, while Sort reorders entire sequences. --- ### Interpretation This diagram demonstrates a systematic approach to string manipulation, likely for computational or linguistic applications. The **Original transformation types** establish foundational operations (e.g., adding/removing letters, fixing order). The **Modified transformation types** introduce variations, such as extending sequences with additional letters or adjusting for synthetic alphabets. The **synthetic alphabet** section tests the adaptability of these transformations to non-standard symbols, suggesting their utility in encoding/decoding or cryptographic systems. The consistent use of arrows and highlights implies a focus on clarity and step-by-step reasoning in transformation logic. </details> that include both the interval size modification and the synthetic alphabet. To ensure that GPT-3 is capable of processing the introduced modifications (Wu et al., 2023, 'counterfactial comprehension check'), we additionally include tests on GPT-3 for two additional settings: original examples on the real alphabet but including the modified prompt, i.e. 'Use this fictional alphabet: [ a b c d e f g h i j k l m n o p q r s t u v w x y z ] . '), and counterexamples involving the synthetic alphabet but without increasing the interval size. GPT-3 evaluation Our code for reproducing Figure 2 is available on Github 10 . For each problem type, we create 50 instances to mirror the original paper. The settings are as follows: model variant=text-davinci-003, temperature=0, maximum length=20. Using the original code, we mirror the evaluation and analysis 10 https://github.com/hodeld/emergent\_analogies\_LLM\_fork approach of the original paper. The prompt pattern including the synthetic alphabet illustrates the following example. ``` alphabet illustrates the following example. Use this fictional alphabet: [x y l k w b f z t n j r q a h v g m u o p d i c s e]. Let's try to complete the pattern: [x y l k] [x y l k b] [t n j r] [ ``` h Human behavioral experiment. We conducted human behavior experiments through an online study with University of Washington (UW) undergraduates analogous to the experiments of the original paper. All participants provided their informed consent prior to the study, and the data collection process was approved by the UW Institutional Review Board (IRB ID STUDY00019080, approved on 6 November 2023). 121 participants completed the study. They were compensated with extra course credits for their participation. The first author of the original study generously provided participant instructions, which we adapted for our experiments. In particular, we presented the participants an additional example problem to introduce the synthetic alphabet. Use this fictional alphabet: [ x y l k w b f z t n j r q a h v g m u o p d i c s e ] . $$[ x x x ] [ y y y ] \\ [ 1 1 1 ] [ ? ]$$ Each participant completed a total of 18 zero-generalization tasks, consisting of six problems for each setting (one problem for each transformation type). ## 7.1.2 Results In our experiments, human achieved consistently higher accuracy than GPT-3, in particular on modified letter string tasks involving both the synthetic alphabet and increased letter interval size, see Figure 2. Human performance remains at a level similar across modifications (Figure 4) while GPT-3 performance declines significantly for modified problem types (Figure 4). The generative accuracy of GPT-3 for the synthetic alphabet is close to zero ( < 0 . 1) when performing the modified tasks 'extend sequence', 'successor' or 'predecessor', and 'fix alphabetic sequence'. Only for 'remove redundant letter' and 'sort' does GPT-3 achieve accuracy in a range similar to that reported in the original paper (Webb, Holyoak, and Lu, 2023). Figure 5 shows the accuracy of GPT-3 in the two counterfactual comprehension checks (Wu et al., 2023). For all but on the 'precessor' task on the synthetic alphabet, we obtain a GPT-3 accuracy of at least 30% of the original level, demonstrating GPT-3's ability to process the introduced modifications. Lastly, Figure 6 illustrates the comparison of human performance in the original tasks between the participants of the original study and those in our Figure 2: Comparison between GPT-3's (blue) and human (orange) performances on modified letter string problems involving a synthetic alphabet and a larger interval size. The transformation types and their order correspond to Figure 6b in the original paper. Humans demonstrate significantly higher accuracy compared to GPT-3. Human results represent the average performance of 121 participants (UW undergraduates). Each participant received one randomly selected instance of each problem subtype. GPT-3 results reflect the average performance across all 50 instances. Gray error bars indicate 95% binomial confidence intervals for the average performance across multiple problems. <details> <summary>Image 2 Details</summary> ![c85c22e6](/v1/image/c85c22e6ae0345c766ddd5b980565e2f9cee391563c521d7aab0a708eaacf3b1) ### Visual Description ## Bar Chart: Generative Accuracy Comparison Between GPT-3 and Humans ### Overview The chart compares generative accuracy between GPT-3 (blue bars) and humans (orange bars) across six text transformation tasks. Error bars represent uncertainty (±standard deviation). Humans consistently outperform GPT-3, except in the "Remove redundant letter" task where performance is nearly equal. ### Components/Axes - **X-axis (Transformation type)**: Extend sequence, Successor, Predecessor, Remove redundant letter, Fix alphabetic sequence, Sort - **Y-axis (Generative accuracy)**: Scale from 0 to 1 - **Legend**: Blue = GPT-3, Orange = Human - **Error bars**: Vertical lines atop bars indicating ±standard deviation ### Detailed Analysis | Task | GPT-3 Accuracy (±SD) | Human Accuracy (±SD) | |-----------------------------|----------------------|----------------------| | Extend sequence | 0.02 ±0.01 | 0.78 ±0.03 | | Successor | 0.05 ±0.02 | 0.74 ±0.04 | | Predecessor | 0.01 ±0.01 | 0.79 ±0.02 | | Remove redundant letter | 0.76 ±0.05 | 0.85 ±0.03 | | Fix alphabetic sequence | 0.03 ±0.01 | 0.31 ±0.04 | | Sort | 0.14 ±0.03 | 0.29 ±0.05 | ### Key Observations - Humans dominate in **complex reasoning tasks** (e.g., "Remove redundant letter" shows 9% human advantage). - GPT-3 excels only in **pattern recognition** (e.g., "Sort" task, 0.14 vs. 0.29). - Largest uncertainty in "Remove redundant letter" (GPT-3: ±0.05, Human: ±0.03). - "Predecessor" task shows minimal GPT-3 capability (0.01 accuracy). ### Interpretation The data highlights fundamental differences in text transformation capabilities: 1. **Human Strengths**: Tasks requiring contextual understanding (e.g., removing redundant letters) where GPT-3 struggles with ambiguity. 2. **GPT-3 Limitations**: Tasks needing sequential reasoning (e.g., "Predecessor") where humans achieve near-perfect accuracy. 3. **Error Bar Implications**: Larger uncertainties in human judgments for complex tasks suggest subjective variability, while GPT-3's errors reflect algorithmic constraints. 4. **Sort Task Anomaly**: GPT-3's relative success here may stem from pattern-matching training data, but humans still outperform by 50%. This chart underscores the complementary nature of human and AI text processing, with humans excelling in nuanced reasoning and GPT-3 in scalable pattern recognition. </details> study. Although the subjects in our study marginally outperform those in the previous study, the similarity in performances is evidence that our experimental setup and execution align with the original study at UCLA. ## 7.1.3 Discussion The recent paper, 'Emergent analogical reasoning in large language models' (Webb, Holyoak, and Lu, 2023), and subsequent news articles argue that LLMs may have acquired the emergent ability for zero-shot analogical reasoning. We are less certain of these conclusions, given our own follow-up experiments. Our results show low success of GPT-3 in solving letter string problems with simple modifications and with a synthetic alphabet, while human performance remains high. Only in two out of six problem types ('remove redundant letter' and 'sort'), GPT-3 achieves similar generative accuracy on our counterexamples compared to the original problems involving the real alphabet, as well as in comparison to human performance on the same modified problems. For these two problem Figure 3: GPT-3 performance for zero-generalization letter string problems for the original experiment (blue) and with the larger interval size (green), and larger interval size with synthetic alphabet (orange). Except for 'remove redundant letter,' GPT3's accuracy declines significantly for the modified problems. The results reflect an average performance for N=50 instances. <details> <summary>Image 3 Details</summary> ![04dba1dc](/v1/image/04dba1dc5fe0fb2b2b55511c51f4f40b01f02a214f5fffb533883c627ecd535a) ### Visual Description ## Bar Chart: Generative Accuracy Across Transformation Types ### Overview The chart compares generative accuracy across six transformation types using three methods: Original (blue), Interval (green), and Interval & synthetic alphabet (orange). Accuracy values range from 0 to 1, with error bars indicating variability. ### Components/Axes - **X-axis (Transformation type)**: Extend sequence, Successor, Predecessor, Remove redundant letter, Fix alphabetic sequence, Sort. - **Y-axis (Generative accuracy)**: Scale from 0 to 1. - **Legend**: Top-right corner, with colors: - Blue = Original - Green = Interval - Orange = Interval & synthetic alphabet ### Detailed Analysis 1. **Extend sequence**: - Original: ~0.95 (±0.05) - Interval: ~0.3 (±0.1) - Interval & synthetic alphabet: ~0.02 (±0.01) 2. **Successor**: - Original: ~0.93 (±0.07) - Interval: ~0.6 (±0.1) - Interval & synthetic alphabet: ~0.05 (±0.03) 3. **Predecessor**: - Original: ~0.78 (±0.1) - Interval: ~0.15 (±0.05) - Interval & synthetic alphabet: ~0.02 (±0.01) 4. **Remove redundant letter**: - Original: ~0.85 (±0.08) - Interval: ~0.78 (±0.06) - Interval & synthetic alphabet: ~0.75 (±0.05) 5. **Fix alphabetic sequence**: - Original: ~0.52 (±0.1) - Interval: ~0.25 (±0.08) - Interval & synthetic alphabet: ~0.03 (±0.02) 6. **Sort**: - Original: ~0.22 (±0.1) - Interval: ~0.08 (±0.05) - Interval & synthetic alphabet: ~0.13 (±0.07) ### Key Observations - **Original method dominance**: Consistently highest accuracy across all transformations except "Remove redundant letter" and "Sort." - **Interval & synthetic alphabet outlier**: Peaks at ~0.75 for "Remove redundant letter" (second only to Original) and ~0.13 for "Sort." - **Interval method**: Strongest performance in "Successor" (~0.6) and "Remove redundant letter" (~0.78). - **Error variability**: Largest error bars for "Sort" (Original: ±0.1) and "Predecessor" (Interval: ±0.05). ### Interpretation The data suggests the **Original method** is generally most reliable, but the **Interval & synthetic alphabet** method shows unexpected efficacy in specific tasks: 1. **Remove redundant letter**: The synthetic alphabet may help identify and eliminate redundancy through structural constraints. 2. **Sort**: The combination of interval and synthetic elements might improve ordering logic. 3. **Other transformations**: The synthetic alphabet appears less beneficial (e.g., "Extend sequence" accuracy drops to ~0.02), possibly due to overcomplication or noise introduction. Notably, the **Interval method alone** performs moderately well in "Successor" and "Remove redundant letter," indicating that interval-based approaches retain some utility without synthetic elements. The **Sort** transformation’s high error margin for Original (~0.1) suggests inherent difficulty in this task across methods. </details> subtypes, GPT-3 does not need to generate a letter from the full alphabet, but only to remove the duplicate letter or to rearrange given letters, which may explain the higher performance. The results of these two tasks also serve as an additional counterfactual comprehension check (Wu et al., 2023) in addition to the accuracy of GPT-3 under the only marginally modified conditions, shown in Figure 5. The results demonstrate that GPT-3 is capable of processing synthetic alphabets, which validates our approach. So what explains the high success of GPT-3 in solving the problems on the real alphabet (as used in the original paper) but failure with the synthetic alphabet and with the modified interval size for most of the letter string problems while human performance remains consistently high? Our results suggest that the answer resides in the training data confirming the analysis of the methods in Section 2. Unlike humans, GPT-3 performs well only for simple analogy problems with the standard English alphabet, which are likely to be present in the training data. These findings contradict two of the main claims in the original paper (Webb, Holyoak, and Lu, 2023) regarding GPT-3's capacity for general, zero-shot reasoning and its human-like characteristics in analogical reasoning. Consequently, we reject the proposition made in the original paper that GPT-3 may have developed mechanisms similar to those underlying human intelligence. The GPT-3 failure to solve simple variations of the original problems demon- Figure 4: Human performance for zero-generalization letter string problems for the original experiment (blue) and with the larger interval size (green), and larger interval size with synthetic alphabet (orange). Human accuracy in the modified problems is comparable to that in the original problems (blue). The results reflect the average performance of N = 121 participants (UW undergraduates). <details> <summary>Image 4 Details</summary> ![22051879](/v1/image/22051879068d33d6c04b881864dfda612e0a550cf5465389548bf5520c5084df) ### Visual Description ## Bar Chart: Generative Accuracy Across Transformation Types ### Overview The chart compares generative accuracy (y-axis: 0–1) across six transformation types (x-axis) for three methods: Original, Interval & synthetic alphabet, and Interval. Error bars indicate uncertainty in measurements. ### Components/Axes - **X-axis (Transformation type)**: Extend sequence, Successor, Predecessor, Remove redundant letter, Fix alphabetic sequence, Sort - **Y-axis (Generative accuracy)**: Scale from 0 to 1, labeled "Generative accuracy" - **Legend**: - Blue: Original - Orange: Interval & synthetic alphabet - Green: Interval Positioned in the top-right corner. ### Detailed Analysis 1. **Extend sequence**: - Original: ~0.85 (±0.05) - Interval & synthetic alphabet: ~0.78 (±0.04) - Interval: ~0.68 (±0.06) 2. **Successor**: - Original: ~0.87 (±0.04) - Interval & synthetic alphabet: ~0.74 (±0.05) - Interval: ~0.64 (±0.07) 3. **Predecessor**: - Original: ~0.82 (±0.03) - Interval & synthetic alphabet: ~0.79 (±0.04) - Interval: ~0.69 (±0.05) 4. **Remove redundant letter**: - Original: ~0.88 (±0.03) - Interval & synthetic alphabet: ~0.85 (±0.04) - Interval: ~0.82 (±0.03) 5. **Fix alphabetic sequence**: - Original: ~0.42 (±0.06) - Interval & synthetic alphabet: ~0.31 (±0.05) - Interval: ~0.21 (±0.04) 6. **Sort**: - Original: ~0.36 (±0.05) - Interval & synthetic alphabet: ~0.29 (±0.04) - Interval: ~0.25 (±0.03) ### Key Observations - **Original** consistently outperforms other methods across all transformations, with the highest accuracy in "Remove redundant letter" (~0.88). - **Interval & synthetic alphabet** shows moderate performance, with a notable drop in "Fix alphabetic sequence" (~0.31) and "Sort" (~0.29). - **Interval** has the lowest accuracy, particularly in "Fix alphabetic sequence" (~0.21) and "Sort" (~0.25). - Error bars suggest greater variability in Original and Interval & synthetic alphabet compared to Interval. ### Interpretation The data demonstrates that the **Original** method maintains the highest generative accuracy, likely due to its reliance on full data context. The **Interval & synthetic alphabet** method bridges the gap between Original and Interval, suggesting synthetic data augmentation improves performance over pure interval-based approaches. However, all methods struggle with complex transformations like "Fix alphabetic sequence" and "Sort," where accuracy drops sharply. The **Interval** method’s lower accuracy may stem from insufficient data granularity or inability to model dependencies in altered sequences. The error bars highlight that while Original is most accurate, its results are less consistent than Interval-based methods. </details> strates the brittleness of the presented approach when assessing human-like reasoning in language models. ## 7.2 ChatGPT's answer to our question: 'Could you give an example of a copycat problem?' Figure 5: Counterfactual comprehension check. Comparison of GPT-3 performance on zero-generalization letter string problems between original tasks (blue) and the only marginally modified tasks involving a synthetic alphabet without modification of the interval size (green) and a modified prompt without modified string sequence (orange). The accuracy on modified tasks is lower than on the original ones but, greater than 0.2 except for 'remove redundant letter' and 'sort' involving the synthetic alphabet. The figure and the order of the transformation types correspond to Figure 6b in the original paper. These results reflect an average performance for N=50 instances. <details> <summary>Image 5 Details</summary> ![ec0c71cc](/v1/image/ec0c71cc189dfe261bd7a751ad905c454e85a868d55730477e4ea1260b90f2f2) ### Visual Description ## Bar Chart: Generative Accuracy Across Transformation Types ### Overview The chart compares generative accuracy (y-axis: 0–1) across six transformation types (x-axis) for three methods: "Original," "Original & synthetic alphabet," and "Original & prompt." Error bars indicate variability in measurements. ### Components/Axes - **X-axis (Transformation type)**: Extend sequence, Successor, Predecessor, Remove redundant letter, Fix alphabetic sequence, Sort. - **Y-axis (Generative accuracy)**: Scale from 0 to 1, labeled "Generative accuracy." - **Legend**: - Blue: Original - Green: Original & synthetic alphabet - Orange: Original & prompt - **Error bars**: Vertical lines atop bars showing measurement uncertainty. ### Detailed Analysis 1. **Extend sequence**: - Original (blue): ~0.95 (±0.05) - Original & synthetic alphabet (green): ~0.92 (±0.04) - Original & prompt (orange): ~0.35 (±0.06) 2. **Successor**: - Original: ~0.93 (±0.06) - Original & synthetic alphabet: ~0.70 (±0.07) - Original & prompt: ~0.55 (±0.08) 3. **Predecessor**: - Original: ~0.78 (±0.09) - Original & synthetic alphabet: ~0.08 (±0.05) - Original & prompt: ~0.40 (±0.07) 4. **Remove redundant letter**: - Original: ~0.85 (±0.08) - Original & synthetic alphabet: ~0.65 (±0.07) - Original & prompt: ~0.50 (±0.06) 5. **Fix alphabetic sequence**: - Original: ~0.52 (±0.06) - Original & synthetic alphabet: ~0.52 (±0.06) - Original & prompt: ~0.25 (±0.05) 6. **Sort**: - Original: ~0.22 (±0.04) - Original & synthetic alphabet: ~0.18 (±0.03) - Original & prompt: ~0.24 (±0.04) ### Key Observations - **Highest accuracy**: "Extend sequence" dominates across all methods, with "Original" achieving ~0.95. - **Lowest accuracy**: "Sort" underperforms, with "Original" at ~0.22 and "Original & synthetic alphabet" at ~0.18. - **Method impact**: - "Original" consistently outperforms other methods. - "Original & prompt" (orange) shows the largest drop in accuracy for most transformations (e.g., "Extend sequence" drops from ~0.95 to ~0.35). - "Original & synthetic alphabet" (green) performs better than "Original & prompt" but worse than "Original" in most cases. ### Interpretation The data suggests that the "Original" method is most effective for generative tasks, while adding synthetic alphabets or prompts reduces accuracy. Transformations like "Extend sequence" and "Successor" are inherently easier for models, whereas "Sort" poses significant challenges. The "Original & prompt" method appears to introduce noise or constraints that degrade performance, particularly for complex transformations. The near-identical performance of "Original" and "Original & synthetic alphabet" in "Fix alphabetic sequence" implies that synthetic alphabets may not always improve task-specific accuracy. </details> Figure 6: Comparison of human performance on the original letter string tasks between the outcomes reported in the original study (blue) and the findings presented in this paper (green). UW undergraduate students exhibit marginally higher accuracies. The transformation types and their order correspond to Figure 6b in the original paper. <details> <summary>Image 6 Details</summary> ![a7c04e64](/v1/image/a7c04e64500c45a4d19dab8422d2ee6c3cbcec87940f9cee6bd9680253ca316f) ### Visual Description ## Bar Chart: Generative Accuracy Comparison Between UCLA and UW Across Transformation Types ### Overview The chart compares the generative accuracy of two institutions, UCLA (blue) and UW (green), across six transformation tasks: "Extend sequence," "Successor," "Predecessor," "Remove redundant letter," "Fix alphabetic sequence," and "Sort." Accuracy is measured on a scale from 0 to 1, with error bars representing uncertainty. ### Components/Axes - **X-axis (Transformation type)**: - Categories: Extend sequence, Successor, Predecessor, Remove redundant letter, Fix alphabetic sequence, Sort. - **Y-axis (Generative accuracy)**: - Scale: 0 to 1, with increments of 0.2. - **Legend**: - Top-right corner, labeled "UCLA" (blue) and "UW" (green). - **Error bars**: - Vertical lines atop each bar, indicating measurement uncertainty. ### Detailed Analysis 1. **Extend sequence**: - UCLA: ~0.83–0.87 (mean ~0.85). - UW: ~0.84–0.88 (mean ~0.86). 2. **Successor**: - UCLA: ~0.78–0.82 (mean ~0.80). - UW: ~0.82–0.86 (mean ~0.84). 3. **Predecessor**: - UCLA: ~0.72–0.76 (mean ~0.74). - UW: ~0.80–0.84 (mean ~0.82). 4. **Remove redundant letter**: - UCLA: ~0.68–0.72 (mean ~0.70). - UW: ~0.85–0.89 (mean ~0.87). 5. **Fix alphabetic sequence**: - UCLA: ~0.25–0.30 (mean ~0.27). - UW: ~0.40–0.45 (mean ~0.42). 6. **Sort**: - UCLA: ~0.22–0.27 (mean ~0.24). - UW: ~0.35–0.40 (mean ~0.37). ### Key Observations - **UCLA** consistently outperforms UW in "Extend sequence," "Successor," and "Predecessor" tasks. - **UW** achieves higher accuracy in "Remove redundant letter" and "Sort," though with larger error bars (greater uncertainty). - Both institutions show significant drops in accuracy for "Fix alphabetic sequence" and "Sort," with UCLA’s performance declining more sharply. - Error bars suggest UW’s results are less consistent in "Remove redundant letter" and "Sort" tasks. ### Interpretation The data suggests that UCLA’s models are more robust for sequence extension and basic transformations (e.g., successor/predecessor), while UW’s models excel in tasks requiring redundancy removal and sorting. However, UW’s higher variability (larger error bars) in these tasks indicates potential instability. The steep decline in performance for "Fix alphabetic sequence" and "Sort" across both institutions highlights a shared limitation in handling complex, rule-based transformations. This could reflect differences in training data, model architecture, or task-specific fine-tuning. The chart underscores the importance of task-specific evaluation when comparing generative models. </details> Figure 7: ChatGPT's answer to our question: 'Could you give an example of a copycat problem?'. The tasks presented in the original paper are called Copycat. This name refers to a computer program that tests such letter string analogy problems. <details> <summary>Image 7 Details</summary> ![56dd2f1b](/v1/image/56dd2f1b6684f24bbc03d254b42b4703ab643391b55b6d25cc98ba01fc6b0fb2) ### Visual Description ## Screenshot: Chat Interface with Copycat Problem Explanation ### Overview The image depicts a chat interface where a user asks for an example of a "copycat problem," and an AI provides a detailed explanation. The conversation includes a definition, an example with original and transformed data, and a step-by-step application of the pattern. ### Components/Axes - **User Message**: Dark gray speech bubble with white text: *"Could you give an example of a copycat problem?"* - **AI Response**: Light gray speech bubble with black text: - Definition: *"Certainly! A copycat problem refers to a type of cognitive challenge where you're asked to identify a pattern or transformation in a set of data and then apply the same pattern or transformation to a new set of data."* - Example: - Original Data: - `AB -> CD` - `EF -> GH` - New Data: - `U -> ??` - `KL -> ??` - Solution: - Pattern: Shifting each letter two positions forward in the alphabet. - Applied Data: - `U -> W` - `KL -> MN` ### Content Details - **Textual Content**: - The AI explains the concept of a copycat problem, emphasizing pattern recognition and transformation. - The example uses letter pairs (e.g., `AB -> CD`) to demonstrate shifting letters by two positions (A→C, B→D). - New data (`U`, `KL`) is transformed using the same rule, resulting in `W` and `MN`. ### Key Observations - The example explicitly ties the pattern (letter shifting) to both original and new data. - The solution is validated by applying the identified rule consistently. - No numerical values or visual trends are present; the focus is on textual logic. ### Interpretation The chat demonstrates how copycat problems test cognitive abilities to recognize and apply patterns. The example uses simple letter transformations to illustrate the concept, emphasizing consistency in rule application. The absence of numerical data suggests the problem is designed for qualitative reasoning rather than quantitative analysis. </details>

Rendering Paper...