# Semantic Structure-Mapping in LLM and Human Analogical Reasoning
**Authors**: Sam Musker, Alex Duchnowski, Raphaël Millière, Ellie Pavlick
Abstract
Analogical reasoning is considered core to human learning and cognition. Recent studies have compared the analogical reasoning abilities of human subjects and Large Language Models (LLMs) on abstract symbol manipulation tasks, such as letter string analogies. However, these studies largely neglect analogical reasoning over semantically meaningful symbols, such as natural language words. This ability to draw analogies that link language to non-linguistic domains, which we term semantic structure-mapping, is thought to play a crucial role in language acquisition and broader cognitive development. We test human subjects and LLMs on analogical reasoning tasks that require the transfer of semantic structure and content from one domain to another. Advanced LLMs match human performance across many task variations. However, humans and LLMs respond differently to certain task variations and semantic distractors. Overall, our data suggest that LLMs are approaching human-level performance on these important cognitive tasks, but are not yet entirely human like.
keywords: language models , analogies , structure-mapping journal: review.
Brown University
Macquarie University
1 Introduction
The recent advances of large language models (LLMs) have raised the question of whether LLMs can serve as useful cognitive models in the study of various aspects of human learning, cognition, and behavior [1, 2, 3]. One such recent debate has focused on whether LLMs acquire the ability to perform analogical reasoning as a by-product of their self-supervised learning objective [4, 5, 6, 7]. Analogical reasoning—the ability to align abstract structures between a source and target domain—is posited to play a central role in human learning and generalization, for example, our ability to reason efficiently in unfamiliar domains [8, 9]. Thus, the question of whether LLMs can reason analogically in a human-like way directly bears on their ability to serve as computational models of human behavior beyond just next-word prediction.
Recent work has focused on the ability of advanced LLMs to match human analogical reasoning performance on tasks that involve recognition of spatial and logical transformations in matrices [4] or detecting patterns in strings of letters or numbers [7]. For example, Mitchell [7] uses analogy tasks such as abcd:abce::ijkl:?? in order to test the extent to which LLMs and humans can recognize and generalize abstract structures and operations (in this example, ordered sequences and successor functions). Such studies have produced mixed results, with evidence suggesting that advanced LLMs achieve the same performance and even produce similar error patterns to those observed in humans [4, 6], but with doubts remaining about the robustness of LLMs’ abilities, particularly with respect to increasingly abstract and challenging domains [10].
Previous work has focused almost exclusively on analogies using abstract and arbitrary symbols, where structures are derived from symbols’ spatial positions in the text prompt, but the symbols themselves are unimportant. This leaves out questions about reasoning analogically over semantically meaningful symbols, such as words in natural language. This type of analogical reasoning, which we call semantic structure-mapping, requires mapping between semantic structure in one domain (e.g., the relationship between a dog and a puppy, or that a dog has four legs) and non-semantic (arbitrary) structure in the other domain (e.g., spatial position in the text prompt). This type of mapping is thought to play a crucial role in human cognition and development, such as in the language-analogical reasoning feedback loop proposed by Structure-Mapping Theory (SMT) [11]. Moreover, if LLMs are to provide insight into how humans perform certain cognitive functions, it will likely involve the role of distributional semantic learning [12, 13, 14] in the acquisition or representation of those functions. Therefore, we focus on investigating how humans and LLMs compare in tasks requiring semantic structure-mapping and assessing whether patterns differ from those observed on tasks involving only arbitrary symbols.
We design two experiments, focused respectively on the mapping of semantic structure (i.e., semantic relationships between symbols, such as relating the symbol dog to the symbol puppy) and semantic content (i.e., information attached to a symbol such as the knowledge that a dog has four legs). In each experiment, the subject (human or LLM) is presented with a set of left-hand terms (the source domain) and a corresponding set of right-hand terms (the target domain), with the final right-hand term omitted. The subject is asked to fill in this blank. An exact copy of our prompt and an example question is shown in Figure 1. We design multiple variants of such questions designed to probe structure-mapping that involves semantic structure and semantic content, respectively. We additionally design a series of control and distractor conditions—e.g., interleaving informative mappings (square => C C C) with uninformative ones (lime => X X X) in order to expose differences in the underlying mechanism.
Overall, the most advanced LLMs we tested match human performance across our primary conditions, even producing human-like error patterns. However, significant differences emerge in several control settings. Even the most advanced LLMs show more sensitivity than humans to information presentation order and struggle to ignore irrelevant semantic information that humans readily dismiss. Thus, our results contribute to the ongoing debate about analogical reasoning, corroborating both work arguing for impressive LLM performance [4, 15] and work highlighting important mechanistic differences between humans and LLMs [10, 16]. Code and data are available at https://github.com/AnonymousReview123/Semantic_Structure_Mapping_Anon. By presenting data on the unique role of semantic structure and content in analogical reasoning, we suggest differences remain in how LLMs and humans represent and map semantic structure, although this gap may be closing as models increase in size and incorporate more diverse training signals. We argue that this has important implications for studying cognitive development and the role of LLMs in this research going forward.
2 Methods
2.1 Experiment Details
2.1.1 Semantic Structure
Each subject was presented with a quiz, which is a sequence of four such questions generated using four sets of base domains and four sets of target domains selected such that a participant sees each base and target domain exactly once. Eight variants of the task were devised to investigate the influence of task variations as described above.
Questions are introduced with the prompt “We are conducting an experiment on general reasoning abilities. Below we will show you various words and drawings of each, after which you will need to complete the last drawing. Respond as concisely as possible with only the last drawing.” We use the term “drawings” to describe the elements in the target domain because it loosely encapsulates the idea of mapping between the source and target domains. In a similar way to how drawings serve as partial structurally isomorphic representations that depict a subject with varying degrees of abstraction [17], the elements in our target domains establish a space of relations that are isomorphic to those in the source domain. In some cases the term “drawing” is straightforwardly applicable, as when the capitalization of characters corresponds to the term for a mature animal. In other cases the use is strained, as when capitalization instead corresponds to a shape being symmetrical. The transparently liberal use of the term “drawing” is used to prime subjects to reason creatively while attending to the correspondence between source and target domains. The prompt’s lack of reference to analogical reasoning accesses pre-theoretic responses to the extent possible. For the same purpose the experiment is introduced to human subjects and LLMs as studying “general reasoning abilities.”
2.1.2 Semantic Content
Each condition (described in Table 4) contains two quizzes, with four questions per quiz. Unless otherwise stated, methodological details of the Semantic Content experiment match those of the Semantic Structure experiment.
The four conditions are divided into those that require numeric reasoning and those that do not. Within the numeric and non-numeric conditions respectively, one condition utilizes only one dimension of variation (referred to as “single-attribute”) whereas another adds a second dimension of variation (“multi-attribute”). This allows for comparing the relative performance of human subjects and models when the task is made to require compositional reasoning over layered transformations.
Questions were formatted like the following example:
| horse => * * * * |
| --- |
| cat => * * * * |
| ant => ! ! ! ! ! ! |
| bee => ! ! ! ! ! ! |
| chicken => ! ! |
| spider => ! ! ! ! ! ! ! ! |
| dog => * * * * |
| human => |
In this example, the number of symbols corresponds to a number-of-legs feature, and the usage of exclamation marks and asterisks corresponds to an egg-laying feature (or, alternatively, a mammal feature). The right-hand sequences of characters thereby encode properties of the entities denoted by the left-hand words. Given that humans are two-legged mammals, the correct answer here would be * *. In order to solve this task, the participant must understand both aspects of the information encoded in the right-hand terms and then construct the answer by generalizing to a new example.
2.2 Participants
2.2.1 LLMs
We run our experiments on the following LLMs: GPT-3 [18], GPT-4 [19], Pythia-12B [20], Claude 2 [21], Claude 3 Opus [22], and Falcon-40B [23]. All of the above are transformer-based LLMs trained primarily on a next word prediction objective.
GPT-3 consists of a 175B parameter model trained on text completion and finetuned to produce more coherent answers. The details of GPT-4 are not publicly known, but it is considered by some sources to be a mixture-of-experts (MoE) model consisting of numerous GPT-3-scale language models [24]. GPT-4, unlike GPT-3, supplements text-completion pretraining and finetuning with reinforcement learning from human feedback (RLHF) in order to better align model outputs with the expectations of a human user. The training of Claude 2 also includes RLHF, but its performance falls short of GPT-4. The more recent Claude 3 (in our case, the most advanced Opus version) is considered to approximately match GPT-4 performance in general. GPT-3 and -4 are developed by OpenAI, whereas Claude 2 and 3 are developed by Anthropic. Pythia-12B and Falcon-40B are open-weights LLMs trained on a text-completion objective and consist of 12B and 40B parameters respectively. Neither undergoes RLHF. Pythia-12B is developed by EleutherAI, and Falcon-40B is developed by the Technology Innovation Institute.
2.2.2 Human Subjects
We also test human participants on our experiments. Reported in the main text are results obtained from 194 (mostly undergraduate) University-Name University students (132 in the Semantic Structure experiment, and 62 in the Semantic Content experiment). The split of participants between experiments approximately matches the 9:4 ratio of experiment conditions. The number of participants by condition are as follows: Defaults 18, Distracted 18, Only RHS 18, Permuted Pairs 17, Permuted Questions 17, Random Finals 15, Random Permuted Pairs 6, Randoms 8, Relational 15, Categorial 16, Multi Attribute 16, Numeric 16, Numeric Multi Attribute 14. The Relational, Categorial, Multi Attribute, Numeric, and Numeric Multi Attribute conditions each have two quizzes while the remaining conditions each have four quizzes per condition. Subjects were assigned randomly to a single quiz from one condition without the re-use of subjects. Roughly the same number of participants were assigned to each condition, with the exception of the Random and Random Permuted Pairs conditions. These were together assigned roughly the expected number of subjects for a single condition due to their similarity.
The subjects were recruited through email advertisements and offered $10 in compensation. Earlier results obtained for the Semantic Structure experiment from an online sample of participants recruited through Prolific are reported in Figure 11 of the Appendix.
We ensure that humans and LLMs are given comparable information in our prompting design. A given human participant sees one quiz with four questions, with questions revealed one at a time with the answer shown following each response. LLMs are prompted with the first question of a quiz, then the second question with the first question and its (correct) answer accumulating in the prompt, and so forth for the four questions in a quiz. This prompt accumulation mimics the availability in the memory of human subjects of previous answers within a quiz.
2.3 Statistical testing
In each experiment, we are interested in the relative performance of human subjects and the best-performing models and how this depends on the particular experiment conditions. Differences between most models and human subjects are large and do not require statistical analysis, and so we focus our statistical analysis on the performance of GPT-4 relative to human subjects and Claude 3 relative to human subjects.
For each experiment and pair of subjects (human subjects and GPT-4, or human subjects and Claude 3) we fit a logistic model to the data with and without interactions between the subject type and the experiment condition. In all cases, the outcome variable is the un-aggregated per-question score achieved by a subject (either a 0 or 1), and the predictor variables are experiment condition (e.g. “Defaults” or “Permuted Pairs”) and subject type (e.g. “human subjects” or “GPT-4”). We use four likelihood ratio tests to assess whether the interaction between subject type and experiment condition is significant for a given pair of subjects within a particular experiment, as motivated by Glover [25]. In all four cases the interaction is significant, and so we use simple effects analysis to investigate the direction and significance of the effect of subject type within particular conditions.
For the semantic content experiment, we additionally perform a logistic simple effects analysis comparing the performance of a single subject type (human, GPT-4, or Claude 3) in compositional versus non-compositional conditions for the numeric and non-numeric cases respectively with the non-compositional condition as reference. For example, we assess the effect of the condition being Multi Attribute with Categorial as the reference condition for only the subject type Claude 3 (and likewise for the other two examined subject types).
Further details are provided in Sections A.1 and A.2 of the Appendix.
3 Results
3.1 Mapping Semantic Structure
We first design a set of experiments investigating the ability of LLMs and human subjects to map semantic structure in the source domain onto arbitrary, non-semantic structure in the target domain. In this set of experiments, our source domain (left-hand side) is a set of words which are assumed to possess some relational structure, and our target domain (right-hand side) is a set of strings related via non-linguistic string operations.
| We are conducting an experiment on general reasoning abilities. Below we will show you various words and drawings of each, after which you will need to complete the last drawing. Respond as concisely as possible with only the last drawing. |
| --- |
| Question 1: |
| square => C C C |
| rectangle => c c c |
| circle => C C |
| oval => |
square rectangle circle oval C C C c c c C C c c
Figure 1: An example question (from the Defaults condition of the Semantic Structure experiment) with a representation of the structure-mapping solution below. The source domain is in blue and the target domain is in orange (for the provided elements) and yellow (for the inferred element).
3.1.1 Overall Performance
Human subjects perform well overall, obtaining accuracy between 0.4 and 0.9 across the various conditions. The most advanced LLMs that we test attain accuracies in the range 0.1-0.95 across conditions. This performance range is comparable to prior work on analogical reasoning over arbitrary symbols. For example, the results of human subjects on the “zero-generalization setting” studied by both Webb et al. [4] and Mitchell et al. [10] range from 0.2-0.8 in the former study and from 0.5-1.0 in the latter study. Similarly, results for LLMs (GPT-3, GPT-3.5, and GPT-4) across those conditions range from 0.1-1.0 in the two studies. Thus, our data suggest that analogies involving semantic structure-mapping are not inherently easier or harder than those which make use of arbitrary symbols.
Our Defaults condition consists of lexical items as a source domain and one of several string operation relations as a target domain. To investigate the robustness of performance metrics, we introduce three control conditions: (1) Permuted Questions, in which we present unaltered versions of the core task with varied question ordering; (2) Permuted Pairs, in which we alter the order in which the lines of the analogy are presented; and (3) Distracted, in which we interleave unrelated mappings between the lines of the target analogy. These conditions are shown in Table 1. We do not expect Permuted Questions to materially alter the task, but might see some effect of the Permuted Pairs and Distracted conditions, as they could make the relevant relations less transparent: see, for example, work on the blocking advantage in humans [26] and in LLMs [27].
| Defaults | Basic test of semantic structure-mapping | square => C C C rectangle => c c c circle => C C oval => |
| --- | --- | --- |
| Permuted Pairs | Like Defaults, but with row order permuted | rectangle => c c c circle => C C square => C C C oval => |
| Distracted | Like Defaults, but with a distractor row added | square => C C C rectangle => c c c pillow => A P circle => C C oval => |
Table 1: Defaults and control conditions used to measure ability of humans and LLMs to perform analogical reasoning tasks that involve semantic structure-mapping. The Permuted Questions condition (not shown) is identical to Defaults, but with question order permuted.
Figure 2 shows the performance of humans and LLMs in the Defaults condition as a function of their performance on MMLU MMLU scores are few-shot for GPT-4 and 5-shot for other models. The reported human baseline is the estimate for human experts given by Hendrycks et al. [28]. The score for Pythia 12B could not be found and so we use the reported value for Pythia 6.9B Tulu., a widely-used language competency benchmark. Increasing MMLU score is associated with higher accuracy on the Defaults condition. Smaller models do not perform competitively (Pythia-12B obtains an accuracy of 0.0, Falcon 40B 0.1, GPT-3 0.5, and Claude 2 0.6). This steadily increasing performance is presumed to correlate with the scale of model parameters and training data [29]. We focus our remaining analysis on comparing human subjects to GPT-4 and Claude 3. In the Defaults condition, neither GPT-4 (coef=-0.7696, z=-1.659, p=0.097) nor Claude 3 (coef=-0.6131, z=-1.299, p=0.194) performs significantly worse than human subjects.
<details>
<summary>extracted/5679376/Images/Comparison_Default_MMLU.png Details</summary>

### Visual Description
## Scatter Plot: Model Performance Comparison
### Overview
The image is a scatter plot comparing the accuracy of various AI models against their MMLU (Massive Multitask Language Understanding) scores. The plot includes error bars for each data point, indicating variability or confidence intervals. The models are labeled with their names and colors, with a legend on the right.
### Components/Axes
- **X-axis**: MMLU score (ranging from 30 to 100, labeled "MMLU score").
- **Y-axis**: Accuracy (ranging from 0.0 to 1.0, labeled "Accuracy").
- **Legend**: Located on the right, associating colors with models:
- **Black**: Human*
- **Red**: Claude 3 Opus, GPT-4, Claude 2
- **Blue**: GPT-3, Falcon-40B, Pythia-12B-Deduped*
### Detailed Analysis
1. **Human***
- **Position**: Top-right corner.
- **Accuracy**: ~0.9 (highest).
- **MMLU Score**: ~95.
- **Error Bars**: Smallest, indicating high confidence.
2. **Claude 3 Opus**
- **Position**: Near Human*, slightly lower.
- **Accuracy**: ~0.8.
- **MMLU Score**: ~85.
- **Error Bars**: Moderate.
3. **GPT-4**
- **Position**: Slightly below Claude 3 Opus.
- **Accuracy**: ~0.78.
- **MMLU Score**: ~80.
- **Error Bars**: Similar to Claude 3 Opus.
4. **Claude 2**
- **Position**: Mid-right.
- **Accuracy**: ~0.55.
- **MMLU Score**: ~70.
- **Error Bars**: Larger than Claude 3 Opus/GPT-4.
5. **GPT-3**
- **Position**: Mid-left.
- **Accuracy**: ~0.45.
- **MMLU Score**: ~55.
- **Error Bars**: Moderate.
6. **Falcon-40B**
- **Position**: Lower-left.
- **Accuracy**: ~0.1.
- **MMLU Score**: ~40.
- **Error Bars**: Large.
7. **Pythia-12B-Deduped***
- **Position**: Bottom-left.
- **Accuracy**: ~0.0 (near zero).
- **MMLU Score**: ~30.
- **Error Bars**: Largest, indicating high variability.
### Key Observations
- **Human* outperforms all models** in both accuracy and MMLU score.
- **Claude 3 Opus and GPT-4** are the closest to Human*, with Claude 3 Opus having a slight edge in accuracy despite a lower MMLU score.
- **Red-colored models** (Claude and GPT variants) generally outperform blue-colored models (GPT-3, Falcon-40B, Pythia).
- **Pythia-12B-Deduped*** has the lowest accuracy and MMLU score, with the largest error bars, suggesting instability or poor generalization.
- **Positive correlation** between MMLU score and accuracy, but exceptions exist (e.g., GPT-4 has a lower MMLU than Claude 3 Opus but similar accuracy).
### Interpretation
The data suggests that **MMLU score is a strong predictor of model performance**, but not the sole factor. Human* serves as the gold standard, while Claude 3 Opus and GPT-4 demonstrate near-human capabilities. The red models (Claude and GPT) appear more efficient or optimized for accuracy compared to blue models. Pythia-12B-Deduped*’s near-zero accuracy may stem from its "deduped" training data, which could reduce its ability to generalize. The error bars highlight variability in performance, with Pythia showing the least reliability. This plot underscores the trade-offs between model size, training data quality, and real-world applicability.
</details>
Figure 2: Human and LLM accuracy in the Defaults condition, relative to performance on the MMLU benchmark. Models in blue are not instruction-tuned while models in orange are. Error bars show standard errors.
Figure 3 compares humans to high-performing LLMs in the Defaults and Permuted Pairs conditions. LLM performance drops in the Permuted Pairs condition, while humans seem equally able to infer the mapping regardless of word presentation order. This effect is significant for both Claude 3 (coef = -1.7802, z = -4.217, p < 0.001) and GPT-4 (coef = -1.6796, z = -3.975, p < 0.001). This suggests that, while the overall performance is comparable, there are likely meaningful mechanistic differences in how the analogy is processed in humans versus LLMs. The remaining control conditions and data for all tested models are shown in Figure 15 of the Appendix . In these conditions, we find that humans and models are roughly equally affected. For example, accuracy in the Distracted condition drops by approximately 0.25 for all three subject types.
<details>
<summary>extracted/5679376/Images/fig3_new.png Details</summary>

### Visual Description
## Bar Chart: Accuracy Comparison Across Entities
### Overview
The chart compares accuracy metrics for three entities (Human, GPT-4, Claude 3) using two methods: Defaults (white bars) and Permuted pairs (blue bars). Accuracy is measured on a scale from 0.0 to 1.0, with error bars indicating variability.
### Components/Axes
- **X-axis**: Categories labeled "Human," "GPT-4," and "Claude 3."
- **Y-axis**: Accuracy values ranging from 0.0 to 1.0.
- **Legend**:
- White bars = Defaults
- Blue bars = Permuted pairs
- **Error bars**: Present for all bars, visually consistent in length.
### Detailed Analysis
1. **Human**:
- Defaults: ~0.85 accuracy (white bar).
- Permuted pairs: ~0.83 accuracy (blue bar).
- Error bars: ±~0.03 for both methods.
2. **GPT-4**:
- Defaults: ~0.75 accuracy (white bar).
- Permuted pairs: ~0.55 accuracy (blue bar).
- Error bars: ±~0.04 for both methods.
3. **Claude 3**:
- Defaults: ~0.78 accuracy (white bar).
- Permuted pairs: ~0.50 accuracy (blue bar).
- Error bars: ±~0.04 for both methods.
### Key Observations
- **Defaults outperform Permuted pairs** in all categories, with the largest gap in GPT-4 (~0.20 difference) and Claude 3 (~0.28 difference).
- **Human accuracy** is highest overall (~0.85 for Defaults), followed by Claude 3 (~0.78) and GPT-4 (~0.75).
- Error bars suggest similar variability across methods and entities.
### Interpretation
The data demonstrates that **Defaults consistently yield higher accuracy** than Permuted pairs, particularly for AI models (GPT-4 and Claude 3). The significant drop in accuracy for permuted pairs in AI systems suggests that pair permutations disrupt their performance more than human judgment. Humans show minimal sensitivity to permutation changes, aligning with expectations of robust cognitive processing. The error bars indicate measurement consistency but do not reveal systematic biases. This pattern highlights the importance of maintaining default configurations for optimal AI performance in tasks requiring pair comparisons.
</details>
Figure 3: Human and LLM accuracy in the Defaults and Permuted Pairs conditions. Error bars show standard errors.
3.1.2 Effect of Semantic Structure on Reasoning
We next investigate more directly the extent to which humans and LLMs leverage semantic structure in order to complete our analogy tasks. To do this, we design three variants of our Defaults analogy task (see Table 2). First, the Only RHS condition removes the source domain entirely. High performance in this condition thus indicates that a subject is able to complete the questions based only on the evident pattern in the target domain. We then introduce two variants which make the semantic structure in the source domain less coherent: the Randoms condition uses unrelated words, while the Random Finals condition uses of three related words followed by one random word. We thus take the performance difference between the RHS Only condition and either the Random or Random Final condition to be a measure of the subject’s bias toward using the semantic structure of the source domain. That is, if the subject is capable of solving the task by simply ignoring the left hand side (the Only RHS condition), then poor performance in the other conditions indicates that the subject was misled by the presence of the altered left hand side.
| Only RHS | Test of how well the answer can be inferred without using any structure-mapping | C C C c c c C C |
| --- | --- | --- |
| Randoms | Variant of Defaults in which there is no semantic structure relating the words on the left hand side | banana => C C C fireplace => c c c bean => C C plug => |
| Random Last | Variant of Defaults in which the final term is not semantically related to the preceding terms | square => C C C rectangle => c c c circle => C C lime => |
Table 2: Conditions involving alteration or omission of the source domain. The Random Permuted Pairs condition (not shown) is identical to Randoms, but with the order of elements within questions permuted.
Both humans and models competently complete the Only RHS condition (see Figure 4). Accuracy is approximately 0.8 for Claude 3 with human subjects and GPT-4 slightly higher at 0.9. GPT-4 is not significantly different from humans in this condition (coef = 0.1178, z = 0.223, p = 0.824), and Claude 3 is worse than humans by a barely significant margin (coef = -0.9130, z = -1.994, p = 0.046). Thus, both humans and LLMs are able to complete the task without the guidance of the left hand side. Considering this, we look at the performance degradation associated with encountering incoherent semantic structure on the left hand side. Humans exhibit a modest decrease in accuracy of about 0.15 in the Random and Random Permuted Pairs conditions relative to defaults. Claude-3 and GPT-4, however, exhibit much larger drops: Claude 3 decreases by approximately 0.5 relative to Defaults, while GPT-4 decreases by 0.6 and 0.4 in the Random and Random Permuted Pairs conditions. Across these two conditions, both GPT-4 (coef = -2.1972, z = -5.211, p < 0.001) and Claude 3 (coef = -2.0680, z = -4.960, p < 0.001) perform significantly worse than humans.
<details>
<summary>extracted/5679376/Images/fig4_new.png Details</summary>

### Visual Description
## Bar Chart: Accuracy Comparison Across Models
### Overview
The chart compares accuracy metrics across three models (Human, GPT-4, Claude 3) using three evaluation methods: "Only RHS" (black), "Randoms" (red), and "Random finals" (blue). Accuracy values range from 0.0 to 1.0 on the y-axis, with error bars indicating variability.
### Components/Axes
- **X-axis**: Models (Human, GPT-4, Claude 3)
- **Y-axis**: Accuracy (0.0–1.0)
- **Legend**:
- Black: Only RHS
- Red: Randoms
- Blue: Random finals
- **Error bars**: Vertical lines on top of each bar segment
### Detailed Analysis
1. **Human**:
- Total accuracy: ~0.85
- Breakdown:
- Blue (Random finals): ~0.45
- Red (Randoms): ~0.35
- Black (Only RHS): ~0.05
- Error bars: ±0.05–0.10
2. **GPT-4**:
- Total accuracy: ~0.90
- Breakdown:
- Blue (Random finals): ~0.05
- Red (Randoms): ~0.15
- Black (Only RHS): ~0.70
- Error bars: ±0.05–0.10
3. **Claude 3**:
- Total accuracy: ~0.75
- Breakdown:
- Blue (Random finals): ~0.25
- Red (Randoms): ~0.10
- Black (Only RHS): ~0.40
- Error bars: ±0.05–0.10
### Key Observations
- **Human performance** is dominated by "Random finals" (blue), contributing ~53% of total accuracy.
- **GPT-4** relies almost exclusively on "Only RHS" (black), accounting for ~78% of its accuracy.
- **Claude 3** shows a more balanced distribution but lower overall accuracy, with "Only RHS" still being the largest contributor (~53%).
- All models exhibit similar error margins (~±0.05–0.10), suggesting comparable measurement precision.
### Interpretation
The data suggests that:
1. **Human cognition** integrates multiple strategies, with "Random finals" playing a significant role.
2. **GPT-4** heavily depends on "Only RHS" methodology, indicating potential over-reliance on a single approach.
3. **Claude 3** demonstrates moderate performance across methods but lacks the synergistic effect seen in human performance.
4. The error bars imply that accuracy measurements are relatively stable across models, though GPT-4's high accuracy may be less generalizable due to its narrow methodological focus.
The chart highlights trade-offs between methodological diversity and performance, with humans showing the most balanced and highest overall accuracy.
</details>
Figure 4: Human and LLM accuracy in Only RHS, Randoms, and Random finals conditions. Data from the Random Permuted Pairs condition is shown in Figure 15 of the Appendix . Error bars show standard errors.
From this we conclude that human subjects are able to easily identify when the left hand side contains no useful semantic structure to leverage. When there is none, they are able to employ a strategy that only relies on the right hand side. By contrast, models do not seem capable of easily identifying the lack of informativeness of the left hand side in these conditions, as they do not use the strategy of only attending to the right hand side, even though they show their capability of using this strategy when no left hand side is present. This suggests mechanistic differences between how human subjects and models process this task.
Although the performance of human subjects does not drop notably in the Random condition compared to the Only RHS condition, it does drop by a wide margin in the Random Finals condition. In this condition, accuracy is approximately 0.5 lower than in the Only RHS condition. This further suggests that the semantic relatedness of the left hand side affects the strategy of human subjects: when the left hand side is clearly unrelated, the information it provides is discarded, but when much of the left hand side appears related, the information is not discarded and the random final word of the source domain prompts an incorrect answer from human subjects. Models also show a large drop in performance in the Random Finals condition relative to Only RHS, with Claude 3 dropping by 0.5 and GPT-4 dropping by 0.8. Simple effects analysis shows that both Claude 3 (coef = -1.0464, z = -2.799, p = 0.005) and GPT-4 (coef = -2.7850, z = -5.168, p < 0.001) are significantly worse than humans in the Random Finals condition. However, we see this difference as less informative than that both models drop in performance across all the random conditions relative to their own performance in the Only RHS condition.
3.1.3 Other Observations
We additionally analyze the extent to which human subjects and models improve by question (Figure 14 of the Appendix), and the extent to which the errors made by humans and models follow the same distribution across questions grouped by target domain and across qualitative error types (Figure 13 and Table 5 of the Appendix). We find that humans and models alike improve over subsequent questions, adding to a body of evidence about in-context learning [30, 31, 32]. Humans and models show similar error distributions by target domain, but qualitative error types reveal a closer correspondence between human and GPT-4 errors than Claude 3.
3.1.4 Diagnosing the Use of an RHS-Only Heuristic
To clarify whether subjects actually make use of left-right relations or only complete right-side patterns in the Semantic Structure experiment, we design the Relational condition, a $2× n$ variant of the Defaults condition which cannot be solved (consistently) using only the right-hand terms (see the example in Table 3).
| pants => H # H |
| --- |
| glove => X # X |
| torso => V |
| foot => Z |
| head => M |
| shirt => V # V |
| hat => |
Table 3: An example from the Relational variant of the Defaults task, used to diagnose subjects’ tendency to rely on RHS-only heuristics to solve the task.
Results are shown in Figure 5. Human subjects and Claude 3 exhibit similar performance, with accuracies of approximately 0.7. GPT-4, however, attains much lower accuracy of approximately 0.35. Simple effects analysis shows that GPT-4 obtains significantly worse accuracy than human subjects (coef = -1.3669, z = -3.065, p = 0.002), while the accuracy of Claude 3 does not differ significantly from human subjects (coef = 0.2111, z = 0.467, p = 0.640).
<details>
<summary>extracted/5679376/Images/fig5_new.png Details</summary>

### Visual Description
## Bar Chart: Accuracy Comparison Across Entities
### Overview
The chart compares accuracy metrics for three entities (Human, GPT-4, Claude 3) across two data series: Defaults (white bars) and Relational (blue bars). Accuracy is measured on a scale from 0.0 to 1.0, with error bars indicating uncertainty ranges.
### Components/Axes
- **X-axis**: Entity labels (Human, GPT-4, Claude 3)
- **Y-axis**: Accuracy (0.0–1.0)
- **Legend**:
- White = Defaults
- Blue = Relational
- **Error Bars**: Vertical lines with caps above/below each bar, representing confidence intervals.
### Detailed Analysis
1. **Human**
- **Relational (Blue)**: Accuracy ≈ 0.65 ± 0.15 (error bar spans ~0.5–0.8)
- **Defaults (White)**: Accuracy ≈ 0.85 ± 0.05 (error bar spans ~0.8–0.9)
2. **GPT-4**
- **Relational (Blue)**: Accuracy ≈ 0.3 ± 0.1 (error bar spans ~0.2–0.4)
- **Defaults (White)**: Accuracy ≈ 0.75 ± 0.05 (error bar spans ~0.7–0.8)
3. **Claude 3**
- **Relational (Blue)**: Accuracy ≈ 0.7 ± 0.1 (error bar spans ~0.6–0.8)
- **Defaults (White)**: Accuracy ≈ 0.8 ± 0.05 (error bar spans ~0.75–0.85)
### Key Observations
- **Relational vs. Defaults**:
- All entities show lower Relational accuracy than Defaults.
- GPT-4 exhibits the largest gap between Relational (0.3) and Defaults (0.75).
- **Error Margins**:
- GPT-4’s Relational accuracy has the widest uncertainty (±0.1).
- Claude 3’s Relational accuracy overlaps with its Defaults accuracy within error margins.
### Interpretation
The data suggests that **Defaults** consistently outperform **Relational** models across all entities. However, the error margins indicate variability:
- **GPT-4** shows the most significant performance disparity between the two methods.
- **Claude 3**’s overlapping error bars imply that Relational and Defaults may perform similarly under uncertainty.
- **Human** accuracy is highest for Defaults, reinforcing the trend.
The chart highlights the importance of error margins in interpreting performance differences, as visual gaps may not always reflect statistically significant disparities.
</details>
Figure 5: Human and LLM accuracy in the Relational condition followup, with Defaults condition performance for reference. Error bars show standard errors.
3.1.5 Takeaways
Despite weak performance from many models on our analogical reasoning tasks, GPT-4 and Claude 3 perform well, showing similar patterns to humans in leveraging semantic structure of corresponding domains to solve analogies. However, differences do remain in how they handle semantic structure in the source domain. Humans prefer leveraging semantic structure when a clear pattern exists (evidenced by the Defaults and Random Finals conditions) but can ignore words when structure is lacking (Randoms condition). Models show the former bias but not the latter ability, appearing distracted by random lexical items. Nevertheless, model results increasingly resemble human subjects, suggesting larger models may close this gap.
Furthermore, qualitative differences exist even between the best models. GPT-4 and Claude 3 match human performance in the Defaults condition, but when the structure is generalized from $2× 2$ to $2× n$ in the Relational followup, making a right-hand-only strategy unworkable, Claude 3 maintains human-level performance while GPT-4 drops significantly. Despite limited public information, it’s notable that models produced using presumably similar approaches can exhibit meaningfully different behavioral patterns.
3.2 Mapping Semantic Content
The Semantic Structure experiment, which presented subjects with source and target domains with corresponding semantic structure (i.e., with corresponding relations between terms), provides insight into the relative bias of human subjects and models to transfer this structure across domains. The Semantic Content experiment modifies the tasks to investigate the extent to which human subjects and models can transfer elements of the linguistic meaning of terms from one domain to another.
To achieve this, we ensure that elements of the target domain directly depend on properties of corresponding source domain elements, requiring knowledge of the source domain terms’ meaning for perfect performance. As in the Semantic Structure experiment, source and target domains are paired such that patterns in the target domain mirror those in the source domain. Together, these experiments compare the subject’s ability and tendency to use a structure-mapping approach. Four tasks are generated, encoding either one or two dimensions of variation and either involving or not involving numeric reasoning (see Table 4).
| Categorial: Right-hand terms are single characters corresponding to a Categorial property of the left-hand terms. | chicken => ! spider => ! cat => * horse => * ant => ! dog => * bee => ! human => |
| --- | --- |
| Multi-Attribute: Right-hand terms are a sequence of several characters that vary according to two properties of the left-hand terms. | grandfather => ! grandmother => * mother => * * father => ! ! brother => ! ! ! sister => |
| Numeric: Right-hand terms are a sequence of a single repeated character, with the number of repetitions corresponding to a numeric property of the left-hand terms. | chicken => * * human => * * dog => * * * * spider => * * * * * * * * cat => * * * * horse => * * * * bee => |
| Numeric Multi-Attribute: Right-hand terms are a sequence of a repeated character, with the number of repetitions corresponding to a numeric property of the left-hand terms and the character corresponding to a Categorial property. | horse => * * * * cat => * * * * ant => ! ! ! ! ! ! bee => ! ! ! ! ! ! chicken => ! ! spider => ! ! ! ! ! ! ! ! dog => * * * * human => |
Table 4: The conditions of the Semantic Content experiment.
<details>
<summary>extracted/5679376/Images/exp2_new.png Details</summary>

### Visual Description
## Bar Chart: Model Accuracy Comparison Across Attribute Types
### Overview
The chart compares accuracy performance across three entities: Human, GPT-4, and Claude 3, evaluated on three attribute types: Single attribute (white), Numeric (red), and Categorical (blue). Accuracy is measured on a scale from 0.0 to 1.0, with error bars indicating variability.
### Components/Axes
- **X-axis**: Model (Human, GPT-4, Claude 3)
- **Y-axis**: Accuracy (0.0 to 1.0)
- **Legend**:
- White: Single attribute
- Red: Numeric
- Blue: Categorical
- **Error Bars**: Vertical lines atop each bar representing confidence intervals.
### Detailed Analysis
1. **Human**:
- Single attribute: ~0.75 (±0.05)
- Numeric: ~0.5 (±0.1)
- Categorical: ~0.45 (±0.1)
2. **GPT-4**:
- Single attribute: ~0.68 (±0.05)
- Numeric: ~0.25 (±0.05)
- Categorical: ~0.65 (±0.05)
3. **Claude 3**:
- Single attribute: ~0.55 (±0.05)
- Numeric: ~0.45 (±0.1)
- Categorical: ~0.60 (±0.05)
### Key Observations
- **Single attribute tasks** show the highest accuracy across all entities, with Human achieving the highest (~0.75).
- **Numeric tasks** are the most challenging, with GPT-4 performing significantly worse (~0.25) compared to Human (~0.5) and Claude 3 (~0.45).
- **Categorical tasks** demonstrate moderate performance, with GPT-4 outperforming Human (~0.65 vs. ~0.45) and Claude 3 (~0.60).
### Interpretation
The data suggests that:
1. **Single attribute tasks** are inherently easier, likely due to simpler pattern recognition requirements.
2. **Numeric tasks** pose significant challenges for AI models (GPT-4 and Claude 3), potentially due to the need for precise numerical reasoning or data interpretation.
3. **Categorical tasks** reveal a nuanced trend: GPT-4 outperforms humans, possibly indicating advanced pattern recognition in structured categorical data, while Claude 3 shows balanced performance.
4. Humans maintain an edge in Numeric tasks, suggesting domain-specific expertise or contextual understanding not fully captured by current AI models.
The error bars indicate variability in performance, with Numeric tasks showing the largest uncertainty (e.g., GPT-4's Numeric accuracy ±0.05). This chart highlights critical gaps in AI performance across different data types, emphasizing the need for specialized training in numeric reasoning.
</details>
Figure 6: Human and model accuracy by condition in the Semantic Content experiment. Error bars show standard errors.
Results for human subjects, GPT-4, and Claude 3 are shown in Figure 6 (other tested models attain much lower accuracy as before).
3.2.1 Human Performance Continues to be Robust
Human subjects perform robustly and consistently, as in the previous experiment. Human accuracy ranges from 0.4 to 0.8 across conditions, comparable to the earlier Semantic Structure experiment. As expected, subjects generally describe their strategy as relating properties of the left-hand terms to their representations on the right-hand side.
3.2.2 Claude 3 Matches Human Performance Stably Across Conditions
Claude 3 matches human performance stably across the different conditions of the Semantic Content experiment with its accuracy falling into a comparable range of 0.4 to 0.7. The model exhibits marginally better performance in the Multi-Attribute condition and marginally worse performance in the remaining three. These differences are insignificant across all conditions, which covers the Categorial (coef = -0.8109, z = -1.879, p = 0.060), Multi-Attribute (coef = 0.6206, z = 1.478, p = 0.140), Numeric (coef = -0.1788, z = -0.439, p = 0.661), and Numeric Multi-Attribute (coef = -0.2009, z = -0.484, p = 0.629) conditions. Therefore, Claude 3 performs as well as human subjects across all conditions of this experiment.
3.2.3 GPT-4 Lags Human Subjects on Numeric Reasoning
GPT-4 achieves good results in the Categorial and Multi-Attribute conditions, with mean accuracies of approximately 0.7 in both (compared to 0.7 and 0.4 respectively for human subjects). GPT-4 is not significantly worse than humans in the Categorial condition (coef = -0.3927, z = -0.889, p = 0.374). GPT-4 significantly outperforms human subjects in the Multi-Attribute condition (coef = 1.0624, z = 2.429, p = 0.015). However, its accuracy drops to 0.2-0.3 in the remaining conditions and we find that GPT-4 is significantly worse than humans in both the Numeric (coef = -1.4781, z = -3.321, p = 0.001) and Numeric Multi-Attribute conditions (coef = -0.9694, z = -2.185, p = 0.029).
In these conditions, GPT-4 fails to correctly relate the number of characters in a response to the numeric property of the object (see Table 7 for an illustrative example). GPT-4’s failure to reason about the number of characters in the expected way is further observed in the sanity check shown in Table 8 of the Appendix, even when the model is not required to relate a property of a word to its representation.
3.2.4 Human Performance Drops in Compositional Conditions, But Models Remain Constant
When comparing the performance of a subject in a non-compositional (single-attribute) condition to the corresponding compositional (multi-attribute) version, we observe some decrease in performance for human subjects but not for models (note that this surprising result is subject to alternative explanations, addressed in the discussion below). The accuracy of human subjects drops from approximately 0.7 to approximately 0.4 when comparing the Categorial condition to the corresponding compositional version (the Multi-Attribute condition). A simple effects analysis confirms that this decline is significant (coef = -1.2267, z = -3.091, p = 0.002). We see a non-significant decrease in accuracy for human subjects when comparing the Numeric condition to its compositional counterpart, with performance dropping from approximately 0.6 to approximately 0.5 (coef = -0.3795, z = -1.028, p = 0.304).
By contrast, we do not find either model to be significantly worse in compositional conditions than non-compositional ones. In fact, GPT-4 exhibits a slight improvement in the compositional conditions, though this change is statistically insignificant for both the Multi-Attribute condition relative to the Categorial condition (coef = 0.2281, z = 0.477, p = 0.634) and for the Numeric Multi-Attribute condition relative to the Numeric condition (coef = 0.1292, z = 0.254, p = 0.799). For Claude 3 we similarly find the differences to be insignificant for the Multi-Attribute condition relative to the Categorial condition (coef = 0.2049, z = 0.452, p = 0.651) and for the Numeric Multi-Attribute condition relative to the Numeric condition (coef = -0.4013, z = -0.893, p = 0.372).
3.2.5 Takeaways
The Semantic Content experiment confirms that human subjects perform robustly and flexibly across diverse task variations. Claude 3 matches human performance in all conditions, indicating it shares humans’ tendency to use the source domain’s semantic content when completing target domains. While GPT-4’s poor performance in numeric conditions is notable, it reflects a failure in numeric reasoning rather than a difference in analogical reasoning.
We find evidence of decreased human performance, but not model performance, in compositional conditions, contrasting with some existing research [33]. However, other factors may be at play. Models’ negative compositionality effect may be masked by a positive effect, such as increased available information: when the target domain represents two source domain properties, models may more easily recognize the encoding of source domain properties. Human subjects may benefit less from this competing effect if they do not struggle to observe this information encoding.
4 Discussion
Our results show that the best-performing LLMs are able to successfully complete many analogical reasoning tasks with human-level accuracy using novel stimuli not present in their training data. They also show that there remain meaningful differences in how such analogies are processed, evidenced by differences in how humans and models respond to distracting or misleading information. However, we observe a clear trend: more recent models come increasingly close to matching human performance across our tasks. In particular, Claude 3, the most recently-released model we test, exhibits impressively robust performance across most task variations, even closing the gap with humans in some test conditions in which its predecessor (GPT-4) exhibited limitations (such as the Relational task version in which mapping from the source domain must be used for success). Together, these results raise questions about the ability of LLMs and similar models to serve as candidate cognitive models, which we discuss briefly below.
4.1 Evaluating the Competence of LLMs
The breadth of Claude 3’s success in our tasks is noteworthy. It suggests that state-of-the-art LLMs can broadly match human performance not only in formal analogical reasoning tasks, as suggested by Webb et al. [4], but also in tasks that require mapping semantic information across linguistic and non-linguistic domains. As such, our results weigh against a long-standing view in cognitive science, according to which connectionist models without a built-in symbolic component are constitutively limited in their ability to robustly handle analogical reasoning tasks [7]. They also inform discussions of whether LLMs possess “functional” linguistic competence, in addition to “formal” linguistic competence [3]. Further work is needed to characterize the precise mechanism that LLMs are using to solve these tasks; it is possible–though increasingly unlikely given the robustness of the behavioral results–that success is due to a myriad of heuristics rather than a systematic analogical reasoning process. Even so, evidence of LLMs completing analogical reasoning tasks in domains designed to involve linguistic structure-mapping, in addition to tasks over abstract symbols, runs counter to the claim that LLMs are capable of formal but not functional linguistic competence.
There remain examples of LLMs performing much worse than humans on analogical reasoning tasks [10], which must be reconciled with our results. Here the competence-performance distinction, originally introduced by Noam Chomsky [34], can be usefully applied to the evaluation of LLMs [35, 2, 36]. This distinction allows researchers to theorize about the abstract computational principles governing cognition separately from the “noise” introduced by performance factors. In humans, it is generally assumed that there is a double dissociation between performance and competence: neither success nor failure on a task designed to measure a particular capacity can always be taken as conclusive evidence that subjects have or lack that capacity, due to auxiliary factors affecting task performance. When it comes to LLMs, by contrast, the distinction is typically applied in a single direction: human-like performance on benchmarks is often explained away by reliance on shallow heuristics [37] and/or lack of construct validity [38], while sub-human performance is often taken as reliable evidence of lack of competence. However, LLM performance can also be negatively affected by strong auxiliary task demands [39] and mismatched conditions in comparisons with human subjects [40]. These are compelling reasons to apply the dissociation in both directions to LLMs as well.
From this perspective, our results offer evidence to support both sides of the present debate about whether LLMs possess human-level analogical reasoning (see Webb et al. [15], Mitchell et al. [10], and Hodel et al. [16]). Supporting the argument of Webb et al. [15] that deficiencies in capabilities other than analogical reasoning can explain poor model performance in some tasks, we find that GPT-4’s failure in the numeric conditions of our Semantic Content experiment may be due to a deficiency in counting ability. However, contrary to Webb et al. [4], who report impressive analogical reasoning in both GPT-3 and GPT-4, we do find a qualitative difference in the performance of these two models, with GPT-3 performing quite poorly on our tasks. Among the models tested, only GPT-4 and Claude 3 produce results that merit detailed comparison with human subjects. This suggests that claims of human-level performance of LLMs on analogical reasoning tasks may have been premature and might have relied on insufficiently challenging tasks.
However, other differences we observe between human subjects and LLMs across task variations are not subject to an auxiliary task demand explanation and suggest that the underlying mechanisms of analogical reasoning in these systems may differ from that in humans. Importantly, these differences persist even in our best performing model, Claude 3. For instance, Claude 3 responds differently than human subjects when some or all words in the target domain are replaced with random words, indicating that they may use distinct strategies for identifying and leveraging relational similarities between source and target domains. Furthermore, Claude 3 remains more sensitive than human subjects to the ordering of elements within domains, which is difficult to explain if LLMs are using a generalizable symbolic working memory approach.
Collectively, these patterns bear on the larger question of how we should arbitrate disputes about competence in machine-human comparisons. On the one hand, it seems reasonable to assume that any system that can reliably achieve success at or above human level on experiments like ours–without relying on memorization and other confounds–should be considered competent at analogical reasoning through structure-mapping. On the other hand, we should be open to the possibility that such competence may be implemented differently in LLMs and humans.
The question of whether we require human-likeness of the mechanism to declare human-level “competence” is ultimately not empirical, but rather demands philosophical consensus among the scientific community around our ultimate goals and metrics for achieving them.
4.2 Analogy in Human(-like) Learning and Bootstrapping
Unlike previous research comparing analogical reasoning in human subjects and LLMs, our tasks involve transferring semantic structure and content from source to target domains, rather than reasoning over abstract symbols. Our experiments thus investigate whether LLMs’ analogical reasoning resembles that of human subjects in a manner pertinent to its purportedly central role in broader cognition. Following Gentner [11], emphasis has been placed on relational similarity, rather than just feature similarity, in mapping from a familiar source to a foreign target domain during analogical reasoning to allow for the flexible transfer of knowledge [41, 42, 43]. This conception allows analogical reasoning to play a fundamental role in human cognition, supporting the emergence of diverse cognitive abilities via “bootstrapping” [44, 45, 46]. In bootstrapping, two cognitive processes mutually support each other’s development. In Gentner’s Structure-Mapping Theory (SMT), language development and structure-mapping-based analogical reasoning are hypothesized to co-develop, with structure-mapping developing the necessary relational reasoning to model language-world relations, and language acquisition in turn developing symbolic reasoning capacities that amplify structure-mapping abilities. Consequently, analogical reasoning is seen as a central cognitive phenomenon of interest.
The success of some LLMs in many of our tasks suggests that the most advanced models may be capable of employing a structure-mapping based approach to analogical reasoning, in which relations in the source domain are used to constrain and guide reasoning about relations in the target domain. This raises the possibility that a bootstrapping cycle between language development and analogical reasoning in humans, as proposed by Gentner [44], may be paralleled in language models. The emergence of such competence from training primarily on text prediction would yield new hypotheses about the emergence of analogical reasoning as a central cognitive faculty from generic learning mechanisms (possibly combined with the unique pressures of language acquisition). However, the mixed success of LLMs and the significant differences from humans in certain conditions underscore the need for continued research to test the robustness of any conclusion that analogical reasoning in LLMs closely matches that of human subjects. As LLM outputs continue to converge toward human responses–an expected product of the language modelling objective–it is crucial to develop novel tasks that examine analogical reasoning ability and are not attested in the training data. While our task allows for clear discrimination between human performance and that of most models prior to Claude 3, further differences in analogical reasoning patterns between humans and Claude 3 likely exist beyond those revealed by our tests. More granular testing would help clarify the extent of the remaining discrepancies between humans and the most advanced LLMs, and much further work is required to verify the hypothesis that language models parallel the bootstrapping cycle between language development and analogical reasoning in humans.
The proprietary nature of leading LLMs like Claude 3 unfortunately limits our ability to directly investigate the features that may explain the emergence of a response pattern largely mirroring that of human subjects. However, increasingly sophisticated open-weights models are being released, which may allow for interpretability work to analyze the internal mechanisms of a model and shed light on the underlying mechanisms that enable advanced LLMs to exhibit impressive analogical reasoning abilities in many tasks.
5 Acknowledgments
This work was supported in part by NIH NIGMS COBRE grant #5P20GM10364510.
References
- [1] A. Srivastava, A. Rastogi, A. Rao, A. A. M. S. E. al., Beyond the imitation game: Quantifying and extrapolating the capabilities of language models (2023). arXiv:2206.04615.
- [2] E. Pavlick, Symbols and grounding in large language models, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 381 (2251) (Jun. 2023). doi:10.1098/rsta.2022.0041. URL http://dx.doi.org/10.1098/rsta.2022.0041
- [3] K. Mahowald, A. A. Ivanova, I. A. Blank, N. Kanwisher, J. B. Tenenbaum, E. Fedorenko, Dissociating language and thought in large language models (2023). arXiv:2301.06627.
- [4] T. Webb, K. J. Holyoak, H. Lu, Emergent analogical reasoning in large language models, Nature Human Behaviour 7 (9) (2023) 1526––1541.
- [5] S. J. Han, K. Ransom, A. Perfors, C. Kemp, Inductive reasoning in humans and large language models (2023). arXiv:2306.06548.
- [6] X. Hu, S. Storks, R. L. Lewis, J. Chai, In-context analogical reasoning with pre-trained language models (2023). arXiv:2305.17626.
- [7] M. Mitchell, Abstraction and analogy-making in artificial intelligence, Annals of the New York Academy of Sciences 1505 (1) (2021) 79–101. arXiv:https://nyaspubs.onlinelibrary.wiley.com/doi/pdf/10.1111/nyas.14619, doi:https://doi.org/10.1111/nyas.14619. URL https://nyaspubs.onlinelibrary.wiley.com/doi/abs/10.1111/nyas.14619
- [8] K. J. Holyoak, D. Gentner, B. N. Kokinov, Introduction: The Place of Analogy in Cognition, in: The Analogical Mind: Perspectives from Cognitive Science, The MIT Press, 2001. arXiv:https://direct.mit.edu/book/chapter-pdf/2323335/9780262316057\_caa.pdf, doi:10.7551/mitpress/1251.003.0003. URL https://doi.org/10.7551/mitpress/1251.003.0003
- [9] D. R. Hofstadter, Epilogue: Analogy as the Core of Cognition, in: The Analogical Mind: Perspectives from Cognitive Science, The MIT Press, 2001. arXiv:https://direct.mit.edu/book/chapter-pdf/2323391/9780262316057\_cao.pdf, doi:10.7551/mitpress/1251.003.0020. URL https://doi.org/10.7551/mitpress/1251.003.0020
- [10] M. Lewis, M. Mitchell, Using counterfactual tasks to evaluate the generality of analogical reasoning in large language models (2024). arXiv:2402.08955.
- [11] D. Gentner, Structure-mapping: A theoretical framework for analogy*, Cognitive Science 7 (2) (1983) 155–170. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1207/s15516709cog0702_3, doi:https://doi.org/10.1207/s15516709cog0702\_3. URL https://onlinelibrary.wiley.com/doi/abs/10.1207/s15516709cog0702_3
- [12] K. Erk, Towards a semantics for distributional representations, in: A. Koller, K. Erk (Eds.), Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013) – Long Papers, Association for Computational Linguistics, Potsdam, Germany, 2013, pp. 95–106. URL https://aclanthology.org/W13-0109
- [13] G. Boleda, Distributional semantics and linguistic theory, CoRR abs/1905.01896 (2019). arXiv:1905.01896. URL http://arxiv.org/abs/1905.01896
- [14] L. Gleitman, C. Fisher, 6 universal aspects of word learning, in: J. A. McGilvray (Ed.), The Cambridge Companion to Chomsky, Cambridge University Press, 2005, p. 123.
- [15] T. Webb, K. J. Holyoak, H. Lu, Evidence from counterfactual tasks supports emergent analogical reasoning in large language models (2024). arXiv:2404.13070.
- [16] D. Hodel, J. West, Response: Emergent analogical reasoning in large language models (2024). arXiv:2308.16118.
- [17] S. French, A model-theoretic account of representation (or, i don’t know much about art…but i know it involves isomorphism), Philosophy of Science 70 (5) (2003) 1472–1483. doi:10.1086/377423.
- [18] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners (2020). arXiv:2005.14165.
- [19] OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundage, K. Button, T. Cai, R. Campbell, A. Cann, B. Carey, C. Carlson, R. Carmichael, B. Chan, C. Chang, F. Chantzis, D. Chen, S. Chen, R. Chen, J. Chen, M. Chen, B. Chess, C. Cho, C. Chu, H. W. Chung, D. Cummings, J. Currier, Y. Dai, C. Decareaux, T. Degry, N. Deutsch, D. Deville, A. Dhar, D. Dohan, S. Dowling, S. Dunning, A. Ecoffet, A. Eleti, T. Eloundou, D. Farhi, L. Fedus, N. Felix, S. P. Fishman, J. Forte, I. Fulford, L. Gao, E. Georges, C. Gibson, V. Goel, T. Gogineni, G. Goh, R. Gontijo-Lopes, J. Gordon, M. Grafstein, S. Gray, R. Greene, J. Gross, S. S. Gu, Y. Guo, C. Hallacy, J. Han, J. Harris, Y. He, M. Heaton, J. Heidecke, C. Hesse, A. Hickey, W. Hickey, P. Hoeschele, B. Houghton, K. Hsu, S. Hu, X. Hu, J. Huizinga, S. Jain, S. Jain, J. Jang, A. Jiang, R. Jiang, H. Jin, D. Jin, S. Jomoto, B. Jonn, H. Jun, T. Kaftan, Łukasz Kaiser, A. Kamali, I. Kanitscheider, N. S. Keskar, T. Khan, L. Kilpatrick, J. W. Kim, C. Kim, Y. Kim, J. H. Kirchner, J. Kiros, M. Knight, D. Kokotajlo, Łukasz Kondraciuk, A. Kondrich, A. Konstantinidis, K. Kosic, G. Krueger, V. Kuo, M. Lampe, I. Lan, T. Lee, J. Leike, J. Leung, D. Levy, C. M. Li, R. Lim, M. Lin, S. Lin, M. Litwin, T. Lopez, R. Lowe, P. Lue, A. Makanju, K. Malfacini, S. Manning, T. Markov, Y. Markovski, B. Martin, K. Mayer, A. Mayne, B. McGrew, S. M. McKinney, C. McLeavey, P. McMillan, J. McNeil, D. Medina, A. Mehta, J. Menick, L. Metz, A. Mishchenko, P. Mishkin, V. Monaco, E. Morikawa, D. Mossing, T. Mu, M. Murati, O. Murk, D. Mély, A. Nair, R. Nakano, R. Nayak, A. Neelakantan, R. Ngo, H. Noh, L. Ouyang, C. O’Keefe, J. Pachocki, A. Paino, J. Palermo, A. Pantuliano, G. Parascandolo, J. Parish, E. Parparita, A. Passos, M. Pavlov, A. Peng, A. Perelman, F. de Avila Belbute Peres, M. Petrov, H. P. de Oliveira Pinto, Michael, Pokorny, M. Pokrass, V. H. Pong, T. Powell, A. Power, B. Power, E. Proehl, R. Puri, A. Radford, J. Rae, A. Ramesh, C. Raymond, F. Real, K. Rimbach, C. Ross, B. Rotsted, H. Roussez, N. Ryder, M. Saltarelli, T. Sanders, S. Santurkar, G. Sastry, H. Schmidt, D. Schnurr, J. Schulman, D. Selsam, K. Sheppard, T. Sherbakov, J. Shieh, S. Shoker, P. Shyam, S. Sidor, E. Sigler, M. Simens, J. Sitkin, K. Slama, I. Sohl, B. Sokolowsky, Y. Song, N. Staudacher, F. P. Such, N. Summers, I. Sutskever, J. Tang, N. Tezak, M. B. Thompson, P. Tillet, A. Tootoonchian, E. Tseng, P. Tuggle, N. Turley, J. Tworek, J. F. C. Uribe, A. Vallone, A. Vijayvergiya, C. Voss, C. Wainwright, J. J. Wang, A. Wang, B. Wang, J. Ward, J. Wei, C. Weinmann, A. Welihinda, P. Welinder, J. Weng, L. Weng, M. Wiethoff, D. Willner, C. Winter, S. Wolrich, H. Wong, L. Workman, S. Wu, J. Wu, M. Wu, K. Xiao, T. Xu, S. Yoo, K. Yu, Q. Yuan, W. Zaremba, R. Zellers, C. Zhang, M. Zhang, S. Zhao, T. Zheng, J. Zhuang, W. Zhuk, B. Zoph, Gpt-4 technical report (2024). arXiv:2303.08774.
- [20] S. Biderman, H. Schoelkopf, Q. Anthony, H. Bradley, K. O’Brien, E. Hallahan, M. A. Khan, S. Purohit, U. S. Prashanth, E. Raff, A. Skowron, L. Sutawika, O. van der Wal, Pythia: A suite for analyzing large language models across training and scaling (2023). arXiv:2304.01373.
- [21] Anthropic, Claude-2 language model, https://www.anthropic.com/index/claude-2, accessed: 2023-11-06 (2023).
- [22] Anthropic, The Claude 3 Model Family: Opus, Sonnet, Haiku, https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, accessed: 2024-05-01 (2024).
- [23] Technology Innovation Institute, Falcon language model, https://falconllm.tii.ae/falcon.html, accessed: 2023-10-31 (2023).
- [24] B. Liu, L. Ding, L. Shen, K. Peng, Y. Cao, D. Cheng, D. Tao, Diversifying the mixture-of-experts representation for language models with orthogonal optimizer (2023). arXiv:2310.09762.
- [25] S. Glover, P. Dixon, Likelihood ratios: A simple and flexible statistic for empirical psychologists, Psychonomic Bulletin & Review 11 (5) (2004) 791–806. doi:10.3758/BF03196706. URL https://doi.org/10.3758/BF03196706
- [26] P. Carvalho, R. Goldstone, Category structure modulates interleaving and blocking advantage in inductive category acquisition, in: Proceedings of the 34th Annual Conference of the Cognitive Science Society, 2012, pp. 186–191.
- [27] J. Russin, E. Pavlick, M. J. Frank, Human curriculum effects emerge with in-context learning in neural networks (2024). arXiv:2402.08674.
- [28] D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, J. Steinhardt, Measuring massive multitask language understanding (2021). arXiv:2009.03300.
- [29] J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, D. Amodei, Scaling laws for neural language models (2020). arXiv:2001.08361.
- [30] S. M. Xie, A. Raghunathan, P. Liang, T. Ma, An explanation of in-context learning as implicit bayesian inference (2022). arXiv:2111.02080.
- [31] Y. Zhang, F. Zhang, Z. Yang, Z. Wang, What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization (2023). arXiv:2305.19420.
- [32] A. Raventós, M. Paul, F. Chen, S. Ganguli, Pretraining task diversity and the emergence of non-bayesian in-context learning for regression, in: A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, S. Levine (Eds.), Advances in Neural Information Processing Systems, Vol. 36, Curran Associates, Inc., 2023, pp. 14228–14246. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/2e10b2c2e1aa4f8083c37dfe269873f8-Paper-Conference.pdf
- [33] O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, M. Lewis, Measuring and narrowing the compositionality gap in language models (2023). arXiv:2210.03350.
- [34] N. Chomsky, Aspects of the Theory of Syntax, 50th Edition, The MIT Press, 1965. URL http://www.jstor.org/stable/j.ctt17kk81z
- [35] C. Firestone, Performance vs. competence in human-machine comparisons, Proc Natl Acad Sci U S A 117 (43) (2020) 26562–26571.
- [36] E. Pavlick, Semantic structure in deep learning (january 2022), Annual Review of Linguistics 8 (2022) 447–471. URL http://dx.doi.org/10.1146/annurev-linguistics-031120-122924
- [37] R. T. McCoy, S. Yao, D. Friedman, M. Hardy, T. L. Griffiths, Embers of autoregression: Understanding large language models through the problem they are trained to solve (2023). arXiv:2309.13638.
- [38] T. Ullman, Large language models fail on trivial alterations to theory-of-mind tasks (2023). arXiv:2302.08399.
- [39] J. Hu, M. C. Frank, Auxiliary task demands mask the capabilities of smaller language models (2024). arXiv:2404.02418.
- [40] A. K. Lampinen, Can language models handle recursively nested grammatical structures? a case study on comparing models and humans (2023). arXiv:2210.15303.
- [41] H. Gust, U. Krumnack, K.-U. Kühnberger, A. Schwering, Analogical reasoning: A core of cognition., KI 22 (2008) 8–12.
- [42] G. S. Halford, W. H. Wilson, S. Phillips, Relational knowledge: the foundation of higher cognition, Trends in Cognitive Sciences 14 (11) (2010) 497–505. doi:https://doi.org/10.1016/j.tics.2010.08.005. URL https://www.sciencedirect.com/science/article/pii/S1364661310002020
- [43] K. J. Holyoak, 234 Analogy and Relational Reasoning, in: The Oxford Handbook of Thinking and Reasoning, Oxford University Press, 2012. arXiv:https://academic.oup.com/book/0/chapter/293248246/chapter-ag-pdf/44513038/book\_34559\_section\_293248246.ag.pdf, doi:10.1093/oxfordhb/9780199734689.013.0013. URL https://doi.org/10.1093/oxfordhb/9780199734689.013.0013
- [44] D. Gentner, Bootstrapping the mind: Analogical processes and symbol systems, Cognitive Science 34 (5) (2010) 752–775. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1111/j.1551-6709.2010.01114.x, doi:https://doi.org/10.1111/j.1551-6709.2010.01114.x. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1551-6709.2010.01114.x
- [45] S. Carey, Bootstrapping & the origin of concepts, Daedalus 133 (1) (2004) 59–68. arXiv:https://direct.mit.edu/daed/article-pdf/133/1/59/1828762/001152604772746701.pdf, doi:10.1162/001152604772746701. URL https://doi.org/10.1162/001152604772746701
- [46] S. Carey, The Origin of Concepts, Oxford University Press, 2009. doi:10.1093/acprof:oso/9780195367638.001.0001. URL https://doi.org/10.1093/acprof:oso/9780195367638.001.0001
Appendix A Statistical outputs and supplementary figures
A.1 Regression results, Semantic Structure experiment
We perform a logistic regression with the outcome variable being the raw score (a 0 or 1 for each question). The predictor variables are condition and subject type (restricted to human subjects and GPT-4 only, or human subjects and Claude 3 only). The regression is performed with and without interactions:
Without interactions:
$smf.logit(formula=respondent\_scores\sim C(subject\_type,Treatment(reference=%
human))+C(quiz\_class,Treatment(reference=permuted\_questions)),data=all\_%
subjects\_df,).fit(maxiter=1000,method=bfgs)$
With interactions:
$smf.logit(formula=respondent\_scores\sim C(subject\_type,Treatment(reference=%
human))*C(quiz\_class,Treatment(reference=permuted\_questions)),data=all\_%
subjects\_df,).fit(maxiter=1000,method=bfgs)$
The significance of including the interaction between predictors is assessed with a likelihood ratio test with the associated p-value calculated as follows:
$p=chi2.sf(lik\_ratio,degfree)$ , with 7 degrees of freedom.
The likelihood ratio in the above formula is calculated as follows:
$lik\_ratio=degfree*(res\_subjXclass.llf-res\_subjplusclass.llf)$ .
In the above, res_subjXclass and res_subjplusclass are the regression outputs with and without interactions respectively.
For both comparisons (human subjects compared to GPT-4 and human subjects compared to Claude 3), we find a significant improvement in model fit when interactions between the subject type and experiment condition are included.A likelihood ratio test shows that including interactions between subject type and experiment condition leads to a significantly better fit of the model ( $chi^{2}(7)=115.1871,p<0.001$ ). For the comparison between Claude 3 and human subjects, we again find a significant negative effect of the subject type being Claude 3 when interactions are not included (coef = -0.8706, z = -5.608, p < 0.001) and find that subject type - condition interactions are significant ( $chi^{2}(7)=173.6511,p<0.001$ ). These results are consistent with the observation that the two models exhibit variable performance across conditions, and indicate that the overall performance gap to human subjects is driven by low model accuracy in certain conditions. Simple effects analysis is used below to assess the effect of subject type in particular conditions and groups thereof.
Regression outputs are shown in Figure 7 and 8.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Tabular Statistical Output: Model Evaluation Results
### Overview
The image contains multiple statistical tables comparing model performance across different conditions. Each table includes coefficients, standard errors, z-scores, p-values, and BIC values for various model specifications. The analysis focuses on logistic and linear regression models with different residual types and data subsets.
### Components/Axes
**Common Table Structure:**
- **Columns:** Dep. Variable, Model, Date, coef, std err, z, P>|z|, BIC
- **Rows:** Model specifications with conditions (e.g., "detracted, permuted_pairs", "detracted, random")
**Key Labels:**
1. **Model Types:** Logistic Regression, Linear Regression
2. **Residual Types:** DiR Residuals, LR Residuals
3. **Conditions:**
- detracted, permuted_pairs
- detracted, random
- full dataset
### Detailed Analysis
**Logistic Regression Results (First Table):**
| Model | Date | coef | std err | z | P>|z| | BIC |
|-------|------|------|---------|---|------|-----|
| Logistic DiR Residuals | 2024-04-17 | 0.0004860 | 0.000512 | 0.949 | 0.344 | 1258 |
| Logistic LR Residuals | 2024-04-17 | -0.0004860 | 0.000512 | -0.949 | 0.344 | 1258 |
**Linear Regression Results (Second Table):**
| Model | Date | coef | std err | z | P>|z| | BIC |
|-------|------|------|---------|---|------|-----|
| Linear DiR Residuals | 2024-04-17 | 0.0004860 | 0.000512 | 0.949 | 0.344 | 1258 |
| Linear LR Residuals | 2024-04-17 | -0.0004860 | 0.000512 | -0.949 | 0.344 | 1258 |
**Key Observations:**
1. **Coefficient Reversal:** The sign of coefficients flips between "detracted, permuted_pairs" and "detracted, random" conditions while maintaining identical magnitude and standard errors.
2. **Statistical Significance:** All p-values (0.344) indicate non-significant results across models.
3. **BIC Consistency:** Identical BIC values suggest similar model complexity across specifications.
4. **Condition Impact:** The "detracted, random" condition shows opposite effects compared to "detracted, permuted_pairs" despite identical model parameters.
### Interpretation
The data demonstrates that model coefficients are highly sensitive to data subset selection. The complete reversal of effect direction between "detracted, permuted_pairs" and "detracted, random" conditions suggests:
1. **Model Instability:** Small changes in data subset composition lead to dramatic shifts in estimated effects.
2. **Condition Dependency:** The relationship between predictors and outcomes appears to depend critically on how data is partitioned.
3. **Permutation Sensitivity:** The use of permuted pairs versus random subsets creates opposing effects, indicating potential issues with model robustness to data randomization.
This pattern raises concerns about the reliability of conclusions drawn from these models, particularly regarding the interpretation of effect directions. The identical standard errors and BIC values across conditions suggest that while model fit remains consistent, the directional relationships between variables are not stable under different data treatment protocols.
</details>
Figure 7: Regression outputs, GPT-4 compared to human subjects in the Semantic Structure experiment.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Tabular Data Extraction: Statistical Model Output
### Overview
The image contains a statistical model output with multiple sections: model details, coefficient estimates, covariance matrices, and function evaluation metrics. The data is presented in tabular format with numerical values and statistical significance indicators.
### Components/Axes
1. **Model Details Section**
- Dep. Variable: response_scores
- Type: Gaussian
- Date: West 17 Apr 2024
- Time: 18:27:37
- Covariance Type: nonrobust
- Log Likelihood: -1627.37
2. **Coefficient Estimates Table**
- Variables: intercept, Model, Date, Time
- Columns: Estimate, Std. Error, z value, Pr(>|z|), 95% Confidence Interval
3. **Covariance Matrices**
- Nonrobust and HC3 covariance types with entries for intercept, Model, Date, Time
4. **Function Evaluations**
- Current function value: 0.376532
- Function evaluations: 15
### Detailed Analysis
#### Model Details
- **Dependent Variable**: response_scores (continuous)
- **Distribution**: Gaussian
- **Date**: West 17 Apr 2024
- **Time**: 18:27:37
- **Covariance Structure**: Nonrobust (default)
- **Log Likelihood**: -1627.37 (lower values indicate better fit)
#### Coefficient Estimates
| Variable | Estimate | Std. Error | z value | Pr(>|z|) | 95% CI Lower | 95% CI Upper |
|------------|----------|------------|---------|----------|--------------|--------------|
| intercept | -1.75543 | 0.300 | -5.851 | 5.00e-09 | -2.340 | -1.171 |
| Model | 0.00370 | 0.00050 | 7.400 | 1.20e-13 | 0.00270 | 0.00470 |
| Date | 0.00000 | 0.00000 | 0.000 | 1.000 | 0.00000 | 0.00000 |
| Time | 0.00000 | 0.00000 | 0.000 | 1.000 | 0.00000 | 0.00000 |
**Key Observations**:
- Model variable shows strong significance (p < 0.001)
- Date and Time variables have zero estimates (likely reference categories)
- Intercept has significant negative effect
#### Covariance Matrices
**Nonrobust Covariance Matrix**:
</details>
Figure 8: Regression outputs, Claude 3 compared to human subjects in the Semantic Structure experiment.
A.2 Regression results, Semantic Content experiment
Regressions are performed in the same manner as for the Semantic Structure experiment, described in Section A.1. Here, the reference condition is Categorial and the degrees of freedom used for the likelihood ratio test is 4.
As observed in the Semantic Structure experiment, the performance of GPT-4 in the Semantic Content experiment is human-comparable in some conditions but notably lower in others. When comparing a logistic model that uses subject type and experiment condition separately to one that includes their interactions, a likelihood ratio test shows that the model with interactions fits the data significantly better ( $chi^{2}(4)=39.6565,p<0.001$ ). For the comparison between Claude 3 and human subjects, when comparing a logistic model that uses subject type and experiment condition separately to one that includes their interactions, a likelihood ratio test shows that the model with interactions fits the data significantly better ( $chi^{2}(4)=11.6002,p=0.021$ ).
Regression outputs are shown in Figure 9 and 10.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Regression Analysis Report: Technical Document Extraction
### Overview
The image displays a technical document containing multiple regression analysis results. The document is structured into sections with statistical tables comparing model coefficients, standard errors, p-values, and significance markers. Key elements include intercepts, treatment effects, and model comparisons (MLE vs. MLE (1 Model)).
---
### Components/Axes
1. **Headings**:
- "Intercept"
- "Effect of condition with only the conditions (treatment)"
- "Effect of condition with only the conditions (treatment) [numeric, numeric_attribute]"
- "Model: MLE"
- "Model: MLE (1 Model)"
2. **Table Columns**:
- Estimate
- Std. Error
- t value
- Pr(>|t|)
- Significance markers (***, **, *)
3. **Variables**:
- (Intercept)
- Treatment
- Treatment:reference
- Current function value
- Function evaluations
- Current function value (numeric_attribute)
- Function evaluations (numeric_attribute)
---
### Detailed Analysis
#### Section 1: Intercept
- **Model: MLE**
- Estimate: `1.0000e+00`
- Std. Error: `0.0000e+00`
- t value: `Inf`
- Pr(>|t|): `0.0000`
- Significance: `***`
- **Model: MLE (1 Model)**
- Estimate: `1.0000e+00`
- Std. Error: `0.0000e+00`
- t value: `Inf`
- Pr(>|t|): `0.0000`
- Significance: `***`
#### Section 2: Effect of Condition (Treatment)
- **Model: MLE**
- Treatment: `0.0000e+00` (p = 1.0000, non-significant)
- Treatment:reference: `0.0000e+00` (p = 1.0000, non-significant)
- **Model: MLE (1 Model)**
- Treatment: `0.0000e+00` (p = 1.0000, non-significant)
- Treatment:reference: `0.0000e+00` (p = 1.0000, non-significant)
#### Section 3: Numeric Attribute Effects
- **Model: MLE**
- Current function value: `0.0000e+00` (p = 1.0000, non-significant)
- Function evaluations: `0.0000e+00` (p = 1.0000, non-significant)
- **Model: MLE (1 Model)**
- Current function value: `0.0000e+00` (p = 1.0000, non-significant)
- Function evaluations: `0.0000e+00` (p = 1.0000, non-significant)
#### Section 4: Combined Effects
- **Model: MLE**
- Treatment: `0.0000e+00` (p = 1.0000, non-significant)
- Treatment:reference: `0.0000e+00` (p = 1.0000, non-significant)
- **Model: MLE (1 Model)**
- Treatment: `0.0000e+00` (p = 1.0000, non-significant)
- Treatment:reference: `0.0000e+00` (p = 1.0000, non-significant)
---
### Key Observations
1. **Intercept Consistency**: Both models show identical intercept estimates (`1.0000e+00`) with perfect precision (Std. Error = 0).
2. **Treatment Effects**: All treatment-related coefficients are `0.0000e+00` with p-values of `1.0000`, indicating no significant effect in either model.
3. **Numeric Attribute**: No meaningful contribution from numeric attributes in either model.
4. **Significance Markers**: All p-values are either `0.0000` (highly significant) or `1.0000` (non-significant), suggesting strict thresholding in model specification.
---
### Interpretation
The document compares two model specifications (MLE and MLE (1 Model)) for a regression analysis. Both models exhibit identical intercepts and null treatment effects, suggesting:
- **Model Simplification**: The MLE (1 Model) may impose constraints that nullify treatment effects.
- **Data Characteristics**: The absence of significant predictors implies either:
- Insignificant treatment impact in the dataset.
- Overfitting/underfitting due to model specification.
- **Technical Context**: The use of "numeric_attribute" suggests potential feature engineering or domain-specific adjustments.
The results highlight the importance of model validation and sensitivity analysis in regression studies.
</details>
Figure 9: Regression outputs, GPT-4 compared to human subjects in the Semantic Content experiment.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Screenshot of Statistical Analysis Report
### Overview
The image displays a technical document containing statistical analysis results, likely from a regression or machine learning model evaluation. The content is organized into multiple sections with headings, tables, and charts. The text is too small to read clearly, but the structure and formatting suggest a focus on model comparisons, statistical metrics, and visualizations of results.
---
### Components/Axes
#### Headings and Sections
- **Model**: Repeatedly listed with dates (e.g., "Model: MLE_D1", "Model: MLE_D2").
- **Method**: Specifies analysis techniques (e.g., "Method: Logit", "Method: Logistic Regression").
- **Date**: Timestamps for data collection or model runs (e.g., "Date: 04-Jun-2024").
- **Logit Regression Results**: Charts and tables summarizing model outputs.
#### Table Columns (Inferred)
- **Date**: Timestamp for data/model run.
- **Model**: Model type (e.g., MLE, MLE_D1, MLE_D2).
- **Method**: Analysis method (e.g., Logit, Logistic Regression).
- **Logit**: Coefficient estimates or model outputs.
- **Residuals**: Residual values or error metrics.
- **p-value**: Statistical significance of coefficients.
- **z**: Standardized coefficients.
- **Pr(>|z|)**: p-values for hypothesis testing.
---
### Detailed Analysis
#### Repeated Sections
1. **Model: MLE_D1**
- **Method: Logit**
- Date: 04-Jun-2024
- Logit Regression Results: Coefficients, residuals, p-values.
- **Method: Logistic Regression**
- Similar structure with additional metrics.
2. **Model: MLE_D2**
- **Method: Logit**
- Date: 04-Jun-2024
- Logit Regression Results: Coefficients, residuals, p-values.
- **Method: Logistic Regression**
- Similar structure with additional metrics.
#### Charts
- **Logit Regression Results**: Likely visualizations of coefficient estimates, confidence intervals, or model fit statistics (e.g., AIC, BIC).
- **Residual Plots**: Scatterplots or histograms of residuals to assess model fit.
---
### Key Observations
1. **Repetition**: Multiple sections repeat the same structure, suggesting comparisons across models/methods.
2. **Statistical Focus**: Emphasis on p-values, coefficients, and residuals indicates hypothesis testing and model validation.
3. **Unreadable Text**: Exact numerical values and labels cannot be extracted due to image resolution.
---
### Interpretation
The document appears to evaluate different statistical models (e.g., MLE variants) using Logit and Logistic Regression methods. The repeated sections suggest a comparative analysis of model performance, with a focus on statistical significance (p-values) and coefficient estimates. The charts likely summarize these results visually, aiding in interpretation. However, the inability to read the text prevents precise extraction of numerical data or conclusions.
---
### Limitations
- **Image Quality**: Text is too small to read, making exact transcription impossible.
- **Ambiguity**: Without clear labels, specific model details (e.g., variables, datasets) cannot be confirmed.
</details>
Figure 10: Regression outputs, Claude 3 compared to human subjects in the Semantic Content experiment.
A.3 Further details of human performance
Figure 11 shows the difference in performance between online subjects recruited through Prolific and in-person University-Name University students in the Semantic Structure experiment conditions. Prolific subjects were paid $1.50 for the task, with Prolific taking an additional $0.50 per subject. This equated to an approximate effective rate of $22 per hour, well above relevant minimum wages. In-person subjects were each paid $10 to reflect the increased time and effort cost. We expect increased performance from the in person subjects for a number of reasons. First, they are more highly remunerated. Second, there may be social pressure to perform well given the presence of a member of the research team. Third, the in-person subjects may not have the decreased attention effects likely experienced by subjects on Prolific who may complete many unrelated and potentially demotivating tasks in a day. Fourth, there is an implicit selection effect on academic performance for students at our university, which is not unrelated to the competencies involved in completing the tasks in the experiment. Indeed, we observe that accuracy increases by approximately 0.1-0.2 for the in-person subjects in all but one condition. The exception to this is the Random Finals condition, in which mean accuracy decreases slightly. However, in this condition it is not clear that a decrease in “accuracy” is objectively worse performance, because in this condition we ask for the drawing corresponding to a final unrelated term, while all previous left-hand terms within the question are related. It is thus not unreasonable to give an answer that differs from what we expect, except insofar as subjects are learning in context from previous questions in the quiz to realize that the final unrelated word should be regarded as irrelevant.
Logistic regression analysis confirms that the in-person subjects outperform the online subjects. This is confirmed with the in-person subjects as the reference class and with the independent variable either being jointly the subject type and quiz class, or the subject type alone (respectively coef -1.1738, P 0.015 and coef -0.6462, P 0.000).
<details>
<summary>extracted/5679376/Images/Prolific_University-Name_Comparison.png Details</summary>

### Visual Description
## Bar Chart: Accuracy by Condition
### Overview
The chart compares the accuracy of two groups ("Prolific" and "University students") across eight experimental conditions. Accuracy is measured on a scale from 0 to 1, with error bars representing variability and red dots indicating statistical significance (p-values). The legend confirms blue bars represent "Prolific" and orange bars represent "University students."
### Components/Axes
- **X-axis**: Conditions (Defaults, Distracted, Permuted Pairs, Permuted Questions, Random Permuted Pairs, Randoms, Only Rhs, Random Finals)
- **Y-axis**: Accuracy (0 to 1, labeled "Accuracy")
- **Legend**: Top-right corner, blue = Prolific, orange = University students
- **Error Bars**: Vertical lines atop bars showing standard deviation
- **Significance Markers**: Red dots between bars (1 dot = p < 0.05, 2 dots = p < 0.01, 3 dots = p < 0.001)
### Detailed Analysis
1. **Defaults**:
- Prolific: 0.65 ± 0.05
- University students: 0.88 ± 0.03
- Significance: 3 dots (p < 0.001)
2. **Distracted**:
- Prolific: 0.50 ± 0.07
- University students: 0.68 ± 0.04
- Significance: 3 dots (p < 0.001)
3. **Permuted Pairs**:
- Prolific: 0.70 ± 0.04
- University students: 0.85 ± 0.03
- Significance: 3 dots (p < 0.001)
4. **Permuted Questions**:
- Prolific: 0.63 ± 0.06
- University students: 0.80 ± 0.05
- Significance: 3 dots (p < 0.001)
5. **Random Permuted Pairs**:
- Prolific: 0.45 ± 0.08
- University students: 0.72 ± 0.06
- Significance: 2 dots (p < 0.01)
6. **Randoms**:
- Prolific: 0.58 ± 0.05
- University students: 0.79 ± 0.04
- Significance: 3 dots (p < 0.001)
7. **Only Rhs**:
- Prolific: 0.74 ± 0.03
- University students: 0.88 ± 0.02
- Significance: 3 dots (p < 0.001)
8. **Random Finals**:
- Prolific: 0.60 ± 0.07
- University students: 0.48 ± 0.06
- Significance: 1 dot (p < 0.05)
### Key Observations
- **University students** consistently outperform "Prolific" in all conditions except **Random Finals**, where their accuracy drops below 0.5.
- **Prolific** shows higher variability (larger error bars) in **Distracted**, **Random Permuted Pairs**, and **Random Finals**.
- **Significance markers** indicate strong statistical differences (p < 0.001) in most conditions, except **Random Finals** (p < 0.05).
- **Only Rhs** and **Defaults** show the largest accuracy gaps (0.14–0.24 difference).
### Interpretation
The data suggests **University students** have superior accuracy across most conditions, likely due to familiarity with the task or domain-specific training. The **Random Finals** condition is an outlier, where University students' performance drops sharply, possibly due to task ambiguity or unfamiliarity. The **Prolific** group’s lower accuracy and higher variability may reflect less controlled experimental conditions or less experience. The statistical significance markers emphasize robust differences in most scenarios, though the **Random Finals** result warrants further investigation into confounding factors.
</details>
Figure 11: Accuracy comparison of online subjects recruited through Prolific and in-person University-Name University students in the Semantic Structure experiment conditions.
Figure 12 below shows the variation in performance among human subjects completing different quizzes. As can be seen, some conditions aggregate over quizzes in which the mean performance is quite stable (for example, Random Finals). Other conditions aggregate over quizzes in which there is a larger variation in performance (for example, randoms). In the first quiz of the randoms condition, all respondents score 100%. No particular features of this quiz were identified that would explain this occurrence. However, given that the performance of subsequent quiz-takers is independent, that a significant proportion of all quiz-takers score 100%, and that we have 28 quizzes that each sample a number of respondents, it does not seem unlikely for one quiz to have all perfect scores by chance.
<details>
<summary>extracted/5679376/Images/Human_Accuracy_by_Quiz.png Details</summary>

### Visual Description
```markdown
## Bar Chart: Accuracy by Condition
### Overview
The chart compares the accuracy of four quizzes (Q1-Q4) across eight experimental conditions: Defaults, Distracted, Permuted Pairs, Permuted Questions, Random Permuted Pairs, Randoms, Only Rhs, and Random Finals. Each condition is represented by grouped bars in blue (Q1), orange (Q2), green (Q3), and red (Q4), with error bars indicating variability.
### Components/Axes
- **X-axis (Condition)**: Eight categories (Defaults, Distracted, Permuted Pairs, Permuted Questions, Random Permuted Pairs, Randoms, Only Rhs, Random Finals).
- **Y-axis (Accuracy)**: Scale from 0.0 to 1.0.
- **Legend**: Located on the right, mapping colors to quizzes:
- Blue = Quiz 1
- Orange = Quiz 2
- Green = Quiz 3
- Red = Quiz 4
- **Error Bars**: Vertical lines atop each bar, representing standard error.
### Detailed Analysis
1. **Defaults**:
- Q1: ~0.75 (±0.08)
- Q2: ~0.93 (±0.05)
- Q3: ~0.91 (±0.04)
- Q4: ~0.94 (±0.03)
2. **Distracted**:
- Q1: ~0.75 (±0.08)
- Q2: ~0.57 (±0.07)
- Q3: ~0.69 (±0.06)
- Q4: ~0.71 (±0.05)
3. **Permuted Pairs**:
- Q1: ~0.92 (±0.04)
- Q2: ~0.59 (±0.06)
- Q3: ~0.94 (±0.03)
- Q4: ~0.94 (±0.03)
4. **Permuted Questions**:
- Q1: ~1.00 (±0.00)
- Q2: ~0.76 (±0.06)
- Q3: ~0.66 (±0.05)
- Q4: ~0.80 (±0.05)
5. **Random Permuted Pairs**:
- Q1: ~0.70 (±0.06)
- Q2: ~0.76 (±0.05)
6. **Randoms**:
- Q1: ~0.92 (±0.04)
- Q2: ~0.67 (±0.05)
7. **Only Rhs**:
- Q1: ~0.94 (±0.03)
- Q2: ~0.81 (±0.04)
- Q3: ~0.90 (±0.03)
- Q4: ~0.90 (±0.03)
8. **Random Finals**:
- Q1: ~0.25 (±0.05)
- Q2: ~0.75 (±0.05)
- Q3: ~0.67 (±0.05)
- Q4: ~0.25 (±0.05)
### Key Observations
- **Highest Accuracy**:
- Q1 excels in **Permuted Questions** (~1.00).
- Q2 performs best in **Defaults** (~0.93).
- Q3 and Q4 peak in **Defaults** (~0.91–0.94) and **Permuted Pairs** (~0.94).
- **Lowest Accuracy**:
- Q1 and Q4 drop sharply in **Random Finals** (~0.25).
- **Error Variability**:
- Largest error bars in **Distracted** (Q1: ±0.08) and **Random Finals** (Q1/Q4: ±0.05).
- **Consistency**:
- Q3 and Q4 show similar performance across most conditions, except **Distracted** and **Random Finals**.
### Interpretation
The data suggests that:
1. **Study Design Impacts
</details>
Figure 12: Human accuracy by quiz across conditions. Error bars show standard errors.
One highly-specific failure mode is present in the human data and deserves special consideration. In only the “*” grounding, participants quite often introduced separator characters into the response (either just “>” or both separator characters, “=>”). This was observed in 11 instances, thus affecting approximately 10% of the responses in that grounding scheme. This issue was not observed in any other grounding scheme. The reason for this is not entirely clear, but could be related to the short length of the groundings in this scheme (a grounding term is either 1 or 3 characters in this scheme, which is shorter than the other versions). It is possible that the short length of the grounding terms could lead subjects to perceive the separator characters as being part of the grounding term, although it is unclear why this would happen even for subjects who successfully ignore the separator characters in three prior responses (which applies to 7 out of 11 such errors).
The error rate of the human subjects by grounding type is shown in Figure 13. The error rate in the “*”, “C K E”, and “Q Z I” grounding schemes was essentially equal, with approximately half of this rate in the “c c” grounding scheme. This is not entirely intuitive, but some possible explanations of this can be offered. First, the “c c” groundings have the fewest number of distinct characters, thus limiting the option space when answering (there are two distinct characters, compared to 3 or 4 for the other groundings). Second, the specific transformations involved (capitalization and adding/removing a letter) are common operations that are encountered more frequently than, say, inserting a special character between existing characters.
As can be seen in Table 5, about a fifth of the human participants’ incorrect responses were simply copies of one of the three right-hand terms presented in the question. Very few participants made a mistake that only reordered the correct answer, whereas about half of all incorrect answers were the wrong combination of characters from the right-hand terms of the task. Note that in all of our target domains, any individual right-hand term only uses characters that are found in at least one of the other three. Thus, the third of incorrect responses from human subjects which did not fall into any of the previous categories included characters which had not been presented in one of the three preceding right-hand terms. This can be explained in some cases by the presence of a distractor that confused a participant into including characters from a right-hand term that used other characters, while in others it can be explained by typos or some other confusion.
In addition to the task questions, subjects were presented with three follow-up questions that asked them to rate their confidence that their answers were correct, describe what they thought the task involved, and describe their strategy for answering the questions.
Subjects employ a mix of strategies in answering the questions. Some subjects explicitly attend to the analogy structure of the left-hand terms. For example, in the distracted condition, one participant reports that “I tried to find the one that resembled the blank one…ie, red/pink, cat/kitten”. By contrast, others focused more on completing the pattern in the right-hand terms. Most subjects robustly ignored the distractor terms in that condition.
Some subjects who report a detailed, correct strategy nevertheless fail to attain a high accuracy, thus demonstrating that the task is not trivial even for those who are able to fully grasp what it involves. For example, one subject attains a below-average accuracy of 50% in the distracted condition despite being able to state that the task involves “Looking at other comparable entries to figure out what the answer to the last entry was (dog:puppy::cat:kitten)”, and reporting a strategy in which “I tried to find similar pairs of entries and looked at their meanings.” By contrast, another subject attains 100% accuracy in the distracted condition while responding to what the task involved with “I thought it was fun” and reporting a strategy in which “I just compared answers and tried my best to understand what they were and then tried to guess based on my interpretation of the other answers.”
Table 5: The distributions of types of errors made by top-performing participants in the Semantic Structure experiment.
| | Copy Context | Scrambled | Wrong Combination | Other |
| --- | --- | --- | --- | --- |
| Human | 0.192 | 0.020 | 0.556 | 0.232 |
| GPT-4 | 0.239 | 0.031 | 0.502 | 0.228 |
| Claude 3 Opus | 0.036 | 0.045 | 0.276 | 0.643 |
<details>
<summary>extracted/5679376/Images/incorrect_answers_by_grounding.png Details</summary>

### Visual Description
## Bar Chart: Percent of Incorrect Answers by Grounding Type
### Overview
The chart compares the percentage of incorrect answers across three groups (Human, Claude 3 Opus, GPT-4) for four grounding types: *, CKE, QZI, and c c. Values are represented as vertical bars with approximate percentages on the y-axis (0.00–0.30).
### Components/Axes
- **X-axis (Grounding Type)**: Four categories: `*`, `CKE`, `QZI`, `c c` (left to right).
- **Y-axis (Percentage of All Incorrect Answers)**: Scale from 0.00 to 0.30 in increments of 0.05.
- **Legend**: Located in the top-right corner, mapping colors to groups:
- Blue: Human
- Orange: Claude 3 Opus
- Green: GPT-4
### Detailed Analysis
1. **Grounding Type `*`**:
- Human: ~0.27
- Claude 3 Opus: ~0.25
- GPT-4: ~0.32 (highest)
2. **Grounding Type `CKE`**:
- Human: ~0.26
- Claude 3 Opus: ~0.30 (highest)
- GPT-4: ~0.24 (lowest)
3. **Grounding Type `QZI`**:
- Human: ~0.31 (highest)
- Claude 3 Opus: ~0.28
- GPT-4: ~0.26
4. **Grounding Type `c c`**:
- Human: ~0.15
- Claude 3 Opus: ~0.16
- GPT-4: ~0.17 (highest)
### Key Observations
- **GPT-4** consistently shows the highest error rates in `*` and `c c`, while **Claude 3 Opus** peaks in `CKE`.
- **Human** has the highest error rate in `QZI` (~0.31), significantly above other groups.
- **Claude 3 Opus** and **GPT-4** have the lowest error rates in `c c` (~0.16 and ~0.17, respectively).
### Interpretation
The data suggests that model performance varies significantly depending on grounding type. GPT-4 appears less accurate in `*` and `c c`, while Claude 3 Opus struggles most with `CKE`. Humans exhibit the highest errors in `QZI`, potentially indicating task-specific challenges. The disparity in error rates across models and grounding types highlights the importance of grounding type in evaluating AI performance, with no single model consistently outperforming others across all categories.
</details>
Figure 13: Percentage of incorrect answers in the Semantic Structure experiment by target domain type.
In addition to the comparable mean performance, we find similar patterns in the errors made by humans and GPT-4. In Table 5 and Figure 13, one can see that the distribution of errors is comparable both when broken down by target domain type and when broken down by several error classifications we design.
<details>
<summary>extracted/5679376/Images/top_performers_best_fit_and_points.png Details</summary>

### Visual Description
## Line Chart Grid: Accuracy vs. Question Number by Subject Type and Experiment Condition
### Overview
The image displays a 3x8 grid of line charts comparing accuracy trends across four question sets (Q1-Q4) for three subject types (Human, Claude 3 Opus, GPT-4) under eight experimental conditions. Each chart includes error bars representing measurement uncertainty. The legend in the top-left corner maps colors to subject types: blue (Human), green (Claude 3 Opus), and purple (GPT-4).
### Components/Axes
- **X-axis**: Question Number (Q1, Q2, Q3, Q4)
- **Y-axis**: Accuracy (0.0 to 1.0 scale)
- **Legend**:
- Blue: Human
- Green: Claude 3 Opus
- Purple: GPT-4
- **Experiment Conditions** (column labels):
1. defaults
2. distracted
3. permuted_pairs
4. permuted_questions
5. random_permuted_pairs
6. randoms
7. only_rhs
8. random_finals
### Detailed Analysis
#### Human Row
1. **defaults**: Starts at ~0.9 (Q1) with gradual decline to ~0.85 (Q4). Error bars range ±0.05.
2. **distracted**: Peaks at ~0.75 (Q1), drops to ~0.65 (Q4). Error bars ±0.1.
3. **permuted_pairs**: Rises from ~0.7 (Q1) to ~0.9 (Q4). Error bars ±0.07.
4. **permuted_questions**: Sharp increase from ~0.6 (Q1) to ~0.95 (Q4). Error bars ±0.08.
5. **random_permuted_pairs**: Starts at ~0.5 (Q1), peaks at ~0.8 (Q3), drops to ~0.7 (Q4). Error bars ±0.12.
6. **randoms**: Gradual rise from ~0.6 (Q1) to ~0.9 (Q4). Error bars ±0.09.
7. **only_rhs**: Declines from ~0.9 (Q1) to ~0.8 (Q4). Error bars ±0.06.
8. **random_finals**: Starts at ~0.5 (Q1), rises to ~0.7 (Q4). Error bars ±0.1.
#### Claude 3 Opus Row
1. **defaults**: Flat line at ~0.95 across all Qs. Error bars ±0.03.
2. **distracted**: Drops to ~0.8 (Q1), stabilizes at ~0.75 (Q4). Error bars ±0.05.
3. **permuted_pairs**: Declines from ~0.9 (Q1) to ~0.6 (Q4). Error bars ±0.1.
4. **permuted_questions**: Flat at ~0.95 (Q1-Q3), drops to ~0.8 (Q4). Error bars ±0.04.
5. **random_permuted_pairs**: Flat at ~0.7 (Q1-Q2), drops to ~0.5 (Q4). Error bars ±0.15.
6. **randoms**: Flat at ~0.85 (Q1-Q3), drops to ~0.7 (Q4). Error bars ±0.06.
7. **only_rhs**: Declines from ~0.95 (Q1) to ~0.85 (Q4). Error bars ±0.05.
8. **random_finals**: Flat at ~0.6 (Q1-Q3), rises to ~0.7 (Q4). Error bars ±0.1.
#### GPT-4 Row
1. **defaults**: Flat at ~0.98 across all Qs. Error bars ±0.02.
2. **distracted**: Drops to ~0.9 (Q1), stabilizes at ~0.85 (Q4). Error bars ±0.03.
3. **permuted_pairs**: Rises from ~0.8 (Q1) to ~0.95 (Q4). Error bars ±0.05.
4. **permuted_questions**: Flat at ~0.98 (Q1-Q3), drops to ~0.9 (Q4). Error bars ±0.02.
5. **random_permuted_pairs**: Flat at ~0.9 (Q1-Q2), drops to ~0.7 (Q4). Error bars ±0.04.
6. **randoms**: Flat at ~0.95 (Q1-Q3), drops to ~0.8 (Q4). Error bars ±0.03.
7. **only_rhs**: Declines from ~0.98 (Q1) to ~0.9 (Q4). Error bars ±0.02.
8. **random_finals**: Flat at ~0.7 (Q1-Q3), rises to ~0.8 (Q4). Error bars ±0.05.
### Key Observations
1. **Human Performance**:
- Struggles most in "distracted" and "random_permuted_pairs" conditions.
- Shows improvement in "permuted_questions" and "randoms" conditions.
- Highest variability in "random_permuted_pairs" (±0.12 error bars).
2. **Claude 3 Opus**:
- Maintains high accuracy in "defaults" and "permuted_questions" but declines sharply in "permuted_pairs" and "random_permuted_pairs".
- "random_finals" shows late improvement despite low initial performance.
3. **GPT-4**:
- Consistently high accuracy across all conditions, with minimal drops in "only_rhs" and "randoms".
- "random_finals" demonstrates late-stage improvement, suggesting adaptive learning.
### Interpretation
The data reveals distinct performance patterns between human and AI subjects:
- **Humans** exhibit context-dependent variability, with accuracy heavily influenced by task structure (e.g., permutations reduce performance).
- **Claude 3 Opus** shows robustness in structured tasks but falters in randomized conditions, suggesting limited generalization.
- **GPT-4** maintains near-perfect accuracy across most conditions, with only minor declines in randomized tasks, indicating superior adaptability.
Notably, the "random_finals" condition shows late improvement for all subjects, potentially reflecting task familiarity or algorithmic adjustments. Error bars highlight that human measurements are less reliable (±0.1 vs. ±0.02 for GPT-4), emphasizing the need for larger sample sizes in human studies.
</details>
Figure 14: Improvement in human and LLM accuracy by question number across different conditions. Error bars show standard errors.
Further, humans and GPT-4 both improve as they see more questions over the course of a quiz. As seen in Figure 14, humans display a positive learning trend in 5 out of 8 conditions. GPT-4 displays a positive learning trend in a comparable 6 out of 8 conditions, with one of the conditions in which it does not display improvement resulting from it displaying near-perfect accuracy from start to finish (in the Only RHS condition).
A.4 Further details of LLM performance
Figure 15 shows the performance of all tested models in the Semantic Structure experiment.
<details>
<summary>extracted/5679376/Images/Aggregate_Accuracy_Comparison.png Details</summary>

### Visual Description
## Bar Chart: Accuracy by Condition
### Overview
The chart compares the accuracy of various AI models and human performance across seven experimental conditions: Defaults, Distracted, Permuted Pairs, Permuted Questions, Random Permuted Pairs, Randoms, Only RHS, and Random Finals. Accuracy values range from 0 to 1, with error bars indicating variability. The legend identifies seven entities: Human, GPT-4, GPT-3, Claude 3 Opus, Claude 2, Falcon-40B, and Pythia-12B-Deduped.
### Components/Axes
- **X-axis (Condition)**: Categorical labels for experimental conditions (e.g., Defaults, Distracted, Permuted Pairs).
- **Y-axis (Accuracy)**: Numerical scale from 0 to 1, representing accuracy percentages.
- **Legend**: Positioned in the top-right corner, mapping colors to entities:
- Blue: Human
- Orange: GPT-4
- Green: GPT-3
- Red: Claude 3 Opus
- Purple: Claude 2
- Brown: Falcon-40B
- Pink: Pythia-12B-Deduped
### Detailed Analysis
1. **Defaults**:
- Human: ~0.88 (±0.05)
- GPT-4: ~0.78 (±0.04)
- Claude 3 Opus: ~0.81 (±0.03)
- Claude 2: ~0.58 (±0.04)
- GPT-3: ~0.48 (±0.03)
- Falcon-40B: ~0.10 (±0.02)
- Pythia-12B-Deduped: Not visible (assumed near 0).
2. **Distracted**:
- Human: ~0.68 (±0.06)
- GPT-4: ~0.48 (±0.05)
- Claude 3 Opus: ~0.60 (±0.04)
- Claude 2: ~0.40 (±0.05)
- GPT-3: ~0.30 (±0.04)
- Falcon-40B: ~0.10 (±0.03)
- Pythia-12B-Deduped: Not visible.
3. **Permuted Pairs**:
- Human: ~0.84 (±0.04)
- GPT-4: ~0.56 (±0.05)
- Claude 3 Opus: ~0.52 (±0.04)
- Claude 2: ~0.48 (±0.05)
- GPT-3: ~0.34 (±0.04)
- Falcon-40B: ~0.05 (±0.02)
- Pythia-12B-Deduped: Not visible.
4. **Permuted Questions**:
- Human: ~0.80 (±0.05)
- GPT-4: ~0.74 (±0.04)
- Claude 3 Opus: ~0.99 (±0.02)
- Claude 2: ~0.64 (±0.05)
- GPT-3: ~0.40 (±0.04)
- Falcon-40B: ~0.05 (±0.02)
- Pythia-12B-Deduped: Not visible.
5. **Random Permuted Pairs**:
- Human: ~0.70 (±0.06)
- GPT-4: ~0.32 (±0.05)
- Claude 3 Opus: ~0.20 (±0.04)
- Claude 2: ~0.20 (±0.05)
- GPT-3: ~0.05 (±0.02)
- Falcon-40B: ~0.05 (±0.02)
- Pythia-12B-Deduped: Not visible.
6. **Randoms**:
- Human: ~0.79 (±0.05)
- GPT-4: ~0.16 (±0.04)
- Claude 3 Opus: ~0.36 (±0.05)
- Claude 2: ~0.20 (±0.05)
- GPT-3: ~0.05 (±0.02)
- Falcon-40B: ~0.05 (±0.02)
- Pythia-12B-Deduped: Not visible.
7. **Only RHS**:
- Human: ~0.88 (±0.04)
- GPT-4: ~0.90 (±0.03)
- Claude 3 Opus: ~0.76 (±0.04)
- Claude 2: ~0.24 (±0.05)
- GPT-3: ~0.39 (±0.04)
- Falcon-40B: ~0.07 (±0.02)
- Pythia-12B-Deduped: Not visible.
8. **Random Finals**:
- Human: ~0.48 (±0.06)
- GPT-4: ~0.06 (±0.03)
- Claude 3 Opus: ~0.28 (±0.05)
- Claude 2: ~0.36 (±0.05)
- GPT-3: ~0.09 (±0.03)
- Falcon-40B: ~0.02 (±0.01)
- Pythia-12B-Deduped: Not visible.
### Key Observations
- **Human Performance**: Consistently highest across all conditions, with minor drops in Distracted (~0.68) and Random Finals (~0.48).
- **Top Models**: GPT-4 and Claude 3 Opus outperform others in most conditions, with Claude 3 Opus achieving near-perfect accuracy (~0.99) in Permuted Questions.
- **Low Performers**: Falcon-40B and Pythia-12B-Deduped show minimal accuracy, often below 0.10, except in Defaults (~0.10 for Falcon-40B).
- **Error Bars**: Largest variability in Distracted (~±0.06 for Human) and Randoms (~±0.05 for GPT-4), suggesting sensitivity to noise.
### Interpretation
The data demonstrates that humans and advanced AI models (GPT-4, Claude 3 Opus) maintain robust accuracy across diverse conditions, with humans excelling in complex scenarios like Permuted Questions. However, all models struggle with randomness (Randoms, Random Finals), where accuracy drops sharply. The near-perfect performance of Claude 3 Opus in Permuted Questions suggests specialized training for structured tasks. Lower-performing models (Falcon-40B, Pythia-12B-Deduped) likely lack the capacity to handle non-default conditions, highlighting limitations in generalization. The error bars indicate that variability increases in challenging conditions, emphasizing the need for error-aware evaluations in real-world applications.
</details>
Figure 15: Performance of all tested models in the Semantic Structure experiment. Error bars show standard errors.
Figure 16 shows the variation in the performance of GPT-4 in two conditions (Only RHS and Random Finals) across various small differences in prompting strategy. With small differences in prompting strategy, performance in the Only RHS condition varies between approximately 20% and approximately 100%. Similarly, small differences in the prompting strategy yield performance in the Random Finals condition that varies between near-zero and close to 40%. Such variation is very significant, but appears in general to be fairly explicable. In the Only RHS condition, the majority of the variation appears to come from setting up the prompt in such a way that it is clear that a final term is desired next, as opposed to a new question. In the other conditions, an arrow separator that divides left- and right-hand side terms is the final element of the prompt, thus suggesting that a right-hand term is appropriate as the next token. In the Only RHS condition, this trailing separator was initially not present, and thus the models often responded by beginning a new question rather than by completing the last question presented. Re-introducing arrow separators and making other small changes designed to more clearly indicate when a question has not yet been completed eliminates these kinds of errors and drastically increases performance. In the Random Finals condition, a significant improvement comes from changing the instruction sentence from one that specifies that a drawing of the left-hand side is requested, to an instruction sentence specifying that various patterns will be shown after which the last should be completed. This is reasonable, as in this condition the final left-hand term is misleading and so an instruction focusing attention on it is expected to reduce performance. As expected, no performance improvement is observed when replacing the set of random final words with different ones. Finally, a performance boost is observed when adding additional newlines (instead of only having a clear line between each question, we now also include a clear line between each line of a question). It is not clear why this should improve performance.
<details>
<summary>extracted/5679376/Images/gpt4_tweak_dependance.png Details</summary>

### Visual Description
## Bar Chart: GPT-4 Accuracy by Presentation Style
### Overview
The chart compares the accuracy of GPT-4 across six presentation styles, with error bars indicating variability. The y-axis represents accuracy (0.0–1.0), and the x-axis lists presentation styles. All bars are blue, with no additional legend elements beyond the color coding.
### Components/Axes
- **Title**: "GPT-4 Accuracy by Presentation Style" (top-center).
- **X-Axis**: Labeled "Presentation Style" with categories:
- "Only Rows"
- "Rows + Arrows"
- "Random Finals Drawing"
- "Random Finals Pattern"
- "Random Finals Randomer"
- "Random Finals Newlines"
- **Y-Axis**: Labeled "Accuracy" (0.0–1.0 in increments of 0.2).
- **Legend**: Not explicitly visible; all bars use a single blue color.
### Detailed Analysis
1. **"Rows + Arrows"**: Tallest bar at ~0.83 accuracy (error bar ±0.05).
2. **"Only Rows"**: ~0.25 accuracy (error bar ±0.03).
3. **"Random Finals Drawing"**: ~0.07 accuracy (error bar ±0.02).
4. **"Random Finals Pattern"**: ~0.22 accuracy (error bar ±0.03).
5. **"Random Finals Randomer"**: ~0.22 accuracy (error bar ±0.03).
6. **"Random Finals Newlines"**: ~0.31 accuracy (error bar ±0.04).
### Key Observations
- "Rows + Arrows" achieves the highest accuracy (~83%), significantly outperforming all other styles.
- Random presentation styles cluster between ~7% and ~31% accuracy, with "Newlines" being the most effective among them.
- Error bars suggest moderate variability, particularly for "Rows + Arrows" (±5%) and "Newlines" (±4%).
### Interpretation
The data demonstrates that structured presentation styles (e.g., "Rows + Arrows") drastically improve GPT-4's accuracy compared to random configurations. The inclusion of arrows in rows correlates with the highest performance, suggesting that explicit visual cues enhance model interpretation. Random styles show minimal accuracy, with "Newlines" as a potential outlier due to its slightly higher performance (~31%). The error bars indicate that while "Rows + Arrows" has the largest variability, its mean accuracy remains robust. This implies that presentation style is a critical factor in optimizing GPT-4's output reliability.
</details>
Figure 16: Dependence of GPT-4 accuracy on prompt variations. Error bars show standard errors.
| Prompt | We are conducting an experiment on general reasoning abilities. Below we will show you various words and drawings of each, after which you will need to complete the last drawing. Respond as concisely as possible with only the last drawing. Question 1: chicken => ! spider => ! cat => * horse => * ant => ! dog => * bee => ! human => * Question 2: car => * tricycle => ! motorcycle => * skateboard => ! bicycle => ! unicycle => ——— SAMPLED RESPONSE SET ——— ! [Continuation omitted] —————————————- ! [Continuation omitted] —————————————- * [Continuation omitted] —————————————- ! [Continuation omitted] —————————————- * [Continuation omitted] |
| --- | --- |
Table 6: Table showing an illustrative response from Falcon-40B in the Categorial condition of the Semantic Content experiment. Observe that the model provides incorrect and correct responses to the question, seeming to recognize the form of a correct response but to not reason further about correctness.
| Prompt | We are conducting an experiment on general reasoning abilities. Below we will show you various words and drawings of each, after which you will need to complete the last drawing. Respond as concisely as possible with only the last drawing. Question 1: spider => * * * * * * * * human => * * cat => * * * * chicken => * * dog => * * * * horse => * * * * bee => * * * * * * Question 2: motorcycle => * * tricycle => * * * bicycle => * * unicycle => * car => ——— SAMPLED RESPONSE SET ——— * * * —————————————- * * —————————————- * * —————————————- * * —————————————- * * |
| --- | --- |
Table 7: Table showing an illustrative response from GPT-4 in the numerical condition of the Semantic Content experiment. Observe that the model fails to correctly relate the number of characters to the numerical property of the object, in this case the number of wheels that a car has.
| Prompt | * * * + * = * * * * * * * * * – * * = * * * * * * * * * – * * * = |
| --- | --- |
| Responses | * * * * * * (first response) * * * * * * (second response) * * * * * * * (third response) |
| Expected result | * * * |
Table 8: Table showing a sanity check that GPT-4 fails to reason about the number of characters in the expected way. Settings: temperature 1, maximum length 256, top P 1.