## In-Context Analogical Reasoning with Pre-Trained Language Models
Xiaoyang Hu 12 ∗
Shane Storks 1 ∗ Richard L. Lewis 2 † Joyce Chai 1 †
1 Computer Science and Engineering Division, University of Michigan 2 Department of Psychology, University of Michigan
{nickhu, sstorks, rickl, chaijy}@umich.edu
## Abstract
Analogical reasoning is a fundamental capacity of human cognition that allows us to reason abstractly about novel situations by relating them to past experiences. While it is thought to be essential for robust reasoning in AI systems, conventional approaches require significant training and/or hard-coding of domain knowledge to be applied to benchmark tasks. Inspired by cognitive science research that has found connections between human language and analogy-making, we explore the use of intuitive language-based abstractions to support analogy in AI systems. Specifically, we apply large pre-trained language models (PLMs) to visual Raven's Progressive Matrices (RPM), a common relational reasoning test. By simply encoding the perceptual features of the problem into language form, we find that PLMs exhibit a striking capacity for zero-shot relational reasoning, exceeding human performance and nearing supervised vision-based methods. We explore different encodings that vary the level of abstraction over task features, finding that higherlevel abstractions further strengthen PLMs' analogical reasoning. Our detailed analysis reveals insights on the role of model complexity, incontext learning, and prior knowledge in solving RPM tasks.
## 1 Introduction
Humans are constantly presented with novel problems and circumstances. Rather than understand them in isolation, we try to connect them with past experiences. With any luck, we might find an analogy : a mapping between relevant aspects of this new situation and a past situation, which helps form abstractions that allow us to reason more effectively in the future (Holyoak, 1984). Analogy is thought to underpin humans' robust reasoning and problem solving capabilities (Hofstadter and
∗ Authors contributed equally to this work.
† Equal advising contribution.
Figure 1: Raven's Progressive Matrices (Raven and Court, 1938; Zhang et al., 2019a) are an analogy-making task where one must infer the missing matrix item based on abstract rules instantiated in the first two rows. To demonstrate the potential analogical reasoning skills in pre-trained language models, we develop languagebased abstractions over their key perceptual features, then prompt them to select the completion of the matrix.
<details>
<summary>Image 1 Details</summary>

### Visual Description
## Diagram: 8-Way Visual Raven's Progressive Matrix (RPM) with Language Model Integration
### Overview
The diagram illustrates a hybrid approach to solving Raven's Progressive Matrices (RPM) puzzles using a pre-trained language model. It combines visual pattern recognition (RPM) with natural language processing (NLP) to generate and evaluate potential solutions. The process involves:
1. Visual RPM matrix with 8 cells containing geometric shapes and symbols
2. Language-based abstractions derived from the visual patterns
3. Generated prompts fed into a pre-trained language model
4. Probabilistic output from the language model predicting the correct answer
### Components/Axes
1. **RPM Matrix (Top Section)**
- 8 cells arranged in a 2x4 grid
- Each cell contains:
- A geometric shape (diamond, pentagon, hexagon)
- Symbols (circles, triangles, dots) in varying quantities
- Bottom-right cell contains a question mark (?) indicating the missing symbol
2. **Language-Based Abstractions**
- Textual representations of visual patterns
- Generated from the RPM matrix
- Example format: "hexagon with 3 circles and 2 triangles"
3. **Pre-Trained Language Model**
- Blue rectangular block labeled "Pre-Trained Language Model"
- Processes language-based abstractions
- Outputs probability distribution over possible answers
4. **Probability Output (Bottom Section)**
- Bar chart showing P(?|symbols) for 8 possible answers
- Y-axis: Probability values (approximate)
- X-axis: 8 candidate answers (geometric patterns)
- Highest probability corresponds to the correct answer (green-highlighted)
### Detailed Analysis
1. **RPM Matrix Complexity**
- Cells vary in:
- Shape type (diamond, pentagon, hexagon)
- Symbol count (1-5 symbols per cell)
- Symbol composition (circles, triangles, dots)
- Example: Top-left cell = diamond with 1 circle; Bottom-left = diamond with 3 triangles
2. **Language Prompts**
- Generated from visual patterns
- Example prompts:
- "hexagon containing 4 circles and 1 triangle"
- "pentagon with 2 dots and 3 triangles"
3. **Language Model Output**
- Probability distribution over 8 candidate answers
- Key observations:
- Highest probability (≈0.4) for the correct answer (green-highlighted)
- Second-highest probability (≈0.25) for a similar pattern
- Remaining probabilities <0.1
### Key Observations
1. **Pattern Recognition**: The RPM matrix tests abstract reasoning through shape/symbol combinations
2. **Language Abstraction**: Visual patterns are converted to textual descriptions for NLP processing
3. **Model Confidence**: The language model demonstrates >40% confidence in the correct answer
4. **Visual-Language Alignment**: The highest probability matches the RPM's missing symbol pattern
### Interpretation
This diagram demonstrates how pre-trained language models can augment visual reasoning tasks by:
1. Translating visual patterns into linguistic representations
2. Leveraging linguistic priors to predict missing elements
3. Quantifying solution confidence through probabilistic outputs
The green-highlighted answer (highest probability) suggests the language model successfully identified the correct RPM solution pattern. The decreasing probability distribution indicates the model's ability to rank candidate solutions by similarity to the training data.
Notably, the absence of numerical values in the RPM matrix (only symbolic counts) highlights the model's capacity to handle abstract, non-quantitative reasoning tasks. The integration of visual and linguistic modalities enables the model to generalize beyond simple pattern matching to more complex abstract reasoning.
</details>
Sander, 2013), and thus it is believed to be prerequisite in order to enable the same in AI systems. However, conventional approaches struggle with analogy-making, and are trained on thousands of examples to achieve any success on benchmark tasks. This is unsatisfying, as humans are capable of analogy-making without explicit training, and such analogy-making should enable zero-shot generalization to new situations (Mitchell, 2021).
Interestingly, a body of work in cognitive science suggests that analogy-making and relational reasoning are connected to humans' symbol system and language capabilities (Gentner, 2010). For example, Gordon (2004) finds that members of an Amazonian tribe that count only with words for 'one,' 'two,' and 'many' struggle to make analo- gies with higher numbers. Further, Gentner et al. (2013) find that deaf children whose sign language does not involve spatial relations are outperformed by hearing children on a spatial relational reasoning task, while Christie and Gentner (2014) find that assigning even nonsensical names to relations enhances children's relational reasoning. All of this demonstrates that language serves as a powerful way for humans to abstract and better reason about the overwhelming and complex percepts we encounter in the world.
In this work, we explore whether language may serve a similar purpose in AI systems. Specifically, we apply contemporary autoregressive pre-trained language models (PLMs) to Raven's Progressive Matrices (RPM), an example of which is shown in Figure 1. RPM is a widely used psychometric test for relational reasoning that requires inducing an abstract rule from just two examples of short sequences of groups of shapes, and then applying the rule to complete a new partial sequence (Raven and Court, 1938). This task makes minimal assumptions about the test taker's prior knowledge, and is thus thought to provide a good estimate for general intelligence (Holyoak, 2012). On the RAVEN dataset (Zhang et al., 2019a), we find that given the ability to perceive key features of RPMs, large PLMs exhibit a surprising capacity for zero-shot relational reasoning, approaching that of supervised vision-based deep learning approaches and even humans. We propose three levels of abstraction over the language features of the task using name assignment and task decomposition, and find that each abstraction further strengthens PLMs' relational reasoning. Our results and detailed analysis offer insights on PLM performance, including the role of models' complexity, in-context learning, and prior knowledge in emergent relational reasoning, and suggest that they could play an important role in future cognitive architectures for analogy-making. 2
## 2 Related Work
Past work has studied analogy in AI across various domains. Mitchell (2021) provides a comprehensive overview of these efforts, especially those applied in idealized symbolic domains. Here, symbolic and probabilistic methods have traditionally been applied (Gentner, 1983; Hofstadter and Mitchell, 1994; Lake et al., 2015). However, these
2 Experiment code is available at https://github.com/ hxiaoyang/lm-raven.
approaches typically require hard-coding domainspecific concepts, and require substantial search through domain knowledge to operate on their target problems, thus making them unscalable. The creation of large-scale image datasets for analogy tasks here (Zhang et al., 2019a; Hu et al., 2021; Odouard and Mitchell, 2022) have enabled further research with deep learning and neuro-symbolic methods (Hill et al., 2019; Spratley et al., 2020; Kim et al., 2020; Zhang et al., 2021), which bring the advantage of requiring less ad-hoc encoding of domain knowledge, but require thousands of training examples to learn the tasks, still limiting their generalization capability.
Other work has explored AI systems' analogymaking in real-world domains, including in natural images (Teney et al., 2020; Bitton et al., 2022) and language (Li et al., 2020; Chen et al., 2022; Sultan and Shahaf, 2022), especially lexical analogies (Turney et al., 2003; Turney, 2008; Speer et al., 2008; Mikolov et al., 2013b,a; Linzen, 2016; Lu et al., 2019). However, these domains make it difficult to control the prior knowledge required to solve tasks (Mitchell, 2021), and in the context of recent generative foundation models that are extensively pre-trained on natural data, it becomes difficult to separate analogy learning from distributional patterns that can be overfit. Unlike prior work, we apply such foundation models for language to analogical reasoning in a zero-shot setting, bypassing the requirement of hard-coding domain knowledge or training models on task-specific data. Furthermore, while contemporaneous work has applied PLMs to a variety of simpler relational reasoning tasks in language (Webb et al., 2022), we systematically explore the advantage of using language to abstract over complex visual features of the task, opening questions about how the powerful symbol systems learned in PLMs may support robust, perception-driven reasoning in future AI systems.
## 3 Raven's Progressive Matrices
Raven's progressive matrices (RPM) are abstract relational reasoning tasks used in cognitive psychology to test humans' analogy-making (Raven and Court, 1938). Each instance of RPM is a matrix consisting of 9 items arranged in a square, the last of which must be selected from a set of choices. Each item consists of several perceptual attributes , such as shape, color, or more abstract features. Within each row of the matrix, a relation is applied
Figure 2: Illustration of the compositional nature of entities, layouts, and component structures in RA VEN, and their unique attributes. We provide example items from sub-tasks each item type appears in.
<details>
<summary>Image 2 Details</summary>

### Visual Description
## Diagram: Entity-Layout-Component Structure Workflow
### Overview
The diagram illustrates a three-stage workflow for organizing visual elements: **Entity** (defining attributes), **Layout** (spatial arrangement), and **Component Structure** (hierarchical composition). Arrows indicate progression from left to right, with color-coded sections (green, yellow, pink) representing each stage.
### Components/Axes
1. **Entity Section (Green)**
- **Labels**:
- `Item Type`: Triangle, Vertical Line, Square
- `Size`: Small, Medium, Large
- `Color`: Gray, White, Black
- **Sub-Tasks**:
- `Center`: Pentagon shape
- **Legend**: Green background with triangular bullet points.
2. **Layout Section (Yellow)**
- **Labels**:
- `Position`: 2x2 Grid, 3x3 Grid
- `Number`: L-R (Left-Right), U-D (Up-Down), O-IC (Outside-Inside), O-IG (Outside-Inside Grid)
- **Examples**:
- 2x2 Grid: Diamond, Circle, Square, Triangle
- 3x3 Grid: Gray/White/Black hexagons
- **Legend**: Yellow background with diamond-shaped bullet points.
3. **Component Structure Section (Pink)**
- **Labels**:
- `Structure`: Circle with Line, Diamond with Dot, Pentagon with Square, Circle with Square
- **Legend**: Pink background with square bullet points.
### Detailed Analysis
- **Entity**: Attributes (Type, Size, Color) are defined with geometric shapes and color codes.
- **Layout**: Spatial arrangements use grid systems (2x2, 3x3) and directional/orientation labels (L-R, U-D).
- **Component Structure**: Complex hierarchical compositions combine shapes (e.g., Circle + Line, Diamond + Dot).
### Key Observations
1. **Flow Direction**: Arrows (green → yellow → pink) enforce a strict left-to-right progression.
2. **Color Coding**:
- Green (Entity) → Yellow (Layout) → Pink (Component Structure)
- Matches legend bullet shapes (triangle → diamond → square).
3. **Grid Systems**:
- 2x2 Grid uses basic shapes; 3x3 Grid uses monochromatic hexagons.
4. **Component Complexity**: Structures combine 2+ shapes (e.g., Circle + Line) for hierarchical relationships.
### Interpretation
This diagram represents a **visual design pipeline**:
1. **Entity Definition**: Establishes foundational attributes (shape, size, color) for elements.
2. **Layout Organization**: Arranges entities into grids or directional patterns.
3. **Component Assembly**: Combines entities into higher-order structures, suggesting modular design principles.
The use of grids and directional labels implies a focus on **spatial hierarchy**, while the component examples highlight **compositional logic** (e.g., nesting shapes). The absence of numerical data suggests this is a conceptual framework rather than a quantitative analysis.
**Notable Patterns**:
- Component structures prioritize **asymmetry** (e.g., Circle + Line vs. Diamond + Dot).
- Layout grids transition from simple (2x2) to complex (3x3), mirroring increasing abstraction.
</details>
over these attributes, such as progression of numerical values associated with these attributes. Given the first two rows of the matrix, the challenge of the task is to identify the relations being applied to items, and apply them analogously in the third row to infer the missing ninth item. Successfully solving an RPM requires tackling two sub-problems: perception of each item's attributes, and reasoning over multiple items' attributes to infer and apply relations.
## 3.1 RAVEN Dataset
We focus our study on RAVEN (Zhang et al., 2019a), which provides a large-scale benchmark for RPM tasks for training and evaluation of AI systems. Each RPM has 8 possible candidate items to complete it. As shown in Figure 2, each item may consist of compositional entities , layouts , and/or component structures , and RAVEN provides a suite of increasingly complex sub-tasks built from these elements. We introduce their unique attributes below, as well as relations that may occur over them across items in the matrix.
Entities. A single entity has a type (i.e., shape), size , and color selected from a small number of classes. Each of these attributes is associated with a number: type with the number of sides in the entity's shape, size with its diameter, and color with the darkness of its shading. The simplest sub-task of RAVEN is Center , where each item only consists of a single entity.
Layouts. Layouts of entities bring additional higher-level attributes to items, specifically the number (i.e., count) and position of entities within a layout. In the 2x2Grid and 3x3Grid sub-tasks of RAVEN, each item consists of multiple entities arranged in a grid.
Component structures. Items may also be composed of multiple sub-items or components ; RAVEN includes four sub-tasks that introduce this even higher-level challenge: L-R , U-D , and O-IC , each of which consist of two single entities in different configurations, and O-IG , which consists of a 2-by-2 grid inside of a larger entity.
Relations. Following prior work on this task, RAVEN applies four different relations to item attributes across rows of the matrix. These are Constant , which does not modify an attribute, Progression , which increases or decreases the value of an attribute by 1 or 2, Arithmetic , which performs addition or subtraction on the first two attributes of the row to create the third, and Distribute Three , which distributes three consistent values of an attribute across each row.
## 4 Methods
In order to apply PLMs to RAVEN, we abstract the visual features of the task into language. Our abstractions are intentionally applied on a per-item basis to tackle the perception problem of the task without giving the PLM explicit hints toward the reasoning problem (which requires capturing patterns over multiple items). This allows us to focus on evaluating the reasoning capabilities of PLMs. 3
First, we introduce our multi-level abstractions for the RAVEN dataset. 4 Then we formally define the interface between PLMs and the RPM task.
## 4.1 Abstractions in RAVEN
We define abstractions for entity-level attributes, layout-level attributes, and component structures which convert the RPM task into one or more text prompts. We apply two kinds of abstractions: naming and decomposition . As discussed in Section 1, assigning names to perceptual features strengthens humans' analogy-making skills over them. Inspired by this, naming abstractions abstract over attributes or combinations of attributes in the RPM by assigning a unique name to describe them. Mean-
3 As the important features of RAVEN are simple, the perception of an individual item is better performed by computer vision models, and can already be done to fairly high accuracy (Zhang et al., 2021). For more general-purpose analogymaking beyond idealized domains, the robust perception of key features that allow previous (source) experiences to be mapped to novel (target) experiences is a challenging unsolved problem (Mitchell, 2021).
4 Some example PLM prompts using these abstractions are shown in this section, while more examples are provided in Appendix C.
Figure 3: Example generated prompts for a complete RPM under entity attribute naming (left) and decomposition (right) abstractions in the Center sub-task.
<details>
<summary>Image 3 Details</summary>

### Visual Description
## Chart/Diagram Type: Attribute-Based Shape Categorization Diagram
### Overview
The image presents a structured visualization of shape attributes (type, size, color) organized into three rows. Each row contains three shapes, with corresponding numerical values and color codes. A legend on the right decodes the attributes, linking numerical values to visual elements.
### Components/Axes
#### Left Grid (Shapes)
- **Structure**: 3 rows × 3 columns of shapes.
- **Labels**: Each row is annotated with a tuple `(type, size, color)` in parentheses.
- **Row 1**: `(3,0.8,0.60)`, `(5,0.8,0.70)`, `(4,0.8,0.80)`
- **Row 2**: `(4,0.4,0.60)`, `(3,0.4,0.70)`, `(5,0.4,0.80)`
- **Row 3**: `(5,0.3,0.40)`, `(4,0.3,0.50)`, `(3,0.3,0.60)`
- **Shapes**:
- **Row 1**: Triangle, Pentagon, Square
- **Row 2**: Square, Triangle, Pentagon
- **Row 3**: Pentagon, Square, Triangle
#### Right Legend
- **Sections**:
1. **Type**:
- Row 1: `3, 5, 4` (Triangle, Pentagon, Square)
- Row 2: `4, 3, 5` (Square, Triangle, Pentagon)
- Row 3: `5, 4, 3` (Pentagon, Square, Triangle)
2. **Size**:
- Row 1: `8, 8, 8` (Large)
- Row 2: `4, 4, 4` (Medium)
- Row 3: `3, 3, 3` (Small)
3. **Color**:
- Row 1: `6, 7, 8` (Dark Pink, Yellow, Blue)
- Row 2: `6, 7, 8` (Dark Pink, Yellow, Blue)
- Row 3: `4, 5, 6` (Light Pink, Gray, Teal)
### Detailed Analysis
#### Left Grid (Shapes)
- **Row 1**:
- **Type**: Triangle (3), Pentagon (5), Square (4)
- **Size**: All large (8)
- **Color**: Values `(3,0.8,0.60)`, `(5,0.8,0.70)`, `(4,0.8,0.80)` (ambiguous mapping to legend colors).
- **Row 2**:
- **Type**: Square (4), Triangle (3), Pentagon (5)
- **Size**: All medium (4)
- **Color**: Values `(4,0.4,0.60)`, `(3,0.4,0.70)`, `(5,0.4,0.80)` (ambiguous mapping).
- **Row 3**:
- **Type**: Pentagon (5), Square (4), Triangle (3)
- **Size**: All small (3)
- **Color**: Values `(5,0.3,0.40)`, `(4,0.3,0.50)`, `(3,0.3,0.60)` (ambiguous mapping).
#### Right Legend
- **Type**:
- Row 1: `3, 5, 4` (Triangle, Pentagon, Square)
- Row 2: `4, 3, 5` (Square, Triangle, Pentagon)
- Row 3: `5, 4, 3` (Pentagon, Square, Triangle)
- **Size**:
- Row 1: `8, 8, 8` (Large)
- Row 2: `4, 4, 4` (Medium)
- Row 3: `3, 3, 3` (Small)
- **Color**:
- Row 1: `6, 7, 8` (Dark Pink, Yellow, Blue)
- Row 2: `6, 7, 8` (Dark Pink, Yellow, Blue)
- Row 3: `4, 5, 6` (Light Pink, Gray, Teal)
### Key Observations
1. **Consistent Size per Row**: Each row has uniform size (8, 4, or 3), but varying types and colors.
2. **Type Rotation**: Types cycle through triangle, pentagon, square in a fixed order across rows.
3. **Color Discrepancy**: The numerical color values in the left grid (e.g., `3,0.8,0.60`) do not directly match the legend’s color codes (e.g., `6,7,8`). This suggests either a different encoding system or a potential inconsistency.
4. **Spatial Layout**: The legend is positioned to the right of the grid, with clear separation between attribute categories (type, size, color).
### Interpretation
- **Purpose**: The diagram appears to categorize shapes based on three attributes (type, size, color) for comparative analysis.
- **Relationships**:
- **Type-Shape Mapping**: Numerical values (e.g., 3 = Triangle) align with the legend’s type labels.
- **Size-Size Mapping**: Numerical values (e.g., 8 = Large) match the legend’s size labels.
- **Color-Color Mapping**: Ambiguity exists between the left grid’s numerical values and the legend’s color codes. For example, `(3,0.8,0.60)` in the left grid does not correspond to `6,7,8` in the legend. This may indicate a secondary encoding (e.g., RGB values) or an error.
- **Trends**:
- **Type Rotation**: The sequence of types (3,5,4 → 4,3,5 → 5,4,3) suggests a cyclical or systematic arrangement.
- **Size Consistency**: Each row maintains a fixed size, emphasizing uniformity within categories.
- **Anomalies**: The mismatch between numerical color values and legend colors raises questions about the encoding logic.
### Conclusion
The diagram systematically organizes shapes by type, size, and color, with a legend providing a key for interpretation. However, the numerical color values in the left grid require further clarification to align with the legend’s color codes. The structure implies a focus on attribute-based categorization, potentially for educational or analytical purposes.
</details>
Figure 4: Example of generated entity layout encodings when abstracting position and number , and summarizing redundant entity attributes within the layout.
<details>
<summary>Image 4 Details</summary>

### Visual Description
## Diagram: Multi-Attribute Data Representation
### Overview
The image presents a structured visualization of data categorized into three primary sections: **position**, **number**, and **type**. Each section contains distinct elements with labeled attributes, legends, and spatial relationships. The diagram combines grid-based positioning, numerical counts, and categorical shapes with embedded metadata.
---
### Components/Axes
#### Position Section
- **Grid Layout**: A 3x3 grid with cells marked by "X" symbols.
- **Coordinates**:
- `[1,1,1]` (top-left cell)
- `[0,0,1]` (middle-right cell)
- `[0,0,1]` (bottom-right cell)
- **Ambiguity**: The third value in coordinates (e.g., `[1,1,1]`) is unclear in a 2D grid context. Likely a typo or placeholder.
#### Number Section
- **Label**: "number"
- **Value**: A box containing the integer `5`.
- **Visualization**: Five "X" symbols aligned horizontally below the number.
#### Type Section
- **Shapes**:
- Pentagon (solid)
- Hexagon (solid)
- Circle (solid)
- Square (dotted)
- **Attributes**:
- **Type**: `[5,6,7]` (labels for pentagon, hexagon, circle)
- **Size**: `[2,3]` (labels for square sizes)
- **Color**: `[0,4,6,8]` (labels for color codes)
- **Legend**:
- Colors: White (`0`), Gray (`4`), Darker Gray (`6`), Black (`8`).
- Sizes: Solid (`2`), Dotted (`3`).
---
### Detailed Analysis
#### Position Grid
- **X Placement**:
- Top-left cell (`[1,1,1]`).
- Middle-right and bottom-right cells (`[0,0,1]`).
- The repeated `[0,0,1]` suggests a possible error or intentional overlap.
#### Number Section
- **Count**: The number `5` directly corresponds to five "X" symbols, confirming a 1:1 relationship.
#### Type Section
- **Shape Attributes**:
- **Pentagon/Hexagon/Circle**: Solid fill, sizes `2` (per legend).
- **Square**: Dotted pattern, size `3` (per legend).
- **Color Mapping**:
- Pentagon: White (`0`).
- Hexagon: Gray (`4`).
- Circle: Darker Gray (`6`).
- Square: Black (`8`).
---
### Key Observations
1. **Grid Inconsistency**: The coordinate `[0,0,1]` appears twice, potentially indicating a duplicate or mislabeled cell.
2. **Shape-Size Relationship**: The square is the only shape with a dotted pattern and size `3`, while others use size `2`.
3. **Color Codes**: The legend’s color codes (`0,4,6,8`) align with the shapes’ visual hierarchy (lightest to darkest).
---
### Interpretation
- **Position Data**: The grid likely represents a 3D spatial dataset, though the third coordinate is ambiguous. The "X" marks may denote critical points or outliers.
- **Number Section**: The count `5` suggests a total of five data instances, possibly linked to the grid’s "X" markers.
- **Type Section**: Shapes with varying attributes (size, color, pattern) could represent categorical data. For example:
- **Type `[5,6,7]`**: Geometric shapes as categories.
- **Size `[2,3]`**: Size differentiation (e.g., small vs. large).
- **Color `[0,4,6,8]`**: Intensity or grouping (e.g., light to dark).
- **Legend Consistency**: Colors and sizes in the legend match the shapes’ visual properties, confirming accurate encoding.
---
### Conclusion
This diagram integrates spatial, numerical, and categorical data into a cohesive framework. The position grid and number section provide foundational counts, while the type section introduces multidimensional attributes (size, color, pattern). The legend ensures clarity in interpreting these attributes, though ambiguities in coordinates and shape categorization warrant further clarification.
</details>
while, jointly understanding and tracking the complex features of the task can become a burden even for humans. Inspired by humans' capability to decompose complex tasks into independent subtasks (Lee and Anderson, 2001), decomposition abstractions split the RPM into multiple sub-matrices by its independent features, then generate a separate prompt for each one. We can then prompt a PLMonce for each sub-matrix, and aggregate PLM outputs to choose a candidate matrix completion. 5
## 4.1.1 Entity-Level Abstractions
As shown in Figure 3, we can abstract perceptual entity attributes into language by assigning them names, then generating prompts to represent the full RPM using these names. As each of an entity's attributes is numerical by nature, we assign each attribute an ordinal numerical name; type is named by the number of sides of the associated shape (e.g., '3' for triangle ), size is named by a decimal representing its diameter, and color is named based on the darkness of the entity's shade. As each of an entity's attributes is independent, i.e., a relation over one attribute has no connection to relations over other attributes, we can decompose the RPM task by these attributes into three separate sub-tasks with their own prompts.
5 A more formal definition for decomposition is provided in Section 4.2.
## 4.1.2 Layout-Level Abstractions
As shown in Figure 4, we next propose abstractions for layouts of entities (e.g., in grid-based sub-tasks of RAVEN). First, the number attribute of a layout corresponds to the count of entities in it. Recognizing number requires implicitly counting entities within a layout, which may be difficult to disentangle from other attributes. As such, we directly expose this attribute by extracting this count and encoding it in text. Since this layout attribute is independent from other attributes, we can again decompose the task and consider it separately from entity attributes.
The position attribute encodes even more complex information about a layout, and relations over it may move entities around within the layout. However, an occupancy map serves as a strong naming abstraction for position which omits distracting details of specific entities while exposing key information for detecting relations over it. We generate the occupancy map as an array of text representing the occupancy of the layout, and decompose this from other attributes. Notably, this abstraction provides a unique language description for each possible global configuration of entities within a layout, allowing the PLM to disentangle global and local patterns in the problem, a helpful capability of humans (Robertson and Lamb, 1991). 6
In RAVEN, relations are applied to specific attributes consistently across all entities in a layout. As our layout-level abstractions make explicit the key features of layouts, we no longer need to track entity-level attributes for specific entities within them. Specifically, rather than supply a PLM with a separate grid-like prompt for each entity-level attribute, we simply provide a list of unique attribute values. This reduces the complexity added by layouts of multiple entities.
## 4.1.3 Structural Decomposition Abstractions
In cases with multiple components in each item, we may find that prompts become long and complicated with earlier approaches. Since each component's attributes and relations are independent, we can alternatively decompose the task by its components. For each component, we can generate a prompt through entity attribute naming abstractions as shown in Figure 3 (left), or we can apply
6 For example, we may recognize the grid of entities in Figure 2 to be in an 'L' shape at the global level, while also recognizing that it is locally composed of triangles.
the higher-level abstractions over entity and layout attributes shown in Figure 4, thus decomposing each component's prompts into prompts for each attribute. As this structural decomposition converts multi-component problems into several simpler single-component, single-attribute problems, the complexity added by multiple components is abstracted away.
## 4.2 Problem Definition
Formally, a complete RPM M consists of 9 matrix items m ij where row and column i, j ∈ { 1 , 2 , 3 } . As discussed in Section 3.1, an individual item m ij in the RAVEN dataset is formalized by high-level components consisting of layout-level attributes and entity-level attributes. Given all items in M except for m 33 , the task is to identify m 33 from a set Y of 8 choices by identifying abstract rules over the attributes within the first 2 rows of M , and selecting the candidate m 33 that correctly applies these rules in the third row.
Applying PLMs. We apply PLMs to RAVEN in a zero-shot setting. In the absence of decomposition abstractions, we define ▲ as the mapping of a complete RPM to a text prompt. The PLM's choice for m 33 is given by
$$\arg \max _ { y \in Y } \frac { 1 } { | \mathbb { L } | } \log \Pr \left ( \mathbb { L } \left ( m _ { 1 1 \colon 3 2 } , y \right ) \right )$$
▲ where | ▲ | denotes the number of tokens in the prompt. When decomposition is introduced, ▲ instead returns multiple prompts, and the (tokenlength normalized) log-probabilities of all subprompts are summed. 7
## 5 Experimental Results
Now, we can examine the impact each of these language-based abstractions has on the performance of transformer-based, autoregressive PLMs in relational reasoning on RAVEN. To further understand their impact with respect to model complexity, we evaluate a range of model sizes: 8 OPT 125M, 1.3B, and 13B (Zhang et al., 2022), along with GPT-3 (Brown et al., 2020). 9 Models are evaluated on a random subset of 500 testing examples from each sub-task of RAVEN.
7 See Appendix C for examples of decomposing prompts.
8 Results on additional model sizes in Appendix A.
9 Specifically, we use the text-davinci-002 variant of InstructGPT (Ouyang et al., 2022) through a Microsoft Azure OpenAI deployment.
After introducing some comparison approaches, we present the experimental results from our applied abstractions on PLMs' entity-level, layoutlevel, and component-level relational reasoning. Afterward, we dive deeper with an analysis on how both our abstractions and in-context learning contribute to model performance.
## 5.1 Comparison Approaches
To contextualize our findings, we provide results from the human study in Zhang et al. (2019a), as well as two supervised baselines from prior work. 10 Additionally, to specifically evaluate the advantage of the way we mapped the RPM task into language, we include two simpler abstraction methods that encode task information less explicitly.
Supervised baselines. While our goal is not to achieve the state of the art on RA VEN, we include results from two state-of-the-art supervised baselines for reference. Specifically, we select the two approaches with the top mean accuracy on RAVEN, as outlined in the survey by Małki´ nski and Ma´ ndziuk (2022): Rel-AIR (Spratley et al., 2020) and CoPINet + ACL (Kim et al., 2020). Rel-AIR combines a simple vision model with an unsupervised scene decomposition module, enabling more generalizable reasoning over entities in RAVEN. CoPINet + ACL applies an analogy-centric contrastive learning paradigm to CoPINet (Zhang et al., 2019b), a prior architecture proposed for perceptual inference trained through contrastive learning. Both baselines have been trained on thousands of examples from the RAVEN dataset, and incorporate task-specific inductive biases in their architecture. Meanwhile, we evaluate PLMs on RAVEN in a zero-shot setting with no supervised learning.
Quasi-image abstraction. To evaluate the helpfulness of naming abstractions over entity attributes, we should compare to an approach that does not have such abstraction. However, some mapping from the visual features of the RPM task into langauge is needed in order for a PLM to interface with it. While the limited context window of PLMs restricts us from incorporating raw pixels directly into our prompts, PLMs have recently been demonstrated to capture spatial patterns in similar inputs: text-based matrices (Patel and Pavlick,
10 Since our approach is not evaluated on the exact same subset of RAVEN data, these results from prior work are not directly comparable, but can be helpful reference points.
Figure 5: Quasi-image abstractions for a triangle and pentagon of different size and color .
<details>
<summary>Image 5 Details</summary>

### Visual Description
## Diagram: Numerical Pattern Representation
### Overview
The image contains two distinct diagrams side-by-side. The left diagram features a triangle with a list of numbers in square brackets, while the right diagram shows a pentagon with a list of numbers. Both lists exhibit incremental growth in the number of elements, with the triangle's numbers being "2" and the pentagon's numbers being "9".
### Components/Axes
- **Left Diagram (Triangle)**:
- **Structure**: A triangle shape with four rows of numerical lists.
- **Labels**: Each row contains a list of "2"s, increasing in count from 1 to 4 elements.
- Row 1: `[2]`
- Row 2: `[2, 2]`
- Row 3: `[2, 2, 2]`
- Row 4: `[2, 2, 2, 2]`
- **Positioning**: Centered within a square boundary on the left side of the image.
- **Right Diagram (Pentagon)**:
- **Structure**: A pentagon shape with ten rows of numerical lists.
- **Labels**: Each row contains a list of "9"s, increasing in count from 1 to 10 elements.
- Row 1: `[9]`
- Row 2: `[9, 9]`
- Row 3: `[9, 9, 9]`
- ...
- Row 10: `[9, 9, 9, 9, 9, 9, 9, 9, 9, 9]`
- **Positioning**: Centered within a square boundary on the right side of the image.
### Detailed Analysis
- **Triangle Pattern**:
- The lists follow a strict incremental pattern, with each row adding one additional "2".
- The final row contains four "2"s, suggesting a progression from 1 to 4 elements.
- **Pentagon Pattern**:
- The lists follow a similar incremental pattern but extend to 10 elements.
- The final row contains ten "9"s, indicating a longer sequence compared to the triangle.
### Key Observations
1. **Symmetry in Structure**: Both diagrams use geometric shapes (triangle and pentagon) to frame their respective lists, possibly indicating categorical differentiation.
2. **Numerical Consistency**: The triangle uses only "2"s, while the pentagon uses only "9"s, suggesting distinct data sets or symbolic meanings.
3. **Growth Progression**: Both diagrams emphasize incremental growth, with the pentagon's sequence being twice as long as the triangle's.
### Interpretation
- **Purpose**: The diagrams likely represent abstract data sequences, where the geometric shapes (triangle/pentagon) symbolize categories or types of data. The incremental lists could model processes like accumulation, iteration, or hierarchical growth.
- **Symbolism**:
- The triangle (3 sides) might represent a base-3 or triangular number system, though the use of "2"s complicates this.
- The pentagon (5 sides) paired with "9"s could imply a countdown, maximum value, or a reference to a 9-based system (e.g., nonary).
- **Anomalies**: The absence of axis labels or legends leaves the exact context ambiguous. The choice of "2" and "9" may hold symbolic or algorithmic significance not immediately apparent.
### Final Notes
The image prioritizes visual representation of numerical growth over explicit contextual clues. Further analysis would require additional metadata to clarify the purpose of the shapes and numbers.
</details>
Figure 6: Results on the RAVEN Center sub-task under entity abstractions, compared to naïve and supervised baselines described in Section 5.1, and humans.
<details>
<summary>Image 6 Details</summary>

### Visual Description
## Line Graph: Center Accuracy vs. Model Size
### Overview
The graph illustrates the relationship between model size (in billion parameters) and center accuracy across various models. It compares human performance, state-of-the-art models (Rel-AIR, CoPINet + ACL), and baseline models (Random, Quasi-Image, Random Naming, Entity Naming, Entity Decomp.). The y-axis represents accuracy (0–1), while the x-axis spans model sizes from 0.1B to 100B parameters.
### Components/Axes
- **X-axis**: Model Size (Billion Parameters) – Logarithmic scale (10⁻¹ to 10²).
- **Y-axis**: Center Accuracy – Linear scale (0 to 1).
- **Legend**:
- Human (dashed green line)
- Rel-AIR (dotted purple line)
- CoPINet + ACL (dotted blue line)
- Random (black dotted line)
- Quasi-Image (gray solid line)
- Random Naming (gray dashed line)
- Entity Naming (blue solid line)
- Entity Decomp. (orange solid line)
### Detailed Analysis
1. **Human Performance**:
- Dashed green line remains flat at ~0.95 accuracy across all model sizes.
2. **Rel-AIR**:
- Dotted purple line stays flat at ~0.9 accuracy.
3. **CoPINet + ACL**:
- Dotted blue line remains flat at ~0.85 accuracy.
4. **Random Baseline**:
- Black dotted line stays flat at ~0.1 accuracy.
5. **Quasi-Image**:
- Gray solid line starts at ~0.1 (0.1B) and increases to ~0.3 (100B).
6. **Random Naming**:
- Gray dashed line starts at ~0.15 (0.1B) and rises to ~0.5 (100B).
7. **Entity Naming**:
- Blue solid line starts at ~0.2 (0.1B), rises to ~0.7 (10B), and plateaus at ~0.8 (100B).
8. **Entity Decomp.**:
- Orange solid line starts at ~0.45 (0.1B), increases to ~0.8 (10B), and reaches ~0.85 (100B).
### Key Observations
- **Scaling Benefits**: Entity Naming and Entity Decomp. show significant accuracy improvements with larger models, approaching human-level performance (~0.95).
- **Baseline Limitations**: Random and Random Naming models perform poorly, with no improvement despite scaling.
- **Architectural Impact**: CoPINet + ACL achieves ~0.85 accuracy without scaling, outperforming Random but trailing human benchmarks.
- **Convergence**: Entity Decomp. closes the gap to human accuracy most effectively (~0.85 vs. 0.95), while Entity Naming lags slightly (~0.8).
### Interpretation
The data suggests that **model architecture** (e.g., CoPINet + ACL) and **scaling** (e.g., Entity Decomp., Entity Naming) are critical for improving center accuracy. Entity Decomp. demonstrates the strongest synergy between scaling and architectural design, achieving ~90% of human performance at 100B parameters. In contrast, Random models highlight the necessity of structured training, as they fail to improve even at 100B parameters. The flat lines for Human, Rel-AIR, and CoPINet + ACL indicate that these represent fixed benchmarks, with CoPINet + ACL serving as a strong baseline for comparison. The trends underscore the importance of both model capacity and algorithmic innovation in achieving human-like performance.
</details>
2021). As such, we propose a quasi-image abstraction which converts the visual RPM task into a matrix of ASCII characters. As shown in Figure 5, an entity's type can be expressed through a matrix of characters; size can be expressed through the height and width of the matrix; and color can be expressed through the actual characters making up the matrix. By converting instances of RAVEN's Center sub-task into this pixel-like form, we have a lower-level abstraction of the task's visual features that can be compared to the higher-level abstraction of naming entity attributes.
Random naming abstraction. We would also like to understand the advantage of the specific names we chose for entity attributes compared to other possible choices. As such, we propose a second baseline where, instead of using ordinal labels to describe entities' type , size , and color , we choose random words from a large corpus. This removes numerical dependencies that may be utilized to recognize some relations, and can help us understand whether PLMs take advantage of this information when it is available.
## 5.2 Entity-Level Reasoning
We first evaluate PLMs under our lowest level abstractions over entity attributes. To isolate the improvements from such abstraction, we focus on the Center sub-task of RAVEN which only includes a single entity per item in the RPM, and thus only tests understanding of relations over entity attributes. The results are shown in Figure 6.
Impact of naming. Under the simplest abstraction of naming the entity-level attributes, we see impressive zero-shot accuracies that monotonically increase with model size up to 77.2% from GPT3 175B on Center , nearing human performance. Further, we find that our choice to map attributes into numerical symbols is consistently advantageous over the quasi-image and random-naming abstractions, which reach respective accuracies up to 28.2% and 51.8%. Meanwhile, we find that as model size increases, our ordinal naming approach outperforms the random naming baseline more and more, up to over 20% in larger model sizes. This suggests that PLMs of larger size can better capture and take advantage of implicit numerical relations in their vocabulary.
Impact of decomposition. When applying decomposition over entity attributes, we observe further improvement of 2.8% accuracy in GPT-3 175B. Interestingly, we see a much sharper improvement from this abstraction in smaller models, with OPT 125M's accuracy doubling from 22.2% to 45.6%, and OPT 1.3B's accuracy rising from 47.2% to 72.0%. This may suggest that PLMs have a limited working memory which is related to the number of learned parameters in them. Large PLMs are more capable to handle complex reasoning tasks because of this, while smaller PLMs benefit from decomposing tasks into more manageable parts.
## 5.3 Layout-Level Reasoning
In Figure 7, we evaluate PLMs' capability to capture relations over layout attributes under our abstractions introduced in the 2x2Grid and 3x3Grid sub-tasks. Without any decomposition abstraction, model performance reaches up to 78.0% and 86.4% accuracy respectively on 2x2Grid and 3x3Grid . When adding naming for layout-level attributes and decomposing all attributes into separate prompts, we see further improvements across the board, with accuracies reaching 87.8% on 2x2Grid and 93.2% on 3x3Grid . The PLM exceeds human performance on both sub-tasks, despite them being arguably some of the most complex tasks in RAVEN, with the latter comprised of more entities than any other sub-task. This suggests that our strong layout-level abstractions enable the PLM to tease apart the numerous attributes in grids of entities and capture obscure patterns, whereas humans may struggle with this as the task becomes more complex.
Figure 7: Results on grid-based sub-tasks of RAVEN without and with decomposition abstractions. Compared to humans and supervised baselines.
<details>
<summary>Image 7 Details</summary>

### Visual Description
## Line Graph: Model Size vs. Accuracy for Different Architectures
### Overview
The image contains two line graphs comparing the accuracy of different AI models (2x2Grid and 3x3Grid) across varying model sizes (in billion parameters). The graphs include multiple data series with distinct line styles and colors, alongside benchmark lines for human performance and random guessing.
### Components/Axes
- **X-axis**: "Model Size (Billion Parameters)" with logarithmic scale markers at 10⁻¹, 10⁰, 10¹, and 10².
- **Y-axis (Left)**: "2x2Grid Accuracy" (0–1) for the left graph; "3x3Grid Accuracy" (0–1) for the right graph.
- **Legend**:
- **Human**: Dashed green line (~0.8 accuracy).
- **Rel-AIR**: Dotted purple line (~0.9 accuracy).
- **Random**: Dotted black line (~0.2 accuracy).
- **Entity Naming**: Solid blue line (2x2Grid: 0.4→0.8; 3x3Grid: 0.6→0.9).
- **Entity & Layout Decomp.**: Solid orange line (2x2Grid: 0.6→0.9; 3x3Grid: 0.7→0.95).
### Detailed Analysis
#### 2x2Grid Accuracy vs. Model Size
- **Human**: Flat dashed green line at ~0.8.
- **Rel-AIR**: Flat dotted purple line at ~0.9.
- **Random**: Flat dotted black line at ~0.2.
- **Entity Naming**: Solid blue line starts at ~0.4 (10⁻¹) and rises to ~0.8 (10²).
- **Entity & Layout Decomp.**: Solid orange line starts at ~0.6 (10⁻¹) and rises to ~0.9 (10²).
#### 3x3Grid Accuracy vs. Model Size
- **Human**: Flat dashed green line at ~0.8.
- **Rel-AIR**: Flat dotted purple line at ~0.9.
- **Random**: Flat dotted black line at ~0.2.
- **Entity Naming**: Solid blue line starts at ~0.6 (10⁻¹) and rises to ~0.9 (10²).
- **Entity & Layout Decomp.**: Solid orange line starts at ~0.7 (10⁻¹) and rises to ~0.95 (10²).
### Key Observations
1. **Model Size Impact**: Both Entity Naming and Entity & Layout Decomp. show significant accuracy improvements as model size increases (e.g., Entity & Layout Decomp. jumps from ~0.6 to ~0.9 in 2x2Grid).
2. **Benchmark Comparison**:
- Rel-AIR consistently outperforms Human (0.9 vs. 0.8).
- Random guessing remains far below all other methods (~0.2).
3. **Architecture Differences**:
- 3x3Grid models achieve higher accuracy than 2x2Grid for the same methods (e.g., Entity & Layout Decomp. reaches 0.95 vs. 0.9 in 2x2Grid).
- Entity & Layout Decomp. outperforms Entity Naming in both grid sizes.
### Interpretation
The data demonstrates that:
- **Larger models** (higher parameter counts) improve performance for decomposition-based methods (Entity & Layout Decomp.), suggesting scalability benefits.
- **Decomposition methods** (Entity & Layout Decomp.) outperform simpler approaches (Entity Naming), indicating architectural complexity matters.
- **Human and Rel-AIR benchmarks** set high accuracy thresholds (~0.8–0.9), implying current models are nearing human-level performance in some tasks.
- The **logarithmic x-axis** highlights that even small increases in model size (e.g., 10⁰ to 10¹) yield disproportionate accuracy gains.
### Spatial Grounding & Trend Verification
- **Legend Position**: Top-left corner, aligned with graph titles.
- **Line Colors**: Confirmed matches (e.g., Entity & Layout Decomp. is orange, Entity Naming is blue).
- **Trend Logic**: Solid lines (Entity methods) show upward slopes, while dashed/dotted lines (benchmarks) remain flat, validating the visual interpretation.
### Content Details
- **2x2Grid Data Points**:
- Entity Naming: 0.4 (10⁻¹), 0.6 (10⁰), 0.8 (10²).
- Entity & Layout Decomp.: 0.6 (10⁻¹), 0.8 (10¹), 0.9 (10²).
- **3x3Grid Data Points**:
- Entity Naming: 0.6 (10⁻¹), 0.8 (10¹), 0.9 (10²).
- Entity & Layout Decomp.: 0.7 (10⁻¹), 0.85 (10¹), 0.95 (10²).
### Notable Anomalies
- **Random Guessing**: Consistently flat at ~0.2, serving as a baseline for comparison.
- **Human Performance**: Slightly lower than Rel-AIR, suggesting Rel-AIR may incorporate human-like reasoning with added efficiency.
### Final Notes
The graphs emphasize the importance of model architecture and size in achieving high accuracy. Entity & Layout Decomp. appears most effective, particularly in larger models, while Random Guessing underscores the value of structured approaches. The data aligns with trends in AI research, where decomposition and scalability are critical for complex tasks.
</details>
## 5.4 Component-Level Reasoning
Lastly, we apply our structural decompositionbased abstractions on RAVEN sub-tasks which have multiple components, i.e., L-R , U-D , O-IC , and O-IG . The results are shown in Figure 8. First, just decomposing the task by its components improves the maximum accuracy on each task on average by about 20%. Additionally decomposing each component by its entity and layout attributes brings further gains, with GPT-3 175B reaching up to 77.6%, 78.0%, 82.8%, and 92.6% on L-R , U-D , O-IC , and O-IG respectively, and exceeding humans and nearing supervised baselines on the latter. The performance gain from this decomposition is again even more pronounced for smaller PLMs. Most significantly, OPT 1.3B improves from 20-30% accuracy to over 70% accuracy, nearing human performance. This demonstrates that not only is GPT-3 capable of very complex analogical reasoning tasks, but even PLMs less than 100 times its size can perform quite well here with the proper abstractions.
## 5.5 Fine-Grained Analysis
Finally, we analyze how model performance varies across different attributes and relations, as we introduce distracting attributes, and as we introduce rows into the matrix. In our analysis, we compare three representative levels of abstraction: entity attribute naming only (no decomposition into multiple prompts), decomposition of components , and full decomposition of entity and layout attributes and components .
## 5.5.1 Analysis of Attributes and Relations
We measure the impact of abstractions in capturing each attribute and relation in RA VEN. In Figure 9,
Table 1: GPT-3 accuracy on Center sub-task with distracting orientation attribute in language prompts, under the naming and decomposition abstractions. orientation values are taken directly from RA VEN or randomly selected.
| Distractor Values | Naming | Decomposition |
|---------------------|----------|-----------------|
| RAVEN | 76.0% | 80.0% |
| Random | 72.6% | 77.8% |
we present GPT-3 175B's accuracy over each attribute and relation. We find that number is the best captured attribute even without any decomposition abstractions, while the model struggles with position until we introduce decomposition of attributes, suggesting the occupancy map encoding used here indeed helped capture it. Meanwhile, Arithmetic is the most difficult relation, with consistently lower accuracy than other relations.
## 5.5.2 Robustness to Distracting Attributes
Since our mappings from RAVEN attributes into language provide the key features over which relations occur, we may wonder how robust PLMs are to distracting or unimportant attributes. In fact, the RAVEN dataset includes one noise attribute that we excluded from our mapping to avoid unnecessarily increasing prompt lengths: orientation , i.e., the rotation of entities in the RPM. To begin exploring this issue, we incorporate orientation into the problem as a fourth entity-level attribute in addition to type , size , and color . For the best model (i.e., GPT-3) on the Center sub-task, we compare two possible injections of orientation values: using the values provided in RAVEN (which are mostly constant within each matrix row), and randomly selected values (which could be more distracting).
As shown in Table 1, compared to GPT-3's Center accuracies of 77.2% and 80.0% with respective naming and decomposition abstractions, the injection of orientation as a distraction feature does not degrade the model performance much, achieving accuracies of 76.0% and 80.0% when using values from RAVEN, and 72.6% and 77.8% when using random values. This shows that PLMs exhibit some robustness to distracting attributes in language context, and have the capability to ignore them in analogical reasoning. Future work may consider more in-depth analysis to discover the extent of model robustness to distraction features, and how it varies by model complexity.
Figure 8: PLM accuracy on multi-component RAVEN sub-tasks with attribute naming only, component decomposition, and full component and attribute decomposition, compared to supervised baselines and humans.
<details>
<summary>Image 8 Details</summary>

### Visual Description
## Line Graphs: Model Size vs. Accuracy Across Metrics
### Overview
The image contains four line graphs arranged in a 2x2 grid, comparing the performance of different model architectures across four accuracy metrics (L-R, U-D, O-IC, O-IG) as model size increases from 10⁻¹ to 10² billion parameters. Each graph includes multiple data series with distinct visual styles and legends.
---
### Components/Axes
#### Common Elements Across All Graphs:
- **X-axis**: Model Size (Billion Parameters)
- Logarithmic scale with ticks at 10⁻¹, 10⁰, 10¹, 10²
- Label: "Model Size (Billion Parameters)"
- **Y-axes**:
- Top-left: L-R Accuracy (0–1 scale)
- Top-right: U-D Accuracy (0–1 scale)
- Bottom-left: O-IC Accuracy (0–1 scale)
- Bottom-right: O-IG Accuracy (0–1 scale)
- **Legends**: Positioned at the top of each graph, with the following entries:
- **Human**: Green dashed line (flat across all graphs)
- **Rel-AIR**: Purple dotted line (flat across all graphs)
- **CoPINet + ACL**: Cyan dotted line (flat across all graphs)
- **Random**: Black dotted line (flat across all graphs)
- **Attr. Naming**: Blue solid line
- **Comp. Decomp.**: Red solid line
- **Comp. & Attr. Decomp.**: Yellow solid line
#### Spatial Grounding:
- Legends are aligned at the top-center of each graph.
- X-axis labels are centered at the bottom of each graph.
- Y-axis labels are rotated 90° on the left side of each graph.
---
### Detailed Analysis
#### 1. **L-R Accuracy (Top-left Graph)**
- **Human**: Flat green dashed line at ~0.8 accuracy.
- **Rel-AIR**: Flat purple dotted line at ~0.6 accuracy.
- **CoPINet + ACL**: Flat cyan dotted line at ~0.4 accuracy.
- **Random**: Flat black dotted line at ~0.2 accuracy.
- **Trends**:
- **Attr. Naming** (blue): Starts at ~0.1 (10⁻¹ params), rises to ~0.6 (10² params).
- **Comp. Decomp.** (red): Starts at ~0.2 (10⁻¹ params), rises to ~0.7 (10² params).
- **Comp. & Attr. Decomp.** (yellow): Starts at ~0.3 (10⁻¹ params), rises to ~0.75 (10² params).
#### 2. **U-D Accuracy (Top-right Graph)**
- **Human**: Flat green dashed line at ~0.8 accuracy.
- **Rel-AIR**: Flat purple dotted line at ~0.6 accuracy.
- **CoPINet + ACL**: Flat cyan dotted line at ~0.4 accuracy.
- **Random**: Flat black dotted line at ~0.2 accuracy.
- **Trends**:
- **Attr. Naming** (blue): Starts at ~0.1 (10⁻¹ params), rises to ~0.5 (10² params).
- **Comp. Decomp.** (red): Starts at ~0.2 (10⁻¹ params), rises to ~0.65 (10² params).
- **Comp. & Attr. Decomp.** (yellow): Starts at ~0.3 (10⁻¹ params), rises to ~0.7 (10² params).
#### 3. **O-IC Accuracy (Bottom-left Graph)**
- **Human**: Flat green dashed line at ~0.8 accuracy.
- **Rel-AIR**: Flat purple dotted line at ~0.6 accuracy.
- **CoPINet + ACL**: Flat cyan dotted line at ~0.4 accuracy.
- **Random**: Flat black dotted line at ~0.2 accuracy.
- **Trends**:
- **Attr. Naming** (blue): Starts at ~0.1 (10⁻¹ params), rises to ~0.55 (10² params).
- **Comp. Decomp.** (red): Starts at ~0.2 (10⁻¹ params), rises to ~0.6 (10² params).
- **Comp. & Attr. Decomp.** (yellow): Starts at ~0.3 (10⁻¹ params), rises to ~0.72 (10² params).
#### 4. **O-IG Accuracy (Bottom-right Graph)**
- **Human**: Flat green dashed line at ~0.8 accuracy.
- **Rel-AIR**: Flat purple dotted line at ~0.6 accuracy.
- **CoPINet + ACL**: Flat cyan dotted line at ~0.4 accuracy.
- **Random**: Flat black dotted line at ~0.2 accuracy.
- **Trends**:
- **Attr. Naming** (blue): Starts at ~0.1 (10⁻¹ params), rises to ~0.6 (10² params).
- **Comp. Decomp.** (red): Starts at ~0.2 (10⁻¹ params), rises to ~0.68 (10² params).
- **Comp. & Attr. Decomp.** (yellow): Starts at ~0.3 (10⁻¹ params), rises to ~0.75 (10² params).
---
### Key Observations
1. **Human Performance**: All graphs show a flat green dashed line at ~0.8 accuracy, suggesting a baseline human-level performance benchmark.
2. **Random Baseline**: The black dotted line (Random) remains consistently at ~0.2 accuracy across all metrics, indicating minimal performance without structured modeling.
3. **Model Size Correlation**: All non-baseline models (Attr. Naming, Comp. Decomp., Comp. & Attr. Decomp.) show **monotonic improvement** in accuracy as model size increases.
4. **Performance Gaps**:
- **Comp. & Attr. Decomp.** (yellow) consistently outperforms other methods across all metrics.
- **Attr. Naming** (blue) underperforms compared to decomposition-based methods.
5. **Flat Baselines**: Rel-AIR, CoPINet + ACL, and Random lines remain flat, suggesting these methods are either size-invariant or inherently limited.
---
### Interpretation
The data demonstrates that **larger model sizes correlate with improved accuracy** across all metrics, with decomposition-based methods (Comp. Decomp. and Comp. & Attr. Decomp.) achieving the highest gains. The flat lines for Human, Rel-AIR, and CoPINet + ACL imply these methods either:
- Reached a performance ceiling (Human/Rel-AIR), or
- Are not sensitive to model size changes (CoPINet + ACL).
The **Comp. & Attr. Decomp.** method (yellow) appears most effective, suggesting that combining compositional decomposition with attribute-level modeling yields superior results. The absence of overlap between data series indicates clear hierarchical performance differences, with decomposition-based approaches outperforming attribute-only methods.
**Critical Insight**: While model size drives performance gains, the choice of architectural strategy (e.g., decomposition vs. attribute naming) determines the ceiling of achievable accuracy. Human-level performance (~0.8) remains unattained by all tested methods, highlighting a potential gap in current modeling paradigms.
</details>
Figure 9: Comparison of accuracy on examples from all sub-tasks, broken down by the types of attributes and relations they require capturing.
<details>
<summary>Image 9 Details</summary>

### Visual Description
## Bar Chart: Model Accuracy Comparison Across Task Categories
### Overview
The chart compares the accuracy of three model configurations across seven task categories. Three methods are evaluated: "Attr. Naming Only" (blue), "Component Decomp." (red), and "Component + Attr. Decomp." (yellow). Accuracy values range from 0.0 to 1.0 on the y-axis.
### Components/Axes
- **X-axis (Categories)**: Entity Attr., Number, Position, Constant, Progression, Arithmetic, Distribute Three
- **Y-axis (Accuracy)**: 0.0 to 1.0 in increments of 0.2
- **Legend**: Located at the top-right corner, mapping colors to methods:
- Blue = Attr. Naming Only
- Red = Component Decomp.
- Yellow = Component + Attr. Decomp.
### Detailed Analysis
1. **Entity Attr.**
- Blue: ~0.70
- Red: ~0.80
- Yellow: ~0.85
2. **Number**
- Blue: ~0.85
- Red: ~0.90
- Yellow: ~0.95
3. **Position**
- Blue: ~0.70
- Red: ~0.75
- Yellow: ~0.85
4. **Constant**
- Blue: ~0.70
- Red: ~0.80
- Yellow: ~0.85
5. **Progression**
- Blue: ~0.70
- Red: ~0.80
- Yellow: ~0.85
6. **Arithmetic**
- Blue: ~0.60
- Red: ~0.70
- Yellow: ~0.75
7. **Distribute Three**
- Blue: ~0.70
- Red: ~0.80
- Yellow: ~0.85
### Key Observations
- **Yellow bars (Component + Attr. Decomp.)** consistently show the highest accuracy across all categories, with values ranging from 0.75 to 0.95.
- **Blue bars (Attr. Naming Only)** are the lowest performers, particularly in "Arithmetic" (~0.60) and "Number" (~0.85).
- **Red bars (Component Decomp.)** perform intermediately, with values between 0.70 and 0.90.
- The largest accuracy gap occurs in the "Number" category, where yellow exceeds blue by ~0.10.
### Interpretation
The data demonstrates that combining **Component Decomposition** with **Attribute Decomposition** (yellow) yields the most robust performance across all task categories. This suggests that attribute-level reasoning enhances the base component decomposition method. The "Attr. Naming Only" approach (blue) underperforms significantly, indicating that attribute naming alone is insufficient for accurate task resolution. The "Arithmetic" category shows the most pronounced performance disparity, highlighting the importance of decomposition strategies for complex numerical reasoning tasks. The consistent superiority of the combined method implies that both component and attribute-level reasoning are critical for high-accuracy modeling in this context.
</details>
## 5.5.3 In-Context Learning Over Rows
By design, RPM tasks are meant to require minimal background knowledge. They should be impossible to solve without the first two rows of the matrix, which provide essential context to complete the third row of the matrix. To understand whether PLMs capture relations specifically from in-context learning over the first two rows of the matrix (as opposed to using prior knowledge from pre-training), we measure the model performance as we introduce rows to the matrices.
As shown in Figure 10, the average model performance increases across all sizes and abstractions as rows are added to the matrix. This suggests that in-context learning indeed contributes significantly to performance, even for smaller models. Larger model sizes see the most significant improvements, suggesting that larger PLMs are stronger in-context learners than smaller ones. Further, larger PLMs can achieve nearly the same accuracy with only two rows of the matrix provided rather compared to having all three, suggesting that they pick up the task quite quickly from in-context learning.
We also observe that in many cases, models achieve accuracies above chance (12.5% accuracy) without being provided any complete rows of the
Table 2: GPT-3 accuracy on RAVEN sub-tasks as rows are added to the RPM, under only naming abstractions.
| Sub-Task | 1 Row | 2 Rows | 3 Rows | Human |
|------------|---------|----------|----------|---------|
| Center | 36.8% | 69.2% | 77.2% | 95.6% |
| 2x2Grid | 54.0% | 71.0% | 78.0% | 81.8% |
| 3x3Grid | 73.0% | 85.2% | 86.4% | 79.6% |
| L-R | 14.0% | 38.2% | 54.2% | 86.4% |
| U-D | 12.4% | 42.0% | 53.6% | 81.8% |
| O-IC | 19.6% | 53.6% | 64.8% | 86.4% |
| O-IG | 32.0% | 62.2% | 74.8% | 81.8% |
matrix (only the third, incomplete row). This may suggest the PLM has a useful prior for this problem, despite it being a visual problem and thus impossible to observe directly in pre-training. This raises questions about the objectivity of RA VEN and possibly the RPM task. 11 Further, when decomposition abstractions are applied, models achieve higher accuracies than when not, suggesting that decomposition encodes some of this prior knowledge for the task. In Table 2, we take a closer look at GPT-3 175B's performance within sub-tasks. Surprisingly, we find the highest accuracies on the grid-based sub-tasks, despite them being the most difficult tasks for humans.
This motivates future work to compare human and PLM performance on ablated analogy-making tasks like these to further evaluate their objectiveness and identify commonalities. Future work in AI and analogy may also consider building diagnostic datasets to tease apart attribute and relation types to better understand how they contribute to model performance and identify areas for improvement.
## In-context learning of attributes and relations.
11 In Appendix B, we further explore this hypothesis on the Impartial-RAVEN dataset (Hu et al., 2021) that removes some superficial correlations in matrix completion choices, and still see comparable results.
Figure 10: Macro average accuracy over all RAVEN sub-tasks as we introduce rows to the matrix during incontext learning, under naming abstractions only (left) and all naming and decomposition abstractions (right). In 1 Row, we include only the incomplete third row.
<details>
<summary>Image 10 Details</summary>

### Visual Description
## Line Graphs: Naming Accuracy vs. Decomposition Accuracy Across Model Sizes
### Overview
The image contains two line graphs comparing accuracy metrics (Naming Accuracy and Decomposition Accuracy) across different model sizes (10⁻¹ to 10² billion parameters). The graphs include human performance baselines, random chance baselines, and three model configurations (1 Row, 2 Rows, 3 Rows). Key trends show performance improvements with increasing model size and row configurations.
---
### Components/Axes
- **X-axis**: Model Size (Billion Parameters)
- Logarithmic scale: 10⁻¹, 10⁰, 10¹, 10²
- **Y-axis (Left Graph)**: Naming Accuracy (0–1)
- **Y-axis (Right Graph)**: Decomposition Accuracy (0–1)
- **Legends**:
- **Human**: Green dashed line (Naming: ~0.85, Decomp: ~0.85)
- **Rel-AIR**: Purple dotted line (Naming: ~0.9, Decomp: ~0.9)
- **CoPINet + ACL**: Blue dotted line (Naming: ~0.95, Decomp: ~0.95)
- **Random**: Black dotted line (Naming: ~0.5, Decomp: ~0.5)
- **1 Row**: Pink line (Naming: 0.2→0.35, Decomp: 0.25→0.45)
- **2 Rows**: Orange line (Naming: 0.25→0.6, Decomp: 0.4→0.75)
- **3 Rows**: Teal line (Naming: 0.25→0.7, Decomp: 0.45→0.85)
---
### Detailed Analysis
#### Left Graph (Naming Accuracy)
- **Human**: Flat green dashed line at ~0.85.
- **Rel-AIR**: Flat purple dotted line at ~0.9.
- **CoPINet + ACL**: Flat blue dotted line at ~0.95.
- **Random**: Flat black dotted line at ~0.5.
- **1 Row**: Starts at 0.2 (10⁻¹) and rises to 0.35 (10²).
- **2 Rows**: Starts at 0.25 (10⁻¹) and rises to 0.6 (10²).
- **3 Rows**: Starts at 0.25 (10⁻¹) and rises to 0.7 (10²).
#### Right Graph (Decomposition Accuracy)
- **Human**: Flat green dashed line at ~0.85.
- **Rel-AIR**: Flat purple dotted line at ~0.9.
- **CoPINet + ACL**: Flat blue dotted line at ~0.95.
- **Random**: Flat black dotted line at ~0.5.
- **1 Row**: Starts at 0.25 (10⁻¹) and rises to 0.45 (10²).
- **2 Rows**: Starts at 0.4 (10⁻¹) and rises to 0.75 (10²).
- **3 Rows**: Starts at 0.45 (10⁻¹) and rises to 0.85 (10²).
---
### Key Observations
1. **Human Performance**: Both graphs show human accuracy (~0.85) as a consistent baseline.
2. **Model Performance**:
- **CoPINet + ACL** outperforms all other models, approaching human-level accuracy.
- **Rel-AIR** performs slightly better than human in Naming Accuracy (~0.9 vs. 0.85).
3. **Row Configurations**:
- Increasing rows improves accuracy: 1 Row < 2 Rows < 3 Rows.
- 3 Rows achieves ~0.85 Decomposition Accuracy (matching human) at 10² parameters.
4. **Random Baseline**: Both graphs show random chance at ~0.5, serving as a lower bound.
---
### Interpretation
- **Model Size Impact**: Larger models (10² parameters) consistently outperform smaller ones, suggesting scalability benefits.
- **Architectural Design**: The "Rows" configuration directly correlates with accuracy gains, indicating that deeper or more complex architectures improve performance.
- **CoPINet + ACL Superiority**: This model configuration closes the gap to human performance, highlighting its effectiveness in leveraging model capacity.
- **Rel-AIR Anomaly**: Slightly exceeds human Naming Accuracy (~0.9 vs. 0.85), possibly due to overfitting or specialized training.
- **Practical Implications**: Optimizing row configurations and model size can bridge the gap between AI and human performance in accuracy-critical tasks.
---
### Spatial Grounding & Validation
- **Legend Placement**: Right-aligned, with clear color-coded labels.
- **Line Consistency**: Colors match legend entries (e.g., teal = 3 Rows).
- **Trend Verification**: All lines slope upward with increasing model size, confirming the stated trends.
</details>
Figure 11: Comparison of accuracy on examples from all RAVEN sub-tasks as rows are introduced to the matrix, with only entity attribute naming abstractions .
<details>
<summary>Image 11 Details</summary>

### Visual Description
## Bar Chart: Accuracy Comparison Across Different Data Representation Strategies
### Overview
The chart compares the accuracy of three data representation strategies (1 Row, 2 Rows, 3 Rows) across seven categories: Entity Attr., Number, Position, Constant, Progression, Arithmetic, and Distribute Three. Accuracy is measured on a scale from 0.0 to 1.0.
### Components/Axes
- **X-Axis (Categories)**: Entity Attr., Number, Position, Constant, Progression, Arithmetic, Distribute Three.
- **Y-Axis (Accuracy)**: Scale from 0.0 to 1.0.
- **Legend**:
- Pink: 1 Row
- Orange: 2 Rows
- Teal: 3 Rows
- **Bar Colors**:
- 1 Row (pink), 2 Rows (orange), 3 Rows (teal).
### Detailed Analysis
1. **Entity Attr.**
- 1 Row: ~0.35
- 2 Rows: ~0.60
- 3 Rows: ~0.70
2. **Number**
- 1 Row: ~0.60
- 2 Rows: ~0.75
- 3 Rows: ~0.85
3. **Position**
- 1 Row: ~0.40
- 2 Rows: ~0.65
- 3 Rows: ~0.70
4. **Constant**
- 1 Row: ~0.30
- 2 Rows: ~0.60
- 3 Rows: ~0.70
5. **Progression**
- 1 Row: ~0.35
- 2 Rows: ~0.60
- 3 Rows: ~0.70
6. **Arithmetic**
- 1 Row: ~0.30
- 2 Rows: ~0.50
- 3 Rows: ~0.60
7. **Distribute Three**
- 1 Row: ~0.25
- 2 Rows: ~0.55
- 3 Rows: ~0.70
### Key Observations
- **Trend Verification**:
- All categories show increasing accuracy with more rows (1 → 2 → 3 Rows).
- The "Number" category has the highest absolute accuracy (0.85 for 3 Rows).
- The "Arithmetic" category has the lowest accuracy across all strategies.
- **Outliers**:
- "Distribute Three" (1 Row) has the lowest accuracy (~0.25).
- "Number" (3 Rows) has the highest accuracy (~0.85).
### Interpretation
The data demonstrates that increasing the number of rows (data points) consistently improves accuracy across all categories. The "Number" category benefits most from additional rows, achieving near-perfect accuracy (0.85) with 3 Rows. Conversely, the "Arithmetic" category remains the weakest, with accuracy plateauing at ~0.60 even with 3 Rows. This suggests that arithmetic operations may require specialized handling or additional contextual data to improve performance. The "Distribute Three" category shows the largest gap between strategies, indicating it may be particularly sensitive to data representation complexity.
</details>
We may wonder whether specific relations or attributes are easier to understand than others with less context. For example, the Progression or Constant relations may be possible to recognize only from the first two items of the third row in an RPM, as we can easily observe patterns in attribute values here, e.g., that entity size is increasing or color remains constant. In Figures 11 and 12, we surprisingly observe only marginal differences here, except for the number attribute, which seems significantly better captured than other attributes in this no-context setting.
Figure 12: Comparison of accuracy on examples from all RAVEN sub-tasks as rows are introduced to the matrix, with all decomposition abstractions .
<details>
<summary>Image 12 Details</summary>

### Visual Description
## Bar Chart: Accuracy by Category and Row Count
### Overview
The chart compares accuracy metrics across seven categories (Entity Attr., Number, Position, Constant, Progression, Arithmetic, Distribute Three) for three row configurations (1 Row, 2 Rows, 3 Rows). Accuracy values range from 0.0 to 1.0 on the y-axis, with categories listed on the x-axis.
### Components/Axes
- **X-axis**: Categories (Entity Attr., Number, Position, Constant, Progression, Arithmetic, Distribute Three)
- **Y-axis**: Accuracy (0.0 to 1.0)
- **Legend**:
- Pink = 1 Row
- Yellow = 2 Rows
- Teal = 3 Rows
- **Title**: Implied by context (Accuracy by Category and Row Count)
### Detailed Analysis
#### Entity Attr.
- 1 Row: ~0.45 (pink)
- 2 Rows: ~0.80 (yellow)
- 3 Rows: ~0.85 (teal)
#### Number
- 1 Row: ~0.60 (pink)
- 2 Rows: ~0.92 (yellow)
- 3 Rows: ~0.95 (teal)
#### Position
- 1 Row: ~0.47 (pink)
- 2 Rows: ~0.79 (yellow)
- 3 Rows: ~0.87 (teal)
#### Constant
- 1 Row: ~0.44 (pink)
- 2 Rows: ~0.80 (yellow)
- 3 Rows: ~0.84 (teal)
#### Progression
- 1 Row: ~0.48 (pink)
- 2 Rows: ~0.82 (yellow)
- 3 Rows: ~0.86 (teal)
#### Arithmetic
- 1 Row: ~0.40 (pink)
- 2 Rows: ~0.70 (yellow)
- 3 Rows: ~0.75 (teal)
#### Distribute Three
- 1 Row: ~0.40 (pink)
- 2 Rows: ~0.81 (yellow)
- 3 Rows: ~0.85 (teal)
### Key Observations
1. **Row Count Impact**:
- 3 Rows consistently outperform 2 Rows and 1 Row across all categories.
- 2 Rows generally outperform 1 Row, though the gap narrows in Arithmetic (~0.30 difference vs. ~0.35 in other categories).
2. **Category Variability**:
- **Number** and **Distribute Three** show the largest accuracy gains with additional rows (+0.35 and +0.45 respectively from 1 to 3 Rows).
- **Arithmetic** has the smallest improvement (+0.05 from 2 to 3 Rows).
3. **Anomalies**:
- In **Number**, 2 Rows (0.92) slightly underperform 3 Rows (0.95), but the difference is minimal (~3%).
- **Arithmetic** shows a plateau in gains between 2 and 3 Rows.
### Interpretation
The data suggests that increasing row count improves accuracy, likely due to enhanced contextual representation or computational capacity. However, diminishing returns are evident in categories like Arithmetic, where adding a third row yields minimal improvement. The **Number** category’s near-parity between 2 and 3 Rows may indicate saturation or task-specific constraints. The consistent trend across categories implies that row count is a critical factor in model performance, but optimal configurations may vary by task type.
</details>
## 6 Conclusion
In this work, we explored the ability of large PLMs to perform zero-shot analogical reasoning in visual Raven's Progressive Matrices (RPM). Upon the simplest mapping to language, they can achieve striking results, while applying higher-level naming and decomposition abstractions over the task features further raises performance to the level of humans and supervised approaches in some cases. We find that while ordinal naming abstractions are a powerful way to enable analogical reasoning in larger PLMs, decomposition abstractions that break the task down into atomic parts conserve their working memory such that even smaller PLMs under 1B parameters can achieve competitive performance on this challenging problem.
Our detailed analysis revealed insights about which features of the task PLMs best capture, their robustness to distracting features, and the role of in-context learning and prior knowledge in picking up this complex task. Surprisingly, we find that even without two complete rows of prior context from the matrix, GPT-3 175B and smaller models can achieve above-chance performance on the task, raising questions about the objectivity and true role of prior knowledge in RPM tasks, which are assumed to require minimal prior knowledge.
These results also raise some questions about the role PLMs may play in future AI systems capable of analogy. While previously thought to be a difficult problem for AI systems, PLMs can solve the reasoning step of analogy easily given strong abstractions over visual perception. Many of these abstractions are intuitive and commonly researched in computer vision, including the detection of object types, sizes, colors, counts, and global arrangements. As such, future work may dive deeper into the challenging problem of generalized perception across domains, where we must robustly tease apart the key features of tasks and experiences that may facilitate analogy-making, e.g., in recognizing the commonalities between a physical bridge and the bridge of a song (Mitchell, 2021). Recent efforts toward understanding how humans describe abstract visual features in language by mapping them to natural concepts 12 are a promising direction toward this goal (Lachmy et al., 2022; Ji et al., 2022).
12 For example, when communicating about abstract shapes, we may make an analogy to refer to them as looking like more familiar natural concepts like flowers or dog bones.
## Acknowledgements
This work was supported in part by DARPA PTG program HR00112220003. We would like to thank the anonymous reviewers for their valuable comments and suggestions.
## Limitations
## Perception and reasoning in text-based RAVEN.
In this work, one limitation is that we do not attempt to solve the perception problem of analogymaking in RPM, rather we apply perfect perception in solving the reasoning part, and assume the perception problem is simple. By doing so, we find that PLMs may be a strong solution to the reasoning problem here, which may better direct future efforts toward AI and analogy. Obviously, the perception problem for idealized domains is a lot different than more natural domains, and identifying key features across many domains that can facilitate a mapping is still a challenging unsolved problem. We hope that our work sparks more interest in this problem.
Meanwhile, one may argue that our decomposition abstractions are too strong, and actually contribute to the reasoning problem in RPM, as they make an independence assumption about which features of the task can be teased apart. Making such an assumption requires an understanding of the problem that cannot be inferred by only seeing one instance. However, we decomposed the task based on very intuitive and common attributes, e.g., shapes, colors, sizes, and counts of items. We believe that the strength of such an abstraction, which could be applied in many problems, should not be understated. Nonetheless, we include decomposition-free forms of results as much as possible throughout the paper to help compare the contributions of decomposition versus naming abstractions, which is more clearly only providing perceptual information. In fact, we find that without any decomposition, PLMs still achieve very strong performance in many cases, and performance gains from decomposition are not always large.
Human performance. Lastly, we note some limitations in the human performance measurements used as reference points. In Zhang et al. (2019a), human performance on RAVEN was measured by giving subjects some task-specific training, then evaluating them on the original visual form of the task. This differs from our results in two ways. First, PLMs had no task-specific training for RAVEN, given that experiments were zero-shot and the text data we generate is new and thus impossible to appear directly in PLM pre-training. This may give humans an advantage. Second, the task is presented to PLMs in text form, not visually. While the essential information from the task is preserved by our conversion, it is possible that this conversion would affect the difficulty of the task for humans (making it easier or harder). As such, it becomes unclear how to contextualize our results with these past human results. Future work may carry out systematic human studies to compare the analogical reasoning capabilities of humans and PLMs in different settings.
## Ethical Considerations
This work does not use any human subjects or human-generated data. Our work deals with abstract visual features that are described with numerical symbols, thus not strongly targeting any language. A possible ethical concern for this work is the amount of computational resources used in evaluating PLMs. To reduce unnecessary computation in our study, we chose to apply PLMs to only a subset of 500 testing examples from each sub-task of the RAVEN dataset, while the full testing set is four times as large.
## References
Yonatan Bitton, Ron Yosef, Eli Strugo, Dafna Shahaf, Roy Schwartz, and Gabriel Stanovsky. 2022. VASR: Visual analogies of situation recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) .
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems , 33:1877-1901.
Jiangjie Chen, Rui Xu, Ziquan Fu, Wei Shi, Zhongqiao Li, Xinbo Zhang, Changzhi Sun, Lei Li, Yanghua Xiao, and Hao Zhou. 2022. E-KAR: A benchmark for rationalizing natural language analogical reasoning. In Findings of the Association for Computational Linguistics: ACL 2022 , pages 3941-3955, Dublin, Ireland. Association for Computational Linguistics.
Stella Christie and Dedre Gentner. 2014. Language helps children succeed on a classic analogy task. Cognitive Science , 38(2):383-397.
- Dedre Gentner. 1983. Structure-mapping: A theoretical framework for analogy. Cognitive Science , 7(2):155170.
- Dedre Gentner. 2010. Bootstrapping the mind: Analogical processes and symbol systems. Cognitive Science , 34(5):752-775.
- Dedre Gentner, Asli Özyürek, Özge Gürcanli, and Susan Goldin-Meadow. 2013. Spatial language facilitates spatial cognition: Evidence from children who lack language input. Cognition , 127(3):318-330.
- Peter Gordon. 2004. Numerical cognition without words: Evidence from Amazonia. Science , 306(5695):496-499.
- Felix Hill, Adam Santoro, David GT Barrett, Ari S Morcos, and Timothy Lillicrap. 2019. Learning to make analogies by contrasting abstract relational structure. In 7th International Conference on Learning Representations (ICLR) .
- Douglas R Hofstadter and Melanie Mitchell. 1994. The Copycat project: A model of mental fluidity and analogy-making , pages 31-112. Ablex Publishing.
- Douglas R Hofstadter and Emmanuel Sander. 2013. Surfaces and essences: Analogy as the fuel and fire of thinking . Basic Books.
- Keith J Holyoak. 1984. Analogical thinking and human intelligence. Advances in the psychology of human intelligence , 2:199-230.
- Keith J Holyoak. 2012. Analogy and relational reasoning. The Oxford Handbook of Thinking and Reasoning .
- Sheng Hu, Yuqing Ma, Xianglong Liu, Yanlu Wei, and Shihao Bai. 2021. Stratified rule-aware network for abstract visual reasoning. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) , volume 35, pages 1567-1574.
- Anya Ji, Noriyuki Kojima, Noah Rush, Alane Suhr, Wai Keen Vong, Robert Hawkins, and Yoav Artzi. 2022. Abstract visual reasoning with tangram shapes. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Youngsung Kim, Jinwoo Shin, Eunho Yang, and Sung Ju Hwang. 2020. Few-shot visual reasoning with meta-analogical contrastive learning. In Advances in Neural Information Processing Systems , volume 33, pages 16846-16856. Curran Associates, Inc.
- Royi Lachmy, Valentina Pyatkin, Avshalom Manevich, and Reut Tsarfaty. 2022. Draw Me a Flower: Processing and Grounding Abstraction in Natural Language. Transactions of the Association for Computational Linguistics , 10:1341-1356.
- Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. 2015. Human-level concept learning through probabilistic program induction. Science , 350(6266):1332-1338.
- Frank J Lee and John R Anderson. 2001. Does learning a complex task have to be complex?: A study in learning decomposition. Cognitive Psychology , 42(3):267-316.
- Peng-Hsuan Li, Tsan-Yu Yang, and Wei-Yun Ma. 2020. CA-EHN: Commonsense analogy from E-HowNet. In Proceedings of the Twelfth Language Resources and Evaluation Conference , pages 2984-2990, Marseille, France. European Language Resources Association.
- Tal Linzen. 2016. Issues in evaluating semantic spaces using word analogies. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP , pages 13-18, Berlin, Germany. Association for Computational Linguistics.
- Hongjing Lu, Ying Nian Wu, and Keith J Holyoak. 2019. Emergence of analogy from relation learning. Proceedings of the National Academy of Sciences , 116(10):4176-4181.
- Mikołaj Małki´ nski and Jacek Ma´ ndziuk. 2022. Deep learning methods for abstract visual reasoning: A survey on Raven's Progressive Matrices. arXiv preprint arXiv:2201.12382 .
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013a. Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems , 26.
- Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013b. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages 746-751, Atlanta, Georgia. Association for Computational Linguistics.
- Melanie Mitchell. 2021. Abstraction and analogymaking in artificial intelligence. Annals of the New York Academy of Sciences , 1505(1):79-101.
- Victor Vikram Odouard and Melanie Mitchell. 2022. Evaluating understanding on conceptual abstraction benchmarks. In Proceedings of the AI Evaluation Beyond Metrics at IJCAI-ECAI 2022 , Vienna, Austria.
- Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155 .
- Roma Patel and Ellie Pavlick. 2021. Mapping language models to grounded conceptual spaces. In International Conference on Learning Representations .
- John C Raven and JH Court. 1938. Raven's progressive matrices . Western Psychological Services Los Angeles.
- Lynn C Robertson and Marvin R Lamb. 1991. Neuropsychological contributions to theories of part/whole organization. Cognitive Psychology , 23(2):299-330.
- Robyn Speer, Catherine Havasi, and Henry Lieberman. 2008. Analogyspace: Reducing the dimensionality of common sense knowledge. In AAAI , volume 8, pages 548-553.
- Steven Spratley, Krista Ehinger, and Tim Miller. 2020. Acloser look at generalisation in raven. In Computer Vision - ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XXVII , page 601-616, Berlin, Heidelberg. SpringerVerlag.
- Oren Sultan and Dafna Shahaf. 2022. Life is a circus and we are the clowns: Automatically finding analogies between situations and processes. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Damien Teney, Peng Wang, Jiewei Cao, Lingqiao Liu, Chunhua Shen, and Anton van den Hengel. 2020. Vprom: A benchmark for visual reasoning using visual progressive matrices. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 34, pages 12071-12078.
- Peter D Turney. 2008. The latent relation mapping engine: Algorithm and experiments. Journal of Artificial Intelligence Research , 33:615-655.
- Peter D Turney, Michael L Littman, Jeffrey Bigham, and Victor Shnayder. 2003. Combining independent modules in lexical multiple-choice problems. Recent Advances in Natural Language Processing III: Selected Papers from RANLP , 2003:101-110.
- Taylor Webb, Keith J Holyoak, and Hongjing Lu. 2022. Emergent analogical reasoning in large language models. arXiv preprint arXiv:2212.09196 .
- Chi Zhang, Feng Gao, Baoxiong Jia, Yixin Zhu, and Song-Chun Zhu. 2019a. RAVEN: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .
- Chi Zhang, Baoxiong Jia, Feng Gao, Yixin Zhu, HongJing Lu, and Song-Chun Zhu. 2019b. Learning perceptual inference by contrasting. In Advances in Neural Information Processing Systems , volume 32. Curran Associates, Inc.
- Chi Zhang, Baoxiong Jia, Song-Chun Zhu, and Yixin Zhu. 2021. Abstract spatial-temporal reasoning via
probabilistic abduction and execution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 97369746.
- Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. 2022. OPT: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 .
## A Expanded Results
In Table 3, we present additional results with a wider range of OPT model sizes (Zhang et al., 2022). We observe similar mostly monotonic increases of accuracy with model size.
## B Results and Analysis with I-RAVEN
As the generation strategy for the negative choices in RAVEN can introduce distributional bias that is problematic for supervised learning and leads to artificially high performance (Hu et al., 2021), this could be a possible reason behind PLMs' strong performance on the task even without any complete rows of context. As such, in Table 4 and Figure 13, we include some supplementary analysis on the Impartial-RAVEN (I-RAVEN) dataset from Hu et al., which introduces more variation in negative choices. However, we observe similar performance trends in I-RAVEN. Performance mostly monotonically increases with model sizes and more abstraction. Further, PLMs achieve above-chance performance again without any rows of context provided, even with no decomposition abstractions. This provides further evidence that RPM, at least formulated in this way, is in part addressed by PLMs' prior knowledge, despite the assumptions of minimal background knowledge that the task makes.
<details>
<summary>Image 13 Details</summary>

### Visual Description
## Line Graphs: Model Size vs. Accuracy Across Different Configurations
### Overview
The image contains two line graphs comparing model accuracy against model size (in billion parameters). The left graph focuses on "Naming" accuracy, while the right graph measures "All" accuracy. Both graphs show trends across model sizes (10⁻¹ to 10²) for four configurations: Random (dotted line), 1 Row (pink), 2 Rows (orange), and 3 Rows (teal).
---
### Components/Axes
- **X-axis (Horizontal)**: Model Size (Billion Parameters)
- Logarithmic scale: 10⁻¹, 10⁰, 10¹, 10²
- **Y-axis (Vertical)**: Accuracy (0 to 1)
- **Legends**:
- Top-left placement
- Labels:
- `Random` (dotted black line)
- `1 Row` (pink line with circles)
- `2 Rows` (orange line with circles)
- `3 Rows` (teal line with circles)
---
### Detailed Analysis
#### Left Graph: Accuracy - Naming
- **Trends**:
- All configurations show upward trends as model size increases.
- `3 Rows` (teal) consistently outperforms others.
- `Random` (dotted) remains flat at ~0.1.
- **Data Points (approximate)**:
- **10⁻¹**:
- `Random`: 0.1
- `1 Row`: 0.2
- `2 Rows`: 0.25
- `3 Rows`: 0.25
- **10⁰**:
- `Random`: 0.1
- `1 Row`: 0.22
- `2 Rows`: 0.35
- `3 Rows`: 0.38
- **10¹**:
- `Random`: 0.1
- `1 Row`: 0.28
- `2 Rows`: 0.45
- `3 Rows`: 0.52
- **10²**:
- `Random`: 0.1
- `1 Row`: 0.32
- `2 Rows`: 0.58
- `3 Rows`: 0.68
#### Right Graph: Accuracy - All
- **Trends**:
- All configurations improve with model size, but gaps narrow at larger scales.
- `3 Rows` (teal) maintains the highest accuracy.
- `Random` (dotted) remains flat at ~0.2.
- **Data Points (approximate)**:
- **10⁻¹**:
- `Random`: 0.2
- `1 Row`: 0.4
- `2 Rows`: 0.5
- `3 Rows`: 0.6
- **10⁰**:
- `Random`: 0.2
- `1 Row`: 0.42
- `2 Rows`: 0.7
- `3 Rows`: 0.78
- **10¹**:
- `Random`: 0.2
- `1 Row`: 0.45
- `2 Rows`: 0.75
- `3 Rows`: 0.8
- **10²**:
- `Random`: 0.2
- `1 Row`: 0.48
- `2 Rows`: 0.8
- `3 Rows`: 0.82
---
### Key Observations
1. **Model Size Impact**:
- Larger models (10²) outperform smaller ones across all configurations.
- `3 Rows` achieves the highest accuracy in both graphs (~0.68 for Naming, ~0.82 for All).
2. **Configuration Differences**:
- `3 Rows` consistently outperforms `2 Rows` and `1 Row` by ~10–20% in both metrics.
- `Random` models show minimal improvement (~0.1–0.2 accuracy), suggesting baseline performance is low.
3. **Diminishing Returns**:
- At 10² model size, accuracy gains between configurations shrink (e.g., `3 Rows` vs. `2 Rows` gap narrows from 0.16 at 10⁰ to 0.02 at 10²).
---
### Interpretation
- **Technical Insight**:
- Increasing model size improves accuracy, but the benefit diminishes at larger scales.
- The `3 Rows` configuration likely represents a more complex or optimized architecture, justifying its superior performance.
- **Practical Implications**:
- For resource-constrained scenarios, smaller models with fewer rows (e.g., `1 Row`) may suffice if accuracy requirements are modest.
- The `Random` line’s flat trend highlights the importance of intentional model design over brute-force scaling.
- **Anomalies**:
- The `Random` line’s slight upward trend in the "All" graph (~0.2 to 0.2) may indicate minor overfitting or noise in the dataset.
---
### Language Note
All text in the image is in English. No non-English content is present.
</details>
Model Size (Billion Parameters)
Figure 13: Macro average accuracy over all ImpartialRAVEN sub-tasks as we introduce rows to the matrix during in-context learning, under naming abstractions only (left) and all naming and decomposition abstractions (right). In 1 Row, we include only the incomplete third row.
## C Example Prompts
In Figure 14, we include example prompts for 2x2Grid , 3x3Grid , L-R and I-OG subtasks under different abstractions. Note that U-D and
I-OC are isomorphic to L-R , and therefore share the same prompt format.
<details>
<summary>Image 14 Details</summary>

### Visual Description
## Grid of Shapes: Visual Composition Analysis
### Overview
The image displays a 3x3 grid of squares, each containing distinct geometric shapes and color patterns. No textual labels, numerical data, or legends are present. The arrangement suggests a categorical or symbolic system, with variations in shape, color, and quantity across cells.
### Components/Axes
- **Grid Structure**:
- 3 rows × 3 columns of equal-sized squares.
- No axis titles, scales, or legends.
- **Shape Categories**:
- Circles (solid and outlined), pentagons, triangles, and a single diamond.
- **Color Variations**:
- Gray, light gray, black, and white (background).
### Detailed Analysis
1. **Top Row**:
- **Top-Left**: Three gray circles (solid).
- **Top-Middle**: Two light gray pentagons (outlined).
- **Top-Right**: Empty.
2. **Middle Row**:
- **Middle-Left**: Three circles (two light gray, one gray).
- **Middle-Middle**: Two gray pentagons (outlined).
- **Middle-Right**: Empty.
3. **Bottom Row**:
- **Bottom-Left**: Three large gray circles (solid).
- **Bottom-Middle**: Two small light gray pentagons (outlined).
- **Bottom-Right**: Three black triangles (outlined).
### Key Observations
- **Symmetry**: The grid exhibits partial symmetry in shape repetition (e.g., pentagons in top-middle and middle-middle).
- **Color Coding**: Gray dominates, with light gray and black used for contrast.
- **Progressive Complexity**: Bottom row shapes are larger and bolder than upper rows.
- **Empty Cells**: Top-right and middle-right cells are blank, creating visual imbalance.
### Interpretation
The grid likely represents a symbolic or instructional system, possibly for:
1. **Categorization**: Shapes and colors may denote hierarchical or functional groups.
2. **Puzzle/Exercise**: The empty cells and varying quantities suggest an interactive task (e.g., pattern completion).
3. **Educational Tool**: Could teach geometry, color theory, or spatial reasoning.
No numerical data or explicit relationships are provided. The absence of labels implies the system relies on visual recognition rather than textual explanation.
</details>
## 2x2Grid
<details>
<summary>Image 15 Details</summary>

### Visual Description
## Grid of Shapes: Visual Categorization Exercise
### Overview
The image displays a 4x3 grid of 12 squares, each containing a unique arrangement of geometric shapes (hexagons, circles, squares, pentagons) with varying colors (black, gray, white, dark gray). No textual labels, legends, or numerical data are present.
### Components/Axes
- **Grid Structure**:
- 4 rows (top to bottom) and 3 columns (left to right).
- Each square is isolated, with no connecting lines or annotations.
- **Shape Types**:
- Hexagons (6 sides), circles, squares, pentagons (5 sides).
- **Color Variations**:
- Solid black, gray, white, and dark gray.
- No gradients or patterns within individual shapes.
### Detailed Analysis
- **Shape Distribution**:
- Hexagons appear in 5 squares (e.g., top-left, top-middle, bottom-left).
- Circles appear in 3 squares (e.g., middle-left, middle-right, bottom-middle).
- Squares appear in 2 squares (e.g., top-right, bottom-right).
- Pentagons appear in 2 squares (e.g., middle-left, bottom-left).
- **Color Patterns**:
- Black dominates the top row (3/3 squares).
- Gray and white alternate in the middle row.
- Dark gray and white dominate the bottom row.
- **Arrangement Logic**:
- No consistent positional rules (e.g., shapes are not aligned by size or orientation).
- Some squares combine multiple shapes (e.g., top-middle: 3 hexagons + 3 pentagons).
### Key Observations
1. **No Textual Data**: The image contains no labels, legends, or numerical values.
2. **Visual Complexity**: Shapes vary in type, size, and color, suggesting a focus on pattern recognition or categorization.
3. **Ambiguity in Purpose**: The grid may represent a puzzle, a test of visual discrimination, or an abstract design.
### Interpretation
The image likely serves as a visual exercise in categorization or pattern analysis. The absence of textual or numerical data implies it is not a chart, diagram, or technical visualization. Instead, it emphasizes spatial arrangement and shape/color differentiation. The lack of explicit rules (e.g., size progression, color coding) suggests the grid is designed to test observational skills or abstract reasoning.
## Conclusion
This image contains no extractable textual information or numerical data. It is a static grid of geometric shapes with no discernible trends, relationships, or annotations. Further context (e.g., source, instructions) would be required to determine its intended purpose.
</details>
3x3Grid
<details>
<summary>Image 16 Details</summary>

### Visual Description
Icon/Small Image (189x61)
</details>
<details>
<summary>Image 17 Details</summary>

### Visual Description
Icon/Small Image (190x59)
</details>
<details>
<summary>Image 18 Details</summary>

### Visual Description
## Diagram: Shape and Color Progression
### Overview
The image consists of three horizontally aligned sections, each divided into two parts by a vertical line. Each section contains a pair of geometric shapes with distinct colors and sizes. Below the sections, the label "L-R" is present, suggesting a left-to-right progression or comparison.
### Components/Axes
- **Sections**: Three distinct regions labeled implicitly by position (left, middle, right).
- **Shapes**:
- **Left Section**: Pentagon (light gray) and triangle (dark gray).
- **Middle Section**: Triangle (light gray) and hexagon (dark gray).
- **Right Section**: Circle (light gray) and square (black).
- **Colors**:
- Light gray (used for pentagon, triangle, and circle).
- Dark gray (used for triangle and hexagon).
- Black (used for square).
- **Text**: The label "L-R" is centered below the three sections.
### Detailed Analysis
1. **Shape Complexity**:
- Left to right, shapes increase in complexity:
- Pentagon (5 sides) → Triangle (3 sides) → Hexagon (6 sides) → Circle (infinite sides) → Square (4 sides).
- Note: The middle section’s triangle and hexagon deviate from the left-to-right size progression (triangle is larger than the pentagon, hexagon is smaller than the triangle).
2. **Color Pattern**:
- Alternating light gray and dark gray dominate, with the exception of the black square in the right section.
- The black square is the only fully filled shape, contrasting with the outlined shapes in other sections.
3. **Size Variation**:
- Left section: Pentagon (small) and triangle (larger).
- Middle section: Triangle (largest) and hexagon (smaller than triangle).
- Right section: Circle (small) and square (larger).
### Key Observations
- **Inconsistent Size Progression**: The middle section’s triangle is larger than the left section’s triangle, breaking the apparent size trend.
- **Color Contrast**: The black square stands out as the only non-outlined shape, potentially signifying a categorical difference.
- **Label Ambiguity**: "L-R" could imply a directional relationship (e.g., left-to-right comparison) but lacks explicit context.
### Interpretation
The image likely illustrates a conceptual progression or comparison of geometric shapes, emphasizing differences in complexity, size, and color. The inconsistent size scaling in the middle section suggests intentional design to highlight specific relationships (e.g., triangle-hexagon contrast). The black square’s distinct color and fill style may denote a unique category or endpoint in the sequence. The absence of numerical data or explicit labels limits quantitative analysis, leaving interpretation reliant on visual patterns.
## Notes
- No numerical data, axes, or legends are present.
- The image focuses on qualitative comparisons rather than quantitative metrics.
- The "L-R" label requires additional context to clarify its purpose (e.g., directional flow, categorical grouping).
</details>
<details>
<summary>Image 19 Details</summary>

### Visual Description
## Diagram: Geometric Pattern Grid
### Overview
The image displays a 3x3 grid of geometric shapes, each contained within a square border. The shapes vary in type (polygons, circles), size, and internal symbols (dots, squares, triangles). No textual labels, legends, or axis markers are present.
### Components/Axes
- **Grid Structure**:
- 3 rows × 3 columns of squares.
- Each square contains a distinct geometric shape with embedded symbols.
- **Shape Types**:
- Polygons: pentagons, hexagons, diamonds, circles.
- Symbols: dots (circles), squares, triangles, single dots.
- **Spatial Layout**:
- Top row: Pentagon (squares/circles), hexagon (circles), circle (triangles).
- Middle row: Diamond (dots/squares), pentagon (single dot), hexagon (dots/squares).
- Bottom row: Diamond (circles), pentagon (single dot), hexagon (dots/squares).
### Detailed Analysis
- **Shape-Symbol Relationships**:
- Symbols are positioned asymmetrically within shapes (e.g., top-left, bottom-right).
- No consistent pattern in symbol quantity or placement across the grid.
- **Size Variation**:
- Shapes range from small (e.g., bottom-middle pentagon) to large (e.g., bottom-right hexagon).
- **Color**:
- All elements are monochromatic (black outlines, white backgrounds).
### Key Observations
1. **No Textual Data**: The image contains no labels, legends, or axis titles.
2. **Symbol Diversity**: Symbols (dots, squares, triangles) are distributed unevenly across shapes.
3. **Size-Form Correlation**: Larger shapes (e.g., bottom-right hexagon) often contain multiple symbols, while smaller shapes (e.g., middle-left diamond) have fewer.
4. **Asymmetry**: Symbols are not uniformly aligned within shapes (e.g., top-left pentagon has symbols clustered in the upper half).
### Interpretation
The grid appears to function as a **visual classification system** or **cognitive test** for pattern recognition. The absence of text suggests the focus is on spatial reasoning or symbolic interpretation. Possible applications include:
- **Educational Tools**: Teaching geometry or symbolic logic.
- **Psychological Testing**: Assessing visual perception or problem-solving skills.
- **Design Principles**: Demonstrating balance, asymmetry, or minimalism.
Notable anomalies include the inconsistent symbol placement and the lack of a clear organizational rule. The grid may intentionally avoid patterns to test observational skills or highlight the importance of context in interpretation.
</details>
## Position: Prog.
row 1: [1, 0, 0, 1, 1, 0, 1, 1, 1], [1, 1, 0, 0, 1, 1, 0, 1, 1], [1, 1, 1, 0, 0, 1, 1, 0, 1];
row 2: [0, 1, 0, 1, 1, 0, 1, 1, 1], [1, 0, 1, 0, 1, 1, 0, 1, 1], [1, 1, 0, 1, 0, 1, 1, 0, 1];
row 3: [0, 1, 0, 0, 1, 0, 0, 1, 0], [0, 0, 1, 0, 0, 1, 0, 0, 1], [1, 0, 0, 1, 0, 0, 1, 0, 0];
## Type: Const.
## Size: Const.
## Number: -
row 1: 6, 6, 6;
row 2: 6, 6, 6;
row 3: 3, 3, 3;
## Color: Arith.
row 1: [5, 6], [5, 6], [5, 6];
row 1: [3, 4, 5, 6], [3, 4, 5, 6], [3, 4, 5, 6];
row 1: [6], [1], [7];
row 2: [4, 5, 6, 7], [4, 5, 6, 7], [4, 5, 6, 7];
row 2: [2, 3, 5], [2, 3, 5], [2, 3, 5];
row 2: [8], [0], [8];
row 3: [4, 5], [4, 5], [4, 5];
row 3: [1, 4, 5], [1, 4, 5], [1, 4, 5];
row 3: [4], [3], [7];
## Complete
row 1: A (3,0.1,40) / B (5,0.3,30), A (7,0.2,40) / B (4,0.3,50), A (5,0.6,40) / B (3,0.3,70);
row 2: A (7,0.6,10) / B (4,0.6,40), A (5,0.1,10) / B (3,0.6,60), A (3,0.2,10) / B (5,0.6,80);
row 3: A (5,0.2,10) / B (3,0.4,50), A (3,0.6,10) / B (5,0.4,70), A (7,0.1,10) / B (4,0.4,90);
## Left Comp.
## Right Comp.
row 1: (3,0.1,40), (7,0.2,40), (5,0.6,40);
row 1: (5,0.3,30), (4,0.3,50), (3,0.3,70);
row 2: (7,0.6,10), (5,0.1,10), (3,0.2,10);
row 2: (4,0.6,40), (3,0.6,60), (5,0.6,80);
row 3: (5,0.2,10), (3,0.6,10), (7,0.1,10);
row 3: (3,0.4,50), (5,0.4,70), (4,0.4,90);
| Type: Distr. | Size: Distr. | Color: Const. | Type: Distr. | Size: Const. | Color: Prog. |
|-----------------|-----------------|-----------------|-----------------|-----------------|-----------------|
| row 1: 3, 7, 5; | row 1: 1, 2, 6; | row 1: 4, 4, 4; | row 1: 5, 4, 3; | row 1: 3, 3, 3; | row 1: 3, 5, 7; |
| row 2: 7, 5, 3; | row 2: 6, 1, 2; | row 2: 1, 1, 1; | row 2: 4, 3, 5; | row 2: 6, 6, 6; | row 2: 4, 6, 8; |
| row 3: 5, 3, 7; | row 3: 2, 6, 1; | row 3: 1, 1, 1; | row 3: 3, 5, 4; | row 3: 4, 4, 4; | row 3: 5, 7, 9; |
## Complete
row 1: A (5,0.6,0) / B [(4,0.4,60), (4,0.4,60), (4,0.6,60), (4,0.6,60)],
A (6,0.5,0) / B [(6,0.6,30), (6,0.6,30), -, -],
A (7,0.4,0) / B [-, (5,0.5,90), -, (5,0.4,90)];
row 2: A (4,0.4,0) / B [-, (5,0.5,80), (5,0.3,80), (5,0.3,80)],
A (5,0.6,0) / B [(4,0.4,10), -, -, -],
A (6,0.5,0) / B [-, (6,0.5,90), (6,0.3,90), -];
row 3: A (4,0.5,0) / B [(6,0.4,40), (6,0.4,40), (6,0.4,40), (6,0.3,40)],
A (5,0.4,0) / B [-, (5,0.5,50), -, -],
A (6,0.6,0) / B [(4,0.4,90), (4,0.4,90), (4,0.5,90), -];
## In Comp.
row 1: [(4,0.4,60), (4,0.4,60), (4,0.6,60), (4,0.6,60)],
[(6,0.6,30), (6,0.6,30), -, -],
[-, (5,0.5,90), -, (5,0.4,90)];
row 2: [-, (5,0.5,80), (5,0.3,80), (5,0.3,80)],
[(4,0.4,10), -, -, -],
[-, (6,0.5,90), (6,0.3,90), -];
row 3: [(6,0.4,40), (6,0.4,40), (6,0.4,40), (6,0.3,40)],
[-, (5,0.5,50), -, -],
[(4,0.4,90), (4,0.4,90), (4,0.5,90), -];
## Position: -
## Number: Arith.
row 1: [1, 1, 1, 1], [1, 1, 0, 0], [0, 1, 0, 1];
row 2: [0, 1, 1, 1], [1, 0, 0, 0], [0, 1, 1, 0];
row 3: [1, 1, 1, 1], [0, 1, 0, 0], [1, 1, 1, 0];
row 1: 4, 2, 2;
row 2: 3, 1, 2;
row 3: 4, 1, 3;
## Type: Distr.
Size: Const.
## Color: Arith.
row 1: [4], [6], [5]; row 1: [4, 6], [6], [4, 5]; row 1: [6], [3], [9];
row 2: [5], [4], [6]; row 2: [3, 5], [4], [3, 5]; row 2: [8], [1], [9];
row 3: [6], [5], [4]; row 3: [3, 4], [5], [4, 5]; row 3: [4], [5], [9];
Figure 14: Example prompts for 2x2Grid , 3x3Grid , L-R and I-OG subtasks under different abstractions.
## Out Comp.
row 1: (5,0.6,0), (6,0.5,0), (7,0.4,0);
row 2: (4,0.4,0), (5,0.6,0), (6,0.5,0);
row 3: (4,0.5,0), (5,0.4,0), (6,0.6,0);
Type: Prog. Size: Distr. Color: Const.
row 1: 5, 6, 7; row 1: 6, 5, 4; row 1: 0, 0, 0;
row 2: 4, 5, 6;
row 2: 4, 6, 5;
row 2: 0, 0, 0;
row 3: 4, 5, 6; row 3: 5, 4, 6; row 3: 0, 0, 0;
## Complete
row 1: [-, (7,0.2,40), (7,0.2,40), (7,0.2,40)], [(5,0.5,20), (5,0.5,20), (5,0.5,20), -], [-, -, -, (3,0.1,60)];
row 2: [(7,0.1,20), (7,0.1,20), -, (7,0.1,20)], [(5,0.2,40), (5,0.2,40), -, -], [-, -, -, (3,0.5,60)];
row 3: [-, (7,0.5,50), (7,0.5,50), (7,0.5,50)], [(5,0.1,10), -, (5,0.1,10), -], [-, (3,0.2,60), -, (3,0.2,60)];
## Position: (Set) Arith.
## Number: -
## Type: Prog.
## Size: Distr.
## Color: Arith.
row 1: [0, 1, 1, 1], [1, 1, 1, 0], [0, 0, 0, 1];
row 1: 3, 3, 1;
row 1: [7], [5], [3];
row 1: [2], [5], [1];
row 1: [4], [2], [6];
row 2: [1, 1, 0, 1], [1, 1, 0, 0], [0, 0, 0, 1];
row 2: 3, 2, 1;
row 2: [7], [5], [3];
row 2: [1], [2], [5];
row 2: [2], [4], [6];
row 3: [0, 1, 1, 1], [1, 0, 1, 0], [0, 1, 0, 1];
row 3: 3, 2, 2;
row 3: [7], [5], [3];
row 3: [5], [1], [2];
row 3: [5], [1], [6];
## Complete
row 1: [(5,0.6,60), -, -, (6,0.4,60), (6,0.5,60), -, (5,0.3,60), (6,0.3,60), (6,0.4,60)],
[(6,0.4,10), (5,0.6,10), -, -, (6,0.4,10), (6,0.5,10), -, (5,0.3,10), (6,0.3,10)],
[(6,0.3,70), (6,0.4,70), (5,0.6,70), -, -, (6,0.4,70), (6,0.5,70), -, (5,0.3,70)];
row 2: [-, (7,0.5,80), -, (4,0.3,80), (7,0.2,80), -, (6,0.2,80), (5,0.3,80), (4,0.5,80)],
[(4,0.5,0), -, (7,0.5,0), -, (4,0.3,0), (7,0.2,0), -, (6,0.2,0), (5,0.3,0)],
[(5,0.3,80), (4,0.5,80), -, (7,0.5,80), -, (4,0.3,80), (7,0.2,80), -, (6,0.2,80)];
row 3: [-, (5,0.5,40), -, -, (4,0.4,40), -, -, (4,0.1,40), -],
[-, -, (5,0.5,30), -, -, (4,0.4,30), -, -, (4,0.1,30)],
[(4,0.1,70), -, -, (5,0.5,70), -, -, (4,0.4,70), -, -];
Table 3: Performance on RAVEN sub-tasks under our abstractions across a wider set of model sizes. 175B refers to text-davinci-002 while the rest are corresponding OPT models.
| Abstractions | Center | 2x2 | 3x3 | L-R | U-D | O-IC | O-IG | Avg. |
|-----------------------|----------|-------|-------|-------|-------|--------|--------|--------|
| Attr. Naming Only | 0.222 | 0.42 | 0.606 | 0.076 | 0.098 | 0.122 | 0.194 | 0.248 |
| Comp. Decomp. | 0.222 | 0.42 | 0.606 | 0.136 | 0.154 | 0.162 | 0.222 | 0.275 |
| Comp. + Attr. Decomp. | 0.456 | 0.62 | 0.724 | 0.378 | 0.408 | 0.374 | 0.52 | 0.497 |
| Attr. Naming Only | 0.302 | 0.51 | 0.684 | 0.104 | 0.134 | 0.12 | 0.25 | 0.301 |
| Comp. Decomp. | 0.302 | 0.51 | 0.684 | 0.186 | 0.232 | 0.254 | 0.344 | 0.359 |
| Comp. + Attr. Decomp. | 0.436 | 0.588 | 0.788 | 0.28 | 0.346 | 0.29 | 0.408 | 0.448 |
| Attr. Naming Only | 0.472 | 0.584 | 0.71 | 0.146 | 0.158 | 0.2 | 0.322 | 0.37 |
| Comp. Decomp. | 0.472 | 0.584 | 0.71 | 0.41 | 0.426 | 0.434 | 0.494 | 0.504 |
| Comp. + Attr. Decomp. | 0.72 | 0.714 | 0.794 | 0.672 | 0.68 | 0.744 | 0.744 | 0.724 |
| Attr. Naming Only | 0.534 | 0.572 | 0.746 | 0.216 | 0.2 | 0.268 | 0.336 | 0.41 |
| Comp. Decomp. | 0.534 | 0.572 | 0.746 | 0.42 | 0.468 | 0.484 | 0.532 | 0.537 |
| Comp. + Attr. Decomp. | 0.706 | 0.738 | 0.826 | 0.658 | 0.664 | 0.704 | 0.784 | 0.726 |
| Attr. Naming Only | 0.618 | 0.59 | 0.752 | 0.196 | 0.228 | 0.284 | 0.396 | 0.438 |
| Comp. Decomp. | 0.618 | 0.59 | 0.752 | 0.492 | 0.528 | 0.548 | 0.584 | 0.587 |
| Comp. + Attr. Decomp. | 0.704 | 0.75 | 0.826 | 0.682 | 0.69 | 0.748 | 0.834 | 0.748 |
| Attr. Naming Only | 0.644 | 0.61 | 0.754 | 0.22 | 0.268 | 0.358 | 0.452 | 0.472 |
| Comp. Decomp. | 0.644 | 0.61 | 0.754 | 0.566 | 0.602 | 0.586 | 0.576 | 0.62 |
| Comp. + Attr. Decomp. | 0.746 | 0.794 | 0.83 | 0.71 | 0.702 | 0.77 | 0.84 | 0.77 |
| Attr. Naming Only | 0.68 | 0.596 | 0.748 | 0.264 | 0.328 | 0.42 | 0.482 | 0.503 |
| Comp. Decomp. | 0.68 | 0.596 | 0.748 | 0.582 | 0.618 | 0.664 | 0.638 | 0.647 |
| Comp. + Attr. Decomp. | 0.762 | 0.818 | 0.828 | 0.738 | 0.714 | 0.786 | 0.86 | 0.787 |
| Attr. Naming Only | 0.772 | 0.78 | 0.864 | 0.542 | 0.536 | 0.648 | 0.748 | 0.699 |
| Comp. Decomp. | 0.772 | 0.78 | 0.864 | 0.738 | 0.732 | 0.78 | 0.84 | 0.787 |
| Comp. + Attr. Decomp. | 0.8 | 0.878 | 0.932 | 0.776 | 0.78 | 0.828 | 0.926 | 0.846 |
Table 4: Performance on I-RAVEN sub-tasks under our abstractions across different model sizes. 175B refers to text-davinci-002 while the rest are corresponding OPT models.
| Abstractions | Abstractions | Center | 2x2 | 3x3 | L-R | U-D | O-IC | O-IG | Avg. |
|----------------|-----------------------|----------|-------|-------|-------|-------|--------|--------|--------|
| | Attr. Naming Only | 0.376 | 0.172 | 0.208 | 0.246 | 0.23 | 0.262 | 0.202 | 0.242 |
| 125M | Comp. Decomp. | 0.376 | 0.172 | 0.208 | 0.336 | 0.344 | 0.354 | 0.224 | 0.288 |
| | Comp. + Attr. Decomp. | 0.608 | 0.514 | 0.602 | 0.612 | 0.624 | 0.638 | 0.594 | 0.6 |
| 1.3B | Attr. Naming Only | 0.594 | 0.29 | 0.31 | 0.348 | 0.37 | 0.388 | 0.334 | 0.376 |
| | Comp. Decomp. | 0.594 | 0.29 | 0.31 | 0.586 | 0.574 | 0.618 | 0.466 | 0.491 |
| | Comp. + Attr. Decomp. | 0.81 | 0.676 | 0.73 | 0.822 | 0.802 | 0.882 | 0.818 | 0.791 |
| 13B | Attr. Naming Only | 0.756 | 0.384 | 0.382 | 0.456 | 0.498 | 0.538 | 0.432 | 0.492 |
| | Comp. Decomp. | 0.756 | 0.384 | 0.382 | 0.75 | 0.74 | 0.766 | 0.564 | 0.62 |
| | Comp. + Attr. Decomp. | 0.836 | 0.748 | 0.728 | 0.824 | 0.826 | 0.906 | 0.868 | 0.819 |
| | Attr. Naming Only | 0.808 | 0.564 | 0.566 | 0.656 | 0.676 | 0.818 | 0.714 | 0.686 |
| 175B | Comp. Decomp. | 0.808 | 0.564 | 0.566 | 0.822 | 0.812 | 0.896 | 0.742 | 0.744 |
| 175B | Comp. + Attr. Decomp. | 0.864 | 0.832 | 0.818 | 0.834 | 0.846 | 0.928 | 0.93 | 0.865 |