# From Reasoning to Generalization: Knowledge-Augmented LLMs for ARC Benchmark
## Abstract
Recent reasoning-oriented LLMs have demonstrated strong performance on challenging tasks such as mathematics and science examinations. However, core cognitive faculties of human intelligence, such as abstract reasoning and generalization, remain underexplored. To address this, we evaluate recent reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC) benchmark, which explicitly demands both faculties. We formulate ARC as a program synthesis task and propose nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most LLMs. To further improve performance, we introduce an ARC solver, Knowledge Augmentation for Abstract Reasoning (KAAR), which encodes core knowledge priors within an ontology that classifies priors into three hierarchical levels based on their dependencies. KAAR progressively expands LLM reasoning capacity by gradually augmenting priors at each level, and invokes RSPC to generate candidate solutions after each augmentation stage. This stage-wise reasoning reduces interference from irrelevant priors and improves LLM performance. Empirical results show that KAAR maintains strong generalization and consistently outperforms non-augmented RSPC across all evaluated LLMs, achieving around 5% absolute gains and up to 64.52% relative improvement. Despite these achievements, ARC remains a challenging benchmark for reasoning-oriented LLMs, highlighting future avenues of progress in LLMs.
## 1 Introduction
Learning from extensive training data has achieved remarkable success in major AI fields such as computer vision, natural language processing, and autonomous driving [1, 2, 3]. However, achieving human-like intelligence goes beyond learning purely from large-scale data; it requires rapid reasoning and generalizing from prior knowledge to novel tasks and situations [4]. Chollet [5] introduced Abstraction and Reasoning Corpus (ARC) to assess the generalization and abstract reasoning capabilities of AI systems. In each ARC task, the solver is required to infer generalized rules or procedures from a small set of training instances, typically fewer than five input-output image pairs, and apply them to generate output images for given input images provided in test instances (Figure 1 (a)). Each image in ARC is a pixel grid represented as a 2D matrix, where each value denotes a pixel color (Figure 1 (b)). ARC evaluates broad generalization, encompassing reasoning over individual input-output pairs and inferring generalized solutions via high-level abstraction, akin to inductive reasoning [6].
ARC is grounded in core knowledge priors, which serve as foundational cognitive faculties of human intelligence, enabling equitable comparisons between AI systems and human cognitive abilities [7]. These priors include: (1) objectness – aggregating elements into coherent, persistent objects; (2) geometry and topology – recognizing and manipulating shapes, symmetries, spatial transformations, and structural patterns (e.g., containment, repetition, projection); (3) numbers and counting – counting, sorting, comparing quantities, performing basic arithmetic, and identifying numerical patterns; and (4) goal-directedness – inferring purposeful transformations between initial and final states without explicit temporal cues. Incorporating these priors allows ARC solvers to replicate human cognitive processes, produce behavior aligned with human expectations, address human-relevant problems, and demonstrate human-like intelligence through generalization and abstract reasoning [5]. These features highlight ARC as a crucial benchmark for assessing progress toward general intelligence.
Chollet [5] suggested approaching ARC tasks as instances of program synthesis, which studies automatically generating a program that satisfies a high-level specification [8]. Following this proposal, recent studies [9, 10] have successfully solved partial ARC tasks by searching for program solutions encoded within object-centric domain-specific languages (DSLs). Reasoning-oriented LLMs integrate chain-of-thought (CoT) reasoning [11], often trained via reinforcement learning, further advancing program synthesis performance. Common approaches using LLMs for code generation include repeated sampling, where multiple candidate programs are generated [12], followed by best-program selection strategies [13, 14, 15, 16], and code refinement, where initial LLM-generated code is iteratively improved using error feedback from execution results [17, 18] or LLM-generated explanations [17, 19, 18]. We note that ARC presents greater challenges than existing program synthesis benchmarks such as HumanEval [12], MBPP [20], and LiveCode [21], due to its stronger emphasis on generalization and abstract reasoning grounded in core knowledge priors, which remain underexplored. This gap motivates our evaluation of recent reasoning-oriented LLMs on the ARC benchmark, and our proposed knowledge augmentation approach to improve their performance.
<details>
<summary>x1.png Details</summary>

### Visual Description
## Diagram: Training and Test Instance Sequence for Pattern Recognition
### Overview
The image displays a technical diagram illustrating a sequence of six training instances followed by one test instance, presented in two parallel representations: (a) a visual grid-based "Image Visualization" and (b) a numerical "Matrix Representation." The diagram appears to depict a pattern recognition or machine learning task where a model learns from a sequence of examples to predict the final test instance.
### Components/Axes
The diagram is organized into two horizontal rows and multiple vertical columns, segmented by dashed lines.
**1. Row Labels (Left Side):**
* **(a) Image Visualization:** The top row, showing 3x3 grids with colored cells.
* **(b) Matrix Representation:** The bottom row, showing corresponding 3x3 matrices of numbers.
**2. Column Headers (Top):**
* **training instances:** Spans the first six columns (instances 1-6).
* **test instance:** Spans the final, seventh column.
**3. Visual Elements:**
* **Grids:** Each instance in row (a) is a 3x3 grid. Cells are either black (background/empty), blue, or red.
* **Matrices:** Each instance in row (b) is a 3x3 matrix enclosed in square brackets `[ ]`. Numbers are colored to match the corresponding grid cells (blue or red) or are black for zeros.
* **Arrows:** Right-pointing arrows (`→`) connect each instance to the next, indicating a sequence or transformation flow from left to right.
* **Question Mark:** The final grid in the "test instance" column contains a large black question mark `?` on a white background, indicating an unknown to be predicted.
* **Dashed Outline:** The final matrix in the "test instance" column is enclosed in a red dashed rectangle, highlighting it as the target for prediction.
### Detailed Analysis
The sequence progresses from left to right. Below is a precise transcription of each instance's matrix representation. The color of the numbers in the matrix corresponds to the color of the active cells in the visual grid above it.
**Training Instances (1-6):**
1. **Instance 1:**
* **Grid:** Top row is solid blue.
* **Matrix:** `[[1, 1, 1], [0, 0, 0], [0, 0, 0]]` (Top row numbers are blue).
2. **Instance 2:**
* **Grid:** Middle row is solid blue.
* **Matrix:** `[[0, 0, 0], [1, 1, 1], [0, 0, 0]]` (Middle row numbers are blue).
3. **Instance 3:**
* **Grid:** A blue "L" shape occupying the top-center, middle-center, and middle-left cells.
* **Matrix:** `[[0, 1, 0], [1, 1, 0], [0, 0, 0]]` (The `1`s forming the "L" are blue).
4. **Instance 4:**
* **Grid:** A blue "L" shape occupying the middle-center, middle-right, and bottom-right cells.
* **Matrix:** `[[0, 0, 0], [0, 1, 0], [1, 1, 0]]` (The `1`s forming the "L" are blue).
5. **Instance 5:**
* **Grid:** A red "L" shape occupying the top-center, top-right, and middle-right cells.
* **Matrix:** `[[0, 2, 2], [0, 0, 2], [0, 0, 0]]` (The `2`s forming the "L" are red).
6. **Instance 6:**
* **Grid:** A red "L" shape occupying the middle-center, middle-right, and bottom-right cells.
* **Matrix:** `[[0, 0, 0], [0, 2, 2], [0, 0, 2]]` (The `2`s forming the "L" are red).
**Test Instance (7):**
* **Grid:** Contains a single large question mark `?`.
* **Matrix (Target):** `[[2, 0, 0], [2, 0, 0], [2, 0, 0]]` (The `2`s in the first column are red, enclosed in a red dashed box).
### Key Observations
1. **Pattern Shift:** The first four training instances use the value `1` (blue). Instances 5 and 6 switch to the value `2` (red).
2. **Shape Progression:** The blue shapes (Instances 1-4) explore horizontal lines and "L" shapes in different orientations. The red shapes (Instances 5-6) also form "L" shapes but in different positions.
3. **Spatial Logic:** The sequence does not show a simple rotation or translation. It may represent a more abstract rule or a sequence of operations.
4. **Prediction Target:** The test instance's matrix (`[[2,0,0],[2,0,0],[2,0,0]]`) represents a vertical line of `2`s (red) in the first column. This is a new shape not seen in the training sequence, which featured horizontal lines and "L" shapes.
### Interpretation
This diagram is a classic representation of an **inductive reasoning or few-shot learning task**, likely from the field of artificial intelligence or cognitive science.
* **What it demonstrates:** The task requires an agent (human or AI) to infer an underlying rule or pattern from a limited sequence of examples (the six training instances) and then apply that rule to generate the correct output for a novel input (the test instance).
* **Relationship between elements:** The dual representation (visual grid and numerical matrix) is key. The grid provides an intuitive, spatial understanding of the pattern, while the matrix provides a formal, computational representation suitable for algorithmic processing. The arrows establish a temporal or logical sequence.
* **Underlying Rule (Hypothesis):** The data suggests a possible rule: **"After two blue 'L' shapes, the pattern switches to red and produces a vertical line."** However, this is speculative. The rule could be more complex, involving the position of the shapes, the count of instances, or a transformation applied at each step. The test instance's vertical line is a significant outlier compared to the training shapes, indicating the rule is not a simple repetition but involves generating a new, logically consistent form.
* **Peircean Investigation:** From a semiotic perspective:
* **Icons:** The grids are icons, directly resembling the patterns they represent.
* **Indices:** The arrows are indices, pointing to the sequential, causal relationship between instances.
* **Symbols:** The final question mark and the dashed matrix are symbols, representing the abstract problem of prediction and the unknown target, respectively. The entire diagram is a symbolic model of a learning process.
**Conclusion:** The image is not a chart of empirical data but a schematic of a logical puzzle. Its purpose is to visually and formally define a pattern completion problem, testing the ability to extract a generative rule from sequential examples. The "data" here is the sequence of shapes and numbers itself, and the "trend" is the progression of the pattern that must be deciphered to solve for the question mark.
</details>
Figure 1: An ARC problem example (25ff71a9) with image visualizations (a), including three input-output pairs in the training instances, and one input image in the test instance, along with their corresponding 2D matrix representations (b). The ground-truth test output is enclosed in a red box.
We systematically assess how reasoning-oriented LLMs approach ARC tasks within the program synthesis framework. For each ARC problem, we begin by providing 2D matrices as input. We adopt three established program generation strategies: direct generation, repeated sampling, and refinement. Each strategy is evaluated under two solution representations: a text-based solution plan and Python code. When generating code solutions, we further examine two modalities: standalone and planning-aided, where a plan is generated to guide subsequent code development, following recent advances [18, 22, 23]. In total, nine ARC solvers are considered. We evaluate several reasoning-oriented LLMs, including proprietary models, GPT-o3-mini [24, 25], and Gemini-2.0-Flash-Thinking (Gemini-2.0) [26], and open-source models, DeepSeek-R1-Distill-Llama-70B (DeepSeek-R1-70B) [27] and QwQ-32B [28]. Accuracy on test instances is reported as the primary metric. When evaluated on the ARC public evaluation set (400 problems), repeated-sampling planning-aided code generation (RSPC) demonstrates consistent generalization and achieves the highest test accuracy across most LLMs, 30.75% with GPT-o3-mini, 16.75% with Gemini-2.0, 14.25% with QwQ-32B, and 7.75% with DeepSeek-R1-70B. We treat the most competitive ARC solver, RSPC, as the solver backbone.
Motivated by the success of manually defined priors in ARC solvers [9, 10], we propose K nowledge A ugmentation for A bstract R easoning (KAAR) for solving ARC tasks using reasoning-oriented LLMs. KAAR formalizes manually defined priors through a lightweight ontology that organizes priors into hierarchical levels based on their dependencies. It progressively augments LLMs with priors at each level via structured prompting. Specifically, core knowledge priors are introduced in stages: beginning with objectness, followed by geometry, topology, numbers, and counting, and concluding with goal-directedness. After each stage, KAAR applies the ARC solver backbone (RSPC) to generate the solution. This progressive augmentation enables LLMs to gradually expand their reasoning capabilities and facilitates stage-wise reasoning, aligning with human cognitive development [29]. Empirical results show that KAAR improves accuracy on test instances across all evaluated LLMs, achieving the largest absolute gain of 6.75% with QwQ-32B and the highest relative improvement of 64.52% with DeepSeek-R1-70B over non-augmented RSPC.
We outline our contributions as follows:
- We evaluate the abstract reasoning and generalization capabilities of reasoning-oriented LLMs on ARC using nine solvers that differ in generation strategies, modalities, and solution representations.
- We introduce KAAR, a knowledge augmentation approach for solving ARC problems using LLMs. KAAR progressively augments LLMs with core knowledge priors structured via an ontology and applies the best ARC solver after augmenting same-level priors, further improving performance.
- We conduct a comprehensive performance analysis of the proposed ARC solvers, highlighting failure cases and remaining challenges on the ARC benchmark.
<details>
<summary>x2.png Details</summary>

### Visual Description
## Diagram: Comparison of Three Code Generation/Refinement Methods
### Overview
The image is a technical diagram comparing three distinct methods for generating and refining code or solutions, likely in the context of program synthesis or AI-assisted programming. The three methods are labeled (1) Direct Generation, (2) Repeat Sampling, and (3) Refinement. Each method is depicted as a flowchart with nodes representing different states or actions, connected by arrows indicating process flow. Below the main flowcharts are three explanatory boxes providing concrete examples of the components referenced in the diagrams.
### Components/Axes
The diagram is organized into three main vertical sections, each representing a method. Within each method, the flow is generally from left to right.
**Common Symbols & Legend (Present in each method's top-right corner):**
* **Gear Icon:** Represents a "planning" or "reasoning" step.
* **Python Logo:** Represents the generation or execution of Python code.
* **Pink Circle (Q):** Represents the initial problem or query.
* **Purple Circle (s := p / s := c):** Represents a state assignment. `s := p` means the state is set based on a plan. `s := c` means the state is set based on code.
* **Green Diamond (It):** Represents the final output or test result.
* **Blue Diamond (Ir):** Present in methods (2) and (3). Represents an intermediate result or evaluation step, with "pass" and "fail" outcomes.
* **Arrows:** Indicate the flow of the process. Dashed lines with arrows indicate loops or retries.
**Section (1) Direct Generation:**
* **Structure:** Three parallel, independent pathways.
* **Pathway 1 (Top):** `Q` -> (Gear) `s := p` -> `It`. Labeled "standalone".
* **Pathway 2 (Middle):** `Q` -> (Gear) `s := c` -> (Python Logo) `It`. Labeled "standalone".
* **Pathway 3 (Bottom):** `Q` -> `p` -> (Gear) `s := c` -> (Python Logo) `It`. Labeled "Planning-aided".
**Section (2) Repeat Sampling:**
* **Structure:** Three parallel pathways, each containing a potential loop.
* **Pathway 1 (Top):** `Q` -> (Gear) `s := p` -> `Ir`. If `Ir` results in "pass", proceed to `It`. If "fail", loop back to the `s := p` step. Labeled "standalone".
* **Pathway 2 (Middle):** `Q` -> (Gear) `s := c` -> (Python Logo) `Ir`. If "pass", proceed to `It`. If "fail", loop back to the `s := c` step. Labeled "standalone".
* **Pathway 3 (Bottom):** `Q` -> `p` -> (Gear) `s := c` -> (Python Logo) `Ir`. If "pass", proceed to `It`. If "fail", loop back to the `s := c` step. Labeled "Planning-aided".
**Section (3) Refinement:**
* **Structure:** A more integrated process with a central loop and multiple feedback paths.
* **Main Flow:** `Q` -> (Gear) `s := p` -> `Ir`. If "pass", proceed to `It`.
* **Refinement Loop:** If `Ir` results in "fail", the process enters a loop:
1. The flow goes to a node `s := c` (with a Python Logo).
2. From there, it goes to another `Ir` evaluation.
3. If this second `Ir` "passes", it proceeds to `It`.
4. If it "fails", it loops back to the initial `s := p` step.
* **Alternate Input:** There is also a direct path from `Q` to a node `p`, which then feeds into the `s := c` node within the refinement loop.
**Explanatory Boxes (Bottom):**
* **(a) Problem Description Q:** A pink box containing text.
* **Content:**
```
The training example(s):
input:[[1,1,1],[0,0,0],[0,0,0]]
output:[[0,0,0],[1,1,1],[0,0,0]]
The test input image(s):
input:[[2,0,0],[2,0,0],[0,0,0]]
```
* **(b) Solution Plan p:** A purple box containing text.
* **Content:** `...for each cell in row i of the output (where i > 0), set its value equal to the value from row (i - 1) in the same column of the input. For the top row of the output (row 0), fill every cell with 0 (the background color)....`
* **(c) Python Code c:** A light blue box containing Python code.
* **Content:**
```python
def generate_output_image(input_image):
rows = len(input_image)
if rows == 0:
return []
cols = len(input_image[0])
output_image = []
output_image.append([0 for _ in range(cols)])
for i in range(1, rows):
output_image.append(input_image[i - 1].copy())
return output_image
```
### Detailed Analysis
The diagram contrasts three strategies for solving a problem (Q), which is exemplified in box (a) as a grid transformation task.
1. **Direct Generation (1):** This is the simplest approach. It attempts to generate a solution in one shot. The "standalone" methods generate either a plan (`p`) or code (`c`) directly from the problem. The "Planning-aided" method first generates a plan (`p`) and then uses that plan to generate code (`c`). There is no mechanism for checking or correcting the output.
2. **Repeat Sampling (2):** This method introduces an evaluation step (`Ir`) after generating a plan or code. If the evaluation fails, the system retries the generation step (`s:=p` or `s:=c`) for that specific pathway. This is a local retry loop confined to each generation strategy.
3. **Refinement (3):** This is the most complex method. It starts by generating a plan (`s:=p`) and evaluating it (`Ir`). If it fails, it doesn't simply retry the plan. Instead, it enters a refinement cycle: it generates code (`s:=c`) based on the failed plan, evaluates that code (`Ir`), and if that also fails, it loops all the way back to re-generate the plan. This creates a tighter coupling between planning and coding, where failures in one stage inform the other.
The explanatory boxes (a, b, c) provide a concrete instance of the abstract symbols: `Q` is the grid problem, `p` is the natural language solution plan, and `c` is the Python implementation of that plan.
### Key Observations
* **Increasing Complexity:** The methods progress from open-loop generation (1) to closed-loop with local retries (2) to a closed-loop with cross-stage refinement (3).
* **Symbol Consistency:** The same symbols (Q, p, c, It, Ir) are used across all three diagrams, allowing for direct comparison of the process flows.
* **Role of Evaluation:** The blue diamond `Ir` is the critical component that enables iterative improvement. Its absence in method (1) is the key differentiator.
* **Planning vs. Code:** The diagram explicitly separates high-level planning (`p`, gear icon) from code implementation (`c`, Python logo), showing they can be generated independently or sequentially.
### Interpretation
This diagram illustrates a conceptual framework for improving the reliability of AI code generation systems. The core idea is that **direct, one-pass generation is insufficient for complex tasks**. The progression shows that incorporating an **evaluation step (`Ir`)** and **feedback loops** is essential for robustness.
* **Method (1)** represents a baseline, akin to a standard language model generating a single response.
* **Method (2)** adds a basic self-correction mechanism, similar to generating multiple samples and picking the best one, or using unit tests to retry generation.
* **Method (3)** proposes a more sophisticated, **integrated debugging process**. A failure doesn't just trigger a retry; it triggers a shift in strategy—from planning to coding—and creates a loop where plan and code are refined together until a valid solution is found. This mimics a human programmer's workflow: write a plan, try to code it, if the code fails, revisit and revise the plan.
The concrete example in the boxes grounds this abstract framework. It shows a simple image transformation problem (`Q`), a human-readable plan to solve it (`p`), and the corresponding code (`c`). The diagram's value is in showing the different *processes* by which an AI system might arrive at that code `c` from the problem `Q`, highlighting the trade-off between simplicity (Method 1) and robustness through iterative refinement (Method 3).
</details>
Figure 2: An illustration of the three ARC solution generation approaches, (1) direct generation, (2) repeated sampling, and (3) refinement, with the GPT-o3-mini input and response fragments (a–c) for solving task 25ff71a9 (Figure 1). For each approach, when the solution $s$ is code, $s:=c$ , a plan $p$ is either generated from the problem description $Q$ to guide code generation (planning-aided) or omitted (standalone). Otherwise, when $s:=p$ , the plan $p$ serves as the final solution instead.
## 2 Problem Formulation
We formulate each ARC task as a tuple $P=⟨ I_r,I_t⟩$ , where $I_r$ and $I_t$ are sets of training and test instances. Each instance consists of an input-output image pair $(i^i,i^o)$ , represented as 2D matrices. The goal is to leverage the LLM $M$ to generate a solution $s$ based on training instances $I_r$ and test input images $\{i^i | (i^i,i^o)∈ I_t\}$ , where $s$ maps each test input $i^i$ to its output $i^o$ , i.e., $s(i^i)=i^o$ , for $(i^i,i^o)∈ I_t$ . We note that the test input images are visible during the generation of solution $s$ , whereas test output images become accessible only after $s$ is produced to validate the correctness of $s$ . We encode the solution $s$ in different forms, as a solution plan $p$ , or as Python code $c$ , optionally guided by $p$ . We denote each ARC problem description, comprising $I_r$ and $\{i^i | (i^i,i^o)∈ I_t\}$ , as $Q$ .
## 3 ARC Solver Backbone
LLMs have shown promise in solving tasks that rely on ARC-relevant priors [30, 31, 32, 33]. We initially assume that reasoning-oriented LLMs implicitly encode sufficient core knowledge priors to solve ARC tasks. We cast each ARC task as a program synthesis problem, which involves generating a solution $s$ from a problem description $Q$ without explicitly prompting for priors. We consider established LLM-based code generation approaches [17, 18, 19, 23] as candidate ARC solution generation strategies, illustrated at the top of Figure 2. These include: (1) direct generation, where the LLM produces the solution $s$ in a single attempt, and then validates it on test instances $I_t$ ; (2) repeated sampling, where the LLM samples solutions until one passes training instances $I_r$ , and then evaluates it on $I_t$ ; and (3) refinement, where the LLM iteratively refines an initial solution $s$ based on failures on $I_r$ until it succeeds, followed by evaluation on $I_t$ . In addition, we extend the solution representation beyond code to include text-based solution plans. Given the problem description $Q$ as input (Figure 2, block (a)), all strategies prompt the LLM to generate a solution $s$ , represented either as a natural language plan $p$ (block (b)), $s:=p$ , or as a Python code $c$ (block (c)), $s:=c$ . For $s:=p$ , the solution is derived directly from $Q$ . For $s:=c$ , we explore two modalities: the LLM either generates $c$ directly from $Q$ (standalone), or first generates a plan $p$ for $Q$ , which is then concatenated with $Q$ to guide subsequent code development (planning-aided), a strategy widely adopted in recent work [18, 22, 23].
Repeated sampling and refinement iteratively produce new solutions based on the correctness of $s$ on training instances $I_r$ , and validate $s$ on test instances $I_t$ once it passes $I_r$ or the iteration limit is reached. When $s:=p$ , its correctness is evaluated by prompting the LLM to generate each output image $i^o$ given its corresponding input $i^i$ and the solution plan $p$ , where $(i^i,i^o)∈ I_r$ or $(i^i,i^o)∈ I_t$ . Alternatively, when $s:=c$ , its correctness is assessed by executing $c$ on $I_r$ or $I_t$ . In repeated sampling, the LLM iteratively generates a new plan $p$ and code $c$ from the problem description $Q$ without additional feedback. In contrast, refinement revises $p$ and $c$ by prompting the LLM with the previously incorrect $p$ and $c$ , concatenated with failed training instances. In total, nine ARC solvers are employed to evaluate the performance of reasoning-oriented LLMs on the ARC benchmark.
## 4 Knowledge Augmentation
Xu et al. [34] improved LLM performance on the ARC benchmark by prompting object-based representations for each task derived from graph-based object abstractions. Building on this insight, we propose KAAR, a knowledge augmentation approach for solving ARC tasks using reasoning-oriented LLMs. KAAR leverages Generalized Planning for Abstract Reasoning (GPAR) [10], a state-of-the-art object-centric ARC solver, to generate the core knowledge priors. GPAR encodes priors as abstraction-defined nodes enriched with attributes and inter-node relations, which are extracted using standard image processing algorithms. To align with the four knowledge dimensions in ARC, KAAR maps GPAR-derived priors into their categories. In detail, KAAR adopts fundamental abstraction methods from GPAR to enable objectness. Objects are typically defined as components based on adjacency rules and color consistency (e.g., 4-connected or 8-connected components), while also including the entire image as a component. KAAR further introduces additional abstractions: (1) middle-vertical, which vertically splits the image into two equal parts, and treats each as a distinct component; (2) middle-horizontal, which applies the same principle along the horizontal axis; (3) multi-lines, which segments the image using full-length rows or columns of uniform color, and treats each resulting part as a distinct component; and (4) no abstraction, which considers only raw 2D matrices. Under no abstraction, KAAR degrades to the ARC solver backbone without incorporating any priors. KAAR inherits GPAR’s geometric and topological priors, including component attributes (size, color, shape) and relations (spatial, congruent, inclusive). It further extends the attribute set with symmetry, bounding box, nearest boundary, and hole count, and augments the relation set with touching. For numeric and counting priors, KAAR follows GPAR, incorporating the largest/smallest component sizes, and the most/least frequent component colors, while extending them with statistical analysis of hole counts and symmetry, as well as the most/least frequent sizes and shapes.
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagram: Task Categorization and Color Change Action Schema
### Overview
The image is a flowchart or process diagram illustrating a three-step decision-making schema for categorizing a task and, if it involves a "color change," defining the specific components and rules for that change. The diagram uses a consistent visual language with dashed orange boxes for prompts/questions, green circular icons representing an AI or processing step, and gray boxes for outputs or determined rules. The flow is vertical, from top to bottom.
### Components/Axes
The diagram is segmented into three primary horizontal sections, each with a prompt, a labeled step, an action icon, and an output.
1. **Top Section (Step a):**
* **Prompt (Orange Dashed Box):** "Please determine which category or categories this task belongs to. Please select from the following predefined categories..."
* **Label:** "(a) Action(s) Selection"
* **Action Icon:** A green circle containing a white, stylized, interlocking symbol (resembling a brain or network).
* **Output (Gray Box):** "This task involves color change." The words "color change" are highlighted in red.
2. **Middle Section (Step b):**
* **Prompt (Orange Dashed Box):** "If this task involves color change: 1. Which components require color change? 2. Determine the conditions used to select these target components: ..."
* **Label:** "(b) Component(s) Selection"
* **Action Icon:** Identical green circle icon.
* **Output (Gray Box):** "Components (color 0) with the minimum and maximum sizes." The phrase "color 0" is in red.
* **Additional Element:** A curly bracket labeled "action schema" vertically connects the right side of this output box to the output box of the next section.
3. **Bottom Section (Step c):**
* **Prompt (Orange Dashed Box):** "If this task involves color change, please determine which source color maps to which target color for the target components. 2. Determine the conditions used to dictate this color change: ..."
* **Label:** "(c) Color Change Rule"
* **Action Icon:** Identical green circle icon.
* **Output (Gray Box):** Contains two bullet points:
* "- minimum-size component (from color 0) to 7."
* "- maximum-size component (from color 0) to 8."
The color numbers "0", "7", and "8" are in red.
### Detailed Analysis
The diagram defines a strict, sequential workflow:
1. **Categorization (Step a):** The process begins by classifying a given task. The output shown is a specific determination: "This task involves color change." This implies the schema is being demonstrated for a task that falls into this category.
2. **Component Identification (Step b):** Upon confirming a color change task, the next step is to identify *which* components will be altered. The output specifies the selection criteria: components that are of "color 0" and are either the "minimum" or "maximum" in size within the context.
3. **Rule Definition (Step c):** The final step defines the transformation rule. The output maps the source color ("color 0") to new target colors based on the component's size attribute:
* The component identified as the **minimum-size** changes from color 0 to color **7**.
* The component identified as the **maximum-size** changes from color 0 to color **8**.
The "action schema" bracket explicitly links the output of step (b) (the selected components) to the input of step (c) (the rule application), showing they are part of a unified action plan.
### Key Observations
* **Conditional Logic:** The entire process for steps (b) and (c) is conditional on the output of step (a) being "color change."
* **Specificity of Selection:** The component selection is not arbitrary; it is based on two clear attributes: a starting color (`color 0`) and a size comparison (min/max).
* **Deterministic Mapping:** The color change rule is a direct, one-to-one mapping based on the size condition established in the previous step.
* **Visual Coding:** Red text is consistently used to highlight the key variables in the process: the task type ("color change"), the source color ("color 0"), and the target colors ("7", "8").
### Interpretation
This diagram represents a formalized, programmatic approach to executing a design or data manipulation task. It translates a high-level instruction ("change colors") into a precise, executable algorithm.
* **What it demonstrates:** It shows how an AI or automated system could parse a user request, break it down into sub-tasks (categorize, select, transform), and generate a concrete action plan with specific parameters (target components and color values).
* **Relationship between elements:** The flow is strictly linear and dependent. The output of each step becomes the necessary context for the next. The "action schema" is the final, compiled set of instructions derived from the initial prompt.
* **Underlying logic:** The schema implies a system where objects have properties like `color` and `size`. The task is to modify the `color` property of a subset of objects (those with `color=0` that are extremal in `size`) to new, distinct values (`7` and `8`). This could be relevant in contexts like data visualization (highlighting outliers), UI design (emphasizing specific elements), or procedural content generation.
* **Notable absence:** The diagram is a template or example. It does not show the actual "predefined categories" from step (a) or the context that defines "minimum" and "maximum" size. It illustrates the *process* of rule generation, not the application to a specific dataset.
</details>
Figure 3: The example of goal-directedness priors augmentation in KAAR with input and response fragments from GPT-o3-mini.
GPAR approaches goal-directedness priors by searching for a sequence of program instructions [35] defined in a DSL. Each instruction supports conditionals, branching, looping, and action statements. KAAR incorporates the condition and action concepts from GPAR, and enables goal-directedness priors by augmenting LLM knowledge in two steps: 1) It prompts the LLM to identify the most relevant actions for solving the given ARC problem from ten predefined action categories (Figure 3 block (a)), partially derived from GPAR and extended based on the training set, such as color change, movement, and extension; 2) For each selected action, KAAR prompts the LLM with the associated schema to resolve implementation details. For example, for a color change action, KAAR first prompts the LLM to identify the target components (Figure 3 blocks (b)), and then specify the source and target colors for modification based on the target components (Figure 3 blocks (c)). We note that KAAR also prompts the LLM to incorporate condition-aware reasoning when determining action implementation details, using knowledge derived from geometry, topology, numbers, and counting priors. This enables fine-grained control, for example, applying color changes only to black components conditioned on the maximum or minimum size: from black (value 0) to blue (value 8) if largest, or to orange (value 7) if smallest. Figure 3 shows fragments of the goal-directedness priors augmentation. See Appendix A.2 for the full set of priors in KAAR.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Composite Diagram: KAAR Augmentation Process for ARC Tasks
### Overview
The image is a composite technical figure illustrating the KAAR (Knowledge-Augmented Abstraction and Reasoning) process for solving ARC (Abstraction and Reasoning Corpus) tasks. It consists of five labeled sub-figures: (a) an example ARC task, (b) a flowchart of the KAAR augmentation process, and three explanatory text boxes (c, d, e) detailing specific reasoning components. The overall purpose is to demonstrate how an AI system decomposes and analyzes visual reasoning problems.
### Components/Axes
The image is segmented into distinct regions:
1. **Top-Left (a) ARC example:** Shows a visual reasoning problem.
* **Input Grid (Top-Left):** A 10x10 grid with black (value 0) and gray (value ~0.5) pixels forming a pattern.
* **Output Grid (Top-Right):** The same grid with modifications. Some gray pixels are changed to light blue, and one pixel is changed to orange.
* **Test Input (Bottom-Left):** A new 10x10 grid with a different black and gray pattern.
* **Question Mark (Bottom-Right):** A box with a "?", indicating the goal is to predict the correct output for the test input.
2. **Top-Right (b) Augmentation process in KAAR:** A flowchart diagram.
* **Starting Point:** A pink circle labeled "Q" (Query).
* **Reasoning Modules (Top Ovals):** Four blue ovals connected to the process flow, representing different reasoning skills:
* "Objectness"
* "Geometry and Topology"
* "Numbers and Counting"
* "Goal-directedness"
* **Process Flow:** The query "Q" feeds into a series of "ARC solver backbone" blocks (yellow rectangles). The flow is sequential.
* **Decision Points:** After each "ARC solver backbone," there is a decision diamond.
* **Input:** "fail on Iᵣ" (where Iᵣ likely represents a training or reference input).
* **Output Paths:**
* "Pass Iᵣ" leads to a green diamond labeled "Iₜ" (likely the transformed or target output).
* The "fail" path continues to the next "ARC solver backbone."
* **Spatial Layout:** The flowchart progresses from left to right. The reasoning ovals are positioned above the main flow, connected by arrows pointing downward to the solver backbones.
3. **Bottom Row (c, d, e):** Three light blue text boxes with dashed borders, each explaining a reasoning component from the flowchart.
* **(c) Objectness:** Text describing component analysis based on 4-connected black pixels.
* **(d) Geometry and Topology:** Text describing spatial relationships and shape properties of components.
* **(e) Numbers and Counting:** Text describing statistical analysis of component sizes and frequencies.
### Detailed Analysis
**Sub-figure (a) - ARC Example:**
* The input grid contains a complex, non-uniform pattern of black and gray pixels.
* The output grid shows a transformation where a contiguous region of gray pixels in the bottom-right quadrant is changed to light blue. Additionally, a single pixel near the top-left is changed from gray to orange.
* The test input presents a new pattern, and the system must infer the transformation rule to produce the correct output.
**Sub-figure (b) - KAAR Augmentation Process Flowchart:**
* The process is iterative. A query (Q) is processed by an initial ARC solver backbone.
* If this solver fails on the reference input (Iᵣ), the process passes to a second backbone, and then potentially a third.
* Each backbone is augmented or guided by one of the four reasoning modules (Objectness, Geometry and Topology, Numbers and Counting, Goal-directedness), as indicated by the arrows from the ovals.
* The goal at each stage is to "Pass Iᵣ" and produce the target output Iₜ.
**Text Box (c) - Objectness:**
* **Language:** English.
* **Transcription:** "When we consider 4-connected black pixels (value 0) as components, the components in each input and output image are as follows: For Training Pair 1 input image: Component 1: Locations=[(0,0), (0,1)] ... Component 8: Locations=[(4, 14)] ..."
* **Key Detail:** It defines "components" as groups of 4-connected black pixels and lists their specific grid coordinates. The text "4-connected black pixels (value 0)" and the coordinate lists are highlighted in red.
**Text Box (d) - Geometry and Topology:**
* **Language:** English.
* **Transcription:** "For Training Pair 1 input image: For component 1: Shape: horizontal line. Different/Identical: Component 1 is different from ALL OTHERS! ... Component 1 is not touching with Component 2. Component 1 is at top-left of Component 2, and Component 2 is at bottom-right of Component 1."
* **Key Detail:** It analyzes the shape ("horizontal line") and spatial relationships ("not touching," "top-left," "bottom-right") between components. The terms "Different/Identical," "different from ALL OTHERS!," "not touching," "top-left," and "bottom-right" are highlighted in red.
**Text Box (e) - Numbers and Counting:**
* **Language:** English.
* **Transcription:** "For Training Pair 1 input image: component 5, with the maximum size 10. component 8, with the minimum size 1. ... There are two components, 4 and 6, each of size 7, which appear most frequently (twice)."
* **Key Detail:** It performs statistical analysis on component sizes, identifying the maximum size (10), minimum size (1), and the most frequent size (7, appearing twice). The phrases "maximum size 10," "minimum size 1," and "most frequently (twice)" are highlighted in red.
### Key Observations
1. **Modular Reasoning:** The KAAR process explicitly breaks down the complex ARC reasoning task into four distinct, interpretable modules (Objectness, Geometry, Numbers, Goal-directedness).
2. **Iterative Refinement:** The flowchart shows a cascade of solver backbones, suggesting a fallback or refinement strategy where failure at one stage triggers a more specialized analysis.
3. **Component-Centric Analysis:** The detailed text boxes reveal that the system's core strategy is to first identify discrete "components" (connected groups of pixels) and then analyze their properties (location, shape, size, relationships) rather than processing the grid as a whole.
4. **Emphasis on Contrast:** The red-highlighted text in the explanations focuses on comparative and relational properties: "different from," "not touching," "top-left of," "maximum," "minimum," "most frequently." This suggests the system learns by contrasting elements within the input.
### Interpretation
This diagram illustrates a neuro-symbolic or hybrid AI approach to visual reasoning. The "ARC solver backbone" likely represents a neural network, while the four reasoning modules (Objectness, Geometry, etc.) represent structured, symbolic knowledge or analysis routines that guide or augment the neural process.
The data suggests that solving ARC-like tasks requires more than pattern recognition; it requires **explicit decomposition** of the visual scene into objects and the **systematic analysis** of their attributes and relationships. The KAAR framework operationalizes this by:
1. **Parsing** the input into components (Objectness).
2. **Characterizing** each component's intrinsic properties (Geometry - shape) and extrinsic properties (Topology - spatial relations).
3. **Quantifying** the scene through statistics (Numbers and Counting).
4. **Directing** the process toward a solution (Goal-directedness).
The red highlights act as a "paper trail" for the system's reasoning, showing which specific comparative facts it extracted to inform its decision. The overall process moves from raw pixels to components, then to relational and statistical facts, and finally to a transformed output, mimicking a human-like analytical approach to abstract problem-solving. The presence of multiple solver backbones implies that different reasoning strategies may be needed for different types of ARC problems, and the system attempts them in sequence.
</details>
Figure 4: Augmentation process in KAAR (block (b)) and the corresponding knowledge augmentation fragments (blocks (c-e)) for ARC problem 62ab2642 (block (a)).
KAAR encodes the full set of core knowledge priors assumed in ARC into an ontology, where priors are organized into three hierarchical levels based on their dependencies. KAAR prompts LLMs with priors at each level to enable incremental augmentation. This reduces context interference and supports stage-wise reasoning aligned with human cognitive development [29]. Figure 4, block (b), illustrates the augmentation process in KAAR alongside the augmented prior fragments used to solve the problem shown in block (a). KAAR begins augmentation with objectness priors, encoding images into components with detailed coordinates based on a specific abstraction method (block (c)). KAAR then prompts geometry and topology priors (block (d)), followed by numbers and counting priors (block (e)). These priors are ordered by dependency while residing at the same ontological level, as they all build upon objectness. Finally, KAAR augments goal-directedness priors, as shown in Figure 3, where target components are derived from objectness analysis and conditions are inferred from geometric, topological, and numerical analyses. After augmenting each level of priors, KAAR invokes the ARC solver backbone to generate solutions. If any solution passes training instances $I_r$ , it is validated on the test instances $I_t$ ; otherwise, augmentation proceeds to the next level of priors.
While the ontology provides a hierarchical representation of priors, it may also introduce hallucinations, such as duplicate abstractions, irrelevant component attributes or relations, and inapplicable actions. To address this, KAAR integrates restrictions from GPAR to filter out inapplicable priors. KAAR adopts GPAR’s duplicate-checking strategy, retaining only abstractions that yield distinct components by size, color, or shape, in at least one training instance. In KAAR, each abstraction is associated with a set of applicable priors. For instance, when the entire image is treated as a component, relation priors are excluded, and actions such as movement and color change are omitted, whereas symmetry and size attributes are retained and actions such as flipping and rotation are considered. In contrast, 4-connected and 8-connected abstractions include all component attributes and relations, and the full set of ten action priors. See Appendix A.3 for detailed restrictions.
| 2 GPT-o3-mini | $I_r$ | Direct Generation P - | Repeated Sampling C - | Refinement PC - | P 35.50 | C 52.50 | PC 35.50 | P 31.00 | C 47.25 | PC 32.00 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| $I_t$ | 20.50 | 24.50 | 22.25 | 23.75 | 32.50 | 30.75 | 24.75 | 29.25 | 25.75 | |
| $I_r\&I_t$ | - | - | - | 22.00 | 31.75 | 29.25 | 21.75 | 28.50 | 25.00 | |
| Gemini-2.0 | $I_r$ | - | - | - | 36.50 | 39.50 | 21.50 | 15.50 | 25.50 | 15.50 |
| $I_t$ | 7.00 | 6.75 | 6.25 | 10.00 | 14.75 | 16.75 | 8.75 | 12.00 | 11.75 | |
| $I_r\&I_t$ | - | - | - | 9.50 | 14.25 | 16.50 | 8.00 | 10.50 | 10.75 | |
| QwQ-32B | $I_r$ | - | - | - | 19.25 | 13.50 | 15.25 | 16.75 | 15.00 | 14.25 |
| $I_t$ | 9.50 | 7.25 | 5.75 | 11.25 | 13.50 | 14.25 | 11.00 | 14.25 | 14.00 | |
| $I_r\&I_t$ | - | - | - | 9.25 | 12.75 | 13.00 | 8.75 | 13.00 | 11.75 | |
| DeepSeek-R1-70B | $I_r$ | - | - | - | 8.75 | 6.75 | 7.75 | 6.25 | 5.75 | 7.75 |
| $I_t$ | 4.25 | 4.75 | 4.50 | 4.25 | 7.25 | 7.75 | 4.75 | 5.75 | 7.75 | |
| $I_r\&I_t$ | - | - | - | 3.50 | 6.50 | 7.25 | 4.25 | 5.25 | 7.00 | |
| 2 | | | | | | | | | | |
Table 1: Performance of nine ARC solvers measured by accuracy on $I_r$ , $I_t$ , and $I_r\&I_t$ using four reasoning-oriented LLMs. For each LLM, the highest accuracy on $I_r$ and $I_r\&I_t$ is in bold; the highest accuracy on $I_t$ is in red. Accuracy is reported as a percentage. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
## 5 Experiments
In ARC, each task is unique and solvable using only core knowledge priors [5]. We begin by comparing nine candidate solvers on the full ARC public evaluation set of 400 tasks. This offers broader insights than previous studies limited to subsets of 400 training tasks [10, 9, 36], given the greater difficulty of the evaluation set [37]. We experiment with recent reasoning-oriented LLMs, including proprietary models, GPT-o3-mini and Gemini 2.0 Flash-Thinking (Gemini-2.0), and open-source models, DeepSeek-R1-Distill-Llama-70B (DeepSeek-R1-70B) and QwQ-32B. We compute accuracy on test instances $I_t$ as the primary evaluation metric. It measures the proportion of problems where the first solution successfully solves $I_t$ after passing the training instances $I_r$ ; otherwise, if none pass $I_r$ within 12 iterations, the last solution is evaluated on $I_t$ , applied to both repeated sampling and refinement. We also report accuracy on $I_r$ and $I_r\&I_t$ , measuring the percentage of problems whose solutions solve $I_r$ and both $I_r$ and $I_t$ . See Appendix A.4 for parameter settings.
Table 1 reports the performance of nine ARC solvers across four reasoning-oriented LLMs. For direct generation methods, accuracy on $I_r$ and $I_r\&I_t$ is omitted, as solutions are evaluated directly on $I_t$ . GPT-o3-mini consistently outperforms all other LLMs, achieving the highest accuracy on $I_r$ (52.50%), $I_t$ (32.50%), and $I_r\&I_t$ (31.75%) under repeated sampling with standalone code generation (C), highlighting its strong abstract reasoning and generalization capabilities. Notably, QwQ-32B, the smallest model, outperforms DeepSeek-R1-70B across all solvers and surpasses Gemini-2.0 under refinement. Among the nine ARC solvers, repeated sampling-based methods generally outperform those based on direct generation or refinement. This diverges from previous findings where refinement dominated conventional code generation tasks that lack abstract reasoning and generalization demands [10, 17, 19]. Within repeated sampling, planning-aided code generation (PC) yields the highest accuracy on $I_t$ across most LLMs. It also demonstrates the strongest generalization with GPT-o3-mini and Gemini-2.0, as evidenced by the smallest accuracy gap between $I_r$ and $I_r\&I_t$ , compared to solution plan (P) and standalone code generation (C). A similar trend is observed for QwQ-32B and DeepSeek-R1-70B, where both C and PC generalize effectively across repeated sampling and refinement. Overall, repeated sampling with planning-aided code generation, denoted as RSPC, shows the best performance and thus serves as the ARC solver backbone.
| 2 GPT-o3-mini | RSPC | $I_r$ Acc 35.50 | $I_t$ $Δ$ - | $I_r\&I_t$ $γ$ - | Acc 30.75 | $Δ$ - | $γ$ - | Acc 29.25 | $Δ$ - | $γ$ - |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| KAAR | 40.00 | 4.50 | 12.68 | 35.00 | 4.25 | 13.82 | 33.00 | 3.75 | 12.82 | |
| Gemini-2.0 | RSPC | 21.50 | - | - | 16.75 | - | - | 16.50 | - | - |
| KAAR | 25.75 | 4.25 | 19.77 | 21.75 | 5.00 | 29.85 | 20.50 | 4.00 | 24.24 | |
| QwQ-32B | RSPC | 15.25 | - | - | 14.25 | - | - | 13.00 | - | - |
| KAAR | 22.25 | 7.00 | 45.90 | 21.00 | 6.75 | 47.37 | 19.25 | 6.25 | 48.08 | |
| DeepSeek-R1-70B | RSPC | 7.75 | - | - | 7.75 | - | - | 7.25 | - | - |
| KAAR | 12.25 | 4.50 | 58.06 | 12.75 | 5.00 | 64.52 | 11.50 | 4.25 | 58.62 | |
| 2 | | | | | | | | | | |
Table 2: Comparison of RSPC (repeated-sampling planning-aided code generation) and its knowledge-augmented variant, KAAR, in terms of accuracy (Acc) on $I_r$ , $I_t$ , and $I_r\&I_t$ . $Δ$ and $γ$ denote the absolute and relative improvements over RSPC, respectively. All values are reported as percentages. The best results for $I_r$ and $I_r\&I_t$ are in bold; the highest for $I_t$ is in red.
We further compare the performance of RSPC with its knowledge-augmented variant, KAAR. For each task, KAAR begins with simpler abstractions, i.e., no abstraction and whole image, and progresses to complicated 4-connected and 8-connected abstractions, consistent with GPAR. KAAR reports the accuracy on test instances $I_t$ based on the first abstraction whose solution solves all training instances $I_r$ ; otherwise, it records the final solution from each abstraction and selects the one that passes the most $I_r$ to evaluate on $I_t$ . KAAR allows the solver backbone (RSPC) up to 4 iterations per invocation, totaling 12 iterations, consistent with the non-augmented setting. See Appendix A.5 for KAAR execution details. As shown in Table 2, KAAR consistently outperforms non-augmented RSPC across all LLMs, yielding around 5% absolute gains on $I_r$ , $I_t$ , and $I_r\&I_t$ . This highlights the effectiveness and model-agnostic nature of the augmented priors. KAAR achieves the highest accuracy using GPT-o3-mini, with 40% on $I_r$ , 35% on $I_t$ , and 33% on $I_r\&I_t$ . KAAR shows the greatest absolute improvements ( $Δ$ ) using QwQ-32B and the largest relative gains ( $γ$ ) using DeepSeek-R1-70B across all evaluated metrics. Moreover, KAAR maintains generalization comparable to RSPC across all LLMs, indicating that the augmented priors are sufficiently abstract and expressive to serve as basis functions for reasoning, in line with ARC assumptions.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Heatmap Comparison: RSPC vs. KAAR Model Coverage
### Overview
The image displays two side-by-side heatmaps comparing the "Coverage" metric between four different AI models. The heatmaps are labeled (a) RSPC and (b) KAAR. Each heatmap is a 4x4 matrix where the rows and columns represent the same set of models, and the cell values indicate a coverage score between 0.0 and 1.0. A color legend on the right maps the numerical values to a color gradient from light beige (0.0) to dark red (1.0).
### Components/Axes
* **Chart Type:** Two comparative heatmaps.
* **Titles/Labels:**
* Left heatmap label: `(a) RSPC`
* Right heatmap label: `(b) KAAR`
* Color scale legend (positioned vertically on the far right): Labeled `Coverage` with markers at `0.0`, `0.5`, and `1.0`.
* **Axes (Identical for both heatmaps):**
* **X-axis (Top):** Model names, listed left to right: `GPT-o3-mini`, `Gemini-2.0`, `QwQ-32B`, `DeepSeek-R1-70B`.
* **Y-axis (Left):** Model names, listed top to bottom: `GPT-o3-mini`, `Gemini-2.0`, `QwQ-32B`, `DeepSeek-R1-70B`.
* **Data Structure:** Each cell contains a numerical value representing the coverage score of the row model with respect to the column model.
### Detailed Analysis
**Matrix (a) RSPC - Coverage Values:**
| Row \ Column | GPT-o3-mini | Gemini-2.0 | QwQ-32B | DeepSeek-R1-70B |
| :--- | :--- | :--- | :--- | :--- |
| **GPT-o3-mini** | 1.00 | 0.50 | 0.40 | 0.22 |
| **Gemini-2.0** | 0.91 | 1.00 | 0.60 | 0.40 |
| **QwQ-32B** | 0.86 | 0.70 | 1.00 | 0.44 |
| **DeepSeek-R1-70B** | 0.87 | 0.87 | 0.81 | 1.00 |
**Matrix (b) KAAR - Coverage Values:**
| Row \ Column | GPT-o3-mini | Gemini-2.0 | QwQ-32B | DeepSeek-R1-70B |
| :--- | :--- | :--- | :--- | :--- |
| **GPT-o3-mini** | 1.00 | 0.55 | 0.54 | 0.34 |
| **Gemini-2.0** | 0.89 | 1.00 | 0.72 | 0.48 |
| **QwQ-32B** | 0.88 | 0.74 | 1.00 | 0.53 |
| **DeepSeek-R1-70B** | 0.92 | 0.82 | 0.88 | 1.00 |
**Trend Verification:**
* **Diagonal Trend:** In both matrices, the diagonal cells (where row and column model are identical) have a value of `1.00`, indicated by the darkest red. This represents perfect self-coverage.
* **Asymmetry Trend:** The matrices are not symmetric. For example, in RSPC, the coverage of Gemini-2.0 by GPT-o3-mini is `0.50`, while the coverage of GPT-o3-mini by Gemini-2.0 is `0.91`.
* **Cross-Model Trend:** Values generally decrease as models become more dissimilar (e.g., GPT-o3-mini vs. DeepSeek-R1-70B has the lowest scores in both charts).
* **Comparison Trend (RSPC vs. KAAR):** For nearly every off-diagonal cell, the value in the KAAR matrix is higher than its counterpart in the RSPC matrix. This indicates a systematic increase in coverage scores under the KAAR metric.
### Key Observations
1. **Highest Asymmetry:** The largest disparity between reciprocal scores is between GPT-o3-mini and DeepSeek-R1-70B. In RSPC, GPT-o3-mini covers DeepSeek-R1-70B at only `0.22`, while DeepSeek-R1-70B covers GPT-o3-mini at `0.87`.
2. **Most Improved (KAAR vs. RSPC):** The coverage of DeepSeek-R1-70B by QwQ-32B shows a significant increase from `0.81` (RSPC) to `0.88` (KAAR). The coverage of QwQ-32B by GPT-o3-mini increases from `0.40` to `0.54`.
3. **Consistent High Performer:** DeepSeek-R1-70B (bottom row) maintains relatively high coverage scores over other models in both metrics, never dropping below `0.81` in RSPC and `0.82` in KAAR.
4. **Consistent Low Performer:** GPT-o3-mini (top row) has the lowest coverage scores over other models, particularly over DeepSeek-R1-70B (`0.22` and `0.34`).
### Interpretation
This visualization compares two different methods or metrics (RSPC and KAAR) for evaluating how well one AI model's outputs "cover" or encompass the capabilities or responses of another. The data suggests the following:
* **KAAR is a More Generous Metric:** The systematic increase in scores from (a) to (b) implies that the KAAR evaluation framework yields higher coverage estimates between models than RSPC does. This could be due to a more lenient scoring algorithm, a different definition of "coverage," or a focus on different aspects of model performance.
* **Model Relationships are Asymmetric:** The non-identical off-diagonal values are a critical finding. They demonstrate that the relationship between models is not mutual. One model may be very good at replicating or covering the outputs of another (high score), while the reverse is not true (low score). This has implications for model benchmarking and understanding hierarchical capabilities.
* **DeepSeek-R1-70B is a Strong "Coverer":** Its consistently high row values indicate it is proficient at generating outputs that encompass the range of the other models tested. Conversely, GPT-o3-mini appears to be the most "specialized" or distinct, as other models cover it well, but it does not cover them as well.
* **The Metric Quantifies Model Similarity/Dissimilarity:** The heatmap acts as a similarity matrix. The low scores between GPT-o3-mini and DeepSeek-R1-70B suggest they are the most dissimilar pair in this set, while higher scores (e.g., between QwQ-32B and DeepSeek-R1-70B) suggest greater overlap in their output distributions or capabilities as measured by these metrics.
</details>
Figure 5: Asymmetric relative coverage matrices for RSPC (a) and KAAR (b), showing the proportion of problems whose test instances are solved by the row model that are also solved by the column model, across four LLMs.
We compare relative problem coverage across evaluated LLMs under RSPC and KAAR based on successful solutions on test instances. As shown in Figure 5, each cell $(i,j)$ represents the proportion of problems solved by the row LLM that are also solved by the column LLM. This is computed as $\frac{|A_i∩ A_j|}{|A_i|}$ , where $A_i$ and $A_j$ are the sets of problems solved by the row and column LLMs, respectively. Values near 1 indicate that the column LLM covers most problems solved by the row LLM. Under RSPC (Figure 5 (a)), GPT-o3-mini exhibits broad coverage, with column values consistently above 0.85. Gemini-2.0 and QwQ-32B also show substantial alignment, with mutual coverage exceeding 0.6. In contrast, DeepSeek-R1-70B shows lower alignment, with column values below 0.45 due to fewer solved problems. Figure 5 (b) illustrates that KAAR generally improves or maintains inter-model overlap compared to RSPC. Notably, KAAR raises the minimum coverage between GPT-o3-mini and DeepSeek-R1-70B from 0.22 under RSPC to 0.34 under KAAR. These results highlight the effectiveness of KAAR in improving cross-model generalization, with all evaluated LLMs solving additional shared problems. In particular, it enables smaller models such as QwQ-32B and DeepSeek-R1-70B to better align with stronger LLMs on the ARC benchmark.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Grouped Bar Chart: AI Model Accuracy on RSPC and KAAR Metrics by Task Category
### Overview
This image is a grouped bar chart comparing the performance of four different AI models across four task categories. Performance is measured by accuracy percentage on two distinct metrics: RSPC and KAAR. The chart displays the accuracy for each model-metric combination within each task category, with the total number of samples for each category noted below the category label.
### Components/Axes
* **Chart Type:** Grouped Bar Chart.
* **Y-Axis:** Labeled "Accuracy on $I_t$ (%)". The scale runs from 0 to 40, with major gridlines at intervals of 10.
* **X-Axis:** Four categorical task groups:
1. Movement (Total: 55)
2. Extension (Total: 129)
3. Recolor (Total: 115)
4. Others (Total: 101)
* **Legend:** Located in the top-right corner. It defines eight data series, pairing four models with two metrics each. The color coding is as follows:
* **GPT-o3-mini: RSPC** - Solid medium blue.
* **GPT-o3-mini: KAAR** - Light blue (top segment of the blue bar).
* **Gemini-2.0: RSPC** - Solid olive green.
* **Gemini-2.0: KAAR** - Light green (top segment of the green bar).
* **QwQ-32B: RSPC** - Solid purple.
* **QwQ-32B: KAAR** - Light purple (top segment of the purple bar).
* **DeepSeek-R1-70B: RSPC** - Solid orange.
* **DeepSeek-R1-70B: KAAR** - Light orange/peach (top segment of the orange bar).
* **Bar Structure:** For each model within a category, the bar is stacked. The lower, solid-colored segment represents the RSPC accuracy. The upper, lighter-colored segment represents the KAAR accuracy. The numerical value for each segment is printed directly on it.
### Detailed Analysis
**1. Movement (Total: 55 samples)**
* **GPT-o3-mini:** RSPC = 41.8%, KAAR = 3.6%. Total height ~45.4%.
* **Gemini-2.0:** RSPC = 20.0%, KAAR = 12.7%. Total height = 32.7%.
* **QwQ-32B:** RSPC = 18.2%, KAAR = 14.5%. Total height = 32.7%.
* **DeepSeek-R1-70B:** RSPC = 10.9%, KAAR = 9.1%. Total height = 20.0%.
**2. Extension (Total: 129 samples)**
* **GPT-o3-mini:** RSPC = 38.0%, KAAR = 0.8%. Total height = 38.8%.
* **Gemini-2.0:** RSPC = 19.4%, KAAR = 1.6%. Total height = 21.0%.
* **QwQ-32B:** RSPC = 17.8%, KAAR = 2.3%. Total height = 20.1%.
* **DeepSeek-R1-70B:** RSPC = 7.8%, KAAR = 1.6%. Total height = 9.4%.
**3. Recolor (Total: 115 samples)**
* **GPT-o3-mini:** RSPC = 24.3%, KAAR = 7.8%. Total height = 32.1%.
* **Gemini-2.0:** RSPC = 13.9%, KAAR = 6.1%. Total height = 20.0%.
* **QwQ-32B:** RSPC = 10.4%, KAAR = 7.8%. Total height = 18.2%.
* **DeepSeek-R1-70B:** RSPC = 4.3%, KAAR = 7.0%. Total height = 11.3%.
**4. Others (Total: 101 samples)**
* **GPT-o3-mini:** RSPC = 21.8%, KAAR = 5.0%. Total height = 26.8%.
* **Gemini-2.0:** RSPC = 14.9%, KAAR = 4.0%. Total height = 18.9%.
* **QwQ-32B:** RSPC = 11.9%, KAAR = 7.9%. Total height = 19.8%.
* **DeepSeek-R1-70B:** RSPC = 9.9%, KAAR = 5.0%. Total height = 14.9%.
### Key Observations
1. **Model Performance Hierarchy:** GPT-o3-mini consistently achieves the highest accuracy on the RSPC metric across all four task categories. Its lead is most pronounced in "Movement" (41.8% vs. next best 20.0%) and "Extension" (38.0% vs. 19.4%).
2. **Metric Disparity (RSPC vs. KAAR):** For every model and every task category, the RSPC accuracy is significantly higher than the KAAR accuracy. The KAAR scores are generally low, often in single digits, with the highest being 14.5% (QwQ-32B on Movement).
3. **Task Difficulty:** The "Movement" category appears to be the easiest for the models, yielding the highest overall accuracy scores. "Extension" and "Recolor" show moderate performance, while "Others" generally has lower scores, suggesting it may be a more heterogeneous or difficult category.
4. **Model Comparison:** Gemini-2.0 and QwQ-32B perform similarly to each other, often within a few percentage points. DeepSeek-R1-70B consistently shows the lowest accuracy on the RSPC metric across all categories.
5. **KAAR Anomaly:** In the "Recolor" category, DeepSeek-R1-70B's KAAR accuracy (7.0%) is higher than its RSPC accuracy (4.3%), which is the only instance in the chart where the KAAR segment is larger than the RSPC segment for a given model.
### Interpretation
The data suggests a clear performance gap between the evaluated models on the RSPC metric, with GPT-o3-mini demonstrating a substantial advantage. The consistently low KAAR scores across all models indicate that this metric represents a much more challenging task or evaluation criterion than RSPC. The fact that RSPC accuracy is always higher implies that the skills or knowledge measured by RSPC are more readily accessible to these large language models than those measured by KAAR.
The variation across task categories ("Movement", "Extension", etc.) shows that model capability is not uniform; performance is task-dependent. The "Movement" task seems to be the most solvable for current models, while the "Others" category, likely a catch-all for miscellaneous tasks, proves more difficult. The anomaly in the "Recolor" category for DeepSeek-R1-70B, where KAAR outperforms RSPC, could indicate a specific strength in that model for the type of reasoning required by KAAR in that context, or it could be a statistical artifact due to the sample size (Total: 115). This chart effectively highlights both the relative strengths of the models and the significant challenge that the KAAR evaluation presents.
</details>
Figure 6: Accuracy on test instances $I_t$ for RSPC and KAAR across the movement, extension, recolor, and others categories using four LLMs. Each stacked bar shows RSPC accuracy (darker segment) and the additional improvement from KAAR (lighter segment).
Following prior work [9, 10], we categorize 400 problems in the ARC public evaluation set into four classes based on their primary transformations: (1) movement (55 problems), (2) extension (129 problems), (3) recolor (115 problems), and (4) others (101 problems). The others category comprises infrequent tasks such as noise removal, selection, counting, resizing, and problems with implicit patterns that hinder systematic classification into the aforementioned categories. See Appendix A.7 for examples of each category. Figure 6 illustrates the accuracy on test instances $I_t$ for RSPC and KAAR across four categories with evaluated LLMs. Each stacked bar represents RSPC accuracy and the additional improvement achieved by KAAR. KAAR consistently outperforms RSPC with the largest accuracy gain in movement (14.5% with QwQ-32B). In contrast, KAAR shows limited improvements in extension, since several problems involve pixel-level extension, which reduces the reliance on component-level recognition. Moreover, extension requires accurate spatial inference across multiple components and poses greater difficulty than movement, which requires mainly direction identification. Although KAAR augments spatial priors, LLMs still struggle to accurately infer positional relations among multiple components, consistent with prior findings [38, 39, 40]. Overlaps from component extensions further complicate reasoning, as LLMs often fail to recognize truncated components as unified wholes, contrary to human perceptual intuition.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Grouped Bar Chart: Model Accuracy on $I_t$ by Average Image Size Interval
### Overview
This is a grouped bar chart comparing the performance of two AI models (GPT-o3-mini and QwQ-32B) on a metric called "Accuracy on $I_t$ (%)" across six different intervals of average image size. Each model is evaluated using two distinct methods or metrics, labeled "RSPC" and "KAAR". The chart includes error bars on each bar, indicating uncertainty or variance in the measurements.
### Components/Axes
* **Chart Type:** Grouped bar chart with error bars.
* **Y-Axis:**
* **Label:** "Accuracy on $I_t$ (%)"
* **Scale:** Linear, ranging from 0 to 80, with major tick marks every 10 units.
* **X-Axis:**
* **Label:** "Average Image Size Interval (width x height)"
* **Categories (Intervals):** Six discrete bins representing ranges of average image size (presumably in pixels, width x height).
1. `(0,25]` - Total: 19
2. `(25,100]` - Total: 139
3. `(100,225]` - Total: 129
4. `(225,400]` - Total: 51
5. `(400,625]` - Total: 39
6. `(625,900]` - Total: 23
* The "Total" below each interval indicates the sample size (number of images) within that bin.
* **Legend (Top-Right Corner):** Maps colors to model-method combinations.
* **Light Blue:** GPT-o3-mini RSPC
* **Light Gray:** GPT-o3-mini KAAR
* **Dark Purple/Mauve:** QwQ-32B RSPC
* **Light Pink/Lavender:** QwQ-32B KAAR
* **Data Series:** Four series, represented by bars of different colors for each x-axis interval. Each bar has a black error bar extending from its top.
### Detailed Analysis
The chart displays accuracy values (with approximate error margins) for each model-method pair across the six image size intervals. The trend is a clear decrease in accuracy for all series as the average image size increases.
**Interval 1: (0,25] (Total: 19)**
* **GPT-o3-mini RSPC (Light Blue):** Accuracy = 73.7%. Error bar extends approximately ±5.3% (labeled "5.3" above the bar).
* **QwQ-32B RSPC (Dark Purple):** Accuracy = 42.1%. Error bar extends approximately ±15.8% (labeled "15.8" above the bar).
* **GPT-o3-mini KAAR (Light Gray):** Not visibly plotted for this interval.
* **QwQ-32B KAAR (Light Pink):** Not visibly plotted for this interval.
**Interval 2: (25,100] (Total: 139)**
* **GPT-o3-mini RSPC (Light Blue):** Accuracy = 48.9%. Error bar extends approximately ±5.0% (labeled "5.0").
* **QwQ-32B RSPC (Dark Purple):** Accuracy = 23.7%. Error bar extends approximately ±11.5% (labeled "11.5").
* **GPT-o3-mini KAAR (Light Gray):** Not visibly plotted.
* **QwQ-32B KAAR (Light Pink):** Not visibly plotted.
**Interval 3: (100,225] (Total: 129)**
* **GPT-o3-mini RSPC (Light Blue):** Accuracy = 24.8%. Error bar extends approximately ±4.7% (labeled "4.7").
* **QwQ-32B RSPC (Dark Purple):** Accuracy = 8.5%. Error bar extends approximately ±6.2% (labeled "6.2").
* **GPT-o3-mini KAAR (Light Gray):** Not visibly plotted.
* **QwQ-32B KAAR (Light Pink):** Not visibly plotted.
**Interval 4: (225,400] (Total: 51)**
* **GPT-o3-mini RSPC (Light Blue):** Accuracy = 11.8%. Error bar extends approximately ±5.9% (labeled "5.9").
* **QwQ-32B RSPC (Dark Purple):** Accuracy = 9.8%. Error bar extends approximately ±2.0% (labeled "2.0").
* **GPT-o3-mini KAAR (Light Gray):** Not visibly plotted.
* **QwQ-32B KAAR (Light Pink):** Not visibly plotted.
**Interval 5: (400,625] (Total: 39)**
* **GPT-o3-mini RSPC (Light Blue):** Accuracy = 5.1%. Error bar is present but the value is not labeled.
* **QwQ-32B RSPC (Dark Purple):** Not visibly plotted (or value is 0).
* **GPT-o3-mini KAAR (Light Gray):** Not visibly plotted.
* **QwQ-32B KAAR (Light Pink):** Not visibly plotted.
**Interval 6: (625,900] (Total: 23)**
* **GPT-o3-mini RSPC (Light Blue):** Accuracy = 4.3%. Error bar is present but the value is not labeled.
* **QwQ-32B RSPC (Dark Purple):** Not visibly plotted (or value is 0).
* **GPT-o3-mini KAAR (Light Gray):** Not visibly plotted.
* **QwQ-32B KAAR (Light Pink):** Not visibly plotted.
**Note on KAAR Metrics:** The "KAAR" variants for both models (light gray and light pink bars) are not visible in any of the intervals shown. This could mean their values are zero, negligible, or not measured for these intervals.
### Key Observations
1. **Strong Negative Correlation:** There is a consistent and steep downward trend in accuracy for the visible "RSPC" metric as the average image size increases. Performance drops from a high of 73.7% for the smallest images to near zero for the largest.
2. **Model Performance Gap:** GPT-o3-mini (light blue bars) consistently outperforms QwQ-32B (dark purple bars) on the RSPC metric across all intervals where both are plotted. The gap is largest for the smallest images (31.6 percentage points) and narrows as accuracy converges toward zero for larger images.
3. **Sample Size Distribution:** The majority of images fall into the middle intervals: `(25,100]` (139 images) and `(100,225]` (129 images). The smallest and largest size intervals have significantly fewer samples (19 and 23, respectively), which may affect the reliability of the accuracy estimates in those bins, as suggested by the larger error bars (e.g., ±15.8% for QwQ-32B in the first interval).
4. **Error Bar Variability:** The uncertainty (error bar size) varies. It is notably large for QwQ-32B in the first interval and for GPT-o3-mini in the fourth interval. The error bars for GPT-o3-mini in the first three intervals are relatively consistent (±4.7% to ±5.3%).
### Interpretation
This chart demonstrates a clear challenge for the evaluated models: their ability to perform the task measured by "$I_t$" degrades significantly as the input images become larger (in terms of average pixel dimensions). The "RSPC" metric shows this relationship starkly.
The data suggests that **GPT-o3-mini has a substantial performance advantage over QwQ-32B** on this specific task, particularly for smaller images. However, this advantage diminishes in absolute terms as the task becomes harder for both models with larger images. The absence of visible "KAAR" metric data is a critical finding—it implies that either this evaluation method was not applicable, failed completely, or yielded results too low to be displayed on this scale for the given image size ranges.
The decreasing sample size at the extremes (very small and very large images) is a common pattern in real-world datasets but warrants caution when interpreting the results for those bins. The large error bar for QwQ-32B on the smallest images indicates high variance in its performance on that subset, meaning its average score of 42.1% may not be highly reliable.
**In summary, the key takeaway is that image size is a major factor negatively impacting model accuracy on this task, and there is a pronounced performance hierarchy between the two models tested, with GPT-o3-mini being superior under the RSPC evaluation.**
</details>
Figure 7: Accuracy on test instances $I_t$ for RSPC and KAAR across average image size intervals, evaluated using GPT-o3-mini and QwQ-32B. See Figure 12 in Appendix for the results with the other LLMs.
A notable feature of ARC is the variation in image size both within and across problems. We categorize tasks by averaging the image size per problem, computed over both training and test image pairs. We report the accuracy on $I_t$ for RSPC and KAAR across average image size intervals using GPT-o3-mini and QwQ-32B, the strongest proprietary and open-source models in Tables 1 and 2. As shown in Figure 7, both LLMs experience performance degradation as image size increases. When the average image size exceeds 400 (20×20), GPT-o3-mini solves only three problems, while QwQ-32B solves none. In ARC, isolating relevant pixels in larger images, represented as 2D matrices, requires effective attention mechanisms in LLMs, which remains an open challenge noted in recent work [41, 34]. KAAR consistently outperforms RSPC on problems with average image sizes below 400, benefiting from object-centric representations. By abstracting each image into components, KAAR reduces interference from irrelevant pixels, directs attention to salient components, and facilitates component-level transformation analysis. However, larger images often produce both oversized and numerous components after abstraction, which continue to challenge LLMs during reasoning. Oversized components hinder transformation execution, and numerous components complicate the identification of target components.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy Trends Over Iterations for Different Models and Methods
### Overview
The image is a line chart displaying the performance, measured as "Accuracy on I&I_t (%)", of two different AI models (GPT-o3-mini and QwQ-32B) using two different methods (RSPC and KAAR) over a series of iterations. The chart tracks how accuracy improves as the number of iterations increases from 1 to 12. The background is divided into three vertical shaded regions, each corresponding to a different conceptual phase of the task.
### Components/Axes
* **Chart Type:** Multi-line chart with data points marked by distinct shapes.
* **X-Axis:**
* **Title:** "# Iterations"
* **Scale:** Linear, with major tick marks at 1, 4, 8, and 12.
* **Category Labels (below axis):** The axis is segmented into three phases:
1. **Objectness** (Iterations 1-4, light blue background)
2. **Geometry, Topology, Numbers and Counting** (Iterations 4-8, light orange background)
3. **Goal-directedness** (Iterations 8-12, light green background)
* **Y-Axis:**
* **Title:** "Accuracy on I&I_t (%)"
* **Scale:** Linear, ranging from 0 to 35, with major increments of 5.
* **Legend:** Located in the bottom-right corner of the chart area. It defines four data series:
1. **GPT-o3-mini: RSPC** - Dark blue line with circle markers (●).
2. **GPT-o3-mini: KAAR** - Light blue line with right-pointing triangle markers (▶).
3. **QwQ-32B: RSPC** - Dark purple line with upward-pointing triangle markers (▲).
4. **QwQ-32B: KAAR** - Light purple/lavender line with square markers (■).
### Detailed Analysis
The chart plots four distinct data series. Below is a breakdown of each, including the visual trend and extracted data points.
**1. GPT-o3-mini: KAAR (Light Blue Line, ▶)**
* **Trend:** Shows the highest overall performance. It starts strong and maintains a steady, upward slope, plateauing at the highest accuracy level in the final phase.
* **Data Points (Iteration: Accuracy %):**
* 1: 18
* 2: 21.25
* 3: 24.5
* 4: 26.75
* 5: 27.75
* 6: 29
* 7: 29.25
* 8: 30
* 9: 32
* 10: 33
* 11: 33
* 12: 33
**2. GPT-o3-mini: RSPC (Dark Blue Line, ●)**
* **Trend:** Follows a very similar trajectory to its KAAR counterpart but consistently performs slightly lower. It also plateaus in the final phase.
* **Data Points (Iteration: Accuracy %):**
* 1: 17.5
* 2: 20.75
* 3: 24
* 4: 26.25
* 5: 26.25
* 6: 27.25
* 7: 27.5
* 8: 28.25
* 9: 28.25
* 10: 29.25
* 11: 29.25
* 12: 29.25
**3. QwQ-32B: KAAR (Light Purple Line, ■)**
* **Trend:** Starts at a much lower accuracy than the GPT models but shows significant improvement, especially during the "Geometry..." and "Goal-directedness" phases. Its growth rate is steeper than the QwQ-32B: RSPC line.
* **Data Points (Iteration: Accuracy %):**
* 1: 4.5
* 2: 8.5
* 3: 10.5
* 4: 13.75
* 5: 14
* 6: 15
* 7: 15.25
* 8: 15.5
* 9: 16.75
* 10: 19
* 11: 19.25
* 12: 19.25
**4. QwQ-32B: RSPC (Dark Purple Line, ▲)**
* **Trend:** Shows the lowest performance overall. It improves gradually but plateaus earlier and at a much lower accuracy level than the other three series.
* **Data Points (Iteration: Accuracy %):**
* 1: 3.5
* 2: 6.25
* 3: 8.25
* 4: 11.5
* 5: 11.5
* 6: 12.25
* 7: 12.5
* 8: 12.75
* 9: 13
* 10: 13
* 11: 13
* 12: 13
### Key Observations
1. **Model Performance Gap:** There is a substantial and consistent performance gap between the GPT-o3-mini model (top two lines) and the QwQ-32B model (bottom two lines) across all iterations and methods.
2. **Method Superiority:** For both models, the **KAAR** method (light blue and light purple lines) consistently outperforms the **RSPC** method (dark blue and dark purple lines). The gap between KAAR and RSPC is more pronounced for the QwQ-32B model.
3. **Phase-Based Improvement:** All models show the most significant gains in accuracy during the first two phases ("Objectness" and "Geometry..."). Performance gains slow considerably or plateau during the final "Goal-directedness" phase.
4. **Plateau Points:**
* GPT-o3-mini: KAAR plateaus at 33% from iteration 10 onward.
* GPT-o3-mini: RSPC plateaus at 29.25% from iteration 10 onward.
* QwQ-32B: KAAR plateaus at 19.25% from iteration 11 onward.
* QwQ-32B: RSPC plateaus at 13% from iteration 9 onward.
### Interpretation
The data suggests a clear hierarchy in both model capability and method effectiveness for the given task (I&I_t). The GPT-o3-mini model demonstrates significantly higher baseline and peak accuracy compared to QwQ-32B, indicating it may be a more capable or better-suited model for this specific benchmark.
Furthermore, the **KAAR method provides a consistent performance boost over RSPC** for both models. This implies that the algorithmic or procedural differences in KAAR are beneficial for improving accuracy on this task. The fact that the performance gap between methods is larger for the weaker model (QwQ-32B) might suggest that KAAR is particularly effective at compensating for model limitations or extracting more latent capability.
The three-phase structure of the x-axis ("Objectness" -> "Geometry..." -> "Goal-directedness") implies the task is composite, requiring different cognitive skills. The chart shows that models make the fastest progress on foundational skills (Objectness, basic Geometry) and hit a performance ceiling when tackling the more complex, integrative skill of "Goal-directedness." This plateau could represent the current limit of these models' reasoning capabilities on this specific benchmark. The investigation would benefit from examining the specific failures in the "Goal-directedness" phase to understand the bottleneck.
</details>
Figure 8: Variance in accuracy on $I_r\&I_t$ with increasing iterations for RSPC and KAAR using GPT-o3-mini and QwQ-32B. See Figure 13 in Appendix for the results with the other LLMs.
Figure 8 presents the variance in accuracy on $I_r\&I_t$ for RSPC and KAAR as iteration count increases using GPT-o3-mini and QwQ-32B. For each task under KAAR, we include only iterations from the abstraction that solves both $I_r$ and $I_t$ . For KAAR, performance improvements across each 4-iteration block are driven by the solver backbone invocation after augmenting an additional level of priors: iterations 1–4 introduce objectness; 5–8 incorporate geometry, topology, numbers, and counting; 9–12 further involve goal-directedness. RSPC shows rapid improvement in the first 4 iterations and plateaus around iteration 8. At each iteration, the accuracy gap between KAAR and RSPC reflects the contribution of accumulated priors via augmentation. KAAR consistently outperforms RSPC, with the performance gap progressively increasing after new priors are augmented and peaking after the integration of goal-directedness. We note that objectness priors alone yield marginal gains with GPT-o3-mini. However, the inclusion of object attributes and relational priors (iterations 4–8) leads to improvements in KAAR over RSPC. This advantage is further amplified after the augmentation of goal-directedness priors (iterations 9–12). These results highlight the benefits of KAAR. Representing core knowledge priors through a hierarchical, dependency-aware ontology enables KAAR to incrementally augment LLMs, perform stage-wise reasoning, and improve solution accuracy. Compared to augmentation at once and non-stage-wise reasoning, KAAR consistently yields superior accuracy, as detailed in Appendix A.6.
## 6 Discussion
ARC and KAAR. ARC serves as a visual abstract reasoning benchmark, requiring models to infer transformations from few examples for each unique task, rather than fitting to a closed rule space as in RAVEN [42] and PGM [43]. ARC assumes tasks are solvable using core knowledge priors. However, the problems are intentionally left undefined to preclude encoding complete solution rules [5]. This pushes models beyond closed-form rule fitting and toward truly domain-general capabilities. While some of the knowledge in KAAR is tailored to ARC, its central contribution lies in representing knowledge through a hierarchical, dependency-aware ontology that enables progressive augmentation. This allows LLMs to gradually expand their reasoning scope and perform stage-wise inference, improving performance on ARC without relying on an exhaustive rule set. Moreover, the ontology of KAAR is transferable to other domains requiring hierarchical reasoning, such as robotic task planning [44], image captioning [45], and visual question answering [46], where similar knowledge priors and dependencies from ARC are applicable. In KAAR, knowledge augmentation increases token consumption, while the additional tokens remain relatively constant since all priors, except goal-directedness, are generated via image processing algorithms from GPAR. On GPT-o3-mini, augmentation tokens constitute around 60% of solver backbone token usage, while on QwQ-32B, this overhead decreases to about 20%, as the solver backbone consumes more tokens. See Appendix A.8 for a detailed discussion. Incorrect abstraction selection in KAAR also leads to wasted tokens. However, accurate abstraction inference often requires validation through viable solutions, bringing the challenge back to solution generation.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Diagram: Visual Pattern Transformation Puzzle
### Overview
The image displays a three-row visual logic puzzle. Each row presents a "before" state (left grid) and an "after" state (right grid) connected by a right-pointing arrow. The first two rows show completed transformations, while the third row's "after" state is a question mark, indicating the puzzle is to deduce the transformation rule and apply it to the final shape.
### Components/Axes
* **Structure:** Three horizontal rows, each containing two 8x8 grids separated by a white arrow.
* **Grids:** Each grid is an 8x8 matrix of squares. The background is black. Shapes are formed by colored squares.
* **Colors:**
* **Blue:** Used for the initial shapes in all left-hand grids and the upper portions of the transformed shapes.
* **Red:** Used for the lower portions of the transformed shapes in the first two rows.
* **Symbols:** A large white question mark inside a black-bordered white square occupies the final grid position.
### Detailed Analysis
**Row 1 (Top):**
* **Left Grid (Before):** A blue shape resembling the letter "M" or two vertical bars connected at the top. It occupies a central vertical region, approximately columns 2-7 and rows 1-6.
* **Right Grid (After):** The same "M" shape, but its lower half (approximately rows 5-8 of the shape) has been recolored red. The upper half remains blue. The shape's outline and position are unchanged.
**Row 2 (Middle):**
* **Left Grid (Before):** A blue rectangular "O" or hollow rectangle shape. It is positioned centrally, approximately columns 2-5 and rows 2-7.
* **Right Grid (After):** The same rectangular shape. Its lower half (approximately rows 5-7 of the shape) is now red. The upper half remains blue. The shape's outline and position are unchanged.
**Row 3 (Bottom):**
* **Left Grid (Before):** A blue shape resembling a circle with a horizontal line through its center (like a theta symbol Θ or a circle with a diameter line). The circle occupies the central area (approx. columns 2-7, rows 3-6). The horizontal line extends from column 1 to column 8 on row 5.
* **Right Grid (After):** A white square containing a large black question mark. This is the unknown to be solved.
### Key Observations
1. **Consistent Transformation Rule:** In both completed examples, the transformation is a **horizontal bisection of the blue shape**. The shape is split along its horizontal midline. The **upper half remains blue**, and the **lower half is recolored red**.
2. **Spatial Precision:** The bisection applies to the shape's own geometry, not the entire grid. The red color fills only the squares that constitute the lower half of the original shape's area.
3. **Shape Preservation:** The outline, position, and size of the shape are perfectly preserved during the transformation. Only the color of the lower half changes.
4. **Pattern Application:** The puzzle implies the same rule must be applied to the third shape.
### Interpretation
This is a non-verbal reasoning test of pattern recognition and rule deduction. The data demonstrates a simple, consistent geometric transformation: **"Color the lower half of the shape red."**
* **What the data suggests:** The puzzle tests the ability to isolate a visual rule from examples and apply it to a novel case. The rule is spatial (horizontal split) and chromatic (blue to red for the lower section).
* **How elements relate:** The arrows define a cause-and-effect or input-output relationship. The first two rows provide the training examples, establishing the rule. The third row presents the test case.
* **Predicted Solution for the Question Mark:** Applying the deduced rule to the third shape (circle with horizontal line):
* The shape would be bisected horizontally along its natural midline (which aligns with the existing horizontal line on row 5).
* The **upper half** of the circle (above the midline) would **remain blue**.
* The **lower half** of the circle (below the midline) and the **lower half of the horizontal line** would be **colored red**.
* The shape's position and outline would remain identical to the left grid.
**Therefore, the missing image should show the same circle-with-line shape, with its top half blue and its bottom half red.**
</details>
Figure 9: Fragment of ARC problem e7dd8335.
Solution Analysis. RSPC achieves over 30% accuracy across evaluated metrics using GPT-o3-mini, even without knowledge augmentation. To assess its alignment with core knowledge priors, we manually reviewed RSPC-generated solution plans and code that successfully solve $I_t$ with GPT-o3-mini. RSPC tends to solve problems without object-centric reasoning. For instance, in Figure 1, it shifts each row downward by one and pads the top with zeros, rather than reasoning over objectness to move each 4-connected component down by one step. Even when applying objectness, RSPC typically defaults to 4-connected abstraction, failing on the problem in Figure 9, where the test input clearly requires 8-connected abstraction. We note that object recognition in ARC involves grouping pixels into task-specific components based on clustering rules, differing from feature extraction approaches [47] in conventional computer vision tasks. Recent work seeks to bridge this gap by incorporating 2D positional encodings and object indices into Vision Transformers [41]. However, its reliance on data-driven learning weakens generalization, undermining ARC’s core objective. In contrast, KAAR enables objectness through explicitly defined abstractions, implemented via standard image processing algorithms, thus ensuring both accuracy and generalization.
Generalization. For all evaluated ARC solvers, accuracy on $I_r$ consistently exceeds that on $I_r\&I_t$ , revealing a generalization gap. Planning-aided code generation methods, such as RSPC and KAAR, exhibit smaller gaps than other solvers, though the issue persists. One reason is that solutions include low-level logic for the training pairs, thus failing to generalize. See Appendix A.9 for examples. Another reason is the usage of incorrect abstractions. For example, reliance solely on 4-connected abstraction leads RSPC to solve only $I_r$ in Figure 9. KAAR similarly fails to generalize in this case. It selects 4-connected abstraction, the first one that solves $I_r$ , to report accuracy on $I_t$ , instead of the correct 8-connected abstraction, as the former is considered simpler. Table 1 also reveals that LLMs differ in their generalization across ARC solvers. While a detailed analysis of these variations is beyond the scope of this study, investigating the underlying causes could offer insights into LLM inference and alignment with intended behaviors, presenting a promising direction for future work.
## 7 Conclusion
We explored the generalization and abstract reasoning capabilities of recent reasoning-oriented LLMs on the ARC benchmark using nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most evaluated LLMs. To further improve performance, we propose KAAR, which progressively augments LLMs with core knowledge priors organized into hierarchical levels based on their dependencies, and applies RSPC after augmenting each level of priors to enable stage-wise reasoning. KAAR improves LLM performance on the ARC benchmark while maintaining strong generalization compared to non-augmented RSPC. However, ARC remains challenging even for the most capable reasoning-oriented LLMs, given its emphasis on abstract reasoning and generalization, highlighting current limitations and motivating future research.
## References
- Khan et al. [2021] Abdullah Ayub Khan, Asif Ali Laghari, and Shafique Ahmed Awan. Machine learning in computer vision: A review. EAI Endorsed Transactions on Scalable Information Systems, 8(32), 2021.
- Otter et al. [2020] Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020.
- Grigorescu et al. [2020] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. Journal of field robotics, 37(3):362–386, 2020.
- Lake et al. [2017] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
- Chollet [2019] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
- Peirce [1868] Charles S Peirce. Questions concerning certain faculties claimed for man. The Journal of Speculative Philosophy, 2(2):103–114, 1868.
- Spelke and Kinzler [2007] Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental science, 10(1):89–96, 2007.
- Gulwani et al. [2017] Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. Program synthesis. Foundations and Trends® in Programming Languages, 4:1–119, 2017.
- Xu et al. [2023a] Yudong Xu, Elias B Khalil, and Scott Sanner. Graphs, constraints, and search for the abstraction and reasoning corpus. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI, pages 4115–4122, 2023a.
- Lei et al. [2024a] Chao Lei, Nir Lipovetzky, and Krista A Ehinger. Generalized planning for the abstraction and reasoning corpus. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, AAAI, pages 20168–20175, 2024a.
- Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th Advances in Neural Information Processing Systems, NeurIPS, pages 24824–24837, 2022.
- Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Li et al. [2022] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378:1092–1097, 2022.
- Chen et al. [2023] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In Proceedings of the 11th International Conference on Learning Representations, ICLR, pages 1–19, 2023.
- Zhang et al. [2023] Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, and Sida Wang. Coder reviewer reranking for code generation. In Proceedings of the 40th International Conference on Machine Learning, ICML, pages 41832–41846, 2023.
- Ni et al. [2023] Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. Lever: Learning to verify language-to-code generation with execution. In Proceedings of the 40th International Conference on Machine Learning, ICML, pages 26106–26128, 2023.
- Zhong et al. [2024a] Li Zhong, Zilong Wang, and Jingbo Shang. Debug like a human: A large language model debugger via verifying runtime execution step by step. In Findings of the Association for Computational Linguistics: ACL 2024, pages 851–870, 2024a.
- Lei et al. [2024b] Chao Lei, Yanchuan Chang, Nir Lipovetzky, and Krista A Ehinger. Planning-driven programming: A large language model programming workflow. arXiv preprint arXiv:2411.14503, 2024b.
- Chen et al. [2024] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. In Proceedings of the 12th International Conference on Learning Representations, ICLR, 2024.
- Austin et al. [2021] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Jain et al. [2025] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In Proceedings of the 13th International Conference on Learning Representations, ICLR, 2025.
- Jiang et al. [2023] Xue Jiang, Yihong Dong, Lecheng Wang, Fang Zheng, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. Self-planning code generation with large language models. ACM Transactions on Software Engineering and Methodology, 33(7):1–28, 2023.
- Islam et al. [2024] Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. MapCoder: Multi-agent code generation for competitive problem solving. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL, pages 4912–4944, 2024.
- Zhong et al. [2024b] Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, et al. Evaluation of openai o1: Opportunities and challenges of agi. arXiv preprint arXiv:2409.18486, 2024b.
- OpenAI [2025] OpenAI. Openai o3-mini. OpenAI, 2025. URL https://openai.com/index/openai-o3-mini/. Accessed: 2025-03-22.
- DeepMind [2024] Google DeepMind. Gemini 2.0 flash thinking. Google DeepMind, 2024. URL https://deepmind.google/technologies/gemini/flash-thinking/. Accessed: 2025-03-22.
- Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Cloud [2025] Alibaba Cloud. Alibaba cloud unveils qwq-32b: A compact reasoning model with cutting-edge performance. Alibaba Cloud, 2025. URL https://www.alibabacloud.com/blog/alibaba-cloud-unveils-qwq-32b-a-compact-reasoning-model-with-cutting-edge-performance_602039. Accessed: 2025-03-22.
- Babakr et al. [2019] Zana H Babakr, Pakstan Mohamedamin, and Karwan Kakamad. Piaget’s cognitive developmental theory: Critical review. Education Quarterly Reviews, 2(3):517–524, 2019.
- Deng et al. [2024] Hourui Deng, Hongjie Zhang, Jie Ou, and Chaosheng Feng. Can llm be a good path planner based on prompt engineering? mitigating the hallucination for path planning. arXiv preprint arXiv:2408.13184, 2024.
- Meng et al. [2024] Silin Meng, Yiwei Wang, Cheng-Fu Yang, Nanyun Peng, and Kai-Wei Chang. LLM-a*: Large language model enhanced incremental heuristic search on path planning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1087–1102, 2024.
- Ahn et al. [2024] Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, EACL, pages 225–237, 2024.
- Zang et al. [2025] Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with multimodal large language models. International Journal of Computer Vision, 133(2):825–843, 2025.
- Xu et al. [2023b] Yudong Xu, Wenhao Li, Pashootan Vaezipoor, Scott Sanner, and Elias B Khalil. Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations. arXiv preprint arXiv:2305.18354, 2023b.
- Lei et al. [2023] Chao Lei, Nir Lipovetzky, and Krista A Ehinger. Novelty and lifted helpful actions in generalized planning. In Proceedings of the International Symposium on Combinatorial Search, SoCS, pages 148–152, 2023.
- Wang et al. [2024] Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah Goodman. Hypothesis search: Inductive reasoning with language models. In Proceedings of the 12 th International Conference on Learning Representations, ICLR, 2024.
- LeGris et al. [2024] Solim LeGris, Wai Keen Vong, Brenden M Lake, and Todd M Gureckis. H-arc: A robust estimate of human performance on the abstraction and reasoning corpus benchmark. arXiv preprint arXiv:2409.01374, 2024.
- Yamada et al. [2024] Yutaro Yamada, Yihan Bao, Andrew Kyle Lampinen, Jungo Kasai, and Ilker Yildirim. Evaluating spatial understanding of large language models. Transactions on Machine Learning Research, 2024.
- Cohn and Hernandez-Orallo [2023] Anthony G Cohn and Jose Hernandez-Orallo. Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of llms. arXiv preprint arXiv:2304.11164, 2023.
- Bang et al. [2023] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, IJCNLP-AACL, pages 675–718, 2023.
- Li et al. [2024a] Wenhao Li, Yudong Xu, Scott Sanner, and Elias Boutros Khalil. Tackling the abstraction and reasoning corpus with vision transformers: the importance of 2d representation, positions, and objects. arXiv preprint arXiv:2410.06405, 2024a.
- Raven [2000] John Raven. The raven’s progressive matrices: change and stability over culture and time. Cognitive psychology, 41(1):1–48, 2000.
- Barrett et al. [2018] David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In Proceedings of the 37th International conference on machine learning, ICML, pages 511–520, 2018.
- Cui et al. [2025] Yongcheng Cui, Ying Zhang, Cui-Hua Zhang, and Simon X Yang. Task cognition and planning for service robots. Intelligence & Robotics, (1):119–142, 2025.
- Stefanini et al. [2022] Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. From show to tell: A survey on deep learning-based image captioning. IEEE transactions on pattern analysis and machine intelligence, (1):539–559, 2022.
- Huynh et al. [2025] Ngoc Dung Huynh, Mohamed Reda Bouadjenek, Sunil Aryal, Imran Razzak, and Hakim Hacid. Visual question answering: from early developments to recent advances–a survey. arXiv preprint arXiv:2501.03939, 2025.
- Zhao et al. [2019] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212–3232, 2019.
- Mialon et al. [2023] Grégoire Mialon, Roberto Dessi, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Roziere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Augmented language models: a survey. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
- Zhu et al. [2025] Yuqi Zhu, Shuofei Qiao, Yixin Ou, Shumin Deng, Shiwei Lyu, Yue Shen, Lei Liang, Jinjie Gu, Huajun Chen, and Ningyu Zhang. KnowAgent: Knowledge-augmented planning for LLM-based agents. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3709–3732, 2025.
- Vu et al. [2024] Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13697–13720, 2024.
- Li et al. [2024b] Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. In Proceedings of the 12th International Conference on Learning Representations, ICLR, 2024b.
- Trivedi et al. [2023] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL, pages 10014–10037, 2023.
- Qiao et al. [2024] Shuofei Qiao, Honghao Gui, Chengfei Lv, Qianghuai Jia, Huajun Chen, and Ningyu Zhang. Making language models better tool learners with execution feedback. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NNACL, pages 3550–3568, 2024.
- Wind [2020] J S Wind. 1st place solution + code and official documentation. https://www.kaggle.com/competitions/abstraction-and-reasoning-challenge/discussion/154597, 2020. Accessed: 2025-03-22.
- Camposampiero et al. [2023] Giacomo Camposampiero, Loic Houmard, Benjamin Estermann, Joël Mathys, and Roger Wattenhofer. Abstract visual reasoning enabled by language. arXiv preprint arXiv:2306.04091, 2023.
- Min [2023] Tan John Chong Min. An approach to solving the abstraction and reasoning corpus (arc) challenge. arXiv preprint arXiv:2306.03553, 2023.
- Tan and Motani [2024] John Chong Min Tan and Mehul Motani. Llms as a system of multiple expert agents: An approach to solve the abstraction and reasoning corpus (arc) challenge. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence, CAI, pages 782–787, 2024.
- Bikov et al. [2024] Kiril Bikov, Mikel Bober-Irizar, and Soumya Banerjee. Reflection system for the abstraction and reasoning corpus. In Proceedings of the 2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle, 2024.
- Franzen et al. [2024] Daniel Franzen, Jan Disselhoff, and David Hartmann. The llm architect: Solving arc-agi is a matter of perspective. https://github.com/da-fr/arc-prize-2024/blob/main/the_architects.pdf, 2024. Accessed: 2025-03-22.
- Hodel [2024] Michael Hodel. Addressing the abstraction and reasoning corpus via procedural example generation. arXiv preprint arXiv:2404.07353, 2024.
- Moskvichev et al. [2023] Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain. arXiv preprint arXiv:2305.07141, 2023.
- Li et al. [2025] Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M. Dunn, Hao Tang, Wei-Long Zheng, Yewen Pu, and Kevin Ellis. Combining induction and transduction for abstract reasoning. In Proceedings of the 13th International Conference on Learning Representations, ICLR, 2025.
- Barke et al. [2024] Shraddha Barke, Emmanuel Anaya Gonzalez, Saketh Ram Kasibatla, Taylor Berg-Kirkpatrick, and Nadia Polikarpova. Hysynth: Context-free llm approximation for guiding program synthesis. In Proceedings of the 38th Advances in Neural Information Processing Systems, NeurIPS, pages 15612–15645, 2024.
## Appendix A Appendix
### A.1 Related Work
Knowledge-Augmented LLMs. Augmenting LLMs with external knowledge can improve reasoning capabilities and mitigate hallucination in text generation [48]. Previous studies achieve this by incorporating domain-specific knowledge, designed by human experts [49], retrieved via search engines [50], or extracted from Wikipedia documents [51]. Trivedi et al. [52] demonstrated that interleaving knowledge augmentation within reasoning steps further reduces model hallucination, resulting in more accurate multi-step reasoning. Additionally, augmenting LLMs with execution feedback improves performance on both question answering [53] and program synthesis tasks [10, 17, 19].
Search in DSL. An abstract, expressive, and compositional representation of core knowledge priors is essential for solving ARC tasks [5]. Previous studies have manually encoded these priors into domain-specific languages (DSLs) with lifted relational representations [9, 10, 54]. Various program synthesis methods have been proposed to search for valid solution programs within their DSLs, including DAG-based search [54], graph-based constraint-guided search [9], and generalized planning [10]. Hand-crafted DSLs encode core knowledge priors with high precision and interpretability, enabling structured program synthesis. However, comprehensive DSLs induce large search spaces, limiting synthesis efficiency.
LLMs for ARC. Recent studies have explored using LLMs as ARC solvers to directly generate test output matrices and have prompted LLMs with different problem descriptions to improve output accuracy. Camposampiero et al. [55] employed LLMs to generate output grids from textual task descriptions, derived from a vision module which is designed to capture human-like visual priors. Min [56] prompted LLMs with the raw 2D matrices of each task, along with transformation and abstraction examples. Xu et al. [34] demonstrated that object representations derived from predefined abstractions can improve LLM performance on ARC tasks. Recent advances in code generation by LLMs [18, 17, 14] highlight their potential to replace search-based program synthesis, addressing efficiency limitations. Tan and Motani [57] evaluated LLM performance on the ARC benchmark by generating Python program solutions. Additionally, Wang et al. [36] approached ARC as an inductive reasoning problem and introduced hypothesis search, where program solutions are generated by selecting LLM-generated hypotheses encoded as functions.
Training-Based Methods. To further improve LLM performance, Bikov et al. [58] fine-tuned LLMs on augmented ARC tasks using standard techniques such as rotation, flipping, and permutation. Beyond these methods, Franzen et al. [59] fine-tuned LLMs on large-scale synthetic ARC tasks [60] and ARC-related datasets such as Concept-ARC [61] and ARC-Heavy [62], achieving a state-of-the-art 56% accuracy on the private evaluation set of 200 tasks. Instead of fine-tuning LLMs, Barke et al. [63] trained a probabilistic context-free grammar (PCFG) using LLM-generated plausible solutions to learn weighted functions. This enables the synthesizer to efficiently generate final program solutions. However, this approach requires a dedicated synthesizer for each DSL, limiting its generalization.
When leveraging LLMs as ARC solvers, existing studies tend to emphasize accuracy on partial training set problems and overlook the core principle of ARC, where solutions should be constructed using core knowledge priors [5]. LLMs still lack these priors, such as objectness, as evidenced by RSPC-generated solutions. Although fine-tuning approaches have achieved state-of-the-art performance, their failure to incorporate core knowledge priors remains a fundamental limitation. KAAR addresses this gap by progressively augmenting LLMs with structured core knowledge priors introduced by GPAR, along with exclusive implementations of goal-directedness priors. It interleaves augmentation within the reasoning process by applying an advanced LLM-based program synthesis solver tailored to the ARC benchmark after augmenting priors at each level. KAAR achieves strong performance, 32.5% test accuracy on the full evaluation set of 400 problems using GPT-o3-mini, demonstrates substantial generalization, and produces solutions aligned with core knowledge priors.
### A.2 Core Knowledge Priors in KAAR
KAAR incorporates abstractions to enable objectness priors; component attributes, relations, and statistical analysis of component attributes to encode geometry, topology, numbers, and counting priors; and predefined actions to support goal-directedness priors. Table 5 presents all abstractions used in KAAR, organized by their prioritization. KAAR incorporates fundamental abstractions, such as 4-connected and 8-connected components, from GPAR, and extends them with additional abstractions unique to KAAR, highlighted in red. Table 6 introduces geometry, topology, numbers, and counting priors, and ten predefined transformations used in KAAR. For each action, KAAR augments the LLM with its corresponding schema to resolve implementation details. The actions and their schemas are detailed in Table A.12. Most actions can be specified within three steps, keeping them tractable for LLMs.
<details>
<summary>x10.png Details</summary>

### Visual Description
## Diagram: Sequential Grid Transformation Pattern
### Overview
The image displays a horizontal sequence of seven 3x5 grids (3 rows, 5 columns) connected by right-pointing arrows, illustrating a step-by-step transformation or logical process. The grids contain colored squares (blue, black, gray, red). A vertical dotted line separates the sixth and seventh grids. The sequence concludes with an arrow pointing to a box containing a question mark, indicating an unknown or to-be-determined next state.
### Components/Axes
* **Grid Structure:** Each grid is a 3-row by 5-column matrix of squares.
* **Color Palette:**
* **Blue:** A primary color used in pattern formation.
* **Black:** The background or inactive state color.
* **Gray:** Appears as a consistent vertical column in the center (column 3) of most grids.
* **Red:** Used to highlight specific cells in transformation steps.
* **Flow Indicators:** Right-pointing arrows (`→`) connect each grid to the next, showing the direction of the sequence.
* **Separator:** A vertical dotted line (`|`) between the sixth and seventh grids suggests a division, possibly between a problem statement and a result or a new phase.
* **Terminal Element:** A final box containing a large question mark (`?`).
### Detailed Analysis
The sequence progresses as follows (grids are numbered 1 to 7 from left to right):
**Grid 1:**
* Pattern: A symmetrical arrangement.
* Row 1: Blue, Black, Gray, Black, Blue
* Row 2: Black, Blue, Gray, Blue, Blue
* Row 3: Blue, Black, Gray, Black, Blue
**Grid 2 (Result of Grid 1):**
* Transformation: The entire grid is black except for a single **red square** at the center (Row 2, Column 3).
**Grid 3:**
* Pattern: Identical to Grid 1.
**Grid 4 (Result of Grid 3):**
* Transformation: The grid is black with **four red squares**:
* Row 1, Column 3 (top-center)
* Row 2, Column 5 (middle-right)
* Row 3, Column 3 (bottom-center)
* (Note: The red cells form a vertical line in the center column plus one on the right edge.)
**Grid 5:**
* Pattern: Very similar to Grids 1 and 3, with one key difference:
* Row 2, Column 1 is **Blue** (in Grids 1 & 3, this cell was Black).
**Grid 6 (Result of Grid 5):**
* Transformation: The grid is black with **two red squares** at opposite corners:
* Row 1, Column 1 (top-left)
* Row 3, Column 5 (bottom-right)
**--- Dotted Line Separator ---**
**Grid 7:**
* Pattern: Identical to Grid 5.
**Final Arrow & Box:**
* An arrow points from Grid 7 to a white box containing a black question mark (`?`), posing the problem of determining the next transformation.
### Key Observations
1. **Input-Output Pairs:** The sequence appears to show three input grids (1, 3, 5) and their corresponding output/transformed states (2, 4, 6). Grid 7 is a new input awaiting its output.
2. **Rule-Based Transformation:** The red cells in the output grids (2, 4, 6) are not random. They likely result from a rule applied to the pattern of blue cells in the preceding input grid.
* Grid 1 (symmetrical) → Red at center.
* Grid 3 (identical to 1) → Red at center column and right edge. This contradicts a simple static rule, suggesting the rule may depend on position in the sequence or an evolving state.
* Grid 5 (slight asymmetry) → Red at opposite corners.
3. **Persistent Feature:** The central gray column (Column 3) is present in all input grids (1, 3, 5, 7) but absent in all output grids (2, 4, 6), where those cells are either black or red.
### Interpretation
This diagram is a visual logic puzzle or a demonstration of a pattern recognition algorithm. It challenges the viewer to deduce the underlying rule that maps an input grid pattern (blue/black/gray) to an output grid pattern (red highlights on black).
The progression suggests the rule is not static but may be **context-dependent**. The fact that identical inputs (Grids 1 and 3) produce different outputs (Grids 2 and 4) implies the transformation function changes with each step, or that there is a hidden variable (like a step counter) influencing the outcome. The final question mark invites the viewer to apply the inferred, evolving rule to Grid 7 to predict the next set of red highlights.
The puzzle tests spatial reasoning, inductive logic, and the ability to discern complex, non-linear rules from limited examples. The dotted line may separate the "training examples" (first three pairs) from the "test case" (Grid 7 and the unknown).
</details>
Figure 10: ARC problem 0520fde7
### A.3 Restrictions in KAAR
For certain abstractions, some priors are either inapplicable or exclusive. The specific priors assigned to some abstractions are detailed in Table 8. For the whole image abstraction, few priors apply as only a single component is present. In contrast, the 4/8-connected-multi-color-non-background abstractions retain most priors. The highlighted priors that capture per-component color diversity are used exclusively for 4/8-connected-multi-color-non-background abstractions, while priors tailored to a single-color component, such as components with same color, components with most frequent color, and components with least frequent color, are excluded. For the middle-vertical and middle-horizontal abstractions, where the image is evenly divided into two components, flipping and movement actions are enabled to facilitate reasoning over overlapping components. For instance, in the problem shown in Figure 10, the solution involves splitting the image along a middle-vertical grid line and moving one component to overlap the other. In the resulting component, a pixel is colored red if the overlapping pixels in both components are blue; otherwise, it is colored black.
### A.4 Parameter Settings
KAAR operates on all LLMs through API access with the full conversational history. For proprietary models, GPT-o3-mini and Gemini-2.0 Flash-Thinking (Gemini-2.0), we use default parameter settings. For open-source models, DeepSeek-R1-Distill-Llama-70B (DeepSeek-R1-70B) and QwQ-32B, we set temperature to 0.6, top-p to 0.95, and top-k to 40 to reduce repetitive outputs and filter rare tokens while preserving generation diversity. We conduct experiments on a virtual machine with 4 NVIDIA A100 80GB GPUs.
### A.5 KAAR
Algorithm 1 presents the pseudocode of KAAR. For each abstraction, KAAR incrementally augments the LLM with core knowledge priors, structured into three dependency-aware levels: beginning with objectness (Line 5), followed by geometry and topology (Lines 10 and 12), numbers and counting (Line 14), and concluding with goal-directedness priors(Line 18). We note that KAAR encodes geometry and topology priors through component attributes (Line 9) and relations (Line 11). The full set of priors is detailed in Tables 5, 6, and A.12. After augmenting each level of priors, KAAR invokes the solver backbone (RSPC) at Lines 6, 15, and 19 to generate code solutions guided by text-based plans, allowing up to 4 iterations (Lines 25–37). In each iteration, the solver backbone first validates the generated code on the training instances $I_r$ ; if successful, it then evaluates the solution on the test instances $I_t$ . The solver backbone returns solve if the generated solution successfully solves $I_t$ after passing $I_r$ ; pass if only $I_r$ is solved; or continues to the next iteration if the solution fails on $I_r$ . If the solver backbone fails to solve $I_r$ within the allotted 4 iterations at Lines 6 and 15, KAAR augments the next level of priors. KAAR proceeds to the next abstraction when the solver backbone fails to solve $I_r$ at Line 19, after the 4-iteration limit. KAAR terminates abstraction iteration upon receiving either pass or solve from the solver backbone and reports accuracy on $I_r$ , $I_t$ , and $I_r\&I_t$ accordingly. If no abstraction fully solves $I_r$ , KAAR records the final code solution for each abstraction (Line 22), selects the one that passes the most training instances (Line 23), and evaluates it on $I_t$ to determine additional accuracy gains (Line 24).
KAAR generates priors offline using image processing algorithms introduced in GPAR at Lines 4, 9, 11 and 13. In contrast, KAAR enables goal-directedness priors at Line 18 by prompting the LLM to select the most suitable actions and identify their implementation details, as described in Table A.12. KAAR iterates over abstractions from simpler to more complex, following the order specified in Table 5. We note that the highest-priority abstraction is no abstraction, where KAAR degrades to the solver backbone (RSPC) as no priors are applied.
Input : LLM $M$ ; ARC problem $P=(I_r,I_t)$ ; description $Q=(I_r,\{i^i | (i^i,i^o)∈ I_t\})$ ; abstraction list $A$ ; max iterations $t=4$
1
2 Function KnowledgeAugmentation ( $M$ , $Q$ , $P$ , $A$ , $t$ ):
3 solutionList $←[]$ ;
4 foreach abstraction $abs$ in $A$ do
5 objectnessPriors $←$ GenerateObjectnessPriors( $Q$ , $abs$ );
6 AugmentKnowledge( $M$ , objectnessPriors);
7 result, code, passedCount $←$ SolverBackbone ( $M$ , $P$ , $Q$ , $t$ );
8 if result $≠$ failure then
9 return result
10
11 attributePriors $←$ GenerateAttributePriors( $Q$ , $abs$ );
12 AugmentKnowledge( $M$ , attributePriors);
13 relationPriors $←$ GenerateRelationPriors( $Q$ , $abs$ );
14 AugmentKnowledge( $M$ , relationPriors);
15 numberPriors $←$ GenerateNumbersCountingPriors( $Q$ , $abs$ );
16 AugmentKnowledge( $M$ , numberPriors);
17 result, code, passedCount $←$ SolverBackbone ( $M$ , $P$ , $Q$ , $t$ );
18 if result $≠$ failure then
19 return result
20
21 AugmentGoalPriors $←$ ( $M$ , $Q$ , $abs$ );
22
23 result, code, passedCount $←$ SolverBackbone ( $M$ , $P$ , $Q$ , $t$ );
24 if result $≠$ failure then
25 return result
26
27 solutionList.append((code, passedCount));
28
29 bestCode $←$ SelectMostPassed(solutionList);
30 return EvaluateOnTest(bestCode, $I_t$ );
31
32
33 Function SolverBackbone ( $M$ , $P$ , $Q$ , $t$ ):
34 i $← 0$ ;
35 while i < t do
36 plan $←M$ .generatePlan( $Q$ );
37 code $←M$ .generateCode( $Q$ , plan);
38 passedCount $←$ EvaluateOnTrain(code, $I_r$ );
39 if passedCount == $|I_r|$ then
40 if EvaluateOnTest(code, $I_t$ ) then
41 return solve, code, passedCount;
42
43 else
44 return pass, code, passedCount;
45
46
47 i $←$ i + 1;
48
49 return failure, code, passedCount;
50
Algorithm 1 KAAR
| Gemini-2.0 $I_t$ $I_r\&I_t$ | $I_r$ 21.75 20.50 | 25.75 19.00 18.00 | 23.00 -2.75 -2.50 | -2.75 |
| --- | --- | --- | --- | --- |
| QwQ-32B | $I_r$ | 22.25 | 18.50 | -3.75 |
| $I_t$ | 21.00 | 17.75 | -3.25 | |
| $I_r\&I_t$ | 19.25 | 16.25 | -3.00 | |
| DeepSeek-R1-70B | $I_r$ | 12.25 | 9.00 | -3.25 |
| $I_t$ | 12.75 | 9.00 | -3.75 | |
| $I_r\&I_t$ | 11.50 | 8.50 | -3.00 | |
| 2 | | | | |
Table 3: Accuracy on $I_r$ , $I_t$ , and $I_r\&I_t$ for KAAR and KAAR ∗ across three LLMs. KAAR ∗ invokes the solver backbone (RSPC) only after all knowledge priors are augmented. $Δ$ denotes the performance drop relative to KAAR. All values are reported as percentages.
### A.6 Ablation Study
Table 3 reports the accuracy decrease resulting from removing incremental knowledge augmentation and stage-wise reasoning in KAAR, denoted as KAAR ∗. Unlike KAAR, which invokes the solver backbone (RSPC) after augmenting each level of priors to enable stage-wise reasoning, KAAR ∗ uses RSPC to solve the problem within 12 iterations after augmenting all priors at once. We evaluate KAAR ∗ using the same reasoning-oriented LLMs as in Tables 1 and 2, excluding GPT-o3-mini due to its computational cost. KAAR ∗ shows decreased accuracy on all metrics, $I_r$ , $I_t$ , and $I_r\&I_t$ , for all evaluated LLMs. These results underscore the effectiveness of progressive augmentation and stage-wise reasoning. Presenting all knowledge priors simultaneously introduces superfluous information, which may obscure viable solutions and impair the LLM reasoning accuracy. We note that we construct the ontology of core knowledge priors based on their dependencies, thereby establishing a fixed augmentation order.
<details>
<summary>x11.png Details</summary>

### Visual Description
## Visual Pattern Recognition Tasks: Four Transformation Categories
### Overview
The image displays a structured set of visual reasoning tasks, organized into four columns. Each column represents a distinct category of transformation applied to grid-based pixel art. The image serves as a demonstration or test set for an AI or human to infer the underlying rule for each category and apply it to a final, unsolved example (marked with a "?").
### Components/Axes
The image is divided into four vertical columns, each with a header and four rows of examples.
**Headers (Top of each column):**
1. **Task f3e62deb (Movement)**
2. **Task b15fca0b (Extension)**
3. **Task 6ea4a07e (Recolor)**
4. **Task 3b4c2228 (Others)**
**Structure per Column:**
* **Rows 1-3:** Three complete "input → output" example pairs.
* **Row 4:** A final "input → ?" test case, separated from the examples by a horizontal dotted line.
* **Grid Format:** Each input and output is presented as a square grid of pixels (approximately 8x8 or 10x10 cells). A right-pointing arrow (`→`) separates the input grid from the output grid in each pair.
### Detailed Analysis
#### **Column 1: Task f3e62deb (Movement)**
* **Rule Inference:** The transformation involves moving a single, solid-colored square shape within the grid.
* **Examples:**
1. **Input:** A light blue square outline (3x3 with a hollow center) in the top-left quadrant. **Output:** The same shape moved to the top-right quadrant.
2. **Input:** A solid yellow square (3x3) in the center-left. **Output:** The square moved to the bottom-center.
3. **Input:** A solid magenta square (3x3) in the center. **Output:** The square moved to the top-right.
* **Test Case (Row 4):**
* **Input:** A solid yellow square (3x3) in the center-left (identical to Example 2's input).
* **Expected Output:** Based on the pattern, the square should move to a new position. The rule is not a simple translation (e.g., "move right 3 cells") as the direction varies. It may involve moving to a specific quadrant or following a sequence.
#### **Column 2: Task b15fca0b (Extension)**
* **Rule Inference:** The transformation involves "extending" or "filling" areas adjacent to existing blue and red shapes with yellow pixels.
* **Examples:**
1. **Input:** Blue vertical bars and red corner pixels on a black background. **Output:** Yellow pixels fill the spaces between and around the blue bars, creating a connected yellow region.
2. **Input:** A different arrangement of blue bars and red pixels. **Output:** Yellow fills the negative space, connecting the blue elements.
3. **Input:** A more complex pattern of blue and red. **Output:** Yellow fills the cavities and gaps, forming a continuous yellow background around the blue structures.
* **Test Case (Row 4):**
* **Input:** A pattern of blue vertical bars and red pixels.
* **Expected Output:** The black background areas adjacent to the blue and red shapes should be filled with yellow, following the "fill the connected negative space" rule observed in the examples.
#### **Column 3: Task 6ea4a07e (Recolor)**
* **Rule Inference:** The transformation is a direct color substitution rule applied to all pixels of a specific color.
* **Examples:**
1. **Input:** Light blue pixels on black. **Output:** All light blue pixels become red.
2. **Input:** Green pixels on black. **Output:** All green pixels become blue.
3. **Input:** Gray pixels on black. **Output:** All gray pixels become yellow.
* **Test Case (Row 4):**
* **Input:** Green pixels on black (identical in color to Example 2's input).
* **Expected Output:** Based on the rule from Example 2, all green pixels should become **blue**.
#### **Column 4: Task 3b4c2228 (Others)**
* **Rule Inference:** This is the most complex category. The transformation appears to isolate or extract a specific subset of pixels based on their color and/or spatial relationship, discarding the rest.
* **Examples:**
1. **Input:** A mix of green and red pixels. **Output:** Only a specific pattern of **blue** pixels remains on a black background. The blue pattern does not directly correspond to the input colors, suggesting a rule like "output the connected component of a certain color" or "apply a cellular automaton rule."
2. **Input:** A different mix of green and red. **Output:** A different pattern of blue pixels.
3. **Input:** Another green/red pattern. **Output:** Yet another blue pixel pattern.
* **Test Case (Row 4):**
* **Input:** A specific arrangement of green and red pixels.
* **Expected Output:** A new pattern of blue pixels on a black background, derived by applying the same unknown complex rule from the examples.
### Key Observations
1. **Task Segmentation:** The image clearly segments problems by transformation type (Movement, Extension, Recolor, Others), which is a common structure in few-shot learning or abstract reasoning tests (like ARC - Abstraction and Reasoning Corpus).
2. **Color Palette:** The tasks use a limited, consistent palette: black (background), light blue, blue, green, yellow, red, magenta, gray.
3. **Rule Complexity Gradient:** The tasks progress from simple geometric translation (Movement) to area filling (Extension), to direct property mapping (Recolor), to an opaque, complex rule (Others).
4. **Test Case Design:** The test cases for "Movement" and "Extension" reuse input patterns from earlier examples, testing if the solver can generalize the rule. The "Recolor" test case uses the exact input color from a previous example, testing direct rule application. The "Others" test case presents a novel input pattern.
### Interpretation
This image is a benchmark or demonstration set for **visual program synthesis** or **abstract reasoning**. It challenges a system to:
1. **Perceive** the visual input accurately (pixel colors and positions).
2. **Induce** a generalizable rule from a small set of examples (few-shot learning).
3. **Deduce** the correct output for a novel input by applying the inferred rule.
The "Others" category is particularly significant as it represents a **black-box function**—the relationship between input and output is not easily describable in simple geometric or color terms, implying the rule might be based on connectivity, symmetry, or a computational process. Successfully solving all four test cases would demonstrate a robust capability for cross-domain visual abstraction, moving from concrete transformations to opaque algorithmic ones. The structure suggests this is from a research context evaluating AI's ability to perform human-like conceptual reasoning on visual data.
</details>
Figure 11: Example ARC tasks for movement, extension, recolor, and others categories.
### A.7 Example Tasks by Category in the ARC Evaluation Set
ARC comprises 1000 unique tasks, with 400 allocated to the training set and 600 to the evaluation set. The evaluation set is further divided into a public subset (400 tasks) and a private subset (200 tasks). Figure 11 illustrates example ARC tasks for the movement, extension, recolor, and others categories in the public evaluation set. In the movement example, components are shifted to the image boundary in directions determined by their colors. The extension example is more complex, requiring LLMs to find the shortest path between two red pixels while avoiding obstacles, which presents challenges for current reasoning-oriented models. Additionally, reliance on pixel-level recognition weakens the effectiveness of KAAR, which is designed to facilitate component identification. The recolor example involves changing non-black components to black and updating black components based on original non-black colors. The others example requires generating a blue diagonal line whose length depends on the number of 4-connected components in the input image that are green and have a size greater than one. The combination of numerical reasoning and structural pattern generation makes this task difficult to classify within the other three categories.
| GPT-o3-mini Gemini QwQ-32B | 66K 58K 79K | 106K 110K 427K |
| --- | --- | --- |
| DeepSeek-R1-70B | 66K | 252K |
| 2 | | |
Table 4: Average token cost for knowledge augmentation and solver backbone (RSPC) in KAAR across four evaluated LLMs. K is $10^3$ .
### A.8 Cost Analysis
Table 4 reports the average token cost, including both prompts and LLM responses, for knowledge augmentation and the solver backbone (RSPC), when using KAAR as the ARC solver. For each ARC task, we consider the abstraction whose solution solves $I_t$ ; if none succeed, the one that passes $I_r$ ; otherwise, the abstraction with the lowest token usage is selected. Except for goal-directedness priors, all core knowledge priors in KAAR are generated offline using image processing algorithms from GPAR, resulting in comparable augmentation costs across all evaluated models. In contrast, token usage by the solver backbone varies substantially due to differences in the LLMs’ abstract reasoning and generalization capabilities. GPT-o3-mini solves most tasks efficiently, with the lowest token consumption by the solver backbone, where tokens used for knowledge augmentation account for approximately 62% of the solver backbone’s token usage. However, the solver backbone consumes more tokens with QwQ-32B, as QwQ-32B consistently generates longer reasoning traces. In this case, tokens used for knowledge augmentation constitute only 19% of the solver backbone’s token usage. Figure 14 illustrates the average token cost for augmenting priors at each level in KAAR.
### A.9 Generalization
Figures 15 and 16 illustrate two ARC problems, 695367ec and b1fc8b8e, where both RSPC and KAAR successfully solve the training instances $I_r$ but fail on the test instances $I_t$ when using GPT-o3-mini. For problem 695367ec, the correct solution involves generating a fixed 15×15 output image by repeatedly copying the input image, changing its color to black, and adding internal horizontal and vertical lines colored with the original input image’s color. However, the RSPC-generated code applies a distinct rule to each input image size without considering generalization. For problem b1fc8b8e, the solution requires accurate object recognition despite component contact, and correctly placing each component into one of the four corners. However, RSPC fails to recognize objectness, and its solution deviates from human intuition, being overfitted to $I_r$ . For problems 695367ec and b1fc8b8e, KAAR exhibits the same limitations, although it adopts abstractions to enable objectness. KAAR begins with the simplest abstraction, no abstraction, where KAAR degrades to RSPC. As a result, it generates the same solution as RSPC and terminates without attempting other abstractions, since the solution already solves $I_r$ and is then evaluated on $I_t$ , resulting in overfitting.
### A.10 Problem Coverage across ARC Solvers
We report the relative problem coverage across nine ARC solvers based on successful test instance solutions using GPT-o3-mini (Figure 17), Gemini-2.0 (Figure 18), QwQ-32B (Figure 19), and DeepSeek-R1-70B (Figure 20). Each cell $(i,j)$ indicates the proportion of problems solved by the row solver that are also solved by the column solver. This is computed as $\frac{|A_i∩ A_j|}{|A_i|}$ , where $A_i$ and $A_j$ are the sets of problems solved by the row and column solvers, respectively, following the same method used in Figure 5. Values close to 1 indicate that the column solver covers most problems solved by the row solver. GPT-o3-mini demonstrates the strongest overall coverage, with pairwise overlap consistently exceeding 0.55. Among all solvers, repeated sampling with standalone (P) and planning-aided code generation (PC) show the highest coverage, with column values consistently above 0.8 for GPT-o3-mini. This trend persists across Gemini-2.0, QwQ-32B, and DeepSeek-R1-70B. Under these models, repeated sampling with planning-aided code generation exhibits better alignment than its standalone code generation counterpart, generally yielding higher coverage values. However, planning-aided code generation under the direct generation setting shows weaker alignment, with column values around 0.40 for Gemini-2.0 and 0.35 for QwQ-32B. Among the four evaluated LLMs, DeepSeek-R1-70B demonstrates the lowest average off-diagonal coverage (i.e., $i≠ j$ ) of 0.603, suggesting potential output instability and variation attributable to solver choice.
### A.11 Performance Analysis
Table 1 highlights performance variations across reasoning-oriented LLMs and ARC solvers with respect to both accuracy and generalization. Notably, the ARC solver, repeated sampling with standalone code generation, exhibits a substantial accuracy gap between $I_r$ and $I_r\&I_t$ , indicating limited generalization capability when using GPT-o3-mini and Gemini-2.0. In contrast, repeated sampling with planning-aided code generation demonstrates markedly improved generalization by preventing solutions from directly replicating the output matrices of training instances, as illustrated in Figure 21. This output copying, observed under repeated sampling with standalone code generation, accounts for approximately 24% and 95% of 83 and 101 overfitting problems with GPT-o3-mini and Gemini-2.0, respectively. When planning is incorporated, output copying is reduced to around 8% and 35% of 25 and 20 overfitting problems with GPT-o3-mini and Gemini-2.0, respectively. Additionally, the incorporation of planning facilitates accurate code generation. For example, in Figure 22, repeated sampling with planning-aided code generation produces a correct solution using GPT-o3-mini by replicating the input image horizontally or vertically based on the presence of a uniform row or column, as specified in the plan and implemented accordingly in code. In contrast, without planning assistance, standalone code generation produces incomplete logic, considering only whether the first column is uniform to determine the replication direction, which leads to failure on the test instance.
For the ARC benchmark, repeated sampling–based methods achieve higher accuracy on $I_r$ , $I_t$ , and $I_r\&I_t$ compared to refinement-based approaches when using GPT-o3-mini and Gemini-2.0. Figure 23 presents an ARC problem where repeated sampling with planning-aided code generation yields a correct solution, whereas its refinement variant fails to correct the initial erroneous code, and the flawed logic persists across subsequent refinements when using GPT-o3-mini. Previous studies have shown that refinement can benefit from control flow graph information [17] and verified plans [18], which assist LLMs in locating and correcting bugs. However, these methods typically incur substantial token consumption, making them difficult to scale affordably.
### A.12 Limitations
KAAR improves the performance of reasoning-oriented LLMs on ARC tasks by progressively prompting with core knowledge priors. Although this inevitably increases token usage, the trade-off can be justified, as the exploration of LLM generalization remains in its early stages. KAAR integrates diverse abstraction methods to enable objectness and iteratively applies abstractions in order of increasing complexity. In contrast, humans typically infer appropriate abstractions directly from training instances, rather than leveraging exhaustive search. To address this, we prompt different LLMs with raw 2D matrices of each ARC problem to select one or three relevant abstractions, but the results are unsatisfactory. As previously discussed, accurate abstraction inference often depends on validation through viable solutions, thereby shifting the challenge back to solution generation. Additionally, KAAR augments core knowledge priors through prompting but lacks mechanisms to enforce LLM adherence to these priors during reasoning. While the KAAR-generated solutions generally conform to core knowledge priors, the intermediate reasoning processes may deviate from the intended patterns. Future work could explore fine-tuning or reinforcement learning to better align model behavior with the desired reasoning patterns.
| No Abstraction | - |
| --- | --- |
| Whole Image | We consider the whole image as a component. |
| Middle-Vertical | We vertically split the image into two equal parts, treating each as a distinct component. |
| Middle-Horizontal | We horizontally split the image into two equal parts, treating each as a distinct component. |
| Multi-Lines | We use rows or columns with a uniform color to divide the input image into multiple components. |
| 4-Connected ∗ | We consider the 4-adjacent pixels of the same color as a component. |
| 4-Connected-Non-Background ∗ | We consider the 4-adjacent pixels of the same color as a component, excluding components with the background color. |
| 4-Connected-Non-Background-Edge ∗ | We consider the 4-adjacent pixels of the same color as a component, containing components with the background color when they are not attached to the edges of the image. |
| 4-Connected-Multi-Color-Non-Background ∗ | We consider 4-adjacent pixels as a component, which may contain different colors, while excluding components with the background color. |
| 4-Connected-Bounding-Box ∗ | We consider 4-adjacent pixels of the same color, and treat all pixels within their bounding box as a component, which may include different colors. |
| 4-Connected-With-Black ∗ | We consider the 4-adjacent pixels of black color, represented by the value 0, as a component, excluding components with other colors. |
| Same-Color | We consider pixels of the same color as a component, excluding components with the background color. |
| 2 | |
Table 5: Abstractions in KAAR. The superscript “ ∗ ” denotes that the 8-connected version is considered. The background color is black if black exists; otherwise, it is the most frequent color in the image. We present abstractions according to their prioritization in KAAR, where the order is given by the table from top to bottom, and making 8-connected abstraction to follow that of the corresponding 4-connected abstraction at the end of the sequence. Abstractions highlighted in red are exclusive to KAAR.
| Geometry and Topology | Size (Width and Height); Color; Shape (One Pixel; Horizontal Line; Vertical Line; Diagonal Line; Square; Rectangle; Cross; Irregular Shape); Symmetry (Horizontal Symmetry; Vertical Symmetry; Diagonal Symmetry; Anti-Diagonal Symmetry; Central Symmetry); Bounding Box; Hole Count; Nearest Boundary; Different/Identical with Other Components; Touching; Inclusive; Spatial (Horizontally Aligned to the Right; Horizontally Aligned to the Left; Vertically Aligned Below; Vertically Aligned Above; Top-Left; Top-Right; Bottom-Left; Bottom-Right; Same Position) |
| --- | --- |
| Numbers and Counting | Component Size Counting; Components with Same Size; Components with Most Frequent Size; Components with Least Frequent Size; Components with Maximum Size; Components with Minimum Size; Component Color Counting; Components with Same Color; Components with Same Number of Colors; Components with Most Frequent Color; Components with Least Frequent Color; Component with Most Distinct Colors; Component with Fewest Distinct Colors; Component Shape Counting; Components with Same Shape; Components with Most Frequent Shape; Components with Least Frequent Shape; Component Hole Number Counting; Components with Same Number of Holes; Components with Maximum Number of Holes; Components with Minimum Number of Holes; Component Symmetry Counting |
| Goal-directedness | Color Change (modifying component value); Movement (shifting component’s position); Extension (expanding component’s area); Completing (filling in missing parts of a component); Resizing (altering component size); Selecting (isolating a component); Copying (duplicating a component); Flipping (mirroring a component); Rotation (rotating a component); Cropping (cutting part of a component) |
| 2 | |
Table 6: KAAR priors classified into geometry and topology, numbers and counting, and goal-directedness. For goal-directedness, we incorporate ten predefined actions, with their corresponding action schemas detailed in Table A.12.
| Color Change Movement Extension | Targets Targets Targets | Source and Target Colors Direction Direction | Start and End Locations Start and End Locations | Pattern Pattern | Order Order | Overlapping Intersection |
| --- | --- | --- | --- | --- | --- | --- |
| Completing | Targets | Pattern | | | | |
| Resizing | Targets | Source and Target Sizes | | | | |
| Selecting | Targets | | | | | |
| Copying | Targets | Locations | Overlapping | | | |
| Flipping | Targets | Flipping Axis | Overlapping | | | |
| Rotation | Targets | Degrees | | | | |
| Cropping | Targets | Subsets | | | | |
| 2 | | | | | | |
Table 7: Actions in KAAR and their schemas (implementation details). Each action schema is presented according to its prompting order in KAAR (left to right). Some actions include a pattern schema that prompts the LLM to identify underlying logic rules, such as repeating every two steps in movement or extension, or completing based on three-color repetition. Targets denote the target components.
| whole image | Symmetry, Size | - | Flipping; Rotation; Extension; Completing, Cropping |
| --- | --- | --- | --- |
| middle-vertical | Size | - | Flipping; Movement |
| middle-horizontal | Size | - | Flipping; Movement |
| multi-lines | Size; Color; Shape; Symmetry; Bounding Box; Hole Count | ALL | ALL |
| 4-connected-multi-color-non-background ∗ | ALL | … Component Color Counting; Components with Same Number of Colors; Component with Most Distinct Colors; Component with Fewest Distinct Colors … | ALL |
Table 8: Abstractions with their assigned knowledge priors. “–” denotes no priors, while “ALL” indicates all priors in the corresponding category, as defined in Table 6. The superscript “ ∗ ” indicates that the 8-connected version is also applicable. The highlighted priors apply exclusively to their corresponding abstractions. For the 4/8-connected-multi-color-non-background abstractions, we present color-counting priors specific to multi-colored components, while all other non-color-counting priors follow those in Table 6.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Grouped Stacked Bar Chart: Model Accuracy by Image Size Interval
### Overview
This is a grouped stacked bar chart comparing the accuracy performance of two AI models (Gemini-2.0 and DeepSeek-R1-70B) across six different average image size intervals. Each model's performance is further broken down into two components: RSPC and KAAR. The chart demonstrates a clear inverse relationship between image size and model accuracy.
### Components/Axes
* **Y-Axis:** Labeled "Accuracy on I_t (%)". Scale ranges from 0 to 80, with major gridlines at intervals of 10.
* **X-Axis:** Labeled "Average Image Size Interval (width x height)". Contains six categorical bins:
1. `(0,25]` - Total: 19
2. `(25,100]` - Total: 139
3. `(100,225]` - Total: 129
4. `(225,400]` - Total: 51
5. `(400,625]` - Total: 39
6. `(625,900]` - Total: 23
* **Legend (Top-Right Corner):**
* Dark Green: `Gemini-2.0 RSPC`
* Light Green: `Gemini-2.0 KAAR`
* Dark Orange: `DeepSeek-R1-70B RSPC`
* Light Orange: `DeepSeek-R1-70B KAAR`
* **Data Series:** Each X-axis interval has two adjacent bars. The left bar represents Gemini-2.0 (stacked green segments), and the right bar represents DeepSeek-R1-70B (stacked orange segments).
### Detailed Analysis
**Interval (0,25] - Total Samples: 19**
* **Gemini-2.0:** Total Accuracy ≈ 79.0% (RSPC: 63.2%, KAAR: 15.8%)
* **DeepSeek-R1-70B:** Total Accuracy ≈ 52.7% (RSPC: 47.4%, KAAR: 5.3%)
* *Trend Verification:* This interval shows the highest accuracy for both models. The Gemini bar is significantly taller than the DeepSeek bar.
**Interval (25,100] - Total Samples: 139**
* **Gemini-2.0:** Total Accuracy ≈ 36.7% (RSPC: 28.8%, KAAR: 7.9%)
* **DeepSeek-R1-70B:** Total Accuracy ≈ 21.6% (RSPC: 15.1%, KAAR: 6.5%)
* *Trend Verification:* A sharp decline in accuracy for both models compared to the first interval. The relative performance gap remains, with Gemini leading.
**Interval (100,225] - Total Samples: 129**
* **Gemini-2.0:** Total Accuracy ≈ 14.0% (RSPC: 9.3%, KAAR: 4.7%)
* **DeepSeek-R1-70B:** Total Accuracy ≈ 7.8% (RSPC: 0.8%, KAAR: 7.0%)
* *Trend Verification:* Continued decline. Notably, the DeepSeek RSPC component becomes very small (0.8%), while its KAAR component (7.0%) is now larger than its RSPC component.
**Interval (225,400] - Total Samples: 51**
* **Gemini-2.0:** Total Accuracy ≈ 7.9% (RSPC: 5.9%, KAAR: 2.0%)
* **DeepSeek-R1-70B:** Total Accuracy ≈ 0.0% (No visible bar segments).
* *Trend Verification:* Accuracy for Gemini drops further. The DeepSeek model shows no measurable accuracy in this interval.
**Intervals (400,625] and (625,900] - Total Samples: 39 and 23 respectively**
* **All Models:** Total Accuracy ≈ 0.0% (No visible bar segments for any category).
* *Trend Verification:* Both models fail to achieve any measurable accuracy on images within these larger size ranges.
### Key Observations
1. **Strong Negative Correlation:** There is a steep, consistent decline in accuracy for both models as the average image size increases.
2. **Model Performance Gap:** Gemini-2.0 consistently outperforms DeepSeek-R1-70B across all intervals where accuracy is non-zero.
3. **Component Shift:** For Gemini-2.0, the RSPC component is always the dominant contributor to total accuracy. For DeepSeek-R1-70B, the RSPC component dominates in the smallest images, but the KAAR component becomes the primary (and only) contributor in the `(100,225]` interval.
4. **Performance Cliff:** Both models hit a performance cliff, with accuracy dropping to zero for images in intervals larger than `(225,400]`.
5. **Sample Distribution:** The majority of test samples (139 + 129 = 268) fall within the `(25,100]` and `(100,225]` intervals, where accuracy is already significantly degraded.
### Interpretation
The data strongly suggests that the evaluated capabilities of both the Gemini-2.0 and DeepSeek-R1-70B models are highly sensitive to input image resolution or size. The primary task measured by "Accuracy on I_t" becomes progressively more difficult for these models as image dimensions increase, failing completely beyond a certain threshold (around 225x225 average pixels).
The consistent performance gap indicates that Gemini-2.0 has a more robust architecture or training for this specific task across varying image sizes. The shift in DeepSeek's internal component contribution (from RSPC to KAAR) in the mid-size range may indicate a different failure mode or a reliance on a different sub-process when the primary one (RSPC) becomes ineffective.
For practical application, this chart implies that preprocessing images to a smaller, consistent size (likely under 100x100 average pixels) would be critical for achieving acceptable performance with these models on this task. The absence of any accuracy in the largest bins could be due to model limitations, lack of relevant training data for high-resolution images, or the inherent difficulty of the task at that scale.
</details>
Figure 12: Accuracy on test instances $I_t$ for RSPC and KAAR across average image size intervals, evaluated with Gemini-2.0 and DeepSeek-R1-70B.
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy on I_r & I_t (%) vs. # Iterations
### Overview
This is a line chart comparing the performance of two AI models (Gemini-2.0 and DeepSeek-R1-70B) using two different methods (RSPC and KAAR) across a series of iterations. The chart tracks accuracy percentage on a metric labeled "I_r & I_t" over 12 iterations, which are grouped into three distinct cognitive task phases.
### Components/Axes
* **Y-Axis:** Labeled "Accuracy on I_r & I_t (%)". Scale ranges from 0 to 25, with major gridlines at intervals of 5.
* **X-Axis:** Labeled "# Iterations". Scale ranges from 1 to 12. The axis is segmented into three phases, indicated by background shading and labels:
* **Phase 1 (Iterations 1-4):** "Objectness" (light blue background).
* **Phase 2 (Iterations 4-8):** "Geometry, Topology, Numbers and Counting" (light beige background).
* **Phase 3 (Iterations 8-12):** "Goal-directedness" (light cyan background).
* **Legend:** Located in the bottom-right quadrant of the chart area. It defines four data series:
1. **Gemini-2.0: RSPC** - Dark green line with circle markers.
2. **Gemini-2.0: KAAR** - Light green line with right-pointing triangle markers.
3. **DeepSeek-R1-70B: RSPC** - Dark orange/brown line with upward-pointing triangle markers.
4. **DeepSeek-R1-70B: KAAR** - Light orange/tan line with square markers.
### Detailed Analysis
**Data Series and Trends:**
1. **Gemini-2.0: KAAR (Light Green, Right Triangles)**
* **Trend:** Shows the strongest and most consistent upward trend, plateauing at the highest accuracy level.
* **Data Points:**
* Iteration 1: 9.5%
* Iteration 2: 13.25%
* Iteration 3: 14.75%
* Iteration 4: 15%
* Iteration 5: 15%
* Iteration 6: 16.25%
* Iteration 7: 16.5%
* Iteration 8: 16.5%
* Iteration 9: 19.75%
* Iteration 10: 20.5%
* Iteration 11: 20.5%
* Iteration 12: 20.5%
2. **Gemini-2.0: RSPC (Dark Green, Circles)**
* **Trend:** Shows a steady upward trend, consistently performing below its KAAR counterpart but above both DeepSeek series. It plateaus in the final phase.
* **Data Points:**
* Iteration 1: 7.5%
* Iteration 2: 11.75%
* Iteration 3: 13.25%
* Iteration 4: 13.5%
* Iteration 5: 13.5%
* Iteration 6: 15%
* Iteration 7: 15.25%
* Iteration 8: 15.25%
* Iteration 9: 15.75%
* Iteration 10: 16.5%
* Iteration 11: 16.5%
* Iteration 12: 16.5%
3. **DeepSeek-R1-70B: KAAR (Light Orange, Squares)**
* **Trend:** Shows a moderate upward trend, with a notable jump between iterations 8 and 9. It is the second-highest performing series by the end.
* **Data Points:**
* Iteration 1: 3.75%
* Iteration 2: 4%
* Iteration 3: 5.5%
* Iteration 4: 6.5%
* Iteration 5: 7%
* Iteration 6: 8.25%
* Iteration 7: 8.5%
* Iteration 8: 8.75%
* Iteration 9: 10.75%
* Iteration 10: 11.25%
* Iteration 11: 11.25%
* Iteration 12: 11.5%
4. **DeepSeek-R1-70B: RSPC (Dark Orange, Up Triangles)**
* **Trend:** Shows the lowest overall performance. It increases gradually and then plateaus completely from iteration 8 onward.
* **Data Points:**
* Iteration 1: 3%
* Iteration 2: 3.25%
* Iteration 3: 4.5%
* Iteration 4: 5.5%
* Iteration 5: 5.5%
* Iteration 6: 6.75%
* Iteration 7: 7%
* Iteration 8: 7.25%
* Iteration 9: 7.25%
* Iteration 10: 7.25%
* Iteration 11: 7.25%
* Iteration 12: 7.25%
### Key Observations
1. **Model Hierarchy:** Gemini-2.0 consistently outperforms DeepSeek-R1-70B across all iterations and both methods.
2. **Method Hierarchy:** For both models, the KAAR method yields higher accuracy than the RSPC method. The performance gap between KAAR and RSPC is larger for Gemini-2.0 than for DeepSeek-R1-70B.
3. **Phase Impact:** All series show their most significant gains during the "Objectness" phase (Iterations 1-4). Growth slows in the "Geometry..." phase and largely plateaus during the "Goal-directedness" phase, except for a final jump for Gemini-2.0: KAAR.
4. **Plateau Points:** DeepSeek-R1-70B: RSPC plateaus earliest (Iteration 8). Gemini-2.0: RSPC and DeepSeek-R1-70B: KAAR show minimal gains after Iteration 10. Gemini-2.0: KAAR reaches its peak at Iteration 10 and holds it.
### Interpretation
The chart demonstrates a clear performance hierarchy between the tested models and methods on the "I_r & I_t" accuracy metric. The data suggests that the **Gemini-2.0 model, when using the KAAR method, is the most effective combination** for this task, achieving over 20% accuracy.
The segmentation into cognitive phases reveals that the **"Objectness" phase is where the most rapid learning or adaptation occurs** for all configurations. The subsequent phases ("Geometry..." and "Goal-directedness") appear to be more challenging, leading to diminishing returns in accuracy gains. This could indicate that the models find basic object recognition easier to improve upon than higher-order geometric reasoning or goal-directed planning within this iterative framework.
The consistent superiority of the KAAR method over RSPC for both models implies that the KAAR approach is more effective for optimizing the measured accuracy. The widening gap between Gemini's KAAR and RSPC lines, compared to the narrower gap for DeepSeek, might suggest that the more capable base model (Gemini) benefits more from the advanced method (KAAR).
</details>
Figure 13: Variance in accuracy on $I_r\&I_t$ with increasing iterations for RSPC and KAAR using Gemini-2.0 and DeepSeek-R1-70B.
<details>
<summary>x14.png Details</summary>

### Visual Description
## Grouped Bar Chart: Token Usage by AI Model Across Cognitive Tasks
### Overview
This image is a grouped bar chart comparing the token usage (in thousands) of four different AI models across three distinct cognitive task categories. The chart visually demonstrates how different models allocate computational resources (measured in tokens) when processing tasks related to object recognition, geometric/mathematical reasoning, and goal-directed planning.
### Components/Axes
* **Chart Type:** Grouped vertical bar chart.
* **Y-Axis:** Labeled "Tokens". The scale runs from 0K to 50K, with major gridlines at 10K, 20K, 30K, 40K, and 50K. The "K" denotes thousands.
* **X-Axis:** Represents three distinct task categories:
1. **Objectness** (Left group)
2. **Geometry, Topology, Numbers and Counting** (Center group)
3. **Goal-directedness** (Right group)
* **Legend:** Positioned in the top-left corner of the chart area. It maps colors to four AI models:
* **Light Blue:** GPT-o3-mini
* **Light Green:** Gemini-2.0
* **Pink/Lavender:** QwQ-32B
* **Orange/Tan:** DeepSeek-R1-70B
* **Data Labels:** Each bar has a numerical label at its top indicating the exact token count (e.g., "11K", "40K").
### Detailed Analysis
The chart presents the following specific data points, verified by cross-referencing bar color with the legend:
**1. Objectness Task Category (Left Group):**
* **GPT-o3-mini (Light Blue):** 11K tokens. This is the lowest value in this category.
* **Gemini-2.0 (Light Green):** 12K tokens.
* **QwQ-32B (Pink):** 20K tokens. This is the highest value in this category.
* **DeepSeek-R1-70B (Orange):** 15K tokens.
* **Trend:** Token usage is relatively low across all models for this task, with QwQ-32B using nearly double the tokens of GPT-o3-mini.
**2. Geometry, Topology, Numbers and Counting Task Category (Center Group):**
* **GPT-o3-mini (Light Blue):** 40K tokens. This is the highest value in this category.
* **Gemini-2.0 (Light Green):** 24K tokens.
* **QwQ-32B (Pink):** 29K tokens.
* **DeepSeek-R1-70B (Orange):** 37K tokens.
* **Trend:** This category shows the highest overall token consumption. GPT-o3-mini and DeepSeek-R1-70B show significantly higher usage than the other two models.
**3. Goal-directedness Task Category (Right Group):**
* **GPT-o3-mini (Light Blue):** 19K tokens.
* **Gemini-2.0 (Light Green):** 31K tokens.
* **QwQ-32B (Pink):** 43K tokens. This is the highest single value in the entire chart.
* **DeepSeek-R1-70B (Orange):** 18K tokens. This is the lowest value in this category.
* **Trend:** This category shows the greatest variance between models. QwQ-32B's token usage is exceptionally high, while GPT-o3-mini and DeepSeek-R1-70B are comparatively low.
### Key Observations
1. **Model-Specific Patterns:**
* **QwQ-32B (Pink)** consistently uses a high number of tokens, leading in two categories ("Objectness" and "Goal-directedness") and peaking at 43K.
* **GPT-o3-mini (Light Blue)** shows extreme variability: it is the most efficient (lowest tokens) for "Objectness" and "Goal-directedness" but the most resource-intensive for "Geometry...".
* **DeepSeek-R1-70B (Orange)** is the second-highest user for "Geometry..." (37K) but the lowest for "Goal-directedness" (18K).
* **Gemini-2.0 (Light Green)** shows moderate, middle-of-the-pack usage across all tasks.
2. **Task Complexity Implication:** The "Geometry, Topology, Numbers and Counting" category elicits the highest average token usage from the models, suggesting it may be the most computationally demanding task type among the three.
3. **Notable Outlier:** The 43K token usage by QwQ-32B for "Goal-directedness" is a significant outlier, being 12K tokens higher than the next highest value in that category (Gemini-2.0 at 31K) and more than double the usage of two other models.
### Interpretation
This chart provides a comparative analysis of computational cost (token usage) for different AI models on specific cognitive benchmarks. The data suggests that:
* **Task-Dependent Efficiency:** No single model is the most token-efficient across all task types. Model performance in terms of resource usage is highly dependent on the nature of the problem. A model efficient at object recognition may be inefficient at geometric reasoning.
* **Potential Correlation with Model Size/Architecture:** The model names hint at different scales (e.g., "32B", "70B" parameters). The high token usage of QwQ-32B and DeepSeek-R1-70B on certain tasks could correlate with larger model sizes engaging in more extensive internal reasoning chains, though this is an inference beyond the explicit data.
* **Benchmarking Insight:** For researchers or engineers, this data is crucial for understanding the operational costs of deploying these models. Choosing a model for a "Goal-directedness" task would involve a trade-off: QwQ-32B might offer higher capability (implied by its high token usage, suggesting deeper processing) at a much higher computational cost than GPT-o3-mini or DeepSeek-R1-70B.
* **The "Geometry..." Task as a Differentiator:** This category acts as a strong differentiator, separating models into high-usage (GPT-o3-mini, DeepSeek-R1-70B) and moderate-usage (Gemini-2.0, QwQ-32B) groups, which may reflect different underlying approaches to mathematical and spatial reasoning.
In essence, the chart moves beyond simple accuracy scores to reveal the hidden "cost" of AI reasoning in terms of token consumption, highlighting that model selection must consider both performance and efficiency for specific application domains.
</details>
Figure 14: Average token cost for augmenting priors at each level across four LLMs. K is $10^3$ .
<details>
<summary>x15.png Details</summary>

### Visual Description
## Diagram: Algorithm for Generating Grid Patterns from Input Images
### Overview
The image presents a technical diagram illustrating a programming task (labeled "Task 695367ec") and its corresponding "Code Solution." It visually demonstrates a transformation rule where simple, solid-colored input grids are converted into complex, patterned output grids. The right side contains the Python code that implements this transformation. The final example (a yellow 5x5 grid) has a question mark, indicating the output is to be determined by applying the provided algorithm.
### Components/Axes
The image is divided into two primary sections:
1. **Left Section (Visual Examples):**
* **Structure:** Four rows, each showing an input grid (left), an arrow, and the resulting output grid (right).
* **Input Grids:** Solid-colored squares with visible grid lines indicating their dimensions.
* Row 1: Green 3x3 grid.
* Row 2: Blue 2x2 grid.
* Row 3: Red 4x4 grid.
* Row 4: Yellow 5x5 grid (output is a white box with a black question mark).
* **Output Grids:** 15x15 pixel grids with a black background. Colored pixels (matching the input color) form a pattern of intersecting horizontal and vertical lines, creating a "grid of grids" effect. The pattern density varies with the input size.
2. **Right Section (Code Solution):**
* **Title:** "Code Solution" in a rounded rectangle.
* **Content:** A Python function definition `def generate_output_image(input_image):` with extensive comments.
* **Highlighted Region:** A red dashed box surrounds the core logic for determining grid-line positions, with an attached note: "Generate a rule for each input size."
### Detailed Analysis
**Code Transcription and Logic:**
The Python function `generate_output_image` takes a 2D list (`input_image`) representing the input grid. The algorithm proceeds as follows:
1. **Extract Color:** `v = input_image[0][0]` – The color value (e.g., green, blue) is taken from the top-left pixel of the input.
2. **Determine Input Size:** `n = len(input_image)` – Assumes the input is square.
3. **Set Output Size:** `out_size = 15` – The output is always a 15x15 grid.
4. **Define Grid Indices (Core Rule):** This is the critical step, highlighted in the diagram. The code sets a list `grid_indices` that defines which rows and columns in the 15x15 output will be painted with the color `v`.
* For `n` (input size) of 2 or 5: `grid_indices = {2, 5, 8, 11, 14}`
* For `n = 3`: `grid_indices = {3, 7, 11}`
* For `n = 4`: `grid_indices = {4, 9, 14}`
* **Fallback Rule (for other `n`):** `block_size = out_size // (n + 1)`, then `grid_indices = {(i + 1) * block_size - 1 for i in range(n)}`. This evenly spaces `n` grid lines within the 15-element dimension.
5. **Generate Output:** The code iterates through each row `r` of the 15x15 output.
* If `r` is in `grid_indices`, the entire row is filled with color `v` (creating a horizontal separator line).
* Otherwise, it creates a row where only the pixels at column indices `c` that are in `grid_indices` are painted with `v` (creating the vertical segments of the pattern).
**Visual Pattern Verification:**
* **3x3 Input (Green):** Output shows green lines at rows/columns 3, 7, and 11, creating a 2x2 arrangement of large black squares within the 15x15 grid.
* **2x2 Input (Blue):** Output shows blue lines at rows/columns 2, 5, 8, 11, and 14, creating a denser 5x5 arrangement of small black squares.
* **4x4 Input (Red):** Output shows red lines at rows/columns 4, 9, and 14, creating a 3x3 arrangement of black squares.
* **5x5 Input (Yellow):** According to the code rule for `n=5`, the output should have yellow lines at indices {2, 5, 8, 11, 14}, identical to the pattern for the 2x2 input but in yellow.
### Key Observations
1. **Fixed Output Dimension:** Regardless of input size (2x2 to 5x5), the output is always rendered in a 15x15 pixel space.
2. **Discrete Rule Set:** The algorithm uses a lookup table for common input sizes (2,3,4,5) rather than a single mathematical formula, suggesting these were the specific cases defined for the task.
3. **Pattern Inversion:** The output is essentially a "wireframe" or "grid overlay" of the input's structure, but scaled and positioned within a fixed canvas. The black squares in the output represent the "cells" of the original input grid.
4. **Size-to-Pattern Mapping:** The relationship is not linear. Both 2x2 and 5x5 inputs produce the same spatial frequency of grid lines (indices {2,5,8,11,14}), while 3x3 and 4x4 produce different, coarser patterns.
### Interpretation
This diagram explains a **visual programming puzzle or algorithm challenge**. The core task is to infer a transformation rule from examples and implement it. The "rule" is the mapping from an input grid's dimension (`n`) to a specific set of coordinates (`grid_indices`) in a fixed-size output canvas.
The provided solution reveals that the rule is **empirical and case-based** for the given examples, with a general fallback. The interesting anomaly is that the 2x2 and 5x5 inputs share the same output pattern geometry. This suggests the rule might be designed to create a visually distinct pattern for each input size within the constraints of a 15x15 output, but the mapping for 2 and 5 converges. The question mark for the 5x5 yellow grid tests the viewer's ability to apply the stated rule: since `n=5` is explicitly handled, its output should be a yellow version of the blue 2x2 pattern.
The diagram serves as both a problem statement (left) and its solution (right), emphasizing the importance of precise rule extraction and implementation in computational thinking. The highlighted code section underscores that the heart of the problem is defining the `grid_indices` for each input size.
</details>
Figure 15: ARC problem 695367ec, where RSPC and KAAR generate the same code solution that passes the training instances but fails on the test instance using GPT-o3-mini.
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Programming Puzzle: Task b1fc8b8e
### Overview
The image presents a visual programming puzzle. On the left, it displays a series of input-output grid transformations (three complete examples and one test case). On the right, it shows a Python code solution attempting to implement the transformation logic, accompanied by a critical annotation. The puzzle involves transforming a 6x6 binary grid (black/light blue) into a 5x5 binary grid based on a pattern derived from the input's first row.
### Components/Axes
**Left Panel (Visual Examples):**
* **Header:** "Task b1fc8b8e" (top-left).
* **Structure:** Four rows. Each row contains:
1. A 6x6 input grid (left).
2. A right-pointing arrow.
3. A 5x5 output grid (right). The fourth output grid is a placeholder with a "?".
* **Grid Content:** Cells are either black (representing 0 or inactive) or light blue (representing 8 or active, as inferred from the code).
**Right Panel (Code Solution):**
* **Header:** "Code Solution" (top-right).
* **Code Language:** Python.
* **Key Code Elements:**
* Function definition: `def generate_output_image(input_image):`
* **Critical Logic Block (within a red dashed box):**
* Comment: `# Count how many 8's appear in the first row.`
* Line: `count_eights = sum(1 for pixel in input_image[0] if pixel == 8)`
* Conditional: `if count_eights >= 2:` to choose between two patterns (`active_pattern` or `softer-border`).
* Pattern Definitions:
* `active_pattern = [8, 8, 0, 8, 8]`
* `softer-border` patterns: `top_active = [0, 8, 0, 0, 8]`, `second_active = [8, 8, 0, 8, 8]`
* **Output Construction:** Builds a 5x5 list named `output_image` with the structure: `[top_active, second_active, blank, top_active, second_active]`, where `blank = [0, 0, 0, 0, 0]`.
* **Annotation (Red Box, right side):**
* Text: "No objective-centric reasoning. Rules are only applied to training instances."
### Detailed Analysis
**Visual Pattern Analysis (Left Panel):**
1. **Example 1 (Top Row):**
* **Input (6x6):** First row has 3 light blue cells (positions 2, 4, 6). The overall input has a scattered pattern.
* **Output (5x5):** A symmetric pattern. Rows 1 & 4 are `[0, 8, 0, 8, 0]`. Rows 2 & 5 are `[8, 0, 8, 0, 8]`. Row 3 (middle) is all black `[0,0,0,0,0]`.
2. **Example 2 (Second Row):**
* **Input (6x6):** First row has 4 light blue cells (positions 1, 2, 3, 6).
* **Output (5x5):** A different symmetric pattern. Rows 1 & 4 are `[8, 8, 0, 8, 8]`. Rows 2 & 5 are `[8, 8, 0, 8, 8]`. Row 3 is all black.
3. **Example 3 (Third Row):**
* **Input (6x6):** First row has 2 light blue cells (positions 3, 5).
* **Output (5x5):** Identical to Example 1's output pattern.
4. **Test Case (Bottom Row):**
* **Input (6x6):** First row has 2 light blue cells (positions 2, 5). The overall input pattern is distinct from the previous examples.
* **Output (5x5):** Unknown (marked with "?").
**Code Logic Analysis (Right Panel):**
The code attempts to formalize the observed rule:
* It counts the number of `8`s (light blue) in the **first row only** of the 6x6 input.
* **Rule 1 (>=2 eights):** Uses `active_pattern = [8, 8, 0, 8, 8]` for both `top_active` and `second_active`. This would produce an output where all four active rows are `[8, 8, 0, 8, 8]`.
* **Rule 2 (<2 eights):** Uses a "softer-border" pattern: `top_active = [0, 8, 0, 0, 8]` and `second_active = [8, 8, 0, 8, 8]`. This would produce an output where rows 1&4 are `[0, 8, 0, 0, 8]` and rows 2&5 are `[8, 8, 0, 8, 8]`.
* The final output is always a 5x5 grid with a blank middle row, constructed by mirroring the two active rows vertically.
**Discrepancy Check:**
* The code's Rule 1 output (`[8,8,0,8,8]` for all active rows) **does not match** the output of Example 2 (which has 4 eights in the first row). Example 2's output has rows 1&4 as `[8,8,0,8,8]` but rows 2&5 are also `[8,8,0,8,8]`, which actually *does* match the code's Rule 1. My initial visual read was incorrect; the code's Rule 1 output matches Example 2.
* The code's Rule 2 output does not exactly match the output of Examples 1 or 3. Examples 1 & 3 output is `[0,8,0,8,0]` and `[8,0,8,0,8]`. The code's "softer-border" pattern is `[0,8,0,0,8]` and `[8,8,0,8,8]`. This is a clear mismatch, indicating the code's logic is flawed or incomplete.
### Key Observations
1. **Symmetry is Key:** All output grids exhibit vertical and horizontal symmetry around the central blank row and column.
2. **Input-First-Row Dependency:** The transformation rule appears to be triggered solely by the count of active cells in the **first row** of the input grid, not the overall input pattern.
3. **Two Distinct Output Patterns:** The three examples show only two unique output patterns:
* **Pattern A (Examples 1 & 3):** A "checkerboard" style with alternating active cells.
* **Pattern B (Example 2):** A "block" style with solid active rows.
4. **Code-Example Mismatch:** The provided code solution does not correctly generate the output for Examples 1 and 3. Its "softer-border" pattern is incorrect.
5. **Critical Annotation:** The red-boxed text provides a meta-critique of the code's logic, stating it lacks "objective-centric reasoning" and only works for the given training instances (the examples), implying it won't generalize to the test case or other unseen inputs.
### Interpretation
This image documents a **failed or incomplete attempt at solving an ARC (Abstraction and Reasoning Corpus)-like puzzle**. The puzzle tests the ability to infer a generalizable rule from a few examples.
* **What the Data Suggests:** The true underlying rule likely involves more than just counting the first row. It may consider the *position* of active cells in the first row, or a combination of the first row with another feature of the input grid. The symmetry in the output suggests the rule constructs a mirrored pattern.
* **Relationship Between Elements:** The code is a direct, but erroneous, translation of an observed hypothesis. The annotation serves as a peer review, pointing out the hypothesis's weakness—it's a "memorization" of the training examples rather than a deduction of the core objective or principle.
* **Notable Anomaly:** The most significant anomaly is the discrepancy between the code's "softer-border" pattern and the actual outputs of Examples 1 and 3. This is the concrete evidence supporting the annotation's claim of flawed reasoning.
* **Purpose:** The image likely serves an educational or analytical purpose, illustrating a common pitfall in program synthesis or inductive reasoning: creating a solution that fits the provided examples perfectly but fails because it captures superficial correlations rather than the deep structure of the problem. The test case with the "?" is the challenge to see if a corrected, objective-centric rule can be formulated.
</details>
Figure 16: ARC problem b1fc8b8e, where RSPC and KAAR generate the same code solution that passes the training instances but fails on the test instance using GPT-o3-mini.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Heatmap: Coverage Correlation Matrix for gpt-o3-mini
### Overview
The image is a 9x9 heatmap displaying numerical "Coverage" values between different methodological approaches. The chart is titled "gpt-o3-mini" at the bottom center. The data is presented as a symmetric matrix where each cell's color intensity (from light orange to dark red) corresponds to a coverage value between 0.0 and 1.0, as indicated by a vertical color bar legend on the right side.
### Components/Axes
* **Chart Type:** Heatmap (Correlation/Similarity Matrix).
* **Title/Label:** "gpt-o3-mini" (located at the bottom center of the chart).
* **X-Axis (Top):** Labels are rotated 45 degrees. From left to right:
1. Direct Generation P
2. Direct Generation C
3. Direct Generation PC
4. Repeated Sampling P
5. Repeated Sampling C
6. Repeated Sampling PC
7. Refinement P
8. Refinement C
9. Refinement PC
* **Y-Axis (Left):** Labels are rotated 45 degrees. From top to bottom:
1. Direct Generation P
2. Direct Generation C
3. Direct Generation PC
4. Repeated Sampling P
5. Repeated Sampling C
6. Repeated Sampling PC
7. Refinement P
8. Refinement C
9. Refinement PC
* **Legend (Right Side):** A vertical color bar labeled "Coverage". The scale runs from 0.0 (lightest orange/cream) at the bottom to 1.0 (darkest red) at the top, with a midpoint marker at 0.5.
* **Data Cells:** Each cell in the 9x9 grid contains a numerical value (to two decimal places) representing the coverage between the row method and the column method. The diagonal cells (where row and column are identical) all have a value of 1.00 and are the darkest red.
### Detailed Analysis
The matrix is symmetric (e.g., value at [Row: Direct Generation C, Column: Direct Generation P] is 0.61, and at [Row: Direct Generation P, Column: Direct Generation C] is 0.74). The following table reconstructs the full data matrix. Values are read directly from the cells.
| Row \ Column | Direct Generation P | Direct Generation C | Direct Generation PC | Repeated Sampling P | Repeated Sampling C | Repeated Sampling PC | Refinement P | Refinement C | Refinement PC |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Direct Generation P** | **1.00** | 0.74 | 0.75 | 0.80 | 0.89 | 0.84 | 0.74 | 0.81 | 0.75 |
| **Direct Generation C** | 0.61 | **1.00** | 0.71 | 0.68 | 0.89 | 0.83 | 0.76 | 0.80 | 0.70 |
| **Direct Generation PC** | 0.69 | 0.79 | **1.00** | 0.73 | 0.91 | 0.84 | 0.73 | 0.81 | 0.80 |
| **Repeated Sampling P** | 0.68 | 0.71 | 0.68 | **1.00** | 0.87 | 0.88 | 0.80 | 0.75 | 0.69 |
| **Repeated Sampling C** | 0.55 | 0.67 | 0.62 | 0.64 | **1.00** | 0.80 | 0.65 | 0.78 | 0.69 |
| **Repeated Sampling PC** | 0.55 | 0.66 | 0.61 | 0.68 | 0.85 | **1.00** | 0.67 | 0.76 | 0.67 |
| **Refinement P** | 0.61 | 0.75 | 0.66 | 0.77 | 0.85 | 0.84 | **1.00** | 0.75 | 0.71 |
| **Refinement C** | 0.56 | 0.67 | 0.62 | 0.61 | 0.87 | 0.79 | 0.63 | **1.00** | 0.73 |
| **Refinement PC** | 0.59 | 0.67 | 0.69 | 0.64 | 0.87 | 0.81 | 0.68 | 0.83 | **1.00** |
### Key Observations
1. **Diagonal Perfection:** All diagonal cells (self-comparison) have a coverage of 1.00, forming a dark red line from top-left to bottom-right.
2. **Highest Off-Diagonal Values:** The highest coverage values outside the diagonal are found in the "Repeated Sampling C" column, particularly when intersected with "Direct Generation PC" (0.91) and "Direct Generation P" (0.89). This suggests strong coverage overlap between these specific method pairs.
3. **Lowest Values:** The lowest coverage values (0.55) occur in the "Repeated Sampling C" and "Repeated Sampling PC" rows when compared against "Direct Generation P". This indicates the weakest coverage relationship in the matrix.
4. **Method Grouping Patterns:**
* Methods within the same family (e.g., all "Direct Generation" variants) generally show moderate to high coverage with each other (values typically >0.70).
* The "Repeated Sampling" methods show particularly high internal coverage (e.g., Repeated Sampling C vs. Repeated Sampling PC = 0.80).
* Coverage between "Refinement" methods and "Direct Generation" methods tends to be lower (often in the 0.55-0.75 range) compared to coverage within the "Repeated Sampling" family.
5. **Color-Value Consistency:** The color gradient accurately reflects the numerical values. Cells with values near 1.0 are dark red, values near 0.5 are medium orange, and values near 0.0 would be light cream (though no values below 0.55 are present).
### Interpretation
This heatmap quantifies the "Coverage" relationship between nine different methodological approaches for the `gpt-o3-mini` model. "Coverage" likely measures the degree of overlap, similarity, or agreement in the outputs or capabilities of these methods.
* **What the data suggests:** The matrix reveals a hierarchy of methodological similarity. The "Repeated Sampling" family (especially the 'C' variant) appears to be a central, highly interconnected hub, showing strong coverage with most other methods. "Direct Generation" methods form another cluster. "Refinement" methods, while internally consistent, show weaker coverage links to the "Direct Generation" approaches.
* **How elements relate:** The symmetric nature confirms that the coverage metric is mutual. The high values between "Repeated Sampling C" and "Direct Generation PC" (0.91) are a key finding, suggesting these two distinct approaches yield highly overlapping results. Conversely, the low values between "Repeated Sampling" methods and "Direct Generation P" (0.55) highlight a significant divergence in what these methods cover.
* **Notable anomalies/trends:** The most striking trend is the central role of "Repeated Sampling C," which has the highest average coverage with other methods. An anomaly is the relatively low coverage (0.61) between the two most basic methods: "Direct Generation C" and "Direct Generation P," suggesting their outputs are quite distinct despite sharing a common high-level approach. The "PC" variants (likely meaning "Prompt + Chain-of-thought" or similar) generally show higher coverage with other methods than their "P" or "C" counterparts alone, indicating a more comprehensive or overlapping coverage profile.
</details>
Figure 17: Asymmetric relative coverage matrix of nine ARC solvers using GPT-o3-mini, showing the proportion of problems whose test instances are solved by the row solver that are also solved by the column solver. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Heatmap: Coverage Similarity Matrix for Gemini-2.0 Generation & Refinement Strategies
### Overview
The image is a 9x9 heatmap titled "Gemini-2.0" at the bottom center. It visualizes a symmetric matrix of "Coverage" values between nine different strategies or methods. The color scale, located on the right, ranges from 0.0 (light orange) to 1.0 (dark red), indicating the degree of coverage similarity or overlap. Each cell contains a precise numerical value.
### Components/Axes
* **Chart Type:** Heatmap (symmetric matrix).
* **Title:** "Gemini-2.0" (bottom center).
* **Color Scale/Legend:** A vertical bar on the right side labeled "Coverage". It shows a gradient from light orange (0.0) to dark red (1.0), with tick marks at 0.0, 0.5, and 1.0.
* **X-Axis Labels (Top, rotated ~45 degrees):**
1. Direct Generation P
2. Direct Generation C
3. Direct Generation PC
4. Repeated Sampling P
5. Repeated Sampling C
6. Repeated Sampling PC
7. Refinement P
8. Refinement C
9. Refinement PC
* **Y-Axis Labels (Left, rotated ~45 degrees):** Identical to the X-axis labels, in the same order from top to bottom.
* **Data Cells:** A 9x9 grid of colored squares. Each square contains a two-decimal numerical value representing the coverage metric between the row and column strategy. The diagonal cells (where row and column are identical) are all 1.00 and are the darkest red.
### Detailed Analysis
The matrix is symmetric (A vs. B equals B vs. A). Below is the full data extraction, presented as a table for clarity. The row and column headers correspond to the axis labels.
| | Dir. Gen. P | Dir. Gen. C | Dir. Gen. PC | Rep. Samp. P | Rep. Samp. C | Rep. Samp. PC | Refinement P | Refinement C | Refinement PC |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Dir. Gen. P** | **1.00** | 0.54 | 0.46 | 0.64 | 0.79 | 0.82 | 0.57 | 0.75 | 0.79 |
| **Dir. Gen. C** | 0.56 | **1.00** | 0.48 | 0.78 | 0.89 | 0.89 | 0.63 | 0.81 | 0.74 |
| **Dir. Gen. PC** | 0.52 | 0.52 | **1.00** | 0.72 | 0.84 | 0.88 | 0.56 | 0.72 | 0.84 |
| **Rep. Samp. P** | 0.45 | 0.53 | 0.45 | **1.00** | 0.85 | 0.88 | 0.57 | 0.70 | 0.72 |
| **Rep. Samp. C** | 0.37 | 0.41 | 0.36 | 0.58 | **1.00** | 0.86 | 0.49 | 0.63 | 0.68 |
| **Rep. Samp. PC** | 0.34 | 0.36 | 0.33 | 0.52 | 0.76 | **1.00** | 0.45 | 0.58 | 0.61 |
| **Refinement P** | 0.46 | 0.49 | 0.40 | 0.66 | 0.83 | 0.86 | **1.00** | 0.66 | 0.80 |
| **Refinement C** | 0.45 | 0.47 | 0.38 | 0.60 | 0.79 | 0.83 | 0.49 | **1.00** | 0.70 |
| **Refinement PC** | 0.46 | 0.42 | 0.44 | 0.60 | 0.83 | 0.85 | 0.58 | 0.69 | **1.00** |
**Trend Verification by Row:**
* **Direct Generation P:** Shows moderate to high similarity with Repeated Sampling and Refinement methods (0.64-0.82), but lower similarity with other Direct Generation methods (0.46-0.54).
* **Direct Generation C:** Exhibits very high similarity with Repeated Sampling C & PC (0.89) and Refinement C (0.81).
* **Repeated Sampling PC:** Has the lowest overall similarity scores with other methods, particularly with Direct Generation strategies (0.33-0.36).
* **Within-Group Similarity:** Methods within the same family (e.g., Repeated Sampling C vs. Repeated Sampling PC) generally show high similarity (0.76-0.89).
### Key Observations
1. **Highest Values (Dark Red):** The highest off-diagonal values are 0.89 (Direct Generation C vs. Repeated Sampling C and PC) and 0.88 (Direct Generation PC vs. Repeated Sampling PC, and Repeated Sampling P vs. PC).
2. **Lowest Values (Light Orange):** The lowest values are concentrated in the "Repeated Sampling PC" row/column when compared to "Direct Generation" methods, with a minimum of 0.33 (Repeated Sampling PC vs. Direct Generation PC).
3. **Asymmetry:** While the matrix is conceptually symmetric, minor numerical asymmetries exist (e.g., Dir. Gen. P vs. C is 0.54, but C vs. P is 0.56). This may be due to rounding or a non-commutative underlying metric.
4. **Pattern by Strategy Type:** "Repeated Sampling" and "Refinement" strategies tend to have higher similarity with each other than they do with "Direct Generation" strategies. "Direct Generation" methods show more internal variation.
### Interpretation
This heatmap quantifies the similarity in "Coverage" between different AI generation and refinement strategies for the Gemini-2.0 model. "Coverage" likely refers to the diversity, breadth, or representativeness of the outputs or solutions produced by each method.
* **What the data suggests:** The strategies are not independent. There is significant overlap in the coverage provided by methods within the same family (Repeated Sampling, Refinement). The high similarity between "Direct Generation C" and the "Repeated Sampling" methods suggests that the "C" variant (possibly "Critique" or "Conditional") produces outputs whose coverage is readily replicated or encompassed by repeated sampling approaches.
* **Relationships:** The "PC" variants (possibly "Plan-and-Critique" or "Parallel Conditioning") often act as bridges. For example, "Repeated Sampling PC" has high similarity with its family members but low similarity with "Direct Generation," indicating it explores a distinct region of the solution space that is nonetheless related to other sampling methods.
* **Notable Anomaly:** The consistently low similarity scores involving "Repeated Sampling PC" when paired with "Direct Generation" methods (0.33-0.36) is striking. This implies that the coverage achieved by repeatedly sampling with the PC strategy is fundamentally different from that achieved by a single direct generation with the same PC strategy. This could indicate that repeated sampling unlocks a qualitatively different mode of exploration for the PC method.
* **Practical Implication:** For a user of Gemini-2.0, this matrix is a guide for strategy selection. If high coverage diversity is the goal, combining a "Direct Generation" method with a "Repeated Sampling" method (especially of a different variant, e.g., Dir. Gen. P with Rep. Samp. C) would yield more complementary results than using two methods from the same family. The matrix helps avoid redundancy when chaining or ensembling these techniques.
</details>
Figure 18: Asymmetric relative coverage matrix of nine ARC solvers using Gemini-2.0, showing the proportion of problems whose test instances are solved by the row solver that are also solved by the column solver. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
<details>
<summary>x19.png Details</summary>

### Visual Description
## Heatmap: Coverage Similarity Between Generation Methods (QwQ-32B)
### Overview
This image is a 9x9 heatmap visualizing the "Coverage" similarity or overlap between nine different text generation methods or strategies. The methods are grouped into three categories: Direct Generation, Repeated Sampling, and Refinement, each with three variants (P, C, PC). The heatmap uses a color gradient from light beige (0.0) to dark red (1.0) to represent the coverage value, which is also printed numerically in each cell. The diagonal cells, representing a method compared to itself, all have a value of 1.00.
### Components/Axes
* **Chart Type:** Heatmap (symmetrical matrix).
* **X-Axis (Top):** Labels are rotated 45 degrees. From left to right:
1. Direct Generation P
2. Direct Generation C
3. Direct Generation PC
4. Repeated Sampling P
5. Repeated Sampling C
6. Repeated Sampling PC
7. Refinement P
8. Refinement C
9. Refinement PC
* **Y-Axis (Left):** Labels are rotated 45 degrees. From top to bottom, the same nine categories as the X-axis.
* **Legend/Color Scale:** Located on the right side of the chart. It is a vertical bar labeled "Coverage" with a gradient from light beige at the bottom (value `0.0`) to dark red at the top (value `1.0`). A midpoint marker indicates `0.5`.
* **Footer Label:** The text "QwQ-32B" is centered at the very bottom of the image, likely indicating the model or dataset used.
### Detailed Analysis
The matrix is symmetric (e.g., value at [Row: Direct Generation C, Column: Repeated Sampling P] equals value at [Row: Repeated Sampling P, Column: Direct Generation C]). Below is the reconstructed data table. Values are read directly from the cells.
| Method (Row \ Column) | Direct Gen P | Direct Gen C | Direct Gen PC | Repeated Samp P | Repeated Samp C | Repeated Samp PC | Refinement P | Refinement C | Refinement PC |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| **Direct Generation P** | **1.00** | 0.53 | 0.37 | 0.68 | 0.71 | 0.74 | 0.68 | 0.71 | 0.74 |
| **Direct Generation C** | 0.69 | **1.00** | 0.45 | 0.79 | 0.93 | 0.86 | 0.72 | 0.86 | 0.86 |
| **Direct Generation PC** | 0.61 | 0.57 | **1.00** | 0.70 | 0.78 | 0.91 | 0.61 | 0.83 | 0.78 |
| **Repeated Sampling P** | 0.58 | 0.51 | 0.36 | **1.00** | 0.73 | 0.76 | 0.64 | 0.69 | 0.69 |
| **Repeated Sampling C** | 0.50 | 0.50 | 0.33 | 0.61 | **1.00** | 0.80 | 0.65 | 0.72 | 0.70 |
| **Repeated Sampling PC** | 0.49 | 0.44 | 0.37 | 0.60 | 0.75 | **1.00** | 0.54 | 0.70 | 0.63 |
| **Refinement P** | 0.59 | 0.48 | 0.32 | 0.66 | 0.80 | 0.70 | **1.00** | 0.73 | 0.73 |
| **Refinement C** | 0.47 | 0.44 | 0.33 | 0.54 | 0.68 | 0.70 | 0.56 | **1.00** | 0.72 |
| **Refinement PC** | 0.50 | 0.45 | 0.32 | 0.55 | 0.68 | 0.64 | 0.57 | 0.73 | **1.00** |
**Trend Verification & Color Correlation:**
* **Diagonal Trend:** All diagonal cells are the darkest red, corresponding to the maximum value of 1.00, confirming perfect self-similarity.
* **High-Value Clusters:** The darkest red off-diagonal cells indicate high coverage. Notable examples:
* `Direct Generation C` vs. `Repeated Sampling C` (0.93) - Very high similarity.
* `Direct Generation PC` vs. `Repeated Sampling PC` (0.91) - Very high similarity.
* `Direct Generation C` vs. `Repeated Sampling PC` (0.86) and `Refinement C` (0.86).
* **Low-Value Clusters:** The lightest beige/orange cells indicate low coverage. Notable examples:
* `Direct Generation PC` vs. `Refinement P` (0.32) and `Refinement PC` (0.32).
* `Direct Generation P` vs. `Direct Generation PC` (0.37).
* `Repeated Sampling PC` vs. `Direct Generation PC` (0.37).
* **General Pattern:** Methods within the same category (e.g., Direct Generation C vs. Direct Generation PC) often have moderate to high similarity. Cross-category comparisons (e.g., Direct Generation vs. Refinement) show more variability.
### Key Observations
1. **Highest Inter-Method Similarity:** The pair `Direct Generation C` and `Repeated Sampling C` has the highest off-diagonal coverage (0.93), suggesting these two methods produce highly overlapping outputs.
2. **Lowest Inter-Method Similarity:** The pair `Direct Generation PC` and `Refinement P` (also `Refinement PC`) has the lowest coverage (0.32), indicating these methods have the least overlap in their outputs.
3. **"PC" Variant Behavior:** The "PC" variants (Direct Generation PC, Repeated Sampling PC, Refinement PC) tend to have lower coverage scores when compared to "P" or "C" variants from other categories, suggesting they may be more distinct strategies.
4. **Refinement Category Consistency:** The Refinement methods (P, C, PC) show relatively consistent, moderate coverage scores when compared to each other (0.72-0.73), but lower scores when compared to Direct Generation methods.
### Interpretation
This heatmap quantifies the similarity in "coverage" (likely meaning the set of problems solved, outputs generated, or knowledge demonstrated) between different prompting or generation strategies for the QwQ-32B model.
* **What it demonstrates:** The data suggests that **methodology significantly impacts output characteristics**. Strategies sharing a core approach (e.g., all "C" variants) have high mutual coverage, meaning they tend to succeed on similar tasks. The very high similarity between `Direct Generation C` and `Repeated Sampling C` implies that for the "C" condition, simply repeating the sampling process doesn't fundamentally change *what* the model can produce compared to a single direct generation.
* **Relationships between elements:** The matrix reveals a structure where the "C" and "PC" variants form stronger clusters with their counterparts in other categories than the "P" variants do. This could indicate that the "C" (perhaps "Constrained" or "Chain-of-thought") and "PC" (perhaps "Plan-and-Code") conditions impose a stronger, more consistent behavioral signature on the model than the "P" (perhaps "Plain") condition.
* **Notable Anomaly/Insight:** The consistently low coverage between `Direct Generation PC` and the `Refinement` methods is striking. It suggests that the "Plan-and-Code" direct generation approach explores a solution space that is largely disjoint from the space explored by refinement-based methods. This could be because refinement methods start from an initial draft and improve it, while direct PC generation attempts to produce a complete, structured solution in one pass, leading to fundamentally different types of outputs or error profiles.
In essence, this chart is a map of strategic similarity. It shows that not all generation methods are created equal; they occupy different "niches" in terms of the problems they can solve, with some being nearly interchangeable (high coverage) and others being complementary (low coverage). This information is crucial for designing ensemble methods or choosing the right strategy for a specific task.
</details>
Figure 19: Asymmetric relative coverage matrix of nine ARC solvers using QwQ-32B, showing the proportion of problems whose test instances are solved by the row solver that are also solved by the column solver. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
<details>
<summary>x20.png Details</summary>

### Visual Description
## Heatmap: Coverage Similarity Matrix for DeepSeek-R1-70B
### Overview
This image is a heatmap visualization displaying a symmetric matrix of "Coverage" values between nine different methods or configurations. The methods are grouped into three categories: Direct Generation, Repeated Sampling, and Refinement, each with three variants (P, C, PC). The heatmap uses a color gradient from light beige (0.0) to dark red (1.0) to represent the coverage value, which is numerically annotated in each cell. The title "DeepSeek-R1-70B" is centered at the bottom.
### Components/Axes
* **Chart Type:** Heatmap (symmetric matrix).
* **Title/Label:** "DeepSeek-R1-70B" (bottom center).
* **Color Scale/Legend:** Located on the right side. It is a vertical bar labeled "Coverage" with a gradient from 0.0 (light beige) to 1.0 (dark red). Key markers are at 0.0, 0.5, and 1.0.
* **X-Axis (Top):** Labels are rotated 45 degrees. From left to right:
1. Direct Generation P
2. Direct Generation C
3. Direct Generation PC
4. Repeated Sampling P
5. Repeated Sampling C
6. Repeated Sampling PC
7. Refinement P
8. Refinement C
9. Refinement PC
* **Y-Axis (Left):** Labels are rotated 45 degrees. From top to bottom, they are identical to the X-axis labels in the same order.
* **Data Grid:** A 9x9 grid of colored cells. Each cell contains a numerical value (coverage) printed in white or black text for contrast against the background color.
### Detailed Analysis
The matrix is symmetric (Value at Row i, Column j = Value at Row j, Column i). The diagonal values are all 1.00, representing perfect coverage similarity of a method with itself.
**Row-by-Row Data Extraction (Coverage Values):**
1. **Direct Generation P:**
* vs. Direct Generation P: **1.00**
* vs. Direct Generation C: 0.76
* vs. Direct Generation PC: 0.65
* vs. Repeated Sampling P: 0.65
* vs. Repeated Sampling C: 0.71
* vs. Repeated Sampling PC: **0.82**
* vs. Refinement P: 0.53
* vs. Refinement C: 0.71
* vs. Refinement PC: 0.76
2. **Direct Generation C:**
* vs. Direct Generation P: 0.68
* vs. Direct Generation C: **1.00**
* vs. Direct Generation PC: 0.58
* vs. Repeated Sampling P: 0.58
* vs. Repeated Sampling C: **0.84**
* vs. Repeated Sampling PC: **0.89**
* vs. Refinement P: 0.53
* vs. Refinement C: 0.79
* vs. Refinement PC: 0.68
3. **Direct Generation PC:**
* vs. Direct Generation P: 0.61
* vs. Direct Generation C: 0.61
* vs. Direct Generation PC: **1.00**
* vs. Repeated Sampling P: 0.56
* vs. Repeated Sampling C: 0.72
* vs. Repeated Sampling PC: 0.72
* vs. Refinement P: **0.44**
* vs. Refinement C: 0.72
* vs. Refinement PC: 0.56
4. **Repeated Sampling P:**
* vs. Direct Generation P: 0.65
* vs. Direct Generation C: 0.65
* vs. Direct Generation PC: 0.59
* vs. Repeated Sampling P: **1.00**
* vs. Repeated Sampling C: 0.76
* vs. Repeated Sampling PC: 0.76
* vs. Refinement P: 0.59
* vs. Refinement C: 0.71
* vs. Refinement PC: 0.65
5. **Repeated Sampling C:**
* vs. Direct Generation P: **0.41**
* vs. Direct Generation C: 0.55
* vs. Direct Generation PC: 0.45
* vs. Repeated Sampling P: 0.45
* vs. Repeated Sampling C: **1.00**
* vs. Repeated Sampling PC: 0.66
* vs. Refinement P: **0.41**
* vs. Refinement C: 0.62
* vs. Refinement PC: 0.62
6. **Repeated Sampling PC:**
* vs. Direct Generation P: 0.45
* vs. Direct Generation C: 0.55
* vs. Direct Generation PC: 0.42
* vs. Repeated Sampling P: 0.42
* vs. Repeated Sampling C: 0.61
* vs. Repeated Sampling PC: **1.00**
* vs. Refinement P: **0.39**
* vs. Refinement C: **0.48**
* vs. Refinement PC: 0.65
7. **Refinement P:**
* vs. Direct Generation P: 0.64
* vs. Direct Generation C: 0.71
* vs. Direct Generation PC: 0.57
* vs. Repeated Sampling P: 0.71
* vs. Repeated Sampling C: **0.86**
* vs. Repeated Sampling PC: **0.86**
* vs. Refinement P: **1.00**
* vs. Refinement C: 0.64
* vs. Refinement PC: 0.64
8. **Refinement C:**
* vs. Direct Generation P: 0.52
* vs. Direct Generation C: 0.65
* vs. Direct Generation PC: 0.57
* vs. Repeated Sampling P: 0.52
* vs. Repeated Sampling C: 0.78
* vs. Repeated Sampling PC: 0.65
* vs. Refinement P: **0.39**
* vs. Refinement C: **1.00**
* vs. Refinement PC: 0.61
9. **Refinement PC:**
* vs. Direct Generation P: 0.42
* vs. Direct Generation C: 0.42
* vs. Direct Generation PC: **0.32**
* vs. Repeated Sampling P: 0.35
* vs. Repeated Sampling C: 0.58
* vs. Repeated Sampling PC: 0.65
* vs. Refinement P: **0.29**
* vs. Refinement C: 0.45
* vs. Refinement PC: **1.00**
### Key Observations
1. **Diagonal Perfection:** All diagonal cells are 1.00 (dark red), confirming each method has perfect coverage overlap with itself.
2. **Highest Off-Diagonal Similarities:** The highest coverage values outside the diagonal are **0.89** (Direct Generation C vs. Repeated Sampling PC) and **0.86** (Refinement P vs. Repeated Sampling C and PC). This suggests strong similarity between these specific method pairs.
3. **Lowest Similarities:** The lowest coverage values are **0.29** (Refinement P vs. Refinement PC) and **0.32** (Direct Generation PC vs. Refinement PC). This indicates these method pairs have the least overlap in coverage.
4. **Pattern by Method Category:**
* Methods within the **"Repeated Sampling"** category (P, C, PC) show relatively high internal similarity (0.61 to 0.76).
* The **"Refinement PC"** method appears to be an outlier, showing generally lower coverage values when compared to most other methods, especially those in the "Direct Generation" and "Refinement P" categories.
* **"Repeated Sampling C"** and **"Repeated Sampling PC"** show very high similarity to **"Refinement P"** (0.86).
5. **Color-Value Correlation:** The color gradient accurately reflects the numerical values. Cells with values ≥0.80 are dark red, values around 0.50 are medium orange, and values ≤0.40 are light beige.
### Interpretation
This heatmap quantifies the similarity in "Coverage" between different generation and refinement strategies for the DeepSeek-R1-70B model. "Coverage" likely measures the overlap in outputs, solutions, or reasoning paths discovered by each method.
* **High Similarity Clusters:** The strong links between "Repeated Sampling" methods and "Refinement P" suggest that the refinement process (P variant) produces results highly consistent with those found through repeated sampling. This could imply that refinement effectively converges on solutions that are also discoverable via brute-force sampling.
* **Distinct Strategies:** The low similarity involving "Refinement PC" indicates it explores a substantially different part of the solution space or uses a fundamentally different approach compared to the other methods, particularly "Direct Generation PC" and "Refinement P".
* **Methodological Insights:** The matrix allows researchers to understand which methods are redundant (high coverage similarity) and which are complementary (low coverage similarity). For instance, combining a method from a high-similarity pair might yield diminishing returns, while combining methods from low-similarity pairs (e.g., "Refinement PC" with "Direct Generation PC") could lead to more comprehensive coverage of the problem space.
* **Asymmetry Note:** While the matrix is visually symmetric, a few off-diagonal pairs show slight numerical asymmetry (e.g., Direct Generation P vs. C is 0.76, while C vs. P is 0.68). This may be due to rounding in the displayed values or a non-symmetric underlying metric. The dominant pattern, however, is symmetry.
</details>
Figure 20: Asymmetric relative coverage matrix of nine ARC solvers using DeepSeek-R1-70B, showing the proportion of problems whose test instances are solved by the row solver that are also solved by the column solver. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Diagram: Programming Task and Code Solution
### Overview
The image displays a technical diagram illustrating a programming task (labeled "Task 358ba94e") and its corresponding Python code solution. The diagram is split into two primary sections: a left panel showing visual input-output examples of a grid transformation task, and a right panel containing the Python function that solves the task. The overall purpose is to demonstrate a method for recognizing a digit from a noisy input grid and outputting a clean, standardized 5x5 pixel pattern for that digit.
### Components/Axes
**Left Panel (Task Visualization):**
* **Header:** "Task 358ba94e" in a rounded rectangle at the top-left.
* **Input-Output Pairs:** Five vertically stacked examples. Each example consists of:
* An **Input Grid**: A 5x5 grid with a black background. Colored pixels (orange, light blue, magenta, red, blue) form scattered, noisy patterns.
* A **Right Arrow**: A white arrow pointing from the input grid to the output grid.
* An **Output Grid**: A 5x5 grid with a solid colored background (matching the input's color) and black pixels forming a clean, recognizable pattern.
* **Test Case:** The fifth and final pair is separated by a dotted line. Its output grid contains a large black question mark on a white background, indicating this is the unknown to be solved.
**Right Panel (Code Solution):**
* **Header:** "Code Solution" in a rounded rectangle at the top-right.
* **Code Block:** A Python function named `generate_output_image(input_image)`. The code is syntax-highlighted (blue keywords, green comments, black text).
* **Highlighted Section:** A red dashed box surrounds the core logic of the function, which consists of a series of `if/elif` statements returning fixed 5x5 matrices for specific digits. A red text annotation next to this box reads: "Copy the output matrices."
### Detailed Analysis
**Task Logic (Inferred from Visuals and Code):**
1. **Input:** A 5x5 grid containing colored pixels on a black background. The colored pixels appear to be a noisy representation of a digit.
2. **Transformation:** The code analyzes the input to determine the most frequent non-zero pixel value (the "digit").
3. **Output:** A 5x5 grid with a solid background of that digit's color and black pixels arranged in a predefined, clean pattern representing the digit.
**Code Transcription:**
```python
def generate_output_image(input_image):
freq = {}
for row in input_image:
for pix in row:
if pix != 0:
freq[pix] = freq.get(pix, 0) + 1
# If there is no non-zero pixel, return a 5x5 zero matrix.
if not freq:
return [[0, 0, 0, 0, 0] for _ in range(5)]
# Determine the "digit" (nonzero value) that occurs most often.
digit = max(freq, key=freq.get)
# Return the fixed normalized 5x5 pattern corresponding to the detected digit.
if digit == 7:
return [
[7, 7, 7, 7, 7],
[7, 0, 7, 7, 7],
[7, 7, 7, 0, 7],
[7, 0, 7, 7, 7],
[7, 7, 7, 7, 7]
]
elif digit == 8:
return [
[8, 8, 8, 8, 8],
[8, 0, 8, 0, 8],
[8, 8, 8, 8, 8],
[8, 0, 8, 0, 8],
[8, 8, 8, 8, 8]
]
elif digit == 6:
return [
[6, 6, 6, 6, 6],
[6, 0, 6, 0, 6],
[6, 6, 6, 6, 6],
[6, 0, 6, 6, 6],
[6, 6, 6, 6, 6]
]
elif digit == 2:
return [
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 0, 2, 2, 2],
[2, 2, 2, 2, 2],
[2, 2, 2, 2, 2]
]
elif digit == 1:
# For the test image the drawn digit is 1.
# We choose a pattern that follows a similar style to the others.
return [
[1, 1, 1, 1, 1],
[1, 0, 1, 0, 1],
[1, 1, 1, 1, 1],
[1, 0, 1, 1, 1],
[1, 1, 1, 1, 1]
]
else:
# For any other digit, simply fill a 5x5 matrix with that digit.
return [[digit]*5 for _ in range(5)]
```
**Visual Pattern to Code Mapping:**
* **Orange Example (Top):** Input has scattered orange pixels. Output is a 5x5 orange grid with black pixels forming a pattern. This corresponds to `digit == 7` in the code.
* **Light Blue Example:** Output pattern matches the `digit == 8` matrix.
* **Magenta Example:** Output pattern matches the `digit == 6` matrix.
* **Red Example:** Output pattern matches the `digit == 2` matrix.
* **Blue Test Case (Bottom):** The input has blue pixels. Based on the code's comment ("For the test image the drawn digit is 1") and the `digit == 1` pattern, the expected output would be a 5x5 blue grid with the black pixel pattern defined for digit 1.
### Key Observations
1. **Handcrafted Patterns:** The solution does not use machine learning. It relies on a hardcoded lookup table of "ideal" digit patterns (7, 8, 6, 2, 1).
2. **Frequency-Based Detection:** The digit is identified by finding the most common non-zero pixel value in the input, assuming the digit's color is the dominant non-black color.
3. **Pattern Style:** The output patterns are not standard digital representations. They are stylized, using the digit's value as the background color and 0 (black) to draw the shape. For example, the pattern for '1' is not a simple vertical line but a more complex, blocky shape.
4. **Test Case Inference:** The diagram explicitly links the blue test input to the `digit == 1` case in the code via a comment, providing the solution.
5. **Spatial Layout:** The red dashed box and annotation "Copy the output matrices" directly highlight the critical, reusable part of the code solution for someone implementing the task.
### Interpretation
This diagram is a concise specification for a programming challenge. It defines a **digit recognition and normalization task** with a specific, rule-based solution.
* **What the data suggests:** The task is to clean up noisy, low-resolution (5x5) images of digits. The "noise" is extra pixels of the same color scattered around the true digit shape. The solution assumes the correct digit is the color that appears most frequently.
* **How elements relate:** The visual examples on the left define the problem and expected outputs. The code on the right provides the exact algorithm to achieve those outputs. The test case bridges the two, showing how the code should be applied to a new input.
* **Notable anomalies/assumptions:**
* The system is brittle. It would fail if multiple colors (digits) were present or if the background wasn't pure black (0).
* The hardcoded patterns for digits 3, 4, 5, 9, and 0 are missing, though the `else` clause provides a fallback (a solid block of the digit's color).
* The pattern for '1' is notably complex, suggesting the task may be part of a larger set with a specific, non-standard visual language for digits.
* **Purpose:** This is likely an educational example or a test case for an AI or programming student. It demonstrates concepts of image processing (pixel frequency analysis), conditional logic, and data transformation (mapping a noisy input to a clean output via a lookup table). The Peircean sign relationship is clear: the noisy input is an *index* (it points to) the digit's color, which is a *symbol* for the digit, which is then represented by a conventional *icon* (the clean pattern).
</details>
Figure 21: ARC problem 358ba94e, where repeated sampling with standalone code generation produces an incorrect solution using GPT-o3-mini.
<details>
<summary>x22.png Details</summary>

### Visual Description
## Diagram: Comparison of Code Generation Approaches for a Visual Pattern Task
### Overview
This image is a technical diagram comparing two methods for generating code to solve a visual pattern recognition task (Task 15696249). The task involves transforming a 3x3 input grid of colored squares into a 9x9 output grid based on specific rules. The diagram contrasts a flawed "standalone code generation" approach with a successful "planning-aided code generation" approach, using visual examples and annotated code snippets.
### Components/Axes
The diagram is divided into two primary vertical sections:
1. **Left Column (Visual Examples):**
* **Header:** "Task 15696249"
* **Content:** Five rows of input-output pairs. Each row shows:
* A 3x3 input grid (left).
* An arrow pointing to a 9x9 output grid (right).
* The grids use solid colors: yellow, magenta (pink), green, red, and blue.
* **Purpose:** Demonstrates the transformation rule. The output pattern (horizontal or vertical replication) depends on whether the input has a uniform row or column.
2. **Right Column (Code & Explanation):**
* **Top Section:** Titled "Repeated sampling with standalone code generation".
* Contains a Python function `generate_output_image` in a code block.
* A red dashed box highlights a specific condition: `if is_uniform(first_col):`.
* A red annotation points to this line with the text: "Incorrect condition."
* **Bottom Section:** Titled "Repeated sampling with planning-aided code generation".
* Contains a "Plan" text block explaining the logic.
* Contains a corrected "Code Solution" Python function.
* A green dashed box surrounds the core logic of the corrected code.
* A green annotation states: "The plan identifies correct conditions and implements code accordingly, yielding the correct solution."
### Detailed Analysis
**Visual Examples (Left Column):**
* **Example 1 (Top):** Input has a uniform middle row (all green). Output shows that row replicated horizontally three times to form a 3x9 band, placed in the middle of the 9x9 grid.
* **Example 2:** Input has a uniform first column (all red). Output shows that column replicated vertically three times to form a 9x3 band, placed on the left of the 9x9 grid.
* **Example 3:** Input has a uniform top row (all blue). Output shows horizontal replication of that row.
* **Example 4:** Input has a uniform first column (all magenta). Output shows vertical replication of that column.
* **Example 5 (Bottom):** Input has a uniform middle column (all green). The output is a question mark, indicating this is the test case for the final solution.
**Code Transcription & Analysis (Right Column):**
**1. Standalone Code Generation (Flawed):**
```python
def generate_output_image(input_image):
output = [[0 for _ in range(9)] for _ in range(9)]
def is_uniform(lst):
return all(x == lst[0] for x in lst)
# Check vertical condition: if the first column is uniform.
first_col = [row[0] for row in input_image]
if is_uniform(first_col): # <-- HIGHLIGHTED AS "Incorrect condition."
for r in range(9):
in_row = input_image[r % 3]
for j in range(3):
output[r][j] = in_row[j]
else:
# Horizontal replication:
# Decide which horizontal band to fill based on the uniformity of the first row.
block_row_index = 0 if is_uniform(input_image[0]) else 1
for i in range(3):
for block in range(3):
for j in range(3):
output[block_row_index * 3 + i][block * 3 + j] = input_image[i][j]
```
* **Error:** The logic is flawed. It only checks if the *first column* is uniform to decide on vertical replication. If not, it defaults to a horizontal replication logic that only checks the *first row*. This fails for cases where a uniform row is not the first row (e.g., Example 1) or a uniform column is not the first column (e.g., Example 5).
**2. Planning-Aided Code Generation (Corrected):**
* **Plan Text:**
> Look at the input's rows and columns to see whether any entire row is "uniform" (all the same number) or any entire column is uniform...
> Check the examples:
> Example 1: Input row 2 is [3, 3, 3] (uniform). So the entire input is copied horizontally ...
> Example 2: Input column 1 is [2, 2, 2] (uniform). So the input is "stacked" ...
> Example 3: Input row 1 is [1, 1, 1] (uniform). So the input is copied horizontally ...
> Example 4: Input column 1 is uniform ([3, 3, 3]), so the input is stacked vertically ...
* **Corrected Code Solution:**
```python
def generate_output_image(input_image):
n = len(input_image)
m = len(input_image[0])
uniform_row = None
for i in range(n):
if all(pixel == input_image[i][0] for pixel in input_image[i]):
uniform_row = i
break
uniform_col = None
if uniform_row is None:
for j in range(m):
col_pixels = [input_image[i][j] for i in range(n)]
if all(pixel == col_pixels[0] for pixel in col_pixels):
uniform_col = j
break
output_size = 9
output = [[0 for _ in range(output_size)] for _ in range(output_size)]
if uniform_row is not None:
block_row = uniform_row # 0-indexed: 0 for first, 1 for middle, 2 for last.
start_row = 3 * block_row
for i in range(3):
new_row = input_image[i] * 3 # Repeat side-by-side three times.
output[start_row + i] = new_row
elif uniform_col is not None:
block_col = uniform_col # 0-indexed column position.
start_col = 3 * block_col
for r in range(9):
in_row = input_image[r % 3]
for c in range(3):
output[r][start_col + c] = in_row[c]
return output
```
* **Key Correction:** The plan correctly identifies that the code must search *all* rows and *all* columns for uniformity, not just the first of each. The code implements this by first searching for any uniform row. If none is found, it then searches for any uniform column. The replication logic is then applied based on the index of the found uniform row/column.
### Key Observations
1. **Logical Flaw vs. Robust Plan:** The standalone code fails because it makes a premature, incorrect assumption (only checking the first row/column). The planning-aided approach succeeds by first analyzing the problem requirements from the examples.
2. **Visual-Code Correlation:** The green output grids in the examples directly correspond to the logic in the corrected code. For instance, the vertical red band in Example 2's output is generated by the `elif uniform_col is not None:` block.
3. **Spatial Layout of Annotations:** The red "Incorrect condition" annotation is placed in the top-right, directly pointing to the flawed line. The green explanatory annotation is placed in the center-right, summarizing the success of the planned approach.
4. **Task Ambiguity Resolution:** The plan text explicitly resolves the ambiguity in the task description by inferring the rule from the provided examples: "if any row is uniform, replicate horizontally; if any column is uniform, replicate vertically."
### Interpretation
This diagram serves as a pedagogical comparison between two paradigms in automated code generation. It argues that **explicit planning and problem analysis** (the "planning-aided" approach) lead to correct and generalizable solutions, whereas **direct code generation** (the "standalone" approach) is prone to logical errors from making oversimplified assumptions.
The underlying message is that for tasks involving pattern recognition and rule deduction from examples (common in AI benchmarks like ARC), a model must first "understand" the pattern by forming a plan before writing code. The incorrect condition in the first code block is a symptom of jumping to implementation without this understanding. The successful second approach demonstrates that decomposing the problem—first identifying the transformation rule, then implementing checks for that rule—yields robust results. This highlights the importance of intermediate reasoning steps in complex problem-solving for AI systems.
</details>
Figure 22: ARC problem 15696249, where repeated sampling with standalone code generation produces an incorrect solution, whereas repeated sampling with planning-aided code generation yields the correct solution using GPT-o3-mini.
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Diagram: Iterative Code Generation and Refinement Process
### Overview
The image is a technical diagram illustrating an iterative process of generating and refining code to solve a specific image transformation task (labeled "Task d19f7514"). It contrasts incorrect approaches with a final correct solution, highlighting the evolution of logic through two main phases: "Refinement with planning-aided code generation" and "Repeated sampling with planning-aided code generation." The diagram includes visual examples of the task, multiple Python code blocks with annotations, and explanatory labels.
### Components and Layout
The diagram is organized into three primary vertical sections:
1. **Left Column (Visual Task Examples):**
* A series of five pixel grid examples, each showing an input grid (left) and an output grid (right) connected by an arrow.
* The grids are 6 rows tall. Input grids contain black, grey, and green pixels. Output grids contain black and yellow pixels.
* The first four examples show completed transformations. The fifth example has a question mark (`?`) in the output box, indicating the unsolved problem the code aims to address.
* A label at the top reads: `Task d19f7514`.
2. **Top-Right Section (Refinement Phase):**
* **Title:** `Refinement with planning-aided code generation` (in bold, black text).
* This section contains two Python code blocks, each enclosed in a red dashed box.
* **First Code Block:** Labeled with red text: `Initial incorrect code`. It defines a function `generate_output_image(input_image)` with three steps: cropping, pixel value replacement, and vertical propagation.
* **Second Code Block:** Labeled with red text: `Incorrect code after refinements`. It shows a more structured but still flawed version of the function. A red annotation box points to a specific logic section with the text: `Incorrect logic persists after refinements.`
3. **Bottom-Right Section (Repeated Sampling Phase):**
* **Title:** `Repeated sampling with planning-aided code generation` (in bold, black text).
* A central label reads: `Code Solution`.
* A single, larger Python code block is enclosed in a red dashed box. A green dashed box highlights a specific section of this code, accompanied by a green annotation: `Correct logic.`
### Detailed Analysis: Code Transcription and Logic
#### 1. Initial Incorrect Code (Top Block)
```python
def generate_output_image(input_image):
# Step 1: Discard the lower 6 rows to work only with the top 6 rows.
cropped = [row[:] for row in input_image[:6]]
# Step 2: Change every nonzero pixel to 4 (convert 3's or any nonzero values to 4).
for r in range(len(cropped)):
for c in range(len(cropped[r])):
if cropped[r][c] != 0:
cropped[r][c] = 4
# Step 3: Vertical propagation.
# Process from the second row (index 1) downwards
for r in range(1, len(cropped)):
for c in range(len(cropped[r])):
# If this pixel is 0 and the pixel immediately above is 4, fill it with 4.
if cropped[r][c] == 0 and cropped[r-1][c] == 4:
cropped[r][c] = 4
return cropped
```
* **Logic:** Crops the input to the first 6 rows, replaces all non-zero pixels with the value 4, then performs a simple vertical fill: any zero pixel directly below a 4 becomes a 4.
#### 2. Incorrect Code After Refinements (Middle Block)
```python
def generate_output_image(input_image):
cropped = input_image[:6]
rows = len(cropped)
cols = len(cropped[0])
# Step 2. Build the base image: every nonzero becomes 4.
base = [[4 if pixel != 0 else 0 for pixel in row] for row in cropped]
# The output image starts with the first (top) row of the base.
output = [base[0][:]]
def is_active(row):
return any(pixel != 0 for pixel in row)
# Process each subsequent row.
for r in range(1, rows):
# Start with the base row.
new_row = base[r][:]
if is_active(cropped[r]):
for c in range(cols):
# Only consider blank cells that were not drawn originally.
if new_row[c] == 0 and cropped[r][c] == 0:
# Propagate only if the cell directly above is 4.
if output[r-1][c] == 4:
# For a right border cell, propagate automatically.
if c == cols - 1:
new_row[c] = 4
else:
left_drawn = (c - 1 >= 0 and cropped[r][c - 1] != 0)
right_drawn = (c + 1 < cols and cropped[r][c + 1] != 0)
if left_drawn or right_drawn:
new_row[c] = 4
output.append(new_row)
return output
```
* **Logic:** This version introduces more complex conditions. It creates a `base` grid (non-zero -> 4). It then builds the `output` row by row. For each row, if the original input row was "active" (had any non-zero pixel), it attempts to fill zeros in the `new_row` based on: 1) the pixel above in the `output` being 4, and 2) additional border or neighbor-drawn conditions. The annotation indicates this logic is still flawed.
#### 3. Correct Code Solution (Bottom Block)
```python
def generate_output_image(input_image):
# Split the input image into two halves
num_rows = len(input_image)
mid = num_rows // 2
top_half = []
bottom_half = []
# Replace nonzero pixels with 4 in both halves
for i in range(mid):
top_half.append([4 if pixel != 0 else 0 for pixel in input_image[i]])
for i in range(mid, num_rows):
bottom_half.append([4 if pixel != 0 else 0 for pixel in input_image[i]])
# Combine the corresponding rows of the top and bottom halves using pixelwise OR (nonzero becomes 4)
output_image = []
for i in range(mid):
combined_row = []
for j in range(len(top_half[i])):
# If either pixel in the corresponding column is nonzero (i.e., equals 4), output 4; otherwise 0.
if top_half[i][j] or bottom_half[i][j]:
combined_row.append(4)
else:
combined_row.append(0)
output_image.append(combined_row)
return output_image
```
* **Logic (Correct):** This solution takes a fundamentally different approach. It splits the input image into top and bottom halves. It processes each half independently, converting non-zero pixels to 4. It then creates the output by combining corresponding rows from the two halves using a pixel-wise logical OR operation. If a pixel is 4 (non-zero) in *either* the top-half row or the bottom-half row at the same column, the output pixel is 4.
### Key Observations
1. **Task Pattern:** The visual examples suggest the transformation rule: The output is a 6-row grid where a pixel is yellow (value 4) if the corresponding pixel in the *top half* (first 3 rows) OR the *bottom half* (last 3 rows) of the input is non-zero. The incorrect codes tried to model this as a vertical propagation or neighbor-based fill, which is more complex and error-prone.
2. **Evolution of Approach:** The process moves from a simple, incorrect vertical fill model to a more complex conditional model, and finally to a correct, simpler model based on splitting and combining halves.
3. **Annotation Strategy:** Red is used consistently to highlight incorrect code and logic. Green is used to highlight the correct logic, creating a clear visual distinction between failure and success.
4. **Spatial Grounding:** The incorrect logic annotations are placed directly adjacent to the specific code lines they describe. The "Correct logic" label is placed within the green dashed box that surrounds the core combining loop of the final solution.
### Interpretation
This diagram serves as a case study in iterative problem-solving for algorithmic tasks. It demonstrates that:
* **Initial Intuition Can Be Misleading:** The first attempts modeled the transformation as a local propagation effect (like a flood fill), which aligns with how one might visually trace patterns but does not match the actual global rule.
* **Refinement Without Re-evaluation is Insufficient:** The second code block added complexity (border checks, neighbor checks) but was built upon the same flawed foundational assumption, so the core error persisted.
* **Correct Solution Requires Re-framing the Problem:** The breakthrough came from abandoning the propagation model and re-framing the task as a simple composition of two independent sub-images (top and bottom halves). This "planning-aided" shift in perspective led to a correct and more elegant solution.
* **The Value of Visual Examples:** The pixel grid examples on the left are crucial. They provide the ground truth that the code must match. The final unsolved example (`?`) represents the test case that likely prompted the re-evaluation and eventual correct solution.
The diagram effectively communicates that successful code generation for visual tasks often depends less on intricate conditional logic and more on correctly identifying the underlying compositional structure of the transformation.
</details>
Figure 23: ARC problem d19f7514, where repeated sampling with planning-aided code generation produces a correct solution, whereas its refinement variant fails to refine the initial erroneous code, and the incorrect logic persists across subsequent refinements when using GPT-o3-mini.
### A.13 Prompts for LLMs
We include all prompts used by KAAR and nine ARC solvers described in Section 3. We adopt a bash-like notation for input arguments within the prompts, such as $ {test_inputs} denotes the test input 2D matrices. A brief description of the prompts used for each solver is provided below.
- Direct generation with solution plan: Prompt 1 describes how to generate the solution plan, and Prompt 2 uses the generated plan to produce the output images.
- Direct generation with standalone code: Prompt 3 describes how to generate the code to produce the output images.
- Direct generation with planning-aided code: It first generates a solution plan using Prompt 1, then uses Prompt 4 to produce code based on the generated plan.
- Repeated sampling with solution plan: It can be regarded as an iterative version of direct generation with solution plan, and thus also uses Prompts 1 and 2.
- Repeated sampling with standalone code: It can be regarded as an iterative version of direct generation with standalone code, and thus also uses Prompt 3.
- Repeated sampling with planning-aided code: It can be regarded as an iterative version of direct generation with planning-aided code, and thus also uses Prompts 1 and 4.
- Refinement with solution plan: Prompt 5 describes the process of refining the generated solution plan with the validation samples. It uses Prompts 1 and 2 to generate the initial plan and the result image.
- Refinement with the standalone code: Prompt 6 describes the process of refining the generated code with the validation samples. It uses Prompt 3 to produce the initial code solution.
- Refinement with the planning-aided code: Prompt 7 describes the process of refining the generated plan and code with the validation samples. It use Prompts 1 and 4 to generate the initial plan and produce the initial code guided by the plan, respectively.
- KAAR: Prompt 8 describes the augmentation of objectness priors. Prompts 9 and 10 introduce the augmentation of geometry and topology priors, encoded as component attributes and relations, respectively. Prompt 11 outlines the augmentation of numbers and counting priors. Prompts 12 and 13 describe action selection and target component identification in the process of augmenting goal-directedness priors. For prompts implementing each action’s implementation details, please refer to our code.
Prompt 1: Direct generation with solution plan - solution plan generation.
⬇
================================ System ================================
You are an expert in analyzing grid - based image processing tasks. Your objective is to derive a text transformation plan (not Python code) from each given input - output image pair (both represented as 2 D matrices), and then apply this plan to generate output image (s), represented as a 2 D matrix, based on the given test input image (s) (2 D matrix). Ensure that the derived plan generalizes across different cases while preserving consistency with the observed transformations.
================================= User =================================
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input, and the right image represents the corresponding output. Each image can be represented as a 2 D matrix: $ {matrix}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
Derive a text transformation plan (not Python code) that maps each given input image (2 D matrix) to its corresponding output image (2 D matrix). Ensure that the plan generalizes across different cases and the test input image (s) (2 D matrix) while maintaining consistency with the observed transformations.
The test input image (s): $ {test_inputs}
Prompt 2: Direct generation with solution plan - output image(s) generation from the plan.
⬇
================================ System ================================
You are an expert in analyzing grid - based image processing tasks. Your objective is to generate output image (s), represented as a 2 D matrix, based on the given input images (2 D matrix) and a derived text transformation plan.
================================= User =================================
Please generate the output image (s) as a 2 D matrix (not Python code) based on the given input image (s) (2 D matrix) and the text transformation plan. Output only the test output image (s) in 2 D matrix format (not Python code). For each test input image, start with [Start Output Image] and end with [End Output Image].
For example, if there is one test input image, the output image should be:
[Start Output Image]
[[0,0,0], [0,0,0], [0,0,0]]
[End Output Image]
If there are multiple (2) test input images, the the output images should be outputted as:
[Start Output Image]
[[0,0,0], [0,0,0], [0,0,0]]
[End Output Image]
[Start Output Image]
[[1,1,1], [1,1,1], [1,1,1]]
[End Output Image]
The test input image (s): $ {test_inputs}
Prompt 3: Direct generation with standalone code.
⬇
================================ System ================================
You are an expert in analyzing grid - based image processing tasks. Your goal is to generate Python code that produces output image (s), represented as a 2 D matrix, based on the given input image (s) (2 D matrix).
================================= User =================================
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input and the right image represents the corresponding output.
Each image can be represented as a 2 D matrix: $ {matrix}
The test input image (s): $ {test_inputs}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
Generate a Python script to map each input image (2 D matrix) to the corresponding output image (2 D matrix).
Ensure that the Python script generalizes across different cases and test input image (s) while maintaining consistency with the observed input - output image pairs.
Please output the Python program, starting with [Start Program] and ending with [End Program].
Include an assert statement with the function signature to verify that the generated output matches the expected result, starting with [Assert Statement].
Use placeholders like input_image and output_image for the variables representing the input and output images.
For example:
[Start Program]
def generate_output_image (input_image):
rows = len (input_image)
cols = len (input_image [0])
def dfs (r, c):
""" Depth - first search to mark all 4- connected ’1’ s to ’2’ s."""
if r < 0 or r >= rows or c < 0 or c >= cols or input_image [r][c] != 1:
return
# Change the current component from 1 to 2
input_image [r][c] = 2
# Explore neighbors (up, down, left, right)
dfs (r - 1, c) # Up
dfs (r + 1, c) # Down
dfs (r, c - 1) # Left
dfs (r, c + 1) # Right
# Traverse the image to find all components with ’1’
for r in range (rows):
for c in range (cols):
if input_image [r][c] == 1:
dfs (r, c)
return input_image
[End Program]
[Assert Statement]
assert generate_output_image (input_image) == output_image
Please note, the assert statement should strictly follow the provided format, and the output image should be represented in list format!
Please note, the script should not include an if __name__ == " __main__ ": block.
Prompt 4: Direct generation with planning-aided code - code generation based on the generated plan.
⬇
================================ System ================================
You are an expert in analyzing grid - based image processing tasks. Your goal is to generate Python code that produces output image (s) represented as a 2 D matrix, based on the given input image (s) (2 D matrix). This code should be generated using a text transformation plan inferred from a set of input - output image pairs (both represented as 2 D matrices).
================================= User =================================
Generate a Python script based on your text transformation plan to map the input image (2 D matrix) to the output image (2 D matrix). Please output the Python program, starting with [Start Program] and ending with [End Program]. Include an assert statement with the function signature to verify that the generated output matches the expected result, starting with [Assert Statement]. Use placeholders like input_image and output_image for the variables representing the input and output images.
For example:
[Start Program]
def generate_output_image (input_image):
rows = len (input_image)
cols = len (input_image [0])
def dfs (r, c):
""" Depth - first search to mark all 4- connected ’1’ s to ’2’ s."""
if r < 0 or r >= rows or c < 0 or c >= cols or input_image [r][c] != 1:
return
# Change the current component from 1 to 2
input_image [r][c] = 2
# Explore neighbors (up, down, left, right)
dfs (r - 1, c) # Up
dfs (r + 1, c) # Down
dfs (r, c - 1) # Left
dfs (r, c + 1) # Right
# Traverse the image to find all components with ’1’
for r in range (rows):
for c in range (cols):
if input_image [r][c] == 1:
dfs (r, c)
return input_image
[End Program]
[Assert Statement]
assert generate_output_image (input_image) == output_image
Please note, the assert statement should strictly follow the provided format, and the output image should be represented in list format!
Please note, the script should not include an if __name__ == " __main__ ": block.
Prompt 5: Refinement with solution plan - plan refinement.
⬇
================================ System ================================
As an expert in analyzing grid - based image processing tasks, your objective is to refine your solution plan based on the provided feedback.
================================= User =================================
The problem description:
[start problem description]
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input, and the right image represents the corresponding output. Each image can be represented as a 2 D matrix: $ {matrix}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
[end problem description]
The INCORRECT text transformation plan fails to solve some example training input and output pairs in the above problem!
[start incorrect transformation plan]
$ {plan}
[end incorrect transformation plan]
The incorrect output (s) generated by the incorrect plan:
[start incorrect output]
$ {incorrect_output}
[end incorrect output]
The generated correct output (s):
[start correct output]
$ {correct_output}
[end correct output]
Please analyze the incorrect reasoning step - by - step, and then generate the revised correct transformation plan (text only), starting with [Start Revised Transformation Plan] and ending with [End Revised Transformation Plan]. Ensure that the revised transformation plan generalizes across different cases and the test input image (s), while maintaining consistency with the observed transformations.
Prompt 6: Refinement with standalone code - code refinement.
⬇
================================ System ================================
As an expert in analyzing grid - based image processing tasks, your objective is to refine your program based on the provided feedback.
================================= User =================================
The problem description:
[start problem description]
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input, and the right image represents the corresponding output. Each image can be represented as a 2 D matrix: $ {matrix}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
[end problem description]
The generated incorrect program fails to solve some example training input and output pairs in the above problem!
[start incorrect program]
$ {code}
[end incorrect program]
The incorrect output (s) generated by the incorrect program:
[start incorrect output]
$ {incorrect_output}
[end incorrect output]
The generated correct output (s):
[start correct output]
$ {correct_output}
[end correct output]
Please analyze the incorrect reasoning step - by - step, and then generate the revised program (Python program only), starting with [Start Revised Program] and ending with [End Revised Program]. Ensure that the revised program generalizes across different cases and the test input image (s), while maintaining consistency with the observed input and output image pairs.
Please include an assert statement with the function signature to verify that the generated output matches the expected result, starting with [Assert Statement]. Use placeholders like input_image and output_image for the variables representing the input and output images.
For example:
[Start Revised Program]
def generate_output_image (input_image):
rows = len (input_image)
cols = len (input_image [0])
def dfs (r, c):
""" Depth - first search to mark all 4- connected ’1’ s to ’2’ s."""
if r < 0 or r >= rows or c < 0 or c >= cols or input_image [r][c] != 1:
return
# Change the current component from 1 to 2
input_image [r][c] = 2
# Explore neighbors (up, down, left, right)
dfs (r - 1, c) # Up
dfs (r + 1, c) # Down
dfs (r, c - 1) # Left
dfs (r, c + 1) # Right
# Traverse the image to find all components with ’1’
for r in range (rows):
for c in range (cols):
if input_image [r][c] == 1:
dfs (r, c)
return input_image
[End Revised Program]
[Assert Statement]
assert generate_output_image (input_image) == output_image
Please note, the assert statement should strictly follow the provided format, and the output image should be represented in list format!
Please note, the script should not include an if __name__ == " __main__ ": block.
Prompt 7: Refinement with planning-aided code - refinement on both generated plan and code.
⬇
================================ System ================================
As an expert in analyzing grid - based image processing tasks, your objective is to refine your transformation plan and program based on the provided feedback.
================================= User =================================
The problem description:
[start problem description]
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input, and the right image represents the corresponding output. Each image can be represented as a 2 D matrix: $ {matrix}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
[end problem description]
The generated incorrect transformation plan and program fail to solve some example training input and output pairs in the above problem!
[start incorrect transformation plan]
$ {plan}
[end incorrect transformation plan]
[start incorrect program]
$ {code}
[end incorrect program]
The incorrect output (s) generated by the incorrect transformation plan and program:
[start incorrect output]
$ {incorrect_output}
[end incorrect output]
The generated correct output (s):
[start correct output]
$ {correct_output}
[end correct output]
Please analyze the incorrect reasoning step - by - step, and then generate the revised transformation plan (text only) and program (Python program only).
For the revised transformation plan, start with [Start Revised Transformation Plan] and end with [End Revised Transformation Plan]. Ensure that the revised transformation plan generalizes across different cases and the test input image (s), while maintaining consistency with the observed transformations.
For the revised Python program, start with [Start Revised Program] and end with [End Revised Program]. Ensure that the revised program generalizes across different cases and the test input image (s), while maintaining consistency with the observed input and output image pairs.
For the revised Python program, please include an assert statement with the function signature to verify that the generated output matches the expected result, starting with [Assert Statement]. Use placeholders like input_image and output_image for the variables representing the input and output images.
For example:
[Start Revised Program]
def generate_output_image (input_image):
rows = len (input_image)
cols = len (input_image [0])
def dfs (r, c):
""" Depth - first search to mark all 4- connected ’1’ s to ’2’ s."""
if r < 0 or r >= rows or c < 0 or c >= cols or input_image [r][c] != 1:
return
# Change the current component from 1 to 2
input_image [r][c] = 2
# Explore neighbors (up, down, left, right)
dfs (r - 1, c) # Up
dfs (r + 1, c) # Down
dfs (r, c - 1) # Left
dfs (r, c + 1) # Right
# Traverse the image to find all components with ’1’
for r in range (rows):
for c in range (cols):
if input_image [r][c] == 1:
dfs (r, c)
return input_image
[End Revised Program]
[Assert Statement]
assert generate_output_image (input_image) == output_image
Please note, the assert statement should strictly follow the provided format, and the output image should be represented in list format!
Please note, the script should not include an if __name__ == " __main__ ": block.
Prompt 8: Objectness priors augmentation
⬇
================================ System ================================
You are an expert in grid - based image analysis.
================================= User =================================
The training instances consist of several pairs of input and output images, where the left image in each pair represents the input and the right image represents the corresponding output.
Please note that the test instance (s) only contains input image (s).
Each image is represented as a 2 D matrix:
$ {matrix}
Please note that each number in the matrix corresponds to a pixel and its value represents the color.
We treat the color represented by the number {background_color} as the background color.
$ {abstraction_rule}
The components in each input and output image pair are as follows:
$ {component_description}
Prompt 9: Geometry and topology priors augmentation - component attributes
⬇
================================ System ================================
You are an expert in geometry and topology analysis. Below is a summary of component attributes, including:
Size (Width and Height); Color; Shape; Symmetry; Bounding Box; Hole Count; Nearest Boundary.
================================= User =================================
$ {geometry_and_topology_priors_attributes} $
Prompt 10: Geometry and topology priors augmentation - component relations
⬇
================================ System ================================
You are an expert in geometry and topology analysis, Below is a summary of component relations, including:
Different / Identical with other components; Inclusive; Touching or or not touching with other component; Spatial Relations,
================================= User =================================
$ {geometry_and_topology_priors_relations} $
Prompt 11: Numbers and counting priors augmentation
⬇
================================ System ================================
You are an expert in numbers and counting analysis. Below is a summary of component statistics, including:
Symmetry numerical summary; Size numerical summary; Color numerical summary; Shape numerical summary; Hole counting summary.
================================= User =================================
$ {numbers_and_couting_priors} $
Prompt 12: Goal-directedness priors augmentation - action selection
⬇
================================ System ================================
You are an expert in analyzing and categorizing grid - based image tasks.
================================= User =================================
Please determine which category or categories this task belongs to. Please select from the following:
1. color change: color change involves modifying the value of a component, and the component size and position always does not change.
2. movement: movement involves shifting the position of a component to a new location within the image, and the component size always does not change.
3. extension: extending involves expanding the boundaries of a component to increase its size or reach within the image, and the component size always changes.
4. completing: completing an image involves filling in missing or incomplete parts of a component to achieve a coherent and fully formed image.
5. resizing: resizing involves altering the dimensions of a component by expanding or shrinking its size within the image.
6. selecting: selecting involves identifying and isolating a specific component within the image as the output component, and the component size and color always does not change.
7. copying: copying involves duplicating a component and either placing the duplicate in a new location or replacing the existing component within the image.
8. flipping: flipping involves mirroring a component along a specified axis to reverse its orientation within the image.
9. rotation: rotation involves turning a component around a fixed point or center by a specified angle within the image.
10. cropping: cropping involves cutting out a specific portion of a component.
Please select the best suitable one or multiple categories from the provided list that best describe the task.
Format your response by starting with [start category] and ending with [end category], numbering each category selected.
For example, if the task belongs only to " color change ", your response should be:
[start category]
1. color chang
[end category]
If the task belongs to both " selecting " and " extension ", your response should be:
[start category]
1. selecting
2. extension
[end category]
Prompt 13: Goal-directedness priors augmentation - target component idetification
⬇
================================ System ================================
You are an expert in analyzing grid - based image tasks, specifically in $ {action} components.
================================= User =================================
If this task involves $ {action}:
1. Begin by identifying WHICH COMPONENTS are to be $ {action} in all input images (training and test pairs).
- Refer to these components as TARGET components (e. g., component 1 in the first input image, component 2 and component 3 in the second input image, etc.).
- List ALL target components in each training and test input image.
- For EACH target component, provide:
- Attribute Analysis result
- Relation analysis result
- Numerical analysis result
2. Determine the CONDITIONS used to select these TARGET components for $ {action} from each training and test input image.
- These conditions must be based on common priorities across all targeted components and must differ from the unselected components.
- For example: the size of all target components might be equal to 3 while the size of the unselected components is not 3.
2.1. Analyze whether these conditions are EMPTY or not.
2.2. Evaluate if these conditions are derived from attribute analysis, including:
2.2.1. Color
2.2.2. Size
2.2.3. Shape
2.2.4. Width
2.2.5. Height
2.2.6. The number of holes
2.2.7. Bounding box
2.2.8. Symmetry
2.2.9. Nearest boundary
2.3. Evaluate if these conditions are derived from relation analysis, including:
2.3.1. Relative position with other components
2.3.2. Touching with other components
2.3.3. Whether they differ from or are identical with other components
2.3.4. Enclosure of other components
2.4. Evaluate if these conditions are derived from numerical analysis, including:
2.4.1. Symmetry numerical analysis
2.4.2. Size numerical analysis
2.4.3. Color numerical analysis
2.4.4. Shape numerical analysis
2.4.5. Hole counting analysis
You must evaluate each condition ONE by ONE and determine the best conditions.
Note:
- The conditions MUST work for ALL training and test input and output image pairs.
- Conditions CANNOT come from the output images!
- A condition can be EMPTY.
- If a condition is based on numerical features (e. g., size (width and height), or the number of holes), you may use the operators =, <, >, >=, or <=.
- For cropping or selecting tasks, consider using a bounding box to extract each component.