# From Reasoning to Generalization: Knowledge-Augmented LLMs for ARC Benchmark
## Abstract
Recent reasoning-oriented LLMs have demonstrated strong performance on challenging tasks such as mathematics and science examinations. However, core cognitive faculties of human intelligence, such as abstract reasoning and generalization, remain underexplored. To address this, we evaluate recent reasoning-oriented LLMs on the Abstraction and Reasoning Corpus (ARC) benchmark, which explicitly demands both faculties. We formulate ARC as a program synthesis task and propose nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most LLMs. To further improve performance, we introduce an ARC solver, Knowledge Augmentation for Abstract Reasoning (KAAR), which encodes core knowledge priors within an ontology that classifies priors into three hierarchical levels based on their dependencies. KAAR progressively expands LLM reasoning capacity by gradually augmenting priors at each level, and invokes RSPC to generate candidate solutions after each augmentation stage. This stage-wise reasoning reduces interference from irrelevant priors and improves LLM performance. Empirical results show that KAAR maintains strong generalization and consistently outperforms non-augmented RSPC across all evaluated LLMs, achieving around 5% absolute gains and up to 64.52% relative improvement. Despite these achievements, ARC remains a challenging benchmark for reasoning-oriented LLMs, highlighting future avenues of progress in LLMs.
## 1 Introduction
Learning from extensive training data has achieved remarkable success in major AI fields such as computer vision, natural language processing, and autonomous driving [1, 2, 3]. However, achieving human-like intelligence goes beyond learning purely from large-scale data; it requires rapid reasoning and generalizing from prior knowledge to novel tasks and situations [4]. Chollet [5] introduced Abstraction and Reasoning Corpus (ARC) to assess the generalization and abstract reasoning capabilities of AI systems. In each ARC task, the solver is required to infer generalized rules or procedures from a small set of training instances, typically fewer than five input-output image pairs, and apply them to generate output images for given input images provided in test instances (Figure 1 (a)). Each image in ARC is a pixel grid represented as a 2D matrix, where each value denotes a pixel color (Figure 1 (b)). ARC evaluates broad generalization, encompassing reasoning over individual input-output pairs and inferring generalized solutions via high-level abstraction, akin to inductive reasoning [6].
ARC is grounded in core knowledge priors, which serve as foundational cognitive faculties of human intelligence, enabling equitable comparisons between AI systems and human cognitive abilities [7]. These priors include: (1) objectness – aggregating elements into coherent, persistent objects; (2) geometry and topology – recognizing and manipulating shapes, symmetries, spatial transformations, and structural patterns (e.g., containment, repetition, projection); (3) numbers and counting – counting, sorting, comparing quantities, performing basic arithmetic, and identifying numerical patterns; and (4) goal-directedness – inferring purposeful transformations between initial and final states without explicit temporal cues. Incorporating these priors allows ARC solvers to replicate human cognitive processes, produce behavior aligned with human expectations, address human-relevant problems, and demonstrate human-like intelligence through generalization and abstract reasoning [5]. These features highlight ARC as a crucial benchmark for assessing progress toward general intelligence.
Chollet [5] suggested approaching ARC tasks as instances of program synthesis, which studies automatically generating a program that satisfies a high-level specification [8]. Following this proposal, recent studies [9, 10] have successfully solved partial ARC tasks by searching for program solutions encoded within object-centric domain-specific languages (DSLs). Reasoning-oriented LLMs integrate chain-of-thought (CoT) reasoning [11], often trained via reinforcement learning, further advancing program synthesis performance. Common approaches using LLMs for code generation include repeated sampling, where multiple candidate programs are generated [12], followed by best-program selection strategies [13, 14, 15, 16], and code refinement, where initial LLM-generated code is iteratively improved using error feedback from execution results [17, 18] or LLM-generated explanations [17, 19, 18]. We note that ARC presents greater challenges than existing program synthesis benchmarks such as HumanEval [12], MBPP [20], and LiveCode [21], due to its stronger emphasis on generalization and abstract reasoning grounded in core knowledge priors, which remain underexplored. This gap motivates our evaluation of recent reasoning-oriented LLMs on the ARC benchmark, and our proposed knowledge augmentation approach to improve their performance.
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: Visualizing Training and Testing Instances
### Overview
The image presents a diagram illustrating the concept of training and testing instances, likely within a machine learning context. It shows a sequence of "training instances" represented both visually as images and numerically as matrices, leading to a "test instance" with an unknown outcome. The diagram is split into two sections: (a) Image Visualization and (b) Matrix Representation.
### Components/Axes
* **Section (a) - Image Visualization:** Displays a series of 6 images arranged horizontally. Each image is a 3x3 grid of colored cells (blue, black, and red). Arrows indicate the flow from one image to the next. The images are labeled "training instances" except for the last one, which is labeled "test instance". The last image contains a question mark.
* **Section (b) - Matrix Representation:** Displays corresponding 3x3 matrices below each image. Each matrix contains numerical values (0, 1, and 2). Arrows connect each matrix to the next, mirroring the flow in the image visualization.
* **Labels:** "training instances" (repeated), "test instance", "(a) Image Visualization", "(b) Matrix Representation".
* **Color Coding:**
* Blue in the images corresponds to the value '1' in the matrices.
* Black in the images corresponds to the value '0' in the matrices.
* Red in the images corresponds to the value '2' in the matrices.
### Detailed Analysis or Content Details
The diagram shows a sequence of six instances. Let's analyze each step:
1. **Instance 1:**
* Image: Predominantly blue with some black cells.
* Matrix: `[[1, 1, 1], [0, 0, 0], [0, 0, 0]]`
2. **Instance 2:**
* Image: Blue and black cells.
* Matrix: `[[0, 0, 0], [1, 1, 1], [0, 0, 0]]`
3. **Instance 3:**
* Image: Blue and black cells.
* Matrix: `[[0, 1, 0], [1, 1, 0], [0, 0, 0]]`
4. **Instance 4:**
* Image: Blue, black, and red cells.
* Matrix: `[[0, 0, 2], [0, 2, 2], [1, 0, 0]]`
5. **Instance 5:**
* Image: Red and black cells.
* Matrix: `[[0, 0, 0], [2, 2, 0], [0, 0, 0]]`
6. **Instance 6 (Test Instance):**
* Image: Black and red cells with a question mark.
* Matrix: `[[0, 0, 0], [2, 0, 0], [0, 0, 0]]`
The arrows indicate a sequential progression, suggesting a time series or a process where each instance builds upon the previous one.
### Key Observations
* The color scheme consistently maps to the numerical values in the matrices.
* The transition from training instances to the test instance shows a shift in color distribution, with red becoming more prominent.
* The test instance is presented as an unknown, indicated by the question mark.
* The matrices represent a sparse representation of the images, with most values being 0.
### Interpretation
This diagram likely illustrates a learning process where a model is trained on the first five instances (training instances) and then tested on the sixth instance (test instance). The matrices could represent feature vectors or activations within a neural network. The changing color distribution suggests that the model is learning to recognize patterns or features associated with the red color. The question mark on the test instance indicates that the model needs to predict the outcome or classify the instance based on its training. The diagram demonstrates a simple example of supervised learning, where the model learns from labeled data (training instances) and then makes predictions on unseen data (test instance). The sparse matrix representation suggests that the features are not densely activated, and only a few features are relevant for each instance.
</details>
Figure 1: An ARC problem example (25ff71a9) with image visualizations (a), including three input-output pairs in the training instances, and one input image in the test instance, along with their corresponding 2D matrix representations (b). The ground-truth test output is enclosed in a red box.
We systematically assess how reasoning-oriented LLMs approach ARC tasks within the program synthesis framework. For each ARC problem, we begin by providing 2D matrices as input. We adopt three established program generation strategies: direct generation, repeated sampling, and refinement. Each strategy is evaluated under two solution representations: a text-based solution plan and Python code. When generating code solutions, we further examine two modalities: standalone and planning-aided, where a plan is generated to guide subsequent code development, following recent advances [18, 22, 23]. In total, nine ARC solvers are considered. We evaluate several reasoning-oriented LLMs, including proprietary models, GPT-o3-mini [24, 25], and Gemini-2.0-Flash-Thinking (Gemini-2.0) [26], and open-source models, DeepSeek-R1-Distill-Llama-70B (DeepSeek-R1-70B) [27] and QwQ-32B [28]. Accuracy on test instances is reported as the primary metric. When evaluated on the ARC public evaluation set (400 problems), repeated-sampling planning-aided code generation (RSPC) demonstrates consistent generalization and achieves the highest test accuracy across most LLMs, 30.75% with GPT-o3-mini, 16.75% with Gemini-2.0, 14.25% with QwQ-32B, and 7.75% with DeepSeek-R1-70B. We treat the most competitive ARC solver, RSPC, as the solver backbone.
Motivated by the success of manually defined priors in ARC solvers [9, 10], we propose K nowledge A ugmentation for A bstract R easoning (KAAR) for solving ARC tasks using reasoning-oriented LLMs. KAAR formalizes manually defined priors through a lightweight ontology that organizes priors into hierarchical levels based on their dependencies. It progressively augments LLMs with priors at each level via structured prompting. Specifically, core knowledge priors are introduced in stages: beginning with objectness, followed by geometry, topology, numbers, and counting, and concluding with goal-directedness. After each stage, KAAR applies the ARC solver backbone (RSPC) to generate the solution. This progressive augmentation enables LLMs to gradually expand their reasoning capabilities and facilitates stage-wise reasoning, aligning with human cognitive development [29]. Empirical results show that KAAR improves accuracy on test instances across all evaluated LLMs, achieving the largest absolute gain of 6.75% with QwQ-32B and the highest relative improvement of 64.52% with DeepSeek-R1-70B over non-augmented RSPC.
We outline our contributions as follows:
- We evaluate the abstract reasoning and generalization capabilities of reasoning-oriented LLMs on ARC using nine solvers that differ in generation strategies, modalities, and solution representations.
- We introduce KAAR, a knowledge augmentation approach for solving ARC problems using LLMs. KAAR progressively augments LLMs with core knowledge priors structured via an ontology and applies the best ARC solver after augmenting same-level priors, further improving performance.
- We conduct a comprehensive performance analysis of the proposed ARC solvers, highlighting failure cases and remaining challenges on the ARC benchmark.
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Diagram: Iterative Image Generation Process
### Overview
The image depicts a diagram illustrating an iterative process for image generation, broken down into three stages: Direct Generation, Repeat Sampling, and Refinement. Each stage is represented by a flow diagram and accompanied by textual descriptions of the problem, solution, and corresponding Python code. The diagrams use state transitions and feedback loops to show the process flow.
### Components/Axes
The diagram consists of three main sections, labeled (1) Direct Generation, (2) Repeat Sampling, and (3) Refinement, arranged horizontally. Each section contains a flow diagram and a text block.
* **Flow Diagrams:** Each diagram uses circles to represent states (labeled Q, p, c), arrows to indicate transitions, and labels on the arrows to describe the conditions or actions causing the transitions (e.g., "s := p", "fail", "pass"). The diagrams also include labels indicating the type of process: "standalone" and "Planning-aided".
* **Text Blocks:**
* **(a) Problem Description Q:** Contains example input and output image data.
* **(b) Solution Plan P:** Describes the algorithm for generating the output image.
* **(c) Python Code C:** Provides the Python code implementation of the algorithm.
### Detailed Analysis or Content Details
**Section 1: Direct Generation**
* The diagram shows a series of states 'Q' transitioning based on the condition "s := p" (where 's' is assigned the value of 'p').
* The output 'I<sub>t</sub>' is labeled as either "pass" or "fail".
* The process is labeled "Planning-aided".
* **Problem Description Q:**
* "The training example(s):"
* input: `[[1,1],[1,0],[0,0],[0,0]]`
* output: `[[0,0],[1,1],[1,1],[0,0]]`
* "The test input image(s):"
* input: `[[2,0],[2,0],[0,0],[0,0]]`
**Section 2: Repeat Sampling**
* The diagram shows states 'Q' transitioning based on "s := c".
* The output 'I<sub>t</sub>' is labeled as either "pass" or "fail".
* The process is labeled as both "standalone" and "Planning-aided".
* **Solution Plan P:**
* "...for each cell in row i of the output (where i > 0), set its value equal to the value from row (i - 1) in the same column of the input."
* "For the top row of the output (row 0), fill every cell with 0 (the background color)..."
**Section 3: Refinement**
* The diagram shows states 'Q' transitioning based on "s := c".
* The output 'I<sub>t</sub>' is labeled as either "pass" or "fail".
* The process is labeled "Planning-aided".
* **Python Code C:**
```python
def generate_output_image(input_image):
rows = len(input_image)
if rows == 0:
return []
cols = len(input_image[0])
output_image = []
output_image.append([0 for _ in range(cols)])
for i in range(1, rows):
output_image.append(input_image[i-1].copy())
return output_image
```
### Key Observations
* The diagrams in all three sections share a similar structure, suggesting a consistent underlying process with iterative refinement.
* The "fail" transitions indicate potential areas for improvement or further iteration.
* The Python code in Section 3 directly implements the algorithm described in the Solution Plan in Section 2.
* The input/output examples in Section 1 provide concrete instances for understanding the problem and solution.
### Interpretation
The diagram illustrates a method for image generation that starts with a direct generation step, then iteratively refines the output through repeat sampling and a final refinement stage. The process appears to be based on copying rows from the input image to the output image, with an initial row filled with a background color (0). The "pass" and "fail" labels suggest a validation or evaluation step at each iteration. The use of "Planning-aided" indicates that some form of planning or guidance is involved in the process. The Python code provides a clear and concise implementation of the algorithm, demonstrating how the iterative process can be automated. The examples provided show a simple case of copying rows, but the overall framework could be extended to more complex image generation tasks. The iterative nature of the process suggests a potential for learning and improvement over time. The diagram highlights the interplay between problem definition, solution design, and code implementation in the context of image generation.
</details>
Figure 2: An illustration of the three ARC solution generation approaches, (1) direct generation, (2) repeated sampling, and (3) refinement, with the GPT-o3-mini input and response fragments (a–c) for solving task 25ff71a9 (Figure 1). For each approach, when the solution $s$ is code, $s:=c$ , a plan $p$ is either generated from the problem description $Q$ to guide code generation (planning-aided) or omitted (standalone). Otherwise, when $s:=p$ , the plan $p$ serves as the final solution instead.
## 2 Problem Formulation
We formulate each ARC task as a tuple $\mathcal{P}=\langle I_{r},I_{t}\rangle$ , where $I_{r}$ and $I_{t}$ are sets of training and test instances. Each instance consists of an input-output image pair $(i^{i},i^{o})$ , represented as 2D matrices. The goal is to leverage the LLM $\mathcal{M}$ to generate a solution $s$ based on training instances $I_{r}$ and test input images $\{i^{i}\ |\ (i^{i},i^{o})\in I_{t}\}$ , where $s$ maps each test input $i^{i}$ to its output $i^{o}$ , i.e., $s(i^{i})=i^{o}$ , for $(i^{i},i^{o})\in I_{t}$ . We note that the test input images are visible during the generation of solution $s$ , whereas test output images become accessible only after $s$ is produced to validate the correctness of $s$ . We encode the solution $s$ in different forms, as a solution plan $p$ , or as Python code $c$ , optionally guided by $p$ . We denote each ARC problem description, comprising $I_{r}$ and $\{i^{i}\ |\ (i^{i},i^{o})\in I_{t}\}$ , as $Q$ .
## 3 ARC Solver Backbone
LLMs have shown promise in solving tasks that rely on ARC-relevant priors [30, 31, 32, 33]. We initially assume that reasoning-oriented LLMs implicitly encode sufficient core knowledge priors to solve ARC tasks. We cast each ARC task as a program synthesis problem, which involves generating a solution $s$ from a problem description $Q$ without explicitly prompting for priors. We consider established LLM-based code generation approaches [17, 18, 19, 23] as candidate ARC solution generation strategies, illustrated at the top of Figure 2. These include: (1) direct generation, where the LLM produces the solution $s$ in a single attempt, and then validates it on test instances $I_{t}$ ; (2) repeated sampling, where the LLM samples solutions until one passes training instances $I_{r}$ , and then evaluates it on $I_{t}$ ; and (3) refinement, where the LLM iteratively refines an initial solution $s$ based on failures on $I_{r}$ until it succeeds, followed by evaluation on $I_{t}$ . In addition, we extend the solution representation beyond code to include text-based solution plans. Given the problem description $Q$ as input (Figure 2, block (a)), all strategies prompt the LLM to generate a solution $s$ , represented either as a natural language plan $p$ (block (b)), $s:=p$ , or as a Python code $c$ (block (c)), $s:=c$ . For $s:=p$ , the solution is derived directly from $Q$ . For $s:=c$ , we explore two modalities: the LLM either generates $c$ directly from $Q$ (standalone), or first generates a plan $p$ for $Q$ , which is then concatenated with $Q$ to guide subsequent code development (planning-aided), a strategy widely adopted in recent work [18, 22, 23].
Repeated sampling and refinement iteratively produce new solutions based on the correctness of $s$ on training instances $I_{r}$ , and validate $s$ on test instances $I_{t}$ once it passes $I_{r}$ or the iteration limit is reached. When $s:=p$ , its correctness is evaluated by prompting the LLM to generate each output image $i^{o}$ given its corresponding input $i^{i}$ and the solution plan $p$ , where $(i^{i},i^{o})\in I_{r}$ or $(i^{i},i^{o})\in I_{t}$ . Alternatively, when $s:=c$ , its correctness is assessed by executing $c$ on $I_{r}$ or $I_{t}$ . In repeated sampling, the LLM iteratively generates a new plan $p$ and code $c$ from the problem description $Q$ without additional feedback. In contrast, refinement revises $p$ and $c$ by prompting the LLM with the previously incorrect $p$ and $c$ , concatenated with failed training instances. In total, nine ARC solvers are employed to evaluate the performance of reasoning-oriented LLMs on the ARC benchmark.
## 4 Knowledge Augmentation
Xu et al. [34] improved LLM performance on the ARC benchmark by prompting object-based representations for each task derived from graph-based object abstractions. Building on this insight, we propose KAAR, a knowledge augmentation approach for solving ARC tasks using reasoning-oriented LLMs. KAAR leverages Generalized Planning for Abstract Reasoning (GPAR) [10], a state-of-the-art object-centric ARC solver, to generate the core knowledge priors. GPAR encodes priors as abstraction-defined nodes enriched with attributes and inter-node relations, which are extracted using standard image processing algorithms. To align with the four knowledge dimensions in ARC, KAAR maps GPAR-derived priors into their categories. In detail, KAAR adopts fundamental abstraction methods from GPAR to enable objectness. Objects are typically defined as components based on adjacency rules and color consistency (e.g., 4-connected or 8-connected components), while also including the entire image as a component. KAAR further introduces additional abstractions: (1) middle-vertical, which vertically splits the image into two equal parts, and treats each as a distinct component; (2) middle-horizontal, which applies the same principle along the horizontal axis; (3) multi-lines, which segments the image using full-length rows or columns of uniform color, and treats each resulting part as a distinct component; and (4) no abstraction, which considers only raw 2D matrices. Under no abstraction, KAAR degrades to the ARC solver backbone without incorporating any priors. KAAR inherits GPAR’s geometric and topological priors, including component attributes (size, color, shape) and relations (spatial, congruent, inclusive). It further extends the attribute set with symmetry, bounding box, nearest boundary, and hole count, and augments the relation set with touching. For numeric and counting priors, KAAR follows GPAR, incorporating the largest/smallest component sizes, and the most/least frequent component colors, while extending them with statistical analysis of hole counts and symmetry, as well as the most/least frequent sizes and shapes.
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Diagram: Task Categorization and Color Change Rules
### Overview
The image presents a flowchart-style diagram outlining a task categorization process, specifically focusing on tasks involving color change. It details the steps to determine the category of a task and, if it involves color change, the components affected and the rules governing the color transformation. The diagram is structured with branching paths based on task characteristics.
### Components/Axes
The diagram consists of three main categories:
* **(a) Action(s) Selection:** Leads to a check for color change involvement.
* **(b) Component(s) Selection:** Focuses on identifying components requiring color change.
* **(c) Color Change Rule:** Defines the rules for color transformation.
Each category has a corresponding green swirl icon. There are orange boxes containing descriptive text. A vertical text label on the right reads "action schema". The diagram uses arrows to indicate flow direction.
### Detailed Analysis or Content Details
**(a) Action(s) Selection:**
* Text: "Please determine which category or categories this task belongs to. Please select from the following predefined categories..."
* Text: "Selection"
* Arrow leads to a conditional check: "This task involves color change."
**(b) Component(s) Selection:**
* Text: "If this task involves color change: 1. Which components require color change? 2. Determine the conditions used to select these target components..."
* Text: "Selection"
* Text: "Components: (color 0) with the minimum and maximum sizes."
* Arrow leads to a conditional check: "If this task involves color change, please determine which source color maps to which target color for the target components. 2. Determine the conditions used to dictate this color change..."
**(c) Color Change Rule:**
* Text: "Color Change Rule"
* Text: "- minimum-size component (from color 0) to 7."
* Text: "- maximum-size component (from color 0) to 8."
### Key Observations
The diagram focuses on a specific type of task – those involving color changes. It breaks down the process into identifying the affected components and defining the rules for the color transformation. The color "0" appears to be a starting color for both minimum and maximum size components. The color change rules are specific: minimum-size components change from color 0 to color 7, and maximum-size components change from color 0 to color 8.
### Interpretation
This diagram represents a formalized process for handling tasks that involve altering the color of components within a system. It suggests a structured approach to ensure consistency and predictability in color changes. The categorization into Action, Component, and Color Change Rule indicates a modular design, allowing for independent modification of each aspect. The specific color change rules (0 to 7, 0 to 8) imply a defined color palette or scale is in use. The diagram is likely part of a larger system for defining and executing automated tasks or workflows, potentially within a user interface or a data visualization application. The "action schema" label suggests this is a formalized description of the actions that can be taken. The diagram is not presenting data, but rather a *process* for handling data.
</details>
Figure 3: The example of goal-directedness priors augmentation in KAAR with input and response fragments from GPT-o3-mini.
GPAR approaches goal-directedness priors by searching for a sequence of program instructions [35] defined in a DSL. Each instruction supports conditionals, branching, looping, and action statements. KAAR incorporates the condition and action concepts from GPAR, and enables goal-directedness priors by augmenting LLM knowledge in two steps: 1) It prompts the LLM to identify the most relevant actions for solving the given ARC problem from ten predefined action categories (Figure 3 block (a)), partially derived from GPAR and extended based on the training set, such as color change, movement, and extension; 2) For each selected action, KAAR prompts the LLM with the associated schema to resolve implementation details. For example, for a color change action, KAAR first prompts the LLM to identify the target components (Figure 3 blocks (b)), and then specify the source and target colors for modification based on the target components (Figure 3 blocks (c)). We note that KAAR also prompts the LLM to incorporate condition-aware reasoning when determining action implementation details, using knowledge derived from geometry, topology, numbers, and counting priors. This enables fine-grained control, for example, applying color changes only to black components conditioned on the maximum or minimum size: from black (value 0) to blue (value 8) if largest, or to orange (value 7) if smallest. Figure 3 shows fragments of the goal-directedness priors augmentation. See Appendix A.2 for the full set of priors in KAAR.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Diagram: Visual Reasoning Components & Augmentation Process
### Overview
The image presents a breakdown of visual reasoning tasks, specifically focusing on the ARC (Abstract Reasoning Challenge) example, and the augmentation process used in KAAR (likely a system or method). It combines a visual example of an ARC puzzle, a diagram of the augmentation process, and textual descriptions of objectness, geometry/topology, and number/counting aspects.
### Components/Axes
The image is divided into five labeled sections:
* **(a) ARC example:** A grid-based puzzle with black and white squares, and a question mark indicating the missing element.
* **(b) Augmentation process in KAAR:** A flow diagram illustrating the augmentation steps.
* **(c) Objectness:** Textual description of component identification.
* **(d) Geometry and Topology:** Textual description of component shape and relationships.
* **(e) Numbers and Counting:** Textual description of component sizes and frequencies.
The augmentation process diagram uses the following elements:
* Oval nodes representing stages: "Objectness", "Geometry and Topology", "Numbers and Counting", "Goal-directed".
* Circular nodes representing input images: labeled *I<sub>T</sub>*.
* Rectangular nodes representing the ARC solver backbone.
* Arrows indicating flow and success/failure paths ("Pass" or "fail on *I<sub>T</sub>*").
* A question mark symbol *Q* representing the unknown.
### Detailed Analysis or Content Details
**(a) ARC example:**
The grid is approximately 8x8. Black pixels have a value of 0, and white pixels have a value of 1. The puzzle has a missing square in the bottom-right corner, marked with a question mark. The pattern appears to involve alternating black and white blocks, with some variations.
**(b) Augmentation process in KAAR:**
The diagram shows a cyclical process.
1. The process starts with an input image *I<sub>T</sub>*.
2. It passes through "Objectness", then to "Geometry and Topology", then to "Numbers and Counting", and finally to "Goal-directed".
3. The output of "Goal-directed" is fed back into the ARC solver backbone.
4. There are two possible outcomes: "Pass *I<sub>T</sub>*" (looping back to the beginning) or "fail on *I<sub>T</sub>*". The "fail" path leads back to the ARC solver backbone.
5. This process is repeated three times, with each iteration labeled *I<sub>T</sub>*.
**(c) Objectness:**
The text states: "When we consider 4-connected black pixels (value 0) as components, the components in each input and output image are as follows:".
For Training Pair 1 input image:
* Component 1: Locations = [(0,0), (0,1)]
* Component 8: Locations = [(4, 14)]
**(d) Geometry and Topology:**
For Training Pair 1 input image:
* For component 1: Shape: horizontal line. Different/Identical: Component 1 is different from ALL OTHERS!
* Component 1 is not touching with Component 2. Component 1 is at top-left of Component 2, and Component 2 is at bottom-right of Component 1.
**(e) Numbers and Counting:**
For Training Pair 1 input image:
* component 5, with the maximum size 10.
* component 8, with the minimum size 1.
* There are two components, 4 and 6, each of size 7, which appear most frequently (twice).
### Key Observations
* The ARC example demonstrates a visual reasoning task requiring pattern recognition.
* The KAAR augmentation process appears to iteratively refine the solution through multiple stages of analysis (objectness, geometry, numbers).
* The textual descriptions provide specific details about component identification, shape, relationships, and sizes within a training image.
* The augmentation process includes a feedback loop, suggesting an iterative refinement strategy.
* The component descriptions are specific to "Training Pair 1", implying that the analysis is being performed on a dataset of training examples.
### Interpretation
The image illustrates a system for solving visual reasoning problems, likely using a combination of automated analysis and iterative refinement. The KAAR augmentation process seems designed to improve the robustness of the ARC solver by systematically exploring different aspects of the visual input. The detailed component descriptions suggest that the system breaks down the image into fundamental elements and analyzes their properties to identify patterns and relationships. The iterative nature of the augmentation process, with its feedback loop, indicates a learning or optimization strategy. The specific details about component sizes and frequencies suggest that the system is capable of quantifying visual features and using them to make inferences. The mention of "Training Pair 1" suggests that this is part of a larger machine learning pipeline. The system appears to be designed to learn from examples and improve its ability to solve visual reasoning problems over time.
</details>
Figure 4: Augmentation process in KAAR (block (b)) and the corresponding knowledge augmentation fragments (blocks (c-e)) for ARC problem 62ab2642 (block (a)).
KAAR encodes the full set of core knowledge priors assumed in ARC into an ontology, where priors are organized into three hierarchical levels based on their dependencies. KAAR prompts LLMs with priors at each level to enable incremental augmentation. This reduces context interference and supports stage-wise reasoning aligned with human cognitive development [29]. Figure 4, block (b), illustrates the augmentation process in KAAR alongside the augmented prior fragments used to solve the problem shown in block (a). KAAR begins augmentation with objectness priors, encoding images into components with detailed coordinates based on a specific abstraction method (block (c)). KAAR then prompts geometry and topology priors (block (d)), followed by numbers and counting priors (block (e)). These priors are ordered by dependency while residing at the same ontological level, as they all build upon objectness. Finally, KAAR augments goal-directedness priors, as shown in Figure 3, where target components are derived from objectness analysis and conditions are inferred from geometric, topological, and numerical analyses. After augmenting each level of priors, KAAR invokes the ARC solver backbone to generate solutions. If any solution passes training instances $I_{r}$ , it is validated on the test instances $I_{t}$ ; otherwise, augmentation proceeds to the next level of priors.
While the ontology provides a hierarchical representation of priors, it may also introduce hallucinations, such as duplicate abstractions, irrelevant component attributes or relations, and inapplicable actions. To address this, KAAR integrates restrictions from GPAR to filter out inapplicable priors. KAAR adopts GPAR’s duplicate-checking strategy, retaining only abstractions that yield distinct components by size, color, or shape, in at least one training instance. In KAAR, each abstraction is associated with a set of applicable priors. For instance, when the entire image is treated as a component, relation priors are excluded, and actions such as movement and color change are omitted, whereas symmetry and size attributes are retained and actions such as flipping and rotation are considered. In contrast, 4-connected and 8-connected abstractions include all component attributes and relations, and the full set of ten action priors. See Appendix A.3 for detailed restrictions.
| 2 GPT-o3-mini | $I_{r}$ | Direct Generation P - | Repeated Sampling C - | Refinement PC - | P 35.50 | C 52.50 | PC 35.50 | P 31.00 | C 47.25 | PC 32.00 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| $I_{t}$ | 20.50 | 24.50 | 22.25 | 23.75 | 32.50 | 30.75 | 24.75 | 29.25 | 25.75 | |
| $I_{r}\&I_{t}$ | - | - | - | 22.00 | 31.75 | 29.25 | 21.75 | 28.50 | 25.00 | |
| Gemini-2.0 | $I_{r}$ | - | - | - | 36.50 | 39.50 | 21.50 | 15.50 | 25.50 | 15.50 |
| $I_{t}$ | 7.00 | 6.75 | 6.25 | 10.00 | 14.75 | 16.75 | 8.75 | 12.00 | 11.75 | |
| $I_{r}\&I_{t}$ | - | - | - | 9.50 | 14.25 | 16.50 | 8.00 | 10.50 | 10.75 | |
| QwQ-32B | $I_{r}$ | - | - | - | 19.25 | 13.50 | 15.25 | 16.75 | 15.00 | 14.25 |
| $I_{t}$ | 9.50 | 7.25 | 5.75 | 11.25 | 13.50 | 14.25 | 11.00 | 14.25 | 14.00 | |
| $I_{r}\&I_{t}$ | - | - | - | 9.25 | 12.75 | 13.00 | 8.75 | 13.00 | 11.75 | |
| DeepSeek-R1-70B | $I_{r}$ | - | - | - | 8.75 | 6.75 | 7.75 | 6.25 | 5.75 | 7.75 |
| $I_{t}$ | 4.25 | 4.75 | 4.50 | 4.25 | 7.25 | 7.75 | 4.75 | 5.75 | 7.75 | |
| $I_{r}\&I_{t}$ | - | - | - | 3.50 | 6.50 | 7.25 | 4.25 | 5.25 | 7.00 | |
| 2 | | | | | | | | | | |
Table 1: Performance of nine ARC solvers measured by accuracy on $I_{r}$ , $I_{t}$ , and $I_{r}\&I_{t}$ using four reasoning-oriented LLMs. For each LLM, the highest accuracy on $I_{r}$ and $I_{r}\&I_{t}$ is in bold; the highest accuracy on $I_{t}$ is in red. Accuracy is reported as a percentage. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
## 5 Experiments
In ARC, each task is unique and solvable using only core knowledge priors [5]. We begin by comparing nine candidate solvers on the full ARC public evaluation set of 400 tasks. This offers broader insights than previous studies limited to subsets of 400 training tasks [10, 9, 36], given the greater difficulty of the evaluation set [37]. We experiment with recent reasoning-oriented LLMs, including proprietary models, GPT-o3-mini and Gemini 2.0 Flash-Thinking (Gemini-2.0), and open-source models, DeepSeek-R1-Distill-Llama-70B (DeepSeek-R1-70B) and QwQ-32B. We compute accuracy on test instances $I_{t}$ as the primary evaluation metric. It measures the proportion of problems where the first solution successfully solves $I_{t}$ after passing the training instances $I_{r}$ ; otherwise, if none pass $I_{r}$ within 12 iterations, the last solution is evaluated on $I_{t}$ , applied to both repeated sampling and refinement. We also report accuracy on $I_{r}$ and $I_{r}\&I_{t}$ , measuring the percentage of problems whose solutions solve $I_{r}$ and both $I_{r}$ and $I_{t}$ . See Appendix A.4 for parameter settings.
Table 1 reports the performance of nine ARC solvers across four reasoning-oriented LLMs. For direct generation methods, accuracy on $I_{r}$ and $I_{r}\&I_{t}$ is omitted, as solutions are evaluated directly on $I_{t}$ . GPT-o3-mini consistently outperforms all other LLMs, achieving the highest accuracy on $I_{r}$ (52.50%), $I_{t}$ (32.50%), and $I_{r}\&I_{t}$ (31.75%) under repeated sampling with standalone code generation (C), highlighting its strong abstract reasoning and generalization capabilities. Notably, QwQ-32B, the smallest model, outperforms DeepSeek-R1-70B across all solvers and surpasses Gemini-2.0 under refinement. Among the nine ARC solvers, repeated sampling-based methods generally outperform those based on direct generation or refinement. This diverges from previous findings where refinement dominated conventional code generation tasks that lack abstract reasoning and generalization demands [10, 17, 19]. Within repeated sampling, planning-aided code generation (PC) yields the highest accuracy on $I_{t}$ across most LLMs. It also demonstrates the strongest generalization with GPT-o3-mini and Gemini-2.0, as evidenced by the smallest accuracy gap between $I_{r}$ and $I_{r}\&I_{t}$ , compared to solution plan (P) and standalone code generation (C). A similar trend is observed for QwQ-32B and DeepSeek-R1-70B, where both C and PC generalize effectively across repeated sampling and refinement. Overall, repeated sampling with planning-aided code generation, denoted as RSPC, shows the best performance and thus serves as the ARC solver backbone.
| 2 GPT-o3-mini | RSPC | $I_{r}$ Acc 35.50 | $I_{t}$ $\Delta$ - | $I_{r}\&I_{t}$ $\gamma$ - | Acc 30.75 | $\Delta$ - | $\gamma$ - | Acc 29.25 | $\Delta$ - | $\gamma$ - |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| KAAR | 40.00 | 4.50 | 12.68 | 35.00 | 4.25 | 13.82 | 33.00 | 3.75 | 12.82 | |
| Gemini-2.0 | RSPC | 21.50 | - | - | 16.75 | - | - | 16.50 | - | - |
| KAAR | 25.75 | 4.25 | 19.77 | 21.75 | 5.00 | 29.85 | 20.50 | 4.00 | 24.24 | |
| QwQ-32B | RSPC | 15.25 | - | - | 14.25 | - | - | 13.00 | - | - |
| KAAR | 22.25 | 7.00 | 45.90 | 21.00 | 6.75 | 47.37 | 19.25 | 6.25 | 48.08 | |
| DeepSeek-R1-70B | RSPC | 7.75 | - | - | 7.75 | - | - | 7.25 | - | - |
| KAAR | 12.25 | 4.50 | 58.06 | 12.75 | 5.00 | 64.52 | 11.50 | 4.25 | 58.62 | |
| 2 | | | | | | | | | | |
Table 2: Comparison of RSPC (repeated-sampling planning-aided code generation) and its knowledge-augmented variant, KAAR, in terms of accuracy (Acc) on $I_{r}$ , $I_{t}$ , and $I_{r}\&I_{t}$ . $\Delta$ and $\gamma$ denote the absolute and relative improvements over RSPC, respectively. All values are reported as percentages. The best results for $I_{r}$ and $I_{r}\&I_{t}$ are in bold; the highest for $I_{t}$ is in red.
We further compare the performance of RSPC with its knowledge-augmented variant, KAAR. For each task, KAAR begins with simpler abstractions, i.e., no abstraction and whole image, and progresses to complicated 4-connected and 8-connected abstractions, consistent with GPAR. KAAR reports the accuracy on test instances $I_{t}$ based on the first abstraction whose solution solves all training instances $I_{r}$ ; otherwise, it records the final solution from each abstraction and selects the one that passes the most $I_{r}$ to evaluate on $I_{t}$ . KAAR allows the solver backbone (RSPC) up to 4 iterations per invocation, totaling 12 iterations, consistent with the non-augmented setting. See Appendix A.5 for KAAR execution details. As shown in Table 2, KAAR consistently outperforms non-augmented RSPC across all LLMs, yielding around 5% absolute gains on $I_{r}$ , $I_{t}$ , and $I_{r}\&I_{t}$ . This highlights the effectiveness and model-agnostic nature of the augmented priors. KAAR achieves the highest accuracy using GPT-o3-mini, with 40% on $I_{r}$ , 35% on $I_{t}$ , and 33% on $I_{r}\&I_{t}$ . KAAR shows the greatest absolute improvements ( $\Delta$ ) using QwQ-32B and the largest relative gains ( $\gamma$ ) using DeepSeek-R1-70B across all evaluated metrics. Moreover, KAAR maintains generalization comparable to RSPC across all LLMs, indicating that the augmented priors are sufficiently abstract and expressive to serve as basis functions for reasoning, in line with ARC assumptions.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Heatmap: Model Coverage Comparison - RSPC & KAAR
### Overview
The image presents two heatmaps, labeled (a) RSPC and (b) KAAR, comparing the coverage between four different models: GPT-03-mini, GPT-03-mini, Gemini-2.0, QwQ-32B, and DeepSeek-R1-70B. The color intensity represents the coverage value, with darker shades indicating higher coverage.
### Components/Axes
* **X-axis:** Models - GPT-03-mini, GPT-03-mini, Gemini-2.0, QwQ-32B, DeepSeek-R1-70B.
* **Y-axis:** Models - GPT-03-mini, Gemini-2.0, QwQ-32B, DeepSeek-R1-70B.
* **Color Scale (Legend):** Located on the right side of the image. Ranges from approximately 0.0 (lightest color) to 1.0 (darkest color), representing Coverage. The color gradient transitions from light yellow to dark red.
* **Labels:** Each cell in the heatmap displays a numerical value representing the coverage between the corresponding row and column models.
* **Titles:** "(a) RSPC" and "(b) KAAR" indicate the type of coverage being measured in each heatmap.
### Detailed Analysis or Content Details
**Heatmap (a) - RSPC**
* **GPT-03-mini vs. GPT-03-mini:** 1.00
* **GPT-03-mini vs. Gemini-2.0:** 0.50
* **GPT-03-mini vs. QwQ-32B:** 0.40
* **GPT-03-mini vs. DeepSeek-R1-70B:** 0.22
* **Gemini-2.0 vs. GPT-03-mini:** 0.91
* **Gemini-2.0 vs. Gemini-2.0:** 1.00
* **Gemini-2.0 vs. QwQ-32B:** 0.60
* **Gemini-2.0 vs. DeepSeek-R1-70B:** 0.40
* **QwQ-32B vs. GPT-03-mini:** 0.86
* **QwQ-32B vs. Gemini-2.0:** 0.70
* **QwQ-32B vs. QwQ-32B:** 1.00
* **QwQ-32B vs. DeepSeek-R1-70B:** 0.44
* **DeepSeek-R1-70B vs. GPT-03-mini:** 0.87
* **DeepSeek-R1-70B vs. Gemini-2.0:** 0.87
* **DeepSeek-R1-70B vs. QwQ-32B:** 0.81
* **DeepSeek-R1-70B vs. DeepSeek-R1-70B:** 1.00
**Heatmap (b) - KAAR**
* **GPT-03-mini vs. GPT-03-mini:** 1.00
* **GPT-03-mini vs. Gemini-2.0:** 0.55
* **GPT-03-mini vs. QwQ-32B:** 0.54
* **GPT-03-mini vs. DeepSeek-R1-70B:** 0.34
* **Gemini-2.0 vs. GPT-03-mini:** 0.89
* **Gemini-2.0 vs. Gemini-2.0:** 1.00
* **Gemini-2.0 vs. QwQ-32B:** 0.72
* **Gemini-2.0 vs. DeepSeek-R1-70B:** 0.48
* **QwQ-32B vs. GPT-03-mini:** 0.88
* **QwQ-32B vs. Gemini-2.0:** 0.74
* **QwQ-32B vs. QwQ-32B:** 1.00
* **QwQ-32B vs. DeepSeek-R1-70B:** 0.53
* **DeepSeek-R1-70B vs. GPT-03-mini:** 0.92
* **DeepSeek-R1-70B vs. Gemini-2.0:** 0.82
* **DeepSeek-R1-70B vs. QwQ-32B:** 0.88
* **DeepSeek-R1-70B vs. DeepSeek-R1-70B:** 1.00
### Key Observations
* In both heatmaps, the diagonal elements (representing a model compared to itself) are all 1.00, as expected.
* Coverage values are generally higher between models within the same heatmap (RSPC or KAAR).
* GPT-03-mini consistently shows lower coverage with other models compared to Gemini-2.0, QwQ-32B, and DeepSeek-R1-70B.
* DeepSeek-R1-70B generally exhibits high coverage with other models, particularly in the KAAR heatmap.
* The coverage values differ between RSPC and KAAR, suggesting that the two metrics capture different aspects of model coverage.
### Interpretation
The heatmaps illustrate the degree of overlap or similarity in coverage between different language models, as measured by RSPC and KAAR. A higher coverage value indicates that the two models being compared perform similarly on the given task or dataset. The differences between the two heatmaps (RSPC vs. KAAR) suggest that the two metrics are not perfectly correlated and may be sensitive to different characteristics of the models.
The consistently lower coverage of GPT-03-mini suggests that it may have a narrower scope or different capabilities compared to the other models. DeepSeek-R1-70B appears to be the most versatile model, exhibiting high coverage with all other models in the KAAR metric.
The data suggests that model coverage is a useful metric for comparing the capabilities of different language models, but it is important to consider the specific metric being used and the context of the comparison. Further investigation would be needed to understand the underlying reasons for the observed differences in coverage. The fact that the coverage is not always symmetrical (e.g., GPT-03-mini vs Gemini-2.0 has a different value than Gemini-2.0 vs GPT-03-mini) suggests that the relationship is not necessarily transitive.
</details>
Figure 5: Asymmetric relative coverage matrices for RSPC (a) and KAAR (b), showing the proportion of problems whose test instances are solved by the row model that are also solved by the column model, across four LLMs.
We compare relative problem coverage across evaluated LLMs under RSPC and KAAR based on successful solutions on test instances. As shown in Figure 5, each cell $(i,j)$ represents the proportion of problems solved by the row LLM that are also solved by the column LLM. This is computed as $\frac{|A_{i}\cap A_{j}|}{|A_{i}|}$ , where $A_{i}$ and $A_{j}$ are the sets of problems solved by the row and column LLMs, respectively. Values near 1 indicate that the column LLM covers most problems solved by the row LLM. Under RSPC (Figure 5 (a)), GPT-o3-mini exhibits broad coverage, with column values consistently above 0.85. Gemini-2.0 and QwQ-32B also show substantial alignment, with mutual coverage exceeding 0.6. In contrast, DeepSeek-R1-70B shows lower alignment, with column values below 0.45 due to fewer solved problems. Figure 5 (b) illustrates that KAAR generally improves or maintains inter-model overlap compared to RSPC. Notably, KAAR raises the minimum coverage between GPT-o3-mini and DeepSeek-R1-70B from 0.22 under RSPC to 0.34 under KAAR. These results highlight the effectiveness of KAAR in improving cross-model generalization, with all evaluated LLMs solving additional shared problems. In particular, it enables smaller models such as QwQ-32B and DeepSeek-R1-70B to better align with stronger LLMs on the ARC benchmark.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Bar Chart: Accuracy on f1 Score for Different Models and Tasks
### Overview
This bar chart compares the accuracy (f1 score) of several large language models (LLMs) – GPT-03-mini, QwQ-32B, Gemini-2.0, and DeepSeek-R1-70B – across four different tasks: Movement, Extension, Recolour, and Others. Each model is evaluated using two prompting methods: RSPC and KAAR. The chart displays the accuracy as a percentage on the y-axis, and the tasks on the x-axis. Each bar is segmented to show the performance of each model/prompting method combination. Total counts for each task are provided below the x-axis labels.
### Components/Axes
* **X-axis:** Tasks - Movement, Extension, Recolour, Others.
* **Y-axis:** Accuracy on f1 (%) - Scale ranges from 0 to 50, with increments of 10.
* **Legend:** Located in the top-right corner, identifies the color-coding for each model and prompting method:
* Blue: GPT-03-mini: RSPC
* Dark Blue: GPT-03-mini: KAAR
* Green: Gemini-2.0: RSPC
* Light Green: Gemini-2.0: KAAR
* Purple: QwQ-32B: RSPC
* Dark Purple: QwQ-32B: KAAR
* Orange: DeepSeek-R1-70B: RSPC
* Yellow: DeepSeek-R1-70B: KAAR
* **Task Totals:** Below each task label, the total number of samples for that task is indicated (Movement: 55, Extension: 129, Recolour: 115, Others: 101).
* **Data Labels:** Numerical values are displayed on top of each segment of the bar, representing the accuracy percentage.
### Detailed Analysis
Here's a breakdown of the accuracy values for each task and model/prompting method combination:
**Movement (Total: 55)**
* GPT-03-mini: RSPC - 41.8%
* GPT-03-mini: KAAR - 20.0%
* QwQ-32B: RSPC - 12.7%
* QwQ-32B: KAAR - 18.2%
* Gemini-2.0: RSPC - 3.6%
* Gemini-2.0: KAAR - 9.1%
* DeepSeek-R1-70B: RSPC - 10.9%
* DeepSeek-R1-70B: KAAR - 14.5%
**Extension (Total: 129)**
* GPT-03-mini: RSPC - 38.8%
* GPT-03-mini: KAAR - 19.4%
* QwQ-32B: RSPC - 1.6%
* QwQ-32B: KAAR - 7.8%
* Gemini-2.0: RSPC - 0.8%
* Gemini-2.0: KAAR - 2.3%
* DeepSeek-R1-70B: RSPC - 17.8%
* DeepSeek-R1-70B: KAAR - 1.6%
**Recolour (Total: 115)**
* GPT-03-mini: RSPC - 24.3%
* GPT-03-mini: KAAR - 13.9%
* QwQ-32B: RSPC - 6.1%
* QwQ-32B: KAAR - 7.8%
* Gemini-2.0: RSPC - 7.0%
* Gemini-2.0: KAAR - 4.3%
* DeepSeek-R1-70B: RSPC - 10.4%
* DeepSeek-R1-70B: KAAR - 7.8%
**Others (Total: 101)**
* GPT-03-mini: RSPC - 21.8%
* GPT-03-mini: KAAR - 14.9%
* QwQ-32B: RSPC - 4.0%
* QwQ-32B: KAAR - 11.9%
* Gemini-2.0: RSPC - 5.0%
* Gemini-2.0: KAAR - 9.9%
* DeepSeek-R1-70B: RSPC - 7.9%
* DeepSeek-R1-70B: KAAR - 5.0%
### Key Observations
* GPT-03-mini consistently performs well, particularly with the RSPC prompting method, achieving the highest scores in Movement and Extension tasks.
* Gemini-2.0 generally exhibits the lowest accuracy across all tasks, regardless of the prompting method.
* The KAAR prompting method often results in lower accuracy compared to RSPC for GPT-03-mini.
* The "Movement" task has the highest overall accuracy scores, while "Extension" and "Recolour" have relatively lower scores.
* The task totals vary significantly, with Extension having the largest sample size (129) and Movement having the smallest (55).
### Interpretation
The data suggests that GPT-03-mini is the most effective model among those tested, especially when using the RSPC prompting method. The performance differences between models are task-dependent, with some models excelling in specific areas. The consistently low performance of Gemini-2.0 indicates potential limitations in its ability to handle these tasks. The varying task totals might influence the observed accuracy scores, as larger sample sizes generally lead to more reliable results. The choice of prompting method (RSPC vs. KAAR) also plays a crucial role, with RSPC generally yielding better results for GPT-03-mini. This data could be used to inform model selection and prompting strategy for specific applications. The relatively low scores across all models for the "Recolour" task suggest that this task is particularly challenging. Further investigation into the nature of these tasks and the models' capabilities could provide valuable insights for improving performance.
</details>
Figure 6: Accuracy on test instances $I_{t}$ for RSPC and KAAR across the movement, extension, recolor, and others categories using four LLMs. Each stacked bar shows RSPC accuracy (darker segment) and the additional improvement from KAAR (lighter segment).
Following prior work [9, 10], we categorize 400 problems in the ARC public evaluation set into four classes based on their primary transformations: (1) movement (55 problems), (2) extension (129 problems), (3) recolor (115 problems), and (4) others (101 problems). The others category comprises infrequent tasks such as noise removal, selection, counting, resizing, and problems with implicit patterns that hinder systematic classification into the aforementioned categories. See Appendix A.7 for examples of each category. Figure 6 illustrates the accuracy on test instances $I_{t}$ for RSPC and KAAR across four categories with evaluated LLMs. Each stacked bar represents RSPC accuracy and the additional improvement achieved by KAAR. KAAR consistently outperforms RSPC with the largest accuracy gain in movement (14.5% with QwQ-32B). In contrast, KAAR shows limited improvements in extension, since several problems involve pixel-level extension, which reduces the reliance on component-level recognition. Moreover, extension requires accurate spatial inference across multiple components and poses greater difficulty than movement, which requires mainly direction identification. Although KAAR augments spatial priors, LLMs still struggle to accurately infer positional relations among multiple components, consistent with prior findings [38, 39, 40]. Overlaps from component extensions further complicate reasoning, as LLMs often fail to recognize truncated components as unified wholes, contrary to human perceptual intuition.
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Bar Chart: Accuracy on I_t vs. Average Image Size Interval
### Overview
This bar chart compares the accuracy on I_t (in percentage) for two models, GPT-o3-mini and QwQ-32B, using two different methods, RSPC and KAAR, across varying average image size intervals (width x height). The chart consists of grouped bar plots for each image size interval, with each group representing the accuracy of the four combinations of model and method. The total number of images used for each interval is also indicated.
### Components/Axes
* **X-axis:** Average Image Size Interval (width x height). The intervals are: (0, 25], (25, 100], (100, 225], (225, 400], (400, 625], (625, 900].
* **Y-axis:** Accuracy on I_t (%). The scale ranges from 0 to 80.
* **Legend:**
* Blue: GPT-o3-mini RSPC
* Light Blue: GPT-o3-mini KAAR
* Orange: QwQ-32B RSPC
* Pink: QwQ-32B KAAR
* **Total Count:** Below each interval on the x-axis, the total number of images used for that interval is displayed.
### Detailed Analysis
The chart presents six groups of bars, one for each image size interval. Within each group, there are four bars representing the accuracy of each model/method combination.
* **(0, 25]**:
* GPT-o3-mini RSPC: Approximately 73.7%
* GPT-o3-mini KAAR: Approximately 42.1%
* QwQ-32B RSPC: Approximately 15.8%
* QwQ-32B KAAR: Approximately 5.3%
* Total: 19
* **(25, 100]**:
* GPT-o3-mini RSPC: Approximately 48.9%
* GPT-o3-mini KAAR: Approximately 23.7%
* QwQ-32B RSPC: Approximately 11.5%
* QwQ-32B KAAR: Approximately 5.0%
* Total: 139
* **(100, 225]**:
* GPT-o3-mini RSPC: Approximately 24.8%
* GPT-o3-mini KAAR: Approximately 8.5%
* QwQ-32B RSPC: Approximately 4.7%
* QwQ-32B KAAR: Approximately 6.2%
* Total: 129
* **(225, 400]**:
* GPT-o3-mini RSPC: Approximately 5.9%
* GPT-o3-mini KAAR: Approximately 11.8%
* QwQ-32B RSPC: Approximately 2.0%
* QwQ-32B KAAR: Approximately 9.8%
* Total: 51
* **(400, 625]**:
* GPT-o3-mini RSPC: Approximately 5.1%
* GPT-o3-mini KAAR: Not visible, but likely low.
* QwQ-32B RSPC: Not visible, but likely low.
* QwQ-32B KAAR: Approximately 4.3%
* Total: 39
* **(625, 900]**:
* GPT-o3-mini RSPC: Not visible, but likely low.
* GPT-o3-mini KAAR: Not visible, but likely low.
* QwQ-32B RSPC: Not visible, but likely low.
* QwQ-32B KAAR: Approximately 4.3%
* Total: 23
**Trends:**
* For both models, the accuracy of RSPC generally decreases as the image size interval increases.
* GPT-o3-mini consistently outperforms QwQ-32B, especially in the smaller image size intervals.
* The difference in accuracy between RSPC and KAAR methods varies depending on the image size interval.
### Key Observations
* GPT-o3-mini RSPC achieves the highest accuracy (73.7%) in the smallest image size interval ((0, 25]).
* As image size increases, the accuracy of all methods declines.
* The number of images used for each interval varies significantly, with the largest number (139) in the (25, 100] interval.
* The accuracy values for the larger image size intervals (400, 625] and (625, 900]) are very low and difficult to discern precisely from the chart.
### Interpretation
The data suggests that both GPT-o3-mini and QwQ-32B models perform better on smaller images. The RSPC method generally yields higher accuracy than the KAAR method, particularly for GPT-o3-mini. The significant drop in accuracy as image size increases indicates that the models may struggle with larger, more complex images. The varying number of images used for each interval could introduce bias in the results. The chart demonstrates a trade-off between image size and accuracy, and highlights the importance of considering image size when selecting a model and method for a given task. The consistent outperformance of GPT-o3-mini suggests it may be a more robust choice for this particular application, especially when dealing with smaller images. The low accuracy values in the larger image size intervals warrant further investigation to understand the underlying causes and potential mitigation strategies.
</details>
Figure 7: Accuracy on test instances $I_{t}$ for RSPC and KAAR across average image size intervals, evaluated using GPT-o3-mini and QwQ-32B. See Figure 12 in Appendix for the results with the other LLMs.
A notable feature of ARC is the variation in image size both within and across problems. We categorize tasks by averaging the image size per problem, computed over both training and test image pairs. We report the accuracy on $I_{t}$ for RSPC and KAAR across average image size intervals using GPT-o3-mini and QwQ-32B, the strongest proprietary and open-source models in Tables 1 and 2. As shown in Figure 7, both LLMs experience performance degradation as image size increases. When the average image size exceeds 400 (20×20), GPT-o3-mini solves only three problems, while QwQ-32B solves none. In ARC, isolating relevant pixels in larger images, represented as 2D matrices, requires effective attention mechanisms in LLMs, which remains an open challenge noted in recent work [41, 34]. KAAR consistently outperforms RSPC on problems with average image sizes below 400, benefiting from object-centric representations. By abstracting each image into components, KAAR reduces interference from irrelevant pixels, directs attention to salient components, and facilitates component-level transformation analysis. However, larger images often produce both oversized and numerous components after abstraction, which continue to challenge LLMs during reasoning. Oversized components hinder transformation execution, and numerous components complicate the identification of target components.
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Accuracy on IδLt vs. Iterations
### Overview
This line chart displays the accuracy on IδLt (%) for four different models (GPT-o3-mini with RSPC and KAAR, and QwQ-32B with RSPC and KAAR) across a range of iterations, from 1 to 12. The x-axis represents the number of iterations, and the y-axis represents the accuracy percentage. The chart is divided into three sections based on the task: Objectness (iterations 1-4), Geometry, Topology, Numbers and Counting (iterations 4-8), and Goal-directedness (iterations 8-12).
### Components/Axes
* **X-axis:** "# Iterations" - Scale from 1 to 12. Marked with vertical dashed lines at 1, 4, 8, and 12, corresponding to the task divisions.
* **Y-axis:** "Accuracy on IδLt (%)" - Scale from 3.5 to 35.
* **Legend:** Located in the top-right corner. Contains the following labels and corresponding colors:
* GPT-o3-mini: RSPC (Blue)
* GPT-o3-mini: KAAR (Green)
* QwQ-32B: RSPC (Red)
* QwQ-32B: KAAR (Brown)
### Detailed Analysis
The chart shows four distinct lines, each representing a model's performance.
**GPT-o3-mini: RSPC (Blue)**
* Trend: The line generally slopes upward, indicating increasing accuracy with more iterations. The slope is steeper in the initial stages and flattens out towards the end.
* Data Points:
* Iteration 1: ~21.25%
* Iteration 4: ~26.75%
* Iteration 8: ~29%
* Iteration 12: ~33%
**GPT-o3-mini: KAAR (Green)**
* Trend: Similar to RSPC, the line slopes upward, but starts at a lower accuracy and has a less steep slope overall.
* Data Points:
* Iteration 1: ~20.75%
* Iteration 4: ~26.25%
* Iteration 8: ~28.25%
* Iteration 12: ~29.25%
**QwQ-32B: RSPC (Red)**
* Trend: The line shows an upward trend, but with more fluctuations than the GPT-o3-mini lines.
* Data Points:
* Iteration 1: ~4.5%
* Iteration 4: ~11.5%
* Iteration 8: ~15.5%
* Iteration 12: ~19%
**QwQ-32B: KAAR (Brown)**
* Trend: The line also slopes upward, but starts at the lowest accuracy and has the least steep slope.
* Data Points:
* Iteration 1: ~6.25%
* Iteration 4: ~11.5%
* Iteration 8: ~12.75%
* Iteration 12: ~19.25%
### Key Observations
* GPT-o3-mini consistently outperforms QwQ-32B across all iterations and tasks.
* RSPC generally yields higher accuracy than KAAR for both models.
* The rate of accuracy improvement decreases as the number of iterations increases, suggesting diminishing returns.
* The largest performance gains are observed during the "Objectness" task (iterations 1-4).
* The QwQ-32B models plateau at a lower accuracy level compared to the GPT-o3-mini models.
### Interpretation
The data suggests that GPT-o3-mini, particularly when used with RSPC, is more effective at achieving higher accuracy on the IδLt metric than QwQ-32B, regardless of whether RSPC or KAAR is used. The diminishing returns observed with increasing iterations indicate that further iterations may not significantly improve performance beyond a certain point. The task-specific performance differences suggest that the models may have varying strengths and weaknesses depending on the nature of the task. The relatively low accuracy of QwQ-32B models suggests they may require further optimization or a different approach to achieve comparable performance to GPT-o3-mini. The IδLt metric likely measures some form of logical or reasoning ability, and the chart demonstrates the iterative improvement of these models on that metric. The three tasks (Objectness, Geometry/Topology/Counting, Goal-directedness) represent increasing levels of complexity in reasoning.
</details>
Figure 8: Variance in accuracy on $I_{r}\&I_{t}$ with increasing iterations for RSPC and KAAR using GPT-o3-mini and QwQ-32B. See Figure 13 in Appendix for the results with the other LLMs.
Figure 8 presents the variance in accuracy on $I_{r}\&I_{t}$ for RSPC and KAAR as iteration count increases using GPT-o3-mini and QwQ-32B. For each task under KAAR, we include only iterations from the abstraction that solves both $I_{r}$ and $I_{t}$ . For KAAR, performance improvements across each 4-iteration block are driven by the solver backbone invocation after augmenting an additional level of priors: iterations 1–4 introduce objectness; 5–8 incorporate geometry, topology, numbers, and counting; 9–12 further involve goal-directedness. RSPC shows rapid improvement in the first 4 iterations and plateaus around iteration 8. At each iteration, the accuracy gap between KAAR and RSPC reflects the contribution of accumulated priors via augmentation. KAAR consistently outperforms RSPC, with the performance gap progressively increasing after new priors are augmented and peaking after the integration of goal-directedness. We note that objectness priors alone yield marginal gains with GPT-o3-mini. However, the inclusion of object attributes and relational priors (iterations 4–8) leads to improvements in KAAR over RSPC. This advantage is further amplified after the augmentation of goal-directedness priors (iterations 9–12). These results highlight the benefits of KAAR. Representing core knowledge priors through a hierarchical, dependency-aware ontology enables KAAR to incrementally augment LLMs, perform stage-wise reasoning, and improve solution accuracy. Compared to augmentation at once and non-stage-wise reasoning, KAAR consistently yields superior accuracy, as detailed in Appendix A.6.
## 6 Discussion
ARC and KAAR. ARC serves as a visual abstract reasoning benchmark, requiring models to infer transformations from few examples for each unique task, rather than fitting to a closed rule space as in RAVEN [42] and PGM [43]. ARC assumes tasks are solvable using core knowledge priors. However, the problems are intentionally left undefined to preclude encoding complete solution rules [5]. This pushes models beyond closed-form rule fitting and toward truly domain-general capabilities. While some of the knowledge in KAAR is tailored to ARC, its central contribution lies in representing knowledge through a hierarchical, dependency-aware ontology that enables progressive augmentation. This allows LLMs to gradually expand their reasoning scope and perform stage-wise inference, improving performance on ARC without relying on an exhaustive rule set. Moreover, the ontology of KAAR is transferable to other domains requiring hierarchical reasoning, such as robotic task planning [44], image captioning [45], and visual question answering [46], where similar knowledge priors and dependencies from ARC are applicable. In KAAR, knowledge augmentation increases token consumption, while the additional tokens remain relatively constant since all priors, except goal-directedness, are generated via image processing algorithms from GPAR. On GPT-o3-mini, augmentation tokens constitute around 60% of solver backbone token usage, while on QwQ-32B, this overhead decreases to about 20%, as the solver backbone consumes more tokens. See Appendix A.8 for a detailed discussion. Incorrect abstraction selection in KAAR also leads to wasted tokens. However, accurate abstraction inference often requires validation through viable solutions, bringing the challenge back to solution generation.
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Diagram: Pattern Recognition/Transformation
### Overview
The image presents a visual pattern recognition problem. It shows three pairs of 8x8 pixel grids. In the first two pairs, a blue shape is transformed into a red shape. The third pair shows a blue shape and a question mark, implying the task is to predict the corresponding red shape. The grids are arranged in three rows, with the initial blue shape on the left and the transformed red shape on the right, separated by a right-pointing arrow.
### Components/Axes
There are no explicit axes or scales. The components are:
* **Blue Shapes:** Represent the initial state.
* **Red Shapes:** Represent the transformed state.
* **Arrows:** Indicate the transformation process.
* **Question Mark:** Represents the unknown transformed state.
* **Grid:** Each shape is displayed on an 8x8 grid of squares.
### Detailed Analysis or Content Details
Let's analyze the transformations:
* **Row 1:** The initial blue shape resembles the letter "U". The transformed red shape also resembles a "U", but with the interior filled in. The transformation appears to be filling the enclosed space with red.
* **Row 2:** The initial blue shape is a hollow square. The transformed red shape is a solid square. The transformation appears to be filling the enclosed space with red.
* **Row 3:** The initial blue shape is a "C". The question mark indicates the expected transformation. Based on the previous two examples, we can infer that the transformed shape should be a solid "C", filled with red.
### Key Observations
The transformation consistently involves filling the enclosed space of the initial blue shape with red. The shapes are relatively simple geometric forms. The pattern is consistent across the first two examples.
### Interpretation
The diagram demonstrates a simple pattern recognition task. The underlying principle is to identify the enclosed space within a shape and fill it with a different color. The question mark suggests a test of the observer's ability to extrapolate this pattern to a new shape. The diagram is likely used to assess visual reasoning or pattern completion skills. The consistent application of the "fill enclosed space" rule suggests a deterministic transformation. The use of simple shapes and colors makes the pattern easily discernible. The diagram is a visual analogy for a logical operation or algorithm.
</details>
Figure 9: Fragment of ARC problem e7dd8335.
Solution Analysis. RSPC achieves over 30% accuracy across evaluated metrics using GPT-o3-mini, even without knowledge augmentation. To assess its alignment with core knowledge priors, we manually reviewed RSPC-generated solution plans and code that successfully solve $I_{t}$ with GPT-o3-mini. RSPC tends to solve problems without object-centric reasoning. For instance, in Figure 1, it shifts each row downward by one and pads the top with zeros, rather than reasoning over objectness to move each 4-connected component down by one step. Even when applying objectness, RSPC typically defaults to 4-connected abstraction, failing on the problem in Figure 9, where the test input clearly requires 8-connected abstraction. We note that object recognition in ARC involves grouping pixels into task-specific components based on clustering rules, differing from feature extraction approaches [47] in conventional computer vision tasks. Recent work seeks to bridge this gap by incorporating 2D positional encodings and object indices into Vision Transformers [41]. However, its reliance on data-driven learning weakens generalization, undermining ARC’s core objective. In contrast, KAAR enables objectness through explicitly defined abstractions, implemented via standard image processing algorithms, thus ensuring both accuracy and generalization.
Generalization. For all evaluated ARC solvers, accuracy on $I_{r}$ consistently exceeds that on $I_{r}\&I_{t}$ , revealing a generalization gap. Planning-aided code generation methods, such as RSPC and KAAR, exhibit smaller gaps than other solvers, though the issue persists. One reason is that solutions include low-level logic for the training pairs, thus failing to generalize. See Appendix A.9 for examples. Another reason is the usage of incorrect abstractions. For example, reliance solely on 4-connected abstraction leads RSPC to solve only $I_{r}$ in Figure 9. KAAR similarly fails to generalize in this case. It selects 4-connected abstraction, the first one that solves $I_{r}$ , to report accuracy on $I_{t}$ , instead of the correct 8-connected abstraction, as the former is considered simpler. Table 1 also reveals that LLMs differ in their generalization across ARC solvers. While a detailed analysis of these variations is beyond the scope of this study, investigating the underlying causes could offer insights into LLM inference and alignment with intended behaviors, presenting a promising direction for future work.
## 7 Conclusion
We explored the generalization and abstract reasoning capabilities of recent reasoning-oriented LLMs on the ARC benchmark using nine candidate solvers. Experimental results show that repeated-sampling planning-aided code generation (RSPC) achieves the highest test accuracy and demonstrates consistent generalization across most evaluated LLMs. To further improve performance, we propose KAAR, which progressively augments LLMs with core knowledge priors organized into hierarchical levels based on their dependencies, and applies RSPC after augmenting each level of priors to enable stage-wise reasoning. KAAR improves LLM performance on the ARC benchmark while maintaining strong generalization compared to non-augmented RSPC. However, ARC remains challenging even for the most capable reasoning-oriented LLMs, given its emphasis on abstract reasoning and generalization, highlighting current limitations and motivating future research.
## References
- Khan et al. [2021] Abdullah Ayub Khan, Asif Ali Laghari, and Shafique Ahmed Awan. Machine learning in computer vision: A review. EAI Endorsed Transactions on Scalable Information Systems, 8(32), 2021.
- Otter et al. [2020] Daniel W Otter, Julian R Medina, and Jugal K Kalita. A survey of the usages of deep learning for natural language processing. IEEE transactions on neural networks and learning systems, 32(2):604–624, 2020.
- Grigorescu et al. [2020] Sorin Grigorescu, Bogdan Trasnea, Tiberiu Cocias, and Gigel Macesanu. A survey of deep learning techniques for autonomous driving. Journal of field robotics, 37(3):362–386, 2020.
- Lake et al. [2017] Brenden M Lake, Tomer D Ullman, Joshua B Tenenbaum, and Samuel J Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 40:e253, 2017.
- Chollet [2019] François Chollet. On the measure of intelligence. arXiv preprint arXiv:1911.01547, 2019.
- Peirce [1868] Charles S Peirce. Questions concerning certain faculties claimed for man. The Journal of Speculative Philosophy, 2(2):103–114, 1868.
- Spelke and Kinzler [2007] Elizabeth S Spelke and Katherine D Kinzler. Core knowledge. Developmental science, 10(1):89–96, 2007.
- Gulwani et al. [2017] Sumit Gulwani, Oleksandr Polozov, Rishabh Singh, et al. Program synthesis. Foundations and Trends® in Programming Languages, 4:1–119, 2017.
- Xu et al. [2023a] Yudong Xu, Elias B Khalil, and Scott Sanner. Graphs, constraints, and search for the abstraction and reasoning corpus. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI, pages 4115–4122, 2023a.
- Lei et al. [2024a] Chao Lei, Nir Lipovetzky, and Krista A Ehinger. Generalized planning for the abstraction and reasoning corpus. In Proceedings of the 38th AAAI Conference on Artificial Intelligence, AAAI, pages 20168–20175, 2024a.
- Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Proceedings of the 36th Advances in Neural Information Processing Systems, NeurIPS, pages 24824–24837, 2022.
- Chen et al. [2021] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Li et al. [2022] Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378:1092–1097, 2022.
- Chen et al. [2023] Bei Chen, Fengji Zhang, Anh Nguyen, Daoguang Zan, Zeqi Lin, Jian-Guang Lou, and Weizhu Chen. Codet: Code generation with generated tests. In Proceedings of the 11th International Conference on Learning Representations, ICLR, pages 1–19, 2023.
- Zhang et al. [2023] Tianyi Zhang, Tao Yu, Tatsunori Hashimoto, Mike Lewis, Wen-tau Yih, Daniel Fried, and Sida Wang. Coder reviewer reranking for code generation. In Proceedings of the 40th International Conference on Machine Learning, ICML, pages 41832–41846, 2023.
- Ni et al. [2023] Ansong Ni, Srini Iyer, Dragomir Radev, Veselin Stoyanov, Wen-tau Yih, Sida Wang, and Xi Victoria Lin. Lever: Learning to verify language-to-code generation with execution. In Proceedings of the 40th International Conference on Machine Learning, ICML, pages 26106–26128, 2023.
- Zhong et al. [2024a] Li Zhong, Zilong Wang, and Jingbo Shang. Debug like a human: A large language model debugger via verifying runtime execution step by step. In Findings of the Association for Computational Linguistics: ACL 2024, pages 851–870, 2024a.
- Lei et al. [2024b] Chao Lei, Yanchuan Chang, Nir Lipovetzky, and Krista A Ehinger. Planning-driven programming: A large language model programming workflow. arXiv preprint arXiv:2411.14503, 2024b.
- Chen et al. [2024] Xinyun Chen, Maxwell Lin, Nathanael Schärli, and Denny Zhou. Teaching large language models to self-debug. In Proceedings of the 12th International Conference on Learning Representations, ICLR, 2024.
- Austin et al. [2021] Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Jain et al. [2025] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. In Proceedings of the 13th International Conference on Learning Representations, ICLR, 2025.
- Jiang et al. [2023] Xue Jiang, Yihong Dong, Lecheng Wang, Fang Zheng, Qiwei Shang, Ge Li, Zhi Jin, and Wenpin Jiao. Self-planning code generation with large language models. ACM Transactions on Software Engineering and Methodology, 33(7):1–28, 2023.
- Islam et al. [2024] Md. Ashraful Islam, Mohammed Eunus Ali, and Md Rizwan Parvez. MapCoder: Multi-agent code generation for competitive problem solving. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, ACL, pages 4912–4944, 2024.
- Zhong et al. [2024b] Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, et al. Evaluation of openai o1: Opportunities and challenges of agi. arXiv preprint arXiv:2409.18486, 2024b.
- OpenAI [2025] OpenAI. Openai o3-mini. OpenAI, 2025. URL https://openai.com/index/openai-o3-mini/. Accessed: 2025-03-22.
- DeepMind [2024] Google DeepMind. Gemini 2.0 flash thinking. Google DeepMind, 2024. URL https://deepmind.google/technologies/gemini/flash-thinking/. Accessed: 2025-03-22.
- Guo et al. [2025] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.
- Cloud [2025] Alibaba Cloud. Alibaba cloud unveils qwq-32b: A compact reasoning model with cutting-edge performance. Alibaba Cloud, 2025. URL https://www.alibabacloud.com/blog/alibaba-cloud-unveils-qwq-32b-a-compact-reasoning-model-with-cutting-edge-performance_602039. Accessed: 2025-03-22.
- Babakr et al. [2019] Zana H Babakr, Pakstan Mohamedamin, and Karwan Kakamad. Piaget’s cognitive developmental theory: Critical review. Education Quarterly Reviews, 2(3):517–524, 2019.
- Deng et al. [2024] Hourui Deng, Hongjie Zhang, Jie Ou, and Chaosheng Feng. Can llm be a good path planner based on prompt engineering? mitigating the hallucination for path planning. arXiv preprint arXiv:2408.13184, 2024.
- Meng et al. [2024] Silin Meng, Yiwei Wang, Cheng-Fu Yang, Nanyun Peng, and Kai-Wei Chang. LLM-a*: Large language model enhanced incremental heuristic search on path planning. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 1087–1102, 2024.
- Ahn et al. [2024] Janice Ahn, Rishu Verma, Renze Lou, Di Liu, Rui Zhang, and Wenpeng Yin. Large language models for mathematical reasoning: Progresses and challenges. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, EACL, pages 225–237, 2024.
- Zang et al. [2025] Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, and Chen Change Loy. Contextual object detection with multimodal large language models. International Journal of Computer Vision, 133(2):825–843, 2025.
- Xu et al. [2023b] Yudong Xu, Wenhao Li, Pashootan Vaezipoor, Scott Sanner, and Elias B Khalil. Llms and the abstraction and reasoning corpus: Successes, failures, and the importance of object-based representations. arXiv preprint arXiv:2305.18354, 2023b.
- Lei et al. [2023] Chao Lei, Nir Lipovetzky, and Krista A Ehinger. Novelty and lifted helpful actions in generalized planning. In Proceedings of the International Symposium on Combinatorial Search, SoCS, pages 148–152, 2023.
- Wang et al. [2024] Ruocheng Wang, Eric Zelikman, Gabriel Poesia, Yewen Pu, Nick Haber, and Noah Goodman. Hypothesis search: Inductive reasoning with language models. In Proceedings of the 12 th International Conference on Learning Representations, ICLR, 2024.
- LeGris et al. [2024] Solim LeGris, Wai Keen Vong, Brenden M Lake, and Todd M Gureckis. H-arc: A robust estimate of human performance on the abstraction and reasoning corpus benchmark. arXiv preprint arXiv:2409.01374, 2024.
- Yamada et al. [2024] Yutaro Yamada, Yihan Bao, Andrew Kyle Lampinen, Jungo Kasai, and Ilker Yildirim. Evaluating spatial understanding of large language models. Transactions on Machine Learning Research, 2024.
- Cohn and Hernandez-Orallo [2023] Anthony G Cohn and Jose Hernandez-Orallo. Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of llms. arXiv preprint arXiv:2304.11164, 2023.
- Bang et al. [2023] Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, Quyet V. Do, Yan Xu, and Pascale Fung. A multitask, multilingual, multimodal evaluation of ChatGPT on reasoning, hallucination, and interactivity. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, IJCNLP-AACL, pages 675–718, 2023.
- Li et al. [2024a] Wenhao Li, Yudong Xu, Scott Sanner, and Elias Boutros Khalil. Tackling the abstraction and reasoning corpus with vision transformers: the importance of 2d representation, positions, and objects. arXiv preprint arXiv:2410.06405, 2024a.
- Raven [2000] John Raven. The raven’s progressive matrices: change and stability over culture and time. Cognitive psychology, 41(1):1–48, 2000.
- Barrett et al. [2018] David Barrett, Felix Hill, Adam Santoro, Ari Morcos, and Timothy Lillicrap. Measuring abstract reasoning in neural networks. In Proceedings of the 37th International conference on machine learning, ICML, pages 511–520, 2018.
- Cui et al. [2025] Yongcheng Cui, Ying Zhang, Cui-Hua Zhang, and Simon X Yang. Task cognition and planning for service robots. Intelligence & Robotics, (1):119–142, 2025.
- Stefanini et al. [2022] Matteo Stefanini, Marcella Cornia, Lorenzo Baraldi, Silvia Cascianelli, Giuseppe Fiameni, and Rita Cucchiara. From show to tell: A survey on deep learning-based image captioning. IEEE transactions on pattern analysis and machine intelligence, (1):539–559, 2022.
- Huynh et al. [2025] Ngoc Dung Huynh, Mohamed Reda Bouadjenek, Sunil Aryal, Imran Razzak, and Hakim Hacid. Visual question answering: from early developments to recent advances–a survey. arXiv preprint arXiv:2501.03939, 2025.
- Zhao et al. [2019] Zhong-Qiu Zhao, Peng Zheng, Shou-tao Xu, and Xindong Wu. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11):3212–3232, 2019.
- Mialon et al. [2023] Grégoire Mialon, Roberto Dessi, Maria Lomeli, Christoforos Nalmpantis, Ramakanth Pasunuru, Roberta Raileanu, Baptiste Roziere, Timo Schick, Jane Dwivedi-Yu, Asli Celikyilmaz, Edouard Grave, Yann LeCun, and Thomas Scialom. Augmented language models: a survey. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
- Zhu et al. [2025] Yuqi Zhu, Shuofei Qiao, Yixin Ou, Shumin Deng, Shiwei Lyu, Yue Shen, Lei Liang, Jinjie Gu, Huajun Chen, and Ningyu Zhang. KnowAgent: Knowledge-augmented planning for LLM-based agents. In Findings of the Association for Computational Linguistics: NAACL 2025, pages 3709–3732, 2025.
- Vu et al. [2024] Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, and Thang Luong. FreshLLMs: Refreshing large language models with search engine augmentation. In Findings of the Association for Computational Linguistics: ACL 2024, pages 13697–13720, 2024.
- Li et al. [2024b] Xingxuan Li, Ruochen Zhao, Yew Ken Chia, Bosheng Ding, Shafiq Joty, Soujanya Poria, and Lidong Bing. Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources. In Proceedings of the 12th International Conference on Learning Representations, ICLR, 2024b.
- Trivedi et al. [2023] Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, ACL, pages 10014–10037, 2023.
- Qiao et al. [2024] Shuofei Qiao, Honghao Gui, Chengfei Lv, Qianghuai Jia, Huajun Chen, and Ningyu Zhang. Making language models better tool learners with execution feedback. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NNACL, pages 3550–3568, 2024.
- Wind [2020] J S Wind. 1st place solution + code and official documentation. https://www.kaggle.com/competitions/abstraction-and-reasoning-challenge/discussion/154597, 2020. Accessed: 2025-03-22.
- Camposampiero et al. [2023] Giacomo Camposampiero, Loic Houmard, Benjamin Estermann, Joël Mathys, and Roger Wattenhofer. Abstract visual reasoning enabled by language. arXiv preprint arXiv:2306.04091, 2023.
- Min [2023] Tan John Chong Min. An approach to solving the abstraction and reasoning corpus (arc) challenge. arXiv preprint arXiv:2306.03553, 2023.
- Tan and Motani [2024] John Chong Min Tan and Mehul Motani. Llms as a system of multiple expert agents: An approach to solve the abstraction and reasoning corpus (arc) challenge. In Proceedings of the 2024 IEEE Conference on Artificial Intelligence, CAI, pages 782–787, 2024.
- Bikov et al. [2024] Kiril Bikov, Mikel Bober-Irizar, and Soumya Banerjee. Reflection system for the abstraction and reasoning corpus. In Proceedings of the 2nd AI4Research Workshop: Towards a Knowledge-grounded Scientific Research Lifecycle, 2024.
- Franzen et al. [2024] Daniel Franzen, Jan Disselhoff, and David Hartmann. The llm architect: Solving arc-agi is a matter of perspective. https://github.com/da-fr/arc-prize-2024/blob/main/the_architects.pdf, 2024. Accessed: 2025-03-22.
- Hodel [2024] Michael Hodel. Addressing the abstraction and reasoning corpus via procedural example generation. arXiv preprint arXiv:2404.07353, 2024.
- Moskvichev et al. [2023] Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The conceptarc benchmark: Evaluating understanding and generalization in the arc domain. arXiv preprint arXiv:2305.07141, 2023.
- Li et al. [2025] Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M. Dunn, Hao Tang, Wei-Long Zheng, Yewen Pu, and Kevin Ellis. Combining induction and transduction for abstract reasoning. In Proceedings of the 13th International Conference on Learning Representations, ICLR, 2025.
- Barke et al. [2024] Shraddha Barke, Emmanuel Anaya Gonzalez, Saketh Ram Kasibatla, Taylor Berg-Kirkpatrick, and Nadia Polikarpova. Hysynth: Context-free llm approximation for guiding program synthesis. In Proceedings of the 38th Advances in Neural Information Processing Systems, NeurIPS, pages 15612–15645, 2024.
## Appendix A Appendix
### A.1 Related Work
Knowledge-Augmented LLMs. Augmenting LLMs with external knowledge can improve reasoning capabilities and mitigate hallucination in text generation [48]. Previous studies achieve this by incorporating domain-specific knowledge, designed by human experts [49], retrieved via search engines [50], or extracted from Wikipedia documents [51]. Trivedi et al. [52] demonstrated that interleaving knowledge augmentation within reasoning steps further reduces model hallucination, resulting in more accurate multi-step reasoning. Additionally, augmenting LLMs with execution feedback improves performance on both question answering [53] and program synthesis tasks [10, 17, 19].
Search in DSL. An abstract, expressive, and compositional representation of core knowledge priors is essential for solving ARC tasks [5]. Previous studies have manually encoded these priors into domain-specific languages (DSLs) with lifted relational representations [9, 10, 54]. Various program synthesis methods have been proposed to search for valid solution programs within their DSLs, including DAG-based search [54], graph-based constraint-guided search [9], and generalized planning [10]. Hand-crafted DSLs encode core knowledge priors with high precision and interpretability, enabling structured program synthesis. However, comprehensive DSLs induce large search spaces, limiting synthesis efficiency.
LLMs for ARC. Recent studies have explored using LLMs as ARC solvers to directly generate test output matrices and have prompted LLMs with different problem descriptions to improve output accuracy. Camposampiero et al. [55] employed LLMs to generate output grids from textual task descriptions, derived from a vision module which is designed to capture human-like visual priors. Min [56] prompted LLMs with the raw 2D matrices of each task, along with transformation and abstraction examples. Xu et al. [34] demonstrated that object representations derived from predefined abstractions can improve LLM performance on ARC tasks. Recent advances in code generation by LLMs [18, 17, 14] highlight their potential to replace search-based program synthesis, addressing efficiency limitations. Tan and Motani [57] evaluated LLM performance on the ARC benchmark by generating Python program solutions. Additionally, Wang et al. [36] approached ARC as an inductive reasoning problem and introduced hypothesis search, where program solutions are generated by selecting LLM-generated hypotheses encoded as functions.
Training-Based Methods. To further improve LLM performance, Bikov et al. [58] fine-tuned LLMs on augmented ARC tasks using standard techniques such as rotation, flipping, and permutation. Beyond these methods, Franzen et al. [59] fine-tuned LLMs on large-scale synthetic ARC tasks [60] and ARC-related datasets such as Concept-ARC [61] and ARC-Heavy [62], achieving a state-of-the-art 56% accuracy on the private evaluation set of 200 tasks. Instead of fine-tuning LLMs, Barke et al. [63] trained a probabilistic context-free grammar (PCFG) using LLM-generated plausible solutions to learn weighted functions. This enables the synthesizer to efficiently generate final program solutions. However, this approach requires a dedicated synthesizer for each DSL, limiting its generalization.
When leveraging LLMs as ARC solvers, existing studies tend to emphasize accuracy on partial training set problems and overlook the core principle of ARC, where solutions should be constructed using core knowledge priors [5]. LLMs still lack these priors, such as objectness, as evidenced by RSPC-generated solutions. Although fine-tuning approaches have achieved state-of-the-art performance, their failure to incorporate core knowledge priors remains a fundamental limitation. KAAR addresses this gap by progressively augmenting LLMs with structured core knowledge priors introduced by GPAR, along with exclusive implementations of goal-directedness priors. It interleaves augmentation within the reasoning process by applying an advanced LLM-based program synthesis solver tailored to the ARC benchmark after augmenting priors at each level. KAAR achieves strong performance, 32.5% test accuracy on the full evaluation set of 400 problems using GPT-o3-mini, demonstrates substantial generalization, and produces solutions aligned with core knowledge priors.
### A.2 Core Knowledge Priors in KAAR
KAAR incorporates abstractions to enable objectness priors; component attributes, relations, and statistical analysis of component attributes to encode geometry, topology, numbers, and counting priors; and predefined actions to support goal-directedness priors. Table 5 presents all abstractions used in KAAR, organized by their prioritization. KAAR incorporates fundamental abstractions, such as 4-connected and 8-connected components, from GPAR, and extends them with additional abstractions unique to KAAR, highlighted in red. Table 6 introduces geometry, topology, numbers, and counting priors, and ten predefined transformations used in KAAR. For each action, KAAR augments the LLM with its corresponding schema to resolve implementation details. The actions and their schemas are detailed in Table A.12. Most actions can be specified within three steps, keeping them tractable for LLMs.
<details>
<summary>x10.png Details</summary>

### Visual Description
\n
## Diagram: Pattern Progression
### Overview
The image presents a sequence of four 4x4 grids, each transitioning into the next via an arrow. The grids are composed of colored squares (black, gray, red, and blue). The final grid is replaced with a question mark enclosed in a white rectangle, indicating a request to predict the next state in the sequence. This is a visual pattern recognition puzzle.
### Components/Axes
There are no explicit axes or legends. The components are the individual colored squares within each grid and the arrows indicating the progression. The grids are arranged horizontally.
### Detailed Analysis or Content Details
The sequence of grids demonstrates a shifting pattern of colors. Let's analyze each step:
* **Grid 1:** The grid has a checkerboard-like pattern with blue and gray squares, and a red square in the center.
* **Grid 2:** The blue squares shift one position to the right, wrapping around to the left side. The red square remains in the center. Black squares appear in the positions previously occupied by gray squares.
* **Grid 3:** The blue squares shift one position to the right again, wrapping around. The red square remains in the center. The gray squares shift to the positions previously occupied by black squares.
* **Grid 4:** The blue squares shift one position to the right again, wrapping around. The red square remains in the center. The black squares shift to the positions previously occupied by gray squares.
The pattern appears to be a cyclic shift of the blue squares to the right, with the red square remaining fixed in the center. The black and gray squares alternate positions with each shift.
### Key Observations
The most prominent observation is the consistent rightward shift of the blue squares. The red square acts as a fixed point. The black and gray squares seem to be swapping positions with each shift.
### Interpretation
The diagram illustrates a simple pattern recognition problem. The underlying principle is a cyclic permutation of the blue squares, combined with a swapping of the black and gray squares. The question mark suggests the task is to extrapolate this pattern to predict the next grid in the sequence.
The next grid would likely have the blue squares shifted one position to the right again, wrapping around. The red square would remain in the center. The gray squares would occupy the positions previously held by the black squares, and vice versa. The pattern suggests a deterministic system where the next state is fully determined by the current state and the defined rules of transformation. This could be a simplified representation of a cellular automaton or a similar dynamic system.
</details>
Figure 10: ARC problem 0520fde7
### A.3 Restrictions in KAAR
For certain abstractions, some priors are either inapplicable or exclusive. The specific priors assigned to some abstractions are detailed in Table 8. For the whole image abstraction, few priors apply as only a single component is present. In contrast, the 4/8-connected-multi-color-non-background abstractions retain most priors. The highlighted priors that capture per-component color diversity are used exclusively for 4/8-connected-multi-color-non-background abstractions, while priors tailored to a single-color component, such as components with same color, components with most frequent color, and components with least frequent color, are excluded. For the middle-vertical and middle-horizontal abstractions, where the image is evenly divided into two components, flipping and movement actions are enabled to facilitate reasoning over overlapping components. For instance, in the problem shown in Figure 10, the solution involves splitting the image along a middle-vertical grid line and moving one component to overlap the other. In the resulting component, a pixel is colored red if the overlapping pixels in both components are blue; otherwise, it is colored black.
### A.4 Parameter Settings
KAAR operates on all LLMs through API access with the full conversational history. For proprietary models, GPT-o3-mini and Gemini-2.0 Flash-Thinking (Gemini-2.0), we use default parameter settings. For open-source models, DeepSeek-R1-Distill-Llama-70B (DeepSeek-R1-70B) and QwQ-32B, we set temperature to 0.6, top-p to 0.95, and top-k to 40 to reduce repetitive outputs and filter rare tokens while preserving generation diversity. We conduct experiments on a virtual machine with 4 NVIDIA A100 80GB GPUs.
### A.5 KAAR
Algorithm 1 presents the pseudocode of KAAR. For each abstraction, KAAR incrementally augments the LLM with core knowledge priors, structured into three dependency-aware levels: beginning with objectness (Line 5), followed by geometry and topology (Lines 10 and 12), numbers and counting (Line 14), and concluding with goal-directedness priors(Line 18). We note that KAAR encodes geometry and topology priors through component attributes (Line 9) and relations (Line 11). The full set of priors is detailed in Tables 5, 6, and A.12. After augmenting each level of priors, KAAR invokes the solver backbone (RSPC) at Lines 6, 15, and 19 to generate code solutions guided by text-based plans, allowing up to 4 iterations (Lines 25–37). In each iteration, the solver backbone first validates the generated code on the training instances $I_{r}$ ; if successful, it then evaluates the solution on the test instances $I_{t}$ . The solver backbone returns solve if the generated solution successfully solves $I_{t}$ after passing $I_{r}$ ; pass if only $I_{r}$ is solved; or continues to the next iteration if the solution fails on $I_{r}$ . If the solver backbone fails to solve $I_{r}$ within the allotted 4 iterations at Lines 6 and 15, KAAR augments the next level of priors. KAAR proceeds to the next abstraction when the solver backbone fails to solve $I_{r}$ at Line 19, after the 4-iteration limit. KAAR terminates abstraction iteration upon receiving either pass or solve from the solver backbone and reports accuracy on $I_{r}$ , $I_{t}$ , and $I_{r}\&I_{t}$ accordingly. If no abstraction fully solves $I_{r}$ , KAAR records the final code solution for each abstraction (Line 22), selects the one that passes the most training instances (Line 23), and evaluates it on $I_{t}$ to determine additional accuracy gains (Line 24).
KAAR generates priors offline using image processing algorithms introduced in GPAR at Lines 4, 9, 11 and 13. In contrast, KAAR enables goal-directedness priors at Line 18 by prompting the LLM to select the most suitable actions and identify their implementation details, as described in Table A.12. KAAR iterates over abstractions from simpler to more complex, following the order specified in Table 5. We note that the highest-priority abstraction is no abstraction, where KAAR degrades to the solver backbone (RSPC) as no priors are applied.
Input : LLM $\mathcal{M}$ ; ARC problem $\mathcal{P}=(I_{r},I_{t})$ ; description $Q=(I_{r},\{i^{i}\ |\ (i^{i},i^{o})\in I_{t}\})$ ; abstraction list $\mathcal{A}$ ; max iterations $t=4$
1
2 Function KnowledgeAugmentation ( $\mathcal{M}$ , $Q$ , $\mathcal{P}$ , $\mathcal{A}$ , $t$ ):
3 solutionList $\leftarrow[]$ ;
4 foreach abstraction $abs$ in $\mathcal{A}$ do
5 objectnessPriors $\leftarrow$ GenerateObjectnessPriors( $Q$ , $abs$ );
6 AugmentKnowledge( $\mathcal{M}$ , objectnessPriors);
7 result, code, passedCount $\leftarrow$ SolverBackbone ( $\mathcal{M}$ , $\mathcal{P}$ , $Q$ , $t$ );
8 if result $\neq$ failure then
9 return result
10
11 attributePriors $\leftarrow$ GenerateAttributePriors( $Q$ , $abs$ );
12 AugmentKnowledge( $\mathcal{M}$ , attributePriors);
13 relationPriors $\leftarrow$ GenerateRelationPriors( $Q$ , $abs$ );
14 AugmentKnowledge( $\mathcal{M}$ , relationPriors);
15 numberPriors $\leftarrow$ GenerateNumbersCountingPriors( $Q$ , $abs$ );
16 AugmentKnowledge( $\mathcal{M}$ , numberPriors);
17 result, code, passedCount $\leftarrow$ SolverBackbone ( $\mathcal{M}$ , $\mathcal{P}$ , $Q$ , $t$ );
18 if result $\neq$ failure then
19 return result
20
21 AugmentGoalPriors $\leftarrow$ ( $\mathcal{M}$ , $Q$ , $abs$ );
22
23 result, code, passedCount $\leftarrow$ SolverBackbone ( $\mathcal{M}$ , $\mathcal{P}$ , $Q$ , $t$ );
24 if result $\neq$ failure then
25 return result
26
27 solutionList.append((code, passedCount));
28
29 bestCode $\leftarrow$ SelectMostPassed(solutionList);
30 return EvaluateOnTest(bestCode, $I_{t}$ );
31
32
33 Function SolverBackbone ( $\mathcal{M}$ , $\mathcal{P}$ , $Q$ , $t$ ):
34 i $\leftarrow 0$ ;
35 while i < t do
36 plan $\leftarrow\mathcal{M}$ .generatePlan( $Q$ );
37 code $\leftarrow\mathcal{M}$ .generateCode( $Q$ , plan);
38 passedCount $\leftarrow$ EvaluateOnTrain(code, $I_{r}$ );
39 if passedCount == $|I_{r}|$ then
40 if EvaluateOnTest(code, $I_{t}$ ) then
41 return solve, code, passedCount;
42
43 else
44 return pass, code, passedCount;
45
46
47 i $\leftarrow$ i + 1;
48
49 return failure, code, passedCount;
50
Algorithm 1 KAAR
| Gemini-2.0 $I_{t}$ $I_{r}\&I_{t}$ | $I_{r}$ 21.75 20.50 | 25.75 19.00 18.00 | 23.00 -2.75 -2.50 | -2.75 |
| --- | --- | --- | --- | --- |
| QwQ-32B | $I_{r}$ | 22.25 | 18.50 | -3.75 |
| $I_{t}$ | 21.00 | 17.75 | -3.25 | |
| $I_{r}\&I_{t}$ | 19.25 | 16.25 | -3.00 | |
| DeepSeek-R1-70B | $I_{r}$ | 12.25 | 9.00 | -3.25 |
| $I_{t}$ | 12.75 | 9.00 | -3.75 | |
| $I_{r}\&I_{t}$ | 11.50 | 8.50 | -3.00 | |
| 2 | | | | |
Table 3: Accuracy on $I_{r}$ , $I_{t}$ , and $I_{r}\&I_{t}$ for KAAR and KAAR ∗ across three LLMs. KAAR ∗ invokes the solver backbone (RSPC) only after all knowledge priors are augmented. $\Delta$ denotes the performance drop relative to KAAR. All values are reported as percentages.
### A.6 Ablation Study
Table 3 reports the accuracy decrease resulting from removing incremental knowledge augmentation and stage-wise reasoning in KAAR, denoted as KAAR ∗. Unlike KAAR, which invokes the solver backbone (RSPC) after augmenting each level of priors to enable stage-wise reasoning, KAAR ∗ uses RSPC to solve the problem within 12 iterations after augmenting all priors at once. We evaluate KAAR ∗ using the same reasoning-oriented LLMs as in Tables 1 and 2, excluding GPT-o3-mini due to its computational cost. KAAR ∗ shows decreased accuracy on all metrics, $I_{r}$ , $I_{t}$ , and $I_{r}\&I_{t}$ , for all evaluated LLMs. These results underscore the effectiveness of progressive augmentation and stage-wise reasoning. Presenting all knowledge priors simultaneously introduces superfluous information, which may obscure viable solutions and impair the LLM reasoning accuracy. We note that we construct the ontology of core knowledge priors based on their dependencies, thereby establishing a fixed augmentation order.
<details>
<summary>x11.png Details</summary>

### Visual Description
\n
## Grid-Based Task Progression
### Overview
The image presents a series of grid-based visual tasks, categorized into four types: "Movement", "Extension", "Recolor", and "Others". Each task type displays three examples of an input grid and its corresponding output grid, followed by a fourth input grid with a question mark indicating the expected output is unknown. The grids are 8x8, composed of black, white, cyan, yellow, green, and magenta cells. Arrows visually connect the input and output grids, suggesting a transformation process.
### Components/Axes
The image is structured into four columns, each representing a task type. Each column contains four rows: three example transformations and one incomplete transformation. The column headers are:
* **Task f3e62deb (Movement)**
* **Task b15fca0b (Extension)**
* **Task 6ea4a07e (Recolor)**
* **Task 3b4c228 (Others)**
Each grid is composed of 64 cells. The cells are colored black, white, cyan, yellow, green, or magenta.
### Detailed Analysis or Content Details
**Task f3e62deb (Movement)**
* **Example 1:** A cyan square moves one cell to the right. Input: Cyan square in top-left corner. Output: Cyan square one cell to the right.
* **Example 2:** A yellow square moves two cells down. Input: Yellow square in top-left corner. Output: Yellow square two cells down.
* **Example 3:** A magenta square moves one cell down and one cell to the right. Input: Magenta square in top-left corner. Output: Magenta square one cell down and one cell to the right.
* **Incomplete:** A yellow square in the top-left corner. Expected output is unknown.
**Task b15fca0b (Extension)**
* **Example 1:** A cyan vertical line extends to fill the grid. Input: Cyan vertical line. Output: Cyan filled grid.
* **Example 2:** A yellow and cyan vertical line extends to fill the grid. Input: Yellow and cyan vertical line. Output: Yellow and cyan filled grid.
* **Example 3:** A yellow and magenta vertical line extends to fill the grid. Input: Yellow and magenta vertical line. Output: Yellow and magenta filled grid.
* **Incomplete:** A cyan and yellow vertical line. Expected output is unknown.
**Task 6ea4a07e (Recolor)**
* **Example 1:** A cyan square is recolored to green. Input: Cyan square. Output: Green square.
* **Example 2:** A yellow square is recolored to blue. Input: Yellow square. Output: Blue square.
* **Example 3:** A magenta square is recolored to yellow. Input: Magenta square. Output: Yellow square.
* **Incomplete:** A green square. Expected output is unknown.
**Task 3b4c228 (Others)**
* **Example 1:** A cyan square expands into a pattern of cyan, green, magenta, and red. Input: Cyan square. Output: Complex pattern.
* **Example 2:** A yellow square expands into a pattern of yellow, green, magenta, and red. Input: Yellow square. Output: Complex pattern.
* **Example 3:** A magenta square expands into a pattern of cyan, green, magenta, and red. Input: Magenta square. Output: Complex pattern.
* **Incomplete:** A green square. Expected output is unknown.
### Key Observations
* The "Movement" task involves shifting the position of a single colored square.
* The "Extension" task involves filling the entire grid based on an initial vertical line pattern.
* The "Recolor" task involves changing the color of a single square.
* The "Others" task involves a more complex transformation of a single square into a patterned grid.
* The incomplete tasks require predicting the output based on the observed patterns in the examples.
### Interpretation
The image demonstrates a series of visual reasoning tasks. The tasks are designed to assess an agent's ability to identify patterns and apply them to new inputs. The four task types represent different types of transformations: simple movement, grid filling, color replacement, and complex pattern generation. The presence of the question marks suggests that the image is part of a test or challenge, where the goal is to predict the correct output for the incomplete tasks. The tasks progressively increase in complexity, from simple movement to complex pattern generation. The "Others" task is particularly challenging, as it requires understanding a more abstract transformation rule. The consistent use of a grid format and limited color palette simplifies the visual complexity, allowing for a focus on the underlying transformation logic. The tasks could be used to evaluate the performance of machine learning models on visual reasoning tasks.
</details>
Figure 11: Example ARC tasks for movement, extension, recolor, and others categories.
### A.7 Example Tasks by Category in the ARC Evaluation Set
ARC comprises 1000 unique tasks, with 400 allocated to the training set and 600 to the evaluation set. The evaluation set is further divided into a public subset (400 tasks) and a private subset (200 tasks). Figure 11 illustrates example ARC tasks for the movement, extension, recolor, and others categories in the public evaluation set. In the movement example, components are shifted to the image boundary in directions determined by their colors. The extension example is more complex, requiring LLMs to find the shortest path between two red pixels while avoiding obstacles, which presents challenges for current reasoning-oriented models. Additionally, reliance on pixel-level recognition weakens the effectiveness of KAAR, which is designed to facilitate component identification. The recolor example involves changing non-black components to black and updating black components based on original non-black colors. The others example requires generating a blue diagonal line whose length depends on the number of 4-connected components in the input image that are green and have a size greater than one. The combination of numerical reasoning and structural pattern generation makes this task difficult to classify within the other three categories.
| GPT-o3-mini Gemini QwQ-32B | 66K 58K 79K | 106K 110K 427K |
| --- | --- | --- |
| DeepSeek-R1-70B | 66K | 252K |
| 2 | | |
Table 4: Average token cost for knowledge augmentation and solver backbone (RSPC) in KAAR across four evaluated LLMs. K is $10^{3}$ .
### A.8 Cost Analysis
Table 4 reports the average token cost, including both prompts and LLM responses, for knowledge augmentation and the solver backbone (RSPC), when using KAAR as the ARC solver. For each ARC task, we consider the abstraction whose solution solves $I_{t}$ ; if none succeed, the one that passes $I_{r}$ ; otherwise, the abstraction with the lowest token usage is selected. Except for goal-directedness priors, all core knowledge priors in KAAR are generated offline using image processing algorithms from GPAR, resulting in comparable augmentation costs across all evaluated models. In contrast, token usage by the solver backbone varies substantially due to differences in the LLMs’ abstract reasoning and generalization capabilities. GPT-o3-mini solves most tasks efficiently, with the lowest token consumption by the solver backbone, where tokens used for knowledge augmentation account for approximately 62% of the solver backbone’s token usage. However, the solver backbone consumes more tokens with QwQ-32B, as QwQ-32B consistently generates longer reasoning traces. In this case, tokens used for knowledge augmentation constitute only 19% of the solver backbone’s token usage. Figure 14 illustrates the average token cost for augmenting priors at each level in KAAR.
### A.9 Generalization
Figures 15 and 16 illustrate two ARC problems, 695367ec and b1fc8b8e, where both RSPC and KAAR successfully solve the training instances $I_{r}$ but fail on the test instances $I_{t}$ when using GPT-o3-mini. For problem 695367ec, the correct solution involves generating a fixed 15×15 output image by repeatedly copying the input image, changing its color to black, and adding internal horizontal and vertical lines colored with the original input image’s color. However, the RSPC-generated code applies a distinct rule to each input image size without considering generalization. For problem b1fc8b8e, the solution requires accurate object recognition despite component contact, and correctly placing each component into one of the four corners. However, RSPC fails to recognize objectness, and its solution deviates from human intuition, being overfitted to $I_{r}$ . For problems 695367ec and b1fc8b8e, KAAR exhibits the same limitations, although it adopts abstractions to enable objectness. KAAR begins with the simplest abstraction, no abstraction, where KAAR degrades to RSPC. As a result, it generates the same solution as RSPC and terminates without attempting other abstractions, since the solution already solves $I_{r}$ and is then evaluated on $I_{t}$ , resulting in overfitting.
### A.10 Problem Coverage across ARC Solvers
We report the relative problem coverage across nine ARC solvers based on successful test instance solutions using GPT-o3-mini (Figure 17), Gemini-2.0 (Figure 18), QwQ-32B (Figure 19), and DeepSeek-R1-70B (Figure 20). Each cell $(i,j)$ indicates the proportion of problems solved by the row solver that are also solved by the column solver. This is computed as $\frac{|A_{i}\cap A_{j}|}{|A_{i}|}$ , where $A_{i}$ and $A_{j}$ are the sets of problems solved by the row and column solvers, respectively, following the same method used in Figure 5. Values close to 1 indicate that the column solver covers most problems solved by the row solver. GPT-o3-mini demonstrates the strongest overall coverage, with pairwise overlap consistently exceeding 0.55. Among all solvers, repeated sampling with standalone (P) and planning-aided code generation (PC) show the highest coverage, with column values consistently above 0.8 for GPT-o3-mini. This trend persists across Gemini-2.0, QwQ-32B, and DeepSeek-R1-70B. Under these models, repeated sampling with planning-aided code generation exhibits better alignment than its standalone code generation counterpart, generally yielding higher coverage values. However, planning-aided code generation under the direct generation setting shows weaker alignment, with column values around 0.40 for Gemini-2.0 and 0.35 for QwQ-32B. Among the four evaluated LLMs, DeepSeek-R1-70B demonstrates the lowest average off-diagonal coverage (i.e., $i\neq j$ ) of 0.603, suggesting potential output instability and variation attributable to solver choice.
### A.11 Performance Analysis
Table 1 highlights performance variations across reasoning-oriented LLMs and ARC solvers with respect to both accuracy and generalization. Notably, the ARC solver, repeated sampling with standalone code generation, exhibits a substantial accuracy gap between $I_{r}$ and $I_{r}\&I_{t}$ , indicating limited generalization capability when using GPT-o3-mini and Gemini-2.0. In contrast, repeated sampling with planning-aided code generation demonstrates markedly improved generalization by preventing solutions from directly replicating the output matrices of training instances, as illustrated in Figure 21. This output copying, observed under repeated sampling with standalone code generation, accounts for approximately 24% and 95% of 83 and 101 overfitting problems with GPT-o3-mini and Gemini-2.0, respectively. When planning is incorporated, output copying is reduced to around 8% and 35% of 25 and 20 overfitting problems with GPT-o3-mini and Gemini-2.0, respectively. Additionally, the incorporation of planning facilitates accurate code generation. For example, in Figure 22, repeated sampling with planning-aided code generation produces a correct solution using GPT-o3-mini by replicating the input image horizontally or vertically based on the presence of a uniform row or column, as specified in the plan and implemented accordingly in code. In contrast, without planning assistance, standalone code generation produces incomplete logic, considering only whether the first column is uniform to determine the replication direction, which leads to failure on the test instance.
For the ARC benchmark, repeated sampling–based methods achieve higher accuracy on $I_{r}$ , $I_{t}$ , and $I_{r}\&I_{t}$ compared to refinement-based approaches when using GPT-o3-mini and Gemini-2.0. Figure 23 presents an ARC problem where repeated sampling with planning-aided code generation yields a correct solution, whereas its refinement variant fails to correct the initial erroneous code, and the flawed logic persists across subsequent refinements when using GPT-o3-mini. Previous studies have shown that refinement can benefit from control flow graph information [17] and verified plans [18], which assist LLMs in locating and correcting bugs. However, these methods typically incur substantial token consumption, making them difficult to scale affordably.
### A.12 Limitations
KAAR improves the performance of reasoning-oriented LLMs on ARC tasks by progressively prompting with core knowledge priors. Although this inevitably increases token usage, the trade-off can be justified, as the exploration of LLM generalization remains in its early stages. KAAR integrates diverse abstraction methods to enable objectness and iteratively applies abstractions in order of increasing complexity. In contrast, humans typically infer appropriate abstractions directly from training instances, rather than leveraging exhaustive search. To address this, we prompt different LLMs with raw 2D matrices of each ARC problem to select one or three relevant abstractions, but the results are unsatisfactory. As previously discussed, accurate abstraction inference often depends on validation through viable solutions, thereby shifting the challenge back to solution generation. Additionally, KAAR augments core knowledge priors through prompting but lacks mechanisms to enforce LLM adherence to these priors during reasoning. While the KAAR-generated solutions generally conform to core knowledge priors, the intermediate reasoning processes may deviate from the intended patterns. Future work could explore fine-tuning or reinforcement learning to better align model behavior with the desired reasoning patterns.
| No Abstraction | - |
| --- | --- |
| Whole Image | We consider the whole image as a component. |
| Middle-Vertical | We vertically split the image into two equal parts, treating each as a distinct component. |
| Middle-Horizontal | We horizontally split the image into two equal parts, treating each as a distinct component. |
| Multi-Lines | We use rows or columns with a uniform color to divide the input image into multiple components. |
| 4-Connected ∗ | We consider the 4-adjacent pixels of the same color as a component. |
| 4-Connected-Non-Background ∗ | We consider the 4-adjacent pixels of the same color as a component, excluding components with the background color. |
| 4-Connected-Non-Background-Edge ∗ | We consider the 4-adjacent pixels of the same color as a component, containing components with the background color when they are not attached to the edges of the image. |
| 4-Connected-Multi-Color-Non-Background ∗ | We consider 4-adjacent pixels as a component, which may contain different colors, while excluding components with the background color. |
| 4-Connected-Bounding-Box ∗ | We consider 4-adjacent pixels of the same color, and treat all pixels within their bounding box as a component, which may include different colors. |
| 4-Connected-With-Black ∗ | We consider the 4-adjacent pixels of black color, represented by the value 0, as a component, excluding components with other colors. |
| Same-Color | We consider pixels of the same color as a component, excluding components with the background color. |
| 2 | |
Table 5: Abstractions in KAAR. The superscript “ ∗ ” denotes that the 8-connected version is considered. The background color is black if black exists; otherwise, it is the most frequent color in the image. We present abstractions according to their prioritization in KAAR, where the order is given by the table from top to bottom, and making 8-connected abstraction to follow that of the corresponding 4-connected abstraction at the end of the sequence. Abstractions highlighted in red are exclusive to KAAR.
| Geometry and Topology | Size (Width and Height); Color; Shape (One Pixel; Horizontal Line; Vertical Line; Diagonal Line; Square; Rectangle; Cross; Irregular Shape); Symmetry (Horizontal Symmetry; Vertical Symmetry; Diagonal Symmetry; Anti-Diagonal Symmetry; Central Symmetry); Bounding Box; Hole Count; Nearest Boundary; Different/Identical with Other Components; Touching; Inclusive; Spatial (Horizontally Aligned to the Right; Horizontally Aligned to the Left; Vertically Aligned Below; Vertically Aligned Above; Top-Left; Top-Right; Bottom-Left; Bottom-Right; Same Position) |
| --- | --- |
| Numbers and Counting | Component Size Counting; Components with Same Size; Components with Most Frequent Size; Components with Least Frequent Size; Components with Maximum Size; Components with Minimum Size; Component Color Counting; Components with Same Color; Components with Same Number of Colors; Components with Most Frequent Color; Components with Least Frequent Color; Component with Most Distinct Colors; Component with Fewest Distinct Colors; Component Shape Counting; Components with Same Shape; Components with Most Frequent Shape; Components with Least Frequent Shape; Component Hole Number Counting; Components with Same Number of Holes; Components with Maximum Number of Holes; Components with Minimum Number of Holes; Component Symmetry Counting |
| Goal-directedness | Color Change (modifying component value); Movement (shifting component’s position); Extension (expanding component’s area); Completing (filling in missing parts of a component); Resizing (altering component size); Selecting (isolating a component); Copying (duplicating a component); Flipping (mirroring a component); Rotation (rotating a component); Cropping (cutting part of a component) |
| 2 | |
Table 6: KAAR priors classified into geometry and topology, numbers and counting, and goal-directedness. For goal-directedness, we incorporate ten predefined actions, with their corresponding action schemas detailed in Table A.12.
| Color Change Movement Extension | Targets Targets Targets | Source and Target Colors Direction Direction | Start and End Locations Start and End Locations | Pattern Pattern | Order Order | Overlapping Intersection |
| --- | --- | --- | --- | --- | --- | --- |
| Completing | Targets | Pattern | | | | |
| Resizing | Targets | Source and Target Sizes | | | | |
| Selecting | Targets | | | | | |
| Copying | Targets | Locations | Overlapping | | | |
| Flipping | Targets | Flipping Axis | Overlapping | | | |
| Rotation | Targets | Degrees | | | | |
| Cropping | Targets | Subsets | | | | |
| 2 | | | | | | |
Table 7: Actions in KAAR and their schemas (implementation details). Each action schema is presented according to its prompting order in KAAR (left to right). Some actions include a pattern schema that prompts the LLM to identify underlying logic rules, such as repeating every two steps in movement or extension, or completing based on three-color repetition. Targets denote the target components.
| whole image | Symmetry, Size | - | Flipping; Rotation; Extension; Completing, Cropping |
| --- | --- | --- | --- |
| middle-vertical | Size | - | Flipping; Movement |
| middle-horizontal | Size | - | Flipping; Movement |
| multi-lines | Size; Color; Shape; Symmetry; Bounding Box; Hole Count | ALL | ALL |
| 4-connected-multi-color-non-background ∗ | ALL | … Component Color Counting; Components with Same Number of Colors; Component with Most Distinct Colors; Component with Fewest Distinct Colors … | ALL |
Table 8: Abstractions with their assigned knowledge priors. “–” denotes no priors, while “ALL” indicates all priors in the corresponding category, as defined in Table 6. The superscript “ ∗ ” indicates that the 8-connected version is also applicable. The highlighted priors apply exclusively to their corresponding abstractions. For the 4/8-connected-multi-color-non-background abstractions, we present color-counting priors specific to multi-colored components, while all other non-color-counting priors follow those in Table 6.
<details>
<summary>x12.png Details</summary>

### Visual Description
\n
## Stacked Bar Chart: Accuracy on It (%) vs. Average Image Size Interval
### Overview
This is a stacked bar chart comparing the accuracy on a metric "It" (%) for different models (Gemini-2.0 RSPC, Gemini-2.0 KAAR, DeepSeek-R1-70B RSPC, DeepSeek-R1-70B KAAR) across various average image size intervals. The x-axis represents the image size intervals, and the y-axis represents the accuracy percentage. Each bar is segmented to show the contribution of each model to the overall accuracy for that image size interval.
### Components/Axes
* **X-axis:** Average Image Size Interval (width x height). The intervals are: (0, 25], (25, 100], (100, 225], (225, 400], (400, 625], (625, 900]. Below each interval, the total number of images in that interval is provided.
* **Y-axis:** Accuracy on It (%). The scale ranges from 0 to 80.
* **Legend:** Located at the top-right of the chart.
* Gemini-2.0 RSPC (Dark Green)
* Gemini-2.0 KAAR (Light Green)
* DeepSeek-R1-70B RSPC (Light Brown)
* DeepSeek-R1-70B KAAR (Dark Brown)
### Detailed Analysis
The chart consists of six stacked bars, one for each image size interval. The values are as follows:
* **(0, 25]**:
* Gemini-2.0 RSPC: 63.2%
* Gemini-2.0 KAAR: 5.3%
* DeepSeek-R1-70B RSPC: 47.4%
* DeepSeek-R1-70B KAAR: 15.8%
* Total: 19
* **(25, 100]**:
* Gemini-2.0 RSPC: 28.8%
* Gemini-2.0 KAAR: 7.9%
* DeepSeek-R1-70B RSPC: 15.1%
* DeepSeek-R1-70B KAAR: 6.5%
* Total: 139
* **(100, 225]**:
* Gemini-2.0 RSPC: 9.3%
* Gemini-2.0 KAAR: 4.7%
* DeepSeek-R1-70B RSPC: 7.0%
* DeepSeek-R1-70B KAAR: 0.8%
* Total: 129
* **(225, 400]**:
* Gemini-2.0 RSPC: 5.9%
* Gemini-2.0 KAAR: 2.0%
* DeepSeek-R1-70B RSPC: 5.9%
* DeepSeek-R1-70B KAAR: 0.0%
* Total: 51
* **(400, 625]**:
* Gemini-2.0 RSPC: 0.0%
* Gemini-2.0 KAAR: 0.0%
* DeepSeek-R1-70B RSPC: 0.0%
* DeepSeek-R1-70B KAAR: 0.0%
* Total: 39
* **(625, 900]**:
* Gemini-2.0 RSPC: 0.0%
* Gemini-2.0 KAAR: 0.0%
* DeepSeek-R1-70B RSPC: 0.0%
* DeepSeek-R1-70B KAAR: 0.0%
* Total: 23
**Trends:**
* For the smallest image size interval (0, 25], Gemini-2.0 RSPC and DeepSeek-R1-70B RSPC contribute the most to the overall accuracy.
* As the image size interval increases, the overall accuracy tends to decrease.
* For larger image size intervals (400, 625] and (625, 900]), the accuracy for all models is very low, approaching zero.
* Gemini-2.0 KAAR consistently contributes a smaller percentage to the overall accuracy compared to Gemini-2.0 RSPC.
* DeepSeek-R1-70B KAAR's contribution is generally lower than DeepSeek-R1-70B RSPC, especially in the smaller image size intervals.
### Key Observations
* The highest accuracy is achieved in the (0, 25] image size interval.
* Accuracy drops significantly as the image size increases.
* The models perform poorly on larger images, suggesting a limitation in their ability to process high-resolution images.
* The number of images in each interval varies significantly, with the (25, 100] interval having the most images (139) and the (625, 900] interval having the fewest (23).
### Interpretation
The data suggests that the models' performance is highly dependent on the image size. They achieve relatively high accuracy on small images but struggle with larger images. This could be due to several factors, such as computational limitations, the need for more training data on larger images, or inherent limitations in the models' architectures. The significant drop in accuracy for larger images indicates a potential bottleneck in the models' ability to extract relevant features from high-resolution images. The varying number of images per interval could also influence the observed accuracy, as smaller intervals might be more susceptible to outliers. The consistent lower performance of the KAAR variants compared to the RSPC variants suggests a potential difference in their underlying mechanisms or training data. Further investigation is needed to understand the specific reasons for these performance differences and to develop strategies for improving the models' performance on larger images.
</details>
Figure 12: Accuracy on test instances $I_{t}$ for RSPC and KAAR across average image size intervals, evaluated with Gemini-2.0 and DeepSeek-R1-70B.
<details>
<summary>x13.png Details</summary>

### Visual Description
\n
## Line Chart: Accuracy vs. Iterations for Different Models
### Overview
This line chart displays the accuracy of four different models (Gemini-2.0 with RSPC, Gemini-2.0 with KAAR, DeepSeek-R1-70B with RSPC, and DeepSeek-R1-70B with KAAR) across 12 iterations, categorized by three distinct tasks: Objectness, Geometry/Topology/Numbers & Counting, and Goal-directedness. The y-axis represents accuracy on *I&L<sub>t</sub>* (%), while the x-axis represents the number of iterations.
### Components/Axes
* **X-axis:** "# Iterations" - Scale from 1 to 12. Markers at 1, 4, 8, and 12.
* **Y-axis:** "Accuracy on *I&L<sub>t</sub>* (%)" - Scale from 0 to 25. Markers at 0, 5, 10, 15, 20, and 25.
* **Legend:** Located in the bottom-right corner.
* Black Circle: Gemini-2.0: RSPC
* Gray Circle: Gemini-2.0: KAAR
* Brown Triangle: DeepSeek-R1-70B: RSPC
* Dark Teal Square: DeepSeek-R1-70B: KAAR
* **Task Categories:** The x-axis is divided into three sections, visually separated by vertical dashed lines:
* "Objectness" (Iterations 1-4)
* "Geometry, Topology, Numbers and Counting" (Iterations 4-8)
* "Goal-directedness" (Iterations 8-12)
### Detailed Analysis
Here's a breakdown of the data for each model and task, with approximate values:
**1. Gemini-2.0: RSPC (Black Circle)**
* **Objectness (1-4 iterations):** Starts at approximately 9.5% at iteration 1, increases to 13.25% at iteration 2, 14.75% at iteration 3, and plateaus at 15% at iteration 4.
* **Geometry/Topology/Numbers & Counting (4-8 iterations):** Remains at 15% until iteration 6, then decreases slightly to 15.25% at iteration 7, and 16.5% at iteration 8.
* **Goal-directedness (8-12 iterations):** Increases sharply to 19.75% at iteration 9, then plateaus at approximately 20.5% for iterations 10, 11, and 12.
**2. Gemini-2.0: KAAR (Gray Circle)**
* **Objectness (1-4 iterations):** Starts at approximately 3.75% at iteration 1, increases to 11.75% at iteration 2, 13.25% at iteration 3, and 13.5% at iteration 4.
* **Geometry/Topology/Numbers & Counting (4-8 iterations):** Increases to 15% at iteration 5, then remains relatively stable at 15.25% at iteration 6, 15.25% at iteration 7, and 15.25% at iteration 8.
* **Goal-directedness (8-12 iterations):** Increases to 15.75% at iteration 9, then plateaus at approximately 16.5% for iterations 10, 11, and 12.
**3. DeepSeek-R1-70B: RSPC (Brown Triangle)**
* **Objectness (1-4 iterations):** Starts at approximately 4% at iteration 1, increases to 5.5% at iteration 2, 6.5% at iteration 3, and 7% at iteration 4.
* **Geometry/Topology/Numbers & Counting (4-8 iterations):** Increases to 8.25% at iteration 5, 8.5% at iteration 6, 8.5% at iteration 7, and 8.75% at iteration 8.
* **Goal-directedness (8-12 iterations):** Increases to 10.75% at iteration 9, 11.25% at iteration 10, 11.25% at iteration 11, and 11.5% at iteration 12.
**4. DeepSeek-R1-70B: KAAR (Dark Teal Square)**
* **Objectness (1-4 iterations):** Starts at approximately 3.25% at iteration 1, increases to 4.5% at iteration 2, 5.5% at iteration 3, and 5.5% at iteration 4.
* **Geometry/Topology/Numbers & Counting (4-8 iterations):** Increases to 6.75% at iteration 5, 7% at iteration 6, 7.25% at iteration 7, and 7.25% at iteration 8.
* **Goal-directedness (8-12 iterations):** Increases to 7.25% at iteration 9, 7.25% at iteration 10, 7.25% at iteration 11, and 7.25% at iteration 12.
### Key Observations
* **Gemini-2.0: RSPC** consistently outperforms all other models across all tasks and iterations.
* **Gemini-2.0: KAAR** generally performs better than the DeepSeek models, but lags behind the RSPC version of Gemini-2.0.
* **DeepSeek-R1-70B: RSPC** performs better than **DeepSeek-R1-70B: KAAR**.
* The most significant performance gains for all models occur during the "Goal-directedness" task (iterations 8-12).
* The "Objectness" task shows the least amount of improvement across iterations.
* The accuracy curves for Gemini-2.0: RSPC and Gemini-2.0: KAAR flatten out after a certain number of iterations, suggesting diminishing returns.
### Interpretation
The data suggests that Gemini-2.0, particularly when paired with RSPC, is the most effective model for these tasks. The substantial increase in accuracy during the "Goal-directedness" phase indicates that these models are better equipped to handle tasks requiring more complex reasoning and planning. The relatively low accuracy scores for the DeepSeek models suggest they may require further training or architectural improvements to achieve comparable performance. The flattening of the curves for Gemini-2.0 models implies that further iterations may not yield significant improvements, and resources could be better allocated to exploring alternative approaches or focusing on more challenging tasks. The difference between RSPC and KAAR suggests that the choice of the augmentation technique significantly impacts performance. The consistent ranking of the models across all tasks suggests that the observed differences are not task-specific anomalies.
</details>
Figure 13: Variance in accuracy on $I_{r}\&I_{t}$ with increasing iterations for RSPC and KAAR using Gemini-2.0 and DeepSeek-R1-70B.
<details>
<summary>x14.png Details</summary>

### Visual Description
\n
## Bar Chart: Model Performance Across Cognitive Dimensions
### Overview
This bar chart compares the performance of four large language models (GPT-03-mini, Gemini-2.0, QwQ-32B, and DeepSeek-R1-70B) across three cognitive dimensions: Objectness, Geometry/Topology/Numbers/Counting, and Goal-directedness. Performance is measured in tokens.
### Components/Axes
* **X-axis:** Cognitive Dimensions - Objectness, Geometry, Topology, Numbers and Counting, Goal-directedness.
* **Y-axis:** Tokens - Scale ranges from 0K to 50K, with tick marks at 10K intervals.
* **Legend:** Located in the top-left corner.
* Blue: GPT-03-mini
* Green: Gemini-2.0
* Purple: QwQ-32B
* Orange: DeepSeek-R1-70B
### Detailed Analysis
The chart consists of three groups of four bars, one for each model within each cognitive dimension.
**Objectness:**
* GPT-03-mini: Approximately 11K tokens.
* Gemini-2.0: Approximately 12K tokens.
* QwQ-32B: Approximately 20K tokens.
* DeepSeek-R1-70B: Approximately 15K tokens.
**Geometry, Topology, Numbers and Counting:**
* GPT-03-mini: Approximately 24K tokens.
* Gemini-2.0: Approximately 40K tokens.
* QwQ-32B: Approximately 29K tokens.
* DeepSeek-R1-70B: Approximately 37K tokens.
**Goal-directedness:**
* GPT-03-mini: Approximately 19K tokens.
* Gemini-2.0: Approximately 31K tokens.
* QwQ-32B: Approximately 18K tokens.
* DeepSeek-R1-70B: Approximately 43K tokens.
### Key Observations
* DeepSeek-R1-70B consistently demonstrates the highest token count across all three cognitive dimensions.
* Gemini-2.0 performs strongly in Geometry, Topology, Numbers and Counting and Goal-directedness, exceeding the other models in these areas.
* QwQ-32B shows the best performance in Objectness.
* GPT-03-mini consistently exhibits the lowest token counts across all dimensions.
### Interpretation
The data suggests that DeepSeek-R1-70B is the most capable model across these cognitive dimensions, as measured by token count. Gemini-2.0 excels in areas requiring spatial reasoning (Geometry, Topology) and planning (Goal-directedness). QwQ-32B shows a relative strength in recognizing and understanding objects. GPT-03-mini appears to be the least performant model in this comparison.
The use of "tokens" as a metric is interesting. It likely represents the amount of computational effort or the complexity of the model's internal representation when processing information related to each cognitive dimension. Higher token counts could indicate a more detailed or nuanced understanding, or simply a more verbose processing style.
The choice of these three cognitive dimensions – Objectness, Geometry/Topology/Numbers/Counting, and Goal-directedness – suggests an attempt to evaluate the models' abilities in areas crucial for general intelligence. Objectness relates to the ability to identify and categorize objects, Geometry/Topology/Numbers/Counting assesses spatial reasoning and mathematical skills, and Goal-directedness evaluates the capacity for planning and achieving objectives.
</details>
Figure 14: Average token cost for augmenting priors at each level across four LLMs. K is $10^{3}$ .
<details>
<summary>x15.png Details</summary>

### Visual Description
## Code Snippet: Image Transformation Function
### Overview
The image presents a code snippet (Python) defining a function `generate_output_image` that transforms an input image (assumed square) into a 15x15 output image based on a grid-drawing rule. The code includes comments explaining the logic and a visual representation of the transformation process with input and output images.
### Components/Axes
The image is divided into two main sections:
1. **Code Solution:** Contains the Python code defining the `generate_output_image` function.
2. **Task 695367ec:** Contains example input and output images.
The code itself includes comments that act as labels, explaining the purpose of each section. The input and output images are visually presented, demonstrating the transformation.
### Detailed Analysis or Content Details
**Code Transcription:**
```python
def generate_output_image(input_image):
# Determine the color value from the input image.
v = input_image[0][0]
# Get the input image dimension (assumed square).
n = len(input_image)
# Set the output image dimensions (always 15x15 as per the transformation rule).
out_size = 15
# Define grid-line indices for rows and columns based on the input dimension.
# These mappings are inferred from the examples:
# For 2x2 and 5x5 inputs, grid-lines occur at indices: 2, 5, 8, 11, 14.
# For a 3x3 input, grid-lines occur at indices: 3, 7, 11.
# For a 4x4 input, grid-lines occur at indices: 4, 9, 14.
if n in (2, 5):
grid_indices = [2, 5, 8, 11, 14]
elif n == 3:
grid_indices = [3, 7, 11]
elif n == 4:
grid_indices = [4, 9, 14]
else:
# Default: evenly space grid-lines over a 15-element dimension.
# This is a fallback if the input size is not one of the above.
block_size = out_size // (n + 1)
grid_indices = [(i + 1) * block_size - 1 for i in range(n)]
# Create the 15x15 output image based on the grid drawing rule.
output = []
for r in range(out_size):
row = []
if r in grid_indices:
# This is a separator (grid-line) row; paint the entire row with v.
row.extend([v] * out_size)
else:
# For a pattern row, only the pixels at grid-line column positions are painted.
for c in range(out_size):
if c in grid_indices:
row.append(v)
else:
row.append(0) # Assuming 0 represents a default color (e.g., black)
output.append(row)
return output
```
**Input/Output Image Analysis:**
* **Input Image (Left):** A 4x4 grid. The color of the top-left pixel is visually identified as a dark shade (likely black or a dark gray).
* **Output Image (Right):** A 15x15 grid. Horizontal and vertical lines are present, forming a grid pattern. The lines are the same color as the top-left pixel of the input image. The rest of the pixels are a lighter color (likely white or a light gray). The grid lines occur at indices 4, 9, and 14.
### Key Observations
* The function transforms an input image into a 15x15 image with a grid pattern.
* The color of the grid lines is determined by the color of the top-left pixel of the input image.
* The grid line positions are determined by the input image's dimension.
* The code provides specific grid line indices for input sizes of 2x2, 3x3, 4x4, and 5x5.
* A default grid line calculation is used for input sizes not explicitly defined.
* The default color for non-grid pixels is assumed to be 0.
### Interpretation
The code implements a visual transformation that maps an input image's color to a grid pattern in a larger output image. The grid's structure is dependent on the input image's size, suggesting a scaling or interpolation process. The function effectively extracts a single color value from the input and uses it to create a structured visual representation. The use of different grid line indices for different input sizes indicates that the function is designed to handle various input resolutions and maintain a consistent visual output. The fallback mechanism ensures that the function can handle any input size, even if it's not explicitly defined in the code. The transformation could be used for image resizing, feature extraction, or creating stylized visual effects. The question mark at the bottom right of the image suggests a prompt for further exploration or a request for a solution to a related problem.
</details>
Figure 15: ARC problem 695367ec, where RSPC and KAAR generate the same code solution that passes the training instances but fails on the test instance using GPT-o3-mini.
<details>
<summary>x16.png Details</summary>

### Visual Description
\n
## Diagram: Image Generation Logic
### Overview
The image presents a diagram illustrating the logic for generating a 5x5 output image based on a 6x6 input image. The logic centers around analyzing the first row of the input image to determine a border pattern, then constructing the output image using this pattern and a blank middle row. The right side of the image contains Python code implementing this logic, with annotations highlighting key decision points.
### Components/Axes
The diagram is divided into three main sections:
1. **Input Images (Left):** Shows two 6x6 input images, one with a pattern and one with a question mark indicating an unknown input.
2. **Arrows:** Illustrate the flow of data from the input image to the output image.
3. **Code Solution (Right):** Python code with comments explaining the image generation process.
The code includes the following key variables and concepts:
* `input_image`: The 6x6 input image.
* `count_eights`: The number of 8s in the first row of the input image.
* `active_pattern`: The border pattern to use when `count_eights` is greater than or equal to 2.
* `top_active`: The first active row of the output image.
* `second_active`: The second active row of the output image.
* `blank`: The middle row of the output image, always filled with zeros.
* `output_image`: The final 5x5 output image.
### Detailed Analysis or Content Details
The Python code defines a function `generate_output_image(input_image)` that performs the following steps:
1. **Count Eights:** Counts the number of pixels with the value 8 in the first row of the input image.
2. **Determine Border Pattern:**
* If `count_eights` is greater than or equal to 2, the `active_pattern` is set to `[8, 8, 8, 8, 8]`. `top_active` and `second_active` are both assigned this pattern.
* Otherwise, the `top_active` is set to `[0, 8, 0, 8, 0]` and `second_active` is set to `[8, 0, 8, 0, 8]`.
3. **Create Blank Row:** The `blank` row is defined as `[0, 0, 0, 0, 0]`.
4. **Construct Output Image:** The `output_image` is constructed as a list of lists, consisting of `top_active`, `second_active`, `blank`, `top_active`, and `second_active`.
5. **Return Output Image:** The function returns the `output_image`.
The first input image shows a 6x6 grid with a pattern of 8s and 0s. The first row contains five 8s. The second input image has a question mark, indicating an unknown input.
### Key Observations
* The logic is based on a simple rule: the presence of two or more 8s in the first row of the input image determines whether to use a "full-active" or "softer-border" pattern.
* The output image is always 5x5, regardless of the input image size.
* The middle row of the output image is always a row of zeros.
* The output image is constructed by vertically mirroring the active rows.
* The annotations "No objective-centric reasoning" and "Rules are only applied to training instances" suggest this logic is part of a larger system designed to mimic human-like pattern recognition without necessarily understanding the underlying meaning.
### Interpretation
The diagram illustrates a rule-based system for image generation. The system analyzes a simple feature (the number of 8s in the first row) and uses this feature to select a border pattern. The output image is then constructed using this pattern and a blank middle row. The annotations suggest that this system is not designed to understand the meaning of the images, but rather to mimic a pattern-recognition process. This could be part of a larger machine learning system trained on a dataset of images. The question mark in the second input image highlights the system's ability to handle unseen inputs based on the defined rules. The system demonstrates a basic form of conditional logic and pattern replication. The use of 8s and 0s suggests a binary representation of information, potentially related to pixel values in a grayscale image.
</details>
Figure 16: ARC problem b1fc8b8e, where RSPC and KAAR generate the same code solution that passes the training instances but fails on the test instance using GPT-o3-mini.
<details>
<summary>x17.png Details</summary>

### Visual Description
## Heatmap: Coverage Comparison of Generation Methods
### Overview
This image presents a heatmap visualizing the coverage comparison between different text generation methods. The methods are: Direct Generation (Prompt, Completion, and Principal Component), Repeated Sampling (Prompt, Completion, and Principal Component), and Refinement (Prompt, Completion, and Principal Component). The heatmap displays the correlation or similarity between these methods, with values ranging from 0 to 1, indicated by a color gradient.
### Components/Axes
* **X-axis:** Represents the generation methods: "Direct Generation\_P", "Direct Generation\_C", "Direct Generation\_PC", "Repeated Sampling\_P", "Repeated Sampling\_C", "Repeated Sampling\_PC", "Refinement\_P", "Refinement\_C", "Refinement\_PC".
* **Y-axis:** Represents the same generation methods as the X-axis, creating a matrix for pairwise comparison.
* **Color Scale (Right Side):** Indicates the "Coverage" value, ranging from approximately 0.0 (blue) to 1.8 (red). The scale is not linear, with a concentration of values between 0.5 and 1.0.
* **Cell Values:** Each cell in the heatmap displays a numerical value representing the coverage between the corresponding methods.
* **Title (Bottom):** "gpt-03-mini"
### Detailed Analysis
The heatmap is a 9x9 matrix. Each cell's color corresponds to the coverage value. The diagonal cells (where a method is compared to itself) all have a value of 1.00, indicating perfect coverage.
Here's a breakdown of the coverage values, row by row:
* **Direct Generation\_P:**
* Direct Generation\_P: 1.00
* Direct Generation\_C: 0.74
* Direct Generation\_PC: 0.75
* Repeated Sampling\_P: 0.80
* Repeated Sampling\_C: 0.89
* Repeated Sampling\_PC: 0.84
* Refinement\_P: 0.74
* Refinement\_C: 0.81
* Refinement\_PC: 0.75
* **Direct Generation\_C:**
* Direct Generation\_C: 1.00
* Direct Generation\_P: 0.61
* Direct Generation\_PC: 0.71
* Repeated Sampling\_P: 0.68
* Repeated Sampling\_C: 0.89
* Repeated Sampling\_PC: 0.83
* Refinement\_P: 0.76
* Refinement\_C: 0.80
* Refinement\_PC: 0.70
* **Direct Generation\_PC:**
* Direct Generation\_PC: 1.00
* Direct Generation\_P: 0.69
* Direct Generation\_C: 0.79
* Repeated Sampling\_P: 0.73
* Repeated Sampling\_C: 0.91
* Repeated Sampling\_PC: 0.84
* Refinement\_P: 0.73
* Refinement\_C: 0.81
* Refinement\_PC: 0.80
* **Repeated Sampling\_P:**
* Repeated Sampling\_P: 1.00
* Direct Generation\_P: 0.68
* Direct Generation\_C: 0.71
* Direct Generation\_PC: 0.68
* Repeated Sampling\_C: 0.87
* Repeated Sampling\_PC: 0.88
* Refinement\_P: 0.80
* Refinement\_C: 0.75
* Refinement\_PC: 0.69
* **Repeated Sampling\_C:**
* Repeated Sampling\_C: 1.00
* Direct Generation\_P: 0.55
* Direct Generation\_C: 0.67
* Direct Generation\_PC: 0.62
* Repeated Sampling\_P: 0.64
* Repeated Sampling\_PC: 0.80
* Refinement\_P: 0.65
* Refinement\_C: 0.78
* Refinement\_PC: 0.69
* **Repeated Sampling\_PC:**
* Repeated Sampling\_PC: 1.00
* Direct Generation\_P: 0.55
* Direct Generation\_C: 0.66
* Direct Generation\_PC: 0.61
* Repeated Sampling\_P: 0.68
* Repeated Sampling\_C: 0.85
* Refinement\_P: 0.76
* Refinement\_C: 0.67
* Refinement\_PC: 0.67
* **Refinement\_P:**
* Refinement\_P: 1.00
* Direct Generation\_P: 0.61
* Direct Generation\_C: 0.75
* Direct Generation\_PC: 0.66
* Repeated Sampling\_P: 0.77
* Repeated Sampling\_C: 0.85
* Repeated Sampling\_PC: 0.84
* Refinement\_C: 0.75
* Refinement\_PC: 0.71
* **Refinement\_C:**
* Refinement\_C: 1.00
* Direct Generation\_P: 0.56
* Direct Generation\_C: 0.67
* Direct Generation\_PC: 0.62
* Repeated Sampling\_P: 0.61
* Repeated Sampling\_C: 0.87
* Repeated Sampling\_PC: 0.79
* Refinement\_P: 0.63
* Refinement\_PC: 0.73
* **Refinement\_PC:**
* Refinement\_PC: 1.00
* Direct Generation\_P: 0.59
* Direct Generation\_C: 0.69
* Direct Generation\_PC: 0.64
* Repeated Sampling\_P: 0.87
* Repeated Sampling\_C: 0.81
* Repeated Sampling\_PC: 0.68
* Refinement\_P: 0.83
* Refinement\_PC: 1.00
### Key Observations
* The diagonal elements are all 1.00, as expected.
* "Repeated Sampling\_C" and "Refinement\_C" consistently show higher coverage values with other methods compared to their "P" and "PC" counterparts.
* "Direct Generation\_P" and "Refinement\_P" generally have lower coverage values with other methods.
* The highest off-diagonal value is 0.91, between "Direct Generation\_PC" and "Repeated Sampling\_C".
* The lowest off-diagonal value is 0.55, appearing twice between "Repeated Sampling\_C" and "Direct Generation\_P", and between "Repeated Sampling\_PC" and "Direct Generation\_P".
### Interpretation
This heatmap demonstrates the degree of overlap or similarity in the outputs generated by different text generation methods. The "Coverage" metric likely represents how much of the output from one method is also present in the output of another.
The higher coverage values between methods ending in "\_C" (Completion) suggest that the completion-based approaches are more consistent and share more common ground with other methods. Conversely, the lower coverage values associated with methods ending in "\_P" (Prompt) indicate that prompt-based approaches may be more diverse or generate more unique outputs.
The relatively low coverage between "Direct Generation\_P" and the "Repeated Sampling" and "Refinement" methods suggests that direct prompting leads to significantly different results compared to iterative sampling or refinement techniques. This could be due to the prompt's sensitivity or the inherent randomness in the sampling/refinement processes.
The "gpt-03-mini" label suggests this data is specific to a particular language model. The heatmap provides valuable insights into the characteristics of different generation strategies within this model, which can inform the selection of the most appropriate method for a given task. The data suggests that for tasks requiring consistency and overlap with other generation methods, completion-based approaches are preferable. For tasks prioritizing diversity and uniqueness, prompt-based approaches may be more suitable.
</details>
Figure 17: Asymmetric relative coverage matrix of nine ARC solvers using GPT-o3-mini, showing the proportion of problems whose test instances are solved by the row solver that are also solved by the column solver. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
<details>
<summary>x18.png Details</summary>

### Visual Description
## Heatmap: Coverage Comparison of Generation Methods
### Overview
This image presents a heatmap visualizing the coverage comparison between different text generation methods. The methods are variations of Direct Generation, Repeated Sampling, and Refinement, each with options for "P" (presumably Probability), "C" (presumably Context), and "PC" (presumably Probability-Context). The heatmap cells are color-coded to represent coverage values, ranging from 0.0 to 1.0, with a gradient from blue (low coverage) to red (high coverage).
### Components/Axes
* **X-axis:** Represents the target generation method. The categories are: "Direct Generation P", "Direct Generation C", "Direct Generation PC", "Repeated Sampling P", "Repeated Sampling C", "Repeated Sampling PC", "Refinement P", "Refinement C", "Refinement PC".
* **Y-axis:** Represents the source generation method, mirroring the categories of the X-axis: "Direct Generation P", "Direct Generation C", "Direct Generation PC", "Repeated Sampling P", "Repeated Sampling C", "Repeated Sampling PC", "Refinement P", "Refinement C", "Refinement PC".
* **Color Scale:** A vertical color bar on the right side of the heatmap indicates the coverage values.
* Blue: ~0.0
* White: ~0.5
* Red: ~1.0
* **Cell Values:** Each cell in the heatmap displays a numerical value representing the coverage between the corresponding source and target methods.
### Detailed Analysis
The heatmap displays coverage values between all combinations of the nine generation methods. The diagonal elements (where source and target methods are the same) are all 1.00, indicating perfect coverage of a method with itself.
Here's a breakdown of the coverage values, organized by source method:
**1. Direct Generation P:**
* Direct Generation P - Direct Generation P: 1.00
* Direct Generation P - Direct Generation C: 0.54
* Direct Generation P - Direct Generation PC: 0.46
* Direct Generation P - Repeated Sampling P: 0.64
* Direct Generation P - Repeated Sampling C: 0.79
* Direct Generation P - Repeated Sampling PC: 0.82
* Direct Generation P - Refinement P: 0.57
* Direct Generation P - Refinement C: 0.75
* Direct Generation P - Refinement PC: 0.79
**2. Direct Generation C:**
* Direct Generation C - Direct Generation P: 0.56
* Direct Generation C - Direct Generation C: 1.00
* Direct Generation C - Direct Generation PC: 0.48
* Direct Generation C - Repeated Sampling P: 0.78
* Direct Generation C - Repeated Sampling C: 0.89
* Direct Generation C - Repeated Sampling PC: 0.89
* Direct Generation C - Refinement P: 0.63
* Direct Generation C - Refinement C: 0.81
* Direct Generation C - Refinement PC: 0.74
**3. Direct Generation PC:**
* Direct Generation PC - Direct Generation P: 0.52
* Direct Generation PC - Direct Generation C: 0.52
* Direct Generation PC - Direct Generation PC: 1.00
* Direct Generation PC - Repeated Sampling P: 0.72
* Direct Generation PC - Repeated Sampling C: 0.84
* Direct Generation PC - Repeated Sampling PC: 0.88
* Direct Generation PC - Refinement P: 0.56
* Direct Generation PC - Refinement C: 0.72
* Direct Generation PC - Refinement PC: 0.84
**4. Repeated Sampling P:**
* Repeated Sampling P - Direct Generation P: 0.45
* Repeated Sampling P - Direct Generation C: 0.53
* Repeated Sampling P - Direct Generation PC: 0.45
* Repeated Sampling P - Repeated Sampling P: 1.00
* Repeated Sampling P - Repeated Sampling C: 0.85
* Repeated Sampling P - Repeated Sampling PC: 0.88
* Repeated Sampling P - Refinement P: 0.57
* Repeated Sampling P - Refinement C: 0.70
* Repeated Sampling P - Refinement PC: 0.72
**5. Repeated Sampling C:**
* Repeated Sampling C - Direct Generation P: 0.37
* Repeated Sampling C - Direct Generation C: 0.41
* Repeated Sampling C - Direct Generation PC: 0.36
* Repeated Sampling C - Repeated Sampling P: 0.58
* Repeated Sampling C - Repeated Sampling C: 1.00
* Repeated Sampling C - Repeated Sampling PC: 0.86
* Repeated Sampling C - Refinement P: 0.49
* Repeated Sampling C - Refinement C: 0.63
* Repeated Sampling C - Refinement PC: 0.68
**6. Repeated Sampling PC:**
* Repeated Sampling PC - Direct Generation P: 0.34
* Repeated Sampling PC - Direct Generation C: 0.36
* Repeated Sampling PC - Direct Generation PC: 0.33
* Repeated Sampling PC - Repeated Sampling P: 0.52
* Repeated Sampling PC - Repeated Sampling C: 0.76
* Repeated Sampling PC - Repeated Sampling PC: 1.00
* Repeated Sampling PC - Refinement P: 0.45
* Repeated Sampling PC - Refinement C: 0.58
* Repeated Sampling PC - Refinement PC: 0.61
**7. Refinement P:**
* Refinement P - Direct Generation P: 0.46
* Refinement P - Direct Generation C: 0.49
* Refinement P - Direct Generation PC: 0.40
* Refinement P - Repeated Sampling P: 0.66
* Refinement P - Repeated Sampling C: 0.83
* Refinement P - Repeated Sampling PC: 0.86
* Refinement P - Refinement P: 1.00
* Refinement P - Refinement C: 0.66
* Refinement P - Refinement PC: 0.80
**8. Refinement C:**
* Refinement C - Direct Generation P: 0.45
* Refinement C - Direct Generation C: 0.47
* Refinement C - Direct Generation PC: 0.38
* Refinement C - Repeated Sampling P: 0.60
* Refinement C - Repeated Sampling C: 0.79
* Refinement C - Repeated Sampling PC: 0.83
* Refinement C - Refinement P: 0.49
* Refinement C - Refinement C: 1.00
* Refinement C - Refinement PC: 0.70
**9. Refinement PC:**
* Refinement PC - Direct Generation P: 0.46
* Refinement PC - Direct Generation C: 0.42
* Refinement PC - Direct Generation PC: 0.44
* Refinement PC - Repeated Sampling P: 0.60
* Refinement PC - Repeated Sampling C: 0.83
* Refinement PC - Repeated Sampling PC: 0.85
* Refinement PC - Refinement P: 0.58
* Refinement PC - Refinement C: 0.69
* Refinement PC - Refinement PC: 1.00
### Key Observations
* The diagonal elements are all 1.00, as expected.
* Coverage values are generally higher when the target method includes "PC" compared to "P" or "C" alone.
* "Repeated Sampling C" and "Repeated Sampling PC" consistently show high coverage values with other methods, particularly with "Direct Generation C" and "Refinement C/PC".
* "Direct Generation P" and "Direct Generation PC" have relatively lower coverage with "Repeated Sampling P/C/PC" and "Refinement P/C/PC".
* The lowest coverage values are generally found between "Repeated Sampling PC" and "Direct Generation P/C/PC".
### Interpretation
This heatmap demonstrates the degree of overlap or consistency in the outputs generated by different text generation methods. A higher coverage value indicates that the target method is more likely to produce outputs similar to those of the source method.
The data suggests that incorporating both probability and context ("PC") generally leads to higher coverage across all methods. This implies that models leveraging both probabilistic and contextual information are more consistent with other generation approaches.
The relatively low coverage between "Repeated Sampling PC" and "Direct Generation P/C/PC" could indicate that repeated sampling introduces significant variations in the generated text compared to direct generation, even when considering both probability and context. This might be due to the inherent stochasticity of repeated sampling.
The consistent high coverage of "Repeated Sampling C" and "Repeated Sampling PC" suggests that these methods are robust and produce outputs that align well with a broader range of generation strategies, particularly those emphasizing contextual information. This could be valuable in scenarios where consistency and predictability are crucial.
The heatmap provides a valuable tool for understanding the relationships between different generation methods and can inform the selection of appropriate methods based on specific application requirements.
</details>
Figure 18: Asymmetric relative coverage matrix of nine ARC solvers using Gemini-2.0, showing the proportion of problems whose test instances are solved by the row solver that are also solved by the column solver. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
<details>
<summary>x19.png Details</summary>

### Visual Description
\n
## Heatmap: Correlation Matrix of Sampling Methods
### Overview
This image presents a heatmap displaying a correlation matrix between different sampling methods used in a process. The methods are variations of Direct Generation, Repeated Sampling, and Refinement, each with options for 'P' (presumably a parameter or process) and 'C' (another parameter or process), and 'PC' (potentially a combination). The color intensity represents the correlation coefficient, with a scale ranging from 0.0 to 1.0.
### Components/Axes
* **X-axis:** Lists the sampling methods: "Direct Generation\_P", "Direct Generation\_C", "Direct Generation\_PC", "Repeated Sampling\_P", "Repeated Sampling\_C", "Repeated Sampling\_PC", "Refinement\_P", "Refinement\_C", "Refinement\_PC".
* **Y-axis:** Lists the same sampling methods as the X-axis, creating a square matrix.
* **Color Scale (Right Side):** A vertical color bar indicates the correlation coefficient.
* 0.0 is represented by a light color (almost white).
* 1.0 is represented by a dark red color.
* 0.5 is represented by a mid-tone color.
* **Cell Values:** Each cell in the matrix displays a numerical value representing the correlation coefficient between the corresponding X and Y axis sampling methods.
* **Label (Bottom):** "Qwq-32B" - likely an identifier for the dataset or experiment.
### Detailed Analysis
The heatmap displays correlation coefficients between nine sampling methods. The values range from approximately 0.32 to 1.00. Here's a breakdown of the values, row by row, referencing the color intensity:
* **Direct Generation\_P:**
* Direct Generation\_P: 1.00
* Direct Generation\_C: 0.53
* Direct Generation\_PC: 0.37
* Repeated Sampling\_P: 0.68
* Repeated Sampling\_C: 0.71
* Repeated Sampling\_PC: 0.74
* Refinement\_P: 0.68
* Refinement\_C: 0.71
* Refinement\_PC: 0.74
* **Direct Generation\_C:**
* Direct Generation\_P: 0.69
* Direct Generation\_C: 1.00
* Direct Generation\_PC: 0.45
* Repeated Sampling\_P: 0.79
* Repeated Sampling\_C: 0.93
* Repeated Sampling\_PC: 0.86
* Refinement\_P: 0.72
* Refinement\_C: 0.86
* Refinement\_PC: 0.86
* **Direct Generation\_PC:**
* Direct Generation\_P: 0.61
* Direct Generation\_C: 0.57
* Direct Generation\_PC: 1.00
* Repeated Sampling\_P: 0.70
* Repeated Sampling\_C: 0.78
* Repeated Sampling\_PC: 0.91
* Refinement\_P: 0.61
* Refinement\_C: 0.83
* Refinement\_PC: 0.78
* **Repeated Sampling\_P:**
* Direct Generation\_P: 0.58
* Direct Generation\_C: 0.51
* Direct Generation\_PC: 0.36
* Repeated Sampling\_P: 1.00
* Repeated Sampling\_C: 0.73
* Repeated Sampling\_PC: 0.76
* Refinement\_P: 0.64
* Refinement\_C: 0.69
* Refinement\_PC: 0.69
* **Repeated Sampling\_C:**
* Direct Generation\_P: 0.50
* Direct Generation\_C: 0.50
* Direct Generation\_PC: 0.33
* Repeated Sampling\_P: 0.61
* Repeated Sampling\_C: 1.00
* Repeated Sampling\_PC: 0.80
* Refinement\_P: 0.65
* Refinement\_C: 0.72
* Refinement\_PC: 0.70
* **Repeated Sampling\_PC:**
* Direct Generation\_P: 0.49
* Direct Generation\_C: 0.44
* Direct Generation\_PC: 0.37
* Repeated Sampling\_P: 0.60
* Repeated Sampling\_C: 0.75
* Repeated Sampling\_PC: 1.00
* Refinement\_P: 0.54
* Refinement\_C: 0.70
* Refinement\_PC: 0.63
* **Refinement\_P:**
* Direct Generation\_P: 0.59
* Direct Generation\_C: 0.48
* Direct Generation\_PC: 0.32
* Repeated Sampling\_P: 0.66
* Repeated Sampling\_C: 0.80
* Repeated Sampling\_PC: 0.70
* Refinement\_P: 1.00
* Refinement\_C: 0.73
* Refinement\_PC: 0.73
* **Refinement\_C:**
* Direct Generation\_P: 0.47
* Direct Generation\_C: 0.44
* Direct Generation\_PC: 0.33
* Repeated Sampling\_P: 0.54
* Repeated Sampling\_C: 0.68
* Repeated Sampling\_PC: 0.70
* Refinement\_P: 0.56
* Refinement\_C: 1.00
* Refinement\_PC: 0.72
* **Refinement\_PC:**
* Direct Generation\_P: 0.50
* Direct Generation\_C: 0.45
* Direct Generation\_PC: 0.32
* Repeated Sampling\_P: 0.55
* Repeated Sampling\_C: 0.68
* Repeated Sampling\_PC: 0.64
* Refinement\_P: 0.57
* Refinement\_C: 0.73
* Refinement\_PC: 1.00
### Key Observations
* **Perfect Self-Correlation:** All methods have a correlation of 1.00 with themselves (diagonal elements).
* **Strong Correlation within Method Type:** Within each sampling method type (Direct Generation, Repeated Sampling, Refinement), the 'P', 'C', and 'PC' variations show relatively high correlations, generally above 0.6.
* **Highest Correlations:** The highest correlations (excluding self-correlation) are observed between "Direct Generation\_C" and "Repeated Sampling\_C" (0.93), and between "Direct Generation\_PC" and "Repeated Sampling\_PC" (0.91).
* **Lowest Correlations:** The lowest correlations are generally found between the 'PC' variations of different methods (e.g., Direct Generation\_PC and Repeated Sampling\_P, 0.36).
* **Symmetry:** The matrix is symmetrical, indicating that the correlation between method A and method B is the same as between method B and method A.
### Interpretation
This correlation matrix suggests that the different sampling methods are related, but not perfectly interchangeable. The strong correlations within each method type (Direct, Repeated, Refinement) indicate that the 'P', 'C', and 'PC' variations are capturing similar information. The higher correlations between 'C' variations suggest that the 'C' parameter or process is a dominant factor in determining the similarity between these methods.
The lower correlations between 'PC' variations of different methods suggest that the combination of 'P' and 'C' introduces more diversity in the sampling process, leading to less overlap in the information captured. The dataset identifier "Qwq-32B" implies this data is from a specific model or experiment, and the correlations observed are specific to that context.
The heatmap provides valuable insights into the relationships between these sampling methods, which can be used to inform decisions about which methods to use in combination or to select the most appropriate method for a given task. Further investigation could explore the meaning of the 'P' and 'C' parameters to understand why certain correlations are stronger than others.
</details>
Figure 19: Asymmetric relative coverage matrix of nine ARC solvers using QwQ-32B, showing the proportion of problems whose test instances are solved by the row solver that are also solved by the column solver. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
<details>
<summary>x20.png Details</summary>

### Visual Description
\n
## Heatmap: Correlation Matrix of DeepSeek-R1-70B Sampling Methods
### Overview
This image presents a heatmap visualizing the correlation matrix between different sampling methods used with the DeepSeek-R1-70B model. The methods are Direct Generation P, Direct Generation C, Direct Generation PC, Repeated Sampling P, Repeated Sampling C, Repeated Sampling PC, Refinement P, Refinement C, and Refinement PC. The color intensity represents the correlation coefficient, with warmer colors (reds) indicating stronger positive correlations and cooler colors (greens) indicating negative or weaker correlations. A colorbar on the right indicates the mapping between color and correlation value.
### Components/Axes
* **X-axis:** Sampling Methods - Direct Generation P, Direct Generation C, Direct Generation PC, Repeated Sampling P, Repeated Sampling C, Repeated Sampling PC, Refinement P, Refinement C, Refinement PC.
* **Y-axis:** Sampling Methods - Direct Generation P, Direct Generation C, Direct Generation PC, Repeated Sampling P, Repeated Sampling C, Repeated Sampling PC, Refinement P, Refinement C, Refinement PC.
* **Colorbar:** Scale from 0.0 to 1.8, representing the correlation coefficient. Green indicates lower correlation, red indicates higher correlation.
* **Title:** DeepSeek-R1-70B (located at the bottom center)
### Detailed Analysis
The heatmap displays a 9x9 matrix of correlation coefficients. Each cell (i, j) represents the correlation between sampling method i and sampling method j. The diagonal elements are all 1.00, indicating perfect self-correlation.
Here's a breakdown of key correlations, with approximate values and trend descriptions:
* **Direct Generation P vs. Direct Generation C:** Correlation of approximately 0.76.
* **Direct Generation P vs. Direct Generation PC:** Correlation of approximately 0.65.
* **Direct Generation P vs. Repeated Sampling P:** Correlation of approximately 0.65.
* **Direct Generation P vs. Repeated Sampling C:** Correlation of approximately 0.71.
* **Direct Generation P vs. Repeated Sampling PC:** Correlation of approximately 0.82.
* **Direct Generation P vs. Refinement P:** Correlation of approximately 0.53.
* **Direct Generation P vs. Refinement C:** Correlation of approximately 0.71.
* **Direct Generation P vs. Refinement PC:** Correlation of approximately 0.76.
* **Direct Generation C vs. Direct Generation PC:** Correlation of approximately 0.58.
* **Direct Generation C vs. Repeated Sampling P:** Correlation of approximately 0.58.
* **Direct Generation C vs. Repeated Sampling C:** Correlation of approximately 0.84.
* **Direct Generation C vs. Repeated Sampling PC:** Correlation of approximately 0.89.
* **Direct Generation C vs. Refinement P:** Correlation of approximately 0.53.
* **Direct Generation C vs. Refinement C:** Correlation of approximately 0.79.
* **Direct Generation C vs. Refinement PC:** Correlation of approximately 0.68.
* **Direct Generation PC vs. Repeated Sampling P:** Correlation of approximately 0.56.
* **Direct Generation PC vs. Repeated Sampling C:** Correlation of approximately 0.72.
* **Direct Generation PC vs. Repeated Sampling PC:** Correlation of approximately 0.72.
* **Direct Generation PC vs. Refinement P:** Correlation of approximately 0.44.
* **Direct Generation PC vs. Refinement C:** Correlation of approximately 0.72.
* **Direct Generation PC vs. Refinement PC:** Correlation of approximately 0.56.
* **Repeated Sampling P vs. Repeated Sampling C:** Correlation of approximately 0.41.
* **Repeated Sampling P vs. Repeated Sampling PC:** Correlation of approximately 0.45.
* **Repeated Sampling P vs. Refinement P:** Correlation of approximately 0.59.
* **Repeated Sampling P vs. Refinement C:** Correlation of approximately 0.76.
* **Repeated Sampling P vs. Refinement PC:** Correlation of approximately 0.76.
* **Repeated Sampling C vs. Repeated Sampling PC:** Correlation of approximately 0.66.
* **Repeated Sampling C vs. Refinement P:** Correlation of approximately 0.41.
* **Repeated Sampling C vs. Refinement C:** Correlation of approximately 0.62.
* **Repeated Sampling C vs. Refinement PC:** Correlation of approximately 0.62.
* **Repeated Sampling PC vs. Refinement P:** Correlation of approximately 0.42.
* **Repeated Sampling PC vs. Refinement C:** Correlation of approximately 0.61.
* **Repeated Sampling PC vs. Refinement PC:** Correlation of approximately 1.00.
* **Refinement P vs. Refinement C:** Correlation of approximately 0.86.
* **Refinement P vs. Refinement PC:** Correlation of approximately 0.86.
* **Refinement C vs. Refinement PC:** Correlation of approximately 0.65.
The lowest correlations (around 0.3-0.4) are generally found between methods that differ significantly in their approach (e.g., Direct Generation vs. Repeated Sampling, or involving PC vs. not involving PC).
### Key Observations
* Methods within the same category (e.g., Direct Generation P, C, PC) generally exhibit higher correlations with each other than with methods from different categories.
* The "Repeated Sampling" methods show relatively lower correlations with the "Direct Generation" methods.
* "Refinement P" and "Refinement C" have a very strong correlation (0.86).
* The correlations involving "Refinement PC" are generally moderate, suggesting it shares characteristics with both refinement and other sampling methods.
### Interpretation
This correlation matrix provides insights into the relationships between different sampling strategies used with the DeepSeek-R1-70B model. The high correlations within categories suggest that variations within a single sampling approach (e.g., different parameters for Direct Generation) tend to produce similar results. The lower correlations between categories indicate that these approaches explore different aspects of the model's output space.
The strong correlation between Refinement P and Refinement C suggests that these two refinement methods are highly consistent in their behavior. The moderate correlations involving Refinement PC suggest that it combines aspects of both refinement and other sampling techniques.
This information can be valuable for selecting appropriate sampling methods for specific tasks. If a task requires diverse outputs, choosing methods with lower correlations might be beneficial. If consistency is crucial, selecting methods with higher correlations would be preferable. The matrix also highlights potential areas for further investigation, such as understanding why certain methods exhibit low correlations with others.
</details>
Figure 20: Asymmetric relative coverage matrix of nine ARC solvers using DeepSeek-R1-70B, showing the proportion of problems whose test instances are solved by the row solver that are also solved by the column solver. P denotes the solution plan; C and PC refer to standalone and planning-aided code generation, respectively.
<details>
<summary>x21.png Details</summary>

### Visual Description
\n
## Diagram: Digit to Matrix Mapping
### Overview
The image presents a diagram illustrating a function `generate_output_image` that maps digits (0-9) to 5x5 matrices represented by numerical values. The diagram shows example matrices for digits 7, 8, 6, 2, 1, and a placeholder for other digits. Arrows point from the digit representations (visualized as 5x5 pixel grids) to the corresponding code block defining the matrix. A text annotation "Copy the output matrices." is present on the right side of the diagram. A question mark is present next to the matrix for digit 1.
### Components/Axes
The diagram consists of the following components:
* **Code Block:** A Python code snippet defining the `generate_output_image` function.
* **Digit Representations:** 5x5 pixel grids visually representing the digits 7, 8, 6, 2, and 1.
* **Arrows:** Arrows connecting each digit representation to its corresponding matrix definition in the code.
* **Annotation:** "Copy the output matrices."
* **Question Mark:** A question mark next to the matrix for digit 1.
### Detailed Analysis or Content Details
The code snippet defines a function `generate_output_image(input_image)` that appears to convert an input image (presumably representing a digit) into a 5x5 matrix of numerical values. The function calculates the frequency of each pixel value (presumably 0 or 1) and determines the most frequent digit. It then returns a predefined 5x5 matrix corresponding to that digit.
Here's a breakdown of the matrices defined for each digit:
* **Digit 7:**
```
[
[7, 7, 7, 7, 7],
[7, 0, 0, 0, 7],
[7, 0, 7, 0, 7],
[7, 0, 0, 0, 7],
[7, 7, 7, 7, 7]
]
```
* **Digit 8:**
```
[
[8, 8, 8, 8, 8],
[8, 0, 0, 0, 8],
[8, 0, 8, 0, 8],
[8, 0, 0, 0, 8],
[8, 8, 8, 8, 8]
]
```
* **Digit 6:**
```
[
[6, 6, 6, 6, 6],
[6, 0, 0, 0, 6],
[6, 0, 6, 0, 6],
[6, 0, 0, 0, 6],
[6, 6, 6, 6, 6]
]
```
* **Digit 2:**
```
[
[2, 2, 2, 2, 2],
[2, 0, 0, 0, 2],
[2, 2, 2, 2, 2],
[2, 0, 0, 0, 2],
[2, 2, 2, 2, 2]
]
```
* **Digit 1:**
```
[
[1, 1, 1, 1, 1],
[1, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 0, 0, 0, 1],
[1, 1, 1, 1, 1]
]
```
* **Other Digits (0, 3, 4, 5, 9):**
The code returns a 5x5 matrix filled with the digit's value. For example, if the digit is 5, it returns a matrix filled with 5s.
### Key Observations
* The matrices for digits 7, 8, 6, and 2 share a similar structure: a border of the digit's value and zeros inside.
* The matrix for digit 1 is also similar, but with a slightly different arrangement of zeros.
* The question mark next to the digit 1 matrix suggests a potential issue or uncertainty with its definition.
* The annotation "Copy the output matrices." indicates that these matrices are intended to be used elsewhere.
### Interpretation
The diagram demonstrates a method for converting handwritten digit images into a numerical representation suitable for machine learning or other computational tasks. The function `generate_output_image` essentially performs a simplified form of feature extraction, representing each digit as a 5x5 matrix of values. The use of predefined matrices for each digit suggests a rule-based approach, rather than a learned model. The question mark next to the digit 1 matrix could indicate that the predefined matrix does not accurately represent the digit, or that there is a need for further refinement. The annotation suggests that these matrices are used as input for another process. The code snippet is a simplified example, and a real-world implementation would likely involve more sophisticated image processing techniques. The function appears to be designed to handle binary images (pixels with values 0 or 1). The frequency calculation and digit determination step suggest that the input image is analyzed to identify the most prominent digit.
</details>
Figure 21: ARC problem 358ba94e, where repeated sampling with standalone code generation produces an incorrect solution using GPT-o3-mini.
<details>
<summary>x22.png Details</summary>

### Visual Description
\n
## Diagram: Code Generation Comparison
### Overview
The image presents a comparison between two approaches to code generation: "Repeated sampling with standalone code generation" and "Repeated sampling with planning-aided code generation." It visually contrasts the code solutions produced by each method, highlighting a correction made in the planning-aided approach. The diagram uses flow arrows and text annotations to explain the process and identify an "Incorrect condition" in the standalone method.
### Components/Axes
The diagram is divided into two main sections, one for each code generation approach. Each section contains:
* **Task ID:** "15696249" (top-left)
* **Method Title:** "Repeated sampling with standalone code generation" or "Repeated sampling with planning-aided code generation"
* **Code Solution:** A block of Python code.
* **Plan:** A textual description of the planning process.
* **Arrows:** Green arrows indicating the flow of execution or reasoning.
* **Annotations:** Text labels like "Incorrect condition." and a question mark.
### Detailed Analysis or Content Details
**Section 1: Repeated sampling with standalone code generation**
* **Code Solution:**
```python
def generate_output_image(input_image):
output = [[0 for _ in range(9)] for _ in range(9)]
for i in range(9):
for j in range(9):
if input_image[i][j] == 1:
output[i][j] = input_image[i][j]
return output
```
* **Annotation:** "Incorrect condition." (positioned near the `if` statement)
**Section 2: Repeated sampling with planning-aided code generation**
* **Plan:**
```
Look at the input's rows and columns to see whether
any entire row is "uniform" (all the same number) or any entire column is uniform...
Example 1: Input row 2 is [3, 3, 3] (uniform). So the entire input is copied horizontally ...
Example 2: Input column 1 is [2, 2, 2] (uniform). So the input is "stacked" vertically ...
Example 3: Input row 1 is [1, 1, 1] (uniform). So the input is copied horizontally ...
Example 4: Input column 1 is uniform (3, 3, 3), so the input is stacked vertically ...
```
* **Code Solution:**
```python
def generate_output_image(input_image):
n = len(input_image)
m = len(input_image[0])
uniform_row = None
for i in range(n):
if all(input_image[i][j] == input_image[i][0] for j in range(m)):
uniform_row = i
break
uniform_col = None
for j in range(m):
if all(input_image[i][j] == input_image[0][j] for i in range(n)):
uniform_col = j
break
if uniform_row is not None:
output = [input_image[uniform_row] for _ in range(n)]
elif uniform_col is not None:
output = [[input_image[i][uniform_col] for i in range(n)] for _ in range(m)]
else:
output = [[0 for _ in range(m)] for _ in range(n)]
for i in range(n):
for j in range(m):
if input_image[i][j] == 1:
output[i][j] = input_image[i][j]
return output
```
* **Annotation:** "The plan identifies correct conditions and implements the correct solution." (positioned near the code solution)
* **Annotation:** A question mark ("?") is placed near the final `else` block in the code.
### Key Observations
* The standalone code generation approach produces a simple, but incorrect, solution. The annotation highlights a flaw in the conditional logic.
* The planning-aided approach first analyzes the input to identify uniform rows or columns. This allows it to implement more efficient and correct copying or stacking logic.
* The planning-aided code includes a fallback to the standalone approach if no uniform rows or columns are found.
* The question mark suggests a potential area for further refinement or consideration in the planning-aided approach.
### Interpretation
The diagram demonstrates the benefit of incorporating planning into code generation. The standalone approach, lacking a high-level understanding of the input data, resorts to a brute-force solution that is flagged as incorrect. The planning-aided approach, by first analyzing the input for patterns (uniform rows/columns), can generate more optimized and accurate code. The question mark indicates that even with planning, there may be edge cases or areas for improvement.
The diagram illustrates a key principle in AI: providing context and reasoning capabilities (through planning) can significantly improve the quality and efficiency of generated code. The comparison highlights the importance of not just generating code, but *understanding* the problem before attempting a solution. The use of examples in the "Plan" section is crucial for guiding the code generation process.
</details>
Figure 22: ARC problem 15696249, where repeated sampling with standalone code generation produces an incorrect solution, whereas repeated sampling with planning-aided code generation yields the correct solution using GPT-o3-mini.
<details>
<summary>x23.png Details</summary>

### Visual Description
\n
## Diagram: Refinement with Planning-Aided Code Generation
### Overview
This diagram illustrates the iterative refinement process of code generation, specifically focusing on a task (d19f7514) involving image manipulation. It visually depicts the evolution of code through multiple stages: initial incorrect code, repeated sampling with planning-aided code generation, and finally, a base solution. The diagram uses a flowchart-like structure with code snippets and visual cues (red boxes, question marks) to highlight the progression and identify areas of persistent errors.
### Components/Axes
The diagram is structured into three main sections, vertically arranged:
1. **Initial Incorrect Code:** Located at the top, this section presents the first attempt at code generation.
2. **Incorrect Code After Refinements:** The middle section shows the code after several refinement iterations, still containing errors.
3. **Base Solution:** The bottom section displays the final, corrected code.
Each section contains code snippets with comments explaining the logic. Arrows indicate the flow of the refinement process. Red boxes highlight areas where incorrect logic persists. A question mark appears in the middle section, indicating uncertainty or a problem area.
### Detailed Analysis or Content Details
**1. Initial Incorrect Code:**
* **Function Definition:** `def generate_output_image(input_image):`
* **Step 1:** `cropped = [row[:6] for row in input_image[:6]]` - Crops the input image to the top 6 rows and first 6 columns.
* **Step 2:** `for r in range(len(cropped)): for c in range(len(cropped[0])):` - Iterates through each pixel in the cropped image.
* **Logic:** `if cropped[r][c] != 0: cropped[r][c] = 4` - Sets any non-zero pixel value to 4.
* **Step 3:** `for c in range(len(cropped)): for r in range(len(cropped)):` - Iterates through columns and rows.
* **Logic:** `if cropped[r][c] == 0 and cropped[r-1][c] == 4: cropped[r][c] = 4` - Propagates the value 4 downwards if the pixel above is 4.
* **Return:** `return cropped`
**2. Incorrect Code After Refinements:**
* **Function Definition:** `def generate_output_image(input_image):`
* **Cropped Image:** `cropped = input_image[:6]`
* **Rows:** `rows = len(cropped)`
* **Cols:** `cols = len(cropped[0])`
* **Step 2:** `base = [[4 if pixel != 0 else 0 for pixel in row] for row in cropped]` - Creates a base image where non-zero pixels are set to 4.
* **Output:** `output = [base[0][:]]`
* **Function Definition:** `def is_active(row): return any(pixel != 0 for pixel in row)`
* **Process Each Subsequent Row:** `for r in range(1, rows): new_row = base[r][:]`
* **Logic:** `if is_active(cropped[r-1]): for c in range(cols): if new_row[c] == 0 and cropped[r-1][c] != 0: new_row[c] = 4` - Propagates 4 downwards if the pixel above is non-zero.
* **Logic:** `if c > 0 and new_row[c-1] == 4: new_row[c] = 4` - Propagates 4 horizontally if the pixel to the left is 4.
* **Output:** `output.append(new_row)`
* **Return:** `return output`
**3. Base Solution:**
* **Function Definition:** `def generate_output_image(input_image):`
* **Cropped Image:** `cropped = input_image[:6]`
* **Rows:** `rows = len(cropped)`
* **Cols:** `cols = len(cropped[0])`
* **Step 1:** `base = [[4 if pixel != 0 else 0 for pixel in row] for row in cropped]` - Creates a base image where non-zero pixels are set to 4.
* **Output:** `output = [base[0][:]]`
* **Function Definition:** `def is_active(row): return any(pixel != 0 for pixel in row)`
* **Process Each Subsequent Row:** `for r in range(1, rows): new_row = base[r][:]`
* **Logic:** `if is_active(cropped[r-1]): for c in range(cols): if new_row[c] == 0 and cropped[r-1][c] != 0: new_row[c] = 4` - Propagates 4 downwards if the pixel above is non-zero.
* **Logic:** `if c > 0 and new_row[c-1] == 4: new_row[c] = 4` - Propagates 4 horizontally if the pixel to the left is 4.
* **Output:** `output.append(new_row)`
* **Return:** `return output`
### Key Observations
* The initial code only propagates the value 4 downwards, ignoring horizontal propagation.
* The "Incorrect Code After Refinements" section attempts to address this by adding horizontal propagation, but still contains errors (indicated by the red box and question mark).
* The "Base Solution" appears to be identical to the "Incorrect Code After Refinements", suggesting the final correction was minor or not fully reflected in the code snippet.
* The iterative process involves building a base image and then propagating values based on neighboring pixels.
### Interpretation
The diagram demonstrates the challenges of code generation, particularly in tasks requiring spatial reasoning and propagation. The initial code fails to capture the full requirements of the task, leading to an incorrect output. Subsequent refinements attempt to address these shortcomings, but the persistence of errors (highlighted in red) suggests that the underlying logic is still flawed. The final "Base Solution" may represent a partial fix or a convergence towards a correct solution, but the diagram doesn't provide enough information to definitively assess its accuracy. The iterative nature of the process, with repeated sampling and planning-aided code generation, highlights the importance of feedback and refinement in achieving a desired outcome. The question mark in the middle section suggests that the refinement process may have stalled or encountered a difficult-to-resolve issue. The diagram illustrates a common pattern in AI-driven code generation: initial attempts often require significant refinement to achieve correctness.
</details>
Figure 23: ARC problem d19f7514, where repeated sampling with planning-aided code generation produces a correct solution, whereas its refinement variant fails to refine the initial erroneous code, and the incorrect logic persists across subsequent refinements when using GPT-o3-mini.
### A.13 Prompts for LLMs
We include all prompts used by KAAR and nine ARC solvers described in Section 3. We adopt a bash-like notation for input arguments within the prompts, such as $ {test_inputs} denotes the test input 2D matrices. A brief description of the prompts used for each solver is provided below.
- Direct generation with solution plan: Prompt 1 describes how to generate the solution plan, and Prompt 2 uses the generated plan to produce the output images.
- Direct generation with standalone code: Prompt 3 describes how to generate the code to produce the output images.
- Direct generation with planning-aided code: It first generates a solution plan using Prompt 1, then uses Prompt 4 to produce code based on the generated plan.
- Repeated sampling with solution plan: It can be regarded as an iterative version of direct generation with solution plan, and thus also uses Prompts 1 and 2.
- Repeated sampling with standalone code: It can be regarded as an iterative version of direct generation with standalone code, and thus also uses Prompt 3.
- Repeated sampling with planning-aided code: It can be regarded as an iterative version of direct generation with planning-aided code, and thus also uses Prompts 1 and 4.
- Refinement with solution plan: Prompt 5 describes the process of refining the generated solution plan with the validation samples. It uses Prompts 1 and 2 to generate the initial plan and the result image.
- Refinement with the standalone code: Prompt 6 describes the process of refining the generated code with the validation samples. It uses Prompt 3 to produce the initial code solution.
- Refinement with the planning-aided code: Prompt 7 describes the process of refining the generated plan and code with the validation samples. It use Prompts 1 and 4 to generate the initial plan and produce the initial code guided by the plan, respectively.
- KAAR: Prompt 8 describes the augmentation of objectness priors. Prompts 9 and 10 introduce the augmentation of geometry and topology priors, encoded as component attributes and relations, respectively. Prompt 11 outlines the augmentation of numbers and counting priors. Prompts 12 and 13 describe action selection and target component identification in the process of augmenting goal-directedness priors. For prompts implementing each action’s implementation details, please refer to our code.
Prompt 1: Direct generation with solution plan - solution plan generation.
⬇
================================ System ================================
You are an expert in analyzing grid - based image processing tasks. Your objective is to derive a text transformation plan (not Python code) from each given input - output image pair (both represented as 2 D matrices), and then apply this plan to generate output image (s), represented as a 2 D matrix, based on the given test input image (s) (2 D matrix). Ensure that the derived plan generalizes across different cases while preserving consistency with the observed transformations.
================================= User =================================
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input, and the right image represents the corresponding output. Each image can be represented as a 2 D matrix: $ {matrix}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
Derive a text transformation plan (not Python code) that maps each given input image (2 D matrix) to its corresponding output image (2 D matrix). Ensure that the plan generalizes across different cases and the test input image (s) (2 D matrix) while maintaining consistency with the observed transformations.
The test input image (s): $ {test_inputs}
Prompt 2: Direct generation with solution plan - output image(s) generation from the plan.
⬇
================================ System ================================
You are an expert in analyzing grid - based image processing tasks. Your objective is to generate output image (s), represented as a 2 D matrix, based on the given input images (2 D matrix) and a derived text transformation plan.
================================= User =================================
Please generate the output image (s) as a 2 D matrix (not Python code) based on the given input image (s) (2 D matrix) and the text transformation plan. Output only the test output image (s) in 2 D matrix format (not Python code). For each test input image, start with [Start Output Image] and end with [End Output Image].
For example, if there is one test input image, the output image should be:
[Start Output Image]
[[0,0,0], [0,0,0], [0,0,0]]
[End Output Image]
If there are multiple (2) test input images, the the output images should be outputted as:
[Start Output Image]
[[0,0,0], [0,0,0], [0,0,0]]
[End Output Image]
[Start Output Image]
[[1,1,1], [1,1,1], [1,1,1]]
[End Output Image]
The test input image (s): $ {test_inputs}
Prompt 3: Direct generation with standalone code.
⬇
================================ System ================================
You are an expert in analyzing grid - based image processing tasks. Your goal is to generate Python code that produces output image (s), represented as a 2 D matrix, based on the given input image (s) (2 D matrix).
================================= User =================================
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input and the right image represents the corresponding output.
Each image can be represented as a 2 D matrix: $ {matrix}
The test input image (s): $ {test_inputs}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
Generate a Python script to map each input image (2 D matrix) to the corresponding output image (2 D matrix).
Ensure that the Python script generalizes across different cases and test input image (s) while maintaining consistency with the observed input - output image pairs.
Please output the Python program, starting with [Start Program] and ending with [End Program].
Include an assert statement with the function signature to verify that the generated output matches the expected result, starting with [Assert Statement].
Use placeholders like input_image and output_image for the variables representing the input and output images.
For example:
[Start Program]
def generate_output_image (input_image):
rows = len (input_image)
cols = len (input_image [0])
def dfs (r, c):
""" Depth - first search to mark all 4- connected ’1’ s to ’2’ s."""
if r < 0 or r >= rows or c < 0 or c >= cols or input_image [r][c] != 1:
return
# Change the current component from 1 to 2
input_image [r][c] = 2
# Explore neighbors (up, down, left, right)
dfs (r - 1, c) # Up
dfs (r + 1, c) # Down
dfs (r, c - 1) # Left
dfs (r, c + 1) # Right
# Traverse the image to find all components with ’1’
for r in range (rows):
for c in range (cols):
if input_image [r][c] == 1:
dfs (r, c)
return input_image
[End Program]
[Assert Statement]
assert generate_output_image (input_image) == output_image
Please note, the assert statement should strictly follow the provided format, and the output image should be represented in list format!
Please note, the script should not include an if __name__ == " __main__ ": block.
Prompt 4: Direct generation with planning-aided code - code generation based on the generated plan.
⬇
================================ System ================================
You are an expert in analyzing grid - based image processing tasks. Your goal is to generate Python code that produces output image (s) represented as a 2 D matrix, based on the given input image (s) (2 D matrix). This code should be generated using a text transformation plan inferred from a set of input - output image pairs (both represented as 2 D matrices).
================================= User =================================
Generate a Python script based on your text transformation plan to map the input image (2 D matrix) to the output image (2 D matrix). Please output the Python program, starting with [Start Program] and ending with [End Program]. Include an assert statement with the function signature to verify that the generated output matches the expected result, starting with [Assert Statement]. Use placeholders like input_image and output_image for the variables representing the input and output images.
For example:
[Start Program]
def generate_output_image (input_image):
rows = len (input_image)
cols = len (input_image [0])
def dfs (r, c):
""" Depth - first search to mark all 4- connected ’1’ s to ’2’ s."""
if r < 0 or r >= rows or c < 0 or c >= cols or input_image [r][c] != 1:
return
# Change the current component from 1 to 2
input_image [r][c] = 2
# Explore neighbors (up, down, left, right)
dfs (r - 1, c) # Up
dfs (r + 1, c) # Down
dfs (r, c - 1) # Left
dfs (r, c + 1) # Right
# Traverse the image to find all components with ’1’
for r in range (rows):
for c in range (cols):
if input_image [r][c] == 1:
dfs (r, c)
return input_image
[End Program]
[Assert Statement]
assert generate_output_image (input_image) == output_image
Please note, the assert statement should strictly follow the provided format, and the output image should be represented in list format!
Please note, the script should not include an if __name__ == " __main__ ": block.
Prompt 5: Refinement with solution plan - plan refinement.
⬇
================================ System ================================
As an expert in analyzing grid - based image processing tasks, your objective is to refine your solution plan based on the provided feedback.
================================= User =================================
The problem description:
[start problem description]
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input, and the right image represents the corresponding output. Each image can be represented as a 2 D matrix: $ {matrix}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
[end problem description]
The INCORRECT text transformation plan fails to solve some example training input and output pairs in the above problem!
[start incorrect transformation plan]
$ {plan}
[end incorrect transformation plan]
The incorrect output (s) generated by the incorrect plan:
[start incorrect output]
$ {incorrect_output}
[end incorrect output]
The generated correct output (s):
[start correct output]
$ {correct_output}
[end correct output]
Please analyze the incorrect reasoning step - by - step, and then generate the revised correct transformation plan (text only), starting with [Start Revised Transformation Plan] and ending with [End Revised Transformation Plan]. Ensure that the revised transformation plan generalizes across different cases and the test input image (s), while maintaining consistency with the observed transformations.
Prompt 6: Refinement with standalone code - code refinement.
⬇
================================ System ================================
As an expert in analyzing grid - based image processing tasks, your objective is to refine your program based on the provided feedback.
================================= User =================================
The problem description:
[start problem description]
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input, and the right image represents the corresponding output. Each image can be represented as a 2 D matrix: $ {matrix}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
[end problem description]
The generated incorrect program fails to solve some example training input and output pairs in the above problem!
[start incorrect program]
$ {code}
[end incorrect program]
The incorrect output (s) generated by the incorrect program:
[start incorrect output]
$ {incorrect_output}
[end incorrect output]
The generated correct output (s):
[start correct output]
$ {correct_output}
[end correct output]
Please analyze the incorrect reasoning step - by - step, and then generate the revised program (Python program only), starting with [Start Revised Program] and ending with [End Revised Program]. Ensure that the revised program generalizes across different cases and the test input image (s), while maintaining consistency with the observed input and output image pairs.
Please include an assert statement with the function signature to verify that the generated output matches the expected result, starting with [Assert Statement]. Use placeholders like input_image and output_image for the variables representing the input and output images.
For example:
[Start Revised Program]
def generate_output_image (input_image):
rows = len (input_image)
cols = len (input_image [0])
def dfs (r, c):
""" Depth - first search to mark all 4- connected ’1’ s to ’2’ s."""
if r < 0 or r >= rows or c < 0 or c >= cols or input_image [r][c] != 1:
return
# Change the current component from 1 to 2
input_image [r][c] = 2
# Explore neighbors (up, down, left, right)
dfs (r - 1, c) # Up
dfs (r + 1, c) # Down
dfs (r, c - 1) # Left
dfs (r, c + 1) # Right
# Traverse the image to find all components with ’1’
for r in range (rows):
for c in range (cols):
if input_image [r][c] == 1:
dfs (r, c)
return input_image
[End Revised Program]
[Assert Statement]
assert generate_output_image (input_image) == output_image
Please note, the assert statement should strictly follow the provided format, and the output image should be represented in list format!
Please note, the script should not include an if __name__ == " __main__ ": block.
Prompt 7: Refinement with planning-aided code - refinement on both generated plan and code.
⬇
================================ System ================================
As an expert in analyzing grid - based image processing tasks, your objective is to refine your transformation plan and program based on the provided feedback.
================================= User =================================
The problem description:
[start problem description]
The input data consists of a few pairs of input and output images, where the left image in each pair represents the input, and the right image represents the corresponding output. Each image can be represented as a 2 D matrix: $ {matrix}
Please note that each number in the matrix corresponds to a pixel, and its value represents the color.
[end problem description]
The generated incorrect transformation plan and program fail to solve some example training input and output pairs in the above problem!
[start incorrect transformation plan]
$ {plan}
[end incorrect transformation plan]
[start incorrect program]
$ {code}
[end incorrect program]
The incorrect output (s) generated by the incorrect transformation plan and program:
[start incorrect output]
$ {incorrect_output}
[end incorrect output]
The generated correct output (s):
[start correct output]
$ {correct_output}
[end correct output]
Please analyze the incorrect reasoning step - by - step, and then generate the revised transformation plan (text only) and program (Python program only).
For the revised transformation plan, start with [Start Revised Transformation Plan] and end with [End Revised Transformation Plan]. Ensure that the revised transformation plan generalizes across different cases and the test input image (s), while maintaining consistency with the observed transformations.
For the revised Python program, start with [Start Revised Program] and end with [End Revised Program]. Ensure that the revised program generalizes across different cases and the test input image (s), while maintaining consistency with the observed input and output image pairs.
For the revised Python program, please include an assert statement with the function signature to verify that the generated output matches the expected result, starting with [Assert Statement]. Use placeholders like input_image and output_image for the variables representing the input and output images.
For example:
[Start Revised Program]
def generate_output_image (input_image):
rows = len (input_image)
cols = len (input_image [0])
def dfs (r, c):
""" Depth - first search to mark all 4- connected ’1’ s to ’2’ s."""
if r < 0 or r >= rows or c < 0 or c >= cols or input_image [r][c] != 1:
return
# Change the current component from 1 to 2
input_image [r][c] = 2
# Explore neighbors (up, down, left, right)
dfs (r - 1, c) # Up
dfs (r + 1, c) # Down
dfs (r, c - 1) # Left
dfs (r, c + 1) # Right
# Traverse the image to find all components with ’1’
for r in range (rows):
for c in range (cols):
if input_image [r][c] == 1:
dfs (r, c)
return input_image
[End Revised Program]
[Assert Statement]
assert generate_output_image (input_image) == output_image
Please note, the assert statement should strictly follow the provided format, and the output image should be represented in list format!
Please note, the script should not include an if __name__ == " __main__ ": block.
Prompt 8: Objectness priors augmentation
⬇
================================ System ================================
You are an expert in grid - based image analysis.
================================= User =================================
The training instances consist of several pairs of input and output images, where the left image in each pair represents the input and the right image represents the corresponding output.
Please note that the test instance (s) only contains input image (s).
Each image is represented as a 2 D matrix:
$ {matrix}
Please note that each number in the matrix corresponds to a pixel and its value represents the color.
We treat the color represented by the number {background_color} as the background color.
$ {abstraction_rule}
The components in each input and output image pair are as follows:
$ {component_description}
Prompt 9: Geometry and topology priors augmentation - component attributes
⬇
================================ System ================================
You are an expert in geometry and topology analysis. Below is a summary of component attributes, including:
Size (Width and Height); Color; Shape; Symmetry; Bounding Box; Hole Count; Nearest Boundary.
================================= User =================================
$ {geometry_and_topology_priors_attributes} $
Prompt 10: Geometry and topology priors augmentation - component relations
⬇
================================ System ================================
You are an expert in geometry and topology analysis, Below is a summary of component relations, including:
Different / Identical with other components; Inclusive; Touching or or not touching with other component; Spatial Relations,
================================= User =================================
$ {geometry_and_topology_priors_relations} $
Prompt 11: Numbers and counting priors augmentation
⬇
================================ System ================================
You are an expert in numbers and counting analysis. Below is a summary of component statistics, including:
Symmetry numerical summary; Size numerical summary; Color numerical summary; Shape numerical summary; Hole counting summary.
================================= User =================================
$ {numbers_and_couting_priors} $
Prompt 12: Goal-directedness priors augmentation - action selection
⬇
================================ System ================================
You are an expert in analyzing and categorizing grid - based image tasks.
================================= User =================================
Please determine which category or categories this task belongs to. Please select from the following:
1. color change: color change involves modifying the value of a component, and the component size and position always does not change.
2. movement: movement involves shifting the position of a component to a new location within the image, and the component size always does not change.
3. extension: extending involves expanding the boundaries of a component to increase its size or reach within the image, and the component size always changes.
4. completing: completing an image involves filling in missing or incomplete parts of a component to achieve a coherent and fully formed image.
5. resizing: resizing involves altering the dimensions of a component by expanding or shrinking its size within the image.
6. selecting: selecting involves identifying and isolating a specific component within the image as the output component, and the component size and color always does not change.
7. copying: copying involves duplicating a component and either placing the duplicate in a new location or replacing the existing component within the image.
8. flipping: flipping involves mirroring a component along a specified axis to reverse its orientation within the image.
9. rotation: rotation involves turning a component around a fixed point or center by a specified angle within the image.
10. cropping: cropping involves cutting out a specific portion of a component.
Please select the best suitable one or multiple categories from the provided list that best describe the task.
Format your response by starting with [start category] and ending with [end category], numbering each category selected.
For example, if the task belongs only to " color change ", your response should be:
[start category]
1. color chang
[end category]
If the task belongs to both " selecting " and " extension ", your response should be:
[start category]
1. selecting
2. extension
[end category]
Prompt 13: Goal-directedness priors augmentation - target component idetification
⬇
================================ System ================================
You are an expert in analyzing grid - based image tasks, specifically in $ {action} components.
================================= User =================================
If this task involves $ {action}:
1. Begin by identifying WHICH COMPONENTS are to be $ {action} in all input images (training and test pairs).
- Refer to these components as TARGET components (e. g., component 1 in the first input image, component 2 and component 3 in the second input image, etc.).
- List ALL target components in each training and test input image.
- For EACH target component, provide:
- Attribute Analysis result
- Relation analysis result
- Numerical analysis result
2. Determine the CONDITIONS used to select these TARGET components for $ {action} from each training and test input image.
- These conditions must be based on common priorities across all targeted components and must differ from the unselected components.
- For example: the size of all target components might be equal to 3 while the size of the unselected components is not 3.
2.1. Analyze whether these conditions are EMPTY or not.
2.2. Evaluate if these conditions are derived from attribute analysis, including:
2.2.1. Color
2.2.2. Size
2.2.3. Shape
2.2.4. Width
2.2.5. Height
2.2.6. The number of holes
2.2.7. Bounding box
2.2.8. Symmetry
2.2.9. Nearest boundary
2.3. Evaluate if these conditions are derived from relation analysis, including:
2.3.1. Relative position with other components
2.3.2. Touching with other components
2.3.3. Whether they differ from or are identical with other components
2.3.4. Enclosure of other components
2.4. Evaluate if these conditions are derived from numerical analysis, including:
2.4.1. Symmetry numerical analysis
2.4.2. Size numerical analysis
2.4.3. Color numerical analysis
2.4.4. Shape numerical analysis
2.4.5. Hole counting analysis
You must evaluate each condition ONE by ONE and determine the best conditions.
Note:
- The conditions MUST work for ALL training and test input and output image pairs.
- Conditions CANNOT come from the output images!
- A condition can be EMPTY.
- If a condition is based on numerical features (e. g., size (width and height), or the number of holes), you may use the operators =, <, >, >=, or <=.
- For cropping or selecting tasks, consider using a bounding box to extract each component.