2501.11896v2

Model: healer-alpha-free

# Systematic Abductive Reasoning via Diverse Relation Representations in Vector-symbolic Architecture **Authors**: Zhong-Hua Sun, Ru-Yuan Zhang, Zonglei Zhen, Da-Hui Wang, Yong-Jie Li, Xiaohong Wan, Hongzhi You > Corresponding author: Hongzhi You, email: Sun, Yong-Jie Li and Hongzhi You are with School of Life Science and Technology, University of Electronic Science and Technology of China (UESTC), Chengdu, China. Ru-Yuan Zhang is with Brain Health Institute, National Center for Mental Disorders, Shanghai Mental Health Center, Shanghai Jiao Tong University School of Medicine and School of Psychology, Shanghai, China. Zonglei Zhen and Xiaohong Wan are with State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University, Beijing␣ China. Da-Hui Wang is with School of Systems Science, Beijing Normal University, Beijing, China. ## Abstract In abstract visual reasoning, monolithic deep learning models suffer from limited interpretability and generalization, while existing neuro-symbolic approaches fall short in capturing the diversity and systematicity of attributes and relation representations. To address these challenges, we propose a Systematic Abductive Reasoning model with diverse relation representations (Rel-SAR) in Vector-symbolic Architecture (VSA) to solve Raven’s Progressive Matrices (RPM). To derive attribute representations with symbolic reasoning potential, we introduce not only various types of atomic vectors that represent numeric, periodic and logical semantics, but also the structured high-dimentional representation (SHDR) for the overall Grid component. For systematic reasoning, we propose novel numerical and logical relation functions and perform rule abduction and execution in a unified framework that integrates these relation representations. Experimental results demonstrate that Rel-SAR achieves significant improvement on RPM tasks and exhibits robust out-of-distribution generalization. Rel-SAR leverages the synergy between HD attribute representations and symbolic reasoning to achieve systematic abductive reasoning with both interpretable and computable semantics. Index Terms: Abstract visual reasoning, relation representation, vector-symbolic architecture. ## I Introduction Raven’s Progressive Matrices (RPM) are a family of psychological intelligence tests widely used for the assessment of abstract reasoning [1, 2]. From a cognitive psychology perspective, abstract visual reasoning in RPM tests involves constructing high-level representations from images and deriving potential relations from these representations [1, 3]. Endowing artificial intelligence with such capabilities is now regarded as a crucial step toward achieving human-level intelligence. However, many recent monolithic deep learning models, which do not explicitly separate perception and reasoning [4, 5, 6, 7, 8, 9], face inherent challenges, such as poor interpretability, limited robustness and generalization, and difficulties in module reuse [10]. Neuro-symbolic architecture, which combines neural visual perception with symbolic reasoning, offers a promising approach to overcoming these challenges and achieving human-level interpretability and generalization [11, 10, 12]. In neuro-symbolic architectures (NSA), Marcus argues that symbol-manipulation in cognition involves representing relations between variables [11]. For RPM tests, object attributes serve as the variables, while potential rules involve the relations. Nevertheless, due to incomplete attribute and relation representations, achieving systematic abduction and execution is still a critical challenge for NSA when performing RPM tests. From the perspective of attributes, recent models such as PrAE [10], the ALANS learner [13], and NVSA (neuro-vector-symbolic architecture) [12] construct attribute representations through neural perception frontends. Notably, the NVSA model achieves hierarchically structured VSA representations of image panels, capturing multiple objects with multiple attributes [12]. Regarding relation representations, PrAE and NVSA achieve abstract reasoning through probabilistic abduction and execution [10] and distributed vector-symbolic architecture (VSA) [12], respectively. Both models rely on predetermined multiple rule templates, each specialized for distinct individual RPM rules. To address the limitations in rule expressiveness, the ALANS learner utilizes learnable rule operators in the abstract algebraic structure, without manual definition for every rules [13]. Additionally, the ARLC model adopts a more expressive VSA-based rule template, operating in the rule parameter space [14]. Both models offer improved interpretabiltiy and generalizability. Despite their advances, previous models fall short in capturing the diversity and systematicity of attribute and relation representations. In contrast, human cognition demonstrates rich and flexible internal representations [15, 16], including arithmetic and logic, and rule-based reasoning systems in cognition are productive and systematic [17]. Therefore, the abstract visual reasoning performance of these models remains open to further improvement. Previous research indicates that Vector Symbolic Architecture (VSA), a form of high-dimensional (HD) distributed representation, possesses algebraic properties for mathematical operations and can also achieve structured symbolic representations of data [18, 19, 20]. In this work, to achieve comprehensive relation representations, we introduce various types of VSA-based atomic HD vectors with distinct semantic representations, including numeric values, periodic values, and logical values. Given that reasoning in RPM problems involves the overall attributes of multiple objects, we further introduce the structured HD representation (SHDR) for the nxn Grid. They serve as attribute representations necessary for abductive reasoning. Meanwhile, we propose numerical and logical relation functions as relation representations that take multiple HD attribute representations as input and define relations among them. Unlike rule templates designed for individual rules, the two proposed relation functions are specifically tailored to numerical and logical types, providing strong rule expressiveness. Here, we propose a Systematic Abductive Reasoning model with diverse relation representations (Rel-SAR) for solving RPM, inspired by the original NVSA model [12]. In the Rel-SAR model, visual attribute extraction and rule inference are implemented within a fully unified computational framework in the VSA machinery. The model comprises a neuro-vector frontend for perceiving object attributes of all raw images in RPM problems and a generic vector-symbolic backend for achieving symbolic reasoning. The perception frontend operates on scene-based SHDR of each image panel, which contains multiple objects, each with various attributes, and predicts HD attribute representations by VSA-based symbolic manipulations. The reasoning backend implements the core idea of systematic abductive reasoning: if the given attributes in an RPM adhere to a specific numerical or logical rule, then the relation representations of all attribute pairs can be defined using the corresponding relation functions with identical parameters. These diverse relation representations are involved in both rule abduction and execution phases, enhancing interpretability and improving the capacity for systematic abductive reasoning. ## II Related Work <details> <summary>x1.png Details</summary> ![1e92595b](/v1/image/1e92595b7d1bb37ddc808e9b323750337fcd3504d467cf1f81ba656260d8a1a8) ### Visual Description ## Diagram: Visual Pattern Recognition Test and Spatial Arrangement Concepts ### Overview The image is a composite diagram divided into two primary sections, labeled **a** and **b**. Section **a** presents a visual matrix reasoning puzzle (similar to Raven's Progressive Matrices) with a set of "Context Panels" and a set of "Answer Panels." Section **b** illustrates and labels seven distinct spatial arrangement concepts using geometric shapes. The overall purpose appears to be demonstrating or testing pattern recognition and spatial reasoning abilities. ### Components/Axes **Section a: Pattern Puzzle** * **Context Panels:** A 3x3 grid of square panels. The bottom-right panel contains a question mark (`?`). * **Answer Panels:** A 2x4 grid of eight square panels below the context grid. One panel (top-left) is highlighted with a red border. * **Labels:** The text "Context Panels" is centered below the 3x3 grid. The text "Answer Panels" is centered below the 2x4 grid. **Section b: Spatial Arrangement Concepts** * **Layout:** Seven individual square panels, each containing a diagram and a text label below it. * **Labels (in order, top to bottom, left to right):** 1. `Center` 2. `Left-Right` 3. `2x2Grid` 4. `Up-Down` 5. `3x3Grid` 6. `Out-InCenter` 7. `Out-InGrid` ### Detailed Analysis **Section a: Context Panels (3x3 Grid)** Each panel contains exactly three geometric shapes. The shapes vary in type (circle, pentagon, triangle) and fill shade (light gray, medium gray, dark gray/black). * **Row 1:** * Panel 1 (Top-Left): Three medium-gray circles. * Panel 2 (Top-Center): Three medium-gray pentagons. * Panel 3 (Top-Right): Three medium-gray triangles. * **Row 2:** * Panel 1 (Middle-Left): Three dark-gray/black circles. * Panel 2 (Middle-Center): Three dark-gray/black pentagons. * Panel 3 (Middle-Right): Three light-gray pentagons. * **Row 3:** * Panel 1 (Bottom-Left): Three light-gray circles. * Panel 2 (Bottom-Center): Three light-gray triangles. * Panel 3 (Bottom-Right): Contains a large question mark (`?`). **Section a: Answer Panels (2x4 Grid)** Each of the eight panels contains three shapes, primarily triangles and pentagons in light gray or white with black outlines. * **Panel with Red Border (Top-Left of Answer Grid):** Contains three small, light-gray pentagons arranged in a triangular formation (one top, two bottom). * **Other Panels:** Contain various arrangements of three triangles or three pentagons. For example, the panel to the right of the red-bordered one has three triangles pointing in different directions. **Section b: Spatial Arrangement Diagrams** Each diagram uses shapes (pentagons, triangles, circles, diamonds) to illustrate a spatial concept. 1. **Center:** A single, large, black pentagon centered in the panel. 2. **Left-Right:** Two pentagons (one medium-gray, one dark-gray) placed side-by-side, separated by a vertical line. 3. **2x2Grid:** Two light-gray pentagons placed in the top-left and top-right positions of an implied 2x2 grid. 4. **Up-Down:** Two triangles (one medium-gray pointing down, one black pointing up) stacked vertically, separated by a horizontal line. 5. **3x3Grid:** Six medium-gray triangles arranged in the top two rows of an implied 3x3 grid (positions: top-left, top-center, top-right, middle-left, middle-center). 6. **Out-InCenter:** A large, white pentagon (outline) containing a small, black diamond at its center. 7. **Out-InGrid:** A large circle (outline) containing three smaller circles inside: two white (top) and one black (bottom-left). ### Key Observations 1. **Pattern in Context Panels:** There is a clear pattern across rows and columns in Section **a**. * **Shape Type:** Each row contains one set of circles, one set of pentagons, and one set of triangles, but not in the same order. * **Shade:** The shade of the shapes is consistent within each row for the first two rows (Row 1: medium gray, Row 2: dark gray for first two panels). Row 3 introduces light gray. * **Logical Gap:** The missing panel (Row 3, Column 3) should logically contain three light-gray triangles to complete the pattern of shape types (circles, pentagons, triangles) and shade (light gray) for the third row. 2. **Answer Panel Correlation:** The red-bordered answer panel contains three light-gray pentagons. This does **not** match the predicted "three light-gray triangles." This suggests either the puzzle has a more complex rule, or the red border indicates an incorrect selection for demonstration purposes. 3. **Section b as a Key:** The diagrams in Section **b** explicitly define spatial terms (`Center`, `Left-Right`, `2x2Grid`, etc.). These terms likely describe the *arrangement* of the three shapes within each context panel, which is a separate variable from shape type and shade. For example, the shapes in the context panels are not centered; they are scattered, which might correspond to an unlisted "scattered" or "random" arrangement. ### Interpretation This image is a technical illustration from a cognitive assessment or research paper on visual reasoning. It serves two purposes: 1. **Demonstration of a Test Item (Section a):** It presents a classic matrix reasoning problem where the solver must deduce the rule governing the distribution of shape types and shades across the grid to identify the missing piece. The presence of the red-bordered answer panel suggests this image might be from an answer key or a study analyzing common errors, as the highlighted answer contradicts the most straightforward pattern (shape type completion). 2. **Definition of Spatial Vocabulary (Section b):** It provides a visual lexicon for describing how objects are arranged within a frame. This is crucial for precisely documenting the rules of the puzzle or for scoring responses in a standardized way. The concepts range from simple (`Center`) to more complex relational (`Out-InGrid`). The disconnect between the obvious pattern in the context panels and the highlighted answer is the most significant finding. It implies that the correct solution may depend on a more nuanced rule involving the spatial arrangements defined in Section **b**, rather than just the shape/shade matrix. For instance, the rule might involve a specific sequence of arrangements (e.g., "scattered," "clustered," "linear") that the solver must also deduce. The image, therefore, doesn't just show a puzzle; it shows the puzzle *and* the conceptual tools needed to solve or analyze it at a deeper level. </details> Figure 1: Illustrations for RAVEN dataset. (a) An example of RPM test from RAVEN [21] dataset. In an RPM test, there are 8 context panels and 8 candidate panels. Participants are required to identify the underlying rules governing various attributes within the context panels. Subsequently, participants use these rules to infer the attributes of the missing panel (represented by ”?”) and choose the most appropriate option (highlighted with a red box) from the answer panels. (b) The RAVEN dataset includes seven configurations: Center, 2x2Grid, 3x3Grid, Left-Right (L-R), Up-Down (U-D), Out-InCenter (O-IC) and Out-InGrid (O-IG) [21]. Four types of rules, i.e., Constant, Progression, Arithmetic, and Distribute Three, are applied to five attributes, i.e., Position, Number, Type, Size, and Color, in a row-wise manner. The I-RAVEN dataset [8] is a variant of RAVEN, where answer sets are generated using an attribute bisection tree. The Raven Progressive Matrices (RPM) is a widely used nonverbal intelligence test designed to assess abstract reasoning. To explore the limitations of current machine learning approaches in solving abstract reasoning tasks, two automatically generated RPM-based datasets—RAVEN [21] and I-RAVEN [8] —have been introduced (Figure 1). Early efforts on RPM primarily employed Relation Network (RN) [22] and their variants [4, 23, 7, 9] to extract relations between context panels. Concurrently, CoPINet [6], MLCL [24], and DCNet [25] integrate contrastive learning in their models. Approaches like MRNet [9] and DRNet [26] aimed to enhance perception capabilities, while SRAN [8] and PredRNet [27] abstract relations using stratified models and prediction errors, respectively. In addition, several methods have focused on scene decomposition and feature disentanglement [28, 29, 30]. Although these monolithic deep learning models achieve high accuracy, they often suffer from limited interpretability and systematic generalization capabilities. Another branch for solving RPM is based on neuro-symbolic architectures, which explicitly distinguish between perception and reasoning. PrAE [10] employs an object CNN to generate probabilistic scene representations and uses predetermined rule templates for probabilistic abduction and execution. Inspired by abstract algebra and representation theory, ALANS [13], which shares the same perception frontend as PrAE, transforms probabilistic scene distributions into matrix-based algebraic representations. The algebraic reasoning backend of ALANS induces potential rules through trainable operator matrices, eliminating the need for manual rule definitions. In abstract reasoning, Vector Symbolic Architectures (VSA) serve as a bridge between perception and reasoning modules by leveraging its structured distribution representations and algebraic properties. NVSA [12] projects each RPM panel into a high-dimensional vector using a trainable CNN and derives probability mass functions (PMFs) by querying an external codebook. Its reasoning backend embeds these PMFs into distributed VSA representations and performs rule abduction and execution using templates based on VSA algebraic operations. NVSA provides a differentiable and transparent implementation of probabilistic abductive reasoning by leveraging VSA representations and operators. However, its perception frontend requires searching a large external codebook, and its reasoning backend still relies on predetermined rule templates. In contrast, Learn-VRF [31], focuses on reasoning by learning VSA rule formulations, eliminating the need for predetermined templates. ARLC [14] further enhances reasoning by incorporating context augmentation and extending rule templates to accommodate more diverse rules. While ARLC and Learn-VRF implement systematic rule learning, they still struggle to process all RPM rules due to limitations in attribute representation. Recently, a class of methods known as relational bottlenecks has been proposed to enable efficient abstraction, but their capacity to handle complex relations remains uncertain [32, 33, 34, 35]. To address this limitation, Rel-SAR transforms perceptual inputs into high-dimensional attribute representations with symbolic reasoning potential and abducts both logical and numerical rules within a unified framework. ## III Preliminaries ### III-A VSA models utilized in this study VSAs are a class of computational models that utilize high-dimensional distributed representations [20]. VSA models used in this study are Holographic Reduced Representations (HRR) and its form in the frequency domain, referred to as Fourier Holographic Reduced Representations (FHRR) [36]. A random FHRR atomic vector, denoted as $\boldsymbol{\theta}:=\left\{\theta_{i}\right\}_{i=1}^{d}$ , is composed of elements $\theta_{i}$ that are independently sampled from a uniform distribution, specifically $\theta_{i}\sim\mathcal{U}(-\pi,\pi)$ [36]. The corresponding HRR atomic vector, $\boldsymbol{x}$ , is then obtained by applying the Inverse Fast Fourier Transform (IFFT) to $\boldsymbol{\theta}$ : $$ \boldsymbol{x}=\mathcal{F}^{-1}\left(e^{j\boldsymbol{\theta}}\right) \tag{1} $$ Here, $\mathcal{F}$ and $\mathcal{F}^{-1}(\cdot)$ represent the Fast Fourier Transform (FFT) and Inverse FFT (IFFT), respectively. When the dimension $d$ is sufficiently large, these randomly generated vectors exhibit pseudo-orthogonality, making them suitable for representing distinct symbols or concepts. The similarity between any two vectors is a crucial metric for evaluating the distributed representations in VSAs. In FHRR and HRR, cosine similarity is employed to measure the similarity between two vectors [20]: $$ \begin{split}sim(\boldsymbol{\theta},\boldsymbol{\phi})&=\frac{1}{d}\sum_{i=1} ^{d}{\cos\left(\theta_{i}-\varphi_{i}\right)}\\ sim(\boldsymbol{x},\boldsymbol{y})&=\frac{\boldsymbol{x}\cdot\boldsymbol{y}}{ \left|\boldsymbol{x}\right|\left|\boldsymbol{y}\right|}\end{split} \tag{2} $$ where $\boldsymbol{\theta}$ and $\boldsymbol{\phi}$ denote two FHRR vectors, and $\boldsymbol{x}$ and $\boldsymbol{y}$ two HRR vectors. The similarity $sim(\cdot,\cdot)$ ranges from -1 to +1, and above two similarity measures are equivalent. The pseudo-orthogonality refers to the case where the similarity $sim(\cdot,\cdot)\approx 0$ . ### III-B Basic operations and structured symbolic representations All computations within VSAs are composed of several basic vector algebraic operations, with the primary ones being binding ( $\circ$ ), bundling ( $+$ ) and unbinding ( $\oslash$ ) (Table I). The binding operation ( $\circ$ ) is employed to form a representation of an object that contains information about the context in which it was encountered [20]. The bundling operation ( $+$ ), also known as superposition, generates a composite high dimensional vector that combines several lower-level representations. In calculation, binding has a higher priority than bundling. The unbinding operation ( $\oslash$ ), which is the inverse of binding, extracts a constituent from the compound data structure. Binding and bundling are referred to as composition operations, while unbinding is considered a decomposition operation. All operations do not change the vector dimensionality. Through the combination of these operations, VSAs can effectively achieve structured symbolic representations [20]. For instance, consider a scene $\boldsymbol{s}$ in which a triangle $\boldsymbol{t}$ is positioned on the left $\boldsymbol{p}_{L}$ and a circle $\boldsymbol{c}$ on the right $\boldsymbol{p}_{R}$ . This scene can be represented as $\boldsymbol{s}=\boldsymbol{p}_{L}\circ\boldsymbol{t}+\boldsymbol{p}_{R}\circ \boldsymbol{c}$ by the role-filler pair [37]. By applying the inverse vector of the left position $\boldsymbol{p}_{L}^{-1}$ to unbind $\boldsymbol{s}$ , we can retrieve an approximate vector representing the content at the left position, i.e., $\boldsymbol{p}_{L}^{-1}\otimes\boldsymbol{s}\approx\boldsymbol{t}$ . Moreover, the triangle $\boldsymbol{t}$ can itself be a compositional scene, where attributes such as color and size are combined into a triangle scene in a similar manner. This decomposable, structure-sensitive, high-dimensional distributed representation has the potential to disentangle complex scenes while maintaining the advantages of traditional connectionist approaches [12]. ### III-C The fractional power encoding method In this study, the rules in RPM are primarily numerical. We introduce the VSA representation of numerical values using the fractional power encoding method (FPE-VSA) [18, 19]. Let $x\in\mathbb{R}$ be a real number and $X\in\mathbb{R}^{d}$ a randomly sampled base vector. The VSA representation $\boldsymbol{v}(x)\in\mathbb{R}^{d}$ for any value $x$ is obtained by repeatedly binding the base vector $X$ with itself $x$ times, as follows: $$ \boldsymbol{v}\left(x\right):=\left(X\right)^{\left(\circ x\right)} \tag{3} $$ The FPE method maps arbitrary real numbers to corresponding HD vector, and has the following properties: $$ \boldsymbol{v}\left(x_{1}+x_{2}\right)=\boldsymbol{v}\left(x_{1}\right)\circ \boldsymbol{v}\left(x_{2}\right) \tag{4} $$ This demonstrates that addition $+$ in the real number domain can be represented by the binding operation $\circ$ in the vector domain. TABLE I: Basic oprations of FHRR and HRR. | Binding( $\boldsymbol{x}\circ\boldsymbol{y}$ ) Bundling( $\boldsymbol{x}+\boldsymbol{y}$ ) Inverse( $\boldsymbol{x}^{-1}$ ) | $\left(\boldsymbol{\theta}+\boldsymbol{\phi}\right)mod\,\,2\pi$ $angle\left(e^{j\boldsymbol{\theta}}+e^{j\boldsymbol{\phi}}\right)$ $\left(-\boldsymbol{\theta}\right)\,\,mod\,\,2\pi$ | $\mathcal{F}^{-1}\left(\mathcal{F}\left(\boldsymbol{x}\right)\cdot\mathcal{F} \left(\boldsymbol{y}\right)\right)$ $\boldsymbol{x}+\boldsymbol{y}$ $\mathcal{F}^{-1}\left(1/\mathcal{F}\left(\boldsymbol{x}\right)\right)$ | | --- | --- | --- | | Unbinding( $\boldsymbol{x}\oslash\boldsymbol{y}$ ) | $\boldsymbol{\theta^{-1}}\circ\boldsymbol{\phi}$ | $\boldsymbol{x^{-1}}\otimes\boldsymbol{y}$ | ## IV Methodology ### IV-A Atomic HD vectors with semantic representations <details> <summary>x2.png Details</summary> ![33759b62](/v1/image/33759b6214f8710ec41c060c692240e8dbc62e129946152b1f4576fc428fb301) ### Visual Description ## Multi-Part Diagram: Data Structures and Relational Operations ### Overview The image is a composite technical diagram consisting of seven distinct panels labeled **a** through **g**. It visually defines different types of vector representations and illustrates operations on relational data structures. The diagrams use a consistent visual language of colored squares, circles, and flowchart elements to convey abstract computational concepts. ### Components/Axes The image is segmented into two rows: * **Top Row (Panels a-d):** Illustrates four fundamental vector types. * **Bottom Row (Panels e-g):** Depicts relational structures and operational flows. **Panel-Specific Components:** * **Panel a (Top-Left):** Labeled **"Random Vector"**. Contains six colored squares (red, orange, yellow, green, blue, light blue) scattered in a 2D space. Each square has a faint, blurred circle of a matching color behind it. Three small grey dots (`...`) are present, suggesting continuation or an arbitrary number of elements. * **Panel b (Top-Center):** Labeled **"Numeric Vector"**. Shows a vertical column of five identical light blue squares. Three blue dots (`...`) are placed above and below the column, indicating a sequence of arbitrary length. * **Panel c (Top-Right):** Labeled **"Circular Vector"**. Displays twelve light blue squares arranged in a perfect circle. Each square has a faint, blurred blue circle behind it. * **Panel d (Top-Far Right):** Labeled **"Boolean Vector"**. Features a large, faint green circle. Two green squares are placed on opposite sides of this circle's circumference. * **Panel e (Bottom-Left):** Contains two labels: **"Binary Relations"** (top) and **"Ternary Relations"** (bottom). Three grey squares are arranged horizontally, labeled below as **`v₁`**, **`v₂`**, and **`v₃`**. Curly braces connect the squares: one brace links `v₁` and `v₂` under "Binary Relations," and a larger brace links all three (`v₁`, `v₂`, `v₃`) under "Ternary Relations." * **Panel f (Bottom-Center):** A flowchart. An input arrow labeled **`v₁:N`** points into a blue rounded rectangle labeled **`R_Num/Lgc, N`**. A downward arrow labeled **`OP₁:M`** points into the top of this rectangle. An output arrow labeled **`r`** exits the rectangle to the right. * **Panel g (Bottom-Right):** A flowchart. An input arrow labeled **`v₁:N-1`** points into a blue rounded rectangle labeled **`R⁻¹_Num/Lgc, N`**. Two downward arrows point into the top of this rectangle: one labeled **`OP₁:M`** and another labeled **`r`**. An output arrow labeled **`v_N`** exits the rectangle to the right. ### Detailed Analysis This section breaks down the informational content of each panel. * **Vector Representations (a-d):** * **a. Random Vector:** Represents elements with no inherent order or spatial relationship. The distinct colors and scattered placement imply independence or randomness of the vector components. * **b. Numeric Vector:** Represents a standard, ordered sequence (likely a list or array). The vertical stack and ellipsis denote a linear, indexed structure of potentially variable length `N`. * **c. Circular Vector:** Represents elements with a cyclical or periodic relationship. The circular arrangement implies that the first and last elements are connected, common in data structures like circular buffers or for representing periodic data. * **d. Boolean Vector:** Represents a binary state or membership. The two squares on a circle suggest a true/false, on/off, or present/absent dichotomy for two states or positions. * **Relational Structures and Operations (e-g):** * **e. Relations:** Defines the concept of arity in relations. A **Binary Relation** is shown as a relationship between two entities (`v₁` and `v₂`). A **Ternary Relation** is a relationship involving three entities (`v₁`, `v₂`, and `v₃`). * **f. Forward Operation (`R`):** Depicts a function or operator `R` that takes a vector of `N` elements (`v₁:N`) and a set of operations or parameters (`OP₁:M`) as input. It produces a scalar or resultant value `r`. The subscript `Num/Lgc` suggests the operation can be numerical or logical. * **g. Inverse Operation (`R⁻¹`):** Depicts the inverse function. It takes a vector of `N-1` elements (`v₁:N-1`), the same operations/parameters (`OP₁:M`), and the resultant value `r` as inputs. It outputs the missing `N`th element (`v_N`). This illustrates solving for an unknown component given the other components and the result of the forward operation. ### Key Observations 1. **Visual Consistency:** The diagram uses a consistent visual vocabulary: squares represent data elements or vectors, circles represent relational contexts (circular, boolean), and rounded rectangles represent operations. 2. **Color Coding:** Colors are used purposefully. In panel **a**, distinct colors emphasize randomness. In panels **b** and **c**, uniform light blue suggests a homogeneous data type. In panel **d**, green is associated with the Boolean state. In panels **f** and **g**, blue is used for the operational blocks. 3. **Mathematical Notation:** The use of subscripts (`₁:N`, `₁:M`) and superscripts (`⁻¹`) follows standard mathematical and computer science conventions for denoting ranges and inverse functions. 4. **Conceptual Progression:** The panels progress from simple data representations (vectors) to more complex relational concepts and finally to operational flows that manipulate these structures. ### Interpretation This diagram serves as a conceptual primer for a computational or mathematical framework. It establishes a taxonomy of data structures (random, numeric, circular, boolean vectors) and then introduces the core idea of **relational operations**. The key insight is the relationship between panels **e**, **f**, and **g**. They collectively describe a system where: 1. Data exists in relational forms (binary, ternary). 2. A forward function `R` can compute a result `r` from a complete set of inputs (`v₁:N`). 3. An inverse function `R⁻¹` can be used to *infer* a missing input (`v_N`) when the other inputs (`v₁:N-1`) and the result `r` are known. This pattern is fundamental to many fields: solving equations in algebra, constraint satisfaction in AI, parameter inference in statistics, and decoding in information theory. The diagram abstractly represents the process of moving from known data to a result (forward) and from a result and partial data to a missing piece of information (inverse). The inclusion of different vector types suggests this operational framework is designed to be agnostic to the specific underlying data structure, whether it's a random set, an ordered list, or a cyclic arrangement. </details> Figure 2: Atomic HD representations and relation functions. (a-d) The Rel-SAR model utilizes four types of atomic HD vectors. Random Vectors (RVs), sampled independently, are used to represent distinct and unrelated symbols or concepts. Numeric Vectors (NVs) are used to represent real numbers and support VSA-based addition-type arithmetic operations. Circular Vectors (CVs) represent periodic values and enable addition-type arithmetic operations with periodicity. Boolean Vectors (BVs), representing logical values of False and True, support VSA operations for logical reasoning. (e) In the RAVEN dataset, for a given attribute, the HD attribute representations in a row of three image panels involve binary or ternary relations. (f) Relation functions describe the numerical or logical relations between multiple HD vector representations $\boldsymbol{v}_{1:N}$ , where $N=2$ for binary and $N=3$ for ternary relations. These relations are governed by the operator powers $OP_{1:M}$ and the output $\boldsymbol{r}$ . (g) For a given relation defined by $OP_{1:M}$ and $\boldsymbol{r}$ , inverse relation functions infer the last HD vector representation $\boldsymbol{v}_{N}$ according to the first $N-1$ representations $\boldsymbol{v}_{1:N-1}$ . In neuro-vector-symbolic systems, atomic HD vector representations with meaningful semantics are essential for perception and reasoning. We introduce four types of atomic HD vectors used in our model (Figure 2): Random Vectors (RVs), Numeric Vectors (NVs), Circular Vectors (CVs), and Boolean Vectors (BVs). The definitions and properties of these vectors are universal within the VSA framework. #### IV-A 1 Random Vector RVs are sampled from specific distributions according to the VSA models, as mentioned in the preliminary section. Due to the absence of numerical or logical relations among RVs and their pseudo-orthogonality in the HD vector space (Figure 2 a), they are often used to represent symbols and concepts assumed to be independent and dissimilar. #### IV-A 2 Numeric Vector NVs, generated using the fractional power encoding (FPE-VSA, Equation 3) [18], are employed to represent real numbers (Figure 2 b). NVs $\boldsymbol{v}(r)\in\mathbb{R}^{d}$ can be used to perform addition-type arithmetic operations through the binding (Equation 4) [19]. #### IV-A 3 Circular Vector CVs are a special class of NVs used to represent periodic values (Figure 2 c). Given a base vector $P$ , where each phase of its elements $\rho_{i}$ is sampled from a discrete distribution ( e.g., for FHRR, $\rho_{i}\sim\mathcal{U}\left(2\pi j/L,\forall j\in\left\{1,\cdots,L\right\}\right)$ , with $L$ being an even number), CVs are defined as $\boldsymbol{p}(r):=(P)^{(\circ r)}$ . These CVs are pseudo-orthogonal to one another and exhibit periodicity with a period of $L$ [19]: $$ \boldsymbol{p}\left(r+L\right)=\boldsymbol{p}\left(r\right) \tag{5} $$ If $L$ is odd, the corresponding CVs with period $L$ can be obtained by selecting every other CV from those with period $2L$ . #### IV-A 4 Boolean Vector BVs are a specific type of CVs with a period of $L=2$ , used to represent Boolean values (Figure 2 d). Following a similar generation method as for CVs, we can generate vectors with a period of $L=2$ , $\boldsymbol{e}\left(0\right)$ and $\boldsymbol{e}\left(1\right)$ , to represent False and True, respectively. Basic logic operations using BVs are implemented as shown in Table II, where $\boldsymbol{a},\boldsymbol{b}\in\{\boldsymbol{e}(0),\boldsymbol{e}(1)\}$ represent arbitrary Boolean values. TABLE II: Logic operations implemented by BV $$ \lnot\boldsymbol{a}=\boldsymbol{a}\circ\boldsymbol{e}\left(1\right) \boldsymbol{a}\oplus\boldsymbol{b}=\boldsymbol{a}\circ\boldsymbol{b} \boldsymbol{a}\land\boldsymbol{b}=\boldsymbol{a}^{\left(\circ sim\left( \boldsymbol{a},\boldsymbol{b}\right)\right)} \begin{aligned} \boldsymbol{a}\lor\boldsymbol{b}&=\left(\boldsymbol{a}\oplus \boldsymbol{b}\right)\circ\left(\boldsymbol{a}\land\boldsymbol{b}\right)\\ \,\,&=\boldsymbol{a}\circ\boldsymbol{b}\circ\boldsymbol{a}^{\left(\circ sim \left(\boldsymbol{a},\boldsymbol{b}\right)\right)}\\ \end{aligned} \tag{1} $$ ### IV-B Relation functions based on atomic HD representations The rules for abductive reasoning in RPM involve binary and ternary relations among the attributes of corresponding objects in each row of three panels (Figure 2 e and Figure 1 a), as well as numerical and logical relations. In this work, we design general relation functions based on VSA algebra, utilizing the aforementioned atomic vector representations, to be used for rule abductions. #### IV-B 1 Relation functions Relation functions, which describe the relations between multiple HD vector representations, are categorized into two types: numerical and logical. Among the atomic HD representations, Numeric Vectors (NVs) and Circular Vectors (CVs) are involved in numerical relations, while Boolean Vectors (BVs) are involved in logical relations. The numerical relation function, $R_{Num}$ , is defined as follows (Figure 2 f): $$ \boldsymbol{r}_{Num}=R_{Num}\left(\boldsymbol{v}_{1:N},OP_{1:M}\right)=\circ_{ i=1}^{N}\boldsymbol{v}_{i}^{\left(\circ op_{i}\right)} \tag{6} $$ where $N$ represents the arity of the relation function, and $\boldsymbol{v}_{1:N}:=\left\{\boldsymbol{v}_{i}\right\}_{i=1}^{N}$ denotes the input set of HD vector representations. $M$ is the number of operator powers and $OP_{1:M}:=\left\{op_{i}\right\}_{i=1}^{M}$ represents the operator powers, which can be considered as parameters of the relation function. The notation $\circ_{i=1}^{N}$ denotes the sequential binding operation applied to the $N$ HD vector representations. $\boldsymbol{r}_{Num}$ is the output HD representation. For the binary numerical relation function, $N=2$ and $M=2$ , while for the ternary numerical relation function, $N=3$ and $M=3$ . Based on the arithmetic properties of NVs and CVs, $R_{Num}$ can describe the additive relations of these two types of HD vector representations. The combination of $OP_{1:M}$ and $\boldsymbol{r}_{Num}$ determines the specific numerical relation in this vector-symbolic method. Similarly, the simplified logical relation function, $R_{Lgc}$ , is defined as follows (Figure 2 f): $$ \boldsymbol{r}_{Lgc}=R_{Lgc}\left(\boldsymbol{v}_{1:N},OP_{1:M}\right)=\left( op_{1}\boldsymbol{v}_{1}\land op_{2}\boldsymbol{v}_{2}\right)\circ op_{3} \boldsymbol{v}_{3} \tag{7} $$ where $\boldsymbol{v}_{1:N}:=\left\{\boldsymbol{v}_{i}\right\}_{i=1}^{N}\in\{ \boldsymbol{e}(0),\boldsymbol{e}(1)\}$ denotes the input set of BVs. The full version of logical relation function is described in Appendix A. Here, we consider only the ternary logical relation, so $N=3$ and $M=3$ . The parameter $OP_{1:M}:=\left\{op_{i}\right\}_{i=1}^{M}$ , where $op_{i}\in\{0,1\}$ determines whether to negate $\boldsymbol{v}_{i}$ , with negation ( $\lnot$ ) applied when $op_{i}=1$ and no negation applied when $op_{i}=0$ (see Appendix A). The symbol $\land$ denotes the AND operation, as shown in Table II. Based on the computational properties of BVs detailed in Table II, $R_{Lgc}$ can describe the ternary logical relations involved in RPM. The combination of the operator $OP_{1:M}$ and the output $\boldsymbol{r}_{Lgc}$ determines the specific logical relation in this vector-symbolic method. <details> <summary>x3.png Details</summary> ![89cf08ca](/v1/image/89cf08ca5d1a064869321e1ec343b6e37fedaba1fe85ccefc3acd92fd0999e8f) ### Visual Description ## Diagram: Hierarchical Scene Representation and Cyclic Object Shift ### Overview The image is a technical diagram split into two primary panels, labeled **a** and **b**. It illustrates a conceptual model for representing scenes composed of objects with attributes and their spatial transformations. Panel **a** details the hierarchical decomposition of a scene into entities with attributes and a layout. Panel **b** demonstrates a specific application of this model to a 3x3 grid, showing how objects undergo a cyclic positional shift. ### Components/Axes The diagram is not a chart with axes but a conceptual flow diagram. Its components are: **Panel a:** * **Left (Scene):** A circular boundary labeled "Scene" containing an "Out-InGrid" region. Inside, three triangular objects are shown. One object (labeled *j*) is highlighted with a blue dashed box, and a larger pink dashed box encloses all three. * **Center (Entity & Layout):** A blue box labeled "Entity" lists four attributes: **Type**, **Size**, **Color**, and **Existence**. Each attribute is associated with a pair of symbols: a key (*k_type*, *k_size*, *k_color*, *k_exist*) and a value (*v_j^type*, *v_j^size*, *v_j^color*, *v_j^exist*). Small colored squares (orange, blue, purple, green) and circles represent these components. A legend defines the blue square as "HD representation" and the circle as "Role-filler binding". * **Center-Right (Aggregation):** Arrows show the aggregation process. The attribute values for object *j* are summed (Σ_Atribute) to form a green square representing object *O_j*. This object is then summed with others (Σ_Object) to form the scene representation *S*. * **Bottom (Layout):** A pink box labeled "Layout" shows "Position" with the symbol *p_j* (a blue square), connected to the object *O_j*. Faded labels *p_{j-1}* and *p_{j+1}* indicate neighboring positions. * **Right (Scene Output):** The final aggregated scene representation is labeled "SHDR for image panel" and denoted by the symbol *S*. **Panel b:** * **Left (3x3Grid):** A 3x3 grid with cells numbered 1 through 9. The cells are labeled "Position: CVs **p_j**". * **Center (Configurations):** Three square panels labeled **C₁**, **C₂**, and **C₃**. Each contains a different arrangement of geometric shapes (pentagons, circles). * **Center-Top (Entity Detail):** A simplified entity box for object *j* in the grid, showing only **Position** (*p_j*) and **Existence** (*v_j^exist*). An orange arrow points from the position in grid **C₁** to the position in grid **C₂**. * **Right (Aggregation & Transformation):** The entity information is summed (Σ_Object) to create "SHDR for 3x3Grid", denoted *C^{3x3}*. Below, text states: "Objects are cyclically shifted one position to the right". Two equations are provided: *C₁ ∘ p(1) = C₂* and *C₂ ∘ p(1) = C₃*. ### Detailed Analysis **Panel a - Hierarchical Scene Decomposition:** 1. **Entity Representation:** An object *j* in a scene is defined by a set of attributes. Each attribute (e.g., Color) has a key (*k_color*) and a value (*v_j^color*). The diagram uses color-coding: orange for Type, blue for Size, purple for Color, green for Existence. 2. **Binding and Aggregation:** The key and value for each attribute are bound (indicated by the circle "Role-filler binding"). The values (*v_j^type, v_j^size, v_j^color, v_j^exist*) are aggregated (summed) to form a unified representation of the object, *O_j* (green square). 3. **Scene Construction:** All object representations (*O_{j-1}, O_j, O_{j+1}*) are aggregated to form the complete scene representation, *S*. 4. **Layout Integration:** The object's position (*p_j*) is part of the "Layout" and is linked to the object representation *O_j*. **Panel b - Application to a 3x3 Grid:** 1. **Grid Structure:** The environment is a discrete 3x3 grid with 9 positions, each represented by a control variable (CV) *p_j*. 2. **Cyclic Shift Operation:** The core process shown is a cyclic shift of objects to the right. * **C₁:** Contains three pentagons at positions (approx.) 4, 6, and 8. * **C₂:** Contains three circles at positions (approx.) 1, 4, and 7. The text and arrow indicate the object from position 1 in C₁ has moved to position 2 in C₂ (though the visual shows a circle at position 1 in C₂, suggesting the shift is of the *pattern* or *state*, not necessarily the same physical object). * **C₃:** Contains three pentagons at positions (approx.) 2, 5, and 9. 3. **Mathematical Notation:** The transformation is formalized. *p(1)* likely represents a permutation function that shifts positions by one step to the right. The operator "∘" denotes the application of this permutation to the configuration, so *C₁ ∘ p(1)* means applying the shift to configuration C₁ to produce C₂. ### Key Observations 1. **Abstraction Level:** The diagram operates at a high level of abstraction, using generic symbols (*k, v, p, O, S, C*) to describe a general framework for scene representation and manipulation. 2. **Color Consistency:** Colors are used consistently to track concepts: blue for "HD representation" (keys and positions), green for "Existence" values and aggregated objects. 3. **Two-Level Model:** Panel **a** shows a general, continuous-like scene model. Panel **b** applies a simplified version of it (focusing only on Position and Existence) to a discrete, structured grid environment. 4. **Transformation Logic:** The cyclic shift in panel **b** is demonstrated visually with changing shapes (pentagons to circles to pentagons) and positions, and formalized mathematically. The shapes may represent different object types or states. ### Interpretation This diagram presents a **structured, compositional approach to visual scene understanding and reasoning**, likely for an AI or computational model. * **What it suggests:** The model decomposes a scene into a set of objects, each defined by a bundle of attributes (type, size, color, existence) and a spatial location. This is a "role-filler" or "frame-based" representation. The scene is the sum of its objects. This allows for modular manipulation—changing an attribute of one object or moving an object (changing its *p_j*) updates the scene representation predictably. * **How elements relate:** The flow is bottom-up: attribute keys and values bind to form object representations, which sum to form the scene. The layout (position) is a special attribute linked to the object. Panel **b** shows how this representation can be used to define and execute transformations (like a cyclic shift) on the scene state by operating on the object positions. * **Notable patterns/anomalies:** The shift from pentagons to circles in panel **b** is interesting. It could imply that the "Existence" value (*v_j^exist*) or the "Type" attribute is being toggled or cycled as part of the transformation, not just the position. The mathematical notation (*C ∘ p(1)*) suggests the framework is designed to support formal, algebraic operations on scene configurations, which is powerful for planning and prediction tasks. The diagram essentially bridges a perceptual representation (the scene image) with a symbolic, manipulable structure (the entity-attribute-layout model). </details> Figure 3: Structured HD representations (SHDR) for the image panel and the nxn Grid. (a) SHDR for the image panel (Equation 10). Taking the Out-InGrid configuration as an example, an image panel contains multiple objects, each with four entity attributes: type, size, color, and existence. Through the first layer of role-filler binding, these attributes are combined to form a SHDR for each object. Additionally, at the layout level, each object has a position attribute. By applying a second layer of role-filler binding, the SHDR for the entire image panel is constructed. (b) SHDR for the nxn Grid (Equation 12). Taking the 3x3Grid configuration as an example, the position vectors for all objects are represented using circular vectors (CVs) with a period of $3\times 3=9$ . The SHDR $\mathcal{C}^{3\mathrm{x}3}$ for this 3x3Grid is obtained by performing role-filler binding between the corresponding position vectors $\boldsymbol{p}_{j}$ and existence vectors $\boldsymbol{v}_{j}$ . Due to the periodic nature of the position vectors, as all objects shift positions cyclically, the SHDR undergoes a binding operation with the position vectors corresponding to the magnitude of the shift. #### IV-B 2 Inverse relation functions Rule execution in RPM requires inferring the third attribute value based on the first two attribute values in a row of panels, given a known relation. It represents an inverse problem of rule abduction. In the vector-symbolic method, given the operator power ${OP_{1:M}}$ and the output $\boldsymbol{r}$ , the last vector representation $\boldsymbol{v}_{N}$ can be inferred from the first $N-1$ inputs $\boldsymbol{v}_{1:N-1}$ using the inverse of the relation functions (Figure 2 g). According to Equation 6, the inverse numerical relation function is defined as follows: $$ \displaystyle\boldsymbol{v}_{N} \displaystyle=R_{Num}^{-1}\left(\boldsymbol{v}_{1:N-1},OP_{1:M},\boldsymbol{r}\right) \displaystyle=\left(\circ_{i=1}^{N-1}\boldsymbol{v}_{i}^{\left(\circ\left(-op_ {i}/op_{M}\right)\right)}\right)\circ\boldsymbol{r}^{\left(\circ\left(-1/op_{M }\right)\right)} \tag{8} $$ Similarly, according to Equation 7, the inverse logical relation function is defined as follows: $$ \displaystyle\boldsymbol{v}_{N} \displaystyle=R_{Lgc}^{-1}\left(\boldsymbol{v}_{1:N-1},OP_{1:M},\boldsymbol{r}\right) \displaystyle=op_{3}\left(op_{1}\boldsymbol{v}_{1}\land op_{2}\boldsymbol{v}_{ 2}\right) \tag{9} $$ . ### IV-C Structured high-dimensional representation and its attribution decomposition VSA can create structured symbolic representations using atomic HD vector representations and decouple them directly from these structures through algebraic operations [12]. This subsection presents the process of constructing a structured HD representation (SHDR) for an image panel and its decomposition to retrieve individual attribute representations. Additionally, an SHDR for the nxn Grid ( $n=2,3$ ) at the component level is also introduced. #### IV-C 1 SHDR for the image panel In RAVEN dataset, each image panel consists of objects, with each object characterized by multiple attributes. Consequently, the structured HD representation (SHDR) for each image panel can be obtained through two layers of role-filler bindings (Figure 3 a). First, the bundling operation is used to construct an SHDR for each object at the entity level by combining its attributes. Then, another bundling operation aggregates these object-level representations to construct a SHDR of the image panel at the scene level. Therefore, each image panel $\mathcal{X}\in\mathbb{R}^{r\times r}$ , with a resolution $r\times r$ , can be represented by an SHDR $\mathcal{S}\in\mathbb{R}^{d}$ as follows: $$ \displaystyle\mathcal{S} \displaystyle=\sum_{j=1}^{N_{pos}}{\boldsymbol{p}_{j}\circ\boldsymbol{O}_{j}} \displaystyle=\sum_{j=1}^{N_{pos}}{\boldsymbol{p}_{j}\circ\left(\sum_{attr\in ATTR }{\boldsymbol{k}_{attr}\circ\boldsymbol{v}_{j}^{attr}}\right)} \tag{10} $$ Here, $\boldsymbol{O}_{j}$ represents the SHDR of the $j$ th object with different attributes at the entity level, incorporating attributes such as type, size, color, and existence. The attribute set is $ATTR=\left\{type,size,color,exist\right\}$ . At the entity level, the key vector $\boldsymbol{k}_{attr}$ denotes the class of a specific attribute $attr\in ATTR$ , while the value vector $\boldsymbol{v}_{j}^{attr}$ indicates the attribute’s value at the position $j$ . At the scene level, the position vector $\boldsymbol{p}_{j}$ specifies the location of the $j$ -th object. #### IV-C 2 Representation decomposition Given an estimated SHDR $\hat{\mathcal{S}}\in\mathbb{R}^{d}$ of an image panel, all SHDRs of objects $\hat{\boldsymbol{O}}_{j}$ at the entity level, along with the corresponding attribute representations $\hat{\boldsymbol{v}}_{j}^{attr}$ , can be derived through a series of unbinding operations [20]. The decomposition process is shown as follows: $$ \begin{cases}\hat{\boldsymbol{O}}_{j}=\boldsymbol{p}_{j}\oslash\hat{\mathcal{S }}=\boldsymbol{p}_{l}^{-1}\circ\hat{\mathcal{S}}\\ \hat{\boldsymbol{v}}_{j}^{attr}=\boldsymbol{k}_{attr}\oslash\hat{\boldsymbol{O }}_{j}=\boldsymbol{k}_{attr}^{-1}\circ\hat{\boldsymbol{O}}_{j}\\ \end{cases} \tag{11} $$ It is important to note that due to inaccuracies in the estimated SHDR $\hat{\mathcal{S}}$ and the noise introduced by the unbinding operation, the estimated attribute representations $\hat{\boldsymbol{v}}_{j}^{attr}$ may not fully match the original $\boldsymbol{v}_{j}^{attr}$ used in Equation 10. #### IV-C 3 SHDR for the nxn Grid component In the RAVEN dataset, three figure configurations— 2x2Grid, 3x3Grid, and Out-InGrid —include components where objects are arranged in an nxn grid pattern at the layout level [21]. Since the positions in the nxn Grid involve component-level rule reasoning, the SHDR for the nxn Grid component ( $n=2,3$ ), focusing only on positions and object existence, is introduced as follows (Figure 3 b): $$ \mathcal{C}^{\mathrm{nxn}}=\sum_{j=1}^{n\times n}{\boldsymbol{p}_{j}\circ \boldsymbol{v}_{j}^{exist}} \tag{12} $$ ### IV-D Rules from the perspective of relation functions TABLE III: Attribute representations and the relation functions involved in rule abductions hlines = , vlines = , colspec=ccccccc, cells=mode=text, cell13 = c = 2halign = c, cell15 = c = 2halign = c, cell21 = r = 3valign = m, cell51 = r = 5valign = m, cell62 = r = 4valign = m, cell23 = r = 4valign = m, cell63 = r = 3valign = m, cell24 = r = 4valign = m, cell64 = r = 3valign = m, cell26 = r = 7valign = m, cell27 = r = 2valign = m, cell47 = r = 2valign = m, cell67 = r = 2valign = m Level & Attributes HD representations Rules in RAVEN and their types Relation functions Entity Type $\begin{array}[]{c}\boldsymbol{v}^{type}\\ \boldsymbol{v}^{size}\\ \boldsymbol{v}^{color}\\ \boldsymbol{v}^{num}\\ \end{array}$ Atomic (NVs) Constant Numerical rules Binary + Numerical Size Progression Color Distribute Three Ternary + Numerical Layout Number Arithmetic Position $\mathcal{C}^{\mathrm{nxn}}=\sum_{j=1}^{n\times n}{\boldsymbol{p}_{j}\circ \boldsymbol{v}_{j}^{exist}}$ $\begin{array}[]{c}\mathrm{SHDR}\\ \boldsymbol{p}_{j}:\mathrm{CVs}\\ \boldsymbol{v}_{j}^{exist}:\mathrm{RVs}\\ \end{array}$ Constant Binary + Numerical Progression Distribute Three Ternary + Numerical $\boldsymbol{v}^{exist}$ Atomic (BVs) Arithmetic Logical rules Ternary + logical TABLE IV: Rules and Corresponding combinations of $OP_{1:M}$ and $r_{Num}$ in Relation functions hlines = , vlines = , colspec=cccccccc, cells=mode=text, cell11 = r = 2, c = 2halign = c, valign = m, cell17 = r = 2valign = m, cell18 = r = 2valign = m, cell31 = r = 6valign = m, cell32 = c = 2halign = l, cell42 = r = 2valign = m,halign = l, cell62 = r = 2valign = m,halign = l, cell82 = c = 2halign = c,halign = l, cell91 = r = 2valign = m, cell92 = r = 2valign = m,halign = l Rules in RAVEN & Ternary $op_{1}$ $op_{2}$ $op_{3}$ $r$ Examples (Rule $\rightarrow$ Relation function) Binary $op_{1}$ $op_{2}$ Numerical rules Constant 0 $-1$ $+1$ 0 $v_{1}=v_{2}\rightarrow\boldsymbol{v}\left(0\right)=\boldsymbol{v}_{1}^{\left( \circ\left(-1\right)\right)}\circ\boldsymbol{v}_{2}^{\left(\circ 1\right)}$ Progression $+$ 0 $-1$ $+1$ $+1$ , $+2$ $v_{1}+1=v_{2}\rightarrow\boldsymbol{v}\left(+1\right)=\boldsymbol{v}_{1}^{ \left(\circ\left(-1\right)\right)}\circ\boldsymbol{v}_{2}^{\left(\circ 1\right)}$ $-$ 0 $-1$ $+1$ $-1$ , $-2$ $v_{1}-2=v_{2}\rightarrow\boldsymbol{v}\left(-2\right)=\boldsymbol{v}_{1}^{ \left(\circ\left(-1\right)\right)}\circ\boldsymbol{v}_{2}^{\left(\circ 1\right)}$ Arithmetic $+$ $-1$ $-1$ $+1$ $0 0$ $v_{1}+v_{2}=v_{3}\rightarrow\boldsymbol{v}\left(0\right)=\boldsymbol{v}_{1}^{ \left(\circ\left(-1\right)\right)}\circ\boldsymbol{v}_{2}^{\left(\circ(-1) \right)}\circ\boldsymbol{v}_{3}^{\left(\circ 1\right)}$ $-$ $-1$ $+1$ $+1$ $0 0$ $v_{1}-v_{2}=v_{3}\rightarrow\boldsymbol{v}\left(0\right)=\boldsymbol{v}_{1}^{ \left(\circ\left(-1\right)\right)}\circ\boldsymbol{v}_{2}^{\left(\circ 1\right )}\circ\boldsymbol{v}_{3}^{\left(\circ 1\right)}$ Distribute Three $+1$ $+1$ $+1$ Any $v_{1}+v_{2}+v_{3}=Any\rightarrow\boldsymbol{Any}=\boldsymbol{v}_{1}^{\left( \circ 1\right)}\circ\boldsymbol{v}_{2}^{\left(\circ 1\right)}\circ\boldsymbol{ v}_{3}^{\left(\circ 1\right)}$ Logical rules Arithmetic $+$ $+1$ $+1$ $+1$ 0 $\boldsymbol{e}\left(0\right)=\left(\boldsymbol{e}\left(1\right)\circ \boldsymbol{v}_{1}\land\boldsymbol{e}\left(1\right)\circ\boldsymbol{v}_{2} \right)\circ\left(\boldsymbol{e}\left(1\right)\circ\boldsymbol{v}_{3}\right)$ $-$ $0 0$ $+1$ $0 0$ 0 $\boldsymbol{e}\left(0\right)=\left(\boldsymbol{e}\left(0\right)\circ \boldsymbol{v}_{1}\land\boldsymbol{e}\left(1\right)\circ\boldsymbol{v}_{2} \right)\circ\left(\boldsymbol{e}\left(0\right)\circ\boldsymbol{v}_{3}\right)$ The RAVEN dataset contains $4$ rules— Constant, Progression, Arithmetic, and Distribute Three —which operate on $5$ rule-governing attributes [21]. These $5$ attributes include $3$ entity-level attributes: Type, Size, and Color, as well as $2$ layout-level attributes: Number and Position. In this study, the HD representations of these attribute values during rule reasoning and relations between rules and relation functions are shown in Table III. For the attributes Type, Size, Color, and Number, the four involved rules follow additive arithmetic operations, meaning the attribute values $\boldsymbol{v}^{attr}$ ( $attr\in\left\{type,size,color,number\right\}$ ) are represented using Numeric Vectors (NVs). Therefore, these rules can be defined using the numerical relation function (Equation 6): Constant and Progression correspond to binary relation functions, while Arithmetic and Distribute Three correspond to ternary relation functions. Each rule is associated with specific combinations of $OP_{1:M}$ and $\boldsymbol{r}_{Num}$ , and corresponding details are shown in Table IV. For the attribute Position, the rules Constant, Progression, and Distribute Three primarily refer to an nxn Grid with multiple objects, which, in an overall sense, follow additive arithmetic operations. Therefore, we use the SHDR $\mathcal{C}^{\mathrm{nxn}}$ for the nxn Grid (Equation 12) to represent the attributes required by these three rules. Since Progression involves a cyclic left or right shift of all objects (Figure 3 b), the position vectors $\boldsymbol{p}_{j}$ in set $\mathcal{C}^{\mathrm{nxn}}$ during rule reasoning are represented by Circular Vectors (CVs). The object existence vectors $\boldsymbol{v}_{j}^{exist}$ are represented by Random Vectors (RVs). These three rules can also be described using numerical relation functions (Equation 6). Take the rule Progression (+1) as an example, where the positions of objects undergo a cyclic right shift. In Figure 3 b, the SHDRs of the 3x3 Grid across a row of three panels exhibit two numerical relations: $\mathcal{C}_{1}\circ\boldsymbol{p}\left(1\right)=\mathcal{C}_{2}$ and $\mathcal{C}_{2}\circ\boldsymbol{p}\left(1\right)=\mathcal{C}_{3}$ , which can be defined using a binary numerical relation function. In addition, the rule Arithmetic on the attribute Position is belong to the logical rule [21]. The attribute values $\boldsymbol{v}_{j}^{exist}$ are also represented using Boolean Vectors (BVs) that can be operated as shown in Table II. Therefore, the rule Arithmetic for Position corresponds to the ternary logical relation function (Equation 7), and corresponding details about the combinations of $OP_{1:M}$ and $\boldsymbol{r}_{Num}$ in the relation function are shown in Table IV. ### IV-E The Systematic Abductive Reasoning model In this section, we present the Systematic Abductive Reasoning model with diverse relation representations (Rel-SAR), inspired by the NVSA [12]. An overview of Rel-SAR is depicted in Figure 4 a. Similar to previous neuro-symbolic models for abstract visual reasoning, Rel-SAR combines a neural visual perception frontend with a symbolic reasoning backend, both utilizing VSA representations with meaningful semantics to facilitate systematic reasoning. The perception frontend employs a neural network to extract the SHDR $\mathcal{S}$ of each image panel $\mathcal{X}$ in the RPM and achieves feature disentanglement from the SHDR using representation decomposition to obtain the HD representations of attributes ( $\boldsymbol{v}$ , $\boldsymbol{p}$ and $\mathcal{C}$ : Table III) required for reasoning in the backend. The reasoning backend consists of three main modules: the rule abduction module, the rule execution module, and the answer selection module. The rule abduction module extracts the corresponding rules ( $OP_{1:M}$ and $\boldsymbol{r}$ : Table IV) for each attribute representation according to appropriate relation function (Equation 6 and 7, Table III). Subsequently, the rule execution module uses these rules to predict the representations of the missing panel’s attributes according to corresponding inverse relation functions (Equation 8 and 9). Finally, the answer selection module compares the predicted attribute representations of the missing panel with the available options in the answer panels and selects the answer. <details> <summary>x4.png Details</summary> ![c8a02628](/v1/image/c8a02628f1cd0f2145a003bc1b410af19d9f257b22b2f9c338384e20265dc43d) ### Visual Description ## System Architecture Diagram: Visual Reasoning Model with Perception Frontend and Reasoning Backend ### Overview The image is a technical diagram illustrating the architecture of a machine learning model designed for visual reasoning tasks. It is divided into three main panels labeled **a**, **b**, and **c**, each detailing different aspects of the system. The overall flow moves from processing raw visual inputs (perception) to abstract reasoning and answer selection. The diagram uses a combination of block diagrams, mathematical notation, and flow arrows to depict data transformation and logical processes. ### Components/Axes The diagram is segmented into three primary regions: 1. **Panel a (Top):** High-level overview of the complete pipeline, split into a "Perception Frontend" (left) and a "Reasoning Backend" (right). 2. **Panel b (Middle):** A detailed view of the feature disentanglement and attribute representation mechanism, bridging the frontend and backend. 3. **Panel c (Bottom):** A detailed breakdown of the "Rule abduction" and "Rule execution" processes within the Reasoning Backend, separated into numerical and logical rule pathways. **Key Labels and Notation:** * **Inputs:** `Contexts` (denoted as `X^(1,1)`, `X^(1,2)`, ..., `X^(3,2)`) and `Candidates` (`X^(1)` ... `X^(8)`). * **Core Functions:** `f_θ` (Perception Frontend function), `f_Num^φ` (Numerical rule function), `f_Lgc^φ` (Logical rule function). * **Intermediate Representations:** `Ŝ` (SHDR for image panel), `Object_j` with attributes `Type`, `Size`, `Color`, `Num`, `Pos` (collectively `v`), `Attr_k` (attributes across panels). * **Rule Components:** `Rule abduction`, `Rule execution`, `Answer selection`. * **Codebooks:** `Frontend Codebook (C^Front)`, `Backend Codebook (C^Back)`. * **Attribute Vectors:** `v_type`, `v_size`, `v_color`, `v_number`, `v_position`, `c_position`. * **Rule Types:** `2-ary Relation`, `3-ary Relation`. * **Rule Sets:** `R_Num,2`, `R_Num,3`, `R_Lgc,3`. * **Operations:** `Query`, `attention`, `argmax`, `Π sim(·,·)` (product of similarities). * **Outputs:** `x^(ŷ)` (selected answer). ### Detailed Analysis **Panel a: High-Level Pipeline** * **Perception Frontend (Left):** Takes multiple context images and candidate images as input. These are processed by a function `f_θ` to produce `Ŝ` (SHDR for image panel). This leads to "Feature Disentanglement," where objects (`Object_j`, `Object_j+1`) are identified and their attributes (Type, Size, Color, Num, Pos) are extracted into a vector `v`. * **Reasoning Backend (Right):** Receives attributes `v` from "very panels" (likely meaning "across panels"). It operates on an attribute set `Attr_k`, which undergoes "Rule abduction" to infer a rule, followed by "Rule execution" to produce a new attribute `v^(3,3)`. This process iterates (`Attr_k-1`, `Attr_k`, `Attr_k+1`). Finally, "Answer selection" uses the processed attributes (`v^(1)`, `v^(2)`, ..., `v^(8)`) to choose the correct candidate `x^(ŷ)`. **Panel b: Attribute Disentanglement and Representation** * This panel details the transition from the perception frontend to the reasoning backend. * The SHDR `Ŝ` is used to generate an object representation `Ô_j` (with a null symbol `Ø` indicating a slot or placeholder). * This representation is decomposed into disentangled attribute vectors: `v̂_type`, `v̂_size`, `v̂_color`, `v̂_exist`. * These vectors, along with a `Frontend Codebook (C^Front)`, are fed into a "Query" and "attention" mechanism. The attention mechanism also uses a `Backend Codebook (C^Back)`. * The output is a set of "HD attribute representations to backend": `v_type`, `v_size`, `v_color`, `v_number`, `v_position`, `c_position`. **Panel c: Rule Abduction and Execution** * This panel is split vertically into **Numerical rules** (left) and **Logical rules** (right). * **Rule Abduction (Top Half):** * **Numerical Path:** Attribute vectors `v` are input to `f_Num^φ`, producing operator predictions `ÔP_1:M`. These are applied to attribute sets (`V_i^2`, `V_i^3`) within rule modules `R_Num,2` and `R_Num,3` to generate rule hypotheses `r̂_i-1`, `r̂_i`, `r̂_i+1`. A similarity product `Π sim(·,·)` computes unnormalized rule probabilities `s_Num^2` and `s_Num^3`. An `argmax` operation selects the best rule, defined as `[ÔP_1:M, r̄]`. * **Logical Path:** A parallel process uses `f_Lgc^φ` to generate rules for 3-ary relations, resulting in rule `[ÔP_1:M, r̄]` and probability `s_Lgc^3`. * **Rule Execution (Bottom Half):** * The selected rule `[ÔP_1:M, r̄]` is applied to input attributes (e.g., `v^(3,1)`, `v^(3,2)`). * For numerical rules, this occurs in module `R_Num^-1`, outputting `v^(3,3)`. * For logical rules, this occurs in module `R_Lgc^1`, outputting `v^(3,3)`. ### Key Observations 1. **Hierarchical Abstraction:** The system moves from concrete pixel data (`Contexts`, `Candidates`) to disentangled object attributes, and finally to abstract relational rules. 2. **Dual Rule Processing:** The model explicitly separates and processes numerical and logical rules, suggesting it handles different types of reasoning (e.g., arithmetic vs. comparative) with specialized pathways. 3. **Iterative Reasoning:** The notation `Attr_k-1`, `Attr_k`, `Attr_k+1` in Panel a implies the reasoning backend operates over multiple steps or layers of attribute abstraction. 4. **Codebook-Mediated Attention:** Panel b reveals that the translation from frontend features to backend-ready attributes is not direct but is mediated by learned codebooks and an attention mechanism, likely to align feature spaces. 5. **Rule Selection via Probability:** The abduction phase doesn't just generate rules; it scores them (`s_Num^2`, `s_Num^3`, `s_Lgc^3`) using a similarity metric, and selects the most probable one via `argmax`. ### Interpretation This diagram outlines a neuro-symbolic architecture for visual question answering or similar visual reasoning tasks. The **Perception Frontend** acts as a vision backbone that parses images into structured, object-centric representations (disentangled attributes). The **Reasoning Backend** is a symbolic or differentiable logic engine that manipulates these attributes using learned rules. The core innovation appears to be the **Rule Abduction** mechanism. Instead of having a fixed set of rules, the model *infers* the most applicable rule (numerical or logical, 2-ary or 3-ary) from the current context by scoring candidate rules. This makes the system more flexible and capable of handling novel reasoning problems. The separation into numerical and logical streams suggests an inductive bias built into the model to handle these fundamentally different types of relations efficiently. The flow from `v` (raw attributes) to `v̂` (disentangled attributes) to `v` (HD representations) to finally `v^(3,3)` (a predicted attribute after rule execution) shows a complete cycle of perception, reasoning, and prediction. The final "Answer selection" module likely compares the predicted outcome `v^(3,3)` or the state of all candidates against the expected answer format to choose the correct image candidate `x^(ŷ)`. This architecture aims to combine the pattern recognition strengths of neural networks with the interpretability and systematic generalization of symbolic reasoning. </details> Figure 4: The Systematic Abductive Reasoning model with diverse relation representations (Rel-SAR). (a) Overall architecture of Rel-SAR model. Our model consists of a visual perception frontend, which processes object attributes for $8$ context and $8$ candidate image panels in a RPM test, and a reasoning backend that performs symbolic arithmetic and logical reasoning. The perception frontend utilizes a neural network, $f_{\theta}$ , to obtain the SHDR of each image panel, and then perceives attributes in the form of HD representations required by the downstream reasoning. In the reasoning backend, the rule abduction module extracts rules for each attribute representation using relation functions. The rule execution module then predicts the missing panel’s attribute representations based on inverse relation functions. Finally, the answer selection module compares the predicted attributes of the missing panel with those in the candidate panels and selects the option with the highest similarity. (b) Given the predicted SHDR for each panel, the SHDR of all objects and their corresponding HD attribute representations can be obtained via representation decomposition. Subsequently, the estimated HD attribute representations are refined in two steps: querying the frontend codebook and applying attention based on the backend codebook. This process produces HD attribute representations suitable for backend reasoning, including attributes such as type, size, color, number, and position. (c) In the rule abduction module, the rule learners $f_{\phi}^{Num}$ and $f_{\varphi}^{Lgc}$ predict the operator powers $\widehat{OP}_{1:M}$ for numerical and logical relation functions based on attributes in the context panels. These predicted $\widehat{OP}_{1:M}$ ensure that all binary or ternary relation input pairs ( $\mathbb{V}^{N},N=2,3$ ) produce the same output $\boldsymbol{\hat{r}}$ when processed through their respective relation functions. Therefore, the rule defined by $\widehat{OP}_{1:M}$ and $\boldsymbol{\hat{r}}$ with the highest overall $\boldsymbol{\hat{r}}$ similarity, also viewed as unnormalized probability, is considered the underlying rule. The rule execution module then predicts the attributes of the missing panel using inverse relation functions with the estimated rules. #### IV-E 1 Perception frontend The perception frontend operates independently on each of the 16 image panels to extract the HD representations of attributes required for abductive reasoning (Figure 4 a and Figure 4 b). For a given image panel $\mathcal{X}^{ind}\in\mathbb{R}^{r\times r}$ , where $ind\in\left\{\left(1,1\right),\left(1,2\right),\cdots,\left(3,2\right)\right\}$ for 8 contexts and $ind\in\left\{1,2,\cdots,8\right\}$ for 8 candidates, the frontend uses a trainable neural network (ResNet-50) to map the image panel to its estimated SHDR $\hat{\mathcal{S}}^{ind}\in\mathbb{R}^{d}$ : $f_{\theta}:\mathcal{X}\rightarrow\hat{\mathcal{S}}$ , where $\theta$ represents the trainable parameters of the network. Theoretically, the expected SHDR $\mathcal{S}$ for each panel should be organized from the corresponding attribute representations as described by Equation 10. Therefore, the learning objective of $f_{\theta}$ is to minimize the difference between its output $\hat{\mathcal{S}}$ and the theoretical SHDR $\mathcal{S}$ , formulated as: $$ \underset{\theta}{\min}\left\|f_{\theta}\left(\mathcal{X};\theta\right)- \mathcal{S}\right\| \tag{13} $$ Subsequently, the estimated SHDR $\hat{\mathcal{S}}^{ind}$ for each panel undergoes representation decomposition (Figure 4 b), as described in Equation 11, to obtain the estimated HD attribute representations for each object, including Type ( $\hat{\boldsymbol{v}}^{type}_{j}$ ), size ( $\hat{\boldsymbol{v}}^{size}_{j}$ ), color ( $\hat{\boldsymbol{v}}^{color}_{j}$ ), and existence ( $\hat{\boldsymbol{v}}^{exist}_{j}$ ), where $j$ denotes the position index of the corresponding object. The HD attribute representations are expected to be selected from a set of frontend codebooks for the available attributes of interested in the RAVEN dataset (Figure 4 b). These frontend codebooks include $C_{Num}^{Front}:=\left\{\boldsymbol{v}(r)\right\}_{r=0}^{9}\cup\left\{ \boldsymbol{v}_{null}\right\}$ and $C_{Lgc}^{Front}:=\left\{\boldsymbol{e}(r)\right\}_{r=0}^{1}$ , which represent the numerical value and logic, respectively. $\boldsymbol{v}_{null}$ represents the null attribute representation when there is no object. To improve the neural network’s performance in encoding the SHDR of an image panel, all hypervectors in these frontend codebooks are randomly and independently generated as RVs, rather than using NVs, CVs, or BVs. However, the estimated HD attribute representations for each object, $\hat{\boldsymbol{v}}^{attr}_{j}$ ( $attr\in\{type,size,color,exist\}$ ), cannot be directly applied to the reasoning backend. First, these representations contain noise introduced by the bundling operation in the form SHDR. Second, as they are expected to be derived from the frontend codebooks of RVs, there are no intrinsic arithmetic or logical relations between the $\hat{\boldsymbol{v}}^{attr}_{j}$ s, which hinders effective reasoning. To address these issues, we adopt an approach similar to the attention mechanism [38] to obtain the HD attribute representations suitable for the reasoning backend (Figure 4 b). In the query stage, we use the estimated HD attribute representations ( $\hat{\boldsymbol{v}}^{attr}_{j}$ ) as query vectors to compute their similarity with all possible vectors for the corresponding attributes in the frontend codebooks. In the attention stage, these similarity scores are then used as attention weights to perform a weighted summation of the corresponding vectors from the backend codebooks, in which all hypervectors are generated according to their attribute type as shown in Table III. The backend codebooks consist of $C_{Num}^{Back}$ (NVs), $C_{Lgc,BV}^{Back}$ (BVs), $C_{Lgc,RV}^{Back}$ (RVs), and $C_{Pos,\mathrm{nxn}}^{Back}:=\left\{\boldsymbol{p}_{r}\right\}_{r=1}^{n^{2}}$ (CVs for positions in nxn Grid). The updated HD attribute representations obtained after the weighted summation can be utilized in the reasoning backend. The details are provided below. The query stage: For each attribute $attr\in\left\{type,size,color\right\}$ , we compute the cosine similarity between the estimated HD attribute representation $\hat{\boldsymbol{v}}^{attr}_{j}$ and all possible vectors of $C_{Num}^{Front}$ in the frontend codebooks. $$ W_{j}^{attr}=sim\left(\hat{\boldsymbol{v}}_{j}^{attr},C_{Num}^{Front}\right) \tag{14} $$ where $W_{j}^{attr}(r)$ ( $r\in\{0,1,...,9,null\}$ ) represents the attention weights corresponding to the value $r$ of attribute $attr$ at the $j$ th position, based on the query similarity. Similarly, the attention weights for the attribute $attr\in\left\{exist\right\}$ can be obtained by querying the logic codebook $C_{Lgc}^{Front}$ as follows: $$ W_{j}^{exist}=softmax\left(\beta\cdot sim\left(\hat{\boldsymbol{v}}_{j}^{exist },C_{Lgc}^{Front}\right)\right) \tag{15} $$ where $W_{j}^{exist}(r)$ ( $r\in\{0,1\}$ ) corresponds to the presence and absence of the object at $j$ th position, respectively. Here, we use the $softmax$ function to normalize the weights, and $\beta$ denotes the inverse softmax temperature. The attention stage: The HD attribute representations required by the reasoning backend involve the entity-level attributes Type, Size, and Color, as well as layout-level attributes Number, Position (Table III). For the numerical attribute $attr\in\left\{type,size,color\right\}$ , the corresponding updated HD representation $\boldsymbol{v}_{j}^{attr}$ can be obtained through the weighted summation on the numerical backend codebook $C_{Num}^{Back}$ as follows: $$ \boldsymbol{v}_{j}^{attr}=\sum_{r\in\left\{0,...,9,null\right\}}{W_{j}^{attr} \left(r\right)\cdot\boldsymbol{v}\left(r\right)},\boldsymbol{v}\left(r\right) \in C_{Num}^{Back} \tag{16} $$ For the logical existence attribute $attr\in\left\{exist\right\}$ , its updated HD representation $\boldsymbol{v}_{j}^{exist}$ are obtained through the weighted summation on the backend codebook $C_{Lgc,RV}^{Back}$ and $C_{Lgc,BV}^{Back}$ , respectively, as follows: $$ \boldsymbol{v}_{j}^{exist,VT}=\sum_{r\in\left\{0,1\right\}}{W_{j}^{exist}\left (r\right)\cdot\boldsymbol{e}\left(r\right)},\boldsymbol{e}\left(r\right)\in C_ {Lgc,VT}^{Back} \tag{17} $$ where $VT\in\{RV,BV\}$ represents the type of atomic HD vectors. Additionally, we introduce the overall HD attribute representations for the Type, Size, and Color attributes within the nxn Grid. These representations $\boldsymbol{v}_{\mathrm{nxn}}^{attr}$ ( $attr\in\left\{type,size,color\right\}$ ) can be obtained by bundling corresponding HD attribute representations of all objects in the Grid with their attention weights of existence as follows: $$ \boldsymbol{v}_{\mathrm{nxn}}^{attr}=\sum_{j=1}^{n^{2}}{W_{j}^{exist}\left(1 \right)\cdot\boldsymbol{v}_{j}^{attr}} \tag{1} $$ For the layout-level attribute Number, its HD attribute representation $\boldsymbol{v}^{number}$ is obtained by projecting the sum of the attention weights of presence to FPE-VSA as follows: $$ \boldsymbol{v}^{number}=\boldsymbol{v}^{\left(\otimes\sum_{j=1}^{n^{2}}{W_{j}^ {exist}\left(1\right)}\right)} \tag{1} $$ where $\boldsymbol{v}$ is the base vector of the numerical backend codebook $C_{Num}^{Back}$ . The layout-level attribute Position within the nxn Grid involves both numerical and logical rules (Table III). Therefore, its HD attribute representations correspond to two distinct rules: the logical representation of each individual object $\boldsymbol{v}_{j}^{position}$ and the overall HD position representation $\mathcal{C}^{position}$ of the entire nxn Grid. The former is an HD existence representation with logical computational properties, that is, $\boldsymbol{v}_{j}^{position}=\boldsymbol{v}_{j}^{exist,BV}$ . Inspired from SHDR for the nxn Grid in Equation 12, the overall HD position representation $\mathcal{C}^{position}$ can be obtained as follows: $$ \mathcal{C}^{position}_{\mathrm{nxn}}=\sum_{j=1}^{n\times n}{W_{j}^{exist} \left(1\right)\cdot\boldsymbol{p}_{j}\circ\boldsymbol{v}_{j}^{exist,RV}}\,, \boldsymbol{p}_{j}\in C_{Pos,\mathrm{nxn}}^{Back} \tag{1} $$ #### IV-E 2 Reasoning backend The Rel-SAR model efficiently implements systematic abductive reasoning by leveraging HD attribute representations and VSA-based relation functions. HD attribute representations from the frontend are transformed into the HD vector space, enabling VSA operations on both numerical and logical relation functions (Equation 6 - 9). Consequently, the reasoning backend can perform systematic rule abduction and execution based on these relational functions, without requiring extensive use of explicit rule templates. TABLE V: Attribute sets for n-ary relation. | $\mathbb{V}^{2}$ | $\left(\boldsymbol{v}_{\left(1,1\right)},\boldsymbol{v}_{\left(1,2\right)}\right)$ , $\left(\boldsymbol{v}_{\left(1,2\right)},\boldsymbol{v}_{\left(1,3\right)}\right)$ , $\left(\boldsymbol{v}_{\left(2,1\right)},\boldsymbol{v}_{\left(2,2\right)}\right)$ , | | --- | --- | | $\left(\boldsymbol{v}_{\left(2,2\right)},\boldsymbol{v}_{\left(2,3\right)}\right)$ , $\left(\boldsymbol{v}_{\left(3,1\right)},\boldsymbol{v}_{\left(3,2\right)}\right)$ ; $\left(\boldsymbol{v}_{\left(3,2\right)},\boldsymbol{v}_{\left(y\right)}\right)$ | | | $\mathbb{V}^{3}$ | $\left(\boldsymbol{v}_{\left(1,1\right)},\boldsymbol{v}_{\left(1,2\right)}, \boldsymbol{v}_{\left(1,3\right)}\right)$ , $\left(\boldsymbol{v}_{\left(2,1\right)},\boldsymbol{v}_{\left(2,2\right)}, \boldsymbol{v}_{\left(2,3\right)}\right)$ ; | | $\left(\boldsymbol{v}_{\left(3,1\right)},\boldsymbol{v}_{\left(3,2\right)}, \boldsymbol{v}_{\left(y\right)}\right)$ | | Rule Abduction. Attributes in the RAVEN dataset follow row-major binary or ternary relations [21]. All possible binary $\mathbb{V}^{2}$ and ternary $\mathbb{V}^{3}$ relation pairs in the RPM test are presented in Table V, where $\boldsymbol{v}_{(i,j)}$ denotes the HD attribute representation for a given attribute in the context panel at row $i$ and column $j$ , and $\boldsymbol{v}_{\left(y\right)}$ represents the corresponding HD attribute representation of the target answer panel. Consequently, rule abduction can be formulated as an optimization problem: For both numerical and logical rules, the rule abduction module must identify a set of operator powers $OP_{1:M}$ such that all $N$ -ary ( $N=2,3$ ) relation pairs yield the same output $\boldsymbol{r}_{Num/Lgc}$ when processed through their respective relation functions $R_{Num/Lgc}$ . Formally: $$ \underset{OP_{1:M}}{\max}s^{N}=\prod_{\mathbb{V}_{i}^{N},\mathbb{V}_{j}^{N}\in \mathbb{V}^{N}}^{i\neq j}{sim\left(R\left(\mathbb{V}_{i}^{N},OP_{1:M}\right),R \left(\mathbb{V}_{j}^{N},OP_{1:M}\right)\right)} \tag{21} $$ where $sim$ denotes cosine similarity, and $R$ represents either the numerical relation function $R_{Num}$ or the logical relation function $R_{Lgc}$ . $\mathbb{V}_{i}^{N},\mathbb{V}_{j}^{N}\in\mathbb{V}^{N}$ ( $i\neq j$ ) refers to any two relation pairs selected from $\mathbb{V}^{N}$ ( $N=2,3$ ). $s^{N}_{Num/Lgc}$ represents the overall similarity between the outputs $r$ of all corresponding relation pairs and can be interpreted as an unnormalized probability of the corresponding rule. Based on the above idea, in the rule abduction module (Figure 4 c Left), the HD attribute representations of $8$ context panels for a given numerical attribute are input into a trainable neural network $f_{\phi}^{Num}$ as the rule learner to predict the operator powers $\widehat{OP}_{1:M}$ , which are expected to achieve the optimization of the objective defined in Equation 21. Subsequently, all attribute sets for binary and ternary relations (Table V) are input into the corresponding numerical relation functions using the predicted $\widehat{OP}_{1:M}$ , and their outputs $\hat{\boldsymbol{r}}_{Num}$ are obtained. Based on the outputs $\hat{\boldsymbol{r}}_{Num}$ from the relation functions, the unnormalized probability $s^{2}_{Num}$ and $s^{3}_{Num}$ for binary and ternary relations, respectively, can be computed (Equation 21). The operator powers $\widehat{OP}_{1:M}$ and the averaged output $\bar{\boldsymbol{r}}_{Num}$ with larger $s^{N}_{Num}$ are then defined as the underlying numerical rule. Similarly, another trainable rule learner, $f_{\varphi}^{Lgc}$ , is used to predict the operator powers $\widehat{OP}_{1:M}$ for the logical rules associated with the attribute Position (Figure 4 c Right). Since the logical rules in the Raven dataset only involve ternary relations, all logical representations are organized according to the attribute sets of ternary relations outlined in Table V, and are then input into the logical ternary relation functions with the predicted parameters $\widehat{OP}_{1:M}$ . Based on the outputs $\hat{\boldsymbol{r}}_{Lgc}$ from the relation functions, the unnormalized probability $s^{3}_{Lgc}$ for logical relations can be computed (Equation 21). Subsequently, $s^{3}_{Lgc}$ is compared with $s^{N}_{Num}$ , which corresponds to numerical relations for the attribute Position. If $s^{3}_{Lgc}$ is larger, the operator powers $\widehat{OP}_{1:M}$ and the averaged output $\bar{\boldsymbol{r}}_{Lgc}$ for logical relations can be interpreted as the underlying logical rule. Rule Execution. After obtaining the predicted operator powers ${\widehat{OP}_{1:M}}$ and the outputs $\bar{\boldsymbol{r}}_{Num/Lgc}$ that represent the rules, we apply these rules to infer the HD attribute representations of the missing panel. For a given attribute, the corresponding attribute representations from the first two panels in the third row of the RPM test are input into the inverse numerical and logical relation functions using the predicted ${\widehat{OP}_{1:M}}$ and $\bar{\boldsymbol{r}}_{Num/Lgc}$ , resulting in the retrieval of the missing HD attribute representation $\hat{\boldsymbol{v}}_{\left(3,3\right)}$ : $$ \hat{\boldsymbol{v}}_{\left(3,3\right)}=R_{Num/Lgc}^{-1}\left(\mathbb{V}, \widehat{OP}_{1:M},\bar{\boldsymbol{r}}_{Num/Lgc}\right) \tag{22} $$ where $\mathbb{V}=(\boldsymbol{v}_{\left(3,2\right)})$ for binary relation rules and $(\boldsymbol{v}_{\left(3,1\right)},\boldsymbol{v}_{\left(3,2\right)})$ for ternary relation rules. The final answer selection. Finally, we calculate the similarity between all HD attribute representations of the missing panel ( $\boldsymbol{v}^{position}_{\left(3,3\right)},\boldsymbol{v}^{number}_{\left(3, 3\right)},\boldsymbol{v}^{type}_{\left(3,3\right)},\boldsymbol{v}^{size}_{ \left(3,3\right)},\boldsymbol{v}^{color}_{\left(3,3\right)}$ ) and the corresponding attribute representations of each candidate panel $\left(y\right)$ . The predicted answer panel $\hat{y}$ is the one with the highest total similarity score. ### IV-F Model training #### IV-F 1 End-to-end training During end-to-end training, the Rel-SAR model utilizes the rule labels provided by the RAVEN dataset and the answer labels to optimize the objectives of visual perception (Equation 13) and rule abduction (Equation 21). Based on the rule labels, the corresponding ground-truth $OP_{1:M}^{gt}$ and $\boldsymbol{r}_{Num/Lgc}^{gt}$ , which represent the rules, can be obtained from Table IV. To facilitate the learning of $OP_{1:M}$ , we design the loss function $\mathcal{L}_{op}$ , which constrains the rule learners $f_{\phi}^{Num}$ and $f_{\varphi}^{Lgc}$ to optimize the objective described in Equation 21: $$ \mathcal{L}_{op}=MSE\left(\widehat{OP}_{1:M},OP_{1:M}^{gt}\right) \tag{23} $$ which is a mean square error ( $MSE$ ) loss between the predicted operator powers $\widehat{OP}_{1:M}$ and the corresponding ground-truth. Additionally, we introduce the loss function $\mathcal{L}_{\boldsymbol{r}}$ to ensure consistent outputs when the inputs to the relation function follow a given rule. This is formulated as follows: $$ \mathcal{L}_{\boldsymbol{r}}=\sum_{i}{\left(1-sim\left(\hat{\boldsymbol{r}}_{i },\boldsymbol{r}^{gt}\right)\right)} \tag{24} $$ where $\widehat{\boldsymbol{r}}_{i}$ denotes the output of the relation function for the $i$ -th relation pair in $\mathbb{V}^{2}$ and $\mathbb{V}^{3}$ . The overall loss function $\mathcal{L}$ for end-to-end training is constructed as: $$ \mathcal{L}=\sum{\mathcal{L}_{op}}+\sum{\mathcal{L}_{\boldsymbol{r}}} \tag{25} $$ where $\sum$ represents the sum of the loss functions across all attributes, for both binary and ternary relation functions, and for both numerical and logical rule types. By minimizing $\mathcal{L}_{op}$ and $\mathcal{L}_{\boldsymbol{r}}$ simultaneously during training, the optimization objective of the perception network $f_{\theta}$ , as described in Equation 13, can be achieved. This is because, under the constraints of numerous ground-truth rules ( $OP_{1:M}^{gt}$ and $\boldsymbol{r}_{Num/Lgc}^{gt}$ ), the expected theoretical SHDR constructed from HD attribute representations in the codebooks will be a competitive representation, guiding the reasoning process toward optimality ( $\mathcal{L}_{op}\rightarrow 0$ and $\mathcal{L}_{\boldsymbol{r}}\rightarrow 0$ ). #### IV-F 2 End-to-End Training with auxiliary attribute labels Following previous work, we assess the performance of the Rel-SAR model in end-to-end training using both auxiliary attribute labels and answer labels. Here, a cosine similarity loss is employed as the perception loss function to enhance the similarity between the estimated SHDR $\hat{\mathcal{S}}^{ind}$ of the perception frontend and the theoretical SHDR $\mathcal{S}$ (Equation 10) derived from attribute labels, thereby optimizing the trainable perception network $f_{\theta}$ (Equation 13). The perception loss function $\mathcal{L}_{p}$ is defined as follows: $$ \mathcal{L}_{p}=1-sim\left(\hat{\mathcal{S}}^{ind},\mathcal{S}\right) \tag{26} $$ Meanwhile, to achieve rule learning through the optimization objective described in Equation 21, we introduce the loss function $\mathcal{L}_{rs}$ to increase the overall similarity between the outputs $\boldsymbol{r}$ of all corresponding relation pairs in $\mathbb{V}^{2}$ and $\mathbb{V}^{3}$ (Table V), respectively. This is formulated as follows: $$ \mathcal{L}_{rs}=1-\prod_{i,j}^{i\neq j}{sim\left(\hat{\boldsymbol{r}}_{i}, \hat{\boldsymbol{r}}_{j}\right)} \tag{27} $$ where $\hat{\boldsymbol{r}}_{i}$ and $\hat{\boldsymbol{r}}_{j}$ denotes the outputs of relation functions for any two distinct sets of relation pairs in $\mathbb{V}^{2}$ and $\mathbb{V}^{3}$ , including attribute sets from the answer panel $y$ . Therefore, the overall loss function $\mathcal{L}$ for end-to-end training with auxiliary attribute labels is constructed as: $$ \mathcal{L}=\sum{\mathcal{L}_{p}}+\sum{\mathcal{L}_{rs}} \tag{28} $$ where the former $\sum$ represents the sum of the loss functions across all context panels and the answer panel, while the latter $\sum$ represents the sum of the loss functions across all attributes. Additionally, similar to other neuro-symbolic approaches, the perception frontend and reasoning backend in Rel-SAR can also be trained independently using the loss functions $\mathcal{L}_{p}$ and $\mathcal{L}_{rs}$ , respectively. ## V Experiments ### V-A Datasets & Experiment setup We evaluate our model on the RAVEN [21] and I-RAVEN [8] datasets. Each dataset contains 70,000 RPM samples, which are divided into training, validation, and test sets with a 6:2:2 ratio. We use a ResNet-50 as the encoder ( $f_{\theta}$ ) to map the image panels $\mathcal{X}$ to their SHDR $\mathcal{S}\in\mathbb{R}^{d}$ , where the dimension $d$ of all vectors is set to $3000$ . Two 5-layer fully connected networks as rule learners ( $f_{\phi}^{Num}$ and $f_{\varphi}^{Lgc}$ ) are employed to extract $OP_{1:M}$ from updated Position representation and other attribute (Number, Type, Size and Color) representation, respectively. The existence vectors of the backend codebook are set to HRR because the normalization of FHRR in the superposition operation will invalidate the logical reasoning. All other vectors are FHRR. We utilize the AdamW optimizer with a learning rate of $1\times 10^{-4}$ and a weight decay of $1\times 10^{-4}$ . TABLE VI: Test accuracy (%) on RAVEN and I-RAVEN Dataset | Method | RAVEN | I-RAVEN | | | | | | | | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Avg | Center | 2X2 | 3X3 | L-R | U-D | O-IC | O-IG | Avg | Center | 2X2 | 3X3 | L-R | U-D | O-IC | O-IG | | | | Deep Learning Model | WReN [4] | 14.7 | 13.1 | 28.6 | 28.3 | 7.5 | 6.3 | 8.4 | 10.6 | 23.8 | 29.4 | 26.8 | 23.5 | 21.9 | 21.4 | 22.5 | 21.5 | | LEN [7] | 72.9 | 80.2 | 57.5 | 62.1 | 73.5 | 81.2 | 84.4 | 71.5 | 41.4 | 56.4 | 31.7 | 29.7 | 44.2 | 44.2 | 52.1 | 31.7 | | | CoPINet [6] | 91.4 | 95.1 | 77.5 | 78.9 | 99.1 | 99.7 | 98.5 | 91.4 | 46.1 | 54.4 | 36.8 | 31.9 | 51.9 | 52.5 | 52.2 | 42.8 | | | MXGNet [39] | 84.0 | 94.3 | 60.5 | 64.9 | 96.6 | 96.4 | 94.1 | 81.3 | 33.1 | 40.7 | 27.9 | 24.7 | 35.8 | 34.5 | 36.4 | 31.6 | | | SCL [28] | 91.6 | 98.1 | 91.0 | 82.5 | 96.8 | 96.5 | 96.0 | 80.1 | 95.0 | 99.0 | 96.2 | 89.5 | 97.9 | 97.1 | 97.6 | 87.7 | | | SRAN [8] | 54.3 | — | — | — | — | — | — | — | 60.8 | 78.2 | 50.1 | 42.4 | 70.1 | 70.3 | 68.2 | 46.3 | | | Rel-Base [30] | 91.7 | 97.6 | 85.9 | 86.9 | 93.5 | 96.5 | 97.6 | 83.8 | 91.1 | — | — | — | — | — | — | — | | | MRNet [9] | 74.4 | 96.2 | 49.1 | 45.9 | 93.7 | 94.2 | 92.5 | 51.3 | 75.0 | 96.8 | 45.6 | 39.6 | 95.7 | 95.9 | 95.6 | 55.5 | | | DCNet [25] | 93.6 | 97.8 | 81.7 | 86.7 | 99.8 | 99.8 | 99.0 | 91.5 | 49.4 | 57.8 | 34.1 | 35.5 | 58.5 | 60.0 | 57.0 | 42.9 | | | STSN [29] | 89.7 | — | — | — | — | — | — | — | 95.7 | 98.6 | 96.2 | 88.8 | 98.0 | 98.8 | 97.8 | 92.0 | | | PredRNet [27] | 95.8 | — | — | — | — | — | — | — | 96.5 | — | — | — | — | — | — | — | | | DRNet [26] | 96.9 | — | — | — | — | — | — | — | 97.6 | — | — | — | — | — | — | — | | | Neuro-Symbolic Model | PrAE [10] | 65.0 | 76.5 | 78.6 | 28.6 | 90.1 | 90.9 | 48.1 | 42.6 | 77.0 | 90.5 | 85.4 | 45.6 | 96.3 | 97.4 | 63.5 | 60.7 | | ALANS [13] | 74.4 | 69.1 | 80.2 | 75.0 | 72.2 | 73.3 | 76.3 | 74.9 | 78.5 | 72.3 | 79.5 | 72.9 | 79.2 | 79.6 | 85.9 | 79.9 | | | NVSA [12] | 87.7 | 99.7 | 93.5 | 57.1 | 99.8 | 99.7 | 98.6 | 65.4 | 88.1 | 99.8 | 96.2 | 54.3 | 100 | 99.9 | 99.6 | 67.1 | | | Rel-SAR (Ours) | 96.5 | 99.1 | 95.7 | 96.2 | 99.6 | 99.6 | 99.1 | 86.2 | 98.0 | 99.8 | 97.1 | 98.1 | 100 | 100 | 99.9 | 90.9 | | | Neuro-Symbolic Model (attribute labels) | NVSA [12] | 98.5 | 100 | 99.4 | 96.3 | 100 | 100 | 100 | 93.9 | 99.0 | 100 | 99.5 | 97.1 | 100 | 100 | 100 | 96.4 | | Xu et al. [40] | 92.9 | 98.8 | 91.9 | 93.1 | 99.2 | 99.1 | 98.2 | 70.1 | 93.2 | 99.5 | 89.6 | 89.7 | 99.7 | 99.5 | 99.6 | 74.7 | | | ALANS-V [13] | 94.4 | 98.4 | 91.5 | 87.0 | 97.3 | 96.4 | 97.3 | 93.2 | 93.5 | 98.9 | 85.0 | 83.2 | 90.9 | 98.1 | 99.1 | 89.5 | | | Rel-SAR (Ours) | 96.6 | 97.9 | 94.3 | 96.6 | 99.0 | 98.7 | 97.9 | 92.2 | 98.1 | 98.5 | 96.7 | 97.8 | 99.4 | 99.8 | 99.2 | 95.6 | | ### V-B End-to-end learning results Here we first evaluate the Rel-SAR model on both RAVEN [21] and I-RAVEN [8] datasets using end-to-end learning. The results, summarized in Table VI, compare our model with both deep neural network methods and neuro-symbolic AI methods. Rel-SAR achieves an average accuracy of $96.5\$ on RAVEN and $98.0\$ on I-RAVEN, comparable to the previous best-performing deep network, DRNet ( $96.9\$ on RAVEN and $97.6\$ on I-RAVEN). Compared to the previous neuro-symbolic method NVSA, Rel-SAR demonstrates a significant accuracy improvement on configurations involving rules based on the position attribute, including 2x2Grid ( $+2.2\$ ), 3x3Grid ( $+39.1\$ ), and Out-InGrid ( $+20.8\$ ). This results in an average accuracy improvement of $8.8\$ on RAVEN and $9.9\$ on I-RAVEN. These improvements are attributed to the effectiveness of our circular and logical HD representations for the position attribute, as well as the numerical and logical relation functions used in rule abduction and execution. Additionally, Rel-SAR utilizes only the answer panel during training, without leveraging information from other candidate panels. This ensures that Rel-SAR does not exploit potential defects in the RAVEN dataset for shortcut learning [8], resulting in similar accuracy on both RAVEN and I-RAVEN datasets ( $96.5\$ vs. $98.0\$ ). In addition, since neuro-symbolic models are hybrid architectures that disentangle perception and reasoning, they are often trained with auxiliary attribute labels. We therefore evaluate the Rel-SAR model on both datasets when trained with additional attribute labels. The results, presented in the bottom section of Table VI, show that our model achieves an average accuracy of $96.6\$ on RAVEN and $98.1\$ on I-RAVEN. Among the neuro-symbolic models compared, ALANS Learner is the most similar to Rel-SAR, as both models incorporate learnable parameters in their reasoning backends. Compared to ALANS, our model demonstrates an average accuracy improvement of $2.2\$ on RAVEN and $4.6\$ on I-RAVEN. In contrast, the reasoning backends in NVSA and Xu’s model rely on pre-designed rule templates or algebraic invariance modules without learnable parameters. NVSA, with its comprehensive rule template library, achieves near-perfect accuracy when trained with auxiliary attribute labels. Although our model slightly underperforms NVSA when trained with auxiliary attribute labels, it exhibits an important advantage: when trained without attribute labels, both NVSA and ALANS Learner suffer significant performance drops, whereas our model maintains nearly identical performance. This consistency highlights the robust synergy between the perception and reasoning modules in our model. ### V-C Perception results learned with the attribute labels TABLE VII: The visual perception accuracy on RAVEN dataset. | PrAE | 85.27 | 88.65 | 93.56 | 73.95 | 100 | 100 | 94.23 | 46.25 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Xu et al. | 96.10 | 100 | 100 | 99.99 | 100 | 100 | 99.96 | 72.78 | | NVSA | 99.76 | 100 | 99.83 | 98.61 | 99.97 | 99.96 | 99.97 | 99.95 | | Rel-SAR | 99.99 | 100 | 100 | 99.99 | 100 | 100 | 100 | 99.94 | Similar to other neuro-symbolic methods, the perception frontend in Rel-SAR can also be independently trained with attribute labels using the perception loss function (Equation 26). By querying the estimated HD attribute representations for the highest similarity with the attribute vectors in the frontend codebooks, we can retrieve the predicted entity-level attributes. A panel’s features are considered correctly extracted only if the attributes of all objects in the panel match those provided by the dataset. As shown in Table VII, the evaluation results demonstrate that the perception frontend in our model achieves an average panel accuracy of $99.99\$ on the RAVEN dataset, maintaining nearly perfect performance, consistent with NVSA. Notably, NVSA reports a resolution issue in the RAVEN dataset, where some objects in the inner regions of the O-InGrid configuration have a different size attribute but the same image representation [12]. To address this, we adopt a similar solution to NVSA, merging classes with different sizes but identical panel representation. To evaluate the generalization capability of the perception frontend of our model, we adopt the experimental settings outlined in [12] to test the model on unseen combinations of attribute values. Specifically, we focus on the single-object case in the 2x2Grid configuration. As shown in Table VIII, we select two attributes (e.g. Position and Type) along with partial value sets for each (e.g. Position $\in\{0,3\}$ and Type $\in\{0,2\}$ ) as the target attributes. Panels containing these target attribute values are included in the training set, while panels lacking them are designated for the test set. This approach ensures that the attribute sets of the training and test datasets are entirely disjoint, enabling a rigorous evaluation of the module’s ability to generalize beyond the observed data. In this experiment, we observe that our perception frontend struggles to identify unseen combinations of attribute values when the value vectors $\boldsymbol{v}$ in Equation 10 are Random Vectors (RVs). We attribute this issue to the orthogonality of RVs, which hinders the model’s ability to recognize adjacency concepts. To address this, we replace the RVs with fractional power encoding (FPE) vectors based on a Gaussian kernel, which introduces progressive similarity between vectors [18, 19]. As shown in Table VIII, our model demonstrates better generalization on unseen attribute combinations compared to NVSA [12] when using FPE vectors with progressive similarity (NVs). However, the type-size combination still shows low accuracy, likely because there is no clear continuous progression between adjacent attribute values of type (e.g., triangle and square). We also conduct an experiment to evaluate the generalization ability of our perception frontend when applied to an unseen number of objects [12]. The training set consists of panels with a finite number of objects (e.g., $k_{train}=1$ ), while the test set consists of panels with a larger number of objects (e.g., $k_{test}=2$ ). As shown in Table IX, when $k_{train}=1$ , our model fails to accurately predict the attributes of panels containing more objects. By analyzing the predicted attribute values, we observe that our model exhibits a consistent distribution of predicted attribute attention weights across different test sets, attributable to the limited diversity (Figure 5 a). However, as the number of objects in the training panels increases, the model learns more complex patterns and can correctly allocate attention to different positions (Figure 5 b). Our model achieves perfect generalization performance when $k_{train}=2$ for 2x2Grid and when $k_{train}=3$ for 3x3Grid. TABLE VIII: Accuracy of attribute-value generalization on 2x2Grid containing $k=1$ object | Position $\in\{0,3\}$ OR | Position $\notin\{0,3\}$ AND | 6720 | 2880 | 0.0 | 0.0 | 26.8 | | --- | --- | --- | --- | --- | --- | --- | | Type $\in\{0,2\}$ | Type $\notin\{0,2\}$ | | | | | | | Position $\in\{0,3\}$ OR | Position $\notin\{0,3\}$ AND | 6400 | 3200 | 15.1 | 0.0 | 31.4 | | Size $\in\{1,5\}$ | Size $\notin\{1,5\}$ | | | | | | | Position $\in\{0,3\}$ OR | Position $\notin\{0,3\}$ AND | 6720 | 2880 | 34.8 | 0.0 | 73.5 | | Color $\in\{0,3,6,8\}$ | Color $\notin\{0,3,6,8\}$ | | | | | | | Type $\in\{0,2\}$ OR | Type $\notin\{0,2\}$ AND | 5760 | 3840 | 0.0 | 0.0 | 13.9 | | Size $\in\{1,5\}$ | Size $\notin\{1,5\}$ | | | | | | | Type $\in\{0,2\}$ OR | Type $\notin\{0,2\}$ AND | 6144 | 3456 | 72.0 | 0.0 | 89.5 | | Color $\in\{0,3,6,8\}$ | Color $\notin\{0,3,6,8\}$ | | | | | | | Size $\in\{1,5\}$ OR | Size $\notin\{1,5\}$ AND | 5760 | 3840 | 29.3 | 0.0 | 82.6 | | Color $\in\{0,3,6,8\}$ | Color $\notin\{0,3,6,8\}$ | | | | | | TABLE IX: Generalization to a growing number of unseen objects in the RAVEN panel | 2x2Grid | 9600 | 1 | 0.3 | 0.0 | 0.0 | — | — | — | — | — | 0.1 | 0.1 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 2x2Grid | 19200 | 2 | 100.0 | 100.0 | 100.0 | — | — | — | — | — | 100.0 | 100.0 | | 3x3Grid | 21600 | 1 | 0.2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | | 3x3Grid | 43200 | 2 | 100.0 | 100.0 | 99.4 | 70.3 | 21.6 | 0.4 | 0.0 | 0.0 | 49.0 | 41.7 | | 3x3Grid | 64800 | 3 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | | 3x3Grid | 86400 | 4 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | 100.0 | <details> <summary>x5.png Details</summary> ![aaffba7a](/v1/image/aaffba7a382c0ff55bf88e4d6e43df06b792c30a13b8002c902bb598eb8fadf5) ### Visual Description ## Density Plot Comparison: Sum of Existence Weights for Different k_train Values ### Overview The image displays two side-by-side density plots, labeled (a) and (b), comparing the distribution of the "Sum of Existence Weights" for different values of a parameter `k_train`. Each plot contains three overlaid density curves corresponding to different `k_test` values. The plots illustrate how the distribution of the summed weights changes based on the combination of `k_train` and `k_test`. ### Components/Axes * **Overall Structure:** Two subplots arranged horizontally. * **Subplot (a) Title:** `(a) k_train=1` * **Subplot (b) Title:** `(b) k_train=2` * **X-Axis (Both Plots):** Label: `Sum of Existence Weights`. * Plot (a) Range: Approximately 0.7 to 1.2. Major ticks at 0.7, 0.8, 0.9, 1.0, 1.1, 1.2. * Plot (b) Range: Approximately 1.5 to 4.0. Major ticks at 1.5, 2.0, 2.5, 3.0, 3.5, 4.0. * **Y-Axis (Both Plots):** Label: `Density`. * Plot (a) Range: 0 to 8. Major ticks at 0, 2, 4, 6, 8. * Plot (b) Range: 0 to 6. Major ticks at 0, 2, 4, 6. * **Legend (Both Plots):** Located in the top-right corner of each subplot. Contains three entries with colored boxes and text labels: * `k_test=2` (Red/Salmon color) * `k_test=3` (Green color) * `k_test=4` (Purple/Lavender color) ### Detailed Analysis **Subplot (a): k_train=1** * **Trend Verification:** All three distributions are unimodal and overlap significantly. The distributions for higher `k_test` values are shifted slightly to the left (lower "Sum of Existence Weights") compared to lower `k_test` values. * **Data Points & Distributions:** * **k_test=2 (Red):** Peak density is approximately 6.5, occurring at an x-value of ~1.0. The distribution spans roughly from 0.8 to 1.2. * **k_test=3 (Green):** Peak density is approximately 7.0, occurring at an x-value of ~0.95. The distribution spans roughly from 0.8 to 1.15. * **k_test=4 (Purple):** Peak density is the highest, approximately 8.0, occurring at an x-value of ~0.9. The distribution spans roughly from 0.75 to 1.1. * **Key Observation:** The distributions are tightly clustered and overlapping, with peaks between 0.9 and 1.0. The variance appears relatively similar across the three `k_test` values. **Subplot (b): k_train=2** * **Trend Verification:** The three distributions are distinct, non-overlapping, and appear as separate, sharp peaks. The peaks shift systematically to the right (higher "Sum of Existence Weights") as `k_test` increases. * **Data Points & Distributions:** * **k_test=2 (Red):** Forms a sharp peak. Peak density is approximately 4.8, occurring at an x-value of ~2.0. The distribution is narrow, spanning roughly from 1.7 to 2.3. * **k_test=3 (Green):** Forms a sharp peak. Peak density is approximately 4.2, occurring at an x-value of ~2.8. The distribution is narrow, spanning roughly from 2.5 to 3.1. * **k_test=4 (Purple):** Forms a sharp peak. Peak density is approximately 3.5, occurring at an x-value of ~3.5. The distribution is narrow, spanning roughly from 3.2 to 3.8. * **Key Observation:** The distributions are completely separated. The mean (peak location) of the "Sum of Existence Weights" increases approximately linearly with `k_test`. The peak density decreases slightly as `k_test` increases. ### Key Observations 1. **Effect of k_train:** The value of `k_train` dramatically changes the relationship between `k_test` and the "Sum of Existence Weights". * For `k_train=1`, the sum is concentrated around 1.0 regardless of `k_test`, with only minor shifts. * For `k_train=2`, the sum is directly and strongly proportional to `k_test`, resulting in distinct, separated distributions. 2. **Distribution Shape:** For `k_train=1`, the distributions are broader and overlapping. For `k_train=2`, the distributions are much narrower (lower variance) and isolated. 3. **Peak Density:** The maximum density value is highest for `k_train=1, k_test=4` (~8.0) and lowest for `k_train=2, k_test=4` (~3.5). ### Interpretation This figure demonstrates a fundamental interaction between the parameters `k_train` and `k_test` on the resulting "Sum of Existence Weights." * **Underlying Mechanism:** The data suggests that `k_train` controls the **scale and separability** of the output. When `k_train=1`, the system's output (the sum) is normalized or constrained to a narrow range near 1.0, making it insensitive to changes in `k_test`. When `k_train=2`, the output becomes sensitive and proportional to `k_test`, suggesting a multiplicative or additive relationship is activated. * **Implication for a Technical System:** This pattern is characteristic of systems with a **normalization or gating mechanism**. `k_train=1` might represent a state where outputs are normalized (e.g., by a softmax function), collapsing differences. `k_train=2` might represent a state where this normalization is relaxed or altered, allowing the intrinsic differences driven by `k_test` to manifest clearly in the output sum. * **Notable Anomaly/Insight:** The most striking finding is the qualitative shift in behavior between the two plots. It indicates that the parameter `k_train` is not merely a scaling factor but a **switch that changes the functional regime** of the system from one of saturation/normalization (`k_train=1`) to one of linear response (`k_train=2`). This is critical for understanding model behavior and tuning. </details> Figure 5: Attention weights for existence on 2X2Grid. For the existence attribute, the presence weights ( $W_{j}^{exist}\left(1\right)$ in Equation 15) represent the probability of an object being present at the $j$ th position. The sum of these weights provides an approximate indication of the total number of objects in a panel. (a) When $k_{train}=1$ , the model consistently predicts the presence of only one object in the panel, regardless of the actual test scenario. (b) When $k_{train}=2$ , the model learns to distinguish between panels with varying numbers of objects, effectively adapting its feature extraction process. ### V-D Reasoning results utilizing the attribute labels We analyze the performance of the reasoning backend of our model on the I-RAVEN dataset. Following the assumption of perfect perception in prior works [14, 31], we also utilize the ground truth attribute labels provided by the I-RAVEN dataset to generate HD attribute representations. The model is trained using the reasoning loss function (Equation 27), and the corresponding evaluation results are shown in Table X. We compare our model with LLM based GPT-3 [41], deep neural network SCL [28], neuro-symbolic based method PrAE [10] and NVSA [12], VSA based method Learn-VRF [31] and ARLC [14]. The performance of GPT-3 is reported in [31]. Our method achieves an accuracy of $99.2\$ , representing a $6.8\$ improvement over the previous state-of-the-art model, ARLC. Notably, our approach significantly outperforms prior VSA-based methods ARLC on the 2X2Grid ( $+14.9\$ ), 3X3Grid ( $+17.7\$ ), and O-InGrid ( $+9.4\$ ) configurations, which involve rules on the Position attribute. This improvement demonstrates that the circular and logical HD representations of Position, along with the corresponding relation functions, effectively handle the reasoning of Arithmetic and Progression rules of Position. TABLE X: In-distribution accuracy on the I-RAVEN dataset | GPT-3 [41] | 86.5 | 86.4 | 83.2 | 81.8 | 83.4 | 84.6 | 92.8 | 93.0 | | --- | --- | --- | --- | --- | --- | --- | --- | --- | | SCL [28] | 84.3 | 99.9 | 68.9 | 43.0 | 98.5 | 99.1 | 97.7 | 82.6 | | PrAE [10] | 71.1 | 83.8 | 82.9 | 47.4 | 94.8 | 94.8 | 56.6 | 37.4 | | NVSA [12] | 88.1 | 99.8 | 96.2 | 54.3 | 100 | 99.9 | 99.6 | 67.1 | | Learn-VRF [31] | 79.5 | 97.7 | 56.3 | 49.9 | 94.0 | 95.6 | 98.3 | 64.8 | | ARLC [14] | 92.4 | 98.4 | 83.4 | 80.0 | 98.7 | 98.4 | 98.8 | 89.4 | | Rel-SAR | 99.2 | 99.9 | 98.3 | 97.7 | 100.0 | 99.9 | 100.0 | 98.8 | TABLE XI: Out-of-distribution accuracy on unseen rule-attribute pairs on I-RAVEN | | Type | Size | Color | | | | | | | | | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | Const. | Progr. | Dist.3 | Const. | Progr. | Arth. | Dist.3 | Const. | Progr. | Arth. | Dist.3 | | | GPT-3 [41] | 88.5 | 86.0 | 88.6 | 93.6 | 93.2 | 92.6 | 71.6 | 94.2 | 94.7 | 94.3 | 65.8 | | Learn-VRF [31] | 100 | 100 | 99.7 | 100 | 100 | 99.8 | 99.8 | 100 | 98.8 | 100 | 100 | | ARLC [14] | 100 | 98.6 | 99.7 | 100 | 100 | 99.6 | 99.6 | 100 | 100 | 100 | 99.8 | | Rel-SAR | 99.7 | 99.6 | 100 | 99.3 | 99.6 | 97.9 | 100 | 96.8 | 89.5 | 86.4 | 70.7 | We also evaluate the out-of-distribution (OOD) generalization ability of our reasoning backend, following the experimental setup described in [31]. In this evaluation, a specific rule (e.g., Type) for an attribute (e.g., Const.) is designated as the target rule. The model is trained and validated using samples that exclude the target rule and is then tested on samples containing only the target rule. This experiments is conducted on the Center configuration, while training, validation, and test sets are filtered from the I-RAVEN dataset. The corresponding evaluation results are shown in Table XI. For attributes Type and Size, our model, like Learn-VRF [31] and ARLC [14], demonstrates a near-perfect ability to generalize to unseen attribute rules. This capability arises from the use of unified HD vectors to represent values across different attributes, facilitating rule transfer between attributes. However, our model exhibits relatively limited performance on the Color attribute. This could be due to the fact that Color has a broader range of attribute values, making it difficult to transfer rules learned from attributes with narrower value ranges, such as Type (0-4) and Size (0-5), to the Color (0-9) attribute. ## VI Conclusion and Future Directions In this work, we propose Rel-SAR, a novel model that leverages VSA algebra to facilitate systematic rule abduction and execution. Rel-SAR adopts a neuro-symbolic architecture, where the perception frontend extracts diverse high-dimensional attribute representations with intrinsic algebraic properties, and the reasoning backend systematically derives a variety of rules based on relation functions. Extensive experiments demonstrate that Rel-SAR achieves superior performance while offers better interpretability and transparency on RPM tasks. The perception frontend of Rel-SAR effectively extracts object-level attributes while preserving the structure information of the image through the binding and bundling operations of VSA. By utilizing fractional power encoding vectors with progressive similarity, our model demonstrates its ability to generalize beyond the observed data. However, Rel-SAR exhibits relatively low accuracy on the Out-InGrid configuration, which may be attributed to the small object size and resolution issue. As noted in [27], CNNs with large kernel sizes or more stacked blocks are less effective at extracting features from RPM images with smaller objects. NVSA further shows that reducing the stride from 2 to 1 in the first convolutional block and removing the max-pooling layer in ResNet-18 can improve accuracy [12]. We believe that using CNNs with smaller kernel sizes and strides in the shallow layers will enhance the accuracy on the O-IG configuration. Consistent with other neuro-symbolic methods, we also introduce auxiliary rule labels during training. This is because, without precise rule-driven guidance, the model struggles to learn meaningful structured high-dimensional representations (SHDR) of attributes. We note that slot attention [42] enables unsupervised scene decomposition, while VQ-VAE [43] learns discrete latent representations, disentangling different concepts (i.e., attributes in RPM) from raw images. Therefore, we suggest employing a learnable frontend codebook, combined with slot attention to enable unsupervised extraction of SHDR from raw images in the future work. The reasoning backend of Rel-SAR implements sytematic abductive reasoning based on diverse HD attribute representation and relation function. Our model has a significant improvement on configurations with rules on Position. This demonstrates the effectiveness of circular and logical representations of Position attribute. The reasoning backend of the Rel-SAR exhibits limited out-of-distribution (OOD) generalization ability. This limitation may stem from the rule learner, a multi-layer fully connected neural network, which lacks the capacity to generalize to OOD attribute values. A growing body of recent work emphasizes reasoning based on relations between perceptual inputs rather than the features of individual inputs [32, 33, 34]. This trend is encapsulated by the ”relational bottleneck” principle [35], which aims to mitigate the influence of the absolute size of attribute values on relational reasoning. However, relational bottleneck may struggle with more complex relations. Future studies can focus on combining relational bottlenecks with VSA algebra to handle complex relations while maintaining robust out-of-distribution generalization. ## References - [1] P. A. Carpenter, M. A. Just, and P. Shell, “What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test.” Psychological review, vol. 97, no. 3, p. 404, 1990. - [2] W. B. Bilker, J. A. Hansen, C. M. Brensinger, J. Richard, R. E. Gur, and R. C. Gur, “Development of abbreviated nine-item forms of the raven’s standard progressive matrices test,” Assessment, vol. 19, no. 3, pp. 354–369, 2012. - [3] M. Mitchell, “Abstraction and analogy-making in artificial intelligence,” Annals of the New York Academy of Sciences, vol. 1505, no. 1, pp. 79–101, 2021. - [4] D. Barrett, F. Hill, A. Santoro, A. Morcos, and T. Lillicrap, “Measuring abstract reasoning in neural networks,” in International conference on machine learning. PMLR, 2018, pp. 511–520. - [5] F. Hill, A. Santoro, D. G. Barrett, A. S. Morcos, and T. Lillicrap, “Learning to make analogies by contrasting abstract relational structure,” arXiv preprint arXiv:1902.00120, 2019. - [6] C. Zhang, B. Jia, F. Gao, Y. Zhu, H. Lu, and S.-C. Zhu, “Learning perceptual inference by contrasting,” Advances in neural information processing systems, vol. 32, 2019. - [7] K. Zheng, Z.-J. Zha, and W. Wei, “Abstract reasoning with distracting features,” Advances in Neural Information Processing Systems, vol. 32, 2019. - [8] S. Hu, Y. Ma, X. Liu, Y. Wei, and S. Bai, “Stratified rule-aware network for abstract visual reasoning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1567–1574. - [9] Y. Benny, N. Pekar, and L. Wolf, “Scale-localized abstract reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 557–12 565. - [10] C. Zhang, B. Jia, S.-C. Zhu, and Y. Zhu, “Abstract spatial-temporal reasoning via probabilistic abduction and execution,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 9736–9746. - [11] G. F. Marcus, The algebraic mind: Integrating connectionism and cognitive science. MIT press, 2003. - [12] M. Hersche, M. Zeqiri, L. Benini, A. Sebastian, and A. Rahimi, “A neuro-vector-symbolic architecture for solving raven’s progressive matrices,” Nature Machine Intelligence, vol. 5, no. 4, pp. 363–375, 2023. - [13] C. Zhang, S. Xie, B. Jia, Y. N. Wu, S.-C. Zhu, and Y. Zhu, “Learning algebraic representation for systematic generalization in abstract reasoning,” in European Conference on Computer Vision. Springer, 2022, pp. 692–709. - [14] G. Camposampiero, M. Hersche, A. Terzić, R. Wattenhofer, A. Sebastian, and A. Rahimi, “Towards learning abductive reasoning using vsa distributed representations,” in International Conference on Neural-Symbolic Learning and Reasoning. Springer, 2024, pp. 370–385. - [15] F. A. Mansouri, D. J. Freedman, and M. J. Buckley, “Emergence of abstract rules in the primate brain,” Nature Reviews Neuroscience, vol. 21, no. 11, pp. 595–610, 2020. - [16] G. Marcus and E. Davis, “Insights for ai from the human mind,” Communications of the ACM, vol. 64, no. 1, pp. 38–41, 2020. - [17] S. A. Sloman, “The empirical case for two systems of reasoning.” Psychological bulletin, vol. 119, no. 1, p. 3, 1996. - [18] T. A. Plate, Distributed representations and nested compositional structure. Citeseer, 1994. - [19] E. Frady, D. Kleyko, C. Kymn, B. Olshausen, and F. Sommer, “Computing on functions using randomized vector representations.” arXiv: Learning,arXiv: Learning, Sep 2021. - [20] D. Kleyko, D. A. Rachkovskij, E. Osipov, and A. Rahimi, “A survey on hyperdimensional computing aka vector symbolic architectures, part i: Models and data transformations,” ACM Computing Surveys, vol. 55, no. 6, pp. 1–40, 2022. - [21] C. Zhang, F. Gao, B. Jia, Y. Zhu, and S.-C. Zhu, “Raven: A dataset for relational and analogical visual reasoning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5317–5327. - [22] A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” Advances in neural information processing systems, vol. 30, 2017. - [23] M. Jahrens and T. Martinetz, “Solving raven’s progressive matrices with multi-layer relation networks,” in 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 2020, pp. 1–6. - [24] M. Małkiński and J. Mańdziuk, “Multi-label contrastive learning for abstract visual reasoning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 2, pp. 1941–1953, 2022. - [25] T. Zhuo and M. Kankanhalli, “Effective abstract reasoning with dual-contrast network,” arXiv preprint arXiv:2205.13720, 2022. - [26] K. Zhao, C. Xu, and B. Si, “Learning visual abstract reasoning through dual-stream networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 15, 2024, pp. 16 979–16 988. - [27] L. Yang, H. You, Z. Zhen, D. Wang, X. Wan, X. Xie, and R.-Y. Zhang, “Neural prediction errors enable analogical visual reasoning in human standard intelligence tests,” in International Conference on Machine Learning. PMLR, 2023, pp. 39 572–39 583. - [28] Y. Wu, H. Dong, R. Grosse, and J. Ba, “The scattering compositional learner: Discovering objects, attributes, relationships in analogical reasoning,” arXiv preprint arXiv:2007.04212, 2020. - [29] S. S. Mondal, T. Webb, and J. D. Cohen, “Learning to reason over visual objects,” arXiv preprint arXiv:2303.02260, 2023. - [30] S. Spratley, K. Ehinger, and T. Miller, “A closer look at generalisation in raven,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16. Springer, 2020, pp. 601–616. - [31] M. Hersche, F. Di Stefano, T. Hofmann, A. Sebastian, and A. Rahimi, “Probabilistic abduction for visual abstract reasoning via learning rules in vector-symbolic architectures,” arXiv preprint arXiv:2401.16024, 2024. - [32] T. W. Webb, I. Sinha, and J. D. Cohen, “Emergent symbols through binding in external memory,” arXiv preprint arXiv:2012.14601, 2020. - [33] A. Altabaa, T. Webb, J. Cohen, and J. Lafferty, “Abstractors and relational cross-attention: An inductive bias for explicit relational reasoning in transformers,” arXiv preprint arXiv:2304.00195, 2023. - [34] G. Kerg, S. Mittal, D. Rolnick, Y. Bengio, B. Richards, and G. Lajoie, “On neural architecture inductive biases for relational tasks. arxiv,” arXiv preprint arXiv:2206.05056, 2022. - [35] T. W. Webb, S. M. Frankland, A. Altabaa, S. Segert, K. Krishnamurthy, D. Campbell, J. Russin, T. Giallanza, R. O’Reilly, J. Lafferty et al., “The relational bottleneck as an inductive bias for efficient abstraction,” Trends in Cognitive Sciences, 2024. - [36] T. A. Plate, “Holographic reduced representations,” IEEE Transactions on Neural networks, vol. 6, no. 3, pp. 623–641, 1995. - [37] P. Kanerva, “Hyperdimensional computing: An introduction to computing in distributed representation with high-dimensional random vectors,” Cognitive computation, vol. 1, pp. 139–159, 2009. - [38] A. Vaswani, “Attention is all you need,” Advances in Neural Information Processing Systems, 2017. - [39] D. Wang, M. Jamnik, and P. Lio, “Abstract diagrammatic reasoning with multiplex graph networks,” arXiv preprint arXiv:2006.11197, 2020. - [40] J. Xu, T. Vaidya, Y. Wu, S. Chandra, Z. Lai, and K. F. E. Chong, “Abstract visual reasoning: An algebraic approach for solving raven’s progressive matrices,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6715–6724. - [41] T. B. Brown, “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020. - [42] F. Locatello, D. Weissenborn, T. Unterthiner, A. Mahendran, G. Heigold, J. Uszkoreit, A. Dosovitskiy, and T. Kipf, “Object-centric learning with slot attention,” Advances in neural information processing systems, vol. 33, pp. 11 525–11 538, 2020. - [43] A. Van Den Oord, O. Vinyals et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017. ## Appendix A The full version of logical relation function The RPM-style logic rules are expressed as that the set of attribute values in the third panel of a row or column corresponds to the logic operation applied to the first two panels. The simplified version ( $M=3$ ) of the logical relation function, $R_{Lgc}$ , cannot handle all RPM-style logic rules, such as XOR. To address this limitation, we developed a full version of $R_{Lgc}$ with $OP_{1:M}$ ( $M=5$ ) to describe all RPM-style logical rules that are meaningful, defined as follows: $$ \displaystyle\boldsymbol{r}_{Lgc} \displaystyle=R_{Lgc}\left(\boldsymbol{v}_{1:N},OP_{1:M}\right) \displaystyle=\left(\left(op_{1}\boldsymbol{v}_{1}\land op_{2}\boldsymbol{v}_{ 2}\right)\lor\left(op_{3}\boldsymbol{v}_{1}\land op_{4}\boldsymbol{v}_{2} \right)\right)\circ op_{5}\boldsymbol{v}_{3} \tag{29} $$ where $op_{i}\in\{0,1\}$ determines whether to negate $\boldsymbol{v}_{i}$ , with negation ( $\lnot$ ) applied when $op_{i}=1$ and no negation applied when $op_{i}=0$ . The implementation of $op_{i}\boldsymbol{v}_{i}$ is as follows: $$ op_{i}\boldsymbol{v}_{i}=\left(\boldsymbol{e}\left(1\right)\right)^{\left( \circ op_{i}\right)}\circ\boldsymbol{v}_{i} \tag{1} $$ where $\boldsymbol{e}\left(1\right)$ is BVs representing a Boolean value indicating True. When $op_{i}=1$ , $\left(\boldsymbol{e}\left(1\right)\right)^{\left(\circ op_{i}\right)}$ simplifies to $\boldsymbol{e}\left(1\right)$ , which signifies the negation of HD vector $\boldsymbol{v}_{i}$ (see Table II). Similarly, When $op_{i}=0$ , $\left(\boldsymbol{e}\left(1\right)\right)^{\left(\circ op_{i}\right)}$ simplifies to $\boldsymbol{e}\left(0\right)$ , which does not change the $\boldsymbol{v}_{i}$ . The inverse logical relation function is defined as follows: TABLE XII: Logical rules and Corresponding combinations of $OP_{1:M}$ and $r$ in Logical relation functions | AND OR DIFF | 0 1 0 | 0 1 1 | 0 1 0 | 0 1 1 | 0 1 0 | 0 0 0 | | --- | --- | --- | --- | --- | --- | --- | | XOR | 0 | 1 | 1 | 0 | 0 | 0 | $$ \displaystyle\boldsymbol{v}_{N} \displaystyle=R_{Lgc}^{-1}\left(\boldsymbol{v}_{1:N-1},OP_{1:M},\boldsymbol{r}\right) \displaystyle=op_{5}\left(\left(op_{1}\boldsymbol{v}_{1}\land op_{2} \boldsymbol{v}_{2}\right)\lor\left(op_{3}\boldsymbol{v}_{1}\land op_{4} \boldsymbol{v}_{2}\right)\right) \tag{31} $$ Several typical logical rules, along with the corresponding $OP_{1:M}$ and $\boldsymbol{r}$ in the full version of logical relation function, are presented in Table XII. The DIFF logic rule, also known as the Arithmetic - rule for the Position attribute of the RAVEN dataset, specifies that the set of attribute values for the third panel is the difference set of the attribute values from the first two panels. OR rule is the Arithmetic + rule for Position attributes in RAVEN. For the OR, AND, and DIFF rules, $op_{3}$ and $op_{4}$ are equal of $op_{1}$ and $op_{2}$ , respectively, allowing them to be represented using the simplified version $R_{Lgc}$ . However, $OP_{1:2}$ and $OP_{3:4}$ of the XOR rule are different, making it impossible to express using the simplified version $R_{Lgc}$ . Because the RAVEN dataset only involve Arithmetic + and Arithmetic - logical rules, this work employs the simplified logical relation function.

Rendering Paper...