2601.20784

Model: healer-alpha-free

# REASON: Accelerating Probabilistic Logical Reasoning for Scalable Neuro-Symbolic Intelligence **Authors**: Zishen Wan, Che-Kai Liu, Jiayi Qian, Hanchen Yang, Arijit Raychowdhury, Tushar Krishna ## Abstract Neuro-symbolic AI systems integrate neural perception with symbolic and probabilistic reasoning to enable data-efficient, interpretable, and robust intelligence beyond purely neural models. Although this compositional paradigm has shown superior performance in domains such as mathematical reasoning, planning, and verification, its deployment remains challenging due to severe inefficiencies in symbolic and probabilistic inference. Through systematic analysis of representative neuro-symbolic workloads, we identify probabilistic logical reasoning as the inefficiency bottleneck, characterized by irregular control flow, low arithmetic intensity, uncoalesced memory accesses, and poor hardware utilization on CPUs and GPUs. This paper presents REASON, an integrated acceleration framework for probabilistic logical reasoning in neuro-symbolic AI. At the algorithm level, REASON introduces a unified directed acyclic graph representation that captures common structure across symbolic and probabilistic models, coupled with adaptive pruning and regularization. At the architecture level, REASON features a reconfigurable, tree-based processing fabric optimized for irregular traversal, symbolic deduction, and probabilistic aggregation. At the system level, REASON is tightly integrated with GPU streaming multiprocessors through a programmable interface and multi-level pipeline that efficiently orchestrates neural, symbolic, and probabilistic execution. Evaluated across six neuro-symbolic workloads, REASON achieves 12-50 $\times$ speedup and 310-681 $\times$ energy efficiency over desktop and edge GPUs under TSMC 28 nm node. REASON enables real-time probabilistic logical reasoning, completing end-to-end tasks in 0.8 s with 6 mm ${}^{\text{2}}$ area and 2.12 W power, demonstrating that targeted acceleration of probabilistic logical reasoning is critical for practical and scalable neuro-symbolic AI and positioning REASON as a foundational system architecture for next-generation cognitive intelligence. ## I Introduction Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, image recognition, and complex pattern learning from vast datasets [23, 46, 42, 16]. However, despite their success, LLMs often struggle with factual accuracy, hallucinations, multi-step reasoning, and interpretability [35, 62, 2, 61]. These limitations have spurred the development of compositional AI systems, which integrate neural with symbolic and probabilistic reasoning to create robust, transparent, and intelligent cognitive systems. footnotetext: † Corresponding author One promising compositional paradigm is neuro-symbolic AI, which integrates neural, symbolic, and probabilistic components into a unified cognitive architecture [60, 1, 72, 9, 75]. In this system, the neural module captures the statistical, pattern-matching behavior of learned models, performing rapid function approximation and token prediction for intuitive perception and feature extraction. The symbolic and probabilistic modules perform explicit, verifiable reasoning that is structured, interpretable, and robust under uncertainty, managing logic-based reasoning and probabilistic updates. This paradigm integrates intuitive generalization and deliberate reasoning. Neuro-symbolic AI has demonstrated superior abstract deduction, complex question answering, mathematical reasoning, logical reasoning, and cognitive robotics [28, 66, 55, 81, 12, 38, 41, 71]. Its ability to learn efficiently from fewer data points, produce transparent and verifiable outputs, and robustly handle uncertainty and ambiguity makes it particularly advantageous compared to purely neural approaches. For example, recently Meta’s LIPS [28] and Google’s AlphaGeometry [66] leverage compositional neuro-symbolic approaches to solve complex math problems and achieve a level of human Olympiad gold medalists. R 2 -Guard [20] leverages LLM and probabilistic models to improve robust reasoning capability and resilience against jailbreaks. They represent a paradigm shift for AI that requires robust, verifiable, and explainable reasoning. Despite impressive algorithmic advances in neuro-symbolic AI – often demonstrated on large-scale distributed GPU clusters – efficient deployment at the edge remains a fundamental challenge. Neuro-symbolic agents, particularly in robotics, planning, interactive cognition, and verification, require real-time logical inference to interact effectively with physical environments and multi-agent systems. For example, Ctrl-G, a text-infilling neuro-symbolic agent [83], must execute hundreds of reasoning steps per second to remain responsive, yet current implementations take over 5 minutes on a desktop GPU to complete a single task. This latency gap makes practical deployment of neuro-symbolic AI systems challenging. To understand the root causes of this inefficiency, we systematically analyze a diverse set of neuro-symbolic workloads and uncover several system- and architecture-level challenges. Symbolic and probabilistic kernels frequently dominate end-to-end runtime and exhibit highly irregular execution characteristics, including heterogeneous compute patterns and memory-bound behavior with low ALU utilization. These kernels suffer from limited exploitable parallelism and irregular, uncoalesced memory accesses, leading to poor performance and efficiency on CPU and GPU architectures. To address these challenges, we develop an integrated acceleration framework, REASON, which to the best of our knowledge, is the first to accelerate probabilistic logical reasoning-based neuro-symbolic AI systems. REASON is designed to close the efficiency gap of compositional AI by jointly optimizing algorithms, architecture, and system integration for the irregular and heterogeneous workloads inherent to neuro-symbolic reasoning. At the algorithm level, REASON introduces a unified directed acyclic graph (DAG) representation that captures shared computational structure across symbolic and probabilistic kernels. An adaptive pruning and regularization technique further reduces model size and computational complexity while preserving task accuracy. At the architecture level, REASON features a flexible design optimized for various irregular symbolic and probabilistic computations, leveraging the unified DAG representation. The architecture comprises reconfigurable tree-based processing elements (PEs), compiler-driven workload mapping, and memory layout to enable highly parallel and energy-efficient symbolic and probabilistic computation. At the system level, REASON is tightly integrated with GPU streaming multiprocessors (SMs), forming a heterogeneous system with a programmable interface and multi-level execution pipeline that efficiently orchestrates neural, symbolic, and probabilistic kernels while maintaining high hardware utilization and scalability as neuro-symbolic models evolve. Notably, unlike conventional tree-like computing arrays optimized primarily for neural workloads, REASON provides reconfigurable support for neural, symbolic, and probabilistic kernels within a unified execution fabric, enabling efficient and scalable neuro-symbolic AI systems. This paper, therefore, makes the following contributions: - We conduct a systematic workload characterization of representative logical- and probabilistic-reasoning-based neuro-symbolic AI models, identifying key performance bottlenecks and architectural optimization opportunities (Sec. II, Sec. III). - We propose REASON, an integrated co-design framework, to efficiently accelerate probabilistic logical reasoning in neuro-symbolic AI, enabling practical and scalable deployment of compositional intelligence (Fig. 4). - REASON introduces cross-layer innovations spanning (i) a unified DAG representation with adaptive pruning at the algorithm level (Sec. IV), (ii) a reconfigurable symbolic/probabilistic architecture and compiler-driven dataflow and mapping at the hardware level (Sec. V), and (iii) a programmable system interface with a multi-level execution pipeline at the system level (Sec. VI) to improve neuro-symbolic efficiency. - Evaluated across cognitive tasks, REASON enables flexible support for symbolic and probabilistic operations, achieving 12-50 $\times$ speedup and 310-681 $\times$ energy efficiency compared to desktop and edge GPUs. REASON enables fast and efficient logical and probabilistic reasoning in 0.8 s per task with 6 mm 2 area and 2.12 W power consumption. (Sec. VII). ## II Neuro-Symbolic AI Systems This section presents the preliminaries of neuro-symbolic AI with its algorithm flow (Sec. II-A), scaling performance analysis (Sec. II-B), and key computational primitives (Sec. II-C). <details> <summary>x1.png Details</summary> ![75e2f2b2](/v1/image/75e2f2b202851e8e8eebfa7418b6f8b27163b527de239de1a5d646731d2a8bdb) ### Visual Description \n ## Diagram: Neuro-Symbolic AI Architecture and Applications ### Overview This image is a technical diagram illustrating a hybrid neuro-symbolic artificial intelligence architecture. It visually explains how neural networks (fast thinking) integrate with symbolic reasoning systems (slow, logical, and probabilistic thinking) and provides concrete application examples of this combined approach. ### Components/Axes The diagram is organized into three main horizontal sections: 1. **Top Section (Architecture Diagram):** * **Left Block (Pink):** Labeled "Neuro". Contains an icon of a neural network and the text "DNN/LLM (Fast Thinking)". * **Center Block (Green/Blue):** Labeled "Symbolic". This is a larger container split into two sub-blocks: * **Top Sub-block (Green):** Labeled "Logical (Slow Thinking)" with a logic gate icon. * **Bottom Sub-block (Blue):** Labeled "Probabilistic (Bayesian Thinking)" with a dice icon. * **Right Section (Detailed Models):** This area expands on the "Symbolic" components with two shaded boxes: * **Top Box (Light Green):** Titled "First-Order Logic (FOL) Boolean Satisfiability (SAT)". Contains a circuit-like diagram with inputs `X₁`, `X₂`, `X₃`, `X₄` connected through logical operators (∩, ∪, ¬) to outputs `Y₁`, `Y₂`. * **Bottom Box (Light Blue):** Contains two models: * **Left Model:** Titled "Probabilistic Circuit (PC)". Shows a network diagram with inputs `X₁`, `X₂`, `X₃`, `X₄` connected to nodes labeled `f` and `g` with multiplication (×) and addition (+) operations. * **Right Model:** Titled "Hidden Markov Model (HMM)". Shows a state transition diagram with observed states `X₁`, `X₂`, `X₃` and hidden states `S₁`, `S₂`, `S₃` connected in a chain. 2. **Bottom Section (Application Examples):** * **Left Column:** Header "Application Examples". * **Main Grid:** A 5-row table mapping application domains to a three-stage pipeline. Each row has a domain name followed by three process steps connected by arrows (→). The steps are color-coded to correspond with the architecture above (pink for neural, green for logical, blue for probabilistic). | Application Domain | Stage 1 (Neural - Pink) | Stage 2 (Logical - Green) | Stage 3 (Probabilistic - Blue) | |--------------------|-------------------------|---------------------------|--------------------------------| | Commonsense Reason | feature extraction → | rule logic → | uncertainty infer. | | Cognitive Robotics | scene graph → | logic-based planning → | uncertainty infer. | | Medical Diagnosis | feature extraction → | rule reasoning → | likelihood infer. | | Question Answering | parsing → | symbolic query planning → | missing fact infer. | | Math Solving | initial sol. gen. → | algebra solver → | uncertainty infer. | ### Detailed Analysis **Architecture Flow:** The diagram shows a directional flow from left to right. The "Neuro" (DNN/LLM) component, responsible for "Fast Thinking," feeds into the "Symbolic" component. The Symbolic component is bifurcated into "Logical (Slow Thinking)" and "Probabilistic (Bayesian Thinking)" modules, which interact bidirectionally (indicated by up/down arrows between them). Dashed lines connect these high-level symbolic modules to their specific formal implementations (FOL/SAT, PC, HMM) on the right. **Application Pipeline Details:** Each application follows a consistent three-stage pattern: 1. **Stage 1 (Pink - Neural):** Initial data processing or representation. 2. **Stage 2 (Green - Logical):** Structured reasoning or planning. 3. **Stage 3 (Blue - Probabilistic):** Inference under uncertainty. The specific pipelines are as shown in the table above. ### Key Observations 1. **Clear Conceptual Mapping:** The color coding (pink, green, blue) is consistently applied across the high-level architecture and the application pipelines, creating a strong visual link between theory and practice. 2. **Bidirectional Interaction:** The arrows between the "Logical" and "Probabilistic" blocks indicate that these reasoning modes are not sequential but interactive and mutually informative within the symbolic system. 3. **Formal Grounding:** The diagram explicitly connects abstract concepts ("Logical Thinking") to concrete, established formal methods (First-Order Logic, SAT), providing technical specificity. 4. **Pipeline Consistency:** All five diverse applications (from robotics to math) are shown to follow the same fundamental neuro-symbolic processing paradigm, suggesting the framework's generality. ### Interpretation This diagram argues for a unified AI architecture that marries the pattern recognition strengths of neural networks ("fast thinking") with the rigorous, explainable reasoning of symbolic systems ("slow thinking"). The "Neuro" component likely handles raw data perception and initial feature extraction, while the "Symbolic" component performs structured reasoning, planning, and inference. The inclusion of both deterministic (Logical/FOL) and stochastic (Probabilistic/Bayesian) models within the symbolic half acknowledges that real-world reasoning requires handling both strict rules and inherent uncertainty. The application examples demonstrate the practical value of this hybrid approach: it enables systems that can not only perceive and parse the world (via neural nets) but also plan, explain decisions, and reason about missing or uncertain information (via symbolic logic and probability). The overall message is that moving beyond pure neural or pure symbolic AI towards an integrated neuro-symbolic paradigm is a powerful direction for creating more robust, generalizable, and trustworthy AI systems capable of complex tasks like medical diagnosis and cognitive robotics. </details> Figure 1: Neuro-symbolic algorithmic flow and operations. The neural module serves as a perceptual and intuitive engine for representation learning, while the symbolic module performs structured logical reasoning with probabilistic inference. This compositional pipeline enables complex cognitive tasks across diverse scenarios. ### II-A Neuro-Symbolic Cognitive Intelligence LLMs and DNNs excel at natural language understanding and image recognition. However, they remain prone to factual errors, hallucinations, challenges in complex multi-step reasoning, and vulnerability to out-of-distribution or adversarial inputs. Their black-box nature also impedes interpretability and formal verification, undermining trust in safety-critical domains. These limitations motivate the development of compositional systems that integrate neural models with symbolic and probabilistic reasoning to achieve greater robustness, transparency, and intelligence. Neuro-symbolic AI represents a paradigm shift toward more integrated and trustworthy intelligence by combining neural, symbolic, and probabilistic techniques. This hybrid approach has shown superior performance in abstract deduction [81, 12], complex question answering [38, 41], and logical reasoning [66, 55]. By learning from limited data and producing transparent, verifiable outputs, neuro-symbolic systems provide a foundation for cognitive intelligence. Fig. 1 presents a unified neuro-symbolic pipeline, illustrating how its components collaborate to solve complex tasks. <details> <summary>x2.png Details</summary> ![570358fb](/v1/image/570358fbe9de0ae2636ab1b37f279b1a8e6dd1694ad52af2eb741e2785d5c165) ### Visual Description ## Scaling Performance Analysis: Task Accuracy vs. Model Size & Scaling Efficiency Analysis: Task Runtime vs. Complexity ### Overview The image is a composite figure containing four distinct charts, labeled (a), (b), (c), and (d). The first three charts (a, b, c) are scatter plots analyzing the relationship between **Model Size** and **Task Accuracy (%)** across different task categories. The fourth chart (d) is a line graph comparing the **Task runtime (min)** of two different model types against increasing problem complexity. The overall theme is the scaling behavior of AI models in terms of accuracy and computational efficiency. ### Components/Axes **Common Elements (Charts a, b, c):** * **X-Axis:** "Model Size" with categorical markers: `7B`, `8B`, `13B`, `70B`, `GPT`. * **Y-Axis:** "Task Accuracy (%)" with a linear scale from 0 to 100, marked at intervals of 20. * **Legends:** Each chart has a legend positioned in the top-left corner, listing specific tasks with unique color and marker shape combinations. The legend distinguishes between two variants for each task, denoted by `(C)` and `(M)`. * **Data Points:** Each data point is a marker (circle, square, triangle, etc.) representing the accuracy of a specific task variant on a specific model size. **Chart (d) Specifics:** * **Title:** "Scaling Efficiency Analysis: Task Runtime vs. Complexity" * **X-Axis:** "Inter. Math Olympiad reasoning (Year_Problem)" with categorical labels: `P01_P08`, `P06_P12`, `P04_P12`, `P12_P15`, `P20_P16`, `P19_P6`. These likely represent problem sets from different years. * **Y-Axis:** "Task runtime (min)" with a linear scale from 0 to 30, marked at intervals of 10. * **Legend:** Positioned in the top-left corner, identifying two data series: * `Neuro-symbolic models (AlphaGeometry)`: Represented by a blue line with circular markers. * `RL-based OST reasoning models`: Represented by a gray line with circular markers. ### Detailed Analysis **Chart (a): Complex Reasoning Tasks** * **Tasks & Legend:** * `TextEdit (C)`: Blue circle * `ACLUTR (C)`: Orange triangle (pointing up) * `ProofWriter (C)`: Green triangle (pointing down) * `TextEdit (M)`: Gray square * `ACLUTR (M)`: Brown diamond * `ProofWriter (M)`: Light blue circle * **Trend & Data Points (Approximate):** * **General Trend:** Accuracy for all tasks generally increases with model size. The `(C)` variants consistently outperform their `(M)` counterparts. * `TextEdit (C)`: Starts high (~90% at 7B), approaches ~100% at GPT. * `ACLUTR (C)`: Starts low (~20% at 7B), shows strong improvement to ~70% at GPT. * `ProofWriter (C)`: Starts very low (~10% at 7B), improves to ~80% at GPT. * `TextEdit (M)`: Starts around 50% at 7B, improves to ~90% at GPT. * `ACLUTR (M)`: Starts around 20% at 7B, improves to ~60% at GPT. * `ProofWriter (M)`: Starts near 0% at 7B, improves to ~40% at GPT. **Chart (b): Math Reasoning Tasks** * **Tasks & Legend:** * `GSM8K (C)`: Blue circle * `SVAMP (C)`: Orange triangle (pointing up) * `TabMWP (C)`: Green triangle (pointing down) * `In-Domain GSM8K (C)`: Gray square * `In-Domain MATH (C)`: Brown diamond * **Trend & Data Points (Approximate):** * **General Trend:** Accuracy improves with model size, but performance is more varied and generally lower than in complex reasoning tasks. `In-Domain MATH (C)` is the most challenging. * `GSM8K (C)`: Starts ~55% at 7B, improves to ~95% at GPT. * `SVAMP (C)`: Starts ~60% at 7B, improves to ~90% at GPT. * `TabMWP (C)`: Starts ~40% at 7B, improves to ~85% at GPT. * `In-Domain GSM8K (C)`: Starts ~25% at 7B, improves to ~70% at GPT. * `In-Domain MATH (C)`: Starts ~10% at 7B, improves to ~50% at GPT. **Chart (c): Question-Answering Tasks** * **Tasks & Legend:** * `AmbigNQ (C)`: Blue circle * `TriviaQA (C)`: Orange triangle (pointing up) * `HotpotQA (C)`: Green triangle (pointing down) * `AmbigNQ (M)`: Gray square * `TriviaQA (M)`: Brown diamond * `HotpotQA (M)`: Light blue circle * **Trend & Data Points (Approximate):** * **General Trend:** Strong positive correlation between model size and accuracy. `(C)` variants again outperform `(M)` variants. * `AmbigNQ (C)`: Starts ~65% at 7B, improves to ~95% at GPT. * `TriviaQA (C)`: Starts ~60% at 7B, improves to ~90% at GPT. * `HotpotQA (C)`: Starts ~55% at 7B, improves to ~85% at GPT. * `AmbigNQ (M)`: Starts ~55% at 7B, improves to ~80% at GPT. * `TriviaQA (M)`: Starts ~40% at 7B, improves to ~75% at GPT. * `HotpotQA (M)`: Starts ~30% at 7B, improves to ~60% at GPT. **Chart (d): Scaling Efficiency Analysis** * **Trend & Data Points (Approximate):** * **Neuro-symbolic models (AlphaGeometry) - Blue Line:** Shows a steep, near-linear increase in runtime with problem complexity. Starts at ~8 min for `P01_P08`, rises to ~13 min for `P19_P6`. * **RL-based OST reasoning models - Gray Line:** Shows a more gradual, slightly super-linear increase. Starts at ~12 min for `P01_P08`, rises to ~28 min for `P19_P6`. * **Key Observation:** The RL-based models have a higher initial runtime but scale worse (steeper slope) than the neuro-symbolic models as problem complexity increases. The lines cross between `P04_P12` and `P12_P15`, after which the neuro-symbolic models become more efficient. ### Key Observations 1. **Consistent Scaling:** Across all task types (Complex Reasoning, Math, QA), increasing model size from 7B to GPT leads to significant accuracy gains. 2. **Task Difficulty Hierarchy:** Within each chart, certain tasks are consistently harder. For example, `ProofWriter (M)` in (a), `In-Domain MATH (C)` in (b), and `HotpotQA (M)` in (c) show the lowest accuracies. 3. **Variant Performance Gap:** The `(C)` variant of each task consistently achieves higher accuracy than the `(M)` variant across all model sizes, suggesting a fundamental difference in difficulty or evaluation setup. 4. **Efficiency Trade-off (Chart d):** There is a clear trade-off between model architecture and scaling efficiency. While RL-based models may have higher base runtime, neuro-symbolic models (AlphaGeometry) demonstrate superior scaling characteristics for this specific reasoning domain, becoming more efficient on more complex problems. ### Interpretation The data presents a multi-faceted view of AI scaling. Charts (a-c) demonstrate the **"scaling law"** phenomenon: larger models are more capable, as measured by task accuracy. However, the gains are not uniform; they depend heavily on the specific task and its variant. The persistent gap between `(C)` and `(M)` variants suggests that model improvements alone may not close performance gaps on inherently more difficult problem formulations. Chart (d) shifts the focus from capability (accuracy) to **efficiency (runtime)**. It reveals that scaling behavior is not monolithic. Different architectural paradigms (neuro-symbolic vs. reinforcement learning-based) exhibit fundamentally different computational cost profiles as problem complexity grows. The crossover point indicates that the "best" model depends on the operational context—specifically, the expected complexity of the problems to be solved. For simpler problems, one architecture may be preferable, while for highly complex Olympiad-level reasoning, the other becomes more efficient. In summary, the image argues that evaluating AI systems requires looking beyond a single metric. True understanding comes from analyzing **capability scaling** (accuracy vs. size) alongside **efficiency scaling** (runtime vs. complexity), and doing so across a diverse set of tasks that probe different facets of intelligence. </details> Figure 2: Scaling performance and efficiency. (a)-(c) Task accuracy of compositional LLM-symbolic models (C) and monolithic LLMs (M - shown in gray) across model sizes on complex reasoning, mathematical reasoning, and question-answering tasks. (d) Runtime efficiency comparison between LLM-symbolic models and RL-based CoT models on mathematical reasoning tasks [76]. Neural module. The neural module serves as the perceptual and intuitive engine, typically DNN or LLM, excelling at processing high-dimensional sensory inputs (e.g., images, audio, text) and converting them into feature representations. It handles perception, feature extraction, and associative learning, providing the abstractions needed for higher-level cognition. Symbolic module. The symbolic module is the logical core operating on neural abstractions and includes symbolic and probabilistic operations. Logical components apply formal logic, rules, and ontologies for structured reasoning and planning, enabling logically sound solutions. Probabilistic components manage uncertainty by representing knowledge probabilistically, supporting belief updates and decision-making under ambiguity, reflecting a nuanced reasoning model. Together, these modules form a complementary reasoning hierarchy. Neural module captures statistical, pattern-matching behavior of learned models, producing rapid but non-verifiable outputs (Fast Thinking), while symbolic modules perform explicit, verifiable reasoning that is structured and reliable (Slow Thinking). The probabilistic module complements both and enables robust planning under ambiguity (Bayesian Thinking). This framework integrates intuitive generalization with deliberate reasoning. ### II-B Scaling Performance Analysis of Neuro-Symbolic Systems Scaling performance analysis. Neuro-symbolic AI systems exhibit superior reasoning capability and scaling behavior compared to monolithic LLMs on complex tasks. We compare representative neuro-symbolic systems against monolithic LLMs across complex reasoning, mathematical reasoning, and question-answering benchmarks (Fig. 2 (a)-(c)). The results reveal two advantages. First, higher accuracy: compositional neuro-symbolic models consistently outperform monolithic LLMs of comparable size. Second, improved scaling efficiency: smaller neuro-symbolic models are sufficient to match or exceed the performance of significantly larger closed-source LLMs. Together, these results highlight the potential scaling limitations of monolithic LLMs and the efficiency benefits of compositional neuro-symbolic reasoning. Comparison with RL-based reasoning models. Beyond monolithic LLMs, recent advancements in reinforcement learning (RL) and chain-of-thought (CoT) prompting improve LLM reasoning accuracy but incur significant computational and scalability overheads (Fig. 2 (d)). First, computational cost: RL-based reasoning often requires hundreds to thousands of LLM queries per decision step, resulting in prohibitively high inference latency and energy consumption. Second, scalability: task-specific fine-tuning constrains generality, whereas neuro-symbolic models use symbolic and probabilistic reasoning modules or tools without retraining. Fig. 2 (d) reveals that neuro-symbolic models AlphaGeometry [66] achieve over $2\times$ efficiency gains and superior data efficiency compared to CoT-based LLMs on mathematical reasoning tasks. ### II-C Computational Primitives in Neuro-Symbolic AI We identify the core computational primitives that are commonly used in neuro-symbolic AI systems (Fig. 1). While neural modules rely on DNNs or LLMs for perception and representation learning, the symbolic and probabilistic components implement structured reasoning. In particular, logical reasoning is typically realized through First-Order Logic (FOL) and Boolean Satisfiability (SAT), probabilistic reasoning through Probabilistic Circuits (PCs), and sequential reasoning through Hidden Markov Models (HMMs). Together, these primitives form the algorithmic foundation of neuro-symbolic systems that integrate learning, logic, and uncertainty-aware inference. First-Order Logic (FOL) and Boolean Satisfiability (SAT). FOL provides a formal symbolic language for representing structured knowledge using predicates, functions, constants, variables and quantifiers ( $\forall,\exists$ ), combined with logical connectives. For instance, the statement “every student has a mentor” can be expressed as $\forall x\bigl(\mathrm{Student}(x)\to\exists y\,(\mathrm{Mentor}(y)\wedge\mathrm{hasMentor}(x,y))\bigr)$ , where predicates encode properties and relations over domain elements. FOL semantics map symbols to domain objects and relations, enabling precise and interpretable logical reasoning. SAT operates over propositional logic and asks whether a conjunctive normal form (CNF) formula $\varphi=\bigwedge_{i=1}^{m}\Bigl(\bigvee_{j=1}^{k_{i}}l_{ij}\Bigr)$ admits a satisfying assignment, where each literal $l_{ij}$ is a Boolean variable or its negation. Modern SAT solvers extend the DPLL algorithm with conflict-driven clause learning (CDCL), incorporating non-chronological backtracking and clause learning to improve scalability [40, 33]. Cube-and-conquer further parallelizes search by splitting into “cube” DPLL subproblems and concurrent CDCL “conquer” solvers [13, 67]. Together, FOL’s expressive representations and SAT’s solving mechanisms form the logic backbone of neuro-symbolic systems, enabling exact logical inference alongside neural learning. | Representative Neuro- Symbolic Workloads | AlphaGeometry [66] | R 2 -Guard [20] | GeLaTo [82] | Ctrl-G [83] | NeuroPC [6] | LINC [52] | | | --- | --- | --- | --- | --- | --- | --- | --- | | Deployment Scenario | Application | Math theorem proving & reasoning | Unsafety detection | Constrained text generation | Interactive text editing, text infilling | Classification | Logical reasoning, Deductive reasoning | | Advantage vs. LLM | Higher deductive reasoning, higher generalization | Higher LLM resilience, higher data efficiency, effective adaptability | Guaranteed constraint satisfaction, higher generalization | Guaranteed constraints satisfaction, higher generalization | Enhanced interpretability, theoretical guarantee | Higher precision, reduced overconfidence, higher scalability | | | Computation Pattern | Neural | LLM | LLM | LLM | LLM | DNN | LLM | | Symbolic | First-order logic, SAT solver, acyclic graph | First-order logic, probabilistic circuit, Hidden Markov model | First-order logic, SAT solver, Hidden Markov model | Hidden Markov model, probabilistic circuits | First-order logic, probabilistic circuit | First-order logic, solver | | TABLE I: Representative neuro-symbolic workloads. Selected neuro-symbolic workloads used in our analysis, spanning diverse application domains, deployment scenarios, and neural-symbolic computation patterns. <details> <summary>x3.png Details</summary> ![36ad04a5](/v1/image/36ad04a5b6e0bea17f661eae2156b52b2b9b5f82627b84526c2824b15c4b159b) ### Visual Description ## Multi-Panel Technical Performance Analysis ### Overview The image contains four distinct subplots (labeled a, b, c, d) presenting performance metrics for various AI systems, primarily comparing "Neuro" (neural) and "Symbolic" approaches across different tasks and hardware. The charts analyze runtime composition, latency, and computational efficiency. ### Components/Axes The image is divided into four panels: * **Panel (a):** A stacked bar chart titled "Runtime Percentage". The Y-axis is labeled "Runtime Percentage" from 0% to 100%. The X-axis lists 14 different systems or tasks. A legend on the left defines two categories: "Neuro" (red, diagonal hatch pattern) and "Symbolic" (green, cross-hatch pattern). * **Panel (b):** A grouped bar chart titled "Runtime Latency (min)". The Y-axis is labeled "Runtime Latency (min)" from 0 to 12. The X-axis groups data by five systems (Alpha, R²-G, GeLaTo, Ctrl-G, LINC), each with two bars labeled "Small" and "Large". A legend within the chart area defines "Neuro" (red) and "Symbolic" (green). * **Panel (c):** A grouped bar chart titled "Runtime Latency (min)". The Y-axis is labeled "Runtime Latency (min)" from 0 to 24. The X-axis groups data by two systems (Alpha, R²-G), each with two bars labeled "A6000" and "Orin". A legend within the chart area defines "Neuro" (red) and "Symbolic" (green). * **Panel (d):** A scatter plot on a log-log scale. The Y-axis is labeled "Attainable Performance (TFLOPS/s)" ranging from 10⁻¹ to 10². The X-axis is labeled "Operation Intensity (FLOPS/Byte)" ranging from 10⁻¹ to 10². Data points are labeled with system names and their type (Neuro or Symb). A diagonal line represents a performance roofline. ### Detailed Analysis #### **Panel (a): Runtime Percentage Breakdown** This chart shows the proportion of runtime spent on Neuro vs. Symbolic components for 14 different systems/tasks. * **IMO:** Neuro ~32.6%, Symbolic ~67.4% * **MiniF:** Neuro ~39.8%, Symbolic ~60.2% * **2F:** Neuro ~36.5%, Symbolic ~63.5% * **TwinS:** Neuro ~33.2%, Symbolic ~66.8% * **XSTest:** Neuro ~42.1%, Symbolic ~57.9% * **Mod:** Neuro ~63.4%, Symbolic ~36.6% * **ComGen:** Neuro ~65.1%, Symbolic ~34.9% * **Review:** Neuro ~61.6%, Symbolic ~38.4% * **News:** Neuro ~36.1%, Symbolic ~63.9% * **ComGen (2nd instance):** Neuro ~39.9%, Symbolic ~60.1% * **TextF:** Neuro ~32.3%, Symbolic ~67.7% * **Math:** Neuro ~49.5%, Symbolic ~50.5% * **AwA2:** Neuro ~66.0%, Symbolic ~34.0% * **FOLIO:** Neuro ~64.3%, Symbolic ~35.7% * **Proof:** Neuro ~64.3%, Symbolic ~35.7% (Note: This appears to be a duplicate label for the last bar, which is visually identical to the FOLIO bar). **Trend:** The Neuro component's runtime share varies significantly, from a low of ~32.3% (TextF) to a high of ~66.0% (AwA2). Systems like Mod, ComGen, Review, AwA2, FOLIO, and Proof are Neuro-dominant (>50% Neuro runtime). Others like IMO, MiniF, 2F, TwinS, XSTest, News, and TextF are Symbolic-dominant. #### **Panel (b): Runtime Latency by Model Size** This chart compares the total runtime latency (in minutes) for "Small" and "Large" model variants across five systems and five tasks. * **Alpha:** * Task IMO: Small ~5.5 min, Large ~10 min * Task Safety: Small ~2.5 min, Large ~7 min * **R²-G:** * Task CoGen: Small ~4.8 min, Large ~9.5 min * Task Text: Small ~4 min, Large ~9 min * **GeLaTo:** Task FOLIO: Small ~5 min, Large ~10.5 min * **Ctrl-G:** (Data for specific tasks not fully labeled on bars, but bars are present) * **LINC:** (Data for specific tasks not fully labeled on bars, but bars are present) **Trend:** For all visible paired comparisons (Alpha-IMO, Alpha-Safety, R²-G-CoGen, R²-G-Text, GeLaTo-FOLIO), the "Large" model variant consistently has a higher runtime latency than the "Small" variant, often approximately double. The Neuro (red) portion of the latency also increases with model size. #### **Panel (c): Runtime Latency by Hardware** This chart compares runtime latency on two hardware platforms (A6000 GPU and Orin SoC) for two systems. * **Alpha:** * Task MiniF: A6000 ~4 min, Orin ~20 min * Task XSTest: A6000 ~3 min, Orin ~19 min * **R²-G:** (Bars are present but specific task labels are not visible on the X-axis for this group). **Trend:** There is a dramatic increase in runtime latency when moving from the A6000 GPU to the Orin platform for the Alpha system. The latency on Orin is approximately 4-5 times higher than on A6000 for both tasks shown. The Symbolic (green) component constitutes the majority of the latency on both platforms. #### **Panel (d): Performance Roofline Analysis** This scatter plot maps various systems on an Operation Intensity vs. Attainable Performance plane. * **Data Points (Approximate Coordinates):** * **LLaMA-3-8B (Neuro):** (~15 FLOPS/Byte, ~15 TFLOPS/s) - Highest performance, high intensity. * **AlphaGeo (Symb):** (~2 FLOPS/Byte, ~8 TFLOPS/s) * **LINC (Symb):** (~0.5 FLOPS/Byte, ~1.5 TFLOPS/s) * **Ctrl-G (Symb):** (~8 FLOPS/Byte, ~1.2 TFLOPS/s) * **R²-Guard (Symb):** (~5 FLOPS/Byte, ~0.8 TFLOPS/s) * **GeLaTo (Symb):** (~0.8 FLOPS/Byte, ~0.1 TFLOPS/s) * **NeuroPC (Symb):** (~1.5 FLOPS/Byte, ~0.15 TFLOPS/s) * Several other unlabeled green (Symbolic) points are clustered between 0.5-5 FLOPS/Byte and 0.1-1 TFLOPS/s. **Trend:** Neuro systems (only LLaMA-3-8B shown) achieve significantly higher attainable performance and operate at higher operation intensity compared to the plotted Symbolic systems. The Symbolic systems are distributed across a lower performance band (0.1 to ~8 TFLOPS/s) and generally lower operation intensity. The diagonal roofline suggests a memory-bound region (lower left) and a compute-bound region (upper right). ### Key Observations 1. **Neuro/Symbolic Trade-off:** Panel (a) reveals a clear dichotomy: some tasks are dominated by Neuro computation, others by Symbolic. This suggests fundamental architectural differences in how these systems approach different problem types. 2. **Scalability Cost:** Panel (b) shows a consistent and significant latency penalty for scaling from "Small" to "Large" models across multiple tasks and systems. 3. **Hardware Disparity:** Panel (c) highlights a massive performance gap between high-end GPU (A6000) and embedded SoC (Orin) hardware, with latency increasing by a factor of 4-5x. 4. **Performance Frontier:** Panel (d) indicates that the evaluated Neuro system (LLaMA-3-8B) operates on a different performance frontier than the Symbolic systems, achieving higher throughput at higher operational intensity. ### Interpretation This composite figure provides a multi-faceted analysis of Neuro-Symbolic AI system performance. The data suggests that: * **Task Specialization:** The choice between Neuro and Symbolic approaches is not universal but highly task-dependent (Panel a). This implies hybrid systems might dynamically allocate resources based on the sub-task. * **The Cost of Scale:** Increasing model size ("Small" to "Large") comes with a predictable and substantial runtime cost (Panel b), which must be weighed against potential accuracy gains. * **Deployment Constraints:** Hardware platform choice (Panel c) is a critical factor, with embedded systems (Orin) incurring severe latency penalties compared to datacenter GPUs (A6000), impacting real-time application feasibility. * **Architectural Efficiency:** The roofline analysis (Panel d) suggests current Symbolic systems are less efficient at utilizing available computational throughput (lower TFLOPS/s) and have lower operational intensity, making them potentially more memory-bound. The Neuro system demonstrates a more compute-bound profile. This gap highlights an area for optimization in Symbolic or hybrid architectures. The overall narrative points to a complex design space where task requirements, model scale, hardware constraints, and architectural paradigm (Neuro vs. Symbolic) must be co-optimized for efficient AI system deployment. </details> Figure 3: End-to-end neuro-symbolic workload characterization. (a) Benchmark six neuro-symbolic workloads (AlphaGeometry, R 2 -Guard, GeLaTo, Ctrl-G, NeuroPC, LINC) on CPU+GPU system, showing symbolic and probabilistic may serve as system bottlenecks. (b) Benchmark neuro-symbolic workloads on tasks with different scales, indicating that real-time performance cannot be satisfied and the potential efficiency issues. (c) Benchmark on A6000 and Orin GPU. (d) Roofline analysis, indicating server memory-bound of symbolic and probabilistic kernels. Probabilistic Circuits (PC). PCs represent tractable probabilistic models over variables $\mathbf{X}$ as directed acyclic graphs [30, 22, 32]. Each node $n$ performs a probabilistic computation: leaf nodes specify primitive distributions $f_{n}(x)$ , while interior nodes combine their children $ch(n)$ via $$ p_{n}(x)=\begin{cases}f_{n}(x),&\text{if }n\text{ is a leaf node}\\ \prod_{c\in\mathrm{ch}(n)}p_{c}(x),&\text{if }n\text{ is a product node}\\ \sum_{c\in\mathrm{ch}(n)}\theta_{n,c}p_{c}(x),&\text{if }n\text{ is a sum node}\end{cases} \tag{1} $$ where $\theta_{n,c}$ denotes the non-negative weight associated with child $c$ . This recursive structure guarantees exact inference (e.g., marginals, conditionals) in time linear in circuit size. PCs’ combination of expressiveness and tractable computation makes them an ideal probabilistic backbone for neuro-symbolic systems, where neural modules learn circuit parameters while symbolic engines perform probabilistic reasoning. Hidden Markov Model (HMM). HMMs are probabilistic model for sequential data [43], where a system evolves through hidden states governed by the first-order Markov property: the state at time step $t$ depends only on the state at time step $t-1$ . Each hidden state emits observations according to a probabilistic distribution. The joint distribution over sequence of hidden states $z_{1:T}$ and observations $x_{1:T}$ is given by $$ p(z_{1:T},x_{1:T})=p(z_{1})p(x_{1}\mid z_{1})\prod_{t=2}^{T}p(z_{t}\mid z_{t-1})p(x_{t}\mid z_{t}) \tag{2} $$ where $p(z_{1})$ is the initial state distribution, $p(z_{t}\mid z_{t-1})$ the transition probability, and $p(x_{t}\mid z_{t})$ the emission probability. HMMs naturally support sequential inference tasks such as filtering, smoothing, and decoding, enabling temporal reasoning in neuro-symbolic pipelines. ## III Neuro-Symbolic Workload Characterization This section characterizes the system behavior of various neuro-symbolic workloads (Sec. III-A - III-B) and provides workload insights for computer architects (Sec. III-C - III-D). Profiling workloads. To conduct comprehensive profiling analysis, we select six state-of-the-art representative neuro-symbolic workloads, as listed in Tab. I, covering a diverse range of applications and underlying computational patterns. Profiling setup. We profile and analyze the selected neuro-symbolic models in terms of runtime, memory, and compute operators using cProfile for latency measurement, and NVIDIA Nsight for kernel-level profiling and analysis. Experiments are conducted on the system with NVIDIA A6000 GPU, Intel Sapphire Rapids CPUs, and DDR5 DRAM. Our software environment includes PyTorch 2.5 and JAX 0.4.6. We also conduct profiling on Jetson Orin [49] for edge scenario deployment. We track control and data flow by analyzing the profiling results in trace view and graph execution format. ### III-A Compute Latency Analysis <details> <summary>x4.png Details</summary> ![f6bb24ac](/v1/image/f6bb24acc71a63924386e5be7ec4327a09ddeb94191422bfb8ffef72b523b84d) ### Visual Description ## Diagram: Research Pipeline for Neuro-Symbolic-Probabilistic AI Hardware Acceleration ### Overview This image is a structured flowchart or research pipeline diagram illustrating the progression from high-level goals to deployment for a specialized AI hardware system. It is divided into five main vertical sections, connected by arrows indicating flow and dependency. The diagram outlines the motivation (Goals), identified problems (Challenges), proposed solutions (Methodology), system design (Architecture), and final application (Deployment) for a project focused on efficient, scalable "agentic cognition." ### Components/Axes The diagram is organized into five primary columns or sections, from left to right: 1. **Goals (Leftmost Column):** * **Main Title:** "Goals" * **Central Graphic:** A box containing a star labeled "REASON" above a cluster of icons representing neural networks, logic symbols, and dice. Below this is the text "Neuro-symbolic-probabilistic AI". * **Y-Axis Label (Left Side):** "Cognitive Capability" (with an upward-pointing arrow). * **X-Axis Label (Bottom):** "Energy and Latency" (with a rightward-pointing arrow). * **Bottom Text:** "Efficiency, Performance, Scalability, Cognition" with a green upward arrow next to it. 2. **Challenges (Second Column):** * **Main Title:** "Challenges" * Contains three red-bordered boxes, each listing a specific challenge: * **Challenge-1:** "Irregular compute and memory access" * **Challenge-2:** "Inefficient symbolic and probabilistic execution" * **Challenge-3:** "Low hardware utilization and scalability" * Arrows connect each challenge box to a corresponding "Key Insight" in the Methodology section. 3. **Methodology (Central, Largest Column):** * **Main Title:** "Methodology" * Contains three blue-bordered "Key Insight" boxes, each paired with a visual metaphor diagram to its right. * **Key Insight-1:** "Unified DAG representation & pruning (Sec. IV)" * **Diagram:** Shows a messy, irregular directed acyclic graph (DAG) on the left (with a red sad face emoji 😟) transforming into a clean, pruned tree structure on the right (with a green happy face emoji 😊). * **Key Insight-2:** "Flexible architecture for symbolic & probabilistic (Sec. V)" * **Diagram:** A timeline comparison. The top "naive" timeline shows long, irregular blocks of pink and green. The bottom "opt." (optimized) timeline shows shorter, more regular, and interleaved blocks. A red sad face 😟 is next to "naive," and a green happy face 😊 is next to "opt." The x-axis is labeled "time". * **Key Insight-3:** "GPU-accelerator protocol and two-level pipelining (Sec. VI)" * **Diagram:** Two bar charts comparing "task scale." The left chart shows a tall pink bar ("desired") next to a very short gray bar labeled "GPU" (with a red sad face 😟). The right chart shows the same tall pink bar ("desired") next to a tall green bar, with both a "GPU" icon and a chip icon (with a green happy face 😊). 4. **Architecture (Fourth Column):** * **Main Title:** "Architecture" * Contains five teal-bordered boxes listing architectural components: * "Reconfigurable PE (Sec. V-B)" * "Compilation & mapping (Sec. V-C)" * "Bi-direction dataflow (Sec. V-D)" * "Memory layout (Sec. V-D)" * "Co-processor & pipelining (Sec. VI)" * A large gray arrow points from the Methodology section to this Architecture section. 5. **Deployment (Rightmost Column):** * **Main Title:** "Deployment" * **Top Element:** A gear icon next to the text "Configurations: hardware & system (Sec. VII)". * **Middle Element:** A gray box with the text "Evaluate: across cognitive tasks, complexities, scales, hardware configs (Sec. VII)". An arrow points from "Configurations" to this box. * **Bottom Element:** A gray box with the text "Target: efficient, scalable agentic cognition". An arrow points from "Evaluate" to this box. * A large gray arrow points from the Architecture section to this Deployment section. ### Detailed Analysis The diagram presents a linear research and development workflow: 1. **Problem Framing:** The goal is to advance "Neuro-symbolic-probabilistic AI" towards higher "Cognitive Capability" while managing "Energy and Latency." The overarching aims are "Efficiency, Performance, Scalability, Cognition." 2. **Problem Identification:** Three core technical challenges are identified, each linked to a specific inefficiency in current systems (irregular access, inefficient execution, low utilization). 3. **Solution Methodology:** Each challenge is addressed by a corresponding key research insight, which is visually explained: * Insight 1 tackles irregularity via graph pruning. * Insight 2 addresses execution inefficiency through architectural flexibility, visualized as optimized scheduling. * Insight 3 improves utilization and scalability via a specialized protocol and pipelining, visualized as closing the gap between desired and actual task scale. 4. **System Design:** The insights inform the design of a hardware architecture with five key components: a reconfigurable processing element (PE), a compilation/mapping system, bi-directional dataflow, a specialized memory layout, and co-processor/pipelining logic. 5. **Validation & Goal:** The architecture is deployed in various configurations and evaluated across a range of tasks. The final target is to achieve "efficient, scalable agentic cognition." ### Key Observations * **Visual Metaphors:** The Methodology section uses simple but effective visual metaphors (graph pruning, timeline optimization, bar chart improvement) paired with emotive emojis (😟 to 😊) to quickly convey the benefit of each insight. * **Section References:** Nearly every text box includes a parenthetical reference to a section in a supporting document (e.g., "Sec. IV", "Sec. V-B"), indicating this diagram is a summary of a more detailed technical paper. * **Flow Arrows:** The diagram uses two types of arrows: thin black arrows for direct challenge-to-insight mapping, and large gray block arrows for the major phase transitions (Methodology -> Architecture -> Deployment). * **Color Coding:** Colors are used functionally: red for problems/challenges, blue for insights/methods, teal for architectural components, and gray for deployment phases. ### Interpretation This diagram is a high-level roadmap for a hardware-software co-design project aimed at a next-generation AI paradigm. It argues that current hardware is inefficient for neuro-symbolic-probabilistic AI due to three fundamental issues. The proposed research program systematically addresses each issue with a specific technical insight, which collectively define a novel computer architecture. The ultimate measure of success is not just raw performance, but the efficient and scalable realization of "agentic cognition"—implying AI systems that can reason, plan, and act autonomously. The heavy referencing of document sections suggests this visual is intended to orient a reader within a complex technical publication, providing a narrative thread from motivation to outcome. The progression from sad to happy faces visually reinforces the core thesis: the proposed methodology and architecture solve the identified pain points. </details> Figure 4: REASON overview. REASON is an integrated acceleration framework for probabilistic logical reasoning grounded neuro-symbolic AI with the goal to achieve efficient and scalable agentic cognition. REASON addresses the challenges of irregular compute and memory, symbolic and probabilistic latency bottleneck, and hardware underutilization, by proposing methodologies including unified DAG representation, reconfigurable PE, efficient dataflow, mapping, scalable architecture, two-level parallelism and programming interface. REASON is deployed across cognitive tasks and consistently demonstrates performance-efficiency improvements for compositional neuro-symbolic systems. Latency bottleneck. We characterize the latency of representative neuro-symbolic workloads (Fig. 3 (a)). Compared to neuro kernels, symbolic and probabilistic kernels are not negligible in latency and may become system bottlenecks. For instance, the neural (symbolic) components account for 36.2% (63.8%), 37.3% (62.7%), 63.4% (36.6%), 36.1% (63.9%), 49.5% (50.5%), and 65.2% (34.8%) of runtime in AlphaGeometry, R 2 -Guard, GeLaTo, Ctrl-G, NeuroPC, and LINC, respectively. Symbolic kernels dominate AlphaGeometry’s runtime, and probabilistic kernels dominate R 2 -Guard and Ctrl-G’s, due to high irregular memory access, wrap divergence, thread underutilization, and execution parallelism. FLOPS and latency measurements further highlight this inefficiency. Notably, when using a smaller LLM (LLaMA-7B) for GeLaTo and LINC, overall accuracy remains stable, but the symbolic latency rises to 69.0% and 65.5%, respectively. We observe consistent trends in the Orin NX-based platform (Fig. 3 (c)). Symbolic components count for 63.8% of AlphaGeometry runtime on A6000 while its FLOPS count for only 19.3%, indicating inefficient hardware utilization. Latency scalability. We evaluate runtime across reasoning tasks of varying difficulty and scale (Fig. 3 (b)). We observe that the relative runtime distribution between neural and symbolic components remains consistent of a single workload across task sizes. Total runtime increases with task complexity and scale. While LLM kernels scale efficiently due to their tensor-based GPU-friendly inference, logical and probabilistic kernels scale poorly due to the exponential growth of search space, making them slower compared to monolithic LLMs. ### III-B Roofline & Symbolic Operation & Inefficiency Analysis Memory-bounded operation. Fig. 3 (d) presents a roofline analysis of GPU memory bandwidth versus compute efficiency. We observe that the symbolic and probabilistic components are typically memory-bound, limiting performance efficiency. For example, R 2 -Guard’s probabilistic circuits use sparse, scattered accesses for marginalization, and Ctrl-G’s HMM iteratively reads and writes state probabilities. Low compute per element makes these workloads constrained by memory access, underutilizing GPU compute resources. TABLE II: Hardware inefficiency analysis. The compute, memory, and communication characteristics of representative neural, symbolic, and probabilistic kernels executed on CPU/GPU platform. | | Neural Kernel | Symbolic Kernel | Probabilistic Kernel | | | | | --- | --- | --- | --- | --- | --- | --- | | MatMul | Softmax | Sparse MatVec | Logic | Marginal | Bayesian | | | Compute Efficiency | | | | | | | | Compute Throughput (%) | 96.8 | 62.2 | 32.5 | 14.7 | 35.0 | 31.1 | | ALU Utilization (%) | 98.4 | 72.0 | 43.9 | 29.3 | 48.5 | 52.8 | | Memory Behavior | | | | | | | | L1 Cache Throughput (%) | 82.4 | 58.0 | 27.1 | 20.6 | 32.4 | 37.1 | | L2 Cache Throughput (%) | 41.7 | 27.6 | 18.3 | 12.4 | 24.2 | 27.5 | | L1 Cache Hit Rate (%) | 88.5 | 85.0 | 53.6 | 37.0 | 42.4 | 40.7 | | L2 Cache Hit Rate (%) | 73.4 | 66.7 | 43.9 | 32.7 | 50.2 | 47.6 | | DRAM BW Utilization (%) | 39.8 | 28.6 | 57.4 | 70.3 | 60.8 | 68.0 | | Control Divergence and Scheduling | | | | | | | | Warp Execution Efficiency (%) | 96.3 | 94.1 | 48.8 | 54.0 | 59.3 | 50.6 | | Branch Efficiency (%) | 98.0 | 98.7 | 60.0 | 58.1 | 63.4 | 66.9 | | Eligible Warps/Cycle (%) | 7.2 | 7.0 | 2.4 | 2.1 | 2.8 | 2.5 | Hardware inefficiency analysis. We leverage Nsight Systems and Nsight Compute [51, 50] to analyze the computational, memory, and control irregularity of neural, symbolic, and probabilistic kernels, as listed in Tab. II. We observe that: First, compute throughput and ALU utilization: neural kernels achieve high throughput and ALU utilization, while symbolic/probabilistic kernels have low throughput and idle ALUs. Second, memory access and cache utilization: neural kernels see high L1 cache hit rates; symbolic kernels incur cache misses and stalls, and probabilistic kernels face high memory pressure. Third, DRAM bandwidth (BW) utilization and data movement overhead: neural workloads use on-chip caches with minimal DRAM usage, but symbolic/probabilistic workloads are DRAM-bound with heavy random-access overhead. Sparsity analysis. We observe high, heterogeneous, irregular, and data-dependent sparsity across neuro-symbolic workloads. Symbolic and probabilistic kernels are often extremely sparse, exhibiting on average 82%, 87%, 75%, 83%, 89%, and 83% sparsity across six representative neuro-symbolic workloads, respectively, with many sparse computational paths based on low activation or probability mass. This observation motivates our adaptive DAG pruning (Sec IV-B). ### III-C Unique Characteristics of Neuro-Symbolic vs LLMs In summary, neuro-symbolic workloads exhibit distinct characteristics compared to monolithic LLMs in compute kernels, memory behavior, dataflow, and performance scaling. Compute kernels. LLMs are dominated by regular, highly parallel tensor operations well suited to GPUs. In contrast, neuro-symbolic workloads comprise heterogeneous symbolic and probabilistic kernels with irregular control flow, low arithmetic intensity, and poor cache locality, leading to low GPU utilization and frequent performance bottlenecks. Memory behavior. Symbolic and probabilistic kernels are primarily memory-bound, operating over large, sparse, and irregular data structures. Probabilistic reasoning further increases memory pressure through large intermediate state caching, creating challenging trade-offs between latency, bandwidth, and on-chip storage. Dataflow and parallelism. Neuro-symbolic workloads exhibit dynamic and tightly coupled data dependencies. Symbolic and probabilistic computations often depend on neural outputs or require compilation into LLM-compatible structures, resulting in serialized execution, limited parallelism, and amplified end-to-end latency. Performance scaling. LLMs scale efficiently across GPUs via optimized data and model parallelism. In contrast, symbolic workloads are difficult to parallelize due to recursive control dependencies, while probabilistic kernels incur substantial inter-node communication, limiting scalability on multi-GPUs. ### III-D Identified Opportunities for Neuro-Symbolic Optimization While neuro-symbolic systems show promise, improving their efficiency is critical for real-time and scalable deployment. Guided by the profiling insights above, we introduce REASON (Fig. 4), an algorithm-hardware co-design framework for accelerated probabilistic logical reasoning in neuro-symbolic AI. Algorithmically, a unified representation with adaptive pruning reduces memory footprint (Sec. IV). In hardware architecture, a flexible architecture and dataflow support various symbolic and probabilistic operations (Sec. V). REASON further provides adaptive scheduling and orchestration of heterogeneous LLM-symbolic agentic workloads through a programmable interface (Sec. VI). Across reasoning tasks, REASON consistently boosts performance, efficiency, and accuracy (Sec. VII). ## IV REASON: Algorithm Optimizations This section introduces the algorithmic optimizations in REASON for symbolic and probabilistic reasoning kernels. We present a unified DAG-based computational representation (Sec. IV-A), followed by adaptive pruning (Sec. IV-B) and regularization techniques (Sec. IV-C) that jointly reduce model complexity and enable efficient neuro-symbolic systems. ### IV-A Stage 1: DAG Representation Unification Motivation. Despite addressing different reasoning goals, symbolic and probabilistic reasoning kernels often share common underlying computational patterns. For instance, logical deduction in FOL, constraint propagation in SAT, and marginal inference in PCs all rely on iterative graph-based computations. Capturing this shared structure is essential to system acceleration. DAGs provide a natural abstraction to unify these diverse kernels under a flexible computational model. <details> <summary>x5.png Details</summary> ![aa03bc8a](/v1/image/aa03bc8ace255b64b0273bb2924c933a9f3d8300b220a906e0383f588ea65f09) ### Visual Description ## Diagram: REASON Algorithm Optimization Flowchart ### Overview The image is a technical flowchart illustrating the three-stage optimization pipeline of the "REASON Algorithm." It depicts how different symbolic and probabilistic reasoning kernels are processed through a series of transformations to produce an optimized output structure. The diagram is structured horizontally, flowing from left (input) to right (output). ### Components/Axes The diagram is divided into three primary sections: 1. **Input Section (Left):** Labeled **"Symb/Prob Kernel Input"**. It consists of three stacked, rounded rectangular boxes, each representing a different reasoning paradigm: * **Top Box:** "Logical Reasoning (SAT/FOL)" * **Middle Box:** "Sequential Reasoning (HMM)" * **Bottom Box:** "Probabilistic Reasoning (PC)" * A single arrow points from this group to the first optimization stage. 2. **Optimization Pipeline (Center):** Labeled **"REASON Algorithm Optimization"**. This is the core of the diagram, featuring three sequential, color-coded stages: * **Stage 1 (Pink Box):** "Stage 1: DAG Representation Unification" with a sub-label "(Sec. IV-A)". * **Stage 2 (Green Box):** "Stage 2: Adaptive DAG Pruning" with a sub-label "(Sec. IV-B)". * **Stage 3 (Blue Box):** "Stage 3: Two-Input DAG Regularization" with a sub-label "(Sec. IV-C)". * Black arrows connect Stage 1 to Stage 2, and Stage 2 to Stage 3, indicating the processing flow. 3. **Output Section (Right):** Labeled **"Output"**. It contains a schematic diagram of a **Directed Acyclic Graph (DAG)**. The graph shows a hierarchical structure with multiple nodes (circles) connected by directed edges (arrows), representing the final optimized data structure. ### Detailed Analysis * **Flow and Relationships:** The diagram establishes a clear, linear pipeline. The three distinct input reasoning methods (Logical, Sequential, Probabilistic) are first unified into a common representation (Stage 1). This unified structure is then simplified through pruning (Stage 2) and finally regularized, specifically for two-input operations (Stage 3), to yield the final DAG output. * **Textual Content:** All text is in English. The diagram includes specific section references (Sec. IV-A, IV-B, IV-C), indicating it is likely excerpted from a larger technical paper or report where these stages are described in detail. * **Spatial Grounding:** The legend/labels are integrated directly into the component boxes. The input is positioned at the far left, the three optimization stages are centered and arranged horizontally, and the output DAG is at the far right. The color-coding (pink, green, blue) is used solely to differentiate the three optimization stages. ### Key Observations 1. **Abstraction of Inputs:** The diagram abstracts complex reasoning methods (SAT/FOL, HMM, PC) into simple input blocks, focusing the viewer's attention on the optimization process itself rather than the specifics of the input formats. 2. **Progressive Refinement:** The stage names suggest a progression from structural unification ("Representation Unification") to efficiency improvement ("Pruning") and finally to structural refinement ("Regularization"). 3. **Specific Output Target:** The final stage ("Two-Input DAG Regularization") and the output diagram imply the algorithm is optimized for producing DAGs that are particularly suited for operations involving two inputs. ### Interpretation This flowchart visually summarizes a method for creating a unified, efficient computational structure (a DAG) from heterogeneous reasoning systems. The process aims to bridge different AI reasoning paradigms by converting them into a common graphical format and then applying a series of optimizations. The **"DAG Representation Unification"** (Stage 1) is the critical translation step, allowing disparate logical, sequential, and probabilistic models to be processed by the same subsequent tools. **"Adaptive DAG Pruning"** (Stage 2) likely removes redundant or low-probability paths to reduce complexity and computational cost. The final **"Two-Input DAG Regularization"** (Stage 3) suggests a focus on standardizing the graph's structure for common binary operations, potentially improving predictability and performance in downstream tasks. The overall pipeline demonstrates a modular approach to AI system integration, where the complexity of combining different reasoning types is managed through a standardized intermediate representation (the DAG) and a clear sequence of optimization steps. The references to specific sections (IV-A, B, C) indicate this is a high-level overview of a more detailed technical methodology. </details> | SAT/FOL PC HMM | Literals and logical operators Primitive distributions, sum and product nodes Hidden state variables at each time step | Logical dependencies between literals, clauses, and formulas Weighted dependencies encoding probabilistic factorization State transition and emission dependencies | Search and deduction via traversal (DPLL/CDCL) Bottom-up probability aggregation and top-down flow propagation Sequential message passing (forward–backward, decoding) | | --- | --- | --- | --- | Figure 5: Unified DAG representations of neuro-symbolic kernels. Logical (SAT/FOL), probabilistic (PC), and sequential (HMM) reasoning are expressed using DAG abstraction. Nodes represent atomic reasoning operations, edges encode dependency structure, and graph traversals implement inference procedures. This unification enables shared compilation, pruning, and hardware mapping in REASON. Methodology. We unify symbolic and probabilistic reasoning kernels under a DAG abstraction, where each node represents an atomic reasoning operation and each directed edge encodes a data/control dependency (Fig. 5). This representation enables a uniform compilation flow – construction, transformation, and scheduling – across heterogeneous kernels (logical deduction, constraint solving, probabilistic aggregation, and sequential message passing), and serves as the algorithmic substrate for subsequent pruning and regularization. #### For FOL and SAT solvers DAG nodes represent variables and logical connectives, with edges indicating dependencies between literals and clauses. We represent a propositional CNF formula $\varphi=\bigwedge_{i=1}^{m}\Bigl(\bigvee_{j=1}^{k_{i}}l_{ij}\Bigr)$ as DAG with three layers: literal nodes for each literal $l_{ij}$ , clause nodes implementing disjunction over literals in $\bigvee_{j}l_{ij}$ , and formula nodes implementing conjunction over clauses $\bigwedge_{i}$ . In SAT, DAG captures the branching and conflict resolution structures in DPLL/CDCL procedures. In FOL, formulas are encoded as DAGs where inference rules act as graph transformation operators that derive contradictions through node and edge expansion. The compiler converts FOL and SAT inputs (clauses in CNF or quantifier-free predicates) into DAGs via: Step- 1 Normalization: predicates are transformed to CNF, removing quantifiers and forming disjunctions of literals. Step- 2 Node creation: each literal becomes a leaf node, each clause an OR node over its literals, and the formula an AND node over clauses. Step- 3 Edge encoding: edges capture dependencies (literal $\rightarrow$ clause $\rightarrow$ formula), while watch-lists as metadata. #### For PCs DAG nodes correspond to sum (mixture) or product (factorization) operations $p_{n}(x)$ over input $x$ (to variable $\mathbf{X}$ ), with children $ch(x)$ . Leaves represent primitive distributions $f_{n}(x)$ . Edges model conditional dependencies. The DAG structure facilitates efficient inference through bottom-up probability evaluation, exploiting structural independence and enabling effective pruning and memorization during probability queries (Eq. 1). The compiler converts PC into DAGs through: Step- 1 Graph extraction: nodes represent random variables, factors, or sub-circuits parsed from expressions such as $P_{n}(x)$ . Step- 2 Node typing: arithmetic operators map to sum nodes for marginalization and product nodes for factor conjunction, while leaf nodes store constants or probabilities. #### For HMMs The unrolled DAG spans time steps, with nodes representing transition factors $p(z_{t}|z_{t-1})$ and emission factors $p(x_{t}|z_{t})$ (Eq. 2), and edges connecting factors across adjacent time steps to reflect Markov dependency. Sequential inference (filtering/smoothing/decoding) becomes structured message passing on this DAG: each step aggregates contributions from predecessor states through transition factors and then applies emission factors. The compiler converts HMMs into DAGs through: Step- 1 Sequence unroll: Each time step becomes a DAG layer, representing states and transitions. Step- 2 Node mapping: Product nodes combine transition and emission probabilities; sum nodes aggregate over prior states. The unified DAG abstraction lays the algorithmic foundation for subsequent pruning, regularization, and hardware mapping, supporting efficient acceleration of neuro-symbolic workloads. <details> <summary>x6.png Details</summary> ![7932c22e](/v1/image/7932c22e1dac4883da906186c94a26b5a3ece03bb892794e79002e930f969219) ### Visual Description \n ## Technical Diagram: REASON Plug-in Architecture for GPU Acceleration ### Overview The image is a multi-part technical diagram illustrating a proposed hardware architecture called "REASON" designed to be integrated as a plug-in into a GPU. The diagram is divided into five interconnected sections labeled (a) through (e), detailing the system from the high-level GPU integration down to the microarchitecture of individual processing nodes and symbolic memory support. ### Components/Axes The diagram is segmented into five primary regions: 1. **(a) GPU with Proposed Plug-in:** Shows the integration of the REASON plug-in within a standard GPU memory hierarchy. 2. **(b) Proposed REASON Plug-in:** Details the internal structure of the plug-in itself. 3. **(c) Tree-based PE Architecture:** A detailed view of the Processing Element (PE) array and its interconnect network. 4. **(d) Node Microarchitecture:** A close-up of the internal logic within a single processing node. 5. **(e) Symbolic Mem. Support:** Illustrates hardware structures for managing symbolic data. **Key Labels and Components by Section:** **(a) GPU with Proposed Plug-in (Leftmost block)** * **Off-chip Memory** (Top) * **Memory Controller** * **GPU Graphics Processing Clusters (GPC)** (Two instances shown, stacked) * **Shared L2 Cache** (Left side, connected to GPCs and Memory Controller) * **Proposed REASON Plug-in** (Highlighted in light blue, connected to Shared L2 Cache and Giga Thread Engine) * **Giga Thread Engine** (Bottom) * Arrows indicate bidirectional data flow between the REASON Plug-in, Shared L2 Cache, and Giga Thread Engine. **(b) Proposed REASON Plug-in (Second block from left)** * **Global Controller** (Top) * **Tree-based PE** (Four instances shown, arranged around a central interconnect) * **Global Interconnect** (Central cross-shaped structure connecting the four PEs) * **Workload Scheduler** (Bottom left) * **Ctrl** (Bottom center) * **Shared Local Memory** (Bottom) * **Custom SIMD Unit** (Bottom right) **(c) Tree-based PE Architecture (Central, largest block)** * **SIMD** (Top left) * **Intermediate Buffer** (Top center) * **M:1 Output Interconnect** (Below Intermediate Buffer) * **Scalar PE** (Top right) * **BCP FIFO** (Right side, connected to Scalar PE) * **Control Logic / MMU** (Left side) * **Forwarding Logic** (Left side, below Control Logic) * **Decode** (Bottom left) * **Pre-fetcher / DMA** (Bottom left, next to Decode) * **Watched Literals Controller** (Bottom center, with a red "From Broadcast" arrow pointing to it) * **Benes Network (N:N Distribution Crossbar)** (Bottom center-right) * **N SRAM Banks** (Bottom right) * **Leaf Node** (Labels point to the bottom layer of circles in the tree) * **From L2** (Arrow entering from bottom right) * **To Last Node PE** (Arrow exiting bottom center) * **From Forwarding or PE** (Arrow entering Benes Network from left) * **Legend (within section c):** * **Red Line:** `DPLL Broadcast (Symbolic)` * **Blue Line:** `SpMIP/MDAG/DPLL Reduction (Neuro/Probabilistic/Symbolic)` * The core visual is a **tree structure** of interconnected nodes (circles). Red and blue lines trace different communication paths through this tree. The tree has a root at the top, multiple intermediate layers, and leaf nodes at the bottom. **(d) Node Microarchitecture (Bottom right inset)** * **XIK** (A component, possibly a processing unit) * **Adder** (Symbol: `+`) * **Fwd** (Forwarding multiplexers, two instances) * **Data** (Input/Output labels) * **Control Signals** (Bottom) * Arrows show data flow into the XIK and Adder, with forwarding paths (`Fwd`) around them. **(e) Symbolic Mem. Support (Top right inset)** * **BCP FIFO** (Top left) * **Broadcast / Msg Assign** (Top center) * **Watched Literals** (Top right, with sub-labels: `Literal`, `Head`, `x1`, `x2`, `NULL`, `Clause (Cnt) Data`) * **Implication** (Two instances, with sub-labels: `x1=0`, `x1=1`) * **Correl. Cache** (Bottom left) * **Index RAM** (Bottom right) * **Clause (Cnt) Data** (Bottom right, connected to Index RAM) * Arrows indicate data flow between these components, including a path from "Broadcast / Msg Assign" to "Watched Literals." ### Detailed Analysis The diagram presents a hierarchical hardware design for symbolic reasoning acceleration. 1. **System Integration (a):** The REASON Plug-in is positioned as a co-processor that interfaces directly with the GPU's Shared L2 Cache and the Giga Thread Engine, suggesting it operates at a high level within the memory hierarchy, close to the main compute units. 2. **Plug-in Structure (b):** The plug-in itself contains a Global Controller managing four Tree-based Processing Elements (PEs). These PEs communicate via a Global Interconnect. Supporting units include a Workload Scheduler, control logic (`Ctrl`), Shared Local Memory, and a Custom SIMD Unit, indicating a blend of control-driven and data-parallel processing. 3. **Tree-based PE Core (c):** This is the computational heart. A tree network of nodes facilitates parallel operations. The legend defines two critical communication protocols: * **DPLL Broadcast (Symbolic) - Red Lines:** These paths flow *downward* from the root to the leaves, suggesting a broadcast of symbolic constraints or decisions (like in a DPLL SAT solver). * **SpMIP/MDAG/DPLL Reduction - Blue Lines:** These paths flow *upward* from leaves to the root, indicating a reduction or aggregation of results (e.g., neuro-symbolic inference, probabilistic reasoning, or conflict analysis). * The architecture includes dedicated hardware for instruction decode, DMA-based pre-fetching, a "Watched Literals Controller" (a key data structure in SAT solvers), a Benes Network for flexible data distribution, and multiple SRAM banks for local storage. 4. **Node-Level Detail (d):** Each node in the tree contains basic arithmetic (`Adder`) and a specialized unit (`XIK`), with forwarding paths to minimize pipeline stalls. 5. **Symbolic Memory (e):** Dedicated hardware manages symbolic concepts: FIFOs for Boolean Constraint Propagation (BCP), storage for "Watched Literals" (with fields for literal, head, and clause data), an implication queue, a correlation cache, and an index RAM. This structure is optimized for the memory access patterns of symbolic algorithms. ### Key Observations * **Hybrid Architecture:** The design explicitly supports "Neuro/Probabilistic/Symbolic" processing (per the blue line legend), indicating a unified architecture for different AI paradigms. * **SAT Solver Inspiration:** Multiple components (`Watched Literals`, `BCP FIFO`, `DPLL` in the legend) are direct references to algorithms used in Boolean Satisfiability (SAT) solvers, suggesting this hardware is optimized for such workloads. * **Tree-Based Parallelism:** The core computational model is a tree, which is well-suited for divide-and-conquer algorithms common in search and reasoning tasks. * **Custom Interconnects:** The use of a Benes Network (a non-blocking crossbar) and an M:1 Output Interconnect highlights a focus on low-latency, high-bandwidth communication between processing nodes. * **Memory Hierarchy:** The design features a multi-level memory strategy: Off-chip Memory -> Shared L2 Cache -> Shared Local Memory (in the plug-in) -> N SRAM Banks (in the PE) -> Intermediate Buffer. ### Interpretation This diagram details a specialized hardware accelerator designed to overcome the limitations of general-purpose GPUs for complex reasoning tasks. The **REASON Plug-in** is not just a simple compute unit; it's a self-contained subsystem with its own controller, memory, scheduler, and a massively parallel tree of processing nodes. The architecture's primary innovation is its **hardware mapping of symbolic reasoning algorithms**. By implementing structures like watched literal queues and BCP FIFOs directly in silicon, and by creating a physical tree network that mirrors the logical tree traversal of algorithms like DPLL, the design aims to achieve orders-of-magnitude speedups over software implementations. The inclusion of neuro-symbolic and probabilistic paths indicates a forward-looking design intended for next-generation AI models that combine learning with logical reasoning. The **spatial organization** is telling: the high-level GPU integration (a) shows *where* it fits, the plug-in view (b) shows *what* it contains, and the detailed PE (c) and node (d) views show *how* it works at the circuit level. The symbolic memory support (e) is the crucial link that feeds data into this computational tree. The entire system is a coherent pipeline from symbolic data management (e) through parallel tree-based processing (c, d) orchestrated by a global controller (b), all integrated into the GPU's memory ecosystem (a). This represents a significant architectural effort to harden AI reasoning workloads into dedicated, efficient silicon. </details> Figure 6: Overview of the REASON hardware acceleration system. (a) Integration of REASON as a GPU co-processor. (b) REASON plug-in architecture with PEs, shared local memory, and global scheduling. (c) Tree-based PE architecture enabling broadcast, reduction, and irregular DAG execution. (d) Micro-architecture of a tree node supporting arithmetic and logical operations. (e) FIFO and memory layout supporting symbolic reasoning. ### IV-B Stage 2: Adaptive DAG Pruning Motivation. While the unified DAG representation provides a common abstraction, it may contain significant redundancy, such as logically implied literals, inactive substructures, or low-probability paths, that inflate DAG size and degrade performance without improving inference quality. Methodology. We propose adaptive DAG pruning, a semantics-preserving optimization that identifies and removes redundant paths in symbolic and probabilistic DAGs. For symbolic kernels, pruning targets literals and clauses that are logically redundant. For probabilistic kernels, pruning eliminates low-activation edges that minimally impact inference. This process significantly reduces model size and computational complexity while preserving correctness of logical and probabilistic inference. #### Pruning of FOL and SAT via implication graph For SAT solvers and FOL reasoning, we prune redundant literals using implication graphs. Given a CNF formula $\varphi=\bigwedge_{i}\left(\bigvee_{j}l_{ij}\right)$ , each binary clause $(l\lor l^{\prime})$ induces two directed implication edges: $\bar{l}\rightarrow l^{\prime}$ and $\bar{l^{\prime}}\rightarrow l$ . The resulting implication graph captures logical dependencies among literals. We perform a depth-first traversal to compute reachability relationships between literals. If a literal $l^{\prime}$ is always implied by another literal $l$ , then $l^{\prime}$ is a hidden literal. Clauses containing both $l$ and $l^{\prime}$ can safely drop $l^{\prime}$ , reducing clause width without semantic changes. For instance, a clause $C=(l\lor l^{\prime})$ is reduced to $C^{\prime}=(l)$ . This procedure removes redundant literals (e.g., hidden tautologies and failed literals), preserves satisfiability, and runs in time linear in the size of the implication graph. #### Pruning of PCs and HMMs via circuit flow For probabilistic DAGs such as PCs and HMMs, we prune edges based on probability flow, which quantifies each edge’s contribution to the overall likelihood. In HMMs, the DAG is unrolled over time steps, with nodes representing transition factors $p(z_{t}\mid z_{t-1})$ and emission factors $p(x_{t}\mid z_{t})$ . We compute expected transition and emission usage via the forward-backward algorithm, yielding posterior state and transition probabilities. Edges corresponding to transitions or emissions with consistently low posterior probability are pruned, as their contribution to the joint likelihood $p(z_{1:T},x_{1:T})$ is negligible. This pruning preserves inference fidelity while reducing state-transition complexity. In PCs, sum node $n$ computes $p_{n}(x)=\sum_{c\in\mathrm{ch}(n)}\theta_{n,c}\,p_{c}(x)$ , where $\theta_{n,c}\geq 0$ denotes the weight associated with child $c$ . For an input $x$ , we define the circuit flow through edge $(n,c)$ as $F_{n,c}(x)=\frac{\theta_{n,c}\,p_{c}(x)}{p_{n}(x)}\cdot F_{n}(x)$ , where $F_{n}(x)$ denotes the top-down flow reaching node $n$ . Intuitively, $F_{n,c}(x)$ measures the fraction of probability mass passing through edge $(n,c)$ for input $x$ . Given a dataset $\mathcal{D}$ , the cumulative flow for edge $(n,c)$ is $F_{n,c}(\mathcal{D})=\sum_{x\in\mathcal{D}}F_{n,c}(x)$ . Edges with the smallest cumulative flow are pruned, as they contribute least to the overall model likelihood. The resulting decrease in average log-likelihood is bounded by $\Delta\log\mathcal{L}\leq\frac{1}{|\mathcal{D}|}\sum_{x\in\mathcal{D}}F_{n,c}(x)$ , providing a theoretically grounded criterion for safe pruning. ### IV-C Stage 3: Two-Input DAG Regularization Methodology. After pruning, the resulting DAGs may still have high fan-in or irregular branching, which hinders efficient hardware execution. To address this, we apply a regularization step that transforms DAGs into a canonical two-input form. Specifically, nodes with more than two inputs are recursively decomposed into balanced binary trees composed of two-input intermediate nodes, preserving the original computation semantics. This normalization promotes uniformity, enabling efficient parallel scheduling, pipelining, and mapping onto REASON architecture, without sacrificing model fidelity or expressive power. For each symbolic or probabilistic kernel, the compiler generates an initial DAG, applies adaptive pruning, and then performs two-input regularization to produce a unified balanced representation. These DAGs are constructed offline and used to generate an execution binary that is programmed onto REASON hardware. This unification-pruning-regularization flow decouples algorithmic complexity from runtime execution and enables predictable performance. ## V REASON: Hardware Architecture REASON features flexible co-processor plug-in architecture (Sec. V-A), reconfigurable symbolic/probabilistic PEs (Sec. V-B), flexible support for symbolic and probabilistic kernels (Sec. V-C - V-D). Sec. V-E presents cycle-by-cycle execution pipeline analysis. Sec. V-F discusses design space exploration and scalability. ### V-A Overall Architecture Neuro-symbolic workloads exhibit heterogeneous compute and memory patterns with diverse sparsity, diverging from the GEMM-centric design of conventional hardware. Built on the unified DAG representation and optimizations (Fig. IV), REASON is a reconfigurable and flexible architecture designed to efficiently execute the irregular computations of symbolic and probabilistic reasoning stages in neuro-symbolic AI. Overview. REASON operates as a programmable co-processor tightly integrated with GPU SMs, forming a heterogeneous system architecture. (Fig. 6 (a)). In this system, REASON serves as an efficient and reconfigurable “slow thinking” engine, accelerating symbolic and probabilistic kernels that are poorly suited to GPU execution. As illustrated in Fig. 6 (b), REASON comprises an array of tree-based PE cores that act as the primary computation engines. A global controller and workload scheduler manage the workload mapping. A shared local memory serves as a unified scratchpad for all cores. Communication between cores and shared memory is handled by a high-bandwidth global interconnect. Tree architecture. Each PE core is organized as a tree-structured compute engine, as shown in Fig. 6 (c). Each tree node integrates a specialized datapath, memory subsystem, and control logic optimized for executing DAG-based symbolic and probabilistic operations. Reconfigurable tree engine (RTE). At the core of each PE is a Reconfigurable Tree Engine (RTE), whose datapath forms a bidirectional tree of PEs (Fig. 6 (d)). The RTE supports both SAT-style symbolic broadcast patterns and probabilistic aggregation operations. A Benes network interconnect enables N-to-N routing, decoupling SRAM banking from DAG mapping and simplifying compilation of irregular graph structures (Sec. V-C). Forwarding logic routes intermediate and irregular outputs back to SRAM for subsequent batches. Memory subsystem. To tackle the memory-bound nature of symbolic and probabilistic kernels, the RTE is backed by a set of dual-port, wide-bitline SRAM banks arranged as a banked L1 cache. A memory front-end with a prefetcher and high-throughput DMA engine moves data from shared scratchpad. A control/memory management unit (MMU) block handles address translation across the distributed memory system. Core control and execution. A scalar PE acts as the core-level controller, fetching and issuing VLIW-like instructions that configure the RTE, memory subsystem, and specialized units. Outputs from the RTE are buffered before being consumed by the scalar PE or the SIMD Unit, which provides support for executing parallelable subset of symbolic solvers. ### V-B Reconfigurable Symbolic/Probabilistic PE The PE architecture is designed to support a wide range of symbolic and probabilistic computation patterns via a VLIW-driven cycle-reconfigurable datapath. Each PE can switch among three operational modes to efficiently execute heterogeneous kernels mapped from the unified DAG representation. Probabilistic mode. In probabilistic mode, the node executes irregular DAGs derived from unified probabilistic representations (Sec. V-C). The nodes are programmed by the VLIW instruction stream to perform arithmetic operations, either addition or multiplication, required by the DAG node mapped onto them. This mode supports probabilistic aggregation patterns such as sum-product computation and likelihood propagation, enabling efficient execution of PCs and HMMs. <details> <summary>x7.png Details</summary> ![1d7ade84](/v1/image/1d7ade84107c3b4e4005ffcae19f97b82cb59154bd795d726b0d563a13cb30a9) ### Visual Description ## Technical Diagram: Multi-Step Computational Mapping Process ### Overview The image is a technical diagram illustrating a five-step process for transforming a complex, interconnected computational graph into an optimized, scheduled execution on a parallel processing architecture. The flow moves from left to right, with each step enclosed in a distinct visual region and labeled sequentially. ### Components/Axes The diagram is segmented into five primary regions, each representing a step in the pipeline: 1. **Step 1: Unified Representation** (Leftmost region) 2. **Step 2: Block Decomposition (BD)** (Center-left region) 3. **Step 3: PE and Register Mapping** (Center region) 4. **Step 4: Tree Mapping** (Center-right region) 5. **Step 5: Reordering** (Rightmost region) **Labels and Annotations:** * **Step Titles:** "Step 1: Unified Representation", "Step 2: Block Decomposition (BD)", "Step 3: PE and Register Mapping", "Step 4: Tree Mapping", "Step 5: Reordering". * **Process Labels:** "Intra-block Regularization", "Inner-block Regularization", "Assign based on BD". * **Component Labels:** "PE" (Processing Element), "Tree global scratchpad", "Single PE", "Local PE SRAM". * **Temporal Labels (Step 5):** "T=0", "T=1", "T=2", "T=3". * **Node Labels (Step 1):** Letters A through I inside circular nodes. * **Action Labels (Step 5):** "Load", "Block", "No-op", "Block". ### Detailed Analysis **Step 1: Unified Representation** * **Visual:** A complex, non-planar graph with nodes labeled A, B, C, D, E, F, G, H, I. Nodes are connected by a dense network of edges. * **Spatial Grounding:** Node A is at the bottom-left. Node I is at the top-right. The graph has a layered, somewhat hierarchical appearance but with many cross-connections. * **Color & Flow:** Red arrows indicate a primary path or dependency chain (e.g., from A to H to I). Blue arrows indicate other connections. The background contains blurred, rectangular elements, possibly representing memory or data blocks. **Step 2: Block Decomposition (BD)** * **Visual:** The unified graph is partitioned into two distinct, gray-shaded rectangular blocks. * **Top Block:** Contains an unlabeled tree structure (root with two children, each with two children). * **Bottom Block:** Contains a similar tree structure. The leftmost leaf node is highlighted in red and labeled "A". Another node (a right-side leaf) is highlighted in green. * **Annotations:** A red arrow labeled "Intra-block Regularization" points from the red node (A) to an edge within the top block. A green arrow labeled "Inner-block Regularization" points between two nodes within the bottom block. **Step 3: PE and Register Mapping** * **Visual:** Two main components. 1. **Top:** Eight identical light-blue squares arranged in a 2x4 grid, each labeled "PE". 2. **Bottom:** A large rectangle labeled "Tree global scratchpad", subdivided into a grid of 4 columns and at least 3 visible rows (with "..." indicating more). * **Flow:** The text "Assign based on BD" is positioned between the PE grid and the scratchpad, indicating the mapping logic from the previous decomposition step. **Step 4: Tree Mapping** * **Visual:** A detailed view of a "Single PE" (a larger light-blue square). Inside it is a tree structure (similar to those in Step 2) with a yellow arrow tracing a path from a leaf node up to the root. * **Component:** Below the tree, within the same PE boundary, is a component labeled "Local PE SRAM", depicted as a horizontal row of 8 memory cells. **Step 5: Reordering** * **Visual:** Four vertical, light-blue bars representing a schedule or sequence of operations over time. * **Temporal Sequence:** * **T=0:** Bar labeled "Load". * **T=1:** Bar labeled "Block". * **T=2:** Bar labeled "No-op". * **T=3:** Bar labeled "Block". ### Key Observations 1. **Progressive Abstraction:** The process moves from a concrete, messy graph (Step 1) to abstract, regularized blocks (Step 2), then to hardware mapping (Steps 3 & 4), and finally to a temporal schedule (Step 5). 2. **Regularization Focus:** Step 2 explicitly introduces "Intra-block" and "Inner-block" regularization, suggesting an optimization pass to structure the decomposed blocks for efficient mapping. 3. **Hierarchy of Memory:** The diagram shows a clear memory hierarchy: a global "Tree global scratchpad" (Step 3) and a per-PE "Local PE SRAM" (Step 4). 4. **Scheduling Insight:** Step 5 reveals that the final execution is not a simple linear flow. The "No-op" at T=2 indicates a pipeline bubble or deliberate stall, and the repeated "Block" operation suggests a chunk-based or batched processing model. ### Interpretation This diagram outlines a compiler or runtime system's methodology for mapping an arbitrary computational graph (e.g., a neural network layer, a dataflow program) onto a parallel hardware accelerator composed of multiple Processing Elements (PEs) with local memory. The **core investigative reading** is as follows: 1. **Problem:** The initial graph (Step 1) is irregular and not directly mappable to hardware. 2. **Solution - Decomposition & Regularization (Step 2):** The graph is broken into subgraphs (blocks). Regularization techniques are applied to simplify connections within and between these blocks, making them more amenable to parallel execution. The highlighting of node 'A' suggests it may be a critical or anchor node in this process. 3. **Solution - Hardware Mapping (Steps 3 & 4):** The regularized blocks are assigned to physical PEs. The "Tree global scratchpad" likely holds intermediate data shared between PEs. Step 4 shows how a single block's tree structure is mapped into a PE's local memory and execution unit, with the yellow arrow possibly representing a specific computation path or reduction operation. 4. **Solution - Temporal Orchestration (Step 5):** The final step schedules the operations. The sequence "Load -> Block -> No-op -> Block" implies a phased execution: first loading data/weights, then processing a block of work, followed by a synchronization point or memory operation (the No-op), and then processing another block. This pattern is typical in systolic arrays or wavefront architectures to manage data dependencies and pipeline efficiency. **Notable Anomaly:** The "No-op" cycle is significant. It is not idle time but a necessary part of the schedule, likely for waiting on data from another PE, flushing a pipeline, or aligning with a global synchronization barrier. This highlights that optimal mapping requires careful temporal planning, not just spatial assignment. **Conclusion:** The diagram is a high-level schematic for a graph-to-hardware compilation pipeline, emphasizing decomposition, regularization, spatial mapping to a PE array, and the creation of a synchronized, multi-cycle execution schedule. </details> Figure 7: Compiler-architecture co-design for probabilistic execution. A probabilistic DAG is decomposed, regularized, mapped onto tree-based PEs, and scheduled with pipeline awareness to enable efficient execution of irregular probabilistic kernels in REASON. Symbolic mode. In symbolic mode, the datapath is repurposed for logical reasoning operations (Sec. V-D). Key hardware components are utilized as follows: (a) The comparator checks logical states for Boolean Constraint Propagation (BCP), identifying literals as TRUE, FALSE, or UNASSIGNED. (b) The adder performs two key functions: address computation by adding the Clause Base Address and Literal Index to locate the next literal in a clause; and clause evaluation by acting as counter to track the number of FALSE literals. This enables fast detection of unit clauses and conflicts, accelerating SAT-style symbolic reasoning. SpMSpM mode. The tree-structured PE inherently supports the sparse matrix-matrix multiplication (SpMSpM), a computation pattern widely studied in prior works [24, 45]. In this mode, the leaf nodes are configured as multipliers to compute partial products of the input matrix elements, while the internal nodes are configured as adders to perform hierarchical reductions. This execution pattern allows small-scale neural or neural-symbolic models to be efficiently mapped onto the REASON engine, extending its applicability beyond purely symbolic and probabilistic kernels. ### V-C Architectural Support for Probabilistic Reasoning Probabilistic reasoning kernels are expressed as DAGs composed of arithmetic nodes (sum and product) connected by data-dependent edges. REASON exploits its pipelined, tree-structured datapath to efficiently map these DAGs onto parallel PEs. Key architectural features include: multi-tree PE mapping for arithmetic DAG execution, a banked register file with flexible crossbar interconnect to support irregular memory access, and compiler-assisted pipeline scheduling with automatic register management to reduce control overhead. Fig. 7 illustrates the overall workflow. Datapath and pipelined execution. The datapath operates in a pipelined fashion, with each layer of nodes serving as pipeline stages. Each pipeline stage receives inputs from a banked register file, which consists of multiple parallel register banks. Each bank operates independently, providing flexible addressing that accommodates the irregular memory access patterns typical in probabilistic workloads (e.g., PCs, HMMs). Flexible interconnect. To handle the irregularity in probabilistic DAGs, REASON employs an optimized interconnection. An input Benes crossbar connects the register file banks to inputs of PE trees, allowing flexible and conflict-free routing of operands into computation units. Output connections from PE to register banks are structured as one-bank-one-PE to minimize hardware complexity while preserving flexibility, balancing trade-offs between utilization and performance. <details> <summary>x8.png Details</summary> ![197072c1](/v1/image/197072c11228aa014a3534ba25f57d8096ff042656e863003ab75c8a6748d352) ### Visual Description \n ## [Bar Charts]: Normalized Latency and Broadcast-to-Root Cycle Counts ### Overview The image contains two side-by-side bar charts, labeled (a) and (b), comparing performance metrics for different network topologies across varying system sizes. Chart (a) shows "Normalized Latency," and chart (b) shows "Normalized Broadcast-to-Root Cycle Counts." Both charts analyze three topologies: All-to-One, Mesh, and Tree, across system sizes denoted as N, 2N, 3N, up to 8N. An annotation in chart (a) indicates that "N" represents the "Number of leaf nodes in the tree-based PE structure" and highlights a "Sys. Freq. Bottleneck" trend. ### Components/Axes **Chart (a): Normalized Latency** * **Title:** Normalized Latency * **Y-Axis:** Label is "8x" at the top, with tick marks from 0 to 8 in increments of 1. The axis represents a normalized latency multiplier. * **X-Axis:** Grouped by system size: N, 2N, 3N, 4N, 5N, 6N, 7N, 8N. Within each size, three bars represent the topologies: "All-to-One", "Mesh", "Tree". * **Legend (Bottom Center):** Four stacked components: * Memory (Solid light green) * PE (Solid orange) * Peripheries (Solid blue) * Inter-node topology Latency (Diagonal blue stripes) * **Annotation (Top Left):** Text: "N: Number of leaf nodes in the tree-based PE structure". An arrow labeled "Sys. Freq. Bottleneck" points from the lower-left to the upper-right, indicating a general upward trend. **Chart (b): Normalized Broadcast-to-Root Cycle Counts** * **Title:** Normalized Broadcast-to-Root Cycle Counts * **Y-Axis:** Label is "30x" at the top, with tick marks from 0 to 30 in increments of 5. The axis represents a normalized cycle count multiplier. * **X-Axis:** Grouped by system size: N, 2N, 3N, 4N, 5N, 6N, 7N, 8N. Within each size, three bars represent the topologies: "Mesh", "Tree", "All-to-One". *Note: The order of topologies differs from chart (a).* * **Legend (Bottom Center):** Three bar types: * Mesh (Solid light green) * Tree (Solid orange) * All-to-One (Diagonal blue stripes) ### Detailed Analysis **Chart (a) - Normalized Latency (Approximate Values)** The latency is a stacked bar, summing contributions from Memory, PE, Peripheries, and Inter-node topology. * **N:** * All-to-One: Total ~1.1x. (Memory ~0.5, PE ~0.3, Peripheries ~0.2, Inter-node ~0.1) * Mesh: Total ~1.0x. (Memory ~0.5, PE ~0.3, Peripheries ~0.2, Inter-node ~0.0) * Tree: Total ~1.0x. (Memory ~0.5, PE ~0.3, Peripheries ~0.2, Inter-node ~0.0) * **2N:** * All-to-One: Total ~1.5x. (Inter-node component increases to ~0.4) * Mesh: Total ~1.1x. * Tree: Total ~1.1x. * **3N:** * All-to-One: Total ~2.2x. (Inter-node ~1.0) * Mesh: Total ~1.2x. * Tree: Total ~1.2x. * **4N:** * All-to-One: Total ~3.0x. (Inter-node ~1.8) * Mesh: Total ~1.3x. * Tree: Total ~1.3x. * **5N:** * All-to-One: Total ~3.8x. (Inter-node ~2.6) * Mesh: Total ~1.4x. * Tree: Total ~1.4x. * **6N:** * All-to-One: Total ~4.8x. (Inter-node ~3.6) * Mesh: Total ~1.5x. * Tree: Total ~1.5x. * **7N:** * All-to-One: Total ~5.8x. (Inter-node ~4.6) * Mesh: Total ~1.6x. * Tree: Total ~1.6x. * **8N:** * All-to-One: Total ~6.8x. (Inter-node ~5.6) * Mesh: Total ~1.7x. * Tree: Total ~1.7x. **Trend Verification (Chart a):** The "All-to-One" latency line (total bar height) slopes steeply upward. The "Mesh" and "Tree" latency lines slope gently upward. The "Inter-node topology Latency" component (striped section) is the primary driver of the increase for "All-to-One". **Chart (b) - Normalized Broadcast-to-Root Cycle Counts (Approximate Values)** Each bar represents the total cycle count for a topology. * **N:** * Mesh: ~2x * Tree: ~1x * All-to-One: ~3x * **2N:** * Mesh: ~4x * Tree: ~2x * All-to-One: ~6x * **3N:** * Mesh: ~6x * Tree: ~3x * All-to-One: ~10x * **4N:** * Mesh: ~8x * Tree: ~4x * All-to-One: ~13x * **5N:** * Mesh: ~10x * Tree: ~5x * All-to-One: ~16x * **6N:** * Mesh: ~12x * Tree: ~6x * All-to-One: ~19x * **7N:** * Mesh: ~14x * Tree: ~7x * All-to-One: ~22x * **8N:** * Mesh: ~16x * Tree: ~8x * All-to-One: ~25x **Trend Verification (Chart b):** All three topology lines slope upward linearly. The "All-to-One" line has the steepest slope, followed by "Mesh", then "Tree". ### Key Observations 1. **Dominant Cost Factor:** In chart (a), the "Inter-node topology Latency" (striped blue) is the dominant and fastest-growing component of total latency for the "All-to-One" topology. For "Mesh" and "Tree", this component is negligible. 2. **Scalability Divergence:** There is a dramatic scalability divergence between "All-to-One" and the other two topologies. "All-to-One" performance degrades rapidly (exponentially in latency, linearly in cycles) as system size (N) increases. "Mesh" and "Tree" scale much more gracefully. 3. **Relative Performance:** "Tree" topology consistently shows the best (lowest) normalized cycle counts in chart (b). "Mesh" is approximately double the cycles of "Tree" at each size. "All-to-One" is approximately 2.5-3 times the cycles of "Tree". 4. **Latency Composition:** The fixed overheads (Memory, PE, Peripheries) are constant across all topologies and sizes, forming a baseline. The variable, topology-dependent cost is solely the inter-node communication. ### Interpretation These charts present a quantitative argument for the inefficiency of a centralized ("All-to-One") communication pattern in scaling processing element (PE) structures compared to distributed topologies ("Mesh" and "Tree"). * **What the data suggests:** The "Sys. Freq. Bottleneck" annotation implies that as the system grows, the inter-node communication latency in an All-to-One scheme becomes the critical path, limiting the maximum achievable system frequency. The linear growth in broadcast cycle counts for all topologies is expected, but the slope reveals the communication overhead multiplier inherent to each topology's algorithm. * **How elements relate:** Chart (a) explains *why* the cycle counts in chart (b) differ. The high inter-node latency for "All-to-One" in (a) directly translates to many more cycles being consumed for the same broadcast operation in (b). The "Mesh" and "Tree" topologies avoid this bottleneck by using more efficient, structured communication paths. * **Notable Anomalies/Outliers:** The "All-to-One" data series is the clear outlier, demonstrating poor scalability. The near-identical performance of "Mesh" and "Tree" in the latency chart (a) is interesting, as it suggests their per-operation latency is similar, yet the cycle count chart (b) shows "Tree" requires half the cycles of "Mesh". This implies the "Tree" topology completes the broadcast operation in fewer logical steps (hops), even if each step's latency is comparable to a Mesh step. * **Peircean Investigation:** The signs (steeply rising striped bars) point to an underlying cause: a centralized communication hub creates a contention point. The data trends (diverging lines) predict that for very large N, the All-to-One approach would become functionally unusable, while Mesh/Tree would remain viable. This is a classic engineering trade-off analysis visualized, advocating for distributed over centralized architectures in scalable systems. </details> Figure 8: Scalability analysis of interconnect topologies. (a) Normalized latency breakdown as the number of leaf nodes $N$ increases. (b) Normalized broadcast-to-root cycle counts for different PE interconnect structures. Register management. REASON adopts an automatic write-address generation policy. Data is written to the lowest available register address in each bank, eliminating the need to encode explicit write addresses in instructions. The compiler precisely predicts these write addresses at compile time due to the deterministic execution sequence, further reducing instruction size and energy overhead. Compiler-driven optimization. To efficiently translate unified DAGs into executable kernels and map onto hardware datapath, REASON adopts a four-step compiler pipeline (Fig. 7). Step- 1 Block decomposition: The compiler decomposes the unified DAG from Sec. IV into execution blocks through a greedy search that identifies schedulable subgraphs whose maximum depth does not exceed the hardware tree depth. This process maximizes PE utilization while minimizing inter-block dependencies that may cause read-after-write stalls. The resulting tree-like blocks form the basis for efficient mapping. Step- 2 PE mapping: For each block, the compiler jointly assigns nodes to PEs and operands to register banks, considering topological constraints and datapath connectivity. Nodes are mapped to preserve order, while operands are allocated to banks to avoid simultaneous conflicts. The compiler dynamically updates feasible mappings and prioritizes nodes with the fewest valid options. This conflict-aware strategy minimizes bank contention and balances data traffic across banks. Step- 3 Tree mapping: Once block and register mappings are fixed, the compiler constructs physical compute trees that maximize data reuse in the REASON datapath. Node fusion and selective replication enhance operand locality and reduce inter-block communication, allowing intermediate results to be consumed within the datapath and lowering memory traffic. Step- 4 Reordering: The compiler then schedules instructions with awareness of the multi-stage pipeline. Dependent operations are spaced by at least one full pipeline interval, while independent ones are interleaved. Lightweight load, store, and copy operations fill idle cycles without disturbing dependencies. Live-range analysis identifies register pressure and inserts minimal spill and reload instructions when needed. The DAG-to-hardware mapping is an automated heuristic process to generate a compact VLIW program for REASON. Designers can interact for design-space exploration to tune architectural parameters within flexible hardware template. ### V-D Architectural Support for Symbolic Logical Reasoning To efficiently support symbolic logical reasoning kernels, REASON features a linked-list-based memory layout and hardware-managed BCP FIFO mechanism (Fig. 6 (e)), enabling efficient and scalable support to large-scale solver kernels that are fundamental to logical reasoning. Watched literals (WLs) unit. The WLs unit acts as a distributed memory controller tightly integrated with $N$ SRAM banks, implementing the two-watched-literals indexing scheme in hardware. This design transforms the primary bottleneck in BCP from a sequential scan over the clause database into a selective parallel memory access problem. Crucially, it enables scalability to industrial-scale SAT problems [44], where only a small subset of clauses (those on the watch list) need to be accessed at any time. This design naturally aligns with a hierarchical memory system, allowing most clauses to reside in remote scratchpad memory or DRAM, with on-chip SRAM caching only the required clauses indexed by WLs unit. <details> <summary>x9.png Details</summary> ![a1369a14](/v1/image/a1369a14d2e4eaa50f4bfd08174b36c2edbb1690f5e267ac03a71d18ae3b8607) ### Visual Description ## Diagram: GPU-REASON and Intra-REASON Pipeline for Symbolic CDCL Execution ### Overview This image is a technical diagram illustrating a two-level pipeline architecture for a system named "REASON." The top section depicts a high-level "GPU-REASON pipeline" that interleaves "Neuro" and "Symbolic" tasks over time. The bottom section provides a detailed timing diagram for the "Intra-REASON pipeline," specifically focusing on the execution of a symbolic CDCL (Conflict-Driven Clause Learning) solver. The diagram uses color-coding, tables, and flowcharts to explain the parallel and pipelined operations. ### Components/Axes **Top Section - GPU-REASON Pipeline:** * **Timeline:** A horizontal arrow labeled "Time" indicates the progression of tasks. * **Tasks:** Three interlocking hexagons represent "3 tasks." * **Pipeline Stages:** Two rows labeled "GPU" and "REASON." * The "GPU" row contains a sequence of green hexagons labeled "Neuro." * The "REASON" row contains a sequence of orange hexagons labeled "Symbolic," which are staggered to overlap with the "Neuro" tasks. * **DPLL-lookahead CDCL Diagram (Top Right):** * A small flowchart showing nodes (circles) connected by arrows. * **Labels:** "Lookahead: LA(A) < LA(B)", "x2", "~x2", "x8", "x6". * **Node Colors:** Grey nodes are labeled "Grey Node: Unsatifiable." Pink and light purple nodes are present. * **Highlighted Region:** A green box labeled "In-node Pipeline" encloses two nodes (A and B). **Bottom Section - Intra-REASON Pipeline (Symbolic CDCL Execution):** This is a large table with the following structure: * **Columns (Cycles):** T1-T4, T5, T6-T9, T10, T11, T15, T16, T17-T19, T22, T23. * **Rows (Modules):** 1. **Broadcast** 2. **Reduction** 3. **L2/DMA** 4. **PE Activity** 5. **BCP FIFO** 6. **Control** 7. **Watched Literals** * **Right-Side Label:** A vertical red label reads "Multiple parallel CDCLs," indicating the table represents one instance of several running in parallel. * **Color-Coding:** Cells are shaded in light blue, pink, orange, and green to denote different types of operations or states. ### Detailed Analysis **GPU-REASON Pipeline Flow:** The diagram shows a pipelined execution where "Neuro" tasks (likely neural network inference) on the GPU are interleaved with "Symbolic" reasoning tasks. This suggests a heterogeneous computing approach where different hardware units handle specialized tasks in a coordinated, overlapping manner to maximize utilization. **Intra-REASON Pipeline Table - Cycle-by-Cycle Breakdown:** | Module | T1-T4 | T5 | T6-T9 | T10 | T11 | T15 | T16 | T17-T19 | T22 | T23 | | :----------- | :----------------------------- | :--------------------- | :----------------------------- | :------------- | :--------------------------- | :------------- | :------------- | :-------------------- | :------------- | :------------- | | **Broadcast**| Broadcast x1 | x1 arrives | | **Broadcast x2** | Broadcast x12 | x2 arrives | x12 arrives | **Broadcast x99** | x99 arrives | | | **Reduction**| | | x2=1 propagate then x3=0 | **x3 arrives** | | | | Conflict propagate | | | | **L2/DMA** | | | | | | DMA activated | DMA activated | DMA activated | DMA activated | **Stop DMA** | | **PE Activity**| | Implication x2=1, x3=0 | | | | None | | | **Conflicts** | | | **BCP FIFO** | [x12=0, x99=1] | [x12=0, x99=1] | [x12=0, x99=1] | [x12=0, x99=1] | [x99=1, x3=0] | [x99=1, x3=0] | [x3=0] | [x3=0] | [x3=0] | [NULL] | | **Control** | Decide x1=0 | | | | **Push x3, Pop x12** | | **Pop x99** | | | **FIFO Flush** | |**Watched Literals**| | No miss detected | | | | No miss detected| Miss detected | | conflicts! | | **Key Data Points & Trends from the Table:** * **Broadcast Trend:** The broadcast module sequentially sends variables (x1, x2, x12, x99) into the pipeline. Their arrivals are staggered across cycles (x1 at T5, x2 at T15, x12 at T16, x99 at T22). * **Reduction & Control Flow:** A reduction operation propagates an implication (x2=1 leads to x3=0) between T6-T9. The Control module later pushes x3 and pops x12 at T11, and pops x99 at T16. * **Memory (L2/DMA) Activity:** DMA (Direct Memory Access) is activated from T15 to T22, suggesting data movement between processing elements and memory during this phase, and is explicitly stopped at T23. * **Processing Element (PE) Activity:** The PE shows implication activity early on (T5), is idle ("None") at T16, and detects "Conflicts" at T22. * **BCP FIFO State:** The Boolean Constraint Propagation FIFO holds a queue of literal assignments. Its state evolves from holding `[x12=0, x99=1]` to eventually being flushed to `[NULL]` at T23. * **Watched Literals:** This mechanism, used for efficient clause monitoring, reports "No miss detected" at T5 and T15, a "Miss detected" at T16, and "conflicts!" at T22, aligning with the PE's conflict detection. ### Key Observations 1. **Pipelining and Parallelism:** The core theme is overlapping different phases of computation (Broadcast, Reduction, DMA, PE Activity) across cycles to achieve parallelism. The "Multiple parallel CDCLs" label confirms this is a scalable design. 2. **Color-Coded Semantics:** Pink cells (e.g., "Broadcast x2", "Push x3, Pop x12") seem to indicate control or data movement initiation. Orange cells (e.g., "Broadcast x99", "Miss detected") may highlight critical path or event-driven actions. Green cells (DMA rows, "Conflicts", "FIFO Flush") likely denote hardware-accelerated or terminal operations. 3. **Synchronization Points:** The pipeline shows clear synchronization. For example, the "Conflict propagate" in Reduction (T17-T19) and "Conflicts" in PE Activity (T22) are followed by a "FIFO Flush" in Control (T23), indicating a resolution or restart procedure. 4. **Data Flow:** The BCP FIFO acts as a central communication channel between modules, its contents changing as variables are decided, propagated, and popped. ### Interpretation This diagram details the microarchitecture of a specialized hardware accelerator for symbolic reasoning, specifically for SAT solving using the CDCL algorithm. It demonstrates how a high-level neuro-symbolic pipeline (GPU-REASON) is supported by a deeply pipelined, parallel execution engine at the lower level (Intra-REASON). The data suggests a system designed for high throughput. By having multiple CDCL instances run in parallel and by meticulously scheduling the stages of a single CDCL instance (broadcast, propagate, memory access, conflict analysis) to overlap, the architecture minimizes idle time for processing elements. The progression from "No miss detected" to "Miss detected" and finally to "conflicts!" in the Watched Literals module illustrates the solver's journey from propagation to encountering a contradiction, which is the core mechanism for guiding the search in CDCL. The "DPLL-lookahead CDCL" inset hints that the system uses an advanced variant of CDCL with lookahead heuristics (comparing LA(A) and LA(B)) to make branching decisions, which is computationally expensive but can dramatically reduce the search space. The "In-node Pipeline" suggests that even within a single decision node, further pipelining occurs. In essence, the image reveals a hardware-software co-design aimed at accelerating the symbolic reasoning component of AI systems, making it feasible to integrate deep logical inference with neural network processing in a unified, efficient pipeline. </details> Figure 9: Two-level execution pipeline for symbolic reasoning. Top: task-level overlap between GPU neural execution and REASON symbolic execution. Bottom: detailed cycle-by-cycle timeline of CDCL SAT solving, illustrating pipelined broadcast/reduction, WLs traversal, latency hiding, and priority-based conflict handling. Color represents the causality of hardware events. SRAM layout. The local SRAM is partitioned to support a linked-list-based organization of watch lists. A dedicated region stores a head pointer table indexed by literal IDs, each pointing to the start of a watch list, enabling $\mathcal{O}(1)$ access. The main data region stores clauses, each augmented with a next-watch pointer that links to other clauses watching the same literal, forming multiple linked lists within the linear address space. Upon a variable assignment, the WLs unit uses the literal ID to fetch the head pointer and traverses the list by following next-watch pointers, dispatching only the relevant clause to PEs. This hardware-managed indexing eliminates full-database scans and maps efficiently to the adder datapath. BCP FIFO. The BCP FIFO sits atop the M:1 output interconnect (Fig. 6 (c)) and serializes multiple parallel implications generated by the leaf tree-node in a single cycle. While many implications can be discovered concurrently, BCP must propagate them sequentially to preserve the causality chain for conflict analysis. The controller immediately broadcasts one implication back into the pipeline, while the rest are queued in the FIFO and processed in a pipelined manner. Within a symbolic (DPLL) tree node, implications are causally independent and can be pipelined, but between assignments, sequential ordering is enforced to maintain correctness. Sec. V-E illustrates a detailed cycle-level execution example. Scalability advantages. A key advantage of the REASON architecture is that its tree-based inter-node topology does not become a bottleneck as the symbolic DPLL tree grows (Fig. 8 (a)). In contrast, all-to-one (or one-to-all) bus interconnects often fail to scale due to post-layout electrical constraints, including high fan-out and buffer insertion for hold-time fixes. Moreover, given that broadcasting is a dominant operation, the root-to-leaf traversal latency is critical. REASON’s tree-based inter-node topology achieves exceptional scalability with an $\mathcal{O}(\log N)$ traversal latency, compared to $\mathcal{O}(\sqrt{N})$ for mesh-based designs and $\mathcal{O}(N)$ for bus-based interconnects (Fig. 8 (b)). This property enables robust scalability for large symbolic reasoning workloads. Listing 1: C++ Programming Interface of REASON ⬇ // Trigger symbolic execution for a single inference void REASON_execute ( int batch_id, // batch identifier int batch_size, // number of objects in the batch const void * neural_buffer, // neural results in shared memory const void * reasoning_mode, // mode selection void * symbolic_buffer // write-back symb. results ); // Query current REASON status for a given object int REASON_check_status ( int batch_id, // batch identifier bool blocking // wait till REASON is idle ); ### V-E Case Study: A Working Example of Symbolic Execution Fig. 9 illustrates the dynamic, pipelined per-node execution of REASON during a cube-and-conquer SAT solving phase, which highlights several key hardware mechanisms, including inter-node pipelined broadcast/reduction, latency hiding via parallel WLs traversal, and priority-based conflict handling. Execution begins with the controller issuing a Decision to assign $x_{1}$ , which is broadcast through the distribution tree (T1–T4). At T5, leaf nodes concurrently discover two implications: $x_{2}$ = $1$ and $x_{3}$ = $0$ . These implications are returned to the controller via the reduction tree in a pipelined manner, where $x_{2}$ = $1$ arrives first, followed by $x_{3}$ = $0$ at T10. Since the FIFO is occupied, $x_{3}$ = $0$ is queued into the BCP FIFO at T11. At T15, the FIFO pops a subsequent implication ( $x_{12}$ ), which triggers WLs lookup. A local SRAM miss prompts the L2/DMA to begin fetching clause, meanwhile BCP FIFO continues servicing queued implications: $x_{99}$ is popped and broadcast from T17–T19 while DMA fetch is still in progress. At T22, the propagation of $x_{99}$ results in a Conflict, which immediately propagates up the reduction tree. Upon receiving the conflict signal at T23, the controller asserts priority control: it halts the ongoing DMA fetch, flushes the FIFO, and discards all pending implications (including $x_{3}$ = $0$ ) from the now-invalid search path. The cube-and-conquer phase terminates, and the parallelized DPLL node is forwarded to the scalar PE for CDCL conflict analysis, as discussed in Sec. II-C. ### V-F Design Space Exploration and Scalability Design space exploration. To identify the optimal configuration of REASON architecture, we perform a comprehensive design space exploration. We systematically evaluated different configurations by varying key architectural parameters such as the depth of the tree (D), the number of parallel register banks (B), and the number of registers per bank (R). We evaluate each configuration across latency, energy consumption, and energy-delay product (EDP) on representative workloads. The selected configuration (D=3, B=64, R=32) offers the optimal trade-off between performance and energy efficiency. Scalability and variability support. Coupled with reconfigurable array, pipeling scheduling, and memory layout optimizations, REASON provides flexible hardware support across symbolic and probabilistic kernels (e.g., SAT, FOL, PC, HMM), neuro-symbolic workloads, and cognitive tasks, enabling efficient neuro-symbolic processing at scale (Sec. VII). Design choice discussion. We adopt a unified architecture for symbolic and probabilistic reasoning to maximize flexibility and efficiency, rather than decoupling them into separate engines. We identify these kernels share common DAG patterns, enabling REASON to execute them on Tree-PEs through a unified representation. This approach achieves $>$ 90% utilization with 58% lower area/power than decoupled designs, while maintaining tight symbolic-probabilistic coupling. A flexible Benes network and compiler co-design handle routing and memory scheduling, ensuring efficient execution. ## VI REASON: System Integration and Pipeline This section presents the system-level integration of REASON. We first present the integration principles and workload partitioning strategy between GPU and REASON (Sec. VI-A), then introduce the programming model that enables flexible invocation and synchronization (Sec. VI-B). Finally, we describe the two-level execution pipeline (Sec. VI-C). ### VI-A Integration with GPUs for End-to-End Reasoning Integration principles. As illustrated in Fig. 6 (a), the proposed REASON is integrated as a co-processor within GPU system to support efficient end-to-end symbolic and probabilistic reasoning. This integration follows two principles: (1) Versatility to ensure compatibility with a broad range of logical and probabilistic reasoning workloads in neuro-symbolic pipelines, and (2) Efficiency to achieve low-latency execution for real-time reasoning. These principles necessitate careful workload assignment between GPU and REASON with pipelined execution. Workload partitioning. To maximize performance while preserving compatibility with existing and emerging LLM-based neuro-symbolic agentic pipelines, we assign neural computation (e.g., LLM inference) to the GPU and offload symbolic reasoning and probabilistic inference to REASON. This partitioning exploits the GPU’s high throughput and programmability for neural kernels, while leveraging REASON’s reconfigurable architecture optimized for logical and probabilistic operations. It also minimizes data movement and enables pipelined execution: while REASON processes symbolic reasoning for the current batch, the GPU concurrently executes neural computation for the next batch. ### VI-B Programming Model REASON’s programming model (Listing. 1) is designed to offer full flexibility and control, making it easy to utilize REASON for accelerating various neuro-symbolic applications. It exposes two core functions for executing and status checking, enabling acceleration of logical and probabilistic kernels. REASON_execute is responsible for processing a single symbolic reasoning run. It is called after GPU SMs complete processing the neural LLM. REASON then performs logical reasoning and probabilistic inference, and writes the symbolic results to shared memory, where SMs use for the next iteration. REASON_check_status reports the current execution status of REASON (IDLE or EXECUTION) and includes an optional blocking flag. This feature allows the host thread to wait for REASON to complete the current step of reasoning before starting the next, ensuring proper coordination across subtasks without relying on CUDA stream synchronization. Synchronization. Coordination between SMs and REASON is handled through shared-memory flag buffers and L2 cache. After executing LLM kernel, SMs write the output to shared memory and set neural_ready flag. REASON polls this flag, fetches the data, and performs symbolic reasoning. It then writes the result back to shared memory and sets symbolic_ready flag, which will be retrieved for final output. This tightly coupled design leverages GPU’s throughput for LLM kernels and REASON’s efficiency for symbolic reasoning, minimizing overhead and maximizing performance. TABLE III: Hardware baselines. Comparison of device specs. | Orin NX [18] | 8 nm | 4 MB | 512 CUDA Cores | 450 | 15 | | --- | --- | --- | --- | --- | --- | | RTX A6000 [48] | 8 nm | 16.5 MB | 10572 CUDA Cores | 628 | 300 | | Xeon CPU [17] | 10 nm | 112.5 MB | 60 cores per socket | 1600 | 270 | | TPU [19] | 7 nm | 170 MB | 8 128 $\times$ 128 PEs | 400 | 192 | | DPU # [59] | 28 nm | 2.4 MB | 8 PEs / 56 Nodes | 3.20 | 1.10 | | REASON | 28 nm | 1.25 MB | 12 PEs / 80 Nodes | 6.00 | 2.12 | | REASON * | 12 nm | 1.25 MB | 12 PEs / 80 Nodes | 1.37 | 1.21 | | REASON * | 8 nm | 1.25 MB | 12 PEs / 80 Nodes | 0.51 | 0.98 | - The 12nm and 8nm data are scaled from the DeepScaleTool [57] with a voltage of 0.8V and a frequency of 500MHz. # The terminology for tree-based architecture is renamed from tree to PE and PE to node to align with the proposed REASON. ### VI-C Two-Level Execution Pipeline Our system-level design employs a two-level pipelined execution model (Fig. 9 top-left) to maximize concurrency across neural and symbolic kernels. The GPU-REASON pipeline overlaps the execution of symbolic kernels on REASON for step $N$ with neural kernels on GPU for step $N$ +1, effectively hiding the latency of one stage and improving throughput. Within REASON, the Intra-REASON pipeline exploits inter-node pipelined broadcast and reduction to hide communication latency, using parallel worklist traversal and priority-based conflict handling to accelerate symbolic kernels (Sec. V-E). The compiler integrates pipeline-aware scheduling to reorder instructions, avoid read-after-write hazards, and insert no-operation instructions when necessary, ensuring each stage receives valid data without interruption. <details> <summary>x10.png Details</summary> ![faf97c6a](/v1/image/faf97c6a92e2002e2cc727add012e4f1805819fa22381340a305cc3107d6bd20) ### Visual Description \n ## Technical Diagram: Chip Architecture Block Diagram and Specifications ### Overview The image presents a two-part technical document: a color-coded block diagram illustrating the internal architecture of a semiconductor chip (likely a processor or accelerator) on the left, and a corresponding specifications table on the right. The diagram uses labeled rectangular regions to denote major functional units and their spatial relationships on the die. ### Components/Axes **Diagram Components (Left Side):** The diagram is divided into several colored blocks with the following labels and approximate positions: 1. **Flat Logic – Control, Decode, Watched Literals, Interconnects, etc.** (Large, light green region on the left side, spanning top to bottom). 2. **Custom SIMD Unit** (Darker green rectangle, positioned in the lower-left quadrant, overlapping the "Flat Logic" region). 3. **Shared Local Memory (M1-M4)** (Light peach rectangle, positioned in the top-center). 4. **N SRAM Banks (M1-M4)** (Light blue rectangle, positioned in the top-right corner). 5. **Input/Output Distribution** (Light purple rectangle, positioned directly below the "N SRAM Banks" block). 6. **Tree-structured PEs** (Light green rectangle, positioned in the bottom-right quadrant, below the "Input/Output Distribution" block). **Specifications Table (Right Side):** | Specification | Value | |---------------|-------| | Technology | 28 nm | | Core VDD | 0.9 V | | Power | 2.12 W | | SRAM | 1.25 MB | | # of PEs | 12 | | # of Nodes | 80 | | DRAM BW | 104 GB/s | | Area | 6 mm² | ### Detailed Analysis **Diagram Layout and Relationships:** * The "Flat Logic" block appears to be the central control and routing fabric, occupying the largest area on the left. * The "Custom SIMD Unit" is embedded within or adjacent to the "Flat Logic" area, suggesting it is tightly integrated with the control logic. * Memory hierarchy is shown on the top and right: "Shared Local Memory" and "N SRAM Banks" are placed at the top, indicating a high-bandwidth local storage layer. * The "Input/Output Distribution" block sits between the SRAM banks and the processing elements, likely managing data flow. * The "Tree-structured PEs" (Processing Elements) are located at the bottom-right, receiving data from the I/O distribution network. The label "Tree-structured" implies a hierarchical interconnect topology among the PEs. **Table Data:** The table provides concrete technical parameters for the chip: * Fabrication process: 28 nanometer technology. * Operating voltage (Core VDD): 0.9 Volts. * Power consumption: 2.12 Watts. * On-chip SRAM capacity: 1.25 Megabytes. * Number of Processing Elements (PEs): 12. * Number of Nodes: 80 (This could refer to cores, clusters, or another hierarchical unit within the PEs). * External memory bandwidth (DRAM BW): 104 Gigabytes per second. * Physical die area: 6 square millimeters. ### Key Observations 1. **Architecture Type:** The combination of a "Custom SIMD Unit," "Tree-structured PEs," and significant on-chip SRAM (1.25 MB) strongly suggests this is a specialized accelerator chip, potentially for signal processing, machine learning inference, or high-performance computing tasks. 2. **Memory-Centric Design:** The prominent placement and labeling of multiple memory banks ("Shared Local Memory," "N SRAM Banks") highlight a design focused on high-bandwidth, low-latency data access to feed the processing elements. 3. **Integration:** The "Flat Logic" block encompassing control, decode, and interconnects indicates a highly integrated system-on-chip (SoC) design where control and data path are closely coupled. 4. **Efficiency Metrics:** The specifications point to a power-efficient design (2.12 W) for its performance class, built on a mature 28nm process node. The area (6 mm²) is relatively small, suggesting a cost-effective or embedded application. ### Interpretation This diagram and table collectively describe a specialized parallel processing chip. The architecture is optimized for data-parallel workloads, as evidenced by the SIMD (Single Instruction, Multiple Data) unit and the array of Processing Elements. The "Tree-structured" interconnect for the PEs suggests a design that balances communication bandwidth and latency, possibly for algorithms requiring reduction operations or hierarchical data sharing. The high DRAM bandwidth (104 GB/s) relative to the small die area indicates the chip is designed to work on large datasets that reside in external memory, streaming them efficiently through the on-chip SRAM hierarchy to the compute units. The 12 PEs and 80 nodes imply a many-core or clustered architecture where each PE may contain multiple simpler processing nodes. In essence, this is a technical blueprint for a domain-specific accelerator, prioritizing parallel computation, efficient memory access, and integration, likely targeting applications in embedded systems, networking, or edge computing where performance-per-watt and area are critical constraints. </details> Figure 10: REASON layout and specifications. The physical design and key operating specifications of our proposed REASON hardware. ## VII Evaluation This section introduces REASON evaluation settings (Sec. VII-A), and benchmarks the proposed algorithm optimizations (Sec. VII-B) and hardware architecture (Sec. VII-C). ### VII-A Evaluation Setup Datasets. We evaluate REASON on 10 commonly-used reasoning datasets: IMO [66], MiniF2F [86], TwinSafety [20], XSTest [56], CommonGen [31], News [85], CoAuthor [26], AwA2 [78], FOLIO [11], and ProofWriter [65]. The tasks are measured by the reasoning and deductive accuracy. Algorithm setup. We evaluate REASON on six state-of-the-art neuro-symbolic models, i.e., AlphaGeometry [66], R 2 -Guard [20], GeLaTo [82], Ctrl-G [83], NeuroPC [6], and LINC [52]. Following the setup in the original literature, we determine the hyperparameters based on end-to-end reasoning performance on the datasets. Our proposed REASON algorithm optimizations are general and can work as a plug-and-play extension to existing neuro-symbolic algorithms. Baselines. We consider several hardware baselines, including Orin NX [18] (since our goal is to enable real-time neuro-symbolic at edge), RTX GPU [48], Xeon CPU [17], ML accelerators (TPU [19], DPU [59]). Tab. III lists configurations. Hardware setup. We implement REASON architecture with [59] as the baseline template, synthesize with Synopsys DC [63], and place and route using Cadence Innovus [5] at TSMC 28nm node. Fig. 10 illustrates the layout and key specifications. The REASON hardware consumes an area of 6 mm 2 and an average power of 2.12 W across workloads based on Synopsys PTPX [64] power-trace analysis (Fig. 12 (a)). Unlike conventional tree-based arrays that mainly target neural workloads, REASON provides unified, reconfigurable support for neural, symbolic, and probabilistic computation. Simulation setup. To evaluate end-to-end performance of REASON when integrated with GPUs, we develop a cycle-accurate simulator based on Accel-Sim (built on GPGPU-Sim) [21]. The simulator is configured for Orin NX architecture. The on-chip GPU is modeled with 8 SMs, each supporting 32 threads per warp, 48 KB shared memory, and 128 KB L1 cache, with a unified 2 MB L2 cache shared across SMs. The off-chip memory uses a 128-bit LPDDR5 interface with 104 GB/s peak BW. DRAM latency and energy are modeled using LPDDR5 timing parameters. Simulator test trace derivation. We use GPGPU-Sim to model interactions between SMs and REASON, including transferring neural results from SMs to REASON and returning symbolic reasoning results from REASON to SMs. To simulate communication overhead, we extract memory access traces from neuro-symbolic model execution on Orin, capturing data volume and access patterns as inputs to GPGPU-Sim for accurate modeling. For GPU comparison baselines, we use real hardware measurements to get accurate ground-truth data. ### VII-B REASON Algorithm Performance TABLE IV: REASON algorithm optimization performance. REASON achieves comparable accuracy with reduced memory footprint via unified DAG representation, adaptive pruning, and regularization. | Workloads | Benchmarks | Metrics | Baseline Performance | After REASON Algo. Opt. | | | --- | --- | --- | --- | --- | --- | | Performance | Memory $\downarrow$ | | | | | | AlphaGeo | IMO | Accuracy ( $\uparrow$ ) | 83% | 83% | 25% | | MiniF2F | Accuracy ( $\uparrow$ ) | 81% | 81% | 21% | | | R 2 -Guard | TwinSafety | AUPRC ( $\uparrow$ ) | 0.758 | 0.752 | 37% | | XSTest | AUPRC ( $\uparrow$ ) | 0.878 | 0.881 | 30% | | | GeLaTo | CommonGen | BLEU ( $\uparrow$ ) | 30.3 | 30.2 | 41% | | News | BLEU ( $\uparrow$ ) | 5.4 | 5.4 | 27% | | | Ctrl-G | CoAuthor | Success rate ( $\uparrow$ ) | 87% | 86% | 29% | | NeuroSP | AwA2 | Accuracy | 87% | 87% | 43% | | LINC | FOLIO | Accuracy ( $\uparrow$ ) | 92% | 91% | 38% | | ProofWriter | Accuracy ( $\uparrow$ ) | 84% | 84% | 26% | | Reasoning accuracy. To evaluate REASON algorithm optimization (Sec. IV), we benchmark it on ten reasoning tasks (Sec. VII-A). Tab. IV lists the arithmetic performance and DAG size reduction. We observe that REASON achieves comparable reasoning accuracy through unification and adaptive DAG pruning. Through pruning and regularization, REASON enables 31.7% memory footprint savings on average of ten reasoning tasks across six neuro-symbolic workloads. ### VII-C REASON Architecture Performance Performance improvement. We benchmark REASON accelerator with Orin NX, RTX GPU, and Xeon CPU for accelerating neuro-symbolic algorithms on 10 reasoning tasks (Fig. 11). For GPU baseline, for neuro kernels, we use Pytorch package that leverages CUDA and cuBLAS/cuDNN libraries; for symbolic kernels, we implement custom kernels optimized for logic and probabilistic operations. The workload is tiled by CuDNN in Pytorch based on block sizes that fit well in GPU memory. We observe that REASON exhibits consistent speedup across datasets, e.g., 50.65 $\times$ /11.98 $\times$ speedup over Orin NX and RTX GPU. Furthermore, REASON achieves real-time performance ( $<$ 1.0 s) on solving math and cognitive reasoning tasks, indicating that REASON enables real-time probabilistic logical reasoning-based neuro-symbolic system with superior reasoning and generalization capability, offering a promising solution for future cognitive applications. <details> <summary>x11.png Details</summary> ![6d4f7011](/v1/image/6d4f70116f1b9af4fe5f3e6f280c982b38f0a2964e4a0e8e0d0a633331a339e5) ### Visual Description ## Grouped Bar Chart: Normalized Runtime Comparison Across Benchmarks ### Overview The image displays a grouped bar chart comparing the normalized runtime (in multiples) of four different computing platforms across ten distinct benchmark tasks. The chart uses a logarithmic scale on the y-axis to accommodate the wide range of values. The primary purpose is to demonstrate the relative performance efficiency of the "REASON" platform against three others: Xeon CPU, Orin NX, and RTX GPU. ### Components/Axes * **Chart Type:** Grouped bar chart with a logarithmic y-axis. * **Y-Axis:** * **Label:** "Norm. Runtime (×)" * **Scale:** Logarithmic (base 10), ranging from 10⁰ (1) to 10² (100). * **Major Ticks:** 10⁰, 10¹, 10². * **X-Axis:** * **Categories (Benchmarks):** IMO, MiniF2F, TwinS, XSTest, ComGen, News, CoAuthor, Awa2, FOLIO, Proof. * **Legend (Top-Right):** * **Xeon CPU:** Purple bar with diagonal hatching. * **Orin NX:** Pink bar with diagonal hatching. * **RTX GPU:** Green bar with diagonal hatching. * **REASON:** Blue bar with a cross-hatch pattern. * **Data Labels:** Each bar has its numerical value printed directly above it. ### Detailed Analysis The chart presents normalized runtime data for ten benchmarks. The "REASON" platform is the baseline, with a value of 1.0 for every benchmark. All other runtimes are expressed as multiples of this baseline. **Data Extraction by Benchmark (Xeon CPU / Orin NX / RTX GPU / REASON):** 1. **IMO:** 97.9 / 48.3 / 12.4 / 1.0 2. **MiniF2F:** 99.2 / 51.5 / 12.1 / 1.0 3. **TwinS:** 96.5 / 48.9 / 11.5 / 1.0 4. **XSTest:** 97.6 / 50.3 / 11.4 / 1.0 5. **ComGen:** 98.5 / 48.0 / 13.8 / 1.0 6. **News:** 95.6 / 50.2 / 12.4 / 1.0 7. **CoAuthor:** 97.9 / 53.0 / 10.6 / 1.0 8. **Awa2:** 100.4 / 51.7 / 9.8 / 1.0 9. **FOLIO:** 98.2 / 51.6 / 12.7 / 1.0 10. **Proof:** 96.9 / 53.0 / 13.1 / 1.0 **Trend Verification:** * **Xeon CPU (Purple):** Consistently the tallest bar in each group, forming a nearly flat "ceiling" across all benchmarks. Values cluster tightly between ~95.6 and 100.4. * **Orin NX (Pink):** Consistently the second-tallest bar. Values cluster between ~48.0 and 53.0. * **RTX GPU (Green):** Consistently the third-tallest bar. Values show slightly more variance, ranging from ~9.8 to 13.8. * **REASON (Blue):** Consistently the shortest bar, fixed at the baseline value of 1.0 for all benchmarks. ### Key Observations 1. **Performance Hierarchy:** A strict and consistent performance hierarchy is maintained across all ten benchmarks: REASON (fastest) < RTX GPU < Orin NX < Xeon CPU (slowest). 2. **Magnitude of Difference:** The logarithmic scale highlights massive performance gaps. The Xeon CPU is approximately **two orders of magnitude (100x)** slower than the REASON baseline. The Orin NX is roughly **50x** slower, and the RTX GPU is about **10-13x** slower. 3. **Consistency:** The relative performance ratios between platforms are remarkably stable across diverse benchmark tasks (from mathematical IMO to text-based CoAuthor and FOLIO). This suggests the performance advantage of the REASON platform is robust and not task-specific. 4. **Outlier:** The "Awa2" benchmark shows the highest runtime for Xeon CPU (100.4) and the lowest for RTX GPU (9.8), slightly widening the performance gap between these two specific platforms for that task. ### Interpretation This chart is a performance benchmark study. The data strongly suggests that the "REASON" platform (likely a specialized hardware accelerator or optimized software framework) provides a dramatic and consistent speedup over general-purpose and other specialized computing platforms for the evaluated set of tasks. * **What it demonstrates:** The REASON system is not merely incrementally faster; it operates at a fundamentally different performance tier, completing tasks in roughly 1% of the time required by a high-end Xeon CPU. The consistency across benchmarks implies its advantage is architectural, not algorithmic. * **Relationship between elements:** The chart is designed to make the REASON platform's advantage visually undeniable. Using it as the normalization baseline (1.0) and placing its bars at the bottom of each group creates a clear visual anchor. The logarithmic y-axis is essential; on a linear scale, the REASON and RTX GPU bars would be nearly invisible next to the Xeon CPU bars. * **Underlying message:** The primary takeaway is the validation of the REASON platform's design. It successfully achieves a 10x to 100x reduction in runtime compared to established computing solutions across a varied workload, indicating significant potential for applications requiring high computational throughput. </details> Figure 11: End-to-end runtime improvement. REASON consistently outperforms Xeon CPU, Orin NX, and RTX GPU in end-to-end runtime latency evaluated on 10 logical and cognitive reasoning tasks. <details> <summary>x12.png Details</summary> ![ee659317](/v1/image/ee659317b04cf55ef7c8a8d59f85ecdb448df1e454a7bc7ccb2b90e0d31c795e) ### Visual Description \n ## Bar Charts: Power Consumption and Energy Usage Across Tasks and Hardware ### Overview The image contains two distinct bar charts, labeled (a) and (b), presented side-by-side. Chart (a) is a simple bar chart comparing the power consumption (in Watts) of five different tasks. Chart (b) is a grouped bar chart with a logarithmic scale comparing the energy consumption (in Joules) of four different hardware platforms across four task categories. The overall theme is a performance comparison, likely from a technical paper or report on computing efficiency. ### Components/Axes **Chart (a) - Left Panel:** * **Type:** Vertical bar chart. * **Y-Axis:** Labeled "Power (W)". It has three marked tick values: `0`, `1.88`, and `2.51`. * **X-Axis:** Lists five task names: `News`, `AwA2`, `TwinSafety`, `XSTest`, `ComGen`. * **Bars:** Five bars, each filled with a diagonal line pattern. Their heights correspond to power values. * **Label:** The chart is labeled with a lowercase "(a)" in the bottom-left corner. **Chart (b) - Right Panel:** * **Type:** Grouped vertical bar chart. * **Y-Axis:** Labeled "Energy (J)". It uses a **logarithmic scale** with major tick marks at `10^0`, `10^1`, `10^2`, and `10^3`. * **X-Axis:** Lists four task categories: `Average`, `Task: IMO`, `Task: TwinS`, `Task: News`. * **Legend:** Positioned in the top-right corner of the chart area. It defines four hardware platforms with distinct colors and patterns: * `Xeon CPU`: Purple bar with diagonal lines. * `Orin NX`: Pink bar with diagonal lines. * `RTX GPU`: Green bar with diagonal lines. * `REASON`: Blue bar with a cross-hatch pattern. * **Bars:** For each of the four task categories on the x-axis, there is a group of four bars, one for each hardware platform defined in the legend. * **Label:** The chart is labeled with a lowercase "(b)" in the bottom-left corner. ### Detailed Analysis **Chart (a) - Power (W):** * **Trend Verification:** The bar heights vary, indicating different power draws for different tasks. `TwinSafety` is the tallest bar, and `AwA2` is the shortest. * **Data Points (Approximate):** * `News`: ~2.1 W * `AwA2`: Exactly at the `1.88` W tick mark. * `TwinSafety`: Exactly at the `2.51` W tick mark. * `XSTest`: Slightly below `2.51` W, approximately 2.45 W. * `ComGen`: Slightly above `1.88` W, approximately 1.95 W. **Chart (b) - Energy (J):** * **Trend Verification:** For every task category, the `REASON` bar (blue, cross-hatch) is dramatically shorter than the other three bars, indicating orders of magnitude lower energy consumption. The `Xeon CPU` (purple) bar is consistently the tallest or among the tallest in each group. * **Data Points (Approximate, read from log scale):** * **Category: `Average`** * `Xeon CPU`: ~838 J (value explicitly written above bar). * `Orin NX`: ~310 J (value explicitly written above bar). * `RTX GPU`: ~681 J (value explicitly written above bar). * `REASON`: ~0.87 J (value explicitly written above bar). * **Category: `Task: IMO`** * `Xeon CPU`: ~500 J * `Orin NX`: ~200 J * `RTX GPU`: ~300 J * `REASON`: ~1 J * **Category: `Task: TwinS`** * `Xeon CPU`: ~1000 J (at the `10^3` line) * `Orin NX`: ~300 J * `RTX GPU`: ~800 J * `REASON`: ~1 J * **Category: `Task: News`** * `Xeon CPU`: ~700 J * `Orin NX`: ~200 J * `RTX GPU`: ~400 J * `REASON`: ~1 J ### Key Observations 1. **Massive Energy Efficiency of REASON:** The most striking observation is the performance of the `REASON` platform. Its energy consumption is consistently around 1 Joule across all tasks, which is 2-3 orders of magnitude (100x to 1000x) lower than the conventional hardware (Xeon CPU, Orin NX, RTX GPU). 2. **Task-Specific Power Draw:** Chart (a) shows that power consumption is task-dependent. The `TwinSafety` task draws the highest power (2.51 W), while `AwA2` draws the least (1.88 W). 3. **Hardware Comparison:** Among the conventional hardware in chart (b), the `Xeon CPU` generally consumes the most energy, followed by the `RTX GPU`, with the `Orin NX` being the most efficient of the three. However, all are vastly outperformed by `REASON`. 4. **Data Annotation:** The `Average` group in chart (b) is the only one with precise numerical values annotated above each bar, providing exact reference points. ### Interpretation This data strongly suggests that the system or method named **"REASON" represents a paradigm shift in energy efficiency** for the evaluated tasks (which appear to be AI/ML or computational tasks like IMO problem-solving, safety testing, and news generation). While conventional computing hardware (CPUs, embedded systems like Orin, and GPUs) consumes hundreds to thousands of Joules per task, REASON achieves the same tasks with energy consumption on the order of a single Joule. The power chart (a) provides context, showing that even the instantaneous power draw of these tasks on what is likely a reference platform is non-trivial (1.88-2.51 W). When integrated over time to compute total energy (chart b), the efficiency advantage of REASON becomes overwhelmingly clear. This implies REASON is not just a minor improvement but a fundamentally different computational approach, possibly involving specialized hardware, a novel algorithm, or an ultra-low-power architecture. The consistency of its ~1 J performance across diverse tasks (from reasoning `IMO` to safety `TwinS`) indicates its efficiency is robust and not task-specific. The charts are likely used to argue for the superior scalability and environmental or operational cost benefits of the REASON system. </details> Figure 12: Energy efficiency improvement. (a) Power analysis of REASON across workloads. (b) Energy efficiency comparison between REASON and CPUs/GPUs, evaluated from 10 reasoning tasks. <details> <summary>x13.png Details</summary> ![90889b5e](/v1/image/90889b5e8f5f06fcfeaf85a33fa237349c8afa41cf04bf6016a2ddc827219dfb) ### Visual Description ## Grouped Bar Chart: Normalized Runtime Comparison Across Neuro-Symbolic AI Approaches ### Overview The image displays a grouped bar chart comparing the normalized runtime (in multiples, X) of three different computational architectures—TPU-like (systolic-based array), DPU-like (tree-based array), and REASON—across six different AI systems (AlphaG, Guard, GeLaTo, Ctrl-G, NPC, LINC). The comparison is segmented into three distinct task categories: Neuro-Only, Symbolic-Only (logical/probabilistic), and End-to-End Neuro+Symbolic. The y-axis uses a logarithmic scale. ### Components/Axes * **Chart Title:** Not fully visible in the provided crop. * **Y-Axis:** * **Label:** `Norm. Runtime (X)` * **Scale:** Logarithmic, ranging from `10^0` (1) to `10^2` (100). * **Major Ticks:** `10^0`, `10^1`, `10^2`. * **X-Axis:** Grouped by task category, with six AI systems listed under each. * **Primary Groups (Bottom Labels):** 1. `Neuro-Only` 2. `Symbolic-Only (logical/probabilistic)` 3. `End-to-End Neuro+Symbolic` * **Sub-categories (AI Systems):** `AlphaG`, `Guard`, `GeLaTo`, `Ctrl-G`, `NPC`, `LINC` (repeated under each primary group). * **Legend (Top-Left):** * **Green (diagonal hatch):** `TPU-like (systolic-based array)` * **Pink (vertical hatch):** `DPU-like (tree-based array)` * **Blue (cross-hatch):** `REASON` * **Data Labels:** Each bar has its exact numerical value printed directly above it. ### Detailed Analysis **1. Neuro-Only Tasks:** * **Trend:** REASON bars are uniformly flat at the baseline (1.00). TPU-like and DPU-like bars show modest, similar runtimes slightly above 1. * **Data Points (TPU-like / DPU-like / REASON):** * AlphaG: 0.69 / 4.31 / 1.00 * Guard: 0.71 / 4.40 / 1.00 * GeLaTo: 0.68 / 4.29 / 1.00 * Ctrl-G: 0.66 / 4.49 / 1.00 * NPC: 0.73 / 4.32 / 1.00 * LINC: 0.68 / 4.30 / 1.00 **2. Symbolic-Only (logical/probabilistic) Tasks:** * **Trend:** A dramatic increase in runtime for TPU-like architectures, with bars reaching near or above the `10^2` (100) mark. DPU-like runtimes also increase significantly but remain an order of magnitude lower than TPU-like. REASON remains constant at 1.00. * **Data Points (TPU-like / DPU-like / REASON):** * AlphaG: 81.35 / 25.13 / 1.00 * Guard: 76.10 / 4.84 / 1.00 * GeLaTo: 109.24 / 5.03 / 1.00 * Ctrl-G: 78.48 / 6.07 / 1.00 * NPC: 74.71 / 4.97 / 1.00 * LINC: 90.89 / 23.97 / 1.00 **3. End-to-End Neuro+Symbolic Tasks:** * **Trend:** Runtimes for TPU-like and DPU-like architectures decrease compared to the Symbolic-Only category but remain higher than in the Neuro-Only category. REASON continues to hold at 1.00. * **Data Points (TPU-like / DPU-like / REASON):** * AlphaG: 21.31 / 7.86 / 1.00 * Guard: 17.77 / 2.31 / 1.00 * GeLaTo: 10.54 / 2.15 / 1.00 * Ctrl-G: 18.02 / 2.90 / 1.00 * NPC: 9.76 / 2.33 / 1.00 * LINC: 8.59 / 6.10 / 1.00 ### Key Observations 1. **REASON's Consistency:** The REASON architecture (blue bars) maintains a normalized runtime of exactly 1.00 across all six AI systems and all three task categories. This serves as the baseline for comparison. 2. **Symbolic Task Performance Gap:** The most striking feature is the massive performance gap in the **Symbolic-Only** category. TPU-like architectures are 75-109x slower than the REASON baseline, while DPU-like architectures are 5-25x slower. 3. **Task Category Impact:** For both TPU-like and DPU-like architectures, the runtime order from fastest to slowest is consistently: Neuro-Only < End-to-End Neuro+Symbolic < Symbolic-Only. 4. **System-Specific Variance:** Within the Symbolic-Only category, the `GeLaTo` system shows the highest runtime for TPU-like (109.24), while `LINC` shows the highest for DPU-like (23.97). In the Neuro-Only category, runtimes are very uniform across systems for a given architecture. ### Interpretation This chart demonstrates a clear architectural advantage for the REASON system in terms of computational efficiency, particularly for symbolic reasoning tasks. The data suggests that traditional systolic (TPU-like) and tree-based (DPU-like) arrays incur a severe performance penalty when handling purely logical or probabilistic workloads, with runtimes exploding by one to two orders of magnitude. The "End-to-End" results indicate that combining neuro and symbolic components mitigates this penalty somewhat, but does not eliminate it. The consistent 1.00 runtime for REASON implies it is either the reference system against which others are normalized, or it possesses a fundamentally different and more efficient computational model for these mixed workloads. The chart effectively argues that for AI systems requiring significant symbolic reasoning, the REASON architecture offers substantial and consistent runtime benefits over the compared TPU-like and DPU-like alternatives. </details> Figure 13: Improved efficiency over ML accelerators. Speedup comparison of neural, symbolic (logical and probabilistic), and end-to-end neuro-symbolic system over TPU-like systolic-based array and DPU-like tree-based array architecture. Energy efficiency improvement. REASON accelerator achieves two orders of energy efficiency than Orin NX, RTX GPU, and Xeon CPU consistently across workloads (Fig. 12 (b)). To further assess REASON energy efficiency in long-term deployment, we perform consecutive tests on REASON using mixed workloads, incorporating both low-activity and high-demand periods, with 15s idle intervals between scenarios. On average, REASON achieves 681 $\times$ energy efficiency compared to RTX GPU. Additionally, when compared to V100 and A100, REASON shows 4.91 $\times$ and 1.60 $\times$ speedup, with 802 $\times$ and 268 $\times$ energy efficiency, respectively. Compare with CPU+GPU. We compare the performance of RESAON as GPU plug-in over the CPU+GPU architecture across neuro-symbolic workloads. CPU+GPU architecture is not efficient for neuro-symbolic computing due to (1) high latency of symbolic/probabilistic operations with poor locality and $<$ 5% CPU parallel efficiency, (2) $>$ 15% inter-device communication overhead from frequent neural-symbolic data transfers, and (3) fine-grained coupling between neural and symbolic modules that makes handoffs costly. REASON co-locates logical and probabilistic reasoning beside GPU SMs, sharing L2 and inter-SM fabric to eliminate transfers and pipeline neural-symbolic execution. Compare with ML accelerators. We benchmark the runtime of neural and symbolic operations on TPU-like systolic array [19] and DPU-like tree-based array [59] over different neuro-symbolic models and tasks (Fig. 13). For TPU-like architecture, we use SCALE-Sim [54], configured with eight 128 $\times$ 128 systolic arrays. For DPU-like architecture, we use MAERI [24], configured with eight PEs in 56-node tree-based array. Compared with ML accelerators, we observe that REASON achieves similar performance in neural operations, while exhibiting superior symbolic logic and probabilistic operation efficiency thus end-to-end speedup in neuro-symbolic systems. TABLE V: Ablation study of necessity of co-design. The normalized runtime achieved by REASON framework w/o the proposed algorithm optimization or hardware techniques on different tasks. | Neuro-symbolic System Algorithm @ Hardware [66, 20, 82] @ Orin NX | Normalized Runtime (%) on IMO [66] 100 | MiniF [86] 100 | TwinS [20] 100 | XSTest [56] 100 | ComG [31] 100 | | --- | --- | --- | --- | --- | --- | | REASON Algo. @ Orin NX | 84.2% | 87.0% | 78.3% | 82.9% | 86.6% | | REASON Algo. @ REASON HW | 2.07% | 1.94% | 2.04% | 1.99% | 2.08% | Ablation study on the proposed hardware techniques. REASON features reconfigurable tree-based array architecture, efficient register bank mapping, and adaptive scheduling strategy to reduce compute latency for neural, symbolic, and probabilistic kernels (Sec. V and Sec. VI). To demonstrate the effectiveness, we measure the runtime of REASON w/o the scheduling, reconfigurable architecture, and bank mapping. In particular, the proposed memory layout support can trim down the runtime by 22% on average. Additionally, with the proposed reconfigurable array and scheduling strategy, the runtime reduction ratio can be further enlarged to 56% and 73%, indicating that both techniques are necessary for REASON to achieve the desired efficient reasoning capability. Ablation study of the necessity of co-design. To verify the necessity of algorithm-hardware co-design strategy to achieve efficient probabilistic logical reasoning-based neuro-symbolic systems, we measure the runtime latency of our REASON w/o the proposed algorithm or hardware techniques in Tab. V. Specifically, with our proposed REASON algorithm optimization, we can trim down the runtime to 78.3% as compared to R 2 -Guard [20] on the same Orin NX hardware and TwinSafety task. Moreover, with both proposed REASON algorithm optimization and accelerator, the runtime can be reduced to 2.04%, indicating the necessity of the co-design strategy of REASON framework. REASON neural optimization. REASON accelerates symbolic reasoning and enables seamless interaction with GPU optimized for neuro (NN/LLM) computation. To further optimize neural module, we integrate standard LLM acceleration techniques: memory-efficient attention [25], chunked prefill [69], speculative decoding [27], FlashAttention-3 kernels [58], FP8 KV-Cache quantization [70], and prefix caching [68]. These collectively yield 2.8-3.3 $\times$ latency reduction for unique prompts and 4-5 $\times$ when prefixes are reused. While REASON’s reported gains stem from its hardware-software co-design, these LLM optimizations are orthogonal and can be applied in conjunction to unlock full potential system speedup. ## VIII Related Work Neuro-symbolic AI. Neuro-symbolic AI has emerged as a promising paradigm for addressing limitations of purely neural models, including factual errors, limited interpretability, and weak multi-step reasoning. [84, 3, 14, 8, 53, 37, 80]. Systems such as LIPS [28], AlphaGeometry [66], NSC [39], NS3D [15] demonstrate strong performance across domains ranging from mathematical reasoning to embodied and cognitive robotics. However, most prior work focuses on algorithmic design and model integration. REASON systematically characterizing the architectural and system-level properties of probabilistic logical reasoning in neuro-symbolic AI, and proposes an integrated acceleration framework for scalable deployment. System and architecture for neuro-symbolic AI. Early neuro-symbolic systems largely focused on software-level abstractions, such as training semantics and declarative languages that integrate neural with logical or probabilistic reasoning, such as DeepProbLog [36], DreamCoder [10], and Scallop [29]. Recent efforts have begun to address system-level challenges, such as heterogeneous mapping, batching control-heavy reasoning, and kernel specialization, including benchmarking [74], pruning [7], Lobster [4], Dolphin [47], and KLay [34]. At the architectural level, a growing body of work exposes the mismatch between compositional neuro-symbolic workloads and conventional hardware designs, motivating cognitive architectures such as CogSys [77], NVSA architectures [73], and NSFlow [79]. REASON advances this direction with the first flexible system-architecture co-design that supports probabilistic logical reasoning-based neuro-symbolic AI and integrates with GPU, enabling efficient and scalable compositional neuro-symbolic and LLM+tools agentic system deployment. ## IX Conclusion We present REASON, the integrated acceleration framework for efficiently executing probabilistic logical reasoning in neuro-symbolic AI. REASON introduces a unified DAG abstraction with adaptive pruning and a flexible reconfigurable architecture integrated with GPUs to enable efficient end-to-end execution. Results show that system-architecture co-design is critical for making neuro-symbolic reasoning practical at scale, and position REASON as a potential foundation for future agentic AI and LLM+tools systems that require structured and interpretable reasoning alongside neural computation. ## Acknowledgements This work was supported in part by CoCoSys, one of seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. We thank Ananda Samajdar, Ritik Raj, Anand Raghunathan, Kaushik Roy, Ningyuan Cao, Katie Zhao, Alexey Tumanov, Shirui Zhao, Xiaoxuan Yang, Zhe Zeng, and the anonymous HPCA reviewers for insightful discussions and valuable feedback. ## References - [1] K. Ahmed, S. Teso, K. Chang, G. Van den Broeck, and A. Vergari (2022) Semantic probabilistic layers for neuro-symbolic learning. Advances in Neural Information Processing Systems 35, pp. 29944–29959. Cited by: §I. - [2] R. Aksitov, S. Miryoosefi, Z. Li, D. Li, S. Babayan, K. Kopparapu, Z. Fisher, R. Guo, S. Prakash, P. Srinivasan, et al. (2023) Rest meets react: self-improvement for multi-step reasoning llm agent. arXiv preprint arXiv:2312.10003. Cited by: §I. - [3] S. Badreddine, A. d. Garcez, L. Serafini, and M. Spranger (2022) Logic tensor networks. Artificial Intelligence 303, pp. 103649. Cited by: §VIII. - [4] P. Biberstein, Z. Li, J. Devietti, and M. Naik (2025) Lobster: a gpu-accelerated framework for neurosymbolic programming. arXiv preprint arXiv:2503.21937. Cited by: §VIII. - [5] Cadence Innovus implementation system - cadence. Note: https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/soc-implementation-and-floorplanning/innovus-implementation-system.html Cited by: § VII-A. - [6] W. Chen, S. Yu, H. Shao, L. Sha, and H. Zhao (2025) Neural probabilistic circuits: enabling compositional and interpretable predictions through logical reasoning. arXiv preprint arXiv:2501.07021. Cited by: TABLE I, § VII-A. - [7] M. Dang, A. Liu, and G. Van den Broeck (2022) Sparse probabilistic circuits via pruning and growing. Advances in Neural Information Processing Systems (NeurIPS) 35, pp. 28374–28385. Cited by: §VIII. - [8] H. Dong, J. Mao, T. Lin, C. Wang, L. Li, and D. Zhou (2019) Neural logic machines. In International Conference on Learning Representations (ICLR), Cited by: §VIII. - [9] S. Du, M. Ibrahim, Z. Wan, L. Zheng, B. Zhao, Z. Fan, C. Liu, T. Krishna, A. Raychowdhury, and H. Li (2025) Cross-layer design of vector-symbolic computing: bridging cognition and brain-inspired hardware acceleration. arXiv preprint arXiv:2508.14245. Cited by: §I. - [10] K. Ellis, L. Wong, M. Nye, M. Sable-Meyer, L. Cary, L. Anaya Pozo, L. Hewitt, A. Solar-Lezama, and J. B. Tenenbaum (2023) Dreamcoder: growing generalizable, interpretable knowledge with wake–sleep bayesian program learning. Philosophical Transactions of the Royal Society A 381 (2251), pp. 20220050. Cited by: §VIII. - [11] S. Han, H. Schoelkopf, Y. Zhao, Z. Qi, M. Riddell, W. Zhou, J. Coady, D. Peng, Y. Qiao, L. Benson, et al. (2022) Folio: natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840. Cited by: § VII-A. - [12] M. Hersche, M. Zeqiri, L. Benini, A. Sebastian, and A. Rahimi (2023) A neuro-vector-symbolic architecture for solving raven’s progressive matrices. Nature Machine Intelligence 5 (4), pp. 363–375. Cited by: §I, § II-A. - [13] M. J. Heule, O. Kullmann, S. Wieringa, and A. Biere (2011) Cube and conquer: guiding cdcl sat solvers by lookaheads. In Haifa Verification Conference, pp. 50–65. Cited by: § II-C. - [14] P. Hohenecker and T. Lukas (2020) Ontology reasoning with deep neural networks. Journal of Artificial Intelligence Research 68, pp. 503–540. Cited by: §VIII. - [15] J. Hsu, J. Mao, and J. Wu (2023) Ns3d: neuro-symbolic grounding of 3d objects and relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2614–2623. Cited by: §VIII. - [16] M. Ibrahim, Z. Wan, H. Li, P. Panda, T. Krishna, P. Kanerva, Y. Chen, and A. Raychowdhury (2024) Special session: neuro-symbolic architecture meets large language models: a memory-centric perspective. In 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pp. 11–20. Cited by: §I. - [17] INTEL Corporation (2023) 4th gen intel xeon scalable processors. Note: https://www.intel.com/content/www/us/en/ark/products/series/228622/4th-gen-intel-xeon-scalable-processors.html Cited by: TABLE III, § VII-A. - [18] () Jetson orin for next-gen robotics — nvidia. Note: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/ (Accessed on 04/02/2024) Cited by: TABLE III, § VII-A. - [19] N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, et al. (2021) Ten lessons from three generations shaped google’s tpuv4i: industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 1–14. Cited by: TABLE III, § VII-A, § VII-C. - [20] M. Kang and B. Li (2025) R 2 -guard: robust reasoning enabled llm guardrail via knowledge-enhanced logical reasoning. International Conference on Learning Representations (ICLR). Cited by: §I, TABLE I, § VII-A, § VII-A, § VII-C, TABLE V, TABLE V. - [21] M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers (2020) Accel-sim: an extensible simulation framework for validated gpu modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 473–486. Cited by: § VII-A. - [22] P. Khosravi, Y. Choi, Y. Liang, A. Vergari, and G. Van den Broeck (2019) On tractable computation of expected predictions. Advances in Neural Information Processing Systems 32. Cited by: § II-C. - [23] J. Kuang, Y. Shen, J. Xie, H. Luo, Z. Xu, R. Li, Y. Li, X. Cheng, X. Lin, and Y. Han (2025) Natural language understanding and inference with mllm in visual question answering: a survey. ACM Computing Surveys 57 (8), pp. 1–36. Cited by: §I. - [24] H. Kwon, A. Samajdar, and T. Krishna (2018) Maeri: enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. ACM Sigplan Notices 53 (2), pp. 461–475. Cited by: § V-B, § VII-C. - [25] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles (SOSP), pp. 611–626. Cited by: § VII-C. - [26] M. Lee, P. Liang, and Q. Yang (2022) Coauthor: designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems, pp. 1–19. Cited by: § VII-A. - [27] Y. Leviathan, M. Kalman, and Y. Matias (2023) Fast inference from transformers via speculative decoding. In International Conference on Machine Learning (ICML), pp. 19274–19286. Cited by: § VII-C. - [28] Z. Li, Z. Li, W. Tang, X. Zhang, Y. Yao, X. Si, F. Yang, K. Yang, and X. Ma (2025) Proving olympiad inequalities by synergizing llms and symbolic reasoning. International Conference on Learning Representations (ICLR), pp. 1–26. Cited by: §I, §VIII. - [29] Z. Li, J. Huang, and M. Naik (2023) Scallop: a language for neurosymbolic programming. Proceedings of the ACM on Programming Languages 7 (PLDI), pp. 1463–1487. Cited by: §VIII. - [30] Y. Liang and G. Van den Broeck (2019) Learning logistic circuits. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4277–4286. Cited by: § II-C. - [31] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren (2020) CommonGen: a constrained text generation challenge for generative commonsense reasoning. Findings of the Association for Computational Linguistics (EMNLP), pp. 1823––1840. Cited by: § VII-A, TABLE V. - [32] A. Liu, K. Ahmed, and G. V. d. Broeck (2024) Scaling tractable probabilistic circuits: a systems perspective. International Conference on Machine Learning (ICML), pp. 30630–30646. Cited by: § II-C. - [33] M. Lo, M. F. Chang, and J. Cong (2025) SAT-accel: a modern sat solver on a fpga. In Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 234–246. Cited by: § II-C. - [34] J. Maene, V. Derkinderen, and P. Z. D. Martires (2024) Klay: accelerating arithmetic circuits for neurosymbolic ai. arXiv preprint arXiv:2410.11415. Cited by: §VIII. - [35] M. Mahaut, L. Aina, P. Czarnowska, M. Hardalov, T. Müller, and L. Màrquez (2024) Factual confidence of llms: on reliability and robustness of current estimators. ACL. Cited by: §I. - [36] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, and L. De Raedt (2018) Deepproblog: neural probabilistic logic programming. Advances in neural information processing systems (NeurIPS) 31. Cited by: §VIII. - [37] R. Manhaeve, S. Dumančić, A. Kimmig, T. Demeester, and L. De Raedt (2021) Neural probabilistic logic programming in deepproblog. Artificial Intelligence 298, pp. 103504. Cited by: §VIII. - [38] J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu (2019) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. International Conference on Learning Representations (ICLR). Cited by: §I, § II-A. - [39] J. Mao, J. B. Tenenbaum, and J. Wu (2025) Neuro-symbolic concepts. arXiv preprint arXiv:2505.06191. Cited by: §VIII. - [40] J. Marques-Silva, I. Lynce, and S. Malik (2021) Conflict-driven clause learning sat solvers. In Handbook of satisfiability, pp. 133–182. Cited by: § II-C. - [41] L. Mei, J. Mao, Z. Wang, C. Gan, and J. B. Tenenbaum (2022) FALCON: fast visual concept learning by integrating images, linguistic descriptions, and conceptual relations. International Conference on Learning Representations (ICLR). Cited by: §I, § II-A. - [42] S. Mirchandani, F. Xia, P. Florence, B. Ichter, D. Driess, M. G. Arenas, K. Rao, D. Sadigh, and A. Zeng (2023) Large language models as general pattern machines. CoRL. Cited by: §I. - [43] B. Mor, S. Garhwal, and A. Kumar (2021) A systematic review of hidden markov models and their applications. Archives of computational methods in engineering 28 (3), pp. 1429–1448. Cited by: § II-C. - [44] M. W. Moskewicz, C. F. Madigan, Y. Zhao, L. Zhang, and S. Malik (2001) Chaff: engineering an efficient sat solver. In Proceedings of the 38th annual Design Automation Conference, pp. 530–535. Cited by: § V-D. - [45] F. Muñoz-Martínez, R. Garg, M. Pellauer, J. L. Abellán, M. E. Acacio, and T. Krishna (2023) Flexagon: a multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pp. 252–265. Cited by: § V-B. - [46] M. F. Naeem, M. G. Z. A. Khan, Y. Xian, M. Z. Afzal, D. Stricker, L. Van Gool, and F. Tombari (2023) I2mvformer: large language model generated multi-view document supervision for zero-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15169–15179. Cited by: §I. - [47] A. Naik, J. Liu, C. Wang, A. Sethi, S. Dutta, M. Naik, and E. Wong (2024) Dolphin: a programmable framework for scalable neurosymbolic learning. arXiv preprint arXiv:2410.03348. Cited by: §VIII. - [48] NVIDIA Corporation (2020) NVIDIA rtx a6000 graphics card. Note: https://www.nvidia.com/en-us/products/workstations/rtx-a6000/ Cited by: TABLE III, § VII-A. - [49] NVIDIA NVIDIA Jetson Orin. Note: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/ Cited by: §III. - [50] NVIDIA NVIDIA Nsight Compute. Note: https://developer.nvidia.com/nsight-compute Cited by: § III-B. - [51] NVIDIA NVIDIA Nsight Systems. Note: https://developer.nvidia.com/nsight-systems Cited by: § III-B. - [52] T. X. Olausson, A. Gu, B. Lipkin, C. E. Zhang, A. Solar-Lezama, J. B. Tenenbaum, and R. Levy (2023) LINC: a neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. Conference on Empirical Methods in Natural Language Processing (EMNLP). Cited by: TABLE I, § VII-A. - [53] C. Pryor, C. Dickens, E. Augustine, A. Albalak, W. Wang, and L. Getoor (2022) NeuPSL: neural probabilistic soft logic. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI) 461, pp. 4145–4153. Cited by: §VIII. - [54] R. Raj, S. Banerjee, N. Chandra, Z. Wan, J. Tong, A. Samajdhar, and T. Krishna (2025) SCALE-sim v3: a modular cycle-accurate systolic accelerator simulator for end-to-end system analysis. In 2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 186–200. Cited by: § VII-C. - [55] B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi, et al. (2024) Mathematical discoveries from program search with large language models. Nature 625 (7995), pp. 468–475. Cited by: §I, § II-A. - [56] P. Röttger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy (2023) Xstest: a test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263. Cited by: § VII-A, TABLE V. - [57] S. Sarangi and B. Baas (2021) DeepScaleTool: a tool for the accurate estimation of technology scaling in the deep-submicron era. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. Cited by: item *. - [58] J. Shah, G. Bikshandi, Y. Zhang, V. Thakkar, P. Ramani, and T. Dao (2024) Flashattention-3: fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems (NeurIPS) 37, pp. 68658–68685. Cited by: § VII-C. - [59] N. Shah, W. Meert, and M. Verhelst (2023) DPU-v2: energy-efficient execution of irregular directed acyclic graphs. In 2023 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1288–1307. Cited by: TABLE III, § VII-A, § VII-A, § VII-C. - [60] C. Shengyuan, Y. Cai, H. Fang, X. Huang, and M. Sun (2023) Differentiable neuro-symbolic reasoning on large-scale knowledge graphs. Advances in Neural Information Processing Systems 36, pp. 28139–28154. Cited by: §I. - [61] C. Singh, J. P. Inala, M. Galley, R. Caruana, and J. Gao (2024) Rethinking interpretability in the era of large language models. arXiv preprint arXiv:2402.01761. Cited by: §I. - [62] G. Sriramanan, S. Bharti, V. S. Sadasivan, S. Saha, P. Kattakinda, and S. Feizi (2024) Llm-check: investigating detection of hallucinations in large language models. Advances in Neural Information Processing Systems 37, pp. 34188–34216. Cited by: §I. - [63] Synopsys Design compiler - synopsys. Note: https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/dc-ultra.html Cited by: § VII-A. - [64] Synopsys PrimeTime - synopsys. Note: https://www.synopsys.com/implementation-and-signoff/signoff/primetime.html Cited by: § VII-A. - [65] O. Tafjord, B. D. Mishra, and P. Clark (2020) ProofWriter: generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048. Cited by: § VII-A. - [66] T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong (2024) Solving olympiad geometry without human demonstrations. Nature 625 (7995), pp. 476–482. Cited by: §I, § II-A, § II-B, TABLE I, § VII-A, § VII-A, TABLE V, TABLE V, §VIII. - [67] P. Van Der Tak, M. J. Heule, and A. Biere (2012) Concurrent cube-and-conquer. In International Conference on Theory and Applications of Satisfiability Testing, pp. 475–476. Cited by: § II-C. - [68] vLLM vLLM Automatic Prefix Caching . Note: https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html Cited by: § VII-C. - [69] vLLM vLLM Performance and Tuning . Note: https://docs.vllm.ai/en/latest/configuration/optimization.html Cited by: § VII-C. - [70] vLLM vLLM Quantized KV Cache . Note: https://docs.vllm.ai/en/stable/features/quantization/quantized_kvcache.html Cited by: § VII-C. - [71] Z. Wan, Y. Du, M. Ibrahim, J. Qian, J. Jabbour, Y. Zhao, T. Krishna, A. Raychowdhury, and V. J. Reddi (2025) Reca: integrated acceleration for real-time and efficient cooperative embodied autonomous agents. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Volume 2, pp. 982–997. Cited by: §I. - [72] Z. Wan, C. Liu, H. Yang, C. Li, H. You, Y. Fu, C. Wan, T. Krishna, Y. Lin, and A. Raychowdhury (2024) Towards cognitive ai systems: a survey and prospective on neuro-symbolic ai. arXiv preprint arXiv:2401.01040. Cited by: §I. - [73] Z. Wan, C. Liu, H. Yang, R. Raj, C. Li, H. You, Y. Fu, C. Wan, S. Li, Y. Kim, et al. (2024) Towards efficient neuro-symbolic ai: from workload characterization to hardware architecture. IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI). Cited by: §VIII. - [74] Z. Wan, C. Liu, H. Yang, R. Raj, C. Li, H. You, Y. Fu, C. Wan, A. Samajdar, Y. C. Lin, et al. (2024) Towards cognitive ai systems: workload and characterization of neuro-symbolic ai. In 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 268–279. Cited by: §VIII. - [75] Z. Wan, C. Liu, H. Yang, R. Raj, A. Raychowdhury, and T. Krishna (2025) Efficient processing of neuro-symbolic ai: a tutorial and cross-layer co-design case study. Proceedings of the International Conference on Neuro-symbolic Systems. Cited by: §I. - [76] Z. Wan, H. Yang, J. Qian, R. Raj, J. Park, C. Wang, A. Raychowdhury, and T. Krishna (2026) Compositional ai beyond llms: system implications of neuro-symbolic-probabilistic architectures. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Volume 1, pp. 67–84. Cited by: Figure 2, Figure 2. - [77] Z. Wan, H. Yang, R. Raj, C. Liu, A. Samajdar, A. Raychowdhury, and T. Krishna (2025) Cogsys: efficient and scalable neurosymbolic cognition system via algorithm-hardware co-design. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 775–789. Cited by: §VIII. - [78] Y. Xian, C. Lampert, B. Schiele, and Z. Akata (2018) Zero-shotlearning-a comprehensive evaluation of the good, the bad and theugly. arXiv preprint arXiv:1707.00600. Cited by: § VII-A. - [79] H. Yang, Z. Wan, R. Raj, J. Park, Z. Li, A. Samajdar, A. Raychowdhury, and T. Krishna (2025) NSFlow: an end-to-end fpga framework with scalable dataflow architecture for neuro-symbolic ai. In 2025 62nd ACM/IEEE Design Automation Conference (DAC), pp. 1–7. Cited by: §VIII. - [80] Z. Yang, A. Ishay, and J. Lee (2020) Neurasp: embracing neural networks into answer set programming. In 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), Cited by: §VIII. - [81] C. Zhang, B. Jia, S. Zhu, and Y. Zhu (2021) Abstract spatial-temporal reasoning via probabilistic abduction and execution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9736–9746. Cited by: §I, § II-A. - [82] H. Zhang, M. Dang, N. Peng, and G. Van den Broeck (2023) Tractable control for autoregressive language generation. In International Conference on Machine Learning (ICML), pp. 40932–40945. Cited by: TABLE I, § VII-A, TABLE V. - [83] H. Zhang, P. Kung, M. Yoshida, G. Van den Broeck, and N. Peng (2024) Adaptable logical control for large language models. Advances in Neural Information Processing Systems (NeurIPS) 37, pp. 115563–115587. Cited by: §I, TABLE I, § VII-A. - [84] H. Zhang and T. Yu (2020) AlphaZero. Deep Reinforcement Learning: Fundamentals, Research and Applications, pp. 391–415. Cited by: §VIII. - [85] Y. Zhang, G. Wang, C. Li, Z. Gan, C. Brockett, and B. Dolan (2020) POINTER: constrained progressive text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558. Cited by: § VII-A. - [86] K. Zheng, J. M. Han, and S. Polu (2021) Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110. Cited by: § VII-A, TABLE V.

Rendering Paper...