2601.20784v1

Model: gemma-3-27b-it-free

# REASON: Accelerating Probabilistic Logical Reasoning for Scalable Neuro-Symbolic Intelligence **Authors**: Zishen Wan, Che-Kai Liu, Jiayi Qian, Hanchen Yang, Arijit Raychowdhury, Tushar Krishna ## Abstract Neuro-symbolic AI systems integrate neural perception with symbolic and probabilistic reasoning to enable data-efficient, interpretable, and robust intelligence beyond purely neural models. Although this compositional paradigm has shown superior performance in domains such as mathematical reasoning, planning, and verification, its deployment remains challenging due to severe inefficiencies in symbolic and probabilistic inference. Through systematic analysis of representative neuro-symbolic workloads, we identify probabilistic logical reasoning as the inefficiency bottleneck, characterized by irregular control flow, low arithmetic intensity, uncoalesced memory accesses, and poor hardware utilization on CPUs and GPUs. This paper presents REASON, an integrated acceleration framework for probabilistic logical reasoning in neuro-symbolic AI. At the algorithm level, REASON introduces a unified directed acyclic graph representation that captures common structure across symbolic and probabilistic models, coupled with adaptive pruning and regularization. At the architecture level, REASON features a reconfigurable, tree-based processing fabric optimized for irregular traversal, symbolic deduction, and probabilistic aggregation. At the system level, REASON is tightly integrated with GPU streaming multiprocessors through a programmable interface and multi-level pipeline that efficiently orchestrates neural, symbolic, and probabilistic execution. Evaluated across six neuro-symbolic workloads, REASON achieves 12-50 $\times$ speedup and 310-681 $\times$ energy efficiency over desktop and edge GPUs under TSMC 28 nm node. REASON enables real-time probabilistic logical reasoning, completing end-to-end tasks in 0.8 s with 6 mm ${}^{\text{2}}$ area and 2.12 W power, demonstrating that targeted acceleration of probabilistic logical reasoning is critical for practical and scalable neuro-symbolic AI and positioning REASON as a foundational system architecture for next-generation cognitive intelligence. ## I Introduction Large Language Models (LLMs) have demonstrated remarkable capabilities in natural language understanding, image recognition, and complex pattern learning from vast datasets [23, 46, 42, 16]. However, despite their success, LLMs often struggle with factual accuracy, hallucinations, multi-step reasoning, and interpretability [35, 62, 2, 61]. These limitations have spurred the development of compositional AI systems, which integrate neural with symbolic and probabilistic reasoning to create robust, transparent, and intelligent cognitive systems. footnotetext: † Corresponding author One promising compositional paradigm is neuro-symbolic AI, which integrates neural, symbolic, and probabilistic components into a unified cognitive architecture [60, 1, 72, 9, 75]. In this system, the neural module captures the statistical, pattern-matching behavior of learned models, performing rapid function approximation and token prediction for intuitive perception and feature extraction. The symbolic and probabilistic modules perform explicit, verifiable reasoning that is structured, interpretable, and robust under uncertainty, managing logic-based reasoning and probabilistic updates. This paradigm integrates intuitive generalization and deliberate reasoning. Neuro-symbolic AI has demonstrated superior abstract deduction, complex question answering, mathematical reasoning, logical reasoning, and cognitive robotics [28, 66, 55, 81, 12, 38, 41, 71]. Its ability to learn efficiently from fewer data points, produce transparent and verifiable outputs, and robustly handle uncertainty and ambiguity makes it particularly advantageous compared to purely neural approaches. For example, recently Meta’s LIPS [28] and Google’s AlphaGeometry [66] leverage compositional neuro-symbolic approaches to solve complex math problems and achieve a level of human Olympiad gold medalists. R 2 -Guard [20] leverages LLM and probabilistic models to improve robust reasoning capability and resilience against jailbreaks. They represent a paradigm shift for AI that requires robust, verifiable, and explainable reasoning. Despite impressive algorithmic advances in neuro-symbolic AI – often demonstrated on large-scale distributed GPU clusters – efficient deployment at the edge remains a fundamental challenge. Neuro-symbolic agents, particularly in robotics, planning, interactive cognition, and verification, require real-time logical inference to interact effectively with physical environments and multi-agent systems. For example, Ctrl-G, a text-infilling neuro-symbolic agent [83], must execute hundreds of reasoning steps per second to remain responsive, yet current implementations take over 5 minutes on a desktop GPU to complete a single task. This latency gap makes practical deployment of neuro-symbolic AI systems challenging. To understand the root causes of this inefficiency, we systematically analyze a diverse set of neuro-symbolic workloads and uncover several system- and architecture-level challenges. Symbolic and probabilistic kernels frequently dominate end-to-end runtime and exhibit highly irregular execution characteristics, including heterogeneous compute patterns and memory-bound behavior with low ALU utilization. These kernels suffer from limited exploitable parallelism and irregular, uncoalesced memory accesses, leading to poor performance and efficiency on CPU and GPU architectures. To address these challenges, we develop an integrated acceleration framework, REASON, which to the best of our knowledge, is the first to accelerate probabilistic logical reasoning-based neuro-symbolic AI systems. REASON is designed to close the efficiency gap of compositional AI by jointly optimizing algorithms, architecture, and system integration for the irregular and heterogeneous workloads inherent to neuro-symbolic reasoning. At the algorithm level, REASON introduces a unified directed acyclic graph (DAG) representation that captures shared computational structure across symbolic and probabilistic kernels. An adaptive pruning and regularization technique further reduces model size and computational complexity while preserving task accuracy. At the architecture level, REASON features a flexible design optimized for various irregular symbolic and probabilistic computations, leveraging the unified DAG representation. The architecture comprises reconfigurable tree-based processing elements (PEs), compiler-driven workload mapping, and memory layout to enable highly parallel and energy-efficient symbolic and probabilistic computation. At the system level, REASON is tightly integrated with GPU streaming multiprocessors (SMs), forming a heterogeneous system with a programmable interface and multi-level execution pipeline that efficiently orchestrates neural, symbolic, and probabilistic kernels while maintaining high hardware utilization and scalability as neuro-symbolic models evolve. Notably, unlike conventional tree-like computing arrays optimized primarily for neural workloads, REASON provides reconfigurable support for neural, symbolic, and probabilistic kernels within a unified execution fabric, enabling efficient and scalable neuro-symbolic AI systems. This paper, therefore, makes the following contributions: - We conduct a systematic workload characterization of representative logical- and probabilistic-reasoning-based neuro-symbolic AI models, identifying key performance bottlenecks and architectural optimization opportunities (Sec. II, Sec. III). - We propose REASON, an integrated co-design framework, to efficiently accelerate probabilistic logical reasoning in neuro-symbolic AI, enabling practical and scalable deployment of compositional intelligence (Fig. 4). - REASON introduces cross-layer innovations spanning (i) a unified DAG representation with adaptive pruning at the algorithm level (Sec. IV), (ii) a reconfigurable symbolic/probabilistic architecture and compiler-driven dataflow and mapping at the hardware level (Sec. V), and (iii) a programmable system interface with a multi-level execution pipeline at the system level (Sec. VI) to improve neuro-symbolic efficiency. - Evaluated across cognitive tasks, REASON enables flexible support for symbolic and probabilistic operations, achieving 12-50 $\times$ speedup and 310-681 $\times$ energy efficiency compared to desktop and edge GPUs. REASON enables fast and efficient logical and probabilistic reasoning in 0.8 s per task with 6 mm 2 area and 2.12 W power consumption. (Sec. VII). ## II Neuro-Symbolic AI Systems This section presents the preliminaries of neuro-symbolic AI with its algorithm flow (Sec. II-A), scaling performance analysis (Sec. II-B), and key computational primitives (Sec. II-C). <details> <summary>x1.png Details</summary> ![75e2f2b2](/v1/image/75e2f2b202851e8e8eebfa7418b6f8b27163b527de239de1a5d646731d2a8bdb) ### Visual Description \n ## Diagram: Neuro-Symbolic Integration and Applications ### Overview The image is a diagram illustrating the integration of neuro (DNN/LLM) and symbolic reasoning approaches, along with examples of their application. It depicts a flow of information between these two paradigms, and how they connect to specific logical and probabilistic models. The diagram is divided into three main sections: a central integration area, a symbolic reasoning section with model examples, and an application examples section. ### Components/Axes The diagram consists of the following components: * **Neuro (DNN/LLM):** Represented by a globe with interconnected nodes and labeled "(Fast Thinking)". * **Symbolic:** Represented by a rectangular box labeled "Symbolic". * **Logical (Slow Thinking):** A brain icon within the "Symbolic" box, labeled "(Slow Thinking)". * **Probabilistic (Bayesian Thinking):** A set of interconnected spheres within the "Symbolic" box, labeled "(Bayesian Thinking)". * **First-Order Logic (FOL):** A diagram of interconnected nodes labeled x1, x2, x3, x4, and y1, y2. Nodes are connected by symbols '∧' (AND) and '¬' (NOT). * **Boolean Satisfiability (SAT):** Similar to FOL, a diagram of interconnected nodes labeled x1, x2, x3, x4. Nodes are connected by symbols '∧' (AND) and '¬' (NOT). * **Probabilistic Circuit (PC):** A diagram of interconnected nodes labeled x1, x2, x3, x4, and a node labeled 'f'. Nodes are connected by symbols '+', 'x', and 'l'. * **Hidden Markov Model (HMM):** A diagram of interconnected nodes labeled x1, x2, x3, and S1, S2, S3. * **Application Examples:** A list of applications with corresponding process flows. ### Detailed Analysis or Content Details The diagram shows a bidirectional flow of information between the "Neuro" and "Symbolic" sections, indicated by arrows. Within the "Symbolic" section, there's a connection between "Logical" and "Probabilistic" thinking. **Symbolic Reasoning Models:** * **First-Order Logic (FOL):** The diagram shows a network of variables (x1-x4, y1-y2) connected by logical operators. * **Boolean Satisfiability (SAT):** Similar to FOL, a network of variables (x1-x4) connected by logical operators. * **Probabilistic Circuit (PC):** A network of variables (x1-x4, f) connected by arithmetic and logical operators. * **Hidden Markov Model (HMM):** A sequence of states (S1-S3) connected to observations (x1-x3). **Application Examples:** * **Commonsense Reason:** feature extraction → rule logic → uncertainty infer. * **Cognitive Robotics:** scene graph → logic-based planning → uncertainty infer. * **Medical Diagnosis:** feature extraction → rule reasoning → likelihood infer. * **Question Answering:** parsing → symbolic query planning → missing fact infer. * **Math Solving:** initial sol. gen. → algebra solver → uncertainty infer. ### Key Observations The diagram highlights the interplay between "fast" (neuro) and "slow" (symbolic) thinking. The application examples demonstrate how this integration can be used in various domains. The consistent presence of "uncertainty infer." in the application examples suggests a focus on handling incomplete or noisy information. The diagram emphasizes the use of rule-based systems and logical reasoning within the symbolic framework. ### Interpretation The diagram illustrates a modern approach to AI that combines the strengths of neural networks (pattern recognition, fast processing) with symbolic reasoning (logic, explainability). The neuro component provides the initial feature extraction or representation, which is then fed into the symbolic component for reasoning and inference. The applications demonstrate the versatility of this approach, ranging from commonsense reasoning to complex tasks like medical diagnosis and math solving. The inclusion of probabilistic models suggests an attempt to handle uncertainty and noise in real-world data. The diagram suggests that the integration of neuro-symbolic approaches is a promising direction for building more robust and intelligent AI systems. The consistent flow towards "uncertainty infer." suggests that a key goal of this integration is to improve the ability of AI systems to reason under conditions of incomplete or unreliable information. </details> Figure 1: Neuro-symbolic algorithmic flow and operations. The neural module serves as a perceptual and intuitive engine for representation learning, while the symbolic module performs structured logical reasoning with probabilistic inference. This compositional pipeline enables complex cognitive tasks across diverse scenarios. ### II-A Neuro-Symbolic Cognitive Intelligence LLMs and DNNs excel at natural language understanding and image recognition. However, they remain prone to factual errors, hallucinations, challenges in complex multi-step reasoning, and vulnerability to out-of-distribution or adversarial inputs. Their black-box nature also impedes interpretability and formal verification, undermining trust in safety-critical domains. These limitations motivate the development of compositional systems that integrate neural models with symbolic and probabilistic reasoning to achieve greater robustness, transparency, and intelligence. Neuro-symbolic AI represents a paradigm shift toward more integrated and trustworthy intelligence by combining neural, symbolic, and probabilistic techniques. This hybrid approach has shown superior performance in abstract deduction [81, 12], complex question answering [38, 41], and logical reasoning [66, 55]. By learning from limited data and producing transparent, verifiable outputs, neuro-symbolic systems provide a foundation for cognitive intelligence. Fig. 1 presents a unified neuro-symbolic pipeline, illustrating how its components collaborate to solve complex tasks. <details> <summary>x2.png Details</summary> ![570358fb](/v1/image/570358fbe9de0ae2636ab1b37f279b1a8e6dd1694ad52af2eb741e2785d5c165) ### Visual Description ## Charts: Scaling Performance Analysis & Scaling Efficiency Analysis ### Overview The image presents four charts comparing the performance of different models on various tasks as a function of model size and complexity. The first three charts focus on task accuracy versus model size for Complex Reasoning, Math Reasoning, and Question-Answering tasks. The fourth chart shows task runtime versus complexity for Neuro-symbolic and KL-based reasoning models. ### Components/Axes **Chart 1: Complex Reasoning Tasks** * **X-axis:** Model Size (TB) - Values: 7B, 8B, 13B, 70B, GPT * **Y-axis:** Task Accuracy (%) - Scale: 20 to 100, increments of 10. * **Legend:** * TextEdit (C) - Orange Squares * CLUTRR (C) - Red Circles * ProofWriter (C) - Blue Triangles * TextEdit (M) - Light Orange Squares * CLUTRR (M) - Light Red Circles * ProofWriter (M) - Light Blue Triangles **Chart 2: Math Reasoning Tasks** * **X-axis:** Model Size (TB) - Values: 7B, 8B, 13B, 70B, GPT * **Y-axis:** Task Accuracy (%) - Scale: 20 to 100, increments of 10. * **Legend:** * GSM8K (C) - Orange Squares * SVAMP (C) - Red Circles * TabMWP (C) - Blue Triangles * In-Domain GSM8K (C) - Light Orange Squares * In-Domain SVAMP (C) - Light Red Circles * In-Domain MATH (C) - Light Blue Triangles **Chart 3: Question-Answering Tasks** * **X-axis:** Model Size (TB) - Values: 7B, 8B, 13B, 70B, GPT * **Y-axis:** Task Accuracy (%) - Scale: 30 to 100, increments of 10. * **Legend:** * AmbiguityQA (C) - Orange Squares * TriviaQA (C) - Red Circles * HotpotQA (C) - Blue Triangles * AmbiguityQA (M) - Light Orange Squares * TriviaQA (M) - Light Red Circles * HotpotQA (M) - Light Blue Triangles **Chart 4: Scaling Efficiency Analysis** * **X-axis:** Complexity (Inter Math Olympics reasoning (Year Problem)) - Values: P1, 08, P6, 04, P12, P5, 20, P9, P6 * **Y-axis:** Task runtime (min) - Scale: 0 to 30, increments of 5. * **Legend:** * Neuro-symbolic models (AlphaGeometry) - Gray Circles * KL-based (T reasoning models - Blue Triangles ### Detailed Analysis or Content Details **Chart 1: Complex Reasoning Tasks** * TextEdit (C): Starts at approximately 40% at 7B, increases to around 60% at 8B, plateaus around 70% at 13B, and reaches approximately 90% at 70B. * CLUTRR (C): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B. * ProofWriter (C): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B. * TextEdit (M): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B. * CLUTRR (M): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B. * ProofWriter (M): Starts at approximately 10% at 7B, increases to around 30% at 8B, plateaus around 40% at 13B, and reaches approximately 60% at 70B. **Chart 2: Math Reasoning Tasks** * GSM8K (C): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B. * SVAMP (C): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B. * TabMWP (C): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B. * In-Domain GSM8K (C): Starts at approximately 60% at 7B, increases to around 70% at 8B, plateaus around 80% at 13B, and reaches approximately 90% at 70B. * In-Domain SVAMP (C): Starts at approximately 50% at 7B, increases to around 60% at 8B, plateaus around 70% at 13B, and reaches approximately 80% at 70B. * In-Domain MATH (C): Starts at approximately 40% at 7B, increases to around 60% at 8B, plateaus around 70% at 13B, and reaches approximately 80% at 70B. **Chart 3: Question-Answering Tasks** * AmbiguityQA (C): Starts at approximately 40% at 7B, increases to around 60% at 8B, plateaus around 70% at 13B, and reaches approximately 90% at 70B. * TriviaQA (C): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B. * HotpotQA (C): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B. * AmbiguityQA (M): Starts at approximately 30% at 7B, increases to around 50% at 8B, plateaus around 60% at 13B, and reaches approximately 80% at 70B. * TriviaQA (M): Starts at approximately 20% at 7B, increases to around 40% at 8B, plateaus around 50% at 13B, and reaches approximately 70% at 70B. * HotpotQA (M): Starts at approximately 10% at 7B, increases to around 30% at 8B, plateaus around 40% at 13B, and reaches approximately 60% at 70B. **Chart 4: Scaling Efficiency Analysis** * Neuro-symbolic models: Starts at approximately 5 minutes at P1, increases linearly to approximately 25 minutes at P9, and reaches approximately 30 minutes at P6. * KL-based models: Starts at approximately 10 minutes at P1, increases linearly to approximately 20 minutes at P9, and plateaus around 20 minutes at P6. ### Key Observations * Across all three accuracy charts, increasing model size generally leads to improved task accuracy, with diminishing returns beyond 13B. * The "In-Domain" tasks consistently outperform their corresponding "C" (presumably "Cross-Domain") counterparts. * The Neuro-symbolic models exhibit a steeper runtime increase with complexity compared to the KL-based models. * The runtime of KL-based models appears to plateau after a certain level of complexity. ### Interpretation The data suggests that scaling model size is an effective strategy for improving performance on complex reasoning, math reasoning, and question-answering tasks. However, the gains diminish as the model size increases, indicating a potential limit to the benefits of simply increasing parameters. The performance difference between in-domain and cross-domain tasks highlights the importance of training data distribution. The runtime analysis reveals a trade-off between model complexity and computational efficiency, with neuro-symbolic models being more computationally expensive than KL-based models. The plateau in runtime for KL-based models suggests that they may be more scalable for highly complex problems. The charts collectively demonstrate the ongoing research into balancing accuracy and efficiency in large language models. The use of "(C)" and "(M)" likely denotes different training methodologies or data splits, with "(M)" potentially representing a more refined or targeted training approach. The x-axis labels in the final chart are somewhat cryptic ("P1", "08", etc.), suggesting a specific benchmark or competition context (Inter Math Olympics). </details> Figure 2: Scaling performance and efficiency. (a)-(c) Task accuracy of compositional LLM-symbolic models (C) and monolithic LLMs (M - shown in gray) across model sizes on complex reasoning, mathematical reasoning, and question-answering tasks. (d) Runtime efficiency comparison between LLM-symbolic models and RL-based CoT models on mathematical reasoning tasks [76]. Neural module. The neural module serves as the perceptual and intuitive engine, typically DNN or LLM, excelling at processing high-dimensional sensory inputs (e.g., images, audio, text) and converting them into feature representations. It handles perception, feature extraction, and associative learning, providing the abstractions needed for higher-level cognition. Symbolic module. The symbolic module is the logical core operating on neural abstractions and includes symbolic and probabilistic operations. Logical components apply formal logic, rules, and ontologies for structured reasoning and planning, enabling logically sound solutions. Probabilistic components manage uncertainty by representing knowledge probabilistically, supporting belief updates and decision-making under ambiguity, reflecting a nuanced reasoning model. Together, these modules form a complementary reasoning hierarchy. Neural module captures statistical, pattern-matching behavior of learned models, producing rapid but non-verifiable outputs (Fast Thinking), while symbolic modules perform explicit, verifiable reasoning that is structured and reliable (Slow Thinking). The probabilistic module complements both and enables robust planning under ambiguity (Bayesian Thinking). This framework integrates intuitive generalization with deliberate reasoning. ### II-B Scaling Performance Analysis of Neuro-Symbolic Systems Scaling performance analysis. Neuro-symbolic AI systems exhibit superior reasoning capability and scaling behavior compared to monolithic LLMs on complex tasks. We compare representative neuro-symbolic systems against monolithic LLMs across complex reasoning, mathematical reasoning, and question-answering benchmarks (Fig. 2 (a)-(c)). The results reveal two advantages. First, higher accuracy: compositional neuro-symbolic models consistently outperform monolithic LLMs of comparable size. Second, improved scaling efficiency: smaller neuro-symbolic models are sufficient to match or exceed the performance of significantly larger closed-source LLMs. Together, these results highlight the potential scaling limitations of monolithic LLMs and the efficiency benefits of compositional neuro-symbolic reasoning. Comparison with RL-based reasoning models. Beyond monolithic LLMs, recent advancements in reinforcement learning (RL) and chain-of-thought (CoT) prompting improve LLM reasoning accuracy but incur significant computational and scalability overheads (Fig. 2 (d)). First, computational cost: RL-based reasoning often requires hundreds to thousands of LLM queries per decision step, resulting in prohibitively high inference latency and energy consumption. Second, scalability: task-specific fine-tuning constrains generality, whereas neuro-symbolic models use symbolic and probabilistic reasoning modules or tools without retraining. Fig. 2 (d) reveals that neuro-symbolic models AlphaGeometry [66] achieve over $2\times$ efficiency gains and superior data efficiency compared to CoT-based LLMs on mathematical reasoning tasks. ### II-C Computational Primitives in Neuro-Symbolic AI We identify the core computational primitives that are commonly used in neuro-symbolic AI systems (Fig. 1). While neural modules rely on DNNs or LLMs for perception and representation learning, the symbolic and probabilistic components implement structured reasoning. In particular, logical reasoning is typically realized through First-Order Logic (FOL) and Boolean Satisfiability (SAT), probabilistic reasoning through Probabilistic Circuits (PCs), and sequential reasoning through Hidden Markov Models (HMMs). Together, these primitives form the algorithmic foundation of neuro-symbolic systems that integrate learning, logic, and uncertainty-aware inference. First-Order Logic (FOL) and Boolean Satisfiability (SAT). FOL provides a formal symbolic language for representing structured knowledge using predicates, functions, constants, variables and quantifiers ( $\forall,\exists$ ), combined with logical connectives. For instance, the statement “every student has a mentor” can be expressed as $\forall x\bigl(\mathrm{Student}(x)\to\exists y\,(\mathrm{Mentor}(y)\wedge\mathrm{hasMentor}(x,y))\bigr)$ , where predicates encode properties and relations over domain elements. FOL semantics map symbols to domain objects and relations, enabling precise and interpretable logical reasoning. SAT operates over propositional logic and asks whether a conjunctive normal form (CNF) formula $\varphi=\bigwedge_{i=1}^{m}\Bigl(\bigvee_{j=1}^{k_{i}}l_{ij}\Bigr)$ admits a satisfying assignment, where each literal $l_{ij}$ is a Boolean variable or its negation. Modern SAT solvers extend the DPLL algorithm with conflict-driven clause learning (CDCL), incorporating non-chronological backtracking and clause learning to improve scalability [40, 33]. Cube-and-conquer further parallelizes search by splitting into “cube” DPLL subproblems and concurrent CDCL “conquer” solvers [13, 67]. Together, FOL’s expressive representations and SAT’s solving mechanisms form the logic backbone of neuro-symbolic systems, enabling exact logical inference alongside neural learning. | Representative Neuro- Symbolic Workloads | AlphaGeometry [66] | R 2 -Guard [20] | GeLaTo [82] | Ctrl-G [83] | NeuroPC [6] | LINC [52] | | | --- | --- | --- | --- | --- | --- | --- | --- | | Deployment Scenario | Application | Math theorem proving & reasoning | Unsafety detection | Constrained text generation | Interactive text editing, text infilling | Classification | Logical reasoning, Deductive reasoning | | Advantage vs. LLM | Higher deductive reasoning, higher generalization | Higher LLM resilience, higher data efficiency, effective adaptability | Guaranteed constraint satisfaction, higher generalization | Guaranteed constraints satisfaction, higher generalization | Enhanced interpretability, theoretical guarantee | Higher precision, reduced overconfidence, higher scalability | | | Computation Pattern | Neural | LLM | LLM | LLM | LLM | DNN | LLM | | Symbolic | First-order logic, SAT solver, acyclic graph | First-order logic, probabilistic circuit, Hidden Markov model | First-order logic, SAT solver, Hidden Markov model | Hidden Markov model, probabilistic circuits | First-order logic, probabilistic circuit | First-order logic, solver | | TABLE I: Representative neuro-symbolic workloads. Selected neuro-symbolic workloads used in our analysis, spanning diverse application domains, deployment scenarios, and neural-symbolic computation patterns. <details> <summary>x3.png Details</summary> ![36ad04a5](/v1/image/36ad04a5b6e0bea17f661eae2156b52b2b9b5f82627b84526c2824b15c4b159b) ### Visual Description ## Bar Charts & Scatter Plot: Performance Comparison of Language Models ### Overview The image presents a comparison of several language models (AlphaGeo, R-Guard, GeLaTo, Ctrl-G, LINC, NPC) across different tasks and metrics. It consists of three bar charts (a, b, c) showing runtime percentage, runtime latency, and attainable performance, respectively, and a scatter plot (d) illustrating the relationship between operation intensity and attainable performance. ### Components/Axes * **(a) Runtime Percentage:** * X-axis: Tasks - "MiMi", "Prints", "Func", "Review", "Text", "FOLIO" * Y-axis: Runtime Percentage (0% to 100%) * Categories: "Neuro" (green), "Symbolic" (pink) * Models: AlphaGeo, R-Guard, GeLaTo, Ctrl-G, NPC, LINC * **(b) Runtime Latency (min):** * X-axis: Model & Input Size - "Small", "Large" for Alpha, R-G, GeLaTo, Ctrl-G, LINC * Y-axis: Runtime Latency (min) (0 to 10) * Tasks: "IMO Safety", "CoGen", "Text", "FOLIO" (indicated above each set of bars) * **(c) Runtime Latency (min):** * X-axis: Input Size - "A6000", "Omin" for Alpha, R-G * Y-axis: Runtime Latency (min) (0 to 24) * Tasks: "MiMi", "XST" (indicated above each set of bars) * **(d) Attainable Performance (TFLOPS/s) vs. Operation Intensity (FLOPS/Byte):** * X-axis: Operation Intensity (FLOPS/Byte) (10^1 to 10^3, logarithmic scale) * Y-axis: Attainable Performance (TFLOPS/s) (10^1 to 10^6, logarithmic scale) * Models: LLaMA-2-7B (Neuro), AlphaGeo (Symbolic), R-Guard (Symbolic), GeLaTo (Symbolic), LINC (Symbolic), Ctrl-G (Symbolic) ### Detailed Analysis or Content Details **(a) Runtime Percentage:** * **AlphaGeo:** MiMi: ~32.6%, Prints: ~43.1%, Func: ~36.8%, Review: ~38.4%, Text: ~49.3%, FOLIO: ~35.7% * **R-Guard:** MiMi: ~39.2%, Prints: ~46.2%, Func: ~43.1%, Review: ~31.6%, Text: ~57.6%, FOLIO: ~33.0% * **GeLaTo:** MiMi: ~60.7%, Prints: ~67.4%, Func: ~67.9%, Review: ~56.3%, Text: ~70.2%, FOLIO: ~64.3% * **Ctrl-G:** MiMi: ~33.0%, Prints: ~40.3%, Func: ~39.8%, Review: ~33.8%, Text: ~50.7%, FOLIO: ~34.7% * **NPC:** MiMi: ~46.3%, Prints: ~53.8%, Func: ~53.9%, Review: ~44.4%, Text: ~60.3%, FOLIO: ~48.0% * **LINC:** MiMi: ~42.8%, Prints: ~50.9%, Func: ~46.8%, Review: ~38.4%, Text: ~56.0%, FOLIO: ~42.3% **(b) Runtime Latency (min):** * **Alpha (Small):** ~2.0 min, **Alpha (Large):** ~8.0 min * **R-G (Small):** ~1.0 min, **R-G (Large):** ~4.0 min * **GeLaTo (Small):** ~1.0 min, **GeLaTo (Large):** ~5.0 min * **Ctrl-G (Small):** ~1.0 min, **Ctrl-G (Large):** ~4.0 min * **LINC (Small):** ~1.0 min, **LINC (Large):** ~5.0 min **(c) Runtime Latency (min):** * **Alpha (A6000):** ~4.0 min, **Alpha (Omin):** ~20.0 min * **R-G (A6000):** ~2.0 min, **R-G (Omin):** ~12.0 min **(d) Attainable Performance vs. Operation Intensity:** * **LLaMA-2-7B (Neuro):** Located at approximately (10^2, 10^5) TFLOPS/s. * **AlphaGeo (Symbolic):** Located at approximately (10^2, 10^5) TFLOPS/s. * **R-Guard (Symbolic):** Located at approximately (10^2, 10^4) TFLOPS/s. * **GeLaTo (Symbolic):** Located at approximately (10^2, 10^4) TFLOPS/s. * **LINC (Symbolic):** Located at approximately (10^2, 10^3) TFLOPS/s. * **Ctrl-G (Symbolic):** Located at approximately (10^2, 10^3) TFLOPS/s. The scatter plot shows a general trend of increasing attainable performance with increasing operation intensity. ### Key Observations * GeLaTo consistently exhibits the highest runtime percentage across all tasks in (a). * Increasing input size (from Small to Large) generally increases runtime latency in (b). * AlphaGeo and R-Guard show a significant increase in runtime latency when switching from A6000 to Omin in (c). * In the scatter plot (d), the symbolic models (AlphaGeo, R-Guard, GeLaTo, LINC, Ctrl-G) cluster together, while LLaMA-2-7B (Neuro) is positioned differently. * The symbolic models generally have lower operation intensity but comparable attainable performance to the neuro model. ### Interpretation The data suggests that GeLaTo is the most computationally intensive model, requiring the longest runtime for most tasks. The increase in runtime latency with larger input sizes indicates a scaling issue. The scatter plot highlights a trade-off between operation intensity and attainable performance. Symbolic models appear to achieve comparable performance to the neuro model (LLaMA-2-7B) with lower operation intensity, potentially indicating greater efficiency. The positioning of LLaMA-2-7B suggests it requires more computational resources to achieve similar performance levels. The different tasks in (a) and (b) and (c) show that the performance of each model varies depending on the specific task. The data suggests that the choice of model depends on the specific application and the available computational resources. The separation between "Neuro" and "Symbolic" models in the scatter plot suggests a fundamental difference in their computational characteristics. </details> Figure 3: End-to-end neuro-symbolic workload characterization. (a) Benchmark six neuro-symbolic workloads (AlphaGeometry, R 2 -Guard, GeLaTo, Ctrl-G, NeuroPC, LINC) on CPU+GPU system, showing symbolic and probabilistic may serve as system bottlenecks. (b) Benchmark neuro-symbolic workloads on tasks with different scales, indicating that real-time performance cannot be satisfied and the potential efficiency issues. (c) Benchmark on A6000 and Orin GPU. (d) Roofline analysis, indicating server memory-bound of symbolic and probabilistic kernels. Probabilistic Circuits (PC). PCs represent tractable probabilistic models over variables $\mathbf{X}$ as directed acyclic graphs [30, 22, 32]. Each node $n$ performs a probabilistic computation: leaf nodes specify primitive distributions $f_{n}(x)$ , while interior nodes combine their children $ch(n)$ via $$ p_{n}(x)=\begin{cases}f_{n}(x),&\text{if }n\text{ is a leaf node}\\ \prod_{c\in\mathrm{ch}(n)}p_{c}(x),&\text{if }n\text{ is a product node}\\ \sum_{c\in\mathrm{ch}(n)}\theta_{n,c}p_{c}(x),&\text{if }n\text{ is a sum node}\end{cases} \tag{1} $$ where $\theta_{n,c}$ denotes the non-negative weight associated with child $c$ . This recursive structure guarantees exact inference (e.g., marginals, conditionals) in time linear in circuit size. PCs’ combination of expressiveness and tractable computation makes them an ideal probabilistic backbone for neuro-symbolic systems, where neural modules learn circuit parameters while symbolic engines perform probabilistic reasoning. Hidden Markov Model (HMM). HMMs are probabilistic model for sequential data [43], where a system evolves through hidden states governed by the first-order Markov property: the state at time step $t$ depends only on the state at time step $t-1$ . Each hidden state emits observations according to a probabilistic distribution. The joint distribution over sequence of hidden states $z_{1:T}$ and observations $x_{1:T}$ is given by $$ p(z_{1:T},x_{1:T})=p(z_{1})p(x_{1}\mid z_{1})\prod_{t=2}^{T}p(z_{t}\mid z_{t-1})p(x_{t}\mid z_{t}) \tag{2} $$ where $p(z_{1})$ is the initial state distribution, $p(z_{t}\mid z_{t-1})$ the transition probability, and $p(x_{t}\mid z_{t})$ the emission probability. HMMs naturally support sequential inference tasks such as filtering, smoothing, and decoding, enabling temporal reasoning in neuro-symbolic pipelines. ## III Neuro-Symbolic Workload Characterization This section characterizes the system behavior of various neuro-symbolic workloads (Sec. III-A - III-B) and provides workload insights for computer architects (Sec. III-C - III-D). Profiling workloads. To conduct comprehensive profiling analysis, we select six state-of-the-art representative neuro-symbolic workloads, as listed in Tab. I, covering a diverse range of applications and underlying computational patterns. Profiling setup. We profile and analyze the selected neuro-symbolic models in terms of runtime, memory, and compute operators using cProfile for latency measurement, and NVIDIA Nsight for kernel-level profiling and analysis. Experiments are conducted on the system with NVIDIA A6000 GPU, Intel Sapphire Rapids CPUs, and DDR5 DRAM. Our software environment includes PyTorch 2.5 and JAX 0.4.6. We also conduct profiling on Jetson Orin [49] for edge scenario deployment. We track control and data flow by analyzing the profiling results in trace view and graph execution format. ### III-A Compute Latency Analysis <details> <summary>x4.png Details</summary> ![f6bb24ac](/v1/image/f6bb24acc71a63924386e5be7ec4327a09ddeb94191422bfb8ffef72b523b84d) ### Visual Description ## Diagram: Neuro-Symbolic Probabilistic AI System Overview ### Overview This diagram illustrates a system overview for Neuro-Symbolic Probabilistic AI, outlining the goals, challenges, methodology, architecture, and deployment stages. It uses a flow-based representation with boxes representing stages and icons representing key insights or components. The diagram is organized horizontally, moving from left to right, representing a progression from goals to deployment. ### Components/Axes The diagram is divided into five main columns: 1. **Goals:** Focuses on the desired capabilities of the system. 2. **Challenges:** Identifies the key obstacles to achieving those goals. 3. **Methodology:** Describes the approach used to overcome the challenges. 4. **Architecture:** Details the system's structural components. 5. **Deployment:** Outlines the steps for implementing and evaluating the system. The left-most column has a vertical axis labeled "Cognitive Capability" with two levels: "REASON" (top) and "Neuro-symbolic probabilistic AI" (center), and "Energy and Latency, Efficiency, Performance, Scalability, Cognition" (bottom). The diagram also includes references to sections (e.g., "Sec. IV", "Sec. V") within a larger document, indicating where more detailed information can be found. ### Detailed Analysis or Content Details **Goals (Leftmost Column):** * Top: A star icon labeled "REASON". * Middle: A collection of icons representing various elements (balls, cubes, gears, etc.) labeled "Neuro-symbolic probabilistic AI". * Bottom: A green upward-pointing arrow labeled "Energy and Latency, Efficiency, Performance, Scalability, Cognition". **Challenges (Second Column):** * **Challenge-1:** "Irregular compute and memory access". * **Challenge-2:** "Inefficient symbolic and probabilistic execution". * **Challenge-3:** "Low hardware utilization and scalability". **Methodology (Third Column):** * **Key Insight-1:** "Unified DAG representation & pruning (Sec. IV)". Depicted as a network of interconnected circles with a smiling face icon. * **Key Insight-2:** "Flexible architecture for naive symbolic & probabilistic opt. (Sec. V)". Depicted as a network of interconnected circles with a smiling face icon. An arrow labeled "time" points from left to right. * **Key Insight-3:** "GPU-accelerator protocol and two-level pipelining (Sec. VI)". Depicted as two sets of interconnected circles with a smiling face icon, each labeled "task" and "desired GPU". **Architecture (Fourth Column):** * **Reconfigurable PE (Sec. V-B):** Represented by a gear icon. * **Compilation & mapping (Sec. V-C):** Represented by a downward-pointing arrow. * **Bi-direction dataflow (Sec. V-D):** Represented by a bidirectional arrow. * **Memory layout (Sec. V-D):** Represented by a rectangular block. * **Co-processor & pipelining (Sec. VI):** Represented by a rectangular block. **Deployment (Rightmost Column):** * "Configurations hardware & system (Sec. VII)". Represented by a gear icon. * "Evaluate across cognitive tasks, complexities, scales, hardware configs (Sec. VII)". Represented by a downward-pointing arrow. * "Target: efficient, scalable agentic cognition". ### Key Observations The diagram emphasizes a progression from high-level goals to concrete deployment steps. The "Key Insights" in the Methodology column appear to be solutions to the identified "Challenges". The use of smiling face icons within the Methodology section suggests a positive or successful approach to each insight. The references to sections (e.g., Sec. IV, Sec. V) indicate a larger document providing more detailed information. The diagram is visually structured to show a clear flow of information and dependencies. ### Interpretation This diagram presents a high-level overview of a neuro-symbolic probabilistic AI system. It highlights the core challenges in building such a system – irregular compute, inefficient execution, and hardware limitations – and proposes a methodology based on unified DAG representation, flexible architecture, and GPU acceleration to address these challenges. The architecture focuses on reconfigurable processing elements, efficient dataflow, and memory layout. The ultimate goal is to achieve efficient, scalable, and agentic cognition. The diagram suggests a system that aims to combine the strengths of both neuro-symbolic and probabilistic approaches to AI. The emphasis on GPU acceleration indicates a focus on performance and scalability. The references to specific sections within a larger document suggest a detailed technical report or paper. The smiling face icons within the methodology section could be interpreted as a visual cue to indicate the effectiveness or promise of each key insight. The overall diagram conveys a sense of a well-planned and structured approach to building a complex AI system. </details> Figure 4: REASON overview. REASON is an integrated acceleration framework for probabilistic logical reasoning grounded neuro-symbolic AI with the goal to achieve efficient and scalable agentic cognition. REASON addresses the challenges of irregular compute and memory, symbolic and probabilistic latency bottleneck, and hardware underutilization, by proposing methodologies including unified DAG representation, reconfigurable PE, efficient dataflow, mapping, scalable architecture, two-level parallelism and programming interface. REASON is deployed across cognitive tasks and consistently demonstrates performance-efficiency improvements for compositional neuro-symbolic systems. Latency bottleneck. We characterize the latency of representative neuro-symbolic workloads (Fig. 3 (a)). Compared to neuro kernels, symbolic and probabilistic kernels are not negligible in latency and may become system bottlenecks. For instance, the neural (symbolic) components account for 36.2% (63.8%), 37.3% (62.7%), 63.4% (36.6%), 36.1% (63.9%), 49.5% (50.5%), and 65.2% (34.8%) of runtime in AlphaGeometry, R 2 -Guard, GeLaTo, Ctrl-G, NeuroPC, and LINC, respectively. Symbolic kernels dominate AlphaGeometry’s runtime, and probabilistic kernels dominate R 2 -Guard and Ctrl-G’s, due to high irregular memory access, wrap divergence, thread underutilization, and execution parallelism. FLOPS and latency measurements further highlight this inefficiency. Notably, when using a smaller LLM (LLaMA-7B) for GeLaTo and LINC, overall accuracy remains stable, but the symbolic latency rises to 69.0% and 65.5%, respectively. We observe consistent trends in the Orin NX-based platform (Fig. 3 (c)). Symbolic components count for 63.8% of AlphaGeometry runtime on A6000 while its FLOPS count for only 19.3%, indicating inefficient hardware utilization. Latency scalability. We evaluate runtime across reasoning tasks of varying difficulty and scale (Fig. 3 (b)). We observe that the relative runtime distribution between neural and symbolic components remains consistent of a single workload across task sizes. Total runtime increases with task complexity and scale. While LLM kernels scale efficiently due to their tensor-based GPU-friendly inference, logical and probabilistic kernels scale poorly due to the exponential growth of search space, making them slower compared to monolithic LLMs. ### III-B Roofline & Symbolic Operation & Inefficiency Analysis Memory-bounded operation. Fig. 3 (d) presents a roofline analysis of GPU memory bandwidth versus compute efficiency. We observe that the symbolic and probabilistic components are typically memory-bound, limiting performance efficiency. For example, R 2 -Guard’s probabilistic circuits use sparse, scattered accesses for marginalization, and Ctrl-G’s HMM iteratively reads and writes state probabilities. Low compute per element makes these workloads constrained by memory access, underutilizing GPU compute resources. TABLE II: Hardware inefficiency analysis. The compute, memory, and communication characteristics of representative neural, symbolic, and probabilistic kernels executed on CPU/GPU platform. | | Neural Kernel | Symbolic Kernel | Probabilistic Kernel | | | | | --- | --- | --- | --- | --- | --- | --- | | MatMul | Softmax | Sparse MatVec | Logic | Marginal | Bayesian | | | Compute Efficiency | | | | | | | | Compute Throughput (%) | 96.8 | 62.2 | 32.5 | 14.7 | 35.0 | 31.1 | | ALU Utilization (%) | 98.4 | 72.0 | 43.9 | 29.3 | 48.5 | 52.8 | | Memory Behavior | | | | | | | | L1 Cache Throughput (%) | 82.4 | 58.0 | 27.1 | 20.6 | 32.4 | 37.1 | | L2 Cache Throughput (%) | 41.7 | 27.6 | 18.3 | 12.4 | 24.2 | 27.5 | | L1 Cache Hit Rate (%) | 88.5 | 85.0 | 53.6 | 37.0 | 42.4 | 40.7 | | L2 Cache Hit Rate (%) | 73.4 | 66.7 | 43.9 | 32.7 | 50.2 | 47.6 | | DRAM BW Utilization (%) | 39.8 | 28.6 | 57.4 | 70.3 | 60.8 | 68.0 | | Control Divergence and Scheduling | | | | | | | | Warp Execution Efficiency (%) | 96.3 | 94.1 | 48.8 | 54.0 | 59.3 | 50.6 | | Branch Efficiency (%) | 98.0 | 98.7 | 60.0 | 58.1 | 63.4 | 66.9 | | Eligible Warps/Cycle (%) | 7.2 | 7.0 | 2.4 | 2.1 | 2.8 | 2.5 | Hardware inefficiency analysis. We leverage Nsight Systems and Nsight Compute [51, 50] to analyze the computational, memory, and control irregularity of neural, symbolic, and probabilistic kernels, as listed in Tab. II. We observe that: First, compute throughput and ALU utilization: neural kernels achieve high throughput and ALU utilization, while symbolic/probabilistic kernels have low throughput and idle ALUs. Second, memory access and cache utilization: neural kernels see high L1 cache hit rates; symbolic kernels incur cache misses and stalls, and probabilistic kernels face high memory pressure. Third, DRAM bandwidth (BW) utilization and data movement overhead: neural workloads use on-chip caches with minimal DRAM usage, but symbolic/probabilistic workloads are DRAM-bound with heavy random-access overhead. Sparsity analysis. We observe high, heterogeneous, irregular, and data-dependent sparsity across neuro-symbolic workloads. Symbolic and probabilistic kernels are often extremely sparse, exhibiting on average 82%, 87%, 75%, 83%, 89%, and 83% sparsity across six representative neuro-symbolic workloads, respectively, with many sparse computational paths based on low activation or probability mass. This observation motivates our adaptive DAG pruning (Sec IV-B). ### III-C Unique Characteristics of Neuro-Symbolic vs LLMs In summary, neuro-symbolic workloads exhibit distinct characteristics compared to monolithic LLMs in compute kernels, memory behavior, dataflow, and performance scaling. Compute kernels. LLMs are dominated by regular, highly parallel tensor operations well suited to GPUs. In contrast, neuro-symbolic workloads comprise heterogeneous symbolic and probabilistic kernels with irregular control flow, low arithmetic intensity, and poor cache locality, leading to low GPU utilization and frequent performance bottlenecks. Memory behavior. Symbolic and probabilistic kernels are primarily memory-bound, operating over large, sparse, and irregular data structures. Probabilistic reasoning further increases memory pressure through large intermediate state caching, creating challenging trade-offs between latency, bandwidth, and on-chip storage. Dataflow and parallelism. Neuro-symbolic workloads exhibit dynamic and tightly coupled data dependencies. Symbolic and probabilistic computations often depend on neural outputs or require compilation into LLM-compatible structures, resulting in serialized execution, limited parallelism, and amplified end-to-end latency. Performance scaling. LLMs scale efficiently across GPUs via optimized data and model parallelism. In contrast, symbolic workloads are difficult to parallelize due to recursive control dependencies, while probabilistic kernels incur substantial inter-node communication, limiting scalability on multi-GPUs. ### III-D Identified Opportunities for Neuro-Symbolic Optimization While neuro-symbolic systems show promise, improving their efficiency is critical for real-time and scalable deployment. Guided by the profiling insights above, we introduce REASON (Fig. 4), an algorithm-hardware co-design framework for accelerated probabilistic logical reasoning in neuro-symbolic AI. Algorithmically, a unified representation with adaptive pruning reduces memory footprint (Sec. IV). In hardware architecture, a flexible architecture and dataflow support various symbolic and probabilistic operations (Sec. V). REASON further provides adaptive scheduling and orchestration of heterogeneous LLM-symbolic agentic workloads through a programmable interface (Sec. VI). Across reasoning tasks, REASON consistently boosts performance, efficiency, and accuracy (Sec. VII). ## IV REASON: Algorithm Optimizations This section introduces the algorithmic optimizations in REASON for symbolic and probabilistic reasoning kernels. We present a unified DAG-based computational representation (Sec. IV-A), followed by adaptive pruning (Sec. IV-B) and regularization techniques (Sec. IV-C) that jointly reduce model complexity and enable efficient neuro-symbolic systems. ### IV-A Stage 1: DAG Representation Unification Motivation. Despite addressing different reasoning goals, symbolic and probabilistic reasoning kernels often share common underlying computational patterns. For instance, logical deduction in FOL, constraint propagation in SAT, and marginal inference in PCs all rely on iterative graph-based computations. Capturing this shared structure is essential to system acceleration. DAGs provide a natural abstraction to unify these diverse kernels under a flexible computational model. <details> <summary>x5.png Details</summary> ![aa03bc8a](/v1/image/aa03bc8ace255b64b0273bb2924c933a9f3d8300b220a906e0383f588ea65f09) ### Visual Description \n ## Diagram: REASON Algorithm Optimization Pipeline ### Overview The image depicts a diagram illustrating the REASON algorithm optimization pipeline. It shows a flow of information from a "Symb/Prob Kernel Input" through three stages of optimization to a final "Output". The diagram uses rectangular boxes to represent stages and arrows to indicate the flow of data. ### Components/Axes The diagram is divided into four main sections: 1. **Symb/Prob Kernel Input:** This section lists three input types: * Logical Reasoning (SAT/FOL) * Sequential Reasoning (HMM) * Probabilistic Reasoning (PC) 2. **REASON Algorithm Optimization:** This section contains three stages: * Stage 1: DAG Representation Unification (Sec. IV-A) - Represented by a pink rectangle. * Stage 2: Adaptive DAG Pruning (Sec. IV-B) - Represented by a green rectangle. * Stage 3: Two-Input DAG Regularization (Sec. IV-C) - Represented by a blue rectangle. 3. **Output:** This section shows a network of interconnected nodes, resembling a neural network or graph structure. 4. **Arrows:** Arrows indicate the direction of data flow from the input to the output, passing through each stage of optimization. ### Detailed Analysis or Content Details The diagram shows a sequential process. The input section lists three different reasoning approaches. These inputs are fed into the first stage, "DAG Representation Unification," which is referenced as "Sec. IV-A". The output of this stage is then passed to "Adaptive DAG Pruning" (Sec. IV-B), and finally to "Two-Input DAG Regularization" (Sec. IV-C). The final output is a complex network of nodes connected by arrows. The output network has approximately 10 nodes in the top layer and 6 nodes in the bottom layer. Each node has multiple incoming and outgoing connections. ### Key Observations The diagram highlights a pipeline for optimizing reasoning algorithms. The use of "DAG" (Directed Acyclic Graph) in the stage names suggests that the algorithm relies on graph-based representations. The progression from unification to pruning to regularization indicates a refinement process, aiming to simplify and improve the reasoning process. The final output is a complex network, suggesting the algorithm produces a structured representation of the reasoning process. ### Interpretation The diagram illustrates a method for transforming symbolic and probabilistic reasoning kernels into a unified and optimized form. The REASON algorithm appears to leverage DAGs as an intermediate representation, allowing for manipulation and refinement through the three stages. The "Unification" stage likely aims to convert different reasoning approaches into a common graph format. "Pruning" suggests removing unnecessary connections or nodes to simplify the graph. "Regularization" likely involves adjusting the graph structure to improve its performance or generalization ability. The final output, a complex network, could represent a learned model or a refined reasoning engine. The references to "Sec. IV-A", "Sec. IV-B", and "Sec. IV-C" indicate that detailed explanations of each stage can be found in Section IV of a related document. The diagram suggests a focus on improving the efficiency and effectiveness of reasoning algorithms by leveraging graph-based representations and optimization techniques. </details> | SAT/FOL PC HMM | Literals and logical operators Primitive distributions, sum and product nodes Hidden state variables at each time step | Logical dependencies between literals, clauses, and formulas Weighted dependencies encoding probabilistic factorization State transition and emission dependencies | Search and deduction via traversal (DPLL/CDCL) Bottom-up probability aggregation and top-down flow propagation Sequential message passing (forward–backward, decoding) | | --- | --- | --- | --- | Figure 5: Unified DAG representations of neuro-symbolic kernels. Logical (SAT/FOL), probabilistic (PC), and sequential (HMM) reasoning are expressed using DAG abstraction. Nodes represent atomic reasoning operations, edges encode dependency structure, and graph traversals implement inference procedures. This unification enables shared compilation, pruning, and hardware mapping in REASON. Methodology. We unify symbolic and probabilistic reasoning kernels under a DAG abstraction, where each node represents an atomic reasoning operation and each directed edge encodes a data/control dependency (Fig. 5). This representation enables a uniform compilation flow – construction, transformation, and scheduling – across heterogeneous kernels (logical deduction, constraint solving, probabilistic aggregation, and sequential message passing), and serves as the algorithmic substrate for subsequent pruning and regularization. #### For FOL and SAT solvers DAG nodes represent variables and logical connectives, with edges indicating dependencies between literals and clauses. We represent a propositional CNF formula $\varphi=\bigwedge_{i=1}^{m}\Bigl(\bigvee_{j=1}^{k_{i}}l_{ij}\Bigr)$ as DAG with three layers: literal nodes for each literal $l_{ij}$ , clause nodes implementing disjunction over literals in $\bigvee_{j}l_{ij}$ , and formula nodes implementing conjunction over clauses $\bigwedge_{i}$ . In SAT, DAG captures the branching and conflict resolution structures in DPLL/CDCL procedures. In FOL, formulas are encoded as DAGs where inference rules act as graph transformation operators that derive contradictions through node and edge expansion. The compiler converts FOL and SAT inputs (clauses in CNF or quantifier-free predicates) into DAGs via: Step- 1 Normalization: predicates are transformed to CNF, removing quantifiers and forming disjunctions of literals. Step- 2 Node creation: each literal becomes a leaf node, each clause an OR node over its literals, and the formula an AND node over clauses. Step- 3 Edge encoding: edges capture dependencies (literal $\rightarrow$ clause $\rightarrow$ formula), while watch-lists as metadata. #### For PCs DAG nodes correspond to sum (mixture) or product (factorization) operations $p_{n}(x)$ over input $x$ (to variable $\mathbf{X}$ ), with children $ch(x)$ . Leaves represent primitive distributions $f_{n}(x)$ . Edges model conditional dependencies. The DAG structure facilitates efficient inference through bottom-up probability evaluation, exploiting structural independence and enabling effective pruning and memorization during probability queries (Eq. 1). The compiler converts PC into DAGs through: Step- 1 Graph extraction: nodes represent random variables, factors, or sub-circuits parsed from expressions such as $P_{n}(x)$ . Step- 2 Node typing: arithmetic operators map to sum nodes for marginalization and product nodes for factor conjunction, while leaf nodes store constants or probabilities. #### For HMMs The unrolled DAG spans time steps, with nodes representing transition factors $p(z_{t}|z_{t-1})$ and emission factors $p(x_{t}|z_{t})$ (Eq. 2), and edges connecting factors across adjacent time steps to reflect Markov dependency. Sequential inference (filtering/smoothing/decoding) becomes structured message passing on this DAG: each step aggregates contributions from predecessor states through transition factors and then applies emission factors. The compiler converts HMMs into DAGs through: Step- 1 Sequence unroll: Each time step becomes a DAG layer, representing states and transitions. Step- 2 Node mapping: Product nodes combine transition and emission probabilities; sum nodes aggregate over prior states. The unified DAG abstraction lays the algorithmic foundation for subsequent pruning, regularization, and hardware mapping, supporting efficient acceleration of neuro-symbolic workloads. <details> <summary>x6.png Details</summary> ![7932c22e](/v1/image/7932c22e1dac4883da906186c94a26b5a3ece03bb892794e79002e930f969219) ### Visual Description \n ## Diagram: REASON Architecture ### Overview The image presents a detailed architectural diagram of the REASON system, a proposed plug-in for GPUs. It illustrates the integration of the REASON plug-in with existing GPU components and details the internal architecture of the REASON processing elements (PEs). The diagram is divided into four main sections: (a) GPU with Proposed Plug-in, (b) Proposed REASON Plug-in, (c) Tree-based PE Architecture, and (d) PE Microarchitecture. The diagram uses block diagrams and interconnect lines to represent the different components and their relationships. ### Components/Axes The diagram does not contain axes in the traditional sense of a chart. Instead, it uses labeled blocks and interconnects to represent components. Key components and labels include: * **Off-chip Memory** * **Memory Controller** * **GPUs Processing Clusters (GPC)** * **Shared L2 Cache** * **Proposed REASON Plug-in** * **Giga Thread Engine** * **Global Controller** * **Tree-based PE** * **Global Interconnect** * **Workload Scheduler** * **Shared Local Memory** * **Custom SIMD Unit** * **SIMD** * **Intermediate Buffer** * **M:1 Output Interconnect** * **Scalar PE** * **BCP FIFO** * **Control Logic / MMU Forwarding Logic** * **DPLL Broadcast (Symbolic)** * **SpM/MD/ADD/PLL Reduction (Neural/Probabilistic/Symbolic)** * **To SRAM Banks** * **From Node PE** * **Leaf Node** * **From Broadcast** * **Bones Network (N x N Distribution Crossbar)** * **N SRAM Banks** * **From Forwarding of Literals** * **Watched Literals** * **Pre-fetcher / DMA** * **From/To L2** * **Symbolic Mem. Support** * **Implication** * **Reduction (from Ctrl)** * **Index SRAM** * **Conflict/Conflict Empty Clause** * **Ctrl** * **Data** * **Control Signals** * **Fwd** * **XOR Gate** * **Adder** ### Detailed Analysis or Content Details **(a) GPU with Proposed Plug-in:** This section shows the REASON plug-in integrated into a standard GPU architecture. The plug-in connects to the GPU's shared L2 cache and is powered by the Giga Thread Engine and Custom SIMD Unit. The GPU also includes GPUs Processing Clusters (GPC) and a Memory Controller connected to Off-chip Memory. **(b) Proposed REASON Plug-in:** This section details the internal components of the REASON plug-in. It includes a Global Controller, multiple Tree-based PEs connected via a Global Interconnect, a Workload Scheduler, and Shared Local Memory. **(c) Tree-based PE Architecture:** This section illustrates the architecture of a single Tree-based PE. It features a SIMD unit, an Intermediate Buffer, and an M:1 Output Interconnect. The PE is composed of multiple nodes connected in a tree-like structure. Data flows from the bottom (SRAM Banks) up to the top (Scalar PE) through the tree. The diagram shows data paths for DPLL Broadcast (Symbolic) and SpM/MD/ADD/PLL Reduction (Neural/Probabilistic/Symbolic). Red lines indicate data flow, while blue lines represent control signals. **(d) PE Microarchitecture:** This section provides a detailed view of the microarchitecture of a PE node. It includes an XOR gate, an adder, and control signals for data forwarding (Fwd). ### Key Observations * The REASON plug-in appears to augment existing GPU capabilities with specialized processing elements for symbolic computation and neural/probabilistic reasoning. * The tree-based PE architecture suggests a hierarchical processing approach, potentially enabling efficient parallelization. * The inclusion of BCP FIFO and SRAM banks indicates a focus on memory access and data management. * The diagram highlights the integration of symbolic and neural/probabilistic processing within the same architecture. * The use of a Bones Network (N x N Distribution Crossbar) suggests a flexible and scalable interconnect for data distribution. ### Interpretation The diagram demonstrates a novel architecture for integrating symbolic reasoning capabilities into a GPU. The REASON plug-in leverages a tree-based PE architecture to perform specialized computations, potentially accelerating tasks that are challenging for traditional GPUs. The combination of symbolic, neural, and probabilistic processing within the same system suggests a hybrid approach to AI, aiming to combine the strengths of different paradigms. The detailed microarchitecture of the PE nodes indicates a focus on efficient data manipulation and control. The overall design suggests a system optimized for complex reasoning tasks, potentially applicable to areas such as constraint solving, machine learning, and knowledge representation. The diagram is a high-level architectural overview and does not provide specific performance metrics or implementation details. However, it clearly outlines the key components and their relationships, providing a solid foundation for further investigation and development. </details> Figure 6: Overview of the REASON hardware acceleration system. (a) Integration of REASON as a GPU co-processor. (b) REASON plug-in architecture with PEs, shared local memory, and global scheduling. (c) Tree-based PE architecture enabling broadcast, reduction, and irregular DAG execution. (d) Micro-architecture of a tree node supporting arithmetic and logical operations. (e) FIFO and memory layout supporting symbolic reasoning. ### IV-B Stage 2: Adaptive DAG Pruning Motivation. While the unified DAG representation provides a common abstraction, it may contain significant redundancy, such as logically implied literals, inactive substructures, or low-probability paths, that inflate DAG size and degrade performance without improving inference quality. Methodology. We propose adaptive DAG pruning, a semantics-preserving optimization that identifies and removes redundant paths in symbolic and probabilistic DAGs. For symbolic kernels, pruning targets literals and clauses that are logically redundant. For probabilistic kernels, pruning eliminates low-activation edges that minimally impact inference. This process significantly reduces model size and computational complexity while preserving correctness of logical and probabilistic inference. #### Pruning of FOL and SAT via implication graph For SAT solvers and FOL reasoning, we prune redundant literals using implication graphs. Given a CNF formula $\varphi=\bigwedge_{i}\left(\bigvee_{j}l_{ij}\right)$ , each binary clause $(l\lor l^{\prime})$ induces two directed implication edges: $\bar{l}\rightarrow l^{\prime}$ and $\bar{l^{\prime}}\rightarrow l$ . The resulting implication graph captures logical dependencies among literals. We perform a depth-first traversal to compute reachability relationships between literals. If a literal $l^{\prime}$ is always implied by another literal $l$ , then $l^{\prime}$ is a hidden literal. Clauses containing both $l$ and $l^{\prime}$ can safely drop $l^{\prime}$ , reducing clause width without semantic changes. For instance, a clause $C=(l\lor l^{\prime})$ is reduced to $C^{\prime}=(l)$ . This procedure removes redundant literals (e.g., hidden tautologies and failed literals), preserves satisfiability, and runs in time linear in the size of the implication graph. #### Pruning of PCs and HMMs via circuit flow For probabilistic DAGs such as PCs and HMMs, we prune edges based on probability flow, which quantifies each edge’s contribution to the overall likelihood. In HMMs, the DAG is unrolled over time steps, with nodes representing transition factors $p(z_{t}\mid z_{t-1})$ and emission factors $p(x_{t}\mid z_{t})$ . We compute expected transition and emission usage via the forward-backward algorithm, yielding posterior state and transition probabilities. Edges corresponding to transitions or emissions with consistently low posterior probability are pruned, as their contribution to the joint likelihood $p(z_{1:T},x_{1:T})$ is negligible. This pruning preserves inference fidelity while reducing state-transition complexity. In PCs, sum node $n$ computes $p_{n}(x)=\sum_{c\in\mathrm{ch}(n)}\theta_{n,c}\,p_{c}(x)$ , where $\theta_{n,c}\geq 0$ denotes the weight associated with child $c$ . For an input $x$ , we define the circuit flow through edge $(n,c)$ as $F_{n,c}(x)=\frac{\theta_{n,c}\,p_{c}(x)}{p_{n}(x)}\cdot F_{n}(x)$ , where $F_{n}(x)$ denotes the top-down flow reaching node $n$ . Intuitively, $F_{n,c}(x)$ measures the fraction of probability mass passing through edge $(n,c)$ for input $x$ . Given a dataset $\mathcal{D}$ , the cumulative flow for edge $(n,c)$ is $F_{n,c}(\mathcal{D})=\sum_{x\in\mathcal{D}}F_{n,c}(x)$ . Edges with the smallest cumulative flow are pruned, as they contribute least to the overall model likelihood. The resulting decrease in average log-likelihood is bounded by $\Delta\log\mathcal{L}\leq\frac{1}{|\mathcal{D}|}\sum_{x\in\mathcal{D}}F_{n,c}(x)$ , providing a theoretically grounded criterion for safe pruning. ### IV-C Stage 3: Two-Input DAG Regularization Methodology. After pruning, the resulting DAGs may still have high fan-in or irregular branching, which hinders efficient hardware execution. To address this, we apply a regularization step that transforms DAGs into a canonical two-input form. Specifically, nodes with more than two inputs are recursively decomposed into balanced binary trees composed of two-input intermediate nodes, preserving the original computation semantics. This normalization promotes uniformity, enabling efficient parallel scheduling, pipelining, and mapping onto REASON architecture, without sacrificing model fidelity or expressive power. For each symbolic or probabilistic kernel, the compiler generates an initial DAG, applies adaptive pruning, and then performs two-input regularization to produce a unified balanced representation. These DAGs are constructed offline and used to generate an execution binary that is programmed onto REASON hardware. This unification-pruning-regularization flow decouples algorithmic complexity from runtime execution and enables predictable performance. ## V REASON: Hardware Architecture REASON features flexible co-processor plug-in architecture (Sec. V-A), reconfigurable symbolic/probabilistic PEs (Sec. V-B), flexible support for symbolic and probabilistic kernels (Sec. V-C - V-D). Sec. V-E presents cycle-by-cycle execution pipeline analysis. Sec. V-F discusses design space exploration and scalability. ### V-A Overall Architecture Neuro-symbolic workloads exhibit heterogeneous compute and memory patterns with diverse sparsity, diverging from the GEMM-centric design of conventional hardware. Built on the unified DAG representation and optimizations (Fig. IV), REASON is a reconfigurable and flexible architecture designed to efficiently execute the irregular computations of symbolic and probabilistic reasoning stages in neuro-symbolic AI. Overview. REASON operates as a programmable co-processor tightly integrated with GPU SMs, forming a heterogeneous system architecture. (Fig. 6 (a)). In this system, REASON serves as an efficient and reconfigurable “slow thinking” engine, accelerating symbolic and probabilistic kernels that are poorly suited to GPU execution. As illustrated in Fig. 6 (b), REASON comprises an array of tree-based PE cores that act as the primary computation engines. A global controller and workload scheduler manage the workload mapping. A shared local memory serves as a unified scratchpad for all cores. Communication between cores and shared memory is handled by a high-bandwidth global interconnect. Tree architecture. Each PE core is organized as a tree-structured compute engine, as shown in Fig. 6 (c). Each tree node integrates a specialized datapath, memory subsystem, and control logic optimized for executing DAG-based symbolic and probabilistic operations. Reconfigurable tree engine (RTE). At the core of each PE is a Reconfigurable Tree Engine (RTE), whose datapath forms a bidirectional tree of PEs (Fig. 6 (d)). The RTE supports both SAT-style symbolic broadcast patterns and probabilistic aggregation operations. A Benes network interconnect enables N-to-N routing, decoupling SRAM banking from DAG mapping and simplifying compilation of irregular graph structures (Sec. V-C). Forwarding logic routes intermediate and irregular outputs back to SRAM for subsequent batches. Memory subsystem. To tackle the memory-bound nature of symbolic and probabilistic kernels, the RTE is backed by a set of dual-port, wide-bitline SRAM banks arranged as a banked L1 cache. A memory front-end with a prefetcher and high-throughput DMA engine moves data from shared scratchpad. A control/memory management unit (MMU) block handles address translation across the distributed memory system. Core control and execution. A scalar PE acts as the core-level controller, fetching and issuing VLIW-like instructions that configure the RTE, memory subsystem, and specialized units. Outputs from the RTE are buffered before being consumed by the scalar PE or the SIMD Unit, which provides support for executing parallelable subset of symbolic solvers. ### V-B Reconfigurable Symbolic/Probabilistic PE The PE architecture is designed to support a wide range of symbolic and probabilistic computation patterns via a VLIW-driven cycle-reconfigurable datapath. Each PE can switch among three operational modes to efficiently execute heterogeneous kernels mapped from the unified DAG representation. Probabilistic mode. In probabilistic mode, the node executes irregular DAGs derived from unified probabilistic representations (Sec. V-C). The nodes are programmed by the VLIW instruction stream to perform arithmetic operations, either addition or multiplication, required by the DAG node mapped onto them. This mode supports probabilistic aggregation patterns such as sum-product computation and likelihood propagation, enabling efficient execution of PCs and HMMs. <details> <summary>x7.png Details</summary> ![1d7ade84](/v1/image/1d7ade84107c3b4e4005ffcae19f97b82cb59154bd795d726b0d563a13cb30a9) ### Visual Description \n ## Diagram: Dataflow for Neural Network Optimization ### Overview This diagram illustrates a five-step process for optimizing neural network execution, focusing on block decomposition, parallel execution unit (PE) mapping, and data reordering. The process begins with a unified representation of the network and culminates in optimized data access patterns for a local PE SRAM. ### Components/Axes The diagram is divided into five sequential steps, labeled "Step 1" through "Step 5". Each step is visually represented with a corresponding diagram. Key elements include: * **Step 1: Unified Representation:** Shows a grid-like structure with nodes labeled A through H, connected by red arcs. * **Step 2: Block Decomposition (BD):** Displays a series of tree-like structures within light blue boxes, with some nodes highlighted in red (A) and green. The text "Intra-block Regularization" and "Inner-block Regularization" are present. * **Step 3: PE and Register Mapping:** Shows a grid of "PE" (Parallel Execution) units and a table labeled "Tree global scratchpad". The text "Assign based on BD" is present. * **Step 4: Tree Mapping:** Illustrates a tree structure with connections to a "Local PE SRAM" represented as a grid. * **Step 5: Reordering:** Displays a timeline with steps labeled T=0, T=1, T=2, and T=3, with corresponding actions: "Load", "No-op", and "Block". ### Detailed Analysis or Content Details **Step 1: Unified Representation:** * A grid of small squares representing data elements. * Nodes A through H are marked on the grid. * A red arc connects A to H, and another connects A to nodes B, C, D, E, F, and G. **Step 2: Block Decomposition (BD):** * Multiple tree-like structures are shown within light blue boxes. * Node A is highlighted in red. * A node is highlighted in green, labeled "Inner-block Regularization". * The text "Intra-block Regularization" is present. **Step 3: PE and Register Mapping:** * A 3x2 grid of "PE" units is shown. * A table labeled "Tree global scratchpad" is present, but its contents are not visible. * The text "Assign based on BD" is present. **Step 4: Tree Mapping:** * A tree structure is shown, with connections to a grid representing "Local PE SRAM". * A yellow arrow indicates data flow from a node in the tree to the SRAM. **Step 5: Reordering:** * A timeline with four steps: T=0, T=1, T=2, T=3. * T=0: "Load" * T=1: "No-op" * T=2: "Block" * T=3: "Block" ### Key Observations * The process progressively decomposes the neural network representation into smaller blocks and maps them onto parallel execution units. * Regularization techniques ("Intra-block Regularization", "Inner-block Regularization") are applied during block decomposition. * Data reordering is performed to optimize access patterns for the local PE SRAM. * The timeline in Step 5 suggests a sequence of operations: loading data, followed by block processing. ### Interpretation This diagram outlines a methodology for optimizing neural network execution on parallel hardware. The initial "Unified Representation" likely represents the original network structure. The "Block Decomposition" step aims to divide the network into smaller, manageable blocks, potentially to exploit parallelism and reduce computational complexity. The "PE and Register Mapping" step assigns these blocks to individual processing elements (PEs) and allocates registers for data storage. The "Tree Mapping" step visualizes the mapping of the decomposed network onto the local SRAM of each PE, and the "Reordering" step optimizes data access patterns to minimize latency and maximize throughput. The use of regularization techniques suggests an attempt to improve the robustness and generalization ability of the network. The timeline indicates a phased execution strategy, starting with data loading and followed by block-wise processing. The overall goal is to accelerate neural network inference or training by leveraging parallel processing and optimized data management. </details> Figure 7: Compiler-architecture co-design for probabilistic execution. A probabilistic DAG is decomposed, regularized, mapped onto tree-based PEs, and scheduled with pipeline awareness to enable efficient execution of irregular probabilistic kernels in REASON. Symbolic mode. In symbolic mode, the datapath is repurposed for logical reasoning operations (Sec. V-D). Key hardware components are utilized as follows: (a) The comparator checks logical states for Boolean Constraint Propagation (BCP), identifying literals as TRUE, FALSE, or UNASSIGNED. (b) The adder performs two key functions: address computation by adding the Clause Base Address and Literal Index to locate the next literal in a clause; and clause evaluation by acting as counter to track the number of FALSE literals. This enables fast detection of unit clauses and conflicts, accelerating SAT-style symbolic reasoning. SpMSpM mode. The tree-structured PE inherently supports the sparse matrix-matrix multiplication (SpMSpM), a computation pattern widely studied in prior works [24, 45]. In this mode, the leaf nodes are configured as multipliers to compute partial products of the input matrix elements, while the internal nodes are configured as adders to perform hierarchical reductions. This execution pattern allows small-scale neural or neural-symbolic models to be efficiently mapped onto the REASON engine, extending its applicability beyond purely symbolic and probabilistic kernels. ### V-C Architectural Support for Probabilistic Reasoning Probabilistic reasoning kernels are expressed as DAGs composed of arithmetic nodes (sum and product) connected by data-dependent edges. REASON exploits its pipelined, tree-structured datapath to efficiently map these DAGs onto parallel PEs. Key architectural features include: multi-tree PE mapping for arithmetic DAG execution, a banked register file with flexible crossbar interconnect to support irregular memory access, and compiler-assisted pipeline scheduling with automatic register management to reduce control overhead. Fig. 7 illustrates the overall workflow. Datapath and pipelined execution. The datapath operates in a pipelined fashion, with each layer of nodes serving as pipeline stages. Each pipeline stage receives inputs from a banked register file, which consists of multiple parallel register banks. Each bank operates independently, providing flexible addressing that accommodates the irregular memory access patterns typical in probabilistic workloads (e.g., PCs, HMMs). Flexible interconnect. To handle the irregularity in probabilistic DAGs, REASON employs an optimized interconnection. An input Benes crossbar connects the register file banks to inputs of PE trees, allowing flexible and conflict-free routing of operands into computation units. Output connections from PE to register banks are structured as one-bank-one-PE to minimize hardware complexity while preserving flexibility, balancing trade-offs between utilization and performance. <details> <summary>x8.png Details</summary> ![197072c1](/v1/image/197072c11228aa014a3534ba25f57d8096ff042656e863003ab75c8a6748d352) ### Visual Description \n ## Charts: Normalized Latency and Normalized Broadcast-to-Root Cycle Counts ### Overview The image presents two bar charts, labeled (a) and (b), comparing performance metrics for different parallel execution (PE) structures: All-to-One, Mesh, and Tree. Chart (a) displays "Normalized Latency," while chart (b) shows "Normalized Broadcast-to-Root Cycle Counts." Both charts vary the number of leaf nodes (N) in the tree-based PE structure from N to 8N. A gray line in chart (a) indicates a "Sys. Freq. Bottleneck." ### Components/Axes **Chart (a): Normalized Latency** * **X-axis:** PE Structure and Leaf Node Count. Categories: "All-to-One", "Mesh", "Tree". Leaf Node Count: "N", "2N", "3N", "4N", "5N", "6N", "7N", "8N". * **Y-axis:** Normalized Latency (Scale: 0 to 8, with increments of 1). * **Legend:** * Memory (Blue) * PE (Orange) * Peripheries (Gray) * Inter-node topology Latency (Hatched Gray) * **Annotation:** "N: Number of leaf nodes in the tree-based PE structure" * **Trend Line:** "Sys. Freq. Bottleneck" (Gray line, slopes upward) **Chart (b): Normalized Broadcast-to-Root Cycle Counts** * **X-axis:** PE Structure and Leaf Node Count. Categories: "All-to-One", "Mesh", "Tree". Leaf Node Count: "N", "2N", "3N", "4N", "5N", "6N", "7N", "8N". * **Y-axis:** Normalized Broadcast-to-Root Cycle Counts (Scale: 0 to 30, with increments of 5). * **Legend:** (Same color scheme as Chart a) * Memory (Blue) * PE (Orange) * Peripheries (Gray) * Inter-node topology Latency (Hatched Gray) ### Detailed Analysis or Content Details **Chart (a): Normalized Latency** * **Memory (Blue):** Remains consistently low, around 1.0-1.5 across all PE structures and leaf node counts. * **PE (Orange):** Also remains relatively low, generally between 1.0 and 2.0, with slight increases for Tree structures at higher leaf node counts. * **Peripheries (Gray):** Shows a slight increase with increasing leaf node counts for all structures, but remains below 2.0. * **Inter-node topology Latency (Hatched Gray):** Starts low for N, increases significantly for Tree structures as the leaf node count increases. At 8N, the latency reaches approximately 3.0 for Tree. * **Trend Line (Sys. Freq. Bottleneck):** Starts at approximately 1.0 and increases linearly to approximately 8.0 at 8N. **Chart (b): Normalized Broadcast-to-Root Cycle Counts** * **Memory (Blue):** Remains consistently low, generally below 2.0 across all structures and leaf node counts. * **PE (Orange):** Similar to Memory, remains low, generally below 2.0. * **Peripheries (Gray):** Shows a significant increase for Tree structures as the leaf node count increases. At 8N, the cycle count reaches approximately 24. * **Inter-node topology Latency (Hatched Gray):** Low for Mesh and All-to-One, but increases dramatically for Tree structures with increasing leaf node counts. At 8N, the cycle count reaches approximately 10 for Tree. ### Key Observations * **Tree structures exhibit increasing latency and cycle counts with increasing leaf nodes.** This is particularly pronounced for Inter-node topology Latency and Peripheries in both charts. * **Mesh and All-to-One structures maintain relatively stable performance** across varying leaf node counts. * **Memory and PE components consistently show low latency and cycle counts** compared to Peripheries and Inter-node topology Latency. * **The "Sys. Freq. Bottleneck" trend line in Chart (a) suggests a system frequency limitation** that becomes more significant as the number of leaf nodes increases. ### Interpretation The data suggests that scaling tree-based PE structures by increasing the number of leaf nodes introduces significant performance bottlenecks, particularly related to inter-node communication and peripheries. The increasing latency and cycle counts for Tree structures indicate that the communication overhead grows rapidly with the number of nodes. The "Sys. Freq. Bottleneck" line implies that the system's clock frequency may be limiting the performance of these structures as they scale. Mesh and All-to-One structures appear to be more scalable, as their performance remains relatively stable with increasing leaf nodes. This suggests that these structures have lower communication overhead or are less sensitive to the system's frequency limitations. The consistently low performance of Memory and PE components indicates that these are not the primary bottlenecks in the system. The focus for optimization should be on reducing the latency and cycle counts associated with inter-node communication and peripheries, especially in tree-based PE structures. The data highlights a trade-off between scalability and performance in parallel execution structures, with tree structures becoming less efficient as the number of nodes increases. </details> Figure 8: Scalability analysis of interconnect topologies. (a) Normalized latency breakdown as the number of leaf nodes $N$ increases. (b) Normalized broadcast-to-root cycle counts for different PE interconnect structures. Register management. REASON adopts an automatic write-address generation policy. Data is written to the lowest available register address in each bank, eliminating the need to encode explicit write addresses in instructions. The compiler precisely predicts these write addresses at compile time due to the deterministic execution sequence, further reducing instruction size and energy overhead. Compiler-driven optimization. To efficiently translate unified DAGs into executable kernels and map onto hardware datapath, REASON adopts a four-step compiler pipeline (Fig. 7). Step- 1 Block decomposition: The compiler decomposes the unified DAG from Sec. IV into execution blocks through a greedy search that identifies schedulable subgraphs whose maximum depth does not exceed the hardware tree depth. This process maximizes PE utilization while minimizing inter-block dependencies that may cause read-after-write stalls. The resulting tree-like blocks form the basis for efficient mapping. Step- 2 PE mapping: For each block, the compiler jointly assigns nodes to PEs and operands to register banks, considering topological constraints and datapath connectivity. Nodes are mapped to preserve order, while operands are allocated to banks to avoid simultaneous conflicts. The compiler dynamically updates feasible mappings and prioritizes nodes with the fewest valid options. This conflict-aware strategy minimizes bank contention and balances data traffic across banks. Step- 3 Tree mapping: Once block and register mappings are fixed, the compiler constructs physical compute trees that maximize data reuse in the REASON datapath. Node fusion and selective replication enhance operand locality and reduce inter-block communication, allowing intermediate results to be consumed within the datapath and lowering memory traffic. Step- 4 Reordering: The compiler then schedules instructions with awareness of the multi-stage pipeline. Dependent operations are spaced by at least one full pipeline interval, while independent ones are interleaved. Lightweight load, store, and copy operations fill idle cycles without disturbing dependencies. Live-range analysis identifies register pressure and inserts minimal spill and reload instructions when needed. The DAG-to-hardware mapping is an automated heuristic process to generate a compact VLIW program for REASON. Designers can interact for design-space exploration to tune architectural parameters within flexible hardware template. ### V-D Architectural Support for Symbolic Logical Reasoning To efficiently support symbolic logical reasoning kernels, REASON features a linked-list-based memory layout and hardware-managed BCP FIFO mechanism (Fig. 6 (e)), enabling efficient and scalable support to large-scale solver kernels that are fundamental to logical reasoning. Watched literals (WLs) unit. The WLs unit acts as a distributed memory controller tightly integrated with $N$ SRAM banks, implementing the two-watched-literals indexing scheme in hardware. This design transforms the primary bottleneck in BCP from a sequential scan over the clause database into a selective parallel memory access problem. Crucially, it enables scalability to industrial-scale SAT problems [44], where only a small subset of clauses (those on the watch list) need to be accessed at any time. This design naturally aligns with a hierarchical memory system, allowing most clauses to reside in remote scratchpad memory or DRAM, with on-chip SRAM caching only the required clauses indexed by WLs unit. <details> <summary>x9.png Details</summary> ![a1369a14](/v1/image/a1369a14d2e4eaa50f4bfd08174b36c2edbb1690f5e267ac03a71d18ae3b8607) ### Visual Description ## Diagram: GPU-REASON Pipeline & Intra-REASON Pipeline ### Overview This diagram illustrates the GPU-REASON pipeline and its intra-REASON symbolic CDCL (Conflict-Driven Clause Learning) execution. It depicts the flow of tasks through the GPU and REASON modules over time, detailing the activities within each module during specific cycles (T1-T23). The diagram highlights the parallel processing capabilities of the system using multiple CDCLs. ### Components/Axes The diagram is structured into several sections: * **Header:** Shows the overall GPU-REASON pipeline with a timeline and the DPLL-lookahead CDCL process. * **Main Body:** A table detailing the Intra-REASON pipeline execution, broken down by module (Broadcast, Reduction, L2/DMA, PE Activity, BCP FIFO, Control, Watched Literals) and time cycles (T1-T4, T5, T6-T9, T10, T11, T15, T16, T17-T19, T22, T23). * **Right Side:** Illustrates the DPLL-lookahead CDCL process with nodes representing decision levels and connections indicating propagation. * **Legend:** Located on the right side, indicating the meaning of the node colors in the DPLL-lookahead CDCL diagram (Gray: Node Unsatisfiable). The time axis is represented by cycles labeled T1 through T23. The modules within the Intra-REASON pipeline are listed vertically. ### Detailed Analysis or Content Details **Header:** * **GPU:** Shows three tasks labeled "Neuro" and "Symbolic" being processed over time. * **REASON:** Shows the same tasks being processed. * **DPLL-lookahead CDCL:** A diagram showing a tree-like structure with nodes. The nodes are connected by lines. * Node A is labeled "x8". * Node B is labeled "x6". * The text "Lookahead: LA(A) < LA(B)" is present. **Intra-REASON Pipeline (Symbolic CDCL execution):** | Module | Cycle | Activity/Data | |-----------------|---------|---------------------------------------------| | **Broadcast** | T1-T4 | x1 arrives | | | T5 | x1 arrives | | | T6-T9 | Broadcast | | | T10 | Broadcast | | | T11 | Broadcast x2 arrives | | | T15 | Broadcast x12 arrives | | | T16 | Broadcast x12 arrives | | | T17-T19 | Broadcast x99 arrives | | | T22 | x99 arrives | | | T23 | [NULL] | | **Reduction** | T1-T4 | x1 | | | T5 | x2=1 propagate then x3=0 | | | T6-T9 | x3 arrives | | | T10 | | | | T11 | | | | T15 | | | | T16 | Conflict propagate x99 | | | T17-T19 | | | | T22 | | | | T23 | | | **L2/DMA** | T1-T4 | | | | T5 | | | | T6-T9 | | | | T10 | DMA activated | | | T11 | DMA activated | | | T15 | DMA activated | | | T16 | DMA activated | | | T17-T19 | DMA activated | | | T22 | Stop DMA | | | T23 | | | **PE Activity** | T1-T4 | Implication x2=1, x3=0 | | | T5 | | | | T6-T9 | | | | T10 | None | | | T11 | None | | | T15 | None | | | T16 | None | | | T17-T19 | Conflicts | | | T22 | | | | T23 | | | **BCP FIFO** | T1-T4 | [x12=0, x99=1] | | | T5 | [x12=0, x99=1] | | | T6-T9 | [x12=0, x99=1] | | | T10 | [x99=1, x3=0] | | | T11 | [x99=1, x3=0] | | | T15 | [x3=0] | | | T16 | [x3=0] | | | T17-T19 | [NULL] | | | T22 | | | | T23 | | | **Control** | T1-T4 | Decide x1=0 | | | T5 | Push x3, Pop x12 | | | T6-T9 | | | | T10 | | | | T11 | | | | T15 | | | | T16 | Pop x99 | | | T17-T19 | | | | T22 | FIFO Flush | | | T23 | | | **Watched Literals**| T1-T4 | No miss detected | | | T5 | No miss detected | | | T6-T9 | Miss detected | | | T10 | Miss detected | | | T11 | No miss detected | | | T15 | No miss detected | | | T16 | conflicts! | | | T17-T19 | | | | T22 | | | | T23 | | **Right Side (DPLL-lookahead CDCL):** * The diagram shows a branching structure, representing the exploration of different decision paths. * The nodes are colored gray, indicating they are unsatisfiable. ### Key Observations * The Intra-REASON pipeline demonstrates a clear progression of tasks through the modules over time. * The Broadcast module frequently receives new data (x1, x2, x12, x99). * The Reduction module performs propagation and conflict detection. * The L2/DMA module is activated during cycles T10-T19. * Conflicts are detected in the PE Activity module during cycle T17-T19. * The BCP FIFO contains values representing clause assignments. * The Control module manages decision-making and FIFO operations. * The Watched Literals module detects missed literals, indicating potential conflicts. * The DPLL-lookahead CDCL diagram shows a limited search space with all nodes marked as unsatisfiable. ### Interpretation The diagram illustrates a highly parallel and efficient pipeline for solving problems using symbolic CDCL. The GPU-REASON architecture leverages the GPU for accelerating the REASON module, enabling faster processing of tasks. The Intra-REASON pipeline demonstrates the detailed execution flow within the REASON module, highlighting the interplay between different modules. The parallel CDCLs (indicated by "Multiple parallel CDCLs" on the right) suggest that the system explores multiple solution paths concurrently. The DPLL-lookahead CDCL diagram, with all nodes marked as unsatisfiable, suggests that the current search space has been exhausted without finding a satisfying solution. This could indicate a complex problem or the need for a different search strategy. The data in the Intra-REASON pipeline table provides a granular view of the system's internal state during the execution process, allowing for detailed performance analysis and optimization. The progression of values in the BCP FIFO and the activities in the Control module reveal the decision-making process and the handling of clause assignments. The "Miss detected" messages in the Watched Literals module indicate potential conflicts that require further investigation. </details> Figure 9: Two-level execution pipeline for symbolic reasoning. Top: task-level overlap between GPU neural execution and REASON symbolic execution. Bottom: detailed cycle-by-cycle timeline of CDCL SAT solving, illustrating pipelined broadcast/reduction, WLs traversal, latency hiding, and priority-based conflict handling. Color represents the causality of hardware events. SRAM layout. The local SRAM is partitioned to support a linked-list-based organization of watch lists. A dedicated region stores a head pointer table indexed by literal IDs, each pointing to the start of a watch list, enabling $\mathcal{O}(1)$ access. The main data region stores clauses, each augmented with a next-watch pointer that links to other clauses watching the same literal, forming multiple linked lists within the linear address space. Upon a variable assignment, the WLs unit uses the literal ID to fetch the head pointer and traverses the list by following next-watch pointers, dispatching only the relevant clause to PEs. This hardware-managed indexing eliminates full-database scans and maps efficiently to the adder datapath. BCP FIFO. The BCP FIFO sits atop the M:1 output interconnect (Fig. 6 (c)) and serializes multiple parallel implications generated by the leaf tree-node in a single cycle. While many implications can be discovered concurrently, BCP must propagate them sequentially to preserve the causality chain for conflict analysis. The controller immediately broadcasts one implication back into the pipeline, while the rest are queued in the FIFO and processed in a pipelined manner. Within a symbolic (DPLL) tree node, implications are causally independent and can be pipelined, but between assignments, sequential ordering is enforced to maintain correctness. Sec. V-E illustrates a detailed cycle-level execution example. Scalability advantages. A key advantage of the REASON architecture is that its tree-based inter-node topology does not become a bottleneck as the symbolic DPLL tree grows (Fig. 8 (a)). In contrast, all-to-one (or one-to-all) bus interconnects often fail to scale due to post-layout electrical constraints, including high fan-out and buffer insertion for hold-time fixes. Moreover, given that broadcasting is a dominant operation, the root-to-leaf traversal latency is critical. REASON’s tree-based inter-node topology achieves exceptional scalability with an $\mathcal{O}(\log N)$ traversal latency, compared to $\mathcal{O}(\sqrt{N})$ for mesh-based designs and $\mathcal{O}(N)$ for bus-based interconnects (Fig. 8 (b)). This property enables robust scalability for large symbolic reasoning workloads. Listing 1: C++ Programming Interface of REASON ⬇ // Trigger symbolic execution for a single inference void REASON_execute ( int batch_id, // batch identifier int batch_size, // number of objects in the batch const void * neural_buffer, // neural results in shared memory const void * reasoning_mode, // mode selection void * symbolic_buffer // write-back symb. results ); // Query current REASON status for a given object int REASON_check_status ( int batch_id, // batch identifier bool blocking // wait till REASON is idle ); ### V-E Case Study: A Working Example of Symbolic Execution Fig. 9 illustrates the dynamic, pipelined per-node execution of REASON during a cube-and-conquer SAT solving phase, which highlights several key hardware mechanisms, including inter-node pipelined broadcast/reduction, latency hiding via parallel WLs traversal, and priority-based conflict handling. Execution begins with the controller issuing a Decision to assign $x_{1}$ , which is broadcast through the distribution tree (T1–T4). At T5, leaf nodes concurrently discover two implications: $x_{2}$ = $1$ and $x_{3}$ = $0$ . These implications are returned to the controller via the reduction tree in a pipelined manner, where $x_{2}$ = $1$ arrives first, followed by $x_{3}$ = $0$ at T10. Since the FIFO is occupied, $x_{3}$ = $0$ is queued into the BCP FIFO at T11. At T15, the FIFO pops a subsequent implication ( $x_{12}$ ), which triggers WLs lookup. A local SRAM miss prompts the L2/DMA to begin fetching clause, meanwhile BCP FIFO continues servicing queued implications: $x_{99}$ is popped and broadcast from T17–T19 while DMA fetch is still in progress. At T22, the propagation of $x_{99}$ results in a Conflict, which immediately propagates up the reduction tree. Upon receiving the conflict signal at T23, the controller asserts priority control: it halts the ongoing DMA fetch, flushes the FIFO, and discards all pending implications (including $x_{3}$ = $0$ ) from the now-invalid search path. The cube-and-conquer phase terminates, and the parallelized DPLL node is forwarded to the scalar PE for CDCL conflict analysis, as discussed in Sec. II-C. ### V-F Design Space Exploration and Scalability Design space exploration. To identify the optimal configuration of REASON architecture, we perform a comprehensive design space exploration. We systematically evaluated different configurations by varying key architectural parameters such as the depth of the tree (D), the number of parallel register banks (B), and the number of registers per bank (R). We evaluate each configuration across latency, energy consumption, and energy-delay product (EDP) on representative workloads. The selected configuration (D=3, B=64, R=32) offers the optimal trade-off between performance and energy efficiency. Scalability and variability support. Coupled with reconfigurable array, pipeling scheduling, and memory layout optimizations, REASON provides flexible hardware support across symbolic and probabilistic kernels (e.g., SAT, FOL, PC, HMM), neuro-symbolic workloads, and cognitive tasks, enabling efficient neuro-symbolic processing at scale (Sec. VII). Design choice discussion. We adopt a unified architecture for symbolic and probabilistic reasoning to maximize flexibility and efficiency, rather than decoupling them into separate engines. We identify these kernels share common DAG patterns, enabling REASON to execute them on Tree-PEs through a unified representation. This approach achieves $>$ 90% utilization with 58% lower area/power than decoupled designs, while maintaining tight symbolic-probabilistic coupling. A flexible Benes network and compiler co-design handle routing and memory scheduling, ensuring efficient execution. ## VI REASON: System Integration and Pipeline This section presents the system-level integration of REASON. We first present the integration principles and workload partitioning strategy between GPU and REASON (Sec. VI-A), then introduce the programming model that enables flexible invocation and synchronization (Sec. VI-B). Finally, we describe the two-level execution pipeline (Sec. VI-C). ### VI-A Integration with GPUs for End-to-End Reasoning Integration principles. As illustrated in Fig. 6 (a), the proposed REASON is integrated as a co-processor within GPU system to support efficient end-to-end symbolic and probabilistic reasoning. This integration follows two principles: (1) Versatility to ensure compatibility with a broad range of logical and probabilistic reasoning workloads in neuro-symbolic pipelines, and (2) Efficiency to achieve low-latency execution for real-time reasoning. These principles necessitate careful workload assignment between GPU and REASON with pipelined execution. Workload partitioning. To maximize performance while preserving compatibility with existing and emerging LLM-based neuro-symbolic agentic pipelines, we assign neural computation (e.g., LLM inference) to the GPU and offload symbolic reasoning and probabilistic inference to REASON. This partitioning exploits the GPU’s high throughput and programmability for neural kernels, while leveraging REASON’s reconfigurable architecture optimized for logical and probabilistic operations. It also minimizes data movement and enables pipelined execution: while REASON processes symbolic reasoning for the current batch, the GPU concurrently executes neural computation for the next batch. ### VI-B Programming Model REASON’s programming model (Listing. 1) is designed to offer full flexibility and control, making it easy to utilize REASON for accelerating various neuro-symbolic applications. It exposes two core functions for executing and status checking, enabling acceleration of logical and probabilistic kernels. REASON_execute is responsible for processing a single symbolic reasoning run. It is called after GPU SMs complete processing the neural LLM. REASON then performs logical reasoning and probabilistic inference, and writes the symbolic results to shared memory, where SMs use for the next iteration. REASON_check_status reports the current execution status of REASON (IDLE or EXECUTION) and includes an optional blocking flag. This feature allows the host thread to wait for REASON to complete the current step of reasoning before starting the next, ensuring proper coordination across subtasks without relying on CUDA stream synchronization. Synchronization. Coordination between SMs and REASON is handled through shared-memory flag buffers and L2 cache. After executing LLM kernel, SMs write the output to shared memory and set neural_ready flag. REASON polls this flag, fetches the data, and performs symbolic reasoning. It then writes the result back to shared memory and sets symbolic_ready flag, which will be retrieved for final output. This tightly coupled design leverages GPU’s throughput for LLM kernels and REASON’s efficiency for symbolic reasoning, minimizing overhead and maximizing performance. TABLE III: Hardware baselines. Comparison of device specs. | Orin NX [18] | 8 nm | 4 MB | 512 CUDA Cores | 450 | 15 | | --- | --- | --- | --- | --- | --- | | RTX A6000 [48] | 8 nm | 16.5 MB | 10572 CUDA Cores | 628 | 300 | | Xeon CPU [17] | 10 nm | 112.5 MB | 60 cores per socket | 1600 | 270 | | TPU [19] | 7 nm | 170 MB | 8 128 $\times$ 128 PEs | 400 | 192 | | DPU # [59] | 28 nm | 2.4 MB | 8 PEs / 56 Nodes | 3.20 | 1.10 | | REASON | 28 nm | 1.25 MB | 12 PEs / 80 Nodes | 6.00 | 2.12 | | REASON * | 12 nm | 1.25 MB | 12 PEs / 80 Nodes | 1.37 | 1.21 | | REASON * | 8 nm | 1.25 MB | 12 PEs / 80 Nodes | 0.51 | 0.98 | - The 12nm and 8nm data are scaled from the DeepScaleTool [57] with a voltage of 0.8V and a frequency of 500MHz. # The terminology for tree-based architecture is renamed from tree to PE and PE to node to align with the proposed REASON. ### VI-C Two-Level Execution Pipeline Our system-level design employs a two-level pipelined execution model (Fig. 9 top-left) to maximize concurrency across neural and symbolic kernels. The GPU-REASON pipeline overlaps the execution of symbolic kernels on REASON for step $N$ with neural kernels on GPU for step $N$ +1, effectively hiding the latency of one stage and improving throughput. Within REASON, the Intra-REASON pipeline exploits inter-node pipelined broadcast and reduction to hide communication latency, using parallel worklist traversal and priority-based conflict handling to accelerate symbolic kernels (Sec. V-E). The compiler integrates pipeline-aware scheduling to reorder instructions, avoid read-after-write hazards, and insert no-operation instructions when necessary, ensuring each stage receives valid data without interruption. <details> <summary>x10.png Details</summary> ![faf97c6a](/v1/image/faf97c6a92e2002e2cc727add012e4f1805819fa22381340a305cc3107d6bd20) ### Visual Description \n ## Diagram: Chip Floorplan and Specifications ### Overview The image presents a diagram illustrating the floorplan of a chip, divided into functional blocks, alongside a table listing key specifications of the chip. The floorplan visually represents the allocation of different components within the chip's area. The table provides quantitative data about the chip's technology, power consumption, memory, and processing elements. ### Components/Axes The diagram is divided into several rectangular blocks, each labeled with its function. The blocks are: * **Flat Logic – Control, Decode, Watched Literals, Interconnects, etc.** (Light Green) * **Shared Local Memory (M1-M4)** (Blue) * **N SRAM Banks (M1-M4)** (Light Blue) * **Input/Output Distribution** (Orange) * **Custom SIMD Unit** (Dark Grey) * **Tree-structured PEs** (Green) The table on the right side lists the following specifications: * **Technology** * **Core VDD** * **Power** * **SRAM** * **# of PEs** * **# of Nodes** * **DRAM BW** * **Area** ### Detailed Analysis or Content Details The diagram shows a partitioning of the chip area. The largest portion is occupied by the "Flat Logic" block, followed by the "Tree-structured PEs" block. The "Shared Local Memory" and "N SRAM Banks" blocks occupy significant portions as well. The "Input/Output Distribution" and "Custom SIMD Unit" blocks are smaller in size. The table provides the following data: * **Technology:** 28 nm * **Core VDD:** 0.9 V * **Power:** 2.12 W * **SRAM:** 1.25 MB * **# of PEs:** 12 * **# of Nodes:** 80 * **DRAM BW:** 104 GB/s * **Area:** 6 mm² ### Key Observations The "Flat Logic" block occupies the largest area, suggesting that control and decoding logic are significant components of this chip. The chip utilizes SRAM for local memory, with a capacity of 1.25 MB. The chip has 12 Processing Elements (PEs) and 80 nodes. The DRAM bandwidth is 104 GB/s, indicating a relatively high memory access rate. The chip's area is 6 mm². ### Interpretation The diagram and table together describe a chip designed for parallel processing, likely a SIMD (Single Instruction, Multiple Data) architecture, as indicated by the "Custom SIMD Unit" and "Tree-structured PEs". The large "Flat Logic" area suggests a complex control structure. The 28nm technology node indicates a mature manufacturing process. The specifications suggest a balance between processing power (12 PEs, 104 GB/s DRAM BW) and power consumption (2.12 W). The chip is likely designed for applications requiring significant computational throughput with moderate power constraints. The SRAM banks and shared local memory suggest an emphasis on fast data access for the PEs. The chip's architecture appears to prioritize efficient data flow and parallel execution. </details> Figure 10: REASON layout and specifications. The physical design and key operating specifications of our proposed REASON hardware. ## VII Evaluation This section introduces REASON evaluation settings (Sec. VII-A), and benchmarks the proposed algorithm optimizations (Sec. VII-B) and hardware architecture (Sec. VII-C). ### VII-A Evaluation Setup Datasets. We evaluate REASON on 10 commonly-used reasoning datasets: IMO [66], MiniF2F [86], TwinSafety [20], XSTest [56], CommonGen [31], News [85], CoAuthor [26], AwA2 [78], FOLIO [11], and ProofWriter [65]. The tasks are measured by the reasoning and deductive accuracy. Algorithm setup. We evaluate REASON on six state-of-the-art neuro-symbolic models, i.e., AlphaGeometry [66], R 2 -Guard [20], GeLaTo [82], Ctrl-G [83], NeuroPC [6], and LINC [52]. Following the setup in the original literature, we determine the hyperparameters based on end-to-end reasoning performance on the datasets. Our proposed REASON algorithm optimizations are general and can work as a plug-and-play extension to existing neuro-symbolic algorithms. Baselines. We consider several hardware baselines, including Orin NX [18] (since our goal is to enable real-time neuro-symbolic at edge), RTX GPU [48], Xeon CPU [17], ML accelerators (TPU [19], DPU [59]). Tab. III lists configurations. Hardware setup. We implement REASON architecture with [59] as the baseline template, synthesize with Synopsys DC [63], and place and route using Cadence Innovus [5] at TSMC 28nm node. Fig. 10 illustrates the layout and key specifications. The REASON hardware consumes an area of 6 mm 2 and an average power of 2.12 W across workloads based on Synopsys PTPX [64] power-trace analysis (Fig. 12 (a)). Unlike conventional tree-based arrays that mainly target neural workloads, REASON provides unified, reconfigurable support for neural, symbolic, and probabilistic computation. Simulation setup. To evaluate end-to-end performance of REASON when integrated with GPUs, we develop a cycle-accurate simulator based on Accel-Sim (built on GPGPU-Sim) [21]. The simulator is configured for Orin NX architecture. The on-chip GPU is modeled with 8 SMs, each supporting 32 threads per warp, 48 KB shared memory, and 128 KB L1 cache, with a unified 2 MB L2 cache shared across SMs. The off-chip memory uses a 128-bit LPDDR5 interface with 104 GB/s peak BW. DRAM latency and energy are modeled using LPDDR5 timing parameters. Simulator test trace derivation. We use GPGPU-Sim to model interactions between SMs and REASON, including transferring neural results from SMs to REASON and returning symbolic reasoning results from REASON to SMs. To simulate communication overhead, we extract memory access traces from neuro-symbolic model execution on Orin, capturing data volume and access patterns as inputs to GPGPU-Sim for accurate modeling. For GPU comparison baselines, we use real hardware measurements to get accurate ground-truth data. ### VII-B REASON Algorithm Performance TABLE IV: REASON algorithm optimization performance. REASON achieves comparable accuracy with reduced memory footprint via unified DAG representation, adaptive pruning, and regularization. | Workloads | Benchmarks | Metrics | Baseline Performance | After REASON Algo. Opt. | | | --- | --- | --- | --- | --- | --- | | Performance | Memory $\downarrow$ | | | | | | AlphaGeo | IMO | Accuracy ( $\uparrow$ ) | 83% | 83% | 25% | | MiniF2F | Accuracy ( $\uparrow$ ) | 81% | 81% | 21% | | | R 2 -Guard | TwinSafety | AUPRC ( $\uparrow$ ) | 0.758 | 0.752 | 37% | | XSTest | AUPRC ( $\uparrow$ ) | 0.878 | 0.881 | 30% | | | GeLaTo | CommonGen | BLEU ( $\uparrow$ ) | 30.3 | 30.2 | 41% | | News | BLEU ( $\uparrow$ ) | 5.4 | 5.4 | 27% | | | Ctrl-G | CoAuthor | Success rate ( $\uparrow$ ) | 87% | 86% | 29% | | NeuroSP | AwA2 | Accuracy | 87% | 87% | 43% | | LINC | FOLIO | Accuracy ( $\uparrow$ ) | 92% | 91% | 38% | | ProofWriter | Accuracy ( $\uparrow$ ) | 84% | 84% | 26% | | Reasoning accuracy. To evaluate REASON algorithm optimization (Sec. IV), we benchmark it on ten reasoning tasks (Sec. VII-A). Tab. IV lists the arithmetic performance and DAG size reduction. We observe that REASON achieves comparable reasoning accuracy through unification and adaptive DAG pruning. Through pruning and regularization, REASON enables 31.7% memory footprint savings on average of ten reasoning tasks across six neuro-symbolic workloads. ### VII-C REASON Architecture Performance Performance improvement. We benchmark REASON accelerator with Orin NX, RTX GPU, and Xeon CPU for accelerating neuro-symbolic algorithms on 10 reasoning tasks (Fig. 11). For GPU baseline, for neuro kernels, we use Pytorch package that leverages CUDA and cuBLAS/cuDNN libraries; for symbolic kernels, we implement custom kernels optimized for logic and probabilistic operations. The workload is tiled by CuDNN in Pytorch based on block sizes that fit well in GPU memory. We observe that REASON exhibits consistent speedup across datasets, e.g., 50.65 $\times$ /11.98 $\times$ speedup over Orin NX and RTX GPU. Furthermore, REASON achieves real-time performance ( $<$ 1.0 s) on solving math and cognitive reasoning tasks, indicating that REASON enables real-time probabilistic logical reasoning-based neuro-symbolic system with superior reasoning and generalization capability, offering a promising solution for future cognitive applications. <details> <summary>x11.png Details</summary> ![6d4f7011](/v1/image/6d4f70116f1b9af4fe5f3e6f280c982b38f0a2964e4a0e8e0d0a633331a339e5) ### Visual Description \n ## Bar Chart: Normalized Runtime Comparison of Different Hardware ### Overview This bar chart compares the normalized runtime (in percentage, represented on a logarithmic y-axis) of several language models (IMO, MiniF2F, Twins, XSTest, ComGen, News, CoAuthor, AwA2, FOLIO, Proof) across three different hardware platforms: Xeon CPU, Orin NX, and RTX GPU. Each model has three bars representing its runtime on each hardware. The chart aims to demonstrate the performance differences between the hardware platforms for these specific models. ### Components/Axes * **X-axis:** Language Models - IMO, MiniF2F, Twins, XSTest, ComGen, News, CoAuthor, AwA2, FOLIO, Proof. * **Y-axis:** Normalized Runtime (%), Logarithmic Scale. The scale ranges from 1 to 102, with tick marks at 1, 10, and 100. * **Legend:** Located in the top-right corner. * Xeon CPU (Blue) * Orin NX (Pink/Red) * RTX GPU (Green) * REASON (Dark Purple) ### Detailed Analysis The chart consists of 10 groups of three bars each, one for each language model. The values are read from the top of each bar. * **IMO:** * Xeon CPU: 97.9% * Orin NX: 48.3% * RTX GPU: 12.4% * **MiniF2F:** * Xeon CPU: 99.2% * Orin NX: 51.5% * RTX GPU: 12.1% * **Twins:** * Xeon CPU: 96.5% * Orin NX: 48.9% * RTX GPU: 11.5% * **XSTest:** * Xeon CPU: 97.6% * Orin NX: 50.3% * RTX GPU: 11.4% * **ComGen:** * Xeon CPU: 98.5% * Orin NX: 48.0% * RTX GPU: 13.8% * **News:** * Xeon CPU: 95.6% * Orin NX: 50.2% * RTX GPU: 12.4% * **CoAuthor:** * Xeon CPU: 97.9% * Orin NX: 53.0% * RTX GPU: 10.6% * **AwA2:** * Xeon CPU: 100.4% * Orin NX: 51.7% * RTX GPU: 9.8% * **FOLIO:** * Xeon CPU: 98.2% * Orin NX: 51.6% * RTX GPU: 12.7% * **Proof:** * Xeon CPU: 96.9% * Orin NX: 53.0% * RTX GPU: 13.1% **Trends:** * The Xeon CPU consistently exhibits the highest normalized runtime across all models, generally around 96-100%. * The Orin NX shows intermediate runtimes, typically ranging from 48% to 53%. * The RTX GPU consistently demonstrates the lowest normalized runtime, generally between 9.8% and 13.8%. ### Key Observations * The RTX GPU consistently outperforms both the Xeon CPU and Orin NX by a significant margin across all models. * The performance difference between the Xeon CPU and Orin NX is less pronounced, but the Xeon CPU is consistently slower. * AwA2 has the highest runtime on the Xeon CPU (100.4%). * AwA2 has the lowest runtime on the RTX GPU (9.8%). ### Interpretation The data strongly suggests that the RTX GPU is the most efficient hardware platform for running these language models, offering significantly faster performance compared to the Xeon CPU and Orin NX. The logarithmic scale emphasizes the substantial speedup achieved with the RTX GPU. The consistent pattern across all models indicates that this performance advantage is not specific to any particular model architecture or dataset. The high runtimes on the Xeon CPU suggest that it is not well-suited for these types of workloads, likely due to its lack of specialized hardware for parallel processing. The Orin NX offers a moderate improvement over the Xeon CPU, but still falls far short of the RTX GPU's performance. The differences in runtime could be attributed to the parallel processing capabilities of the RTX GPU, which are well-suited for the matrix operations commonly used in deep learning models. The Xeon CPU, being a general-purpose processor, lacks these specialized capabilities. The Orin NX is an ARM-based processor, and while it offers some parallel processing capabilities, it is not as powerful as the RTX GPU. The fact that the RTX GPU consistently achieves runtimes in the 10-13% range, while the Xeon CPU is in the 96-100% range, indicates a roughly 8-10x speedup. This is a significant performance improvement that could have a substantial impact on the cost and efficiency of running these language models. </details> Figure 11: End-to-end runtime improvement. REASON consistently outperforms Xeon CPU, Orin NX, and RTX GPU in end-to-end runtime latency evaluated on 10 logical and cognitive reasoning tasks. <details> <summary>x12.png Details</summary> ![ee659317](/v1/image/ee659317b04cf55ef7c8a8d59f85ecdb448df1e454a7bc7ccb2b90e0d31c795e) ### Visual Description \n ## Bar Charts: Power Consumption and Energy Usage ### Overview The image presents two bar charts: (a) power consumption in Watts (W) for different tasks, and (b) energy usage in Joules (J) for the same tasks across different hardware platforms. The tasks are News, AwA2, TwinSafety, XSTest, and ComGen. The hardware platforms are Xeon CPU, Orin NX, RTX GPU, and REASON. ### Components/Axes **Chart (a): Power Consumption** * **X-axis:** Tasks - News, AwA2, TwinSafety, XSTest, ComGen * **Y-axis:** Power (W) - Scale ranges from approximately 0 to 2.51. * **Bars:** Represent power consumption for each task. **Chart (b): Energy Usage** * **X-axis:** Tasks - Average, Task: IMO, Task: TwinS, Task: News * **Y-axis:** Energy (J) - Logarithmic scale, ranging from approximately 1 to 1000. * **Bars:** Represent energy consumption for each task and hardware platform. * **Legend:** * Xeon CPU (Purple) * Orin NX (Pink/Red) * RTX GPU (Green) * REASON (Blue) ### Detailed Analysis or Content Details **Chart (a): Power Consumption** * **News:** Approximately 1.88 W. * **AwA2:** Approximately 2.05 W. * **TwinSafety:** Approximately 2.25 W. * **XSTest:** Approximately 2.35 W. * **ComGen:** Approximately 2.51 W. * The power consumption generally increases from News to ComGen. **Chart (b): Energy Usage** * **Average:** * Xeon CPU: Approximately 838 J. * Orin NX: Approximately 310 J. * RTX GPU: Approximately 681 J. * REASON: Approximately 0.87 J. * **Task: IMO:** * Xeon CPU: Approximately 650 J. * Orin NX: Approximately 250 J. * RTX GPU: Approximately 550 J. * REASON: Approximately 0.75 J. * **Task: TwinS:** * Xeon CPU: Approximately 950 J. * Orin NX: Approximately 350 J. * RTX GPU: Approximately 850 J. * REASON: Approximately 1 J. * **Task: News:** * Xeon CPU: Approximately 600 J. * Orin NX: Approximately 200 J. * RTX GPU: Approximately 500 J. * REASON: Approximately 0.8 J. The energy usage varies significantly across tasks and hardware platforms. The Y-axis is logarithmic, so differences appear less pronounced than they are. ### Key Observations * The Xeon CPU consistently consumes the most energy across all tasks. * REASON consistently consumes the least energy across all tasks. * The RTX GPU generally consumes more energy than the Orin NX, except for the TwinS task. * Energy consumption for the TwinS task is the highest overall. * The logarithmic scale on the Y-axis of Chart (b) compresses the differences in energy consumption. ### Interpretation The data suggests that the Xeon CPU is the most energy-intensive platform, while REASON is the most energy-efficient. The RTX GPU offers a balance between performance and energy consumption, generally falling between the Xeon CPU and Orin NX. The TwinS task appears to be the most demanding in terms of energy usage. The relationship between the two charts is that power consumption (Chart a) contributes to energy usage (Chart b). Energy is power multiplied by time. The tasks with higher power consumption will likely have higher energy usage, assuming similar execution times. The logarithmic scale in Chart (b) highlights the relative differences in energy consumption, making it easier to compare the efficiency of different hardware platforms. The large differences in energy consumption between REASON and the other platforms suggest that REASON is significantly more efficient for these tasks. The fact that the TwinS task has the highest energy consumption suggests it is the most computationally intensive of the tasks tested. </details> Figure 12: Energy efficiency improvement. (a) Power analysis of REASON across workloads. (b) Energy efficiency comparison between REASON and CPUs/GPUs, evaluated from 10 reasoning tasks. <details> <summary>x13.png Details</summary> ![90889b5e](/v1/image/90889b5e8f5f06fcfeaf85a33fa237349c8afa41cf04bf6016a2ddc827219dfb) ### Visual Description \n ## Bar Chart: Normalized Runtime Comparison of Different Architectures ### Overview This bar chart compares the normalized runtime (denoted as "Norm. Runtime (x)") of several neural network architectures (AlphaG, Guard, GelatoToCtrl-G, NPC, LINC) across three different categories: Neuro-Only, Symbolic-Only (logical/probabilistic), and End-to-End Neuro+Symbolic. Each architecture is represented by two bars: one for a TPU-like (systolic-based array) and one for a DPU-like (tree-based array). The y-axis is on a logarithmic scale. ### Components/Axes * **X-axis:** Represents the different architectures and categories. The categories are: * Neuro-Only: AlphaG, Guard, GelatoToCtrl-G, NPC, LINC * Symbolic-Only (logical/probabilistic): AlphaG, Guard, GelatoToCtrl-G, NPC, LINC * End-to-End Neuro+Symbolic: AlphaG, Guard, GelatoToCtrl-G, NPC, LINC * **Y-axis:** "Norm. Runtime (x)" on a logarithmic scale (base 10). The scale ranges from approximately 0.5 to 110. * **Legend:** Located at the top-left corner, distinguishing between: * TPU-like (systolic-based array) - Green bars with diagonal hatching. * DPU-like (tree-based array) - Pink bars with diagonal hatching. * **Title:** "REASON" is present in a yellow box above the chart. ### Detailed Analysis The chart consists of 30 bars (5 architectures x 2 array types x 3 categories). Here's a breakdown of the approximate normalized runtime values for each combination, referencing the legend colors for verification: **Neuro-Only:** * AlphaG (TPU): ~0.69 * AlphaG (DPU): ~4.31 * Guard (TPU): ~0.71 * Guard (DPU): ~4.40 * GelatoToCtrl-G (TPU): ~0.68 * GelatoToCtrl-G (DPU): ~4.29 * NPC (TPU): ~0.66 * NPC (DPU): ~4.49 * LINC (TPU): ~0.73 * LINC (DPU): ~4.32 **Symbolic-Only (logical/probabilistic):** * AlphaG (TPU): ~81.35 * AlphaG (DPU): ~25.13 * Guard (TPU): ~109.24 * Guard (DPU): ~76.10 * GelatoToCtrl-G (TPU): ~5.03 * GelatoToCtrl-G (DPU): ~78.48 * NPC (TPU): ~1.00 * NPC (DPU): ~74.71 * LINC (TPU): ~0.67 * LINC (DPU): ~23.97 **End-to-End Neuro+Symbolic:** * AlphaG (TPU): ~7.86 * AlphaG (DPU): ~21.31 * Guard (TPU): ~11.77 * Guard (DPU): ~10.54 * GelatoToCtrl-G (TPU): ~2.31 * GelatoToCtrl-G (DPU): ~18.02 * NPC (TPU): ~1.00 * NPC (DPU): ~9.76 * LINC (TPU): ~2.90 * LINC (DPU): ~8.59 **Trends:** * **TPU vs. DPU:** Generally, the DPU-like arrays have significantly higher normalized runtimes than the TPU-like arrays, especially in the Symbolic-Only and End-to-End categories. * **Category Impact:** The normalized runtime increases dramatically when moving from Neuro-Only to Symbolic-Only. The End-to-End category shows intermediate values. * **Architecture Variation:** Within each category, the normalized runtime varies considerably between the different architectures. ### Key Observations * The largest runtime values are observed for Guard with the TPU-like array in the Symbolic-Only category (~109.24). * NPC consistently shows a runtime of 1.00 for the TPU-like array in both the Symbolic-Only and End-to-End categories. * The Neuro-Only category has the lowest normalized runtimes, generally below 5. * The DPU-like arrays consistently show higher runtimes than the TPU-like arrays. ### Interpretation This chart demonstrates the performance trade-offs between different neural network architectures and hardware platforms (TPU vs. DPU) when performing different types of reasoning tasks (Neuro-Only, Symbolic-Only, End-to-End). The substantial increase in runtime for the Symbolic-Only category suggests that symbolic reasoning is significantly more computationally expensive than purely neural processing, at least for these architectures and hardware. The End-to-End approach attempts to bridge the gap, but still exhibits higher runtimes than the Neuro-Only approach. The consistent performance of NPC with a runtime of 1.00 on the TPU-like array in the Symbolic-Only and End-to-End categories suggests it may be a particularly efficient architecture for these tasks on this hardware. The difference in performance between TPU and DPU suggests that the systolic array architecture of the TPU is better suited for these workloads than the tree-based architecture of the DPU. The "REASON" title suggests the chart is evaluating architectures for reasoning tasks. The logarithmic scale emphasizes the large differences in runtime, particularly for the Symbolic-Only category. </details> Figure 13: Improved efficiency over ML accelerators. Speedup comparison of neural, symbolic (logical and probabilistic), and end-to-end neuro-symbolic system over TPU-like systolic-based array and DPU-like tree-based array architecture. Energy efficiency improvement. REASON accelerator achieves two orders of energy efficiency than Orin NX, RTX GPU, and Xeon CPU consistently across workloads (Fig. 12 (b)). To further assess REASON energy efficiency in long-term deployment, we perform consecutive tests on REASON using mixed workloads, incorporating both low-activity and high-demand periods, with 15s idle intervals between scenarios. On average, REASON achieves 681 $\times$ energy efficiency compared to RTX GPU. Additionally, when compared to V100 and A100, REASON shows 4.91 $\times$ and 1.60 $\times$ speedup, with 802 $\times$ and 268 $\times$ energy efficiency, respectively. Compare with CPU+GPU. We compare the performance of RESAON as GPU plug-in over the CPU+GPU architecture across neuro-symbolic workloads. CPU+GPU architecture is not efficient for neuro-symbolic computing due to (1) high latency of symbolic/probabilistic operations with poor locality and $<$ 5% CPU parallel efficiency, (2) $>$ 15% inter-device communication overhead from frequent neural-symbolic data transfers, and (3) fine-grained coupling between neural and symbolic modules that makes handoffs costly. REASON co-locates logical and probabilistic reasoning beside GPU SMs, sharing L2 and inter-SM fabric to eliminate transfers and pipeline neural-symbolic execution. Compare with ML accelerators. We benchmark the runtime of neural and symbolic operations on TPU-like systolic array [19] and DPU-like tree-based array [59] over different neuro-symbolic models and tasks (Fig. 13). For TPU-like architecture, we use SCALE-Sim [54], configured with eight 128 $\times$ 128 systolic arrays. For DPU-like architecture, we use MAERI [24], configured with eight PEs in 56-node tree-based array. Compared with ML accelerators, we observe that REASON achieves similar performance in neural operations, while exhibiting superior symbolic logic and probabilistic operation efficiency thus end-to-end speedup in neuro-symbolic systems. TABLE V: Ablation study of necessity of co-design. The normalized runtime achieved by REASON framework w/o the proposed algorithm optimization or hardware techniques on different tasks. | Neuro-symbolic System Algorithm @ Hardware [66, 20, 82] @ Orin NX | Normalized Runtime (%) on IMO [66] 100 | MiniF [86] 100 | TwinS [20] 100 | XSTest [56] 100 | ComG [31] 100 | | --- | --- | --- | --- | --- | --- | | REASON Algo. @ Orin NX | 84.2% | 87.0% | 78.3% | 82.9% | 86.6% | | REASON Algo. @ REASON HW | 2.07% | 1.94% | 2.04% | 1.99% | 2.08% | Ablation study on the proposed hardware techniques. REASON features reconfigurable tree-based array architecture, efficient register bank mapping, and adaptive scheduling strategy to reduce compute latency for neural, symbolic, and probabilistic kernels (Sec. V and Sec. VI). To demonstrate the effectiveness, we measure the runtime of REASON w/o the scheduling, reconfigurable architecture, and bank mapping. In particular, the proposed memory layout support can trim down the runtime by 22% on average. Additionally, with the proposed reconfigurable array and scheduling strategy, the runtime reduction ratio can be further enlarged to 56% and 73%, indicating that both techniques are necessary for REASON to achieve the desired efficient reasoning capability. Ablation study of the necessity of co-design. To verify the necessity of algorithm-hardware co-design strategy to achieve efficient probabilistic logical reasoning-based neuro-symbolic systems, we measure the runtime latency of our REASON w/o the proposed algorithm or hardware techniques in Tab. V. Specifically, with our proposed REASON algorithm optimization, we can trim down the runtime to 78.3% as compared to R 2 -Guard [20] on the same Orin NX hardware and TwinSafety task. Moreover, with both proposed REASON algorithm optimization and accelerator, the runtime can be reduced to 2.04%, indicating the necessity of the co-design strategy of REASON framework. REASON neural optimization. REASON accelerates symbolic reasoning and enables seamless interaction with GPU optimized for neuro (NN/LLM) computation. To further optimize neural module, we integrate standard LLM acceleration techniques: memory-efficient attention [25], chunked prefill [69], speculative decoding [27], FlashAttention-3 kernels [58], FP8 KV-Cache quantization [70], and prefix caching [68]. These collectively yield 2.8-3.3 $\times$ latency reduction for unique prompts and 4-5 $\times$ when prefixes are reused. While REASON’s reported gains stem from its hardware-software co-design, these LLM optimizations are orthogonal and can be applied in conjunction to unlock full potential system speedup. ## VIII Related Work Neuro-symbolic AI. Neuro-symbolic AI has emerged as a promising paradigm for addressing limitations of purely neural models, including factual errors, limited interpretability, and weak multi-step reasoning. [84, 3, 14, 8, 53, 37, 80]. Systems such as LIPS [28], AlphaGeometry [66], NSC [39], NS3D [15] demonstrate strong performance across domains ranging from mathematical reasoning to embodied and cognitive robotics. However, most prior work focuses on algorithmic design and model integration. REASON systematically characterizing the architectural and system-level properties of probabilistic logical reasoning in neuro-symbolic AI, and proposes an integrated acceleration framework for scalable deployment. System and architecture for neuro-symbolic AI. Early neuro-symbolic systems largely focused on software-level abstractions, such as training semantics and declarative languages that integrate neural with logical or probabilistic reasoning, such as DeepProbLog [36], DreamCoder [10], and Scallop [29]. Recent efforts have begun to address system-level challenges, such as heterogeneous mapping, batching control-heavy reasoning, and kernel specialization, including benchmarking [74], pruning [7], Lobster [4], Dolphin [47], and KLay [34]. At the architectural level, a growing body of work exposes the mismatch between compositional neuro-symbolic workloads and conventional hardware designs, motivating cognitive architectures such as CogSys [77], NVSA architectures [73], and NSFlow [79]. REASON advances this direction with the first flexible system-architecture co-design that supports probabilistic logical reasoning-based neuro-symbolic AI and integrates with GPU, enabling efficient and scalable compositional neuro-symbolic and LLM+tools agentic system deployment. ## IX Conclusion We present REASON, the integrated acceleration framework for efficiently executing probabilistic logical reasoning in neuro-symbolic AI. REASON introduces a unified DAG abstraction with adaptive pruning and a flexible reconfigurable architecture integrated with GPUs to enable efficient end-to-end execution. Results show that system-architecture co-design is critical for making neuro-symbolic reasoning practical at scale, and position REASON as a potential foundation for future agentic AI and LLM+tools systems that require structured and interpretable reasoning alongside neural computation. ## Acknowledgements This work was supported in part by CoCoSys, one of seven centers in JUMP 2.0, a Semiconductor Research Corporation (SRC) program sponsored by DARPA. We thank Ananda Samajdar, Ritik Raj, Anand Raghunathan, Kaushik Roy, Ningyuan Cao, Katie Zhao, Alexey Tumanov, Shirui Zhao, Xiaoxuan Yang, Zhe Zeng, and the anonymous HPCA reviewers for insightful discussions and valuable feedback. ## References - [1] K. Ahmed, S. Teso, K. Chang, G. Van den Broeck, and A. Vergari (2022) Semantic probabilistic layers for neuro-symbolic learning. Advances in Neural Information Processing Systems 35, pp. 29944–29959. Cited by: §I. - [2] R. Aksitov, S. Miryoosefi, Z. Li, D. Li, S. Babayan, K. Kopparapu, Z. Fisher, R. Guo, S. Prakash, P. Srinivasan, et al. (2023) Rest meets react: self-improvement for multi-step reasoning llm agent. arXiv preprint arXiv:2312.10003. Cited by: §I. - [3] S. Badreddine, A. d. Garcez, L. Serafini, and M. Spranger (2022) Logic tensor networks. Artificial Intelligence 303, pp. 103649. Cited by: §VIII. - [4] P. Biberstein, Z. Li, J. Devietti, and M. Naik (2025) Lobster: a gpu-accelerated framework for neurosymbolic programming. arXiv preprint arXiv:2503.21937. Cited by: §VIII. - [5] Cadence Innovus implementation system - cadence. Note: https://www.cadence.com/en_US/home/tools/digital-design-and-signoff/soc-implementation-and-floorplanning/innovus-implementation-system.html Cited by: § VII-A. - [6] W. Chen, S. Yu, H. Shao, L. Sha, and H. Zhao (2025) Neural probabilistic circuits: enabling compositional and interpretable predictions through logical reasoning. arXiv preprint arXiv:2501.07021. Cited by: TABLE I, § VII-A. - [7] M. Dang, A. Liu, and G. Van den Broeck (2022) Sparse probabilistic circuits via pruning and growing. Advances in Neural Information Processing Systems (NeurIPS) 35, pp. 28374–28385. Cited by: §VIII. - [8] H. Dong, J. Mao, T. Lin, C. Wang, L. Li, and D. Zhou (2019) Neural logic machines. In International Conference on Learning Representations (ICLR), Cited by: §VIII. - [9] S. Du, M. Ibrahim, Z. Wan, L. Zheng, B. Zhao, Z. Fan, C. Liu, T. Krishna, A. Raychowdhury, and H. Li (2025) Cross-layer design of vector-symbolic computing: bridging cognition and brain-inspired hardware acceleration. arXiv preprint arXiv:2508.14245. Cited by: §I. - [10] K. Ellis, L. Wong, M. Nye, M. Sable-Meyer, L. Cary, L. Anaya Pozo, L. Hewitt, A. Solar-Lezama, and J. B. Tenenbaum (2023) Dreamcoder: growing generalizable, interpretable knowledge with wake–sleep bayesian program learning. Philosophical Transactions of the Royal Society A 381 (2251), pp. 20220050. Cited by: §VIII. - [11] S. Han, H. Schoelkopf, Y. Zhao, Z. Qi, M. Riddell, W. Zhou, J. Coady, D. Peng, Y. Qiao, L. Benson, et al. (2022) Folio: natural language reasoning with first-order logic. arXiv preprint arXiv:2209.00840. Cited by: § VII-A. - [12] M. Hersche, M. Zeqiri, L. Benini, A. Sebastian, and A. Rahimi (2023) A neuro-vector-symbolic architecture for solving raven’s progressive matrices. Nature Machine Intelligence 5 (4), pp. 363–375. Cited by: §I, § II-A. - [13] M. J. Heule, O. Kullmann, S. Wieringa, and A. Biere (2011) Cube and conquer: guiding cdcl sat solvers by lookaheads. In Haifa Verification Conference, pp. 50–65. Cited by: § II-C. - [14] P. Hohenecker and T. Lukas (2020) Ontology reasoning with deep neural networks. Journal of Artificial Intelligence Research 68, pp. 503–540. Cited by: §VIII. - [15] J. Hsu, J. Mao, and J. Wu (2023) Ns3d: neuro-symbolic grounding of 3d objects and relations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2614–2623. Cited by: §VIII. - [16] M. Ibrahim, Z. Wan, H. Li, P. Panda, T. Krishna, P. Kanerva, Y. Chen, and A. Raychowdhury (2024) Special session: neuro-symbolic architecture meets large language models: a memory-centric perspective. In 2024 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ ISSS), pp. 11–20. Cited by: §I. - [17] INTEL Corporation (2023) 4th gen intel xeon scalable processors. Note: https://www.intel.com/content/www/us/en/ark/products/series/228622/4th-gen-intel-xeon-scalable-processors.html Cited by: TABLE III, § VII-A. - [18] () Jetson orin for next-gen robotics — nvidia. Note: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/ (Accessed on 04/02/2024) Cited by: TABLE III, § VII-A. - [19] N. P. Jouppi, D. H. Yoon, M. Ashcraft, M. Gottscho, T. B. Jablin, G. Kurian, J. Laudon, S. Li, P. Ma, X. Ma, et al. (2021) Ten lessons from three generations shaped google’s tpuv4i: industrial product. In 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), pp. 1–14. Cited by: TABLE III, § VII-A, § VII-C. - [20] M. Kang and B. Li (2025) R 2 -guard: robust reasoning enabled llm guardrail via knowledge-enhanced logical reasoning. International Conference on Learning Representations (ICLR). Cited by: §I, TABLE I, § VII-A, § VII-A, § VII-C, TABLE V, TABLE V. - [21] M. Khairy, Z. Shen, T. M. Aamodt, and T. G. Rogers (2020) Accel-sim: an extensible simulation framework for validated gpu modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pp. 473–486. Cited by: § VII-A. - [22] P. Khosravi, Y. Choi, Y. Liang, A. Vergari, and G. Van den Broeck (2019) On tractable computation of expected predictions. Advances in Neural Information Processing Systems 32. Cited by: § II-C. - [23] J. Kuang, Y. Shen, J. Xie, H. Luo, Z. Xu, R. Li, Y. Li, X. Cheng, X. Lin, and Y. Han (2025) Natural language understanding and inference with mllm in visual question answering: a survey. ACM Computing Surveys 57 (8), pp. 1–36. Cited by: §I. - [24] H. Kwon, A. Samajdar, and T. Krishna (2018) Maeri: enabling flexible dataflow mapping over dnn accelerators via reconfigurable interconnects. ACM Sigplan Notices 53 (2), pp. 461–475. Cited by: § V-B, § VII-C. - [25] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica (2023) Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles (SOSP), pp. 611–626. Cited by: § VII-C. - [26] M. Lee, P. Liang, and Q. Yang (2022) Coauthor: designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems, pp. 1–19. Cited by: § VII-A. - [27] Y. Leviathan, M. Kalman, and Y. Matias (2023) Fast inference from transformers via speculative decoding. In International Conference on Machine Learning (ICML), pp. 19274–19286. Cited by: § VII-C. - [28] Z. Li, Z. Li, W. Tang, X. Zhang, Y. Yao, X. Si, F. Yang, K. Yang, and X. Ma (2025) Proving olympiad inequalities by synergizing llms and symbolic reasoning. International Conference on Learning Representations (ICLR), pp. 1–26. Cited by: §I, §VIII. - [29] Z. Li, J. Huang, and M. Naik (2023) Scallop: a language for neurosymbolic programming. Proceedings of the ACM on Programming Languages 7 (PLDI), pp. 1463–1487. Cited by: §VIII. - [30] Y. Liang and G. Van den Broeck (2019) Learning logistic circuits. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 4277–4286. Cited by: § II-C. - [31] B. Y. Lin, W. Zhou, M. Shen, P. Zhou, C. Bhagavatula, Y. Choi, and X. Ren (2020) CommonGen: a constrained text generation challenge for generative commonsense reasoning. Findings of the Association for Computational Linguistics (EMNLP), pp. 1823––1840. Cited by: § VII-A, TABLE V. - [32] A. Liu, K. Ahmed, and G. V. d. Broeck (2024) Scaling tractable probabilistic circuits: a systems perspective. International Conference on Machine Learning (ICML), pp. 30630–30646. Cited by: § II-C. - [33] M. Lo, M. F. Chang, and J. Cong (2025) SAT-accel: a modern sat solver on a fpga. In Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, pp. 234–246. Cited by: § II-C. - [34] J. Maene, V. Derkinderen, and P. Z. D. Martires (2024) Klay: accelerating arithmetic circuits for neurosymbolic ai. arXiv preprint arXiv:2410.11415. Cited by: §VIII. - [35] M. Mahaut, L. Aina, P. Czarnowska, M. Hardalov, T. Müller, and L. Màrquez (2024) Factual confidence of llms: on reliability and robustness of current estimators. ACL. Cited by: §I. - [36] R. Manhaeve, S. Dumancic, A. Kimmig, T. Demeester, and L. De Raedt (2018) Deepproblog: neural probabilistic logic programming. Advances in neural information processing systems (NeurIPS) 31. Cited by: §VIII. - [37] R. Manhaeve, S. Dumančić, A. Kimmig, T. Demeester, and L. De Raedt (2021) Neural probabilistic logic programming in deepproblog. Artificial Intelligence 298, pp. 103504. Cited by: §VIII. - [38] J. Mao, C. Gan, P. Kohli, J. B. Tenenbaum, and J. Wu (2019) The neuro-symbolic concept learner: interpreting scenes, words, and sentences from natural supervision. International Conference on Learning Representations (ICLR). Cited by: §I, § II-A. - [39] J. Mao, J. B. Tenenbaum, and J. Wu (2025) Neuro-symbolic concepts. arXiv preprint arXiv:2505.06191. Cited by: §VIII. - [40] J. Marques-Silva, I. Lynce, and S. Malik (2021) Conflict-driven clause learning sat solvers. In Handbook of satisfiability, pp. 133–182. Cited by: § II-C. - [41] L. Mei, J. Mao, Z. Wang, C. Gan, and J. B. Tenenbaum (2022) FALCON: fast visual concept learning by integrating images, linguistic descriptions, and conceptual relations. International Conference on Learning Representations (ICLR). Cited by: §I, § II-A. - [42] S. Mirchandani, F. Xia, P. Florence, B. Ichter, D. Driess, M. G. Arenas, K. Rao, D. Sadigh, and A. Zeng (2023) Large language models as general pattern machines. CoRL. Cited by: §I. - [43] B. Mor, S. Garhwal, and A. Kumar (2021) A systematic review of hidden markov models and their applications. Archives of computational methods in engineering 28 (3), pp. 1429–1448. Cited by: § II-C. - [44] M. W. Moskewicz, C. F. Madigan, Y. Zhao, L. Zhang, and S. Malik (2001) Chaff: engineering an efficient sat solver. In Proceedings of the 38th annual Design Automation Conference, pp. 530–535. Cited by: § V-D. - [45] F. Muñoz-Martínez, R. Garg, M. Pellauer, J. L. Abellán, M. E. Acacio, and T. Krishna (2023) Flexagon: a multi-dataflow sparse-sparse matrix multiplication accelerator for efficient dnn processing. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pp. 252–265. Cited by: § V-B. - [46] M. F. Naeem, M. G. Z. A. Khan, Y. Xian, M. Z. Afzal, D. Stricker, L. Van Gool, and F. Tombari (2023) I2mvformer: large language model generated multi-view document supervision for zero-shot image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15169–15179. Cited by: §I. - [47] A. Naik, J. Liu, C. Wang, A. Sethi, S. Dutta, M. Naik, and E. Wong (2024) Dolphin: a programmable framework for scalable neurosymbolic learning. arXiv preprint arXiv:2410.03348. Cited by: §VIII. - [48] NVIDIA Corporation (2020) NVIDIA rtx a6000 graphics card. Note: https://www.nvidia.com/en-us/products/workstations/rtx-a6000/ Cited by: TABLE III, § VII-A. - [49] NVIDIA NVIDIA Jetson Orin. Note: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-orin/ Cited by: §III. - [50] NVIDIA NVIDIA Nsight Compute. Note: https://developer.nvidia.com/nsight-compute Cited by: § III-B. - [51] NVIDIA NVIDIA Nsight Systems. Note: https://developer.nvidia.com/nsight-systems Cited by: § III-B. - [52] T. X. Olausson, A. Gu, B. Lipkin, C. E. Zhang, A. Solar-Lezama, J. B. Tenenbaum, and R. Levy (2023) LINC: a neurosymbolic approach for logical reasoning by combining language models with first-order logic provers. Conference on Empirical Methods in Natural Language Processing (EMNLP). Cited by: TABLE I, § VII-A. - [53] C. Pryor, C. Dickens, E. Augustine, A. Albalak, W. Wang, and L. Getoor (2022) NeuPSL: neural probabilistic soft logic. Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence (IJCAI) 461, pp. 4145–4153. Cited by: §VIII. - [54] R. Raj, S. Banerjee, N. Chandra, Z. Wan, J. Tong, A. Samajdhar, and T. Krishna (2025) SCALE-sim v3: a modular cycle-accurate systolic accelerator simulator for end-to-end system analysis. In 2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 186–200. Cited by: § VII-C. - [55] B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawzi, et al. (2024) Mathematical discoveries from program search with large language models. Nature 625 (7995), pp. 468–475. Cited by: §I, § II-A. - [56] P. Röttger, H. R. Kirk, B. Vidgen, G. Attanasio, F. Bianchi, and D. Hovy (2023) Xstest: a test suite for identifying exaggerated safety behaviours in large language models. arXiv preprint arXiv:2308.01263. Cited by: § VII-A, TABLE V. - [57] S. Sarangi and B. Baas (2021) DeepScaleTool: a tool for the accurate estimation of technology scaling in the deep-submicron era. In 2021 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–5. Cited by: item *. - [58] J. Shah, G. Bikshandi, Y. Zhang, V. Thakkar, P. Ramani, and T. Dao (2024) Flashattention-3: fast and accurate attention with asynchrony and low-precision. Advances in Neural Information Processing Systems (NeurIPS) 37, pp. 68658–68685. Cited by: § VII-C. - [59] N. Shah, W. Meert, and M. Verhelst (2023) DPU-v2: energy-efficient execution of irregular directed acyclic graphs. In 2023 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), pp. 1288–1307. Cited by: TABLE III, § VII-A, § VII-A, § VII-C. - [60] C. Shengyuan, Y. Cai, H. Fang, X. Huang, and M. Sun (2023) Differentiable neuro-symbolic reasoning on large-scale knowledge graphs. Advances in Neural Information Processing Systems 36, pp. 28139–28154. Cited by: §I. - [61] C. Singh, J. P. Inala, M. Galley, R. Caruana, and J. Gao (2024) Rethinking interpretability in the era of large language models. arXiv preprint arXiv:2402.01761. Cited by: §I. - [62] G. Sriramanan, S. Bharti, V. S. Sadasivan, S. Saha, P. Kattakinda, and S. Feizi (2024) Llm-check: investigating detection of hallucinations in large language models. Advances in Neural Information Processing Systems 37, pp. 34188–34216. Cited by: §I. - [63] Synopsys Design compiler - synopsys. Note: https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/dc-ultra.html Cited by: § VII-A. - [64] Synopsys PrimeTime - synopsys. Note: https://www.synopsys.com/implementation-and-signoff/signoff/primetime.html Cited by: § VII-A. - [65] O. Tafjord, B. D. Mishra, and P. Clark (2020) ProofWriter: generating implications, proofs, and abductive statements over natural language. arXiv preprint arXiv:2012.13048. Cited by: § VII-A. - [66] T. H. Trinh, Y. Wu, Q. V. Le, H. He, and T. Luong (2024) Solving olympiad geometry without human demonstrations. Nature 625 (7995), pp. 476–482. Cited by: §I, § II-A, § II-B, TABLE I, § VII-A, § VII-A, TABLE V, TABLE V, §VIII. - [67] P. Van Der Tak, M. J. Heule, and A. Biere (2012) Concurrent cube-and-conquer. In International Conference on Theory and Applications of Satisfiability Testing, pp. 475–476. Cited by: § II-C. - [68] vLLM vLLM Automatic Prefix Caching . Note: https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html Cited by: § VII-C. - [69] vLLM vLLM Performance and Tuning . Note: https://docs.vllm.ai/en/latest/configuration/optimization.html Cited by: § VII-C. - [70] vLLM vLLM Quantized KV Cache . Note: https://docs.vllm.ai/en/stable/features/quantization/quantized_kvcache.html Cited by: § VII-C. - [71] Z. Wan, Y. Du, M. Ibrahim, J. Qian, J. Jabbour, Y. Zhao, T. Krishna, A. Raychowdhury, and V. J. Reddi (2025) Reca: integrated acceleration for real-time and efficient cooperative embodied autonomous agents. In Proceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Volume 2, pp. 982–997. Cited by: §I. - [72] Z. Wan, C. Liu, H. Yang, C. Li, H. You, Y. Fu, C. Wan, T. Krishna, Y. Lin, and A. Raychowdhury (2024) Towards cognitive ai systems: a survey and prospective on neuro-symbolic ai. arXiv preprint arXiv:2401.01040. Cited by: §I. - [73] Z. Wan, C. Liu, H. Yang, R. Raj, C. Li, H. You, Y. Fu, C. Wan, S. Li, Y. Kim, et al. (2024) Towards efficient neuro-symbolic ai: from workload characterization to hardware architecture. IEEE Transactions on Circuits and Systems for Artificial Intelligence (TCASAI). Cited by: §VIII. - [74] Z. Wan, C. Liu, H. Yang, R. Raj, C. Li, H. You, Y. Fu, C. Wan, A. Samajdar, Y. C. Lin, et al. (2024) Towards cognitive ai systems: workload and characterization of neuro-symbolic ai. In 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 268–279. Cited by: §VIII. - [75] Z. Wan, C. Liu, H. Yang, R. Raj, A. Raychowdhury, and T. Krishna (2025) Efficient processing of neuro-symbolic ai: a tutorial and cross-layer co-design case study. Proceedings of the International Conference on Neuro-symbolic Systems. Cited by: §I. - [76] Z. Wan, H. Yang, J. Qian, R. Raj, J. Park, C. Wang, A. Raychowdhury, and T. Krishna (2026) Compositional ai beyond llms: system implications of neuro-symbolic-probabilistic architectures. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), Volume 1, pp. 67–84. Cited by: Figure 2, Figure 2. - [77] Z. Wan, H. Yang, R. Raj, C. Liu, A. Samajdar, A. Raychowdhury, and T. Krishna (2025) Cogsys: efficient and scalable neurosymbolic cognition system via algorithm-hardware co-design. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 775–789. Cited by: §VIII. - [78] Y. Xian, C. Lampert, B. Schiele, and Z. Akata (2018) Zero-shotlearning-a comprehensive evaluation of the good, the bad and theugly. arXiv preprint arXiv:1707.00600. Cited by: § VII-A. - [79] H. Yang, Z. Wan, R. Raj, J. Park, Z. Li, A. Samajdar, A. Raychowdhury, and T. Krishna (2025) NSFlow: an end-to-end fpga framework with scalable dataflow architecture for neuro-symbolic ai. In 2025 62nd ACM/IEEE Design Automation Conference (DAC), pp. 1–7. Cited by: §VIII. - [80] Z. Yang, A. Ishay, and J. Lee (2020) Neurasp: embracing neural networks into answer set programming. In 29th International Joint Conference on Artificial Intelligence (IJCAI 2020), Cited by: §VIII. - [81] C. Zhang, B. Jia, S. Zhu, and Y. Zhu (2021) Abstract spatial-temporal reasoning via probabilistic abduction and execution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9736–9746. Cited by: §I, § II-A. - [82] H. Zhang, M. Dang, N. Peng, and G. Van den Broeck (2023) Tractable control for autoregressive language generation. In International Conference on Machine Learning (ICML), pp. 40932–40945. Cited by: TABLE I, § VII-A, TABLE V. - [83] H. Zhang, P. Kung, M. Yoshida, G. Van den Broeck, and N. Peng (2024) Adaptable logical control for large language models. Advances in Neural Information Processing Systems (NeurIPS) 37, pp. 115563–115587. Cited by: §I, TABLE I, § VII-A. - [84] H. Zhang and T. Yu (2020) AlphaZero. Deep Reinforcement Learning: Fundamentals, Research and Applications, pp. 391–415. Cited by: §VIII. - [85] Y. Zhang, G. Wang, C. Li, Z. Gan, C. Brockett, and B. Dolan (2020) POINTER: constrained progressive text generation via insertion-based generative pre-training. arXiv preprint arXiv:2005.00558. Cited by: § VII-A. - [86] K. Zheng, J. M. Han, and S. Polu (2021) Minif2f: a cross-system benchmark for formal olympiad-level mathematics. arXiv preprint arXiv:2109.00110. Cited by: § VII-A, TABLE V.

Rendering Paper...