2505.20971v1

Model: healer-alpha-free

# Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA > Corresponding author. ## Abstract LLMs have demonstrated remarkable capabilities in complex reasoning tasks, yet they often suffer from hallucinations and lack reliable factual grounding. Meanwhile, knowledge graphs (KGs) provide structured factual knowledge but lack the flexible reasoning abilities of LLMs. In this paper, we present Reason-Align-Respond (RAR), a novel framework that systematically integrates LLM reasoning with knowledge graphs for KGQA. Our approach consists of three key components: a Reasoner that generates human-like reasoning chains, an Aligner that maps these chains to valid KG paths, and a Responser that synthesizes the final answer. We formulate this process as a probabilistic model and optimize it using the Expectation-Maximization algorithm, which iteratively refines the reasoning chains and knowledge paths. Extensive experiments on multiple benchmarks demonstrate the effectiveness of RAR, achieving state-of-the-art performance with Hit scores of 93.3% and 91.0% on WebQSP and CWQ respectively. Human evaluation confirms that RAR generates high-quality, interpretable reasoning chains well-aligned with KG paths. Furthermore, RAR exhibits strong zero-shot generalization capabilities and maintains computational efficiency during inference. Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA Xiangqing Shen, Fanfan Wang, and Rui Xia thanks: Corresponding author. School of Computer Science and Engineering, Nanjing University of Science and Technology, China {xiangqing.shen, ffwang, rxia}@njust.edu.cn ## 1 Introduction Large language models (LLMs) have exhibited impressive capabilities across a range of complex tasks (Hadi et al., 2023), yet their reasoning processes often lack reliable factual knowledge. This shortcoming reduces interpretability, leads to hallucinations, and causes factual or logical errors (Huang et al., 2025). Knowledge graphs (KGs) (Bollacker et al., 2008), which organize factual knowledge in a structured format, offer strong interpretability and expressive power, making them a reliable source of factual support for LLM reasoning. Integrating KGs with LLMs in question answering, e.g, knowledge graph question answering (KGQA), has gained interest as an effective strategy to mitigate hallucinations and enhance interpretability. <details> <summary>x1.png Details</summary> ![f5ff5c02](/v1/image/f5ff5c02a662eab5114b2b59c96997214821c5235233c0494e76cf001fc27776) ### Visual Description ## Diagram: Comparison of Knowledge Graph Question Answering Methods ### Overview This diagram illustrates and compares three different approaches for answering a complex, multi-hop question using a knowledge graph (KG). The central question is: **"What's the music style of the album folklore by Scott Swift's daughter?"** The diagram contrasts two existing methods (Training-free and Training-based) that produce incorrect answers with a proposed new framework ("Our Framework") that arrives at the correct answer. The visual narrative emphasizes the reasoning process and knowledge path alignment required for accurate multi-hop reasoning. ### Components/Axes The diagram is divided into three primary sections, each enclosed in a dashed or solid border: 1. **Top-Left Section: "Training-free Agent Exploration Methods"** * **Header:** "Training-free Agent Exploration Methods" * **Main Element:** A box labeled "Explore on graph" containing a small knowledge graph snippet. * **Graph Nodes & Edges:** * Nodes: `Scott Swift` (blue), `Taylor Swift` (green), `folklore` (gray), `pop` (red), `Indie Folk` (gray). * Edges: `daughter` (from Scott to Taylor), `album` (from Taylor to folklore), `genre` (from Taylor to pop, and from folklore to Indie Folk). * **Process Indicators:** * `t=1` (green arrow) points from Scott Swift to Taylor Swift. * `t=2` (red arrow with an 'X') points from Taylor Swift to pop. * **Output:** A box labeled "Answer: Pop" with a red 'X' icon. * **Agent Icon:** An icon labeled "LLM Agent" with a circular arrow labeled "T steps". 2. **Top-Right Section: "Training-based Path Generation Methods"** * **Header:** "Training-based Path Generation Methods" * **Main Element:** A box showing a generated path: `Scott --daughter--> Taylor --album--> folklore --styleX--> Pop`. * **Process Flow:** * A "Path Generator" box receives input from a "KG" (Knowledge Graph) icon. * The Path Generator outputs the path to an LLM icon (OpenAI logo). * The LLM outputs the answer. * **Output:** A box labeled "Answer: Pop" with a red 'X' icon. 3. **Bottom Section: "Our Framework"** * **Header:** "Our Framework" (in a yellow-highlighted box). * **Input Question:** A large box contains the full question text: "Question: What's the **music style** of the **album folklore** by **Scott Swift's** daughter?" * **Two Parallel Components:** * **Left: "Reasoning Chain by Reasoner"** * A numbered list outlining a logical, step-by-step reasoning process: 1. "Begin by identifying the daughter of the query entity 'Scott Swift', represented by the intermediate entity 'c'. This step establishes..." 2. "Next, verify that the intermediate entity 'c' has released a album named 'folklore'. This ensures that..." 3. "Finally, determine the music genre of the album 'folklore'. This genre provides the specific music style..." * **Right: "Knowledge Path by Aligner"** * A more detailed knowledge graph snippet showing the correct path. * **Nodes:** `Scott Swift` (blue), `Taylor Swift` (green), `folklore` (blue), `Lover` (gray), `Seven` (gray), `USA` (gray), `pop` (gray), `Indie Folk` (green). * **Edges & Path:** A highlighted path with numbered steps: 1. `daughter` (orange arrow from Scott to Taylor). 2. `album` (orange arrow from Taylor to folklore). 3. `genre` (orange arrow from folklore to Indie Folk). * Other edges shown: `nationality` (Scott to USA), `track` (Taylor to Lover, folklore to Seven), `genre` (Taylor to pop). * **Output:** An arrow from the "Knowledge Path" box points to a "Responser" label, which outputs a box labeled "Answer: Indie Folk" with a green checkmark icon. ### Detailed Analysis The diagram presents a comparative analysis of methodologies for the same task. * **Training-free Agent Exploration:** This method uses an LLM agent to explore the knowledge graph step-by-step (`t=1`, `t=2`). The visual shows it correctly identifies Taylor Swift as the daughter (`t=1`, green arrow) but then makes an incorrect jump from Taylor Swift directly to the genre "pop" (`t=2`, red arrow with 'X'), missing the crucial intermediate step of identifying the specific album "folklore". This leads to the wrong answer, "Pop". * **Training-based Path Generation:** This method uses a dedicated "Path Generator" module to create a reasoning path from the KG. The generated path shown is `Scott -> daughter -> Taylor -> album -> folklore -> style -> Pop`. The error is embedded in the final edge: it incorrectly links the album "folklore" to the genre "Pop" (marked with a red 'X'). This flawed path is fed to an LLM, which outputs the same incorrect answer, "Pop". * **Our Framework:** This proposed solution decouples the process into two specialized components: 1. **Reasoner:** Generates a high-level, logical reasoning chain (the three-step list) that correctly structures the problem: find the daughter, find her album, find that album's genre. 2. **Aligner:** Grounds this reasoning in the actual knowledge graph, constructing the precise, correct path: `Scott Swift --daughter--> Taylor Swift --album--> folklore --genre--> Indie Folk`. The graph also shows related but irrelevant information (e.g., Taylor's other album "Lover", the track "Seven", her nationality "USA", her general genre "pop") which the framework correctly ignores. The aligned path is then passed to a "Responser" to generate the final, correct answer: "Indie Folk". ### Key Observations 1. **Error Source:** Both failing methods make a similar logical error: they associate the artist (Taylor Swift) directly with her predominant genre (Pop) instead of tracing the relationship to the specific album in question ("folklore") and then to its unique genre. 2. **Visual Coding:** Colors are used meaningfully. Green indicates correct steps or entities in the final path (Taylor Swift, folklore, Indie Folk). Red indicates errors (the wrong "pop" node and the incorrect edges leading to it). Blue is used for the starting entity (Scott Swift). 3. **Complexity of the KG:** The "Knowledge Path by Aligner" graph reveals the complexity of the underlying data, showing multiple possible paths and relationships. The framework's success lies in selecting the single correct path relevant to the specific question. 4. **Process vs. Output:** The diagram emphasizes that the *process* of reasoning and path construction is as important as the final answer. The "Our Framework" section dedicates equal space to the reasoning chain and the knowledge path. ### Interpretation This diagram argues for a modular, neuro-symbolic approach to complex question answering over knowledge graphs. It demonstrates that: * **Pure LLM exploration** (Training-free) can be myopic, taking locally correct steps but missing the global logical structure. * **Pure path generation** (Training-based) can be brittle, generating plausible but factually incorrect relationships if not properly constrained. * The proposed **decoupled framework** succeeds by first establishing a correct, high-level reasoning plan (the "what to do") and then meticulously grounding each step in the factual knowledge graph (the "how to do it"). This separation of concerns between logical reasoning and factual retrieval/alignment appears to be the key innovation for handling multi-hop queries where intermediate steps are critical. The diagram serves as a visual proof-of-concept for why this integrated yet modular design is superior for accurate knowledge-grounded reasoning. </details> Figure 1: The comparison between our Reason-Align-Respond framework and the existing methods for LLM-based KGQA. The existing LLM-based KGQA studies broadly fall into two main categories: Training-free Agent Exploration methods (Sun et al., 2024) and Training-based Path Generation methods (Luo et al., 2024a). The former uses LLMs as agents to explore nodes in KGs, while the latter trains generators to retrieve and generate knowledge paths. However, both methods lack the holistic reasoning process that humans typically employ when answering questions. This limitation, combined with the semantic gap between natural language descriptions and structured knowledge graphs, often results in reasoning paths that do not align with human logic, lack semantic relevance, or contain unnecessary noise, as shown in Fig. 1. On the other hand, the latest deep reasoning LLMs, such as OpenAI-o1 (Zhong et al., 2024) and DeepSeek-R1 (Guo et al., 2025), produce holistic reasoning chains before answering challenging questions, showcasing stronger logical reasoning abilities. But the reasoning chains obtained through reinforcement learning tend to be verbose, sometimes contain logical fallacies, and amplify hallucinations (Bao et al., 2025). While our work is not specifically aimed at improving Deepseek-R1, we believe that using knowledge paths from KGs as constraints for deep reasoning might potentially alleviate these issues. The two aforementioned aspects are complementary. For one thing, allowing LLMs to first perform human-like reasoning in KGQA, can establish a structured, goal-oriented framework that helps reduce invalid or noisy KG path exploration. For another, while free reasoning by LLMs tends to increase hallucinations, incorporating knowledge paths from KGs provides constraints that help ensure alignment with the knowledge graph, thereby reducing hallucinations. In this work, we introduce Reason-Align-Respond (RAR), a novel framework that systematically integrates three modules—Reasoner, Aligner, and Responser—to align LLM reasoning with knowledge graphs for KGQA. Each of the three modules is a fine-tuned LLM. Firstly, Reasoner conducts global reasoning for the question, simulating human-like thinking process to generate a reasoning chain that guides LLM’s exploration on the KG. Secondly, Aligner decodes a knowledge path on the KG based on the above reasoning chain. Each decoding step ensures correspondence to an actual knowledge triple in the graph. Finally, Responser leverages both the reasoning chain and the knowledge path to generate the final answer. We treat both reasoning chain $z_{r}$ and knowledge path $z_{p}$ as latent variables, use a probabilistic model to formalize the distribution of answer $a$ given the question $q$ and KG ${\mathcal{G}}$ , and propose an end-to-end training algorithm to jointly fine-tune the parameters across all three modules. Specifically, we employ the expectation-maximization (EM) algorithm to iteratively optimize model parameters while inferring the latent variables. The E-step estimates the posterior probability of latent variables ( $z_{r}$ , $z_{p}$ ) given observations ( $q$ , ${\mathcal{G}}$ , $a$ ) and samples high-quality reasoning chain and aligned knowledge paths accordingly. The M-step maximizes the evidence lower bound (ELBO) of the log-likelihood to update model parameters. Through iterative optimization, the model progressively refines the reasoning chain and knowledge path, and in turn promotes the generation of high-quality responses, ultimately forming a stable closed loop. We conduct extensive evaluation across multiple benchmarks, including WebQSP, CWQ, CommonsenseQA (CSQA), and MedQA, using Freebase, ConceptNet and a medical KG as knowledge graphs. The results demonstrate the effectiveness of RAR in improving KGQA performance. Compared to 19 baselines across three categories, RAR achieves state-of-the-art results. Specifically, it achieves Hit scores of 93.3% on WebQSP and 91.0% on CWQ, with corresponding F1 score of 87.7% and 84.8%, significantly surpassing the existing methods. Additionally, RAR demonstrates strong zero-shot generalizability on CSQA and MedQA. The EM algorithm demonstrates gradual performance improvements with increasing iterations and shows stable convergence after 200 steps. Human evaluation of reasoning chains shows that RAR generates high-quality, interpretable chains aligned with KG paths, ensuring accurate and reliable reasoning. We also report the results of RAR under different LLM backbone models, and conduct the ablation study that confirms the importance of each component. Notably, RAR achieves these effective performance while maintaining computational efficiency during inference. These comprehensive results demonstrate that our approach successfully bridges the gap between LLM reasoning and structured knowledge, offering a promising direction for developing more reliable and interpretable question answering systems. <details> <summary>x2.png Details</summary> ![6044452d](/v1/image/6044452d1dc904da4e138d70c0aeb9ae3fbba41e1b4abd4254a79d01c7506a56) ### Visual Description ## System Architecture Diagram: Iterative Reasoning and Knowledge Graph Alignment for Question Answering ### Overview The image is a technical flowchart illustrating a machine learning system designed to answer complex questions by generating and refining reasoning chains and knowledge paths. The system uses an iterative process (E-step and M-step) to update its components. The example question used throughout the diagram is: "What's the music style of the album folklore by Scott Swift's daughter?" The final answer is "Indie Folk." ### Components/Axes The diagram is organized into several interconnected functional blocks, with arrows indicating data flow and update cycles. **Primary Components:** 1. **Question q** (Top-left, peach-colored box): The input query. 2. **Reasoner p_θ(z_r|q)** (Left, blue box): A module that generates a step-by-step reasoning chain for the question. 3. **Reasoning Chain z_r** (Bottom-left, pink box): The output of the Reasoner, containing a structured, step-by-step thought process. 4. **Aligner p_φ(z_p|G, z_r, q)** (Center, blue box): A module that generates triples (subject, relation, object) in a Knowledge Graph (KG) based on the question and reasoning chain. 5. **Knowledge Path z_p** (Bottom-right, light blue box): The output of the Aligner, visualized as a subgraph from a larger knowledge graph. 6. **Responser p_w(a|z_r, z_p, q)** (Right, blue box): A module that generates the final answer based on the reasoning chain and knowledge path. 7. **Answer a** (Top-right, green box): The final output answer. **Process Steps:** * **E-step** (Top-center): Labeled "sample high-quality Reasoning Chains and Knowledge Paths (z_r, z_p) ~ p_{w,ψ}((z_r, z_p)|G, q, a)". This step involves sampling or generating candidate reasoning and knowledge components. * **M-step** (Center, with two downward arrows): Labeled "update Reasoner" and "update Aligner". This step involves updating the parameters of the Reasoner and Aligner modules based on the sampled data. **Connections & Data Flow:** * **Prior** (Dashed arrows): Flow from the Reasoner to the Aligner and from the Aligner back to the E-step sampling process. * **Likelihood** (Dashed arrow): Flow from the Responser back to the E-step sampling process. * **KG-constrained Decoding** (Arrow from Aligner to Knowledge Path): Indicates the Aligner's output is constrained by the structure of a knowledge graph. * **Update Responser** (Bidirectional arrows between Answer and Responser): The Responser is updated using the final answer. * **Reasoning Chain z_r** (Arrow from Reasoning Chain to Responser): The generated reasoning chain is fed as input to the Responser. ### Detailed Analysis **1. Question & Answer Example:** * **Input Question (q):** "What's the music style of the album folklore by Scott Swift's daughter?" * **Output Answer (a):** "Indie Folk" **2. Reasoning Chain (z_r) Content:** The pink box details a 3-step reasoning process: 1. "Begin by identifying the daughter of the query entity 'Scott Swift', represented by the intermediate entity 'c'. This step..." 2. "Next, verify that the intermediate entity 'c' has released a album... This ensures..." 3. "Finally, determine the music genre of the album 'folklore'. This genre provides..." **3. Knowledge Path (z_p) Content:** The light blue box shows a knowledge graph subgraph with nodes and labeled edges: * **Nodes (Entities):** Scott Swift, Taylor Swift, Lover, folklore, Seven, USA, pop, Indie Folk. * **Edges (Relations):** * `daughter` (Scott Swift → Taylor Swift) * `nationality` (Scott Swift → USA) * `track` (Lover → Taylor Swift) * `album` (Taylor Swift → folklore) * `genre` (folklore → Indie Folk) * `genre` (folklore → pop) * `track` (folklore → Seven) * **Highlighted Path:** A numbered path (1, 2, 3) is overlaid, tracing: Scott Swift --(1:daughter)--> Taylor Swift --(2:album)--> folklore --(3:genre)--> Indie Folk. **4. Sampled High-Quality Pair (E-step Output):** The central box shows an example of a sampled reasoning chain and knowledge path pair: * **Reasoning Chain:** `` * **Knowledge Path:** `<ALIGN> (Scott Swift, daughter, Taylor Swift), (Taylor Swift, album, folklore), (folklore, genre, Indie Folk) </ALIGN>` ### Key Observations 1. **Iterative Learning Loop:** The system has a clear feedback loop. The E-step samples potential solutions, and the M-step uses them to update the core reasoning (Reasoner) and alignment (Aligner) modules. The Responser is also updated. 2. **Dual-Component Reasoning:** The system explicitly separates *procedural reasoning* (the step-by-step "how to think" in the Reasoning Chain) from *declarative knowledge* (the factual triples in the Knowledge Path). 3. **Knowledge Graph Constraint:** The Aligner's output is not free-form; it is constrained to generate valid triples from a knowledge graph (KG-constrained Decoding), ensuring factual grounding. 4. **Example Specificity:** The entire diagram is illustrated with a single, concrete example about Taylor Swift's album, making the abstract process tangible. 5. **Symbolic Notation:** The modules are defined with probabilistic notation (e.g., p_θ(z_r|q)), indicating they are parameterized models (likely neural networks) trained to perform their specific tasks. ### Interpretation This diagram depicts a sophisticated neuro-symbolic AI architecture for complex question answering. It moves beyond simple retrieval or language modeling by: * **Explicit Reasoning:** Forcing the model to articulate its reasoning steps in a structured chain (`z_r`), which improves interpretability and allows for verification. * **Knowledge Grounding:** Tying the reasoning process to verifiable facts from a knowledge graph (`z_p`), reducing hallucination and improving factual accuracy. * **Joint Optimization:** Using an Expectation-Maximization (EM) style framework (E-step/M-step) to jointly optimize the reasoning and knowledge alignment components. The system learns not just to answer, but to *reason effectively* and *align its reasoning with external knowledge*. The "Peircean" investigative reading suggests this system embodies abductive reasoning—it seeks the best explanation (the reasoning chain and knowledge path) that accounts for the question and the available knowledge, ultimately leading to the answer. The outliers or anomalies would be cases where the knowledge graph is incomplete or the reasoning chain leads to a dead end, which the iterative update process aims to correct over time. The core innovation is the tight, iterative coupling of a reasoning module with a knowledge-grounded module, mediated by a response generator, all within a unified probabilistic framework. </details> Figure 2: Illustration of our RAR framework comprising Reasoner, Aligner, Responser with iterative EM optimization. ## 2 Approach Knowledge graphs (KGs) contain extensive factual knowledge in the form of triples: ${\mathcal{G}}=\{(e,r,e^{\prime})|e,e^{\prime}\in{\mathcal{E}},r\in{\mathcal{R}}\}$ , where ${\mathcal{E}}$ and ${\mathcal{R}}$ denote sets of entities and relations, respectively. The goal of knowledge graph question answering (KGQA) is to retrieve relevant facts from a KG ${\mathcal{G}}$ and generate an answer $a$ in response to a given natural language question $q$ . ### 2.1 Task Formalization We introduce two latent variables, a reasoning chain $z_{r}$ and a knowledge path $z_{p}$ , working together to answer the question $q$ based on a KG ${\mathcal{G}}$ : - Reasoning Chain $z_{r}$ denotes a chain of discrete reasoning steps expressed in natural language, working together to address the question $q$ . - Knowledge Path $z_{p}$ denotes an interconnected path of knowledge triples extracted from a KG ${\mathcal{G}}$ . Since neither $z_{r}$ nor $z_{p}$ are explicitly annotated in existing KGQA benchmarks, we treat them as latent variables, and employ a probabilistic model to formalize the distribution of the answer $a$ conditioned on the question $q$ and KG ${\mathcal{G}}$ as: $$ \displaystyle p_{w,\phi,\theta}(a|{\mathcal{G}},q)= \displaystyle\sum_{z_{r},z_{p}}p_{w}(a|z_{r},z_{p},q)p_{\phi}(z_{p}|{\mathcal{ G}},z_{r},q)p_{\theta}(z_{r}|q), \tag{1} $$ where we assume $a$ is conditionally independent of ${\mathcal{G}}$ given $(z_{r},z_{p},q)$ , allowing factorization and summation over these conditional probabilities. On this basis, we introduce our framework, Reason-Align-Respond (RAR), that integrates three modules—Reasoner, Aligner, and Responser—to align LLM reasoning with knowledge graphs for KGQA, as illustrated in Fig. 2. It is worth noting that each of the three modules is a fine-tuned LLM, with parameters denoted as $\theta$ , $\phi$ , and $\omega$ , respectively. They each utilize the Prompts shown in App. F to generate reasoning chains, knowledge paths, and answers. Firstly, Reasoner $p_{\theta}(z_{r}|q)$ generates a latent reasoning chain $z_{r}$ to address the question $q$ . The Reasoner generates a reasoning chain in the following format: $$ z_{r}=\texttt{<THINK>}s_{1}\dots s_{t}\texttt{</THINK>}, $$ where <THINK> and </THINK> are special tokens denoting the start and end of the reasoning process, and each $s_{i}$ is a discrete reasoning step. Secondly, Aligner $p_{\phi}(z_{p}|{\mathcal{G}},z_{r},q)$ takes the question $q$ and the reasoning chain $z_{r}$ as inputs to explore KG ${\mathcal{G}}$ , producing a latent reasoning path $z_{p}$ that aligns with $z_{r}$ . Aligner processes the prompt with $q$ and $z_{r}$ to generate a knowledge path in the following format: $$ z_{p}=\texttt{<ALIGN>}\pi_{1}\dots\pi_{t}\texttt{</ALIGN>}, $$ where <ALIGN> and </ALIGN> mark the beginning and end of the knowledge path. Each $\pi_{i}$ is a triple from the KG ${\mathcal{G}}$ formatted as $\pi_{i}=\texttt{<TRI>}(e_{i}^{h},r_{i},e_{i}^{t})\texttt{</TRI>}$ , where <TRI> and </TRI> bound the triple. Finally, Responser $p_{w}(a|z_{r},z_{p},q)$ generates the final answer $a$ by synthesizing the question $q$ , reasoning chain $z_{r}$ , and knowledge path $z_{p}$ . ### 2.2 Optimization via the EM Algorithm As we have mentioned, each of the three modules is a fine-tuned LLM. In this section, we propose an end-to-end training algorithm to jointly optimize the parameters across all three modules. The training objective is the likelihood of the distribution $p_{w,\phi,\theta}(a|{\mathcal{G}},q)$ shown in Eq. (LABEL:equ:main). Since our model involves latent variables $z_{r}$ and $z_{p}$ , we adopt the Expectation-Maximization (EM) algorithm—a principled approach for maximum likelihood estimation (MLE) in latent-variable models Dempster et al. (1977); Sen et al. (2022); Qu et al. (2021). Unifying Reasoner and Aligner. In practice, we unify Reasoner and Aligner by merging their latent variables $z_{r}$ and $z_{p}$ into a single one $z=(z_{r},z_{p})$ . This results in a consolidated module, referred to as ReAligner, whose parameters denoted by $\psi$ . Hence, instead of generating $z_{r}$ and $z_{p}$ separately, ReAligner simultaneously outputs both a Reasoning Chain and a Knowledge Path, treating it as a single instruction-tuning task with a prompt template (see App. F). With this simplification, the conditional distribution for the final answer $a$ is: $$ p_{w,\psi}(a|{\mathcal{G}},q)=\sum_{z}p_{w}(a|q,z)p_{\psi}(z|{\mathcal{G}},q), \tag{2} $$ where ${\mathcal{G}}$ denotes the KG, $q$ the question, and $z$ the unified latent variable (combining the Reasoning Chain and Knowledge Path). Learning Objective. We aim to learn the parameters $(w,\psi)$ by maximizing the log-likelihood of the training data with respect to Eq. (2), written as: $$ \displaystyle\max_{w,\psi}\mathcal{O}(w,\psi)=\log\mathbb{E}_{z\sim p_{\psi}(z |{\mathcal{G}},q)}[p_{w}(a|q,z)]. \tag{3} $$ According to Jensen’s inequality, we have: $$ \mathcal{O}(w,\psi)\geq\underbrace{\mathbb{E}_{q(z)}\log({\frac{p_{w}(a|q,z)p_ {\psi}(z|{\mathcal{G}},q)}{q(z)})}}_{{\mathcal{L}}_{\text{ELBO}}}, \tag{4} $$ where $q(z)$ is a variational distribution. Equality holds when $q(z)=p_{w,\psi}(z|{\mathcal{G}},q,a)$ , the true posterior of $z$ . The term ${\mathcal{L}}_{\text{ELBO}}$ is the Evidence Lower BOund (ELBO), and maximizing ${\mathcal{L}}_{\text{ELBO}}$ indirectly maximizes $\mathcal{O}(w,\psi)$ . EM Algorithm. The EM algorithm alternates between an E-step and an M-step until convergence: E-step. Given current parameters $(w^{(t)},\psi^{(t)})$ at iteration $t$ , E-step updates the variational distribution $q^{(t)}(z)$ by minimizing $\mathrm{KL}(q(z)||p_{w^{(t)},\psi^{(t)}}(z|{\mathcal{G}},q,a))$ . The solution is the posterior of $z$ under the current parameters: $$ q^{(t)}(z)=p_{w^{(t)},\psi^{(t)}}(z|{\mathcal{G}},q,a). \tag{5} $$ M-step. Keeping $q^{(t)}(z)$ fixed, M-step maximizes ${\mathcal{L}}_{\text{ELBO}}$ in Eq. (4) with respect to $w$ and $\psi$ . Ignoring terms that do not depend on $(w,\psi)$ , the objective reduces to: $$ \displaystyle Q(w,\psi|w^{(t)},\psi^{(t)}) \displaystyle=\sum_{({\mathcal{G}},q,a)}\sum_{z}q^{(t)}(z)\,\log[p_{w}(a|q,z)p _{\psi}(z|{\mathcal{G}},q)] \displaystyle=\underbrace{\sum_{({\mathcal{G}},q,a)}\sum_{z}q^{(t)}(z)\log p_{ w}(a|q,z)}_{Q_{\text{Responser}}(w)} \displaystyle\quad+\underbrace{\sum_{({\mathcal{G}},q,a)}\sum_{z}q^{(t)}(z) \log p_{\psi}(z|{\mathcal{G}},q)}_{Q_{\text{ReAligner}}(\psi)}, \tag{6} $$ which naturally divides into the instruction-tuning objective for Responser and ReAligner in Eq. (2). By iterative optimization with the EM algorithm, our framework progressively refines its understanding of the question. This iterative process gradually corrects any flaws in Reasoning Chains and Knowledge Paths, leading to answers that are both higher in quality and more interpretable, while significantly reducing the risk of hallucination. The workflow of the EM algorithm is shown in Alg. 1, with more details in practice in App. A. Algorithm 1 The EM algorithm in RAR while not converge do For each instance, sample $N$ Reasoning Chains and Knowledge Paths $z_{I}$ from ReAligner $p_{\psi}$ . For each instance, update Responser $p_{w}$ with $Q_{\text{Responser}}(w)$ in Eq. (6) using $z_{I}$ . $\boxdot$ E-step: For each instance, identify $K$ high-quality Reasoning Chains and Knowledge Paths $z_{I}^{h}$ from $z_{I}$ based on Eq. (5). $\boxdot$ M-step: For each instance, update ReAligner $p_{\psi}$ according to $Q_{\text{ReAligner}}(\psi)$ in Eq. (6). end while ### 2.3 Techniques During Inference During inference, given $q$ , Reasoner generates $z_{r}$ , while Aligner produces $z_{p}$ . Responser synthesizes them to produce $a$ . To enhance performance, we introduce three additional key techniques. Table 1: Performance comparison with different baselines on the two KGQA datasets. Types Methods WebQSP CWQ Hit F1 Hit F1 LLM Reasoning Qwen2-7B (Yang et al., 2024) 50.8 35.5 25.3 21.6 Llama-2-7B (Touvron et al., 2023) 56.4 36.5 28.4 21.4 Llama-3.1-8B (Meta, 2024) 55.5 34.8 28.1 22.4 GPT-4o-mini (OpenAI, 2024a) 63.8 40.5 63.8 40.5 ChatGPT (OpenAI, 2022) 59.3 43.5 34.7 30.2 ChatGPT+Few-shot (Brown et al., 2020) 68.5 38.1 38.5 28.0 ChatGPT+CoT (Wei et al., 2022b) 73.5 38.5 47.5 31.0 ChatGPT+Self-Consistency (Wang et al., 2024) 83.5 63.4 56.0 48.1 Graph Reasoning GraftNet (Sun et al., 2018) 66.7 62.4 36.8 32.7 NSM (He et al., 2021) 68.7 62.8 47.6 42.4 SR+NSM (Zhang et al., 2022) 68.9 64.1 50.2 47.1 ReaRev (Mavromatis and Karypis, 2022) 76.4 70.9 52.9 47.8 KG+LLM KD-CoT (Wang et al., 2023a) 68.6 52.5 55.7 - EWEK-QA (Dehghan et al., 2024) 71.3 - 52.5 - ToG (GPT-4) (Sun et al., 2024) 82.6 - 68.5 - EffiQA (Dong et al., 2025) 82.9 - 69.5 RoG (Llama-2-7B) (Luo et al., 2024b) 85.7 70.8 62.6 56.2 GNN-RAG+RA (Mavromatis and Karypis, 2024) 90.7 73.5 68.7 60.4 GCR (Llama-3.1-8B + GPT-4o-mini) (Luo et al., 2024c) 92.2 74.1 75.8 61.7 RAR (Llama-3.1-8B + GPT-4o-mini) 93.3 87.7 91.0 84.8 KG-constrained Decoding. KG-constrained Decoding aims to prevent hallucinated triples that do not exist in the KG. When generating the Knowledge Path, Aligner may inadvertently produce triples absent from the KG. To address this, KG-constrained Decoding restricts the output tokens so that only tokens forming valid KG triples can be produced. In this way, the generated Knowledge Path strictly aligns with actual entities and relations in the KG. Related work (Luo et al., 2024c; Li et al., 2024) also attempts to mitigate similar issues; our approach is tailored specifically to our framework. Knowledge Path Expansion. Knowledge Path Expansion addresses the potential incompleteness of initially generated Knowledge Paths. To illustrate, consider a question about countries that share borders with the United States. a Knowledge Path is generated by Aligner. While correct, this represents only one instance of a broader pattern. By abstracting the specific instance into a template: $$ \texttt{<ALIGN><TRI>(US,borders,?country)</TRI></ALIGN>}, $$ where $?country$ is a variable, we capture the fundamental relationship structure. Applied to the KG, this template retrieves all valid instances, such as: $$ \texttt{<ALIGN><TRI>(US,borders,Canada)</TRI></ALIGN>}. $$ This method transforms a single Knowledge Path into a comprehensive query template, enabling more complete and exhaustive answers. LLM-driven Consolidation. LLM-driven Consolidation addresses the challenge of inconsistencies and noise that emerge when sampling multiple Reasoning Chains and Knowledge Paths. Multiple sampling helps increase coverage and improve the likelihood of correct answers, but inevitably introduces noise and conflicts between samples. To address this challenge, we propose using a powerful LLM as a “Consolidator” that analyzes and integrates multiple Reasoning Chains and Knowledge Paths to derive final answers, following the prompt template detailed in App. F. This approach effectively preserves the benefits of multiple sampling while leveraging the LLM’s analytical capabilities to produce reliable answers. ## 3 Experiment ### 3.1 Experiment Settings Datasets. Following previous research (Luo et al., 2024c; Sun et al., 2024), we evaluate our model on three datasets: WebQuestionSP (WebQSP)) (Yih et al., 2016), Complex WebQuestions (CWQ) (Talmor and Berant, 2018), and CommonsenseQA (CSQA) (Talmor et al., 2019). The first two datasets use Freebase (Bollacker et al., 2008) as the KG, while CSQA leverages ConceptNet (Speer et al., 2017), allowing us to assess model generalizability across the unseen KG. Baselines. We compare RAR with 19 baselines across three categories: LLM reasoning methods, graph reasoning methods, and KG-enhanced LLM reasoning methods. Evaluation Metrics. For WebQSP and CWQ, we adopt Hit and F1 as evaluation metrics. Hit checks whether the generated predictions match any correct answer, while F1 evaluates overall answer coverage by balancing precision and recall. For CSQA, a multiple-choice QA dataset, we use accuracy as the evaluation metric. Implementations. Our implementation uses Llama-3.1-8B (Meta, 2024) as the backbone for Reasoner, Aligner, and Responser. To enhance question decomposition, we pretrain both Reasoner and Aligner using 2,000 exemplars demonstrating step-by-step KG-based problem-solving. For each component, we generate top- $10$ candidates using KG-constrained Decoding and Knowledge Path Expansion, with GPT-4o-mini handling LLM-driven Consolidation. Details are provided in App. C. ### 3.2 Main Results Tab. 1 shows that RAR achieves significant gains on both WebQSP and CWQ datasets, improving the state-of-the-art by 13.6% and 23.1% in F1, and by 1.1% and 15.2% in Hit respectively. These remarkable improvements demonstrate the effectiveness of integrating structured KG knowledge with LLM reasoning. Table 2: Impact of different components on CWQ. Variants F1 Precision Recall RAR 84.8 84.9 89.0 RAR w/o LC 83.0 81.8 91.2 w/o Reasoner 80.3 75.6 94.7 w/o KD 48.9 54.3 50.8 w/o KPE 71.7 80.2 71.7 ### 3.3 Ablation Study As shown in Tab. 2, removing LLM-driven Consolidation (LC) lowers precision but increases recall, since LC aims to eliminate noisy predictions. Excluding Reasoner causes a pronounced drop in precision but a rise in recall, indicating that Reasoning Chains guide Aligner to explore the KG more accurately, reducing hallucination and noise. Disabling Knowledge Path Expansion (KPE) diminishes performance, confirming its role in enriching Knowledge Paths. Most importantly, removing KG-constrained Decoding (KD) yields the largest performance decrease, underscoring the importance of restricting generation to valid KG paths. ### 3.4 Further Analyses <details> <summary>x3.png Details</summary> ![583a481a](/v1/image/583a481af3ddac469264fcc200c302573373d1d6fa0a3c224be3469f18c6aa67) ### Visual Description ## Line Chart: Model Performance Metrics vs. Iteration Steps ### Overview The image is a line chart displaying the performance of a machine learning model across three key metrics—Precision, Recall, and F1 score—as a function of training iteration steps. The chart shows all three metrics improving with more iterations before plateauing. ### Components/Axes * **Chart Type:** Line chart with markers. * **X-Axis:** * **Label:** "Iteration Steps" * **Scale:** Linear scale with major tick marks at 20, 60, 100, 200, and 300. * **Y-Axis:** * **Scale:** Linear scale with major tick marks at 45, 55, 65, and 75. The axis appears to represent a percentage or score. * **Legend:** * **Position:** Bottom-right corner of the plot area. * **Entries:** * Blue line with circle markers: **Precision** * Orange line with square markers: **Recall** * Green line with triangle markers: **F1** ### Detailed Analysis **Data Series and Trends:** 1. **Precision (Blue line, circle markers):** * **Trend:** Shows a steady, strong upward slope from the start, with the rate of increase slowing after 100 steps and plateauing after 200 steps. * **Approximate Data Points:** * At 20 steps: ~65 * At 60 steps: ~69 * At 100 steps: ~73 * At 200 steps: ~76 * At 300 steps: ~76 (plateau) 2. **Recall (Orange line, square markers):** * **Trend:** Shows a consistent upward slope, similar in shape to Precision but at a lower absolute value. It also plateaus after 200 steps. * **Approximate Data Points:** * At 20 steps: ~52 * At 60 steps: ~56 * At 100 steps: ~60 * At 200 steps: ~63 * At 300 steps: ~63 (plateau) 3. **F1 (Green line, triangle markers):** * **Trend:** Follows a trajectory between Precision and Recall, as expected for the harmonic mean of the two. It shows steady growth and plateaus after 200 steps. * **Approximate Data Points:** * At 20 steps: ~54 * At 60 steps: ~58 * At 100 steps: ~62 * At 200 steps: ~65 * At 300 steps: ~65 (plateau) ### Key Observations * **Performance Hierarchy:** Precision is consistently the highest metric, followed by F1, with Recall being the lowest throughout the training process. * **Convergence Point:** All three metrics show a clear plateau beginning at approximately 200 iteration steps, with negligible improvement between 200 and 300 steps. * **Growth Pattern:** The most significant gains for all metrics occur between 20 and 100 iteration steps. The rate of improvement diminishes between 100 and 200 steps before stopping. * **Metric Relationship:** The F1 score line is positioned between the Precision and Recall lines at every data point, visually confirming its role as a combined metric. ### Interpretation This chart demonstrates the learning curve of a classification model. The data suggests that: 1. **Training Effectiveness:** The model is successfully learning from the data, as all performance metrics improve with more training iterations. 2. **Diminishing Returns:** There is a clear point of diminishing returns around 200 iteration steps. Further training beyond this point (to 300 steps) provides no measurable benefit to these key metrics, indicating the model has likely converged. 3. **Model Bias:** The consistent gap where Precision > Recall suggests the model, in its current state, is better at being correct when it makes a positive prediction (high precision) than it is at finding all actual positive instances (lower recall). This could indicate a class imbalance in the data or a model tuning that favors precision. 4. **Optimal Training Duration:** For efficiency, training could be stopped at or shortly after 200 steps without sacrificing performance, saving computational resources. </details> Figure 3: Impact of iteration steps of the EM algorithm. Impact of Iteration Steps. As shown in Fig. 3, RAR exhibits consistent improvement across all metrics during EM updates. The performance rises rapidly in early stages and continues to refine over iterations, eventually reaching convergence with minimal fluctuations. <details> <summary>x4.png Details</summary> ![0c395ece](/v1/image/0c395eceea5c914865205602f453cba5a210e74d0c2cf840be0271de61e15371) ### Visual Description ## Bar Chart with Line Overlay: Model Performance Comparison ### Overview The image is a combined bar and line chart comparing three different models on two performance metrics: "Correctness" (represented by light blue bars) and "Alignment Ratio" (represented by a salmon-colored line with circular markers). The chart visually demonstrates a performance progression across the three models listed on the x-axis. ### Components/Axes * **Chart Type:** Bar chart with a line chart overlay. * **X-Axis (Categories):** Three models are listed from left to right: 1. `Llama-3.1-8B` 2. `GPT-4o` 3. `RAR` * **Y-Axis (Scale):** A numerical scale ranging from 40 to 100, with major tick marks at 40, 60, 80, and 100. The axis title is not explicitly shown, but the context implies it represents a percentage or score. * **Legend:** Located at the top center of the chart area. * A salmon-colored line with a circular marker is labeled `Alignment Ratio`. * A light blue rectangle is labeled `Correctness`. * **Data Series:** * **Correctness (Bars):** Three vertical light blue bars, one for each model. * **Alignment Ratio (Line):** A single salmon-colored line connecting three circular data points, one positioned above each model's bar. ### Detailed Analysis **1. Correctness (Light Blue Bars):** * **Trend:** The height of the bars increases steadily from left to right. * **Data Points (Approximate Values):** * `Llama-3.1-8B`: The bar reaches just above the 80 mark. **Estimated Value: ~82**. * `GPT-4o`: The bar is significantly taller, reaching between the 80 and 100 marks, closer to 100. **Estimated Value: ~93**. * `RAR`: The bar is the tallest, nearly reaching the 100 mark. **Estimated Value: ~97**. **2. Alignment Ratio (Salmon Line & Dots):** * **Trend:** The line shows a shallow upward slope between the first two models, followed by a very steep upward slope to the third model. * **Data Points (Approximate Values):** * `Llama-3.1-8B`: The dot is positioned just above the 40 mark, roughly halfway to 60. **Estimated Value: ~50**. * `GPT-4o`: The dot is slightly higher than the first, still below the 60 mark. **Estimated Value: ~55**. * `RAR`: The dot is positioned very high, near the top of the bar and close to the 100 mark. **Estimated Value: ~98**. ### Key Observations 1. **Performance Hierarchy:** `RAR` is the top-performing model on both metrics, followed by `GPT-4o`, with `Llama-3.1-8B` performing the lowest. 2. **Metric Discrepancy:** For `Llama-3.1-8B` and `GPT-4o`, there is a large gap between their high "Correctness" scores (82, 93) and their much lower "Alignment Ratio" scores (50, 55). 3. **Convergence at RAR:** The `RAR` model shows near-perfect performance on both metrics (~97 Correctness, ~98 Alignment Ratio), with the two data points converging at the top of the chart. 4. **Non-Linear Improvement:** The improvement in "Alignment Ratio" from `GPT-4o` to `RAR` is dramatically larger than the improvement in "Correctness" between the same two models. ### Interpretation This chart suggests a significant advancement in model capability represented by `RAR`. While `GPT-4o` shows a substantial improvement in raw "Correctness" over `Llama-3.1-8B`, its "Alignment Ratio" sees only a marginal gain. This implies that `GPT-4o` may be more accurate but not necessarily better aligned with the intended task or user intent compared to the baseline. The `RAR` model, however, excels in both dimensions. The steep rise in the Alignment Ratio line indicates that `RAR` has made a breakthrough in the quality measured by this metric, achieving near-parity with its high correctness score. The data implies that `RAR` is not just incrementally better but represents a qualitative leap, successfully addressing the alignment gap present in the other two models. Without specific context on what "Alignment Ratio" and "Correctness" measure, the chart strongly positions `RAR` as the superior model in this evaluation. </details> Figure 4: Human evaluation of reasoning chains on CWQ. Table 3: Efficiency and performance of RAR compared to different methods on CWQ. Types Methods Hit Avg. Runtime (s) Path Generation GNN-RAG 66.8 1.73 RoG 62.6 2.68 GCR 75.8 3.72 Agent Exploration ToG 68.5 18.89 EffiQA 69.5 - Ours RAR 91.0 4.38 Quality of Reasoning Chains. Through manual evaluation of 500 randomly selected samples, we assess the quality of Reasoning Chains using two criteria: reasoning correctness and KG alignment. The correctness metric evaluates whether the Reasoning Chain successfully solves the given question, while the alignment metric measures how well the reasoning steps correspond to valid KG paths. As shown in Fig. 4, RAR substantially outperforms both GPT-4o and baseline methods across both metrics. These results demonstrate that by aligning Reasoning Chains to KG structures, our approach not only improves the reliability of the reasoning process but also enhances its interpretability. <details> <summary>x5.png Details</summary> ![fab1b978](/v1/image/fab1b978f69293d2e1e90e00b54e9b125fad528da0f2dbffc3633a6e49d40ab0) ### Visual Description ## Line Chart: Precision, Recall, and F1 Score Trends ### Overview The image is a line chart displaying the performance metrics of a model or system across an increasing parameter (likely epochs, iterations, or a hyperparameter value). Three metrics are plotted: Precision, Recall, and F1 Score. All metrics show improvement as the x-axis value increases, with Precision reaching a plateau while Recall and F1 continue a steady rise. ### Components/Axes - **X-Axis**: Numerical scale with labeled markers at **1, 3, 5, and 10**. Data points are also present at intermediate values (approximately 2 and 4). The axis is not explicitly titled, but context suggests it represents an increasing training or tuning parameter. - **Y-Axis**: Numerical scale ranging from **65 to 80**, with labeled ticks at **65, 70, 75, and 80**. The axis is not explicitly titled, but the values represent percentage scores for the performance metrics. - **Legend**: Located in the **bottom-right corner** of the chart. It defines three data series: - **Precision**: Blue line with circular markers. - **Recall**: Orange line with square markers. - **F1**: Green line with triangular markers. ### Detailed Analysis **Precision (Blue Line, Circles):** - **Trend**: Rises sharply initially, then plateaus. - **Data Points (Approximate)**: - x=1: ~77 - x=2: ~78.5 - x=3: ~79.5 - x=4: ~80 - x=5: ~80 - x=10: ~80 - **Observation**: Precision improves rapidly from x=1 to x=4 and then remains stable at approximately 80 for x=5 and x=10. **Recall (Orange Line, Squares):** - **Trend**: Shows a consistent, steady upward slope across the entire range. - **Data Points (Approximate)**: - x=1: ~63 - x=2: ~65.5 - x=3: ~68 - x=4: ~69 - x=5: ~70 - x=10: ~72 - **Observation**: Recall starts as the lowest metric but demonstrates continuous improvement, narrowing the gap with Precision over time. **F1 Score (Green Line, Triangles):** - **Trend**: Increases steadily, positioned between the Precision and Recall lines. - **Data Points (Approximate)**: - x=1: ~66 - x=2: ~68 - x=3: ~69.5 - x=4: ~70.5 - x=5: ~71 - x=10: ~72 - **Observation**: As the harmonic mean of Precision and Recall, the F1 score follows a trajectory that balances the two, ending at the same approximate value as Recall at x=10. ### Key Observations 1. **Convergence at x=10**: At the final data point (x=10), the Recall and F1 scores converge at approximately 72, while Precision remains higher at 80. 2. **Plateau Effect**: Precision exhibits a clear performance ceiling, showing no improvement after x=4. In contrast, Recall and F1 show no signs of plateauing within the displayed range. 3. **Initial Performance Gap**: At x=1, there is a significant gap between Precision (~77) and the other two metrics (~63-66). This gap narrows considerably as the x-axis value increases. 4. **Rate of Improvement**: The most rapid gains for all metrics occur between x=1 and x=3. The rate of improvement slows for Precision after x=4 but remains relatively constant for Recall and F1. ### Interpretation This chart illustrates a common trade-off in machine learning model optimization. The data suggests that as the model is trained longer or tuned with a higher parameter value (increasing x), it becomes better at avoiding false positives (improving Precision) up to a point, after which this metric stabilizes. Simultaneously, the model becomes progressively better at finding all relevant instances (improving Recall), which in turn drives up the overall balanced performance metric (F1 Score). The fact that Recall and F1 continue to rise while Precision plateaus indicates that further optimization beyond x=4 primarily benefits the model's comprehensiveness (Recall) rather than its exactness (Precision). The convergence of Recall and F1 at x=10 suggests that at this point, the model's performance is primarily limited by its ability to recall all positives, as Precision is already maximized. This visualization would be critical for deciding an optimal stopping point for training or selecting a hyperparameter value based on whether Precision or Recall is more important for the specific application. </details> Figure 5: Impact of different beam size on CWQ. KG-constrained Decoding Effectiveness. We examine the effectiveness of KG-constrained Decoding in mitigating hallucinations and maintaining efficiency. Our method achieves zero hallucinations in Knowledge Paths when answers are correct, while without constraints, even correct answers show a 44% hallucination rate. The efficiency evaluation reveals minimal computational overhead. Particularly noteworthy is the comparison with GCR, the previous state-of-the-art method using KG constraints. Our approach achieves a 15.2% improvement in answer accuracy over GCR, with only a marginal increase in runtime from Reasoning Chain generation. This modest overhead is well justified by the significant gains in answer reliability and interpretability. Table 4: Impact of using different LLM backbones for Reasoner, Aligner and LLM-driven Consolidation. Components Variants Hit F1 Reasoner and Aligner Llama-2-7B 84.0 68.9 Llama-3.1-8B 85.2 72.6 Qwen2-0.5B 71.3 56.0 Qwen2-1.5B 72.0 56.6 Qwen2-7B 81.4 67.1 LLM-driven Consolidation GPT-4o-mini 91.0 84.8 GPT-4o 92.8 84.9 Qwen2-7B 88.7 82.2 Llama-3.1-8B 90.6 83.7 Llama-3.1-70B 92.4 83.0 Impact of Beam Size. As shown in Fig. 5, increasing beam size allows RAR to explore more potential Reasoning Chains and Knowledge Paths, leading to improved performance across all metrics. This demonstrates that examining multiple candidate solutions helps identify better Reasoning Chains and Knowledge Paths for responses of higher quality. Impact of different LLM Backbone. Tab. 4 demonstrates that larger LLMs generally achieve better performance, with Llama-3.1-8B and GPT-4o delivering the strongest results for backbone and LLM-based Consolidation (LC), respectively. Zero-shot Generalizability to Unseen KGs. Table 5: Zero-shot transferability to unseen KG. Model CSQA MedQA GPT-4o-mini 91 75 GCR 94 79 RAR 94 80 Following Luo et al. (2024c), we evaluate RAR ’s zero-shot transfer capabilities on CSQA (Talmor et al., 2019) and MedQA (Jin et al., 2021). The results show that RAR achieves superior zero-shot performance compared to GPT-4o-mini on both datasets. Notably, RAR achieves comparable performance to GCR, as both methods leverage the KG to constrain the decoding process to enhance reasoning without requiring additional training on the target KG. ### 3.5 Case study <details> <summary>x6.png Details</summary> ![4af71498](/v1/image/4af71498b99cd92b8fefa05c9681049fbab9c3e309176f94fb11d1dbcf154302) ### Visual Description \n ## [Diagram]: Knowledge Graph-Constrained Reasoning Process Examples ### Overview The image displays four distinct examples of a question-answering system's internal process. Each example follows a consistent structure, demonstrating how a "Reasoner" generates a step-by-step reasoning chain, an "Aligner" maps this to a knowledge graph path (as triples), and a "KG-constrained Decoding" diagram visualizes the entity relationships. The final answer is generated by a "Responder." The language is English throughout. ### Components/Axes The image is segmented into four horizontal blocks, each representing one question-answer example. Each block contains four primary components arranged from left to right: 1. **Question & Answer Header:** A line stating the question and the final answer. 2. **Reasoning Chain generated by Reasoner:** A `<THINK>` block containing numbered, step-by-step logical reasoning. 3. **Knowledge Path generated by Aligner:** A series of `<TRIPLE>` elements that formalize the reasoning steps into structured data (subject, predicate, object). 4. **KG-constrained Decoding Diagram:** A visual graph with nodes (circles) and directed edges (arrows) representing entities and their relationships. Nodes are color-coded (blue, green, orange) and some are labeled "undecodable." Numbered steps (1, 2, 3) correspond to the reasoning steps. 5. **Answer generated by Responder:** The final answer, highlighted in green. ### Detailed Analysis **Example 1:** * **Question:** What did dr josef mengele do? **Answer:** Physician * **Reasoning Chain:** 1. Identify the profession associated with "Josef Mengele". This profession is represented by the answer entity "x". * **Knowledge Path:** `<TRIPLE> (Josef Mengele, people.person.profession, Physician) </TRIPLE>` * **KG Diagram:** Shows a path from node "Josef Mengele" (blue) via edge "people.person.profession" (step 1) to node "Physician" (green). An "undecodable" node is present but not connected to the main path. **Example 2:** * **Question:** What type of Claude Debussy music appears in the film Black Tights? **Answer:** Ballet * **Reasoning Chain:** 1. Identify the type of music associated with "Claude Debussy". 2. Determine if this type appears in the film "Black Tights". * **Knowledge Path:** * `<TRIPLE> (Claude Debussy, music.artist.genre, Ballet) </TRIPLE>` * `<TRIPLE> (Ballet, film.film_genre.films_in_this_genre, Black Tights) </TRIPLE>` * **KG Diagram:** A two-step path: "Claude Debussy" (blue) -> "Ballet" (green, step 1) -> "Black Tights" (blue, step 2). An "undecodable" node is present. **Example 3:** * **Question:** What high school did the artist who recorded "Girl Tonight" attend? **Answer:** Petersburg High School * **Reasoning Chain:** 1. Identify the artist ("c") associated with "Girl Tonight". 2. Determine the educational institutions attended by artist "c". 3. Confirm which institution is a "High school". * **Knowledge Path:** * `<TRIPLE> (Trey Songz, music.featured_artist.recordings, Girl Tonight) </TRIPLE>` * `<TRIPLE> (Trey Songz, people.person.education-education.education.institution, Petersburg High School) </TRIPLE>` * `<TRIPLE> (Petersburg High School, education.educational_institution.school_type, High school) </TRIPLE>` * **KG Diagram:** A three-step path: "Girl Tonight" (blue) -> "Trey Songz" (orange, step 1) -> "Petersburg High School" (green, step 2) -> "High school" (blue, step 3). An "undecodable" node is present. **Example 4:** * **Question:** Where did Tennessee Williams attend college, and has an organization headquarters located in New York City? **Answer:** The New School * **Reasoning Chain:** 1. Identify the educational institution attended by "Tennessee Williams". 2. Verify the institution is a "College/University". 3. Confirm its headquarters are in "New York City". * **Knowledge Path:** * `<TRIPLE> (Tennessee Williams, people.person.education-education.education.institution, The New School) </TRIPLE>` * `<TRIPLE> (The New School, common.topic.notable_types, College/University) </TRIPLE>` * `<TRIPLE> (The New School, organization.organization.headquarters-location.mailing_address.citytown, New York City) </TRIPLE>` * **KG Diagram:** A three-step path: "Tennessee Williams" (blue) -> "The New School" (green, step 1) -> "College/University" (blue, step 2) and "New York City" (blue, step 3). An "undecodable" node is present. ### Key Observations 1. **Consistent Process:** All examples follow an identical four-stage pipeline: Question -> Reasoning Chain -> Knowledge Triples -> Visual Graph -> Answer. 2. **Graph Structure:** The KG diagrams consistently show a linear or branching path from the query entity to the answer entity, with intermediate nodes. The "undecodable" node appears in every diagram but is never part of the correct reasoning path, suggesting it represents irrelevant or inaccessible knowledge. 3. **Color Coding:** In the diagrams, the final answer entity node is consistently colored green. The starting query entity and other intermediate entities are blue or orange. The edge labels in the diagrams match the predicates in the `<TRIPLE>` statements. 4. **Complexity Scaling:** The reasoning chain length and graph complexity increase with the question's complexity (from 1 step in Example 1 to 3 steps in Examples 3 & 4). ### Interpretation This image illustrates a **neuro-symbolic or hybrid AI system** for complex question answering. It demonstrates a clear separation of concerns: * The **Reasoner** performs natural language understanding and logical decomposition of the question. * The **Aligner** translates this logic into formal, structured queries against a knowledge graph (using the triple format). * The **KG-constrained Decoding** visualizes the execution of these queries, showing the specific path through the knowledge graph that leads to the answer. The "undecodable" nodes highlight the system's ability to ignore irrelevant or noisy data. * The **Responder** generates the final natural language answer. The system's strength lies in its **explainability**. Each step of the reasoning is transparent and traceable, from the initial logical steps to the precise data relationships used. This is in contrast to "black box" neural models. The examples show the system can handle various relation types (profession, genre, education, location) and multi-hop reasoning (connecting an artist to a song, to an education institution, to its type). The consistent success across diverse domains (history, music, film, education) suggests a robust underlying knowledge graph and reasoning engine. </details> Figure 6: Examples of Reasoning Chains and Knowledge Paths generated by RAR under different iteration steps. Fig. 6 illustrates the qualitative results by our RAR framework. To investigate the effect of EM iterations, in Fig. 7, we also present two representative cases that demonstrate the evolution of RAR ’s behavior across iterations of the EM algorithm. These cases showcase distinct improvements in RAR ’s ability to generate Reasoning Chains and Knowledge Paths. In Case 1, we examine RAR ’s response to the query “What high school did the artist who recorded ’Girl Tonight’ attend?” The early-stage model demonstrates suboptimal verification behavior by directly searching for high school information without following the proper verification sequence: first identifying educational institutions, then verifying the specific type as high school. This Reasoning Chain prevents proper alignment with the Knowledge Path in the KG. In contrast, the later-stage model generates both Reasoning Chains and Knowledge Paths effectively, producing a higher quality Reasoning Chain that successfully aligns with the Knowledge Path in the KG. Case 2 examines RAR ’s handling of “What type of Claude Debussy music appears in the film Black Tights?” Here, we observe a different pattern of improvement. While the early-stage model generates the same Reasoning Chain as the later-stage model, it fails to generate Knowledge Paths that fully align with and reflect this reasoning, resulting in a misaligned Knowledge Path that do not lead to the correct answer. The later-stage model maintains consistency between Reasoning Chains and Knowledge Paths, thus arriving at the correct answer. These cases validate the effectiveness of the EM approach. ## 4 Related Work LLM Reasoning. Recent progress in LLMs have spurred extensive research to improve deep reasoning. One line of work focuses on prompting-based methods, which elicit intermediate reasoning steps during inference, such as Chain-of-Thought (CoT) prompting (Wei et al., 2022a). Building on CoT, self-consistency (Wang et al., 2023b) generates multiple reasoning traces and selects the most consistent answer. Tree-of-Thought (Yao et al., 2023) explores branching steps in a tree structure to uncover optimal solutions. Beyond prompting, researchers have investigated fine-tuning LLMs on reasoning tasks (Yu et al., 2022; Hoffman et al., 2023), including reinforcement learning methods (OpenAI, 2024b; Guo et al., 2025) that encourage more complex multi-step reasoning before arriving at a final answer. KG-enhanced LLM Reasoning. Despite their remarkable performance, LLMs still face limitations such as incomplete or outdated domain knowledge, interpretability challenges, and the potential for hallucinations. To address these issues, a growing body of work has focused on integrating LLMs with KGs (Pan et al., 2024). KD-CoT (Wang et al., 2023a) enhances CoT by retrieving relevant facts from external KGs, guiding LLMs with more reliable information. RoG (Luo et al., 2024b) employs a plan–retrieve–reason pipeline that explicitly fetches KG-based reasoning paths to ground the final answer. GCR Luo et al. (2024c) further mitigates hallucinations by enforcing graph-constrained decoding, ensuring that every reasoning step aligns with valid KG connections. GNN-RAG (Mavromatis and Karypis, 2024) leverages a graph neural network for effective KG retrieval, while StructGPT (Jiang et al., 2023) and ToG (Sun et al., 2024) treat the LLM as an agent that navigates the KG, assembling multi-hop paths to produce more trustworthy answers. ## 5 Conclusion In this paper, we present RAR, a novel framework that integrates LLM reasoning with knowledge graphs for KGQA through three key components—Reasoner, Aligner, and Responser. We formulate this process as a probabilistic model and optimize it using the Expectation-Maximization algorithm. Through extensive experiments on multiple benchmarks, we demonstrate that RAR achieves state-of-the-art performance while maintaining interpretability and computational efficiency. These results demonstrate that RAR successfully bridges the gap between LLM reasoning and structured knowledge, offering a promising direction for building reliable and interpretable QA systems. ## Limitations One limitation of our framework lies in the computational overhead introduced by Reasoner. In certain cases, especially for complex queries, the Reasoning Chain generated by Reasoner can become relatively long, increasing resource consumption. However, our experimental results demonstrate that the performance gains from incorporating Reasoning Chain justify this additional cost, striking a practical balance between efficiency and effectiveness. Another limitation concerns the generalizability to specialized domains. Though our framework, trained on Freebase-based KGQA datasets, shows improved generalization to unseen KGs compared to previous methods, its performance on highly specialized knowledge graphs (e.g, medical KGs) remains to be enhanced. Improving adaptation to domain-specific KGs presents a promising direction for future research. ## References - Bao et al. (2025) Forrest Bao, Chenyu Xu, and Ofer Mendelevitch. 2025. Deepseek-r1 hallucinates more than deepseek-v3. Accessed: 2025-02-10. - Bollacker et al. (2008) Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 1247–1250. ACM. - Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901. - Cheng et al. (2024) Sitao Cheng, Ziyuan Zhuang, Yong Xu, Fangkai Yang, Chaoyun Zhang, Xiaoting Qin, Xiang Huang, Ling Chen, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. 2024. Call me when necessary: LLMs can efficiently and faithfully reason over structured environments. In Findings of the Association for Computational Linguistics: ACL 2024, pages 4275–4295, Bangkok, Thailand. Association for Computational Linguistics. - Dehghan et al. (2024) Mohammad Dehghan, Mohammad Alomrani, Sunyam Bagga, David Alfonso-Hermelo, Khalil Bibi, Abbas Ghaddar, Yingxue Zhang, Xiaoguang Li, Jianye Hao, Qun Liu, Jimmy Lin, Boxing Chen, Prasanna Parthasarathi, Mahdi Biparva, and Mehdi Rezagholizadeh. 2024. EWEK-QA : Enhanced web and efficient knowledge graph retrieval for citation-based question answering systems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14169–14187, Bangkok, Thailand. Association for Computational Linguistics. - Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22. - Dong et al. (2025) Zixuan Dong, Baoyun Peng, Yufei Wang, Jia Fu, Xiaodong Wang, Xin Zhou, Yongxue Shan, Kangchen Zhu, and Weiguo Chen. 2025. Effiqa: Efficient question-answering with strategic multi-model collaboration on knowledge graphs. In Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025, pages 7180–7194. Association for Computational Linguistics. - Gu et al. (2024) Yu Gu, Yiheng Shu, Hao Yu, Xiao Liu, Yuxiao Dong, Jie Tang, Jayanth Srinivasa, Hugo Latapie, and Yu Su. 2024. Middleware for LLMs: Tools are instrumental for language agents in complex environments. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7646–7663, Miami, Florida, USA. Association for Computational Linguistics. - Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. - Hadi et al. (2023) Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. 2023. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints, 3. - He et al. (2021) Gaole He, Yunshi Lan, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. Improving multi-hop knowledge base question answering by learning intermediate supervision signals. In Proceedings of the 14th ACM international conference on web search and data mining, pages 553–561. - Hoffman et al. (2023) Matthew Douglas Hoffman, Du Phan, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A. Saurous. 2023. Training chain-of-thought via latent-variable inference. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. - Huang et al. (2025) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55. - Huang et al. (2024) Xiang Huang, Sitao Cheng, Shanshan Huang, Jiayu Shen, Yong Xu, Chaoyun Zhang, and Yuzhong Qu. 2024. QueryAgent: A reliable and efficient reasoning framework with environmental feedback based self-correction. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5014–5035, Bangkok, Thailand. Association for Computational Linguistics. - Jiang et al. (2023) Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: A general framework for large language model to reason over structured data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9237–9251. - Jiang et al. (2022) Jinhao Jiang, Kun Zhou, Xin Zhao, and Ji-Rong Wen. 2022. Unikgqa: Unified retrieval and reasoning for solving multi-hop question answering over knowledge graph. In The Eleventh International Conference on Learning Representations. - Jin et al. (2021) Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421. - Li et al. (2024) Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James Glass, and Helen Meng. 2024. Decoding on graphs: Faithful and sound reasoning on knowledge graphs through generation of well-formed chains. arXiv preprint arXiv:2410.18415. - Li et al. (2023) Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, and Wenhu Chen. 2023. Few-shot in-context learning on knowledge base question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6966–6980, Toronto, Canada. Association for Computational Linguistics. - Luo et al. (2024a) Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024a. Reasoning on graphs: Faithful and interpretable large language model reasoning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. - Luo et al. (2024b) Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024b. Reasoning on graphs: Faithful and interpretable large language model reasoning. In International Conference on Learning Representations. - Luo et al. (2024c) Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, and Shirui Pan. 2024c. Graph-constrained reasoning: Faithful reasoning on knowledge graphs with large language models. CoRR, abs/2410.13080. - Mavromatis and Karypis (2022) Costas Mavromatis and George Karypis. 2022. Rearev: Adaptive reasoning for question answering over knowledge graphs. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2447–2458. - Mavromatis and Karypis (2024) Costas Mavromatis and George Karypis. 2024. Gnn-rag: Graph neural retrieval for large language model reasoning. arXiv preprint arXiv:2405.20139. - Meta (2024) Meta. 2024. Build the future of ai with meta llama 3. - Nie et al. (2024) Zhijie Nie, Richong Zhang, Zhongyuan Wang, and Xudong Liu. 2024. Code-style in-context learning for knowledge-based question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18833–18841. - OpenAI (2022) OpenAI. 2022. Introducing chatgpt. - OpenAI (2024a) OpenAI. 2024a. Hello gpt-4o. - OpenAI (2024b) OpenAI. 2024b. Learning to reason with llms. - Pan et al. (2024) Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering (TKDE). - Qu et al. (2021) Meng Qu, Junkun Chen, Louis-Pascal A. C. Xhonneux, Yoshua Bengio, and Jian Tang. 2021. Rnnlogic: Learning logic rules for reasoning on knowledge graphs. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. - Sen et al. (2022) Prithviraj Sen, Breno W. S. R. de Carvalho, Ryan Riegel, and Alexander G. Gray. 2022. Neuro-symbolic inductive logic programming with logical neural networks. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 8212–8219. AAAI Press. - Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence, volume 31. - Sun et al. (2018) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William Cohen. 2018. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242. - Sun et al. (2024) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. In The Twelfth International Conference on Learning Representations. - Talmor and Berant (2018) Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651. - Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158. - Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. - Wang et al. (2023a) Keheng Wang, Feiyu Duan, Sirui Wang, Peiguang Li, Yunsen Xian, Chuantao Yin, Wenge Rong, and Zhang Xiong. 2023a. Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. arXiv preprint arXiv:2308.13259. - Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. - Wang et al. (2024) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2024. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations. - Wei et al. (2022a) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022a. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. - Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837. - Yang et al. (2024) An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671. - Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. - Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers. The Association for Computer Linguistics. - Yu et al. (2022) Ping Yu, Tianlu Wang, Olga Golovneva, Badr AlKhamissy, Gargi Ghosh, Mona T. Diab, and Asli Celikyilmaz. 2022. ALERT: adapting language models to reasoning tasks. CoRR, abs/2212.08286. - Zhang et al. (2022) Jing Zhang, Xiaokang Zhang, Jifan Yu, Jian Tang, Jie Tang, Cuiping Li, and Hong Chen. 2022. Subgraph retrieval enhanced model for multi-hop knowledge base question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5773–5784. - Zhong et al. (2024) Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, Chao Cao, Hanqi Jiang, Hanxu Chen, Yiwei Li, Junhao Chen, Huawen Hu, Yihen Liu, Huaqin Zhao, Shaochen Xu, Haixing Dai, Lin Zhao, Ruidong Zhang, Wei Zhao, Zhenyuan Yang, Jingyuan Chen, Peilong Wang, Wei Ruan, Hui Wang, Huan Zhao, Jing Zhang, Yiming Ren, Shihuan Qin, Tong Chen, Jiaxi Li, Arif Hassan Zidan, Afrar Jahin, Minheng Chen, Sichen Xia, Jason Holmes, Yan Zhuang, Jiaqi Wang, Bochen Xu, Weiran Xia, Jichao Yu, Kaibo Tang, Yaxuan Yang, Bolun Sun, Tao Yang, Guoyu Lu, Xianqiao Wang, Lilong Chai, He Li, Jin Lu, Lichao Sun, Xin Zhang, Bao Ge, Xintao Hu, Lian Zhang, Hua Zhou, Lu Zhang, Shu Zhang, Ninghao Liu, Bei Jiang, Linglong Kong, Zhen Xiang, Yudan Ren, Jun Liu, Xi Jiang, Yu Bao, Wei Zhang, Xiang Li, Gang Li, Wei Liu, Dinggang Shen, Andrea Sikora, Xiaoming Zhai, Dajiang Zhu, and Tianming Liu. 2024. Evaluation of openai o1: Opportunities and challenges of AGI. CoRR, abs/2409.18486. ## Appendix A Rationale and Details of the EM Algorithm We provide the motivation and the details of applying the EM algorithm to optimize our framework. ### A.1 Overview We have two modules: - ReAligner, parameterized by $\psi$ , which generates Graph-aware Reasoning Chains $z$ given $(\mathcal{G},q)$ . - Responser, parameterized by $w$ , which predicts the final answer $a$ given the question $q$ and candidate Graph-aware Reasoning Chain $z$ . Given a training set of triples $\{(\mathcal{G},q,a)\}$ , our objective is to maximize: $$ \mathcal{O}(w,\psi)\;=\;\sum_{(\mathcal{G},q,a)}\log\Bigl{(}\mathbb{E}_{z\sim p _{\psi}(z\mid\mathcal{G},q)}\bigl{[}p_{w}(a\mid q,z)\bigr{]}\Bigr{)}. $$ Because exact marginalization over $z$ can be expensive, we employ an EM-style approach to iteratively refine both modules. ### A.2 Rationale The selection of the EM algorithm is fundamentally motivated by the central challenge and objective of the RAR: generating latent natural language (NL) reasoning steps for KGQA. #### A.2.1 Core Challenge: Generating Latent NL Reasoning - RAR aims to produce complex, human-like NL reasoning chains – the intermediate “thought process” connecting a question to its answer using a Knowledge Graph (KG). - Crucially, these detailed reasoning steps are not available in standard KGQA training datasets. They constitute latent variables within our model. #### A.2.2 Inadequacy of Direct Supervision Methods - In standard SFT applied within RoG (Luo et al., 2024a), relation sequences absent from the original training data are often generated. To label these sequences for training, the process frequently relies on heuristics to create pseudo-gold labels, such as identifying the shortest KG path between query and answer entities, which can be potentially noisy. - This heuristic-based supervision is insufficient for training models to generate the complex, multi-step, logically nuanced NL reasoning that RAR targets, especially when multiple constraints are involved. #### A.2.3 EM as the Principled Approach for Latent Variables - EM is the standard, principled approach for parameter estimation in models with latent variables. - It provides a formal framework to optimize the Reasoner (generating the latent NL chain) and the Aligner (grounding the chain to the KG) without requiring explicit supervision for the intermediate NL reasoning steps. - Optimization is guided indirectly by maximizing the likelihood of observing the correct final answer, conditioned on the feasibility of the generated reasoning chain being aligned to the KG. #### A.2.4 Necessity of Iterative Refinement - Generating coherent, long-form NL reasoning is challenging. Initial attempts, especially early in training, are likely to be imperfect or logically flawed. - The iterative nature of the EM algorithm is well-suited for this progressive refinement: E-step identifies the most likely or “best” latent reasoning chains produced by the current model that successfully link the question to the correct answer via a feasible KG alignment. This step essentially evaluates the current reasoning quality based on outcomes; M-step updates the parameters of the Reasoner and Aligner models by training them on these high-quality reasoning chains identified in the E-step. This step aims to make the models generate more chains similar to the successful ones. - This iterative E-M loop allows the system to gradually improve the quality, logical coherence, and KG-alignability of the generated latent reasoning, as demonstrated qualitatively in Fig. 4. #### A.2.5 Connections to Reinforcement Learning Connection to Implicit RL. The EM algorithm, as applied in RAR, can be viewed as a form of implicit Reinforcement Learning: - The E-step acts like a selection or filtering mechanism based on the quality of the reasoning chain, implicitly assigning a high reward (e.g, 1) to successful chains (reaching the correct answer via KG alignment) and low reward (e.g, 0) to unsuccessful ones. - The M-step, particularly the Reasoner update (maximizing $logp(z_{h}at_{I}|q)$ for selected high-quality chains $z_{h}at_{I}$ ), mathematically resembles a policy gradient update ( $E[R*\nabla\log p(z|q)]\approx R*\nabla\log p(\hat{z}|q)$ ) where $R$ is effectively this implicit binary reward. Thus, EM reinforces the generation of “good” reasoning chains without the need for explicit reward engineering. Why EM Was Preferred Over Explicit Reinforcement Learning. While the EM process here shares similarities with RL (see next point), we opted for EM over explicit RL formulations (like PPO) for several practical reasons: - Reward Function Design: Crafting a good reward function (‘R‘) that accurately captures the quality of multi-step NL reasoning is non-trivial. - Training Complexity and Cost: Explicit RL methods often lead to higher computational costs and potentially unstable training. - Efficiency and Simplicity: EM, derived naturally from the maximum likelihood objective for latent variable models, offers a more direct, mathematically grounded, and often simpler optimization pathway for our specific problem structure. ### A.3 Details of EM Algorithm #### A.3.1 Step 1: Updating Responser 1. Sample Candidate Graph-aware Reasoning Chains. For each training example $(\mathcal{G},q,a)$ , sample $K$ Graph-aware Reasoning Chains: $$ z_{k}\;\sim\;p_{\psi}(z\mid\mathcal{G},q),\quad k=1,\dots,K. $$ Let $\hat{z}=\{\,z_{1},z_{2},\ldots,z_{K}\}$ . 1. Approximate the Objective for $w$ . The term $$ \log\mathbb{E}_{z\sim p_{\psi}(z\mid\mathcal{G},q)}\bigl{[}p_{w}(a\mid q,z) \bigr{]} $$ is approximated by $$ \log\biggl{(}\tfrac{1}{K}\sum_{k=1}^{K}p_{w}(a\mid q,z_{k})\biggr{)}. $$ We then take gradients (w.r.t. $w$ ) and update $w$ so that $p_{w}(a\mid q,z)$ is more likely to produce the correct $a$ for the sampled Graph-aware Reasoning Chains. 1. Result. After updating $w$ , Responser $p_{w}(a\mid q,z)$ is better aligned with whatever Graph-aware Reasoning Chains $p_{\psi}$ currently emits. #### A.3.2 Step 2: EM-Style Update for ReAligner After $w$ is updated, we refine the ReAligner $p_{\psi}(z\mid\mathcal{G},q)$ . In EM terms, we view $z$ as a latent variable: E-Step (Posterior Inference) - Compute / Re-rank Graph-aware Reasoning Chains. Re-sample or re-rank the $K$ Graph-aware Reasoning Chains using the updated $p_{w}$ . We want Graph-aware Reasoning Chains that are “most aligned” with the correct answer $a$ . Formally: $$ p_{w,\psi}(z\mid\mathcal{G},q,a)\;\propto\;p_{w}(a\mid q,z)\,p_{\psi}(z\mid \mathcal{G},q). $$ - Scoring. For a single Graph-aware Reasoning Chain $z$ , define $$ S(z)\;=\;\log p_{w}(a\mid q,z)\;+\;\log p_{\psi}(z\mid\mathcal{G},q). $$ - Selecting High-Quality Graph-aware Reasoning Chains. Rank (or sample) Graph-aware Reasoning Chains by $S(z)$ and select the top set $z_{I}=\{\text{top-}K\text{ Graph-aware Reasoning Chains}\}$ . M-Step (Update $\psi$ ) - Treat $z_{I}$ (the selected high-quality Graph-aware Reasoning Chains) as if they were observed. - Update $\psi$ by maximizing: $$ \log p_{\psi}(z_{I}\mid\mathcal{G},q)\;=\;\sum_{z\,\in\,z_{I}}\log p_{\psi}(z \mid\mathcal{G},q) $$ - In practice, this amounts to standard fine-tuning (e.g., instruction tuning or teacher forcing) of $p_{\psi}$ on the newly identified high-quality Graph-aware Reasoning Chains. #### A.3.3 Complete Iteration A single iteration of our EM-style algorithm proceeds as follows: 1. (Update $w$ ): For each sample $(\mathcal{G},q,a)$ , draw $K$ Graph-aware Reasoning Chains from $p_{\psi}$ , then update $w$ by maximizing $$ \log\left(\tfrac{1}{K}\sum_{k=1}^{K}p_{w}(a\mid q,z_{k})\right). $$ 1. (E-Step): Using the updated $w$ , compute $$ p_{w,\psi}(z\mid\mathcal{G},q,a)\;\propto\;p_{w}(a\mid q,z)\,p_{\psi}(z\mid \mathcal{G},q). $$ Select high-quality Graph-aware Reasoning Chains $z_{I}$ from the candidates. 1. (M-Step): Update $\psi$ by maximizing $\log p_{\psi}(z_{I}\mid\mathcal{G},q)$ , i.e. fine-tune $p_{\psi}$ so that it is more likely to emit $z_{I}$ in the future. This loop can be repeated until convergence or for a fixed number of epochs. #### A.3.4 Practical Variations 1. Top- $K$ vs. Full Posterior. Instead of summing/sampling over all subsets, it is simpler to pick the top- $K$ Graph-aware Reasoning Chains by $S(\cdot)$ . 1. Skipping Responser Optimization. To further improve efficiency, we can skip optimizing Responser. LLMs often possess strong zero-shot summarization or question-answering capabilities, which means they can produce high-quality answers from given Graph-aware Reasoning Chains without additional training. As a result, we can treat an LLM as a pre-optimized Responser and focus solely on updating ReAligner, thereby reducing overall computation. ## Appendix B More Related Work ### B.1 Comparison with Agent Exploration Methods To situate RAR within the KGQA landscape, we first contrast it with representative agent exploration methods such as ToG (Sun et al., 2024). Although both paradigms comprise stages that can be informally mapped to reasoning and grounding, their internal principles diverge markedly. Training methodology and optimization. RAR is trained with a mathematically grounded expectation–maximisation (EM) procedure that explicitly and stably refines two complementary capabilities: the NL Reasoner and the KG - aware Aligner. By contrast, many agent methods rely more heavily on prompt or workflow engineering Cheng et al. (2024); Gu et al. (2024); Huang et al. (2024); Li et al. (2023); Nie et al. (2024); they seldom perform task - specific optimization that directly targets the core reasoning mechanism. Accuracy, reliability, and complexity handling. The synergy between NL reasoning, KG - constrained alignment, and EM - guided supervised fine - tuning translates into markedly higher accuracy and robustness for RAR (see Tab. 1). Empirically, its explicit decomposition allows it to cope well with multi - hop and conjunctive constraints that are challenging for purely prompt - driven agents. Resource consumption. Once training is finished, inference in RAR can be carried out by a collection of relatively small, specialized, fine - tuned models—one each for the Reasoner, Aligner, and Responser. This modularity yields the efficiency gains reported in Tab. 3. Agent systems, in contrast, often incur higher latency and cost because they perform several large - LLM calls while exploring the KG. Taken together, these differences show that RAR is not a mere variant of the agent paradigm; its EM - centered optimization strategy and KG - constrained decoding constitute a distinct design that offers both practical efficiency and stronger empirical performance. ### B.2 Comparison with Path Generation Methods RAR also differs fundamentally from path generation methods such as RoG (Luo et al., 2024a) and GCR (Luo et al., 2024c). Where those systems directly predict a linear sequence of KG relations, ours maintains a higher - level, human - readable plan in natural language and lets the Aligner ground each step to KG triples. Specifically, (i) the Reasoner produces a multi - step NL chain that expresses the conceptual logic; (ii) the Aligner incrementally matches every NL step to concrete triples through KG - constrained decoding; and (iii) the Responser integrates evidence from both the NL chain and the aligned KG path to craft the final answer. Training again relies on EM—iteratively improving latent NL chains that can be aligned and that ultimately yield correct answers—whereas RoG and related work usually depend on direct SFT with shortest KG paths that may be noisy supervision signals. Illustrative example. For the query “What did Dr Josef Mengele do?” the two paradigms unfold differently: - RoG. An LLM planner outputs the relation people.person.profession; a symbolic retriever then follows this edge in the KG to obtain the triple (Josef Mengele, profession, Physician), and the answer Physician is returned. - RAR. Step 1 (A Reasoner step) proposes: “Identify the profession associated with Josef Mengele.” The Aligner grounds this to the same triple as above. The Responser finally reports Physician, explicitly citing both the reasoning chain and the grounded path. Handling complex constraints. Because RoG represents reasoning as a single linear relation sequence, it struggles with conjunctive queries such as “presidents who were also actors”; after following profession $\rightarrow$ Actor, it cannot backtrack to verify profession $\rightarrow$ President. RAR, in contrast, naturally decomposes the query into two successive NL steps (“find presidents” $\rightarrow$ “filter those who are actors”) and grounds each step separately, ensuring both constraints are satisfied. Robustness to noisy supervision. Shortest - path supervision can be incorrect when more semantically plausible paths exist. By letting EM discover latent NL chains that are both alignable and answer - bearing, RAR avoids this brittleness and achieves the quality gains visualized in Fig. 4 of the submission. In summary, the collaboration of an explicit NL Reasoner, a KG - constrained Aligner, and EM - based optimization endows RAR with a distinctive combination of interpretability, flexibility, and empirical strength that is not achieved by prior agent exploration or path generation methods. ## Appendix C Experimental Setup ### C.1 Fine-tuning Datasets and Knowledge Graph For evaluation, we use two benchmark KGQA datasets: WebQuestionSP (WebQSP) (Yih et al., 2016) and Complex WebQuestions (CWQ) (Talmor and Berant, 2018). To ensure fair comparison, we adopt identical train and test splits as previous works (Jiang et al., 2022; Luo et al., 2024b). The detailed statistics of these datasets are presented in Tab. 6. Both WebQSP and CWQ are based on Freebase (Bollacker et al., 2008). To reduce computational overhead and memory consumption, we utilize the same subgraphs as previous work (Luo et al., 2024b). Additionally, we preprocess Freebase by converting CVT (Compound Value Type) nodes, which represent n-ary relationships, into binary relationships by concatenating edge labels with “-” as the delimiter, following Li et al. (2024). Dataset Dataset Statistics Statistics of Answer Numbers #Train #Test #Ans = 1 2 $\geq$ #Ans $\leq$ 4 5 $\geq$ #Ans $\leq$ 9 #Ans $\geq$ 10 WebQSP 2,826 1,628 51.2% 27.4% 8.3% 12.1% CWQ 27,639 3,531 70.6% 19.4% 6% 4% Table 6: Statistics of datasets. ### C.2 Datasets for Cold Starting To initialize the training of Reasoner and Aligner, we leverage high-quality Reasoning Chains and Knowledge Paths derived from SPARQL queries in the WebQSP and CWQ datasets. This approach prevents the models from generating malformed outputs during early training stages. The SPARQL queries in these datasets represent the gold-standard Reasoning Chain that human experts would use to solve questions using the KG. We decompose these SPARQL queries according to a predefined grammar, breaking them into atomic chunks that each represent a single reasoning step. For each chunk, we query Freebase to obtain the corresponding triples, then use GPT-4o to generate natural language reasoning chains based on these retrieved triples. Through this process, we generate a dataset of 2,000 high-quality Reasoning Chains with their corresponding Knowledge Paths for each question. This dataset enables us to perform cold-start pretraining of Reasoner and Aligner, teaching them to generate well-structured, step-by-step Reasoning Chains and Knowledge Paths with appropriate special tokens, a crucial foundation for subsequent optimization using the EM algorithm. ### C.3 Hyperparameters For Responser, we adopt Llama-3.1-8B (Meta, 2024) without fine-tuning based on our preliminary experiments (detailed analysis in App. A.3.4). For both Reasoner and Aligner, we conduct extensive experiments with various lightweight LLMs ranging from 0.5B to 8B parameters (Yang et al., 2024; Touvron et al., 2023; Meta, 2024). All models share the same hyperparameter configuration: training for 3 epochs with a batch size of 4 and a learning rate of 2e-5. We employ a cosine learning rate scheduler with a warmup ratio of 0.03. The training is performed on 4 A6000 GPUs for each model variant. ## Appendix D Details of KG-constrained Decoding For efficient knowledge graph operations, we implemented a Virtuoso-based Freebase instance with distributed high-speed SPARQL querying capabilities. Our system achieves a throughput of 2,000 requests per second, enabling rapid graph traversal and neighborhood node retrieval during the constrained decoding process. This high-performance infrastructure allows us to efficiently retrieve next-hop candidates for token constraints directly from the knowledge graph. ## Appendix E Case Study of Different Iteration Steps In Fig. 7, we provide two examples to investigate the effect of iteration steps of the EM algorithm. <details> <summary>x7.png Details</summary> ![3cd68d04](/v1/image/3cd68d048f022d2956729ce230afac760fe44ad8c95fb235e6bef45a3000d2a1) ### Visual Description ## [Table/Comparison Diagram]: Reasoning Chain Comparison for Question-Answering Tasks ### Overview The image displays a structured comparison table analyzing two distinct question-answering cases (Case 1 and Case 2). Each case contrasts two different reasoning approaches, labeled "Steps 20" and "Steps 100," to solve a factual question. The table highlights the reasoning chain, knowledge path, and final response for each approach, using color-coding (red for incorrect paths/responses, green for correct ones) and symbols (✗ for incorrect, ✓ for correct) to indicate accuracy. ### Components/Axes The table is organized into the following rows and columns: * **Columns:** The table has three main columns. The first column lists the row labels (Case, Steps, Reasoning Chain & Knowledge Path, Response). The second and third columns present the two contrasting approaches for each case, labeled "20" and "100" respectively. * **Rows (per case):** * **Case Header:** States the case number and the specific question being answered, along with the correct answer. * **Steps:** A numerical label ("20" or "100") for the reasoning approach. * **Reasoning Chain & Knowledge Path:** A textual description of the logical steps taken, interspersed with knowledge triplets formatted as `<Subject, Predicate, Object>`. Arrows (`⟹`) indicate the flow from one step to the next. * **Response:** The final answer generated by the reasoning chain. ### Detailed Analysis **Case 1:** * **Question:** What high school did the artist who recorded "Girl Tonight" attend? * **Correct Answer:** Petersburg. * **Approach "20" (Left Column):** * **Reasoning Chain:** "First, identify the artist 'c' associated with the recording 'Girl Tonight'." → Knowledge triplet: `<John H. Guyer, school_type, High school>` (highlighted in red). "Next, determine the high school attended by the artist 'c'." → Knowledge triplet: `<John H. Guyer, notable_object, High school>` (highlighted in red). * **Response:** John H. Guyer (marked with a red ✗). * **Approach "100" (Right Column):** * **Reasoning Chain:** "First, identify the artist 'c' associated with the recording 'Girl Tonight'." → Knowledge triplet: `<Trey Songz, recordings, Girl Tonight>`. "Next, determine the educational institutions 'x' attended by artist 'c'." → Knowledge triplet: `<Trey Songz, education_institution, Petersburg>`. "Finally, confirm which of educational institutions 'x' is a high school." → Knowledge triplet: `<Petersburg, school_type, High school>`. * **Response:** Petersburg (marked with a green ✓). **Case 2:** * **Question:** What type of Claude Debussy music appears in the film Black Tights? * **Correct Answer:** Ballet. * **Approach "20" (Left Column):** * **Reasoning Chain:** "First, identify the type of music 'x' associated with Claude Debussy." → Knowledge triplet: `<Claude Debussy, compositions, En blanc et noir>` (highlighted in red). "Next, confirm that the type of music 'x' is featured in the film Black Tights." → Knowledge triplet: `<En blanc et noir, composer, Claude Debussy>` (highlighted in red). * **Response:** En blanc et noir (marked with a red ✗). * **Approach "100" (Right Column):** * **Reasoning Chain:** "First, identify the type of music 'x' associated with Claude Debussy." → Knowledge triplet: `<Claude Debussy, genre, Ballet>`. "Next, determine whether the type of music 'x' appears in the film Black Tights." → Knowledge triplet: `<Ballet, films_in_this_genre, Black Tights>`. * **Response:** Ballet (marked with a green ✓). ### Key Observations 1. **Error Pattern in "20" Approach:** Both incorrect ("20") reasoning chains exhibit a similar flaw. They retrieve a specific entity (a composition: "John H. Guyer" or "En blanc et noir") instead of the requested categorical attribute (a high school or a music type/genre). The subsequent knowledge triplets then attempt to validate this incorrect entity, leading to a circular or irrelevant path. 2. **Correct Pattern in "100" Approach:** Both correct ("100") reasoning chains follow a more logical, multi-step verification process. They first identify the correct intermediate entity (the artist Trey Songz, or the genre Ballet), then retrieve the target attribute (the education institution, or the film), and finally perform a verification step to confirm the attribute matches the question's constraint (confirming Petersburg is a high school, or that Ballet is a genre in the film). 3. **Visual Coding:** Red text and the ✗ symbol are consistently used to denote errors in the reasoning path and final answer. Green text (in the knowledge triplets) and the ✓ symbol denote correct reasoning and answers. The arrows (`⟹`) clearly delineate the step-by-step flow of logic. ### Interpretation This diagram serves as a technical comparison between two methodologies for knowledge-grounded question answering, likely representing different model sizes, training paradigms, or reasoning strategies (e.g., "20" vs. "100" could refer to model parameters, training steps, or chain-of-thought depth). The data demonstrates that the "100" approach employs a more robust and verifiable reasoning strategy. It correctly parses the question to identify the *type* of information needed (a high school, a music genre) and uses a chain of knowledge lookups that culminates in a verification step. In contrast, the "20" approach appears to make an initial, incorrect associative leap to a specific named entity and then fails to correct course, resulting in a logically flawed and incorrect answer. The key takeaway is that effective factual question answering requires not just retrieving related information, but precisely understanding the *category* of the answer sought and implementing a multi-step verification process to ensure the retrieved data satisfies all constraints of the original query. The "100" method exemplifies this, while the "20" method illustrates a common failure mode of associative, non-verifying retrieval. </details> Figure 7: Examples of Reasoning Chains and Knowledge Paths generated by RAR under different iteration steps. ## Appendix F Templates and Prompts In this section, we illustrate all the prompts used in the experiments. Reasoning Chain Template. The template of Reasoning Chains generated by Reasoner is shown in Fig. 8, where each $s_{i}$ is a discrete reasoning step in natural language. Reasoning Chain Template <THINK> $s_{1}s_{2}s_{3}\dots s_{n}$ </THINK> Figure 8: The template for the Reasoning Chain generated by Reasoner. Knowledge Path Template. The template of Knowledge Path generated by Aligner is shown in Fig. 9, where $e$ abd $r$ denotes the entities and relations from the KG. Knowledge Path Template <ALIGN><TRIPLE> <|> $e_{1}$ <|> $r_{1}$ <|> $e_{1}^{{}^{\prime}}$ </TRIPLE> … <TRIPLE> <|> $e_{n}$ <|> $r_{n}$ <|> $e_{n}^{{}^{\prime}}$ </TRIPLE></ALIGN> Figure 9: The template of Knowledge Paths generated by Aligner. ReAligner Prompt. The prompt for instructing ReAligner is shown in Fig. 10, where the task is to generate the Reasoning Chain and Knowledge Path given the question and the KG. ReAligner Prompt ============================= Prompt Input ================================ Generate a step-by-step thinking process for the given question. Ensure the thinking process is aligned with triples in the knowledge base. Question: <Question> Query entities: <Question Entities> ============================= LLM Output ================================ Thinking process: <Reasoning Chain> Align Process: <Knowledge Path> Figure 10: The prompt template for ReAligner. Responser Prompt. The prompt for instructing Responser is shown in Fig. 11, where the task is to generate the final answer based on the given the question and the generated Reasoning Chain and Knowledge Path. Responser Prompt ============================= Prompt Input ================================ Generate a step-by-step thinking process for the given question. Ensure the thinking process is aligned with triples in the knowledge base. Question: <Question> Query entities: <Question Entities> Thinking process: <Reasoning Chain> Align Process: <Knowledge Path> Summarization: ============================= LLM Output ================================ <Answer> Figure 11: The prompt template for Responser. LLM-driven Consolidation Prompt. The prompt for LLM-driven Consolidation is shown in Fig. 12. We use RAR to generate $K$ Knowledge Paths and hypothesis answers for each question. The Knowledge Paths and hypothesis answers are provided to general LLMs to answer the questions without fine-tuning. LLM-driven Consolidation Prompt ============================= Prompt Input ================================ Relevant triples: <Knowledge Path 1>. Therefore, a possible answer could be: <Hypothesis Answer 1> $\ldots$ <Knowledge Path K>. Therefore, a possible answer could be: <Hypothesis Answer K> Question: <Question> Based on the reasoning paths, please answer the given question. Please keep the answer as simple as possible and only return answers. Please return each answer in a new line. ============================= LLM Output ================================ <Answer 1> <Answer 2> $\ldots$ Figure 12: The prompt template for LLM-driven Consolidation.

Rendering Paper...