2505.20971v1

Model: gemini-2.0-flash

# Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA > Corresponding author. Abstract LLMs have demonstrated remarkable capabilities in complex reasoning tasks, yet they often suffer from hallucinations and lack reliable factual grounding. Meanwhile, knowledge graphs (KGs) provide structured factual knowledge but lack the flexible reasoning abilities of LLMs. In this paper, we present Reason-Align-Respond (RAR), a novel framework that systematically integrates LLM reasoning with knowledge graphs for KGQA. Our approach consists of three key components: a Reasoner that generates human-like reasoning chains, an Aligner that maps these chains to valid KG paths, and a Responser that synthesizes the final answer. We formulate this process as a probabilistic model and optimize it using the Expectation-Maximization algorithm, which iteratively refines the reasoning chains and knowledge paths. Extensive experiments on multiple benchmarks demonstrate the effectiveness of RAR, achieving state-of-the-art performance with Hit scores of 93.3% and 91.0% on WebQSP and CWQ respectively. Human evaluation confirms that RAR generates high-quality, interpretable reasoning chains well-aligned with KG paths. Furthermore, RAR exhibits strong zero-shot generalization capabilities and maintains computational efficiency during inference. Reason-Align-Respond: Aligning LLM Reasoning with Knowledge Graphs for KGQA Xiangqing Shen, Fanfan Wang, and Rui Xia thanks: Corresponding author. School of Computer Science and Engineering, Nanjing University of Science and Technology, China {xiangqing.shen, ffwang, rxia}@njust.edu.cn 1 Introduction Large language models (LLMs) have exhibited impressive capabilities across a range of complex tasks (Hadi et al., 2023), yet their reasoning processes often lack reliable factual knowledge. This shortcoming reduces interpretability, leads to hallucinations, and causes factual or logical errors (Huang et al., 2025). Knowledge graphs (KGs) (Bollacker et al., 2008), which organize factual knowledge in a structured format, offer strong interpretability and expressive power, making them a reliable source of factual support for LLM reasoning. Integrating KGs with LLMs in question answering, e.g, knowledge graph question answering (KGQA), has gained interest as an effective strategy to mitigate hallucinations and enhance interpretability. <details> <summary>x1.png Details</summary> ![f5ff5c02](/v1/image/f5ff5c02a662eab5114b2b59c96997214821c5235233c0494e76cf001fc27776) ### Visual Description ## Diagram: Comparison of Agent Exploration and Path Generation Methods ### Overview The image presents a comparative diagram illustrating two approaches to answering a question about music style: "What's the music style of the album folklore by Scott Swift's daughter?". The left side depicts a "Training-free Agent Exploration Method," while the right side shows a "Training-based Path Generation Method." Below these, "Our Framework" is presented, detailing a "Reasoning Chain by Reasoner" and a "Knowledge Path by Aligner" that correctly answers the question. ### Components/Axes **Top Section:** * **Titles:** "Training-free Agent Exploration Methods" (left), "Training-based Path Generation Methods" (right). * **Answer Boxes:** Each method has an "Answer" box, both initially stating "Pop" with a red "X" indicating an incorrect answer. * **Question:** "What's the music style of the album folklore by Scott Swift's daughter?" **Left Side (Training-free Agent Exploration):** * **Graph:** A simple graph with nodes representing entities (Scott Swift, Taylor Swift, Pop, Indie Folk, folklore) and edges representing relationships (daughter, album, genre). * Scott Swift (blue node) is connected to Taylor Swift (green node) via "daughter" (t=1, green arrow). * Taylor Swift (green node) is connected to Pop (red node) via "genre" (t=2, red arrow with an X). * Taylor Swift (green node) is connected to folklore (grey node) via "album" (black arrow). * folklore (grey node) is connected to Indie Folk (grey node) via "genre" (black arrow). * **LLM Agent:** An icon representing a Large Language Model Agent. * **T steps:** A green arrow indicating the agent takes T steps. * **Explore on graph:** Label below the graph. **Right Side (Training-based Path Generation):** * **Graph:** A graph with nodes representing entities (Scott, Taylor, folklore, Pop) and edges representing relationships (daughter, album, style). * Scott is connected to Taylor via "daughter". * Taylor is connected to folklore via "album". * folklore is connected to Pop via "style" (red X over style). * **KG:** An icon representing a Knowledge Graph. * **Path Generator:** A box labeled "Path Generator". * **Arrow Flow:** Arrows indicate the flow of information from the question to the Path Generator, from the KG to the Path Generator, from the Path Generator to the graph, and from the graph to the "Answer" box. **Bottom Section (Our Framework):** * **Title:** "Our Framework" * **Left Box (Reasoning Chain by Reasoner):** * **Title:** "Reasoning Chain by Reasoner" * **Steps:** 1. "Begin by identifying the daughter of the query entity 'Scott Swift', represented by the intermediate entity 'c'. This step establishes..." 2. "Next, verify that the intermediate entity 'c' has released a album named 'folklore'. This ensures that..." 3. "Finally, determine the music genre of the album 'folklore'. This genre provides the specific music style..." * **Right Box (Knowledge Path by Aligner):** * **Title:** "Knowledge Path by Aligner" * **Graph:** A graph with nodes representing entities (Scott Swift, Taylor Swift, Pop, Indie Folk, folklore, USA, Lover, Seven) and edges representing relationships (daughter, album, genre, nationality, track). * Scott Swift (blue node) is connected to Taylor Swift (green node) via "daughter" (orange arrow labeled "1"). * Taylor Swift (green node) is connected to Indie Folk (green node) via "genre" (pink arrow labeled "3"). * Taylor Swift (green node) is connected to folklore (blue node) via "album" (orange arrow labeled "2"). * Scott Swift (blue node) is connected to USA (grey node) via "nationality". * Taylor Swift (green node) is connected to Lover (grey node) via "track". * folklore (blue node) is connected to Seven (grey node) via "track". * Taylor Swift (green node) is connected to Pop (green node) via "genre". * **Responser:** Label above the "Answer" box. * **Answer Box:** "Answer: Indie Folk" with a green checkmark indicating the correct answer. ### Detailed Analysis or ### Content Details * **Incorrect Answers:** Both the "Training-free" and "Training-based" methods initially provide the incorrect answer "Pop." * **Correct Answer:** "Our Framework" correctly identifies the answer as "Indie Folk." * **Reasoning Chain:** The "Reasoning Chain by Reasoner" outlines the steps taken to arrive at the correct answer. * **Knowledge Path:** The "Knowledge Path by Aligner" visually represents the connections between entities that lead to the correct answer. ### Key Observations * The diagram highlights the difference between methods that rely on exploration versus those that use training to generate paths. * The "Our Framework" section demonstrates a more effective approach to answering the question. * The use of color and arrows helps to visually represent the flow of information and relationships between entities. ### Interpretation The diagram illustrates the superiority of the proposed framework in answering complex questions compared to simpler exploration or path generation methods. The initial failure of both "Training-free" and "Training-based" methods underscores the need for a more sophisticated approach. "Our Framework," which combines reasoning and knowledge alignment, successfully identifies the correct answer, "Indie Folk." This suggests that a combination of structured reasoning and knowledge graph traversal is more effective for answering questions that require understanding relationships between entities. The diagram emphasizes the importance of considering multiple factors and relationships to arrive at an accurate conclusion. </details> Figure 1: The comparison between our Reason-Align-Respond framework and the existing methods for LLM-based KGQA. The existing LLM-based KGQA studies broadly fall into two main categories: Training-free Agent Exploration methods (Sun et al., 2024) and Training-based Path Generation methods (Luo et al., 2024a). The former uses LLMs as agents to explore nodes in KGs, while the latter trains generators to retrieve and generate knowledge paths. However, both methods lack the holistic reasoning process that humans typically employ when answering questions. This limitation, combined with the semantic gap between natural language descriptions and structured knowledge graphs, often results in reasoning paths that do not align with human logic, lack semantic relevance, or contain unnecessary noise, as shown in Fig. 1. On the other hand, the latest deep reasoning LLMs, such as OpenAI-o1 (Zhong et al., 2024) and DeepSeek-R1 (Guo et al., 2025), produce holistic reasoning chains before answering challenging questions, showcasing stronger logical reasoning abilities. But the reasoning chains obtained through reinforcement learning tend to be verbose, sometimes contain logical fallacies, and amplify hallucinations (Bao et al., 2025). While our work is not specifically aimed at improving Deepseek-R1, we believe that using knowledge paths from KGs as constraints for deep reasoning might potentially alleviate these issues. The two aforementioned aspects are complementary. For one thing, allowing LLMs to first perform human-like reasoning in KGQA, can establish a structured, goal-oriented framework that helps reduce invalid or noisy KG path exploration. For another, while free reasoning by LLMs tends to increase hallucinations, incorporating knowledge paths from KGs provides constraints that help ensure alignment with the knowledge graph, thereby reducing hallucinations. In this work, we introduce Reason-Align-Respond (RAR), a novel framework that systematically integrates three modules—Reasoner, Aligner, and Responser—to align LLM reasoning with knowledge graphs for KGQA. Each of the three modules is a fine-tuned LLM. Firstly, Reasoner conducts global reasoning for the question, simulating human-like thinking process to generate a reasoning chain that guides LLM’s exploration on the KG. Secondly, Aligner decodes a knowledge path on the KG based on the above reasoning chain. Each decoding step ensures correspondence to an actual knowledge triple in the graph. Finally, Responser leverages both the reasoning chain and the knowledge path to generate the final answer. We treat both reasoning chain $z_{r}$ and knowledge path $z_{p}$ as latent variables, use a probabilistic model to formalize the distribution of answer $a$ given the question $q$ and KG ${\mathcal{G}}$ , and propose an end-to-end training algorithm to jointly fine-tune the parameters across all three modules. Specifically, we employ the expectation-maximization (EM) algorithm to iteratively optimize model parameters while inferring the latent variables. The E-step estimates the posterior probability of latent variables ( $z_{r}$ , $z_{p}$ ) given observations ( $q$ , ${\mathcal{G}}$ , $a$ ) and samples high-quality reasoning chain and aligned knowledge paths accordingly. The M-step maximizes the evidence lower bound (ELBO) of the log-likelihood to update model parameters. Through iterative optimization, the model progressively refines the reasoning chain and knowledge path, and in turn promotes the generation of high-quality responses, ultimately forming a stable closed loop. We conduct extensive evaluation across multiple benchmarks, including WebQSP, CWQ, CommonsenseQA (CSQA), and MedQA, using Freebase, ConceptNet and a medical KG as knowledge graphs. The results demonstrate the effectiveness of RAR in improving KGQA performance. Compared to 19 baselines across three categories, RAR achieves state-of-the-art results. Specifically, it achieves Hit scores of 93.3% on WebQSP and 91.0% on CWQ, with corresponding F1 score of 87.7% and 84.8%, significantly surpassing the existing methods. Additionally, RAR demonstrates strong zero-shot generalizability on CSQA and MedQA. The EM algorithm demonstrates gradual performance improvements with increasing iterations and shows stable convergence after 200 steps. Human evaluation of reasoning chains shows that RAR generates high-quality, interpretable chains aligned with KG paths, ensuring accurate and reliable reasoning. We also report the results of RAR under different LLM backbone models, and conduct the ablation study that confirms the importance of each component. Notably, RAR achieves these effective performance while maintaining computational efficiency during inference. These comprehensive results demonstrate that our approach successfully bridges the gap between LLM reasoning and structured knowledge, offering a promising direction for developing more reliable and interpretable question answering systems. <details> <summary>x2.png Details</summary> ![6044452d](/v1/image/6044452d1dc904da4e138d70c0aeb9ae3fbba41e1b4abd4254a79d01c7506a56) ### Visual Description ## Diagram: Reasoning and Knowledge Path Generation ### Overview The image is a diagram illustrating a multi-step process for answering a question using reasoning chains and knowledge paths. It involves components like a Reasoner, Aligner, and Responser, with updates occurring in M-steps and sampling in an E-step. The diagram shows the flow of information and dependencies between these components. ### Components/Axes * **Question q:** (Top-left, in a peach rounded box) "What's the music style of the album folklore by Scott Swift's daughter?" * **Reasoner pθ(zr|q):** (Left, in a light blue rounded box) "Generate a step-by-step reasoning chain for the question: ..." * **Reasoning Chain zr:** (Bottom-left, in a light gray rounded box) * "1. Begin by identifying the daughter of the query entity "Scott Swift", represented by the intermediate entity "c". This step..." * "2. Next, verify that the intermediate entity "c" has released a album... This ensures..." * "3. Finally, determine the music genre of the album "folklore". This genre provides..." * **Aligner pφ(zp|G, zr, q):** (Bottom-center, in a light blue rounded box) "Generate triples in the KG based on the question and reasoning chain: ..." * **Knowledge Path zp:** (Bottom-right, in a light green rounded box) A graph with nodes and edges representing relationships between entities. * Nodes: Scott Swift, Taylor Swift, USA, Lover, folklore, Indie Folk, Seven, pop * Edges: daughter (Scott Swift -> Taylor Swift), nationality (Scott Swift -> USA), album (Taylor Swift -> folklore), genre (Taylor Swift -> pop), genre (folklore -> Indie Folk), track (Lover -> Taylor Swift), track (folklore -> Seven) * Numbered paths: * 1: Scott Swift -> Taylor Swift -> folklore * 2: Taylor Swift -> folklore -> Indie Folk * 3: Taylor Swift -> pop * **Responser pw(a|zr, zp, q):** (Right, in a light blue rounded box) "Answer the question based on the reasoning chain and knowledge path: ..." * **Answer a:** (Top-right, in a light green rounded box) "Indie Folk" * **E-step:** (Top-center) "sample high-quality Reasoning Chains and Knowledge Paths (zr, zp) ~ pw,ψ((zr, zp)|G, q, a)" * **M-step:** (Left-center, Bottom-center) "update Reasoner", "update Aligner" * **Resoning Chain:** (Top-center, in a light gray rounded box) "<THINK> 1. Begin by identifying... 2. Next, verify that the... 3. Finally, determine the music genre of the album "folklore"... </THINK>" * **Knowledge Path:** (Top-center, in a light gray rounded box) "<ALIGN> (Scott Swift, daughter, Taylor Swift), (Taylor Swift, album, folklore), (folklore, genre, Indie Folk) </ALIGN>" * **KG-constrained Decoding:** (Right-center) * **Prior:** (Dashed arrows) Indicates prior knowledge or information flow. * **Likelihood:** (Dashed arrows) Indicates likelihood or probability. ### Detailed Analysis or Content Details The diagram illustrates a pipeline: 1. **Question Input:** The process begins with a question (q). 2. **Reasoner:** The Reasoner generates a step-by-step reasoning chain (zr) based on the question. 3. **Reasoning Chain:** The reasoning chain outlines the steps needed to answer the question. 4. **Aligner:** The Aligner generates triples in the Knowledge Graph (KG) based on the question and reasoning chain, resulting in a knowledge path (zp). 5. **Knowledge Path:** The knowledge path represents the relationships between entities in the KG. 6. **Responser:** The Responser answers the question (a) based on the reasoning chain and knowledge path. 7. **Answer Output:** The final answer is provided. 8. **E-step:** High-quality reasoning chains and knowledge paths are sampled. 9. **M-step:** The Reasoner and Aligner are updated based on the sampled chains and paths. The reasoning chain consists of three steps: identifying Scott Swift's daughter, verifying that the daughter has released an album, and determining the music genre of the album "folklore". The knowledge path shows the relationships between entities such as Scott Swift, Taylor Swift, folklore, and Indie Folk. It also includes entities like USA, Lover, Seven, and pop. ### Key Observations * The diagram shows a cyclical process with updates to the Reasoner and Aligner. * The reasoning chain and knowledge path are used to generate the answer. * The knowledge path is a graph representation of relationships between entities. ### Interpretation The diagram illustrates a system for answering questions by generating reasoning chains and knowledge paths. The Reasoner and Aligner work together to create a structured representation of the information needed to answer the question. The Responser then uses this information to generate the final answer. The E-step and M-step indicate a learning or optimization process where the system improves its reasoning and alignment capabilities over time. The use of prior knowledge and likelihood suggests a probabilistic approach to reasoning and knowledge representation. The diagram highlights the importance of both reasoning and knowledge in answering complex questions. </details> Figure 2: Illustration of our RAR framework comprising Reasoner, Aligner, Responser with iterative EM optimization. 2 Approach Knowledge graphs (KGs) contain extensive factual knowledge in the form of triples: ${\mathcal{G}}=\{(e,r,e^{\prime})|e,e^{\prime}∈{\mathcal{E}},r∈{\mathcal{R}}\}$ , where ${\mathcal{E}}$ and ${\mathcal{R}}$ denote sets of entities and relations, respectively. The goal of knowledge graph question answering (KGQA) is to retrieve relevant facts from a KG ${\mathcal{G}}$ and generate an answer $a$ in response to a given natural language question $q$ . 2.1 Task Formalization We introduce two latent variables, a reasoning chain $z_{r}$ and a knowledge path $z_{p}$ , working together to answer the question $q$ based on a KG ${\mathcal{G}}$ : - Reasoning Chain $z_{r}$ denotes a chain of discrete reasoning steps expressed in natural language, working together to address the question $q$ . - Knowledge Path $z_{p}$ denotes an interconnected path of knowledge triples extracted from a KG ${\mathcal{G}}$ . Since neither $z_{r}$ nor $z_{p}$ are explicitly annotated in existing KGQA benchmarks, we treat them as latent variables, and employ a probabilistic model to formalize the distribution of the answer $a$ conditioned on the question $q$ and KG ${\mathcal{G}}$ as: $$ \displaystyle p_{w,\phi,\theta}(a|{\mathcal{G}},q)= \displaystyle\sum_{z_{r},z_{p}}p_{w}(a|z_{r},z_{p},q)p_{\phi}(z_{p}|{\mathcal{% G}},z_{r},q)p_{\theta}(z_{r}|q), \tag{1} $$ where we assume $a$ is conditionally independent of ${\mathcal{G}}$ given $(z_{r},z_{p},q)$ , allowing factorization and summation over these conditional probabilities. On this basis, we introduce our framework, Reason-Align-Respond (RAR), that integrates three modules—Reasoner, Aligner, and Responser—to align LLM reasoning with knowledge graphs for KGQA, as illustrated in Fig. 2. It is worth noting that each of the three modules is a fine-tuned LLM, with parameters denoted as $\theta$ , $\phi$ , and $\omega$ , respectively. They each utilize the Prompts shown in App. F to generate reasoning chains, knowledge paths, and answers. Firstly, Reasoner $p_{\theta}(z_{r}|q)$ generates a latent reasoning chain $z_{r}$ to address the question $q$ . The Reasoner generates a reasoning chain in the following format: $$ z_{r}=\texttt{<THINK>}s_{1}\dots s_{t}\texttt{</THINK>}, $$ where <THINK> and </THINK> are special tokens denoting the start and end of the reasoning process, and each $s_{i}$ is a discrete reasoning step. Secondly, Aligner $p_{\phi}(z_{p}|{\mathcal{G}},z_{r},q)$ takes the question $q$ and the reasoning chain $z_{r}$ as inputs to explore KG ${\mathcal{G}}$ , producing a latent reasoning path $z_{p}$ that aligns with $z_{r}$ . Aligner processes the prompt with $q$ and $z_{r}$ to generate a knowledge path in the following format: $$ z_{p}=\texttt{<ALIGN>}\pi_{1}\dots\pi_{t}\texttt{</ALIGN>}, $$ where <ALIGN> and </ALIGN> mark the beginning and end of the knowledge path. Each $\pi_{i}$ is a triple from the KG ${\mathcal{G}}$ formatted as $\pi_{i}=\texttt{<TRI>}(e_{i}^{h},r_{i},e_{i}^{t})\texttt{</TRI>}$ , where <TRI> and </TRI> bound the triple. Finally, Responser $p_{w}(a|z_{r},z_{p},q)$ generates the final answer $a$ by synthesizing the question $q$ , reasoning chain $z_{r}$ , and knowledge path $z_{p}$ . 2.2 Optimization via the EM Algorithm As we have mentioned, each of the three modules is a fine-tuned LLM. In this section, we propose an end-to-end training algorithm to jointly optimize the parameters across all three modules. The training objective is the likelihood of the distribution $p_{w,\phi,\theta}(a|{\mathcal{G}},q)$ shown in Eq. (LABEL:equ:main). Since our model involves latent variables $z_{r}$ and $z_{p}$ , we adopt the Expectation-Maximization (EM) algorithm—a principled approach for maximum likelihood estimation (MLE) in latent-variable models Dempster et al. (1977); Sen et al. (2022); Qu et al. (2021). Unifying Reasoner and Aligner. In practice, we unify Reasoner and Aligner by merging their latent variables $z_{r}$ and $z_{p}$ into a single one $z=(z_{r},z_{p})$ . This results in a consolidated module, referred to as ReAligner, whose parameters denoted by $\psi$ . Hence, instead of generating $z_{r}$ and $z_{p}$ separately, ReAligner simultaneously outputs both a Reasoning Chain and a Knowledge Path, treating it as a single instruction-tuning task with a prompt template (see App. F). With this simplification, the conditional distribution for the final answer $a$ is: $$ p_{w,\psi}(a|{\mathcal{G}},q)=\sum_{z}p_{w}(a|q,z)p_{\psi}(z|{\mathcal{G}},q), \tag{2} $$ where ${\mathcal{G}}$ denotes the KG, $q$ the question, and $z$ the unified latent variable (combining the Reasoning Chain and Knowledge Path). Learning Objective. We aim to learn the parameters $(w,\psi)$ by maximizing the log-likelihood of the training data with respect to Eq. (2), written as: $$ \displaystyle\max_{w,\psi}\mathcal{O}(w,\psi)=\log\mathbb{E}_{z\sim p_{\psi}(z% |{\mathcal{G}},q)}[p_{w}(a|q,z)]. \tag{3} $$ According to Jensen’s inequality, we have: $$ \mathcal{O}(w,\psi)\geq\underbrace{\mathbb{E}_{q(z)}\log({\frac{p_{w}(a|q,z)p_% {\psi}(z|{\mathcal{G}},q)}{q(z)})}}_{{\mathcal{L}}_{\text{ELBO}}}, \tag{4} $$ where $q(z)$ is a variational distribution. Equality holds when $q(z)=p_{w,\psi}(z|{\mathcal{G}},q,a)$ , the true posterior of $z$ . The term ${\mathcal{L}}_{\text{ELBO}}$ is the Evidence Lower BOund (ELBO), and maximizing ${\mathcal{L}}_{\text{ELBO}}$ indirectly maximizes $\mathcal{O}(w,\psi)$ . EM Algorithm. The EM algorithm alternates between an E-step and an M-step until convergence: E-step. Given current parameters $(w^{(t)},\psi^{(t)})$ at iteration $t$ , E-step updates the variational distribution $q^{(t)}(z)$ by minimizing $\mathrm{KL}(q(z)||p_{w^{(t)},\psi^{(t)}}(z|{\mathcal{G}},q,a))$ . The solution is the posterior of $z$ under the current parameters: $$ q^{(t)}(z)=p_{w^{(t)},\psi^{(t)}}(z|{\mathcal{G}},q,a). \tag{5} $$ M-step. Keeping $q^{(t)}(z)$ fixed, M-step maximizes ${\mathcal{L}}_{\text{ELBO}}$ in Eq. (4) with respect to $w$ and $\psi$ . Ignoring terms that do not depend on $(w,\psi)$ , the objective reduces to: $$ \displaystyle Q(w,\psi|w^{(t)},\psi^{(t)}) \displaystyle=\sum_{({\mathcal{G}},q,a)}\sum_{z}q^{(t)}(z)\,\log[p_{w}(a|q,z)p% _{\psi}(z|{\mathcal{G}},q)] \displaystyle=\underbrace{\sum_{({\mathcal{G}},q,a)}\sum_{z}q^{(t)}(z)\log p_{% w}(a|q,z)}_{Q_{\text{Responser}}(w)} \displaystyle\quad+\underbrace{\sum_{({\mathcal{G}},q,a)}\sum_{z}q^{(t)}(z)% \log p_{\psi}(z|{\mathcal{G}},q)}_{Q_{\text{ReAligner}}(\psi)}, \tag{6} $$ which naturally divides into the instruction-tuning objective for Responser and ReAligner in Eq. (2). By iterative optimization with the EM algorithm, our framework progressively refines its understanding of the question. This iterative process gradually corrects any flaws in Reasoning Chains and Knowledge Paths, leading to answers that are both higher in quality and more interpretable, while significantly reducing the risk of hallucination. The workflow of the EM algorithm is shown in Alg. 1, with more details in practice in App. A. Algorithm 1 The EM algorithm in RAR while not converge do For each instance, sample $N$ Reasoning Chains and Knowledge Paths $z_{I}$ from ReAligner $p_{\psi}$ . For each instance, update Responser $p_{w}$ with $Q_{\text{Responser}}(w)$ in Eq. (6) using $z_{I}$ . $\boxdot$ E-step: For each instance, identify $K$ high-quality Reasoning Chains and Knowledge Paths $z_{I}^{h}$ from $z_{I}$ based on Eq. (5). $\boxdot$ M-step: For each instance, update ReAligner $p_{\psi}$ according to $Q_{\text{ReAligner}}(\psi)$ in Eq. (6). end while 2.3 Techniques During Inference During inference, given $q$ , Reasoner generates $z_{r}$ , while Aligner produces $z_{p}$ . Responser synthesizes them to produce $a$ . To enhance performance, we introduce three additional key techniques. Table 1: Performance comparison with different baselines on the two KGQA datasets. Types Methods WebQSP CWQ Hit F1 Hit F1 LLM Reasoning Qwen2-7B (Yang et al., 2024) 50.8 35.5 25.3 21.6 Llama-2-7B (Touvron et al., 2023) 56.4 36.5 28.4 21.4 Llama-3.1-8B (Meta, 2024) 55.5 34.8 28.1 22.4 GPT-4o-mini (OpenAI, 2024a) 63.8 40.5 63.8 40.5 ChatGPT (OpenAI, 2022) 59.3 43.5 34.7 30.2 ChatGPT+Few-shot (Brown et al., 2020) 68.5 38.1 38.5 28.0 ChatGPT+CoT (Wei et al., 2022b) 73.5 38.5 47.5 31.0 ChatGPT+Self-Consistency (Wang et al., 2024) 83.5 63.4 56.0 48.1 Graph Reasoning GraftNet (Sun et al., 2018) 66.7 62.4 36.8 32.7 NSM (He et al., 2021) 68.7 62.8 47.6 42.4 SR+NSM (Zhang et al., 2022) 68.9 64.1 50.2 47.1 ReaRev (Mavromatis and Karypis, 2022) 76.4 70.9 52.9 47.8 KG+LLM KD-CoT (Wang et al., 2023a) 68.6 52.5 55.7 - EWEK-QA (Dehghan et al., 2024) 71.3 - 52.5 - ToG (GPT-4) (Sun et al., 2024) 82.6 - 68.5 - EffiQA (Dong et al., 2025) 82.9 - 69.5 RoG (Llama-2-7B) (Luo et al., 2024b) 85.7 70.8 62.6 56.2 GNN-RAG+RA (Mavromatis and Karypis, 2024) 90.7 73.5 68.7 60.4 GCR (Llama-3.1-8B + GPT-4o-mini) (Luo et al., 2024c) 92.2 74.1 75.8 61.7 RAR (Llama-3.1-8B + GPT-4o-mini) 93.3 87.7 91.0 84.8 KG-constrained Decoding. KG-constrained Decoding aims to prevent hallucinated triples that do not exist in the KG. When generating the Knowledge Path, Aligner may inadvertently produce triples absent from the KG. To address this, KG-constrained Decoding restricts the output tokens so that only tokens forming valid KG triples can be produced. In this way, the generated Knowledge Path strictly aligns with actual entities and relations in the KG. Related work (Luo et al., 2024c; Li et al., 2024) also attempts to mitigate similar issues; our approach is tailored specifically to our framework. Knowledge Path Expansion. Knowledge Path Expansion addresses the potential incompleteness of initially generated Knowledge Paths. To illustrate, consider a question about countries that share borders with the United States. a Knowledge Path is generated by Aligner. While correct, this represents only one instance of a broader pattern. By abstracting the specific instance into a template: $$ \texttt{<ALIGN><TRI>(US,borders,?country)</TRI></ALIGN>}, $$ where $?country$ is a variable, we capture the fundamental relationship structure. Applied to the KG, this template retrieves all valid instances, such as: $$ \texttt{<ALIGN><TRI>(US,borders,Canada)</TRI></ALIGN>}. $$ This method transforms a single Knowledge Path into a comprehensive query template, enabling more complete and exhaustive answers. LLM-driven Consolidation. LLM-driven Consolidation addresses the challenge of inconsistencies and noise that emerge when sampling multiple Reasoning Chains and Knowledge Paths. Multiple sampling helps increase coverage and improve the likelihood of correct answers, but inevitably introduces noise and conflicts between samples. To address this challenge, we propose using a powerful LLM as a “Consolidator” that analyzes and integrates multiple Reasoning Chains and Knowledge Paths to derive final answers, following the prompt template detailed in App. F. This approach effectively preserves the benefits of multiple sampling while leveraging the LLM’s analytical capabilities to produce reliable answers. 3 Experiment 3.1 Experiment Settings Datasets. Following previous research (Luo et al., 2024c; Sun et al., 2024), we evaluate our model on three datasets: WebQuestionSP (WebQSP)) (Yih et al., 2016), Complex WebQuestions (CWQ) (Talmor and Berant, 2018), and CommonsenseQA (CSQA) (Talmor et al., 2019). The first two datasets use Freebase (Bollacker et al., 2008) as the KG, while CSQA leverages ConceptNet (Speer et al., 2017), allowing us to assess model generalizability across the unseen KG. Baselines. We compare RAR with 19 baselines across three categories: LLM reasoning methods, graph reasoning methods, and KG-enhanced LLM reasoning methods. Evaluation Metrics. For WebQSP and CWQ, we adopt Hit and F1 as evaluation metrics. Hit checks whether the generated predictions match any correct answer, while F1 evaluates overall answer coverage by balancing precision and recall. For CSQA, a multiple-choice QA dataset, we use accuracy as the evaluation metric. Implementations. Our implementation uses Llama-3.1-8B (Meta, 2024) as the backbone for Reasoner, Aligner, and Responser. To enhance question decomposition, we pretrain both Reasoner and Aligner using 2,000 exemplars demonstrating step-by-step KG-based problem-solving. For each component, we generate top- $10$ candidates using KG-constrained Decoding and Knowledge Path Expansion, with GPT-4o-mini handling LLM-driven Consolidation. Details are provided in App. C. 3.2 Main Results Tab. 1 shows that RAR achieves significant gains on both WebQSP and CWQ datasets, improving the state-of-the-art by 13.6% and 23.1% in F1, and by 1.1% and 15.2% in Hit respectively. These remarkable improvements demonstrate the effectiveness of integrating structured KG knowledge with LLM reasoning. Table 2: Impact of different components on CWQ. Variants F1 Precision Recall RAR 84.8 84.9 89.0 RAR w/o LC 83.0 81.8 91.2 w/o Reasoner 80.3 75.6 94.7 w/o KD 48.9 54.3 50.8 w/o KPE 71.7 80.2 71.7 3.3 Ablation Study As shown in Tab. 2, removing LLM-driven Consolidation (LC) lowers precision but increases recall, since LC aims to eliminate noisy predictions. Excluding Reasoner causes a pronounced drop in precision but a rise in recall, indicating that Reasoning Chains guide Aligner to explore the KG more accurately, reducing hallucination and noise. Disabling Knowledge Path Expansion (KPE) diminishes performance, confirming its role in enriching Knowledge Paths. Most importantly, removing KG-constrained Decoding (KD) yields the largest performance decrease, underscoring the importance of restricting generation to valid KG paths. 3.4 Further Analyses <details> <summary>x3.png Details</summary> ![583a481a](/v1/image/583a481af3ddac469264fcc200c302573373d1d6fa0a3c224be3469f18c6aa67) ### Visual Description ## Line Chart: Precision, Recall, and F1 Score vs. Iteration Steps ### Overview The image is a line chart comparing the performance of Precision, Recall, and F1 score over a range of iteration steps. The x-axis represents the number of iteration steps, while the y-axis represents the score values. The chart displays how these metrics change as the number of iteration steps increases. ### Components/Axes * **X-axis:** "Iteration Steps" with tick marks at 20, 60, 100, 200, and 300. * **Y-axis:** Numerical scale ranging from 45 to 75, with tick marks at 45, 55, 65, and 75. * **Legend:** Located on the bottom-right of the chart, it identifies the lines as follows: * Blue line with circle markers: "Precision" * Orange line with square markers: "Recall" * Green line with triangle markers: "F1" ### Detailed Analysis * **Precision (Blue Line):** The precision starts at approximately 65 at 20 iteration steps, increases to approximately 69 at 60 steps, reaches around 73.5 at 100 steps, and plateaus at approximately 76 at 200 and 300 steps. * **Recall (Orange Line):** The recall starts at approximately 52 at 20 iteration steps, increases to approximately 56 at 60 steps, reaches around 59 at 100 steps, approximately 63.5 at 200 steps, and approximately 64 at 300 steps. * **F1 (Green Line):** The F1 score starts at approximately 54 at 20 iteration steps, increases to approximately 58 at 60 steps, reaches around 63 at 100 steps, approximately 65.5 at 200 steps, and approximately 65.5 at 300 steps. ### Key Observations * Precision consistently outperforms both Recall and F1 score across all iteration steps. * All three metrics show an increase with more iteration steps, but the rate of increase diminishes as the number of steps increases, especially for Precision. * Recall and F1 score are very close in value, with F1 score being slightly higher. ### Interpretation The chart suggests that increasing the number of iteration steps improves the performance of the model in terms of precision, recall, and F1 score. However, the improvement diminishes as the number of steps increases, indicating a point of diminishing returns. The higher precision compared to recall suggests that the model is better at avoiding false positives than false negatives. The F1 score, being the harmonic mean of precision and recall, provides a balanced measure of the model's accuracy. The convergence of all three metrics at higher iteration steps indicates that the model's performance stabilizes after a certain number of iterations. </details> Figure 3: Impact of iteration steps of the EM algorithm. Impact of Iteration Steps. As shown in Fig. 3, RAR exhibits consistent improvement across all metrics during EM updates. The performance rises rapidly in early stages and continues to refine over iterations, eventually reaching convergence with minimal fluctuations. <details> <summary>x4.png Details</summary> ![0c395ece](/v1/image/0c395eceea5c914865205602f453cba5a210e74d0c2cf840be0271de61e15371) ### Visual Description ## Bar and Line Chart: Alignment Ratio vs. Correctness ### Overview The image is a combination of a bar and line chart comparing the "Alignment Ratio" (represented by a coral line) and "Correctness" (represented by light blue bars) across three models: Llama-3.1-8B, GPT-4o, and RAR. The y-axis represents percentage values, ranging from 40 to 100. ### Components/Axes * **X-axis:** Categorical axis displaying the names of the models: "Llama-3.1-8B", "GPT-4o", and "RAR". * **Y-axis:** Numerical axis representing percentage values, ranging from 40 to 100, with increments of 20 (40, 60, 80, 100). * **Legend:** Located at the top of the chart, indicating: * "Alignment Ratio" (coral line with a circular marker) * "Correctness" (light blue bar) ### Detailed Analysis * **Correctness (Light Blue Bars):** * Llama-3.1-8B: The bar reaches approximately 82%. * GPT-4o: The bar reaches approximately 92%. * RAR: The bar reaches approximately 97%. * **Alignment Ratio (Coral Line):** * Llama-3.1-8B: The line starts at approximately 52%. * GPT-4o: The line reaches approximately 56%. * RAR: The line rises to approximately 96%. ### Key Observations * The "Correctness" scores are relatively high for all three models, with RAR having the highest score. * The "Alignment Ratio" increases significantly from GPT-4o to RAR. * The "Alignment Ratio" for Llama-3.1-8B is lower than its "Correctness" score. * The "Correctness" score for GPT-4o is higher than Llama-3.1-8B, but lower than RAR. ### Interpretation The chart suggests that while all three models exhibit relatively high correctness, their alignment ratios vary significantly. RAR demonstrates a substantial increase in alignment ratio compared to the other two models, indicating a potentially better performance in terms of aligning with desired outputs or objectives. The difference between "Correctness" and "Alignment Ratio" for each model could indicate varying degrees of accuracy versus alignment with specific goals or preferences. The large jump in "Alignment Ratio" from GPT-4o to RAR is a notable trend. </details> Figure 4: Human evaluation of reasoning chains on CWQ. Table 3: Efficiency and performance of RAR compared to different methods on CWQ. Types Methods Hit Avg. Runtime (s) Path Generation GNN-RAG 66.8 1.73 RoG 62.6 2.68 GCR 75.8 3.72 Agent Exploration ToG 68.5 18.89 EffiQA 69.5 - Ours RAR 91.0 4.38 Quality of Reasoning Chains. Through manual evaluation of 500 randomly selected samples, we assess the quality of Reasoning Chains using two criteria: reasoning correctness and KG alignment. The correctness metric evaluates whether the Reasoning Chain successfully solves the given question, while the alignment metric measures how well the reasoning steps correspond to valid KG paths. As shown in Fig. 4, RAR substantially outperforms both GPT-4o and baseline methods across both metrics. These results demonstrate that by aligning Reasoning Chains to KG structures, our approach not only improves the reliability of the reasoning process but also enhances its interpretability. <details> <summary>x5.png Details</summary> ![fab1b978](/v1/image/fab1b978f69293d2e1e90e00b54e9b125fad528da0f2dbffc3633a6e49d40ab0) ### Visual Description ## Line Chart: Precision, Recall, and F1 Score ### Overview The image is a line chart comparing the performance of three metrics: Precision, Recall, and F1 score, across different values on the x-axis (1, 3, 5, 10). The chart displays how these metrics change as the x-axis value increases. ### Components/Axes * **X-axis:** Labeled with values 1, 3, 5, and 10. * **Y-axis:** Ranges from 65 to 80, with tick marks at 65, 70, 75, and 80. * **Legend:** Located in the bottom-right corner, enclosed in a light gray box. It identifies the lines as follows: * Blue line with circle markers: Precision * Orange line with square markers: Recall * Green line with triangle markers: F1 ### Detailed Analysis * **Precision (Blue):** The precision line starts at approximately 77 at x=1, rises to approximately 79 at x=3, remains around 80 at x=5, and stays at approximately 80 at x=10. The trend is generally increasing but plateaus after x=3. * x=1: ~77 * x=3: ~79 * x=5: ~80 * x=10: ~80 * **Recall (Orange):** The recall line starts at approximately 63 at x=1, rises to approximately 68 at x=3, reaches approximately 70 at x=5, and ends at approximately 72 at x=10. The trend is consistently increasing. * x=1: ~63 * x=3: ~68 * x=5: ~70 * x=10: ~72 * **F1 (Green):** The F1 line starts at approximately 66 at x=1, rises to approximately 69 at x=3, reaches approximately 70.5 at x=5, and ends at approximately 71.5 at x=10. The trend is consistently increasing. * x=1: ~66 * x=3: ~69 * x=5: ~70.5 * x=10: ~71.5 ### Key Observations * Precision starts higher than Recall and F1 but plateaus quickly. * Recall and F1 scores increase steadily across the x-axis values. * At x=10, Recall and F1 are approaching Precision, but Precision remains the highest. ### Interpretation The chart illustrates the relationship between Precision, Recall, and F1 score as the x-axis value changes. Precision initially performs well but shows diminishing returns as the x-axis value increases. Recall and F1 score, on the other hand, demonstrate consistent improvement across the range of x-axis values. This suggests that increasing the x-axis value benefits Recall and F1 more than Precision, potentially indicating a trade-off between these metrics. The specific meaning of the x-axis is not provided, so the interpretation is limited to the observed trends. </details> Figure 5: Impact of different beam size on CWQ. KG-constrained Decoding Effectiveness. We examine the effectiveness of KG-constrained Decoding in mitigating hallucinations and maintaining efficiency. Our method achieves zero hallucinations in Knowledge Paths when answers are correct, while without constraints, even correct answers show a 44% hallucination rate. The efficiency evaluation reveals minimal computational overhead. Particularly noteworthy is the comparison with GCR, the previous state-of-the-art method using KG constraints. Our approach achieves a 15.2% improvement in answer accuracy over GCR, with only a marginal increase in runtime from Reasoning Chain generation. This modest overhead is well justified by the significant gains in answer reliability and interpretability. Table 4: Impact of using different LLM backbones for Reasoner, Aligner and LLM-driven Consolidation. Components Variants Hit F1 Reasoner and Aligner Llama-2-7B 84.0 68.9 Llama-3.1-8B 85.2 72.6 Qwen2-0.5B 71.3 56.0 Qwen2-1.5B 72.0 56.6 Qwen2-7B 81.4 67.1 LLM-driven Consolidation GPT-4o-mini 91.0 84.8 GPT-4o 92.8 84.9 Qwen2-7B 88.7 82.2 Llama-3.1-8B 90.6 83.7 Llama-3.1-70B 92.4 83.0 Impact of Beam Size. As shown in Fig. 5, increasing beam size allows RAR to explore more potential Reasoning Chains and Knowledge Paths, leading to improved performance across all metrics. This demonstrates that examining multiple candidate solutions helps identify better Reasoning Chains and Knowledge Paths for responses of higher quality. Impact of different LLM Backbone. Tab. 4 demonstrates that larger LLMs generally achieve better performance, with Llama-3.1-8B and GPT-4o delivering the strongest results for backbone and LLM-based Consolidation (LC), respectively. Zero-shot Generalizability to Unseen KGs. Table 5: Zero-shot transferability to unseen KG. Model CSQA MedQA GPT-4o-mini 91 75 GCR 94 79 RAR 94 80 Following Luo et al. (2024c), we evaluate RAR ’s zero-shot transfer capabilities on CSQA (Talmor et al., 2019) and MedQA (Jin et al., 2021). The results show that RAR achieves superior zero-shot performance compared to GPT-4o-mini on both datasets. Notably, RAR achieves comparable performance to GCR, as both methods leverage the KG to constrain the decoding process to enhance reasoning without requiring additional training on the target KG. 3.5 Case study <details> <summary>x6.png Details</summary> ![4af71498](/v1/image/4af71498b99cd92b8fefa05c9681049fbab9c3e309176f94fb11d1dbcf154302) ### Visual Description ## Question Answering with Knowledge Graph Reasoning ### Overview The image presents a series of question-answer examples, each demonstrating a reasoning process using a knowledge graph. For each question, the image shows the question and answer, the reasoning chain generated by a "Reasoner," the knowledge path generated by an "Aligner," and a KG-constrained decoding diagram. ### Components/Axes Each question-answer example is structured into four main parts: 1. **Question and Answer:** The initial question and the final answer provided. 2. **Reasoning Chain generated by Reasoner:** A textual description of the steps taken to arrive at the answer. This is enclosed within `<THINK>` tags. 3. **Knowledge Path generated by Aligner:** A structured representation of the knowledge used, expressed as a series of triples within `<TRIPLE>` tags. 4. **KG-constrained Decoding:** A diagram visually representing the reasoning path through the knowledge graph. This includes nodes representing entities and edges representing relationships. ### Detailed Analysis **Example 1:** * **Question:** What did dr josef mengele do? **Answer:** Physician * **Reasoning Chain:** Identifies the profession associated with "Josef Mengele." * **Knowledge Path:** `(Josef Mengele, people.person.profession, Physician)` * **KG-constrained Decoding:** * Nodes: Josef Mengele, Physician * Edge: Connects Josef Mengele to Physician, labeled "1". * Color: The node "Physician" is green, indicating the correct answer. **Example 2:** * **Question:** What type of Claude Debussy music appears in the film Black Tights? **Answer:** Ballet * **Reasoning Chain:** Identifies the type of music associated with "Claude Debussy" and confirms its presence in the film "Black Tights." * **Knowledge Path:** `(Claude Debussy, music.artist.genre, Ballet)` and `(Ballet, film.film_genre.films_in_this_genre, Black Tights)` * **KG-constrained Decoding:** * Nodes: Claude Debussy, Ballet, Black Tights * Edges: Claude Debussy to Ballet (labeled "1"), Ballet to Black Tights (labeled "2"). * Color: The node "Ballet" is green, indicating the correct answer. **Example 3:** * **Question:** What high school did the artist who recorded "Girl Tonight" attend? **Answer:** Petersburg High School * **Reasoning Chain:** Identifies the artist associated with "Girl Tonight," determines the educational institutions they attended, and confirms which is a high school. * **Knowledge Path:** `(Trey Songz, music.featured_artist.recordings, Girl Tonight)`, `(Trey Songz, people.person.education-education.education.institution, Petersburg High School)`, and `(Petersburg High School, education.educational_institution.school_type, High school)` * **KG-constrained Decoding:** * Nodes: Girl Tonight, Trey Songz, Petersburg High School, High school * Edges: Girl Tonight to Trey Songz (labeled "1"), Trey Songz to Petersburg High School (labeled "2"), Petersburg High School to High school (labeled "3"). * Color: The node "Petersburg High School" is green, indicating the correct answer. **Example 4:** * **Question:** Where did Tennessee Williams attend college, and has an organization headquarters located in New York City? **Answer:** The New School * **Reasoning Chain:** Identifies the educational institution attended by "Tennessee Williams," verifies it's a college/university, and confirms its headquarters are in New York City. * **Knowledge Path:** `(Tennessee Williams, people.person.education-education.education.institution, The New School)`, `(The New School, common.topic.notable_types, College/University)`, and `(The New School, organization.organization.headquarters-location.mailing_address.citytown, New York City)` * **KG-constrained Decoding:** * Nodes: Tennessee Williams, The New School, College/University, New York City * Edges: Tennessee Williams to The New School (labeled "1"), The New School to College/University (labeled "2"), The New School to New York City (labeled "3"). * Color: The node "The New School" is green, indicating the correct answer. ### Key Observations * Each example follows a similar structure, demonstrating a consistent reasoning process. * The KG-constrained Decoding diagrams visually represent the knowledge graph traversal. * The green nodes in the KG-constrained Decoding diagrams indicate the final answer. * The numbers on the edges in the KG-constrained Decoding diagrams likely represent the order of reasoning steps. * "Xundecodable" nodes appear in some diagrams, possibly indicating entities that could not be decoded or were irrelevant to the final answer. ### Interpretation The image illustrates a system for question answering that leverages knowledge graphs. The "Reasoner" generates a chain of reasoning steps, which are then translated into a path through the knowledge graph by the "Aligner." The KG-constrained Decoding diagrams provide a visual representation of this process, making it easier to understand how the system arrives at the correct answer. The use of triples in the Knowledge Path generated by the Aligner provides a structured way to represent the relationships between entities in the knowledge graph. The green highlighting of the answer node in the KG-constrained Decoding diagrams provides a clear indication of the system's final answer. The "Xundecodable" nodes suggest that the system may encounter entities that are not relevant to the question or cannot be decoded, but it is still able to arrive at the correct answer. </details> Figure 6: Examples of Reasoning Chains and Knowledge Paths generated by RAR under different iteration steps. Fig. 6 illustrates the qualitative results by our RAR framework. To investigate the effect of EM iterations, in Fig. 7, we also present two representative cases that demonstrate the evolution of RAR ’s behavior across iterations of the EM algorithm. These cases showcase distinct improvements in RAR ’s ability to generate Reasoning Chains and Knowledge Paths. In Case 1, we examine RAR ’s response to the query “What high school did the artist who recorded ’Girl Tonight’ attend?” The early-stage model demonstrates suboptimal verification behavior by directly searching for high school information without following the proper verification sequence: first identifying educational institutions, then verifying the specific type as high school. This Reasoning Chain prevents proper alignment with the Knowledge Path in the KG. In contrast, the later-stage model generates both Reasoning Chains and Knowledge Paths effectively, producing a higher quality Reasoning Chain that successfully aligns with the Knowledge Path in the KG. Case 2 examines RAR ’s handling of “What type of Claude Debussy music appears in the film Black Tights?” Here, we observe a different pattern of improvement. While the early-stage model generates the same Reasoning Chain as the later-stage model, it fails to generate Knowledge Paths that fully align with and reflect this reasoning, resulting in a misaligned Knowledge Path that do not lead to the correct answer. The later-stage model maintains consistency between Reasoning Chains and Knowledge Paths, thus arriving at the correct answer. These cases validate the effectiveness of the EM approach. 4 Related Work LLM Reasoning. Recent progress in LLMs have spurred extensive research to improve deep reasoning. One line of work focuses on prompting-based methods, which elicit intermediate reasoning steps during inference, such as Chain-of-Thought (CoT) prompting (Wei et al., 2022a). Building on CoT, self-consistency (Wang et al., 2023b) generates multiple reasoning traces and selects the most consistent answer. Tree-of-Thought (Yao et al., 2023) explores branching steps in a tree structure to uncover optimal solutions. Beyond prompting, researchers have investigated fine-tuning LLMs on reasoning tasks (Yu et al., 2022; Hoffman et al., 2023), including reinforcement learning methods (OpenAI, 2024b; Guo et al., 2025) that encourage more complex multi-step reasoning before arriving at a final answer. KG-enhanced LLM Reasoning. Despite their remarkable performance, LLMs still face limitations such as incomplete or outdated domain knowledge, interpretability challenges, and the potential for hallucinations. To address these issues, a growing body of work has focused on integrating LLMs with KGs (Pan et al., 2024). KD-CoT (Wang et al., 2023a) enhances CoT by retrieving relevant facts from external KGs, guiding LLMs with more reliable information. RoG (Luo et al., 2024b) employs a plan–retrieve–reason pipeline that explicitly fetches KG-based reasoning paths to ground the final answer. GCR Luo et al. (2024c) further mitigates hallucinations by enforcing graph-constrained decoding, ensuring that every reasoning step aligns with valid KG connections. GNN-RAG (Mavromatis and Karypis, 2024) leverages a graph neural network for effective KG retrieval, while StructGPT (Jiang et al., 2023) and ToG (Sun et al., 2024) treat the LLM as an agent that navigates the KG, assembling multi-hop paths to produce more trustworthy answers. 5 Conclusion In this paper, we present RAR, a novel framework that integrates LLM reasoning with knowledge graphs for KGQA through three key components—Reasoner, Aligner, and Responser. We formulate this process as a probabilistic model and optimize it using the Expectation-Maximization algorithm. Through extensive experiments on multiple benchmarks, we demonstrate that RAR achieves state-of-the-art performance while maintaining interpretability and computational efficiency. These results demonstrate that RAR successfully bridges the gap between LLM reasoning and structured knowledge, offering a promising direction for building reliable and interpretable QA systems. Limitations One limitation of our framework lies in the computational overhead introduced by Reasoner. In certain cases, especially for complex queries, the Reasoning Chain generated by Reasoner can become relatively long, increasing resource consumption. However, our experimental results demonstrate that the performance gains from incorporating Reasoning Chain justify this additional cost, striking a practical balance between efficiency and effectiveness. Another limitation concerns the generalizability to specialized domains. Though our framework, trained on Freebase-based KGQA datasets, shows improved generalization to unseen KGs compared to previous methods, its performance on highly specialized knowledge graphs (e.g, medical KGs) remains to be enhanced. Improving adaptation to domain-specific KGs presents a promising direction for future research. References - Bao et al. (2025) Forrest Bao, Chenyu Xu, and Ofer Mendelevitch. 2025. Deepseek-r1 hallucinates more than deepseek-v3. Accessed: 2025-02-10. - Bollacker et al. (2008) Kurt D. Bollacker, Colin Evans, Praveen K. Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, pages 1247–1250. ACM. - Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901. - Cheng et al. (2024) Sitao Cheng, Ziyuan Zhuang, Yong Xu, Fangkai Yang, Chaoyun Zhang, Xiaoting Qin, Xiang Huang, Ling Chen, Qingwei Lin, Dongmei Zhang, Saravan Rajmohan, and Qi Zhang. 2024. Call me when necessary: LLMs can efficiently and faithfully reason over structured environments. In Findings of the Association for Computational Linguistics: ACL 2024, pages 4275–4295, Bangkok, Thailand. Association for Computational Linguistics. - Dehghan et al. (2024) Mohammad Dehghan, Mohammad Alomrani, Sunyam Bagga, David Alfonso-Hermelo, Khalil Bibi, Abbas Ghaddar, Yingxue Zhang, Xiaoguang Li, Jianye Hao, Qun Liu, Jimmy Lin, Boxing Chen, Prasanna Parthasarathi, Mahdi Biparva, and Mehdi Rezagholizadeh. 2024. EWEK-QA : Enhanced web and efficient knowledge graph retrieval for citation-based question answering systems. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14169–14187, Bangkok, Thailand. Association for Computational Linguistics. - Dempster et al. (1977) Arthur P Dempster, Nan M Laird, and Donald B Rubin. 1977. Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society: series B (methodological), 39(1):1–22. - Dong et al. (2025) Zixuan Dong, Baoyun Peng, Yufei Wang, Jia Fu, Xiaodong Wang, Xin Zhou, Yongxue Shan, Kangchen Zhu, and Weiguo Chen. 2025. Effiqa: Efficient question-answering with strategic multi-model collaboration on knowledge graphs. In Proceedings of the 31st International Conference on Computational Linguistics, COLING 2025, Abu Dhabi, UAE, January 19-24, 2025, pages 7180–7194. Association for Computational Linguistics. - Gu et al. (2024) Yu Gu, Yiheng Shu, Hao Yu, Xiao Liu, Yuxiao Dong, Jie Tang, Jayanth Srinivasa, Hugo Latapie, and Yu Su. 2024. Middleware for LLMs: Tools are instrumental for language agents in complex environments. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7646–7663, Miami, Florida, USA. Association for Computational Linguistics. - Guo et al. (2025) Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948. - Hadi et al. (2023) Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. 2023. A survey on large language models: Applications, challenges, limitations, and practical usage. Authorea Preprints, 3. - He et al. (2021) Gaole He, Yunshi Lan, Jing Jiang, Wayne Xin Zhao, and Ji-Rong Wen. 2021. Improving multi-hop knowledge base question answering by learning intermediate supervision signals. In Proceedings of the 14th ACM international conference on web search and data mining, pages 553–561. - Hoffman et al. (2023) Matthew Douglas Hoffman, Du Phan, David Dohan, Sholto Douglas, Tuan Anh Le, Aaron Parisi, Pavel Sountsov, Charles Sutton, Sharad Vikram, and Rif A. Saurous. 2023. Training chain-of-thought via latent-variable inference. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. - Huang et al. (2025) Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems, 43(2):1–55. - Huang et al. (2024) Xiang Huang, Sitao Cheng, Shanshan Huang, Jiayu Shen, Yong Xu, Chaoyun Zhang, and Yuzhong Qu. 2024. QueryAgent: A reliable and efficient reasoning framework with environmental feedback based self-correction. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5014–5035, Bangkok, Thailand. Association for Computational Linguistics. - Jiang et al. (2023) Jinhao Jiang, Kun Zhou, Zican Dong, Keming Ye, Wayne Xin Zhao, and Ji-Rong Wen. 2023. Structgpt: A general framework for large language model to reason over structured data. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9237–9251. - Jiang et al. (2022) Jinhao Jiang, Kun Zhou, Xin Zhao, and Ji-Rong Wen. 2022. Unikgqa: Unified retrieval and reasoning for solving multi-hop question answering over knowledge graph. In The Eleventh International Conference on Learning Representations. - Jin et al. (2021) Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421. - Li et al. (2024) Kun Li, Tianhua Zhang, Xixin Wu, Hongyin Luo, James Glass, and Helen Meng. 2024. Decoding on graphs: Faithful and sound reasoning on knowledge graphs through generation of well-formed chains. arXiv preprint arXiv:2410.18415. - Li et al. (2023) Tianle Li, Xueguang Ma, Alex Zhuang, Yu Gu, Yu Su, and Wenhu Chen. 2023. Few-shot in-context learning on knowledge base question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6966–6980, Toronto, Canada. Association for Computational Linguistics. - Luo et al. (2024a) Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024a. Reasoning on graphs: Faithful and interpretable large language model reasoning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net. - Luo et al. (2024b) Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. 2024b. Reasoning on graphs: Faithful and interpretable large language model reasoning. In International Conference on Learning Representations. - Luo et al. (2024c) Linhao Luo, Zicheng Zhao, Chen Gong, Gholamreza Haffari, and Shirui Pan. 2024c. Graph-constrained reasoning: Faithful reasoning on knowledge graphs with large language models. CoRR, abs/2410.13080. - Mavromatis and Karypis (2022) Costas Mavromatis and George Karypis. 2022. Rearev: Adaptive reasoning for question answering over knowledge graphs. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2447–2458. - Mavromatis and Karypis (2024) Costas Mavromatis and George Karypis. 2024. Gnn-rag: Graph neural retrieval for large language model reasoning. arXiv preprint arXiv:2405.20139. - Meta (2024) Meta. 2024. Build the future of ai with meta llama 3. - Nie et al. (2024) Zhijie Nie, Richong Zhang, Zhongyuan Wang, and Xudong Liu. 2024. Code-style in-context learning for knowledge-based question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18833–18841. - OpenAI (2022) OpenAI. 2022. Introducing chatgpt. - OpenAI (2024a) OpenAI. 2024a. Hello gpt-4o. - OpenAI (2024b) OpenAI. 2024b. Learning to reason with llms. - Pan et al. (2024) Shirui Pan, Linhao Luo, Yufei Wang, Chen Chen, Jiapu Wang, and Xindong Wu. 2024. Unifying large language models and knowledge graphs: A roadmap. IEEE Transactions on Knowledge and Data Engineering (TKDE). - Qu et al. (2021) Meng Qu, Junkun Chen, Louis-Pascal A. C. Xhonneux, Yoshua Bengio, and Jian Tang. 2021. Rnnlogic: Learning logic rules for reasoning on knowledge graphs. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. - Sen et al. (2022) Prithviraj Sen, Breno W. S. R. de Carvalho, Ryan Riegel, and Alexander G. Gray. 2022. Neuro-symbolic inductive logic programming with logical neural networks. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 8212–8219. AAAI Press. - Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence, volume 31. - Sun et al. (2018) Haitian Sun, Bhuwan Dhingra, Manzil Zaheer, Kathryn Mazaitis, Ruslan Salakhutdinov, and William Cohen. 2018. Open domain question answering using early fusion of knowledge bases and text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4231–4242. - Sun et al. (2024) Jiashuo Sun, Chengjin Xu, Lumingyuan Tang, Saizhuo Wang, Chen Lin, Yeyun Gong, Lionel Ni, Heung-Yeung Shum, and Jian Guo. 2024. Think-on-graph: Deep and responsible reasoning of large language model on knowledge graph. In The Twelfth International Conference on Learning Representations. - Talmor and Berant (2018) Alon Talmor and Jonathan Berant. 2018. The web as a knowledge-base for answering complex questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 641–651. - Talmor et al. (2019) Alon Talmor, Jonathan Herzig, Nicholas Lourie, and Jonathan Berant. 2019. Commonsenseqa: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158. - Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288. - Wang et al. (2023a) Keheng Wang, Feiyu Duan, Sirui Wang, Peiguang Li, Yunsen Xian, Chuantao Yin, Wenge Rong, and Zhang Xiong. 2023a. Knowledge-driven cot: Exploring faithful reasoning in llms for knowledge-intensive question answering. arXiv preprint arXiv:2308.13259. - Wang et al. (2023b) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2023b. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. - Wang et al. (2024) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. 2024. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations. - Wei et al. (2022a) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022a. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022. - Wei et al. (2022b) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. 2022b. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837. - Yang et al. (2024) An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, and Zhihao Fan. 2024. Qwen2 technical report. arXiv preprint arXiv:2407.10671. - Yao et al. (2023) Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of thoughts: Deliberate problem solving with large language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023. - Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Christopher Meek, Ming-Wei Chang, and Jina Suh. 2016. The value of semantic parse labeling for knowledge base question answering. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 2: Short Papers. The Association for Computer Linguistics. - Yu et al. (2022) Ping Yu, Tianlu Wang, Olga Golovneva, Badr AlKhamissy, Gargi Ghosh, Mona T. Diab, and Asli Celikyilmaz. 2022. ALERT: adapting language models to reasoning tasks. CoRR, abs/2212.08286. - Zhang et al. (2022) Jing Zhang, Xiaokang Zhang, Jifan Yu, Jian Tang, Jie Tang, Cuiping Li, and Hong Chen. 2022. Subgraph retrieval enhanced model for multi-hop knowledge base question answering. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5773–5784. - Zhong et al. (2024) Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, Xiaowei Yu, Chao Cao, Hanqi Jiang, Hanxu Chen, Yiwei Li, Junhao Chen, Huawen Hu, Yihen Liu, Huaqin Zhao, Shaochen Xu, Haixing Dai, Lin Zhao, Ruidong Zhang, Wei Zhao, Zhenyuan Yang, Jingyuan Chen, Peilong Wang, Wei Ruan, Hui Wang, Huan Zhao, Jing Zhang, Yiming Ren, Shihuan Qin, Tong Chen, Jiaxi Li, Arif Hassan Zidan, Afrar Jahin, Minheng Chen, Sichen Xia, Jason Holmes, Yan Zhuang, Jiaqi Wang, Bochen Xu, Weiran Xia, Jichao Yu, Kaibo Tang, Yaxuan Yang, Bolun Sun, Tao Yang, Guoyu Lu, Xianqiao Wang, Lilong Chai, He Li, Jin Lu, Lichao Sun, Xin Zhang, Bao Ge, Xintao Hu, Lian Zhang, Hua Zhou, Lu Zhang, Shu Zhang, Ninghao Liu, Bei Jiang, Linglong Kong, Zhen Xiang, Yudan Ren, Jun Liu, Xi Jiang, Yu Bao, Wei Zhang, Xiang Li, Gang Li, Wei Liu, Dinggang Shen, Andrea Sikora, Xiaoming Zhai, Dajiang Zhu, and Tianming Liu. 2024. Evaluation of openai o1: Opportunities and challenges of AGI. CoRR, abs/2409.18486. Appendix A Rationale and Details of the EM Algorithm We provide the motivation and the details of applying the EM algorithm to optimize our framework. A.1 Overview We have two modules: - ReAligner, parameterized by $\psi$ , which generates Graph-aware Reasoning Chains $z$ given $(\mathcal{G},q)$ . - Responser, parameterized by $w$ , which predicts the final answer $a$ given the question $q$ and candidate Graph-aware Reasoning Chain $z$ . Given a training set of triples $\{(\mathcal{G},q,a)\}$ , our objective is to maximize: $$ \mathcal{O}(w,\psi)\;=\;\sum_{(\mathcal{G},q,a)}\log\Bigl{(}\mathbb{E}_{z\sim p% _{\psi}(z\mid\mathcal{G},q)}\bigl{[}p_{w}(a\mid q,z)\bigr{]}\Bigr{)}. $$ Because exact marginalization over $z$ can be expensive, we employ an EM-style approach to iteratively refine both modules. A.2 Rationale The selection of the EM algorithm is fundamentally motivated by the central challenge and objective of the RAR: generating latent natural language (NL) reasoning steps for KGQA. A.2.1 Core Challenge: Generating Latent NL Reasoning - RAR aims to produce complex, human-like NL reasoning chains – the intermediate “thought process” connecting a question to its answer using a Knowledge Graph (KG). - Crucially, these detailed reasoning steps are not available in standard KGQA training datasets. They constitute latent variables within our model. A.2.2 Inadequacy of Direct Supervision Methods - In standard SFT applied within RoG (Luo et al., 2024a), relation sequences absent from the original training data are often generated. To label these sequences for training, the process frequently relies on heuristics to create pseudo-gold labels, such as identifying the shortest KG path between query and answer entities, which can be potentially noisy. - This heuristic-based supervision is insufficient for training models to generate the complex, multi-step, logically nuanced NL reasoning that RAR targets, especially when multiple constraints are involved. A.2.3 EM as the Principled Approach for Latent Variables - EM is the standard, principled approach for parameter estimation in models with latent variables. - It provides a formal framework to optimize the Reasoner (generating the latent NL chain) and the Aligner (grounding the chain to the KG) without requiring explicit supervision for the intermediate NL reasoning steps. - Optimization is guided indirectly by maximizing the likelihood of observing the correct final answer, conditioned on the feasibility of the generated reasoning chain being aligned to the KG. A.2.4 Necessity of Iterative Refinement - Generating coherent, long-form NL reasoning is challenging. Initial attempts, especially early in training, are likely to be imperfect or logically flawed. - The iterative nature of the EM algorithm is well-suited for this progressive refinement: E-step identifies the most likely or “best” latent reasoning chains produced by the current model that successfully link the question to the correct answer via a feasible KG alignment. This step essentially evaluates the current reasoning quality based on outcomes; M-step updates the parameters of the Reasoner and Aligner models by training them on these high-quality reasoning chains identified in the E-step. This step aims to make the models generate more chains similar to the successful ones. - This iterative E-M loop allows the system to gradually improve the quality, logical coherence, and KG-alignability of the generated latent reasoning, as demonstrated qualitatively in Fig. 4. A.2.5 Connections to Reinforcement Learning Connection to Implicit RL. The EM algorithm, as applied in RAR, can be viewed as a form of implicit Reinforcement Learning: - The E-step acts like a selection or filtering mechanism based on the quality of the reasoning chain, implicitly assigning a high reward (e.g, 1) to successful chains (reaching the correct answer via KG alignment) and low reward (e.g, 0) to unsuccessful ones. - The M-step, particularly the Reasoner update (maximizing $logp(z_{h}at_{I}|q)$ for selected high-quality chains $z_{h}at_{I}$ ), mathematically resembles a policy gradient update ( $E[R*∇\log p(z|q)]≈ R*∇\log p(\hat{z}|q)$ ) where $R$ is effectively this implicit binary reward. Thus, EM reinforces the generation of “good” reasoning chains without the need for explicit reward engineering. Why EM Was Preferred Over Explicit Reinforcement Learning. While the EM process here shares similarities with RL (see next point), we opted for EM over explicit RL formulations (like PPO) for several practical reasons: - Reward Function Design: Crafting a good reward function (‘R‘) that accurately captures the quality of multi-step NL reasoning is non-trivial. - Training Complexity and Cost: Explicit RL methods often lead to higher computational costs and potentially unstable training. - Efficiency and Simplicity: EM, derived naturally from the maximum likelihood objective for latent variable models, offers a more direct, mathematically grounded, and often simpler optimization pathway for our specific problem structure. A.3 Details of EM Algorithm A.3.1 Step 1: Updating Responser 1. Sample Candidate Graph-aware Reasoning Chains. For each training example $(\mathcal{G},q,a)$ , sample $K$ Graph-aware Reasoning Chains: $$ z_{k}\;\sim\;p_{\psi}(z\mid\mathcal{G},q),\quad k=1,\dots,K. $$ Let $\hat{z}=\{\,z_{1},z_{2},...,z_{K}\}$ . 1. Approximate the Objective for $w$ . The term $$ \log\mathbb{E}_{z\sim p_{\psi}(z\mid\mathcal{G},q)}\bigl{[}p_{w}(a\mid q,z)% \bigr{]} $$ is approximated by $$ \log\biggl{(}\tfrac{1}{K}\sum_{k=1}^{K}p_{w}(a\mid q,z_{k})\biggr{)}. $$ We then take gradients (w.r.t. $w$ ) and update $w$ so that $p_{w}(a\mid q,z)$ is more likely to produce the correct $a$ for the sampled Graph-aware Reasoning Chains. 1. Result. After updating $w$ , Responser $p_{w}(a\mid q,z)$ is better aligned with whatever Graph-aware Reasoning Chains $p_{\psi}$ currently emits. A.3.2 Step 2: EM-Style Update for ReAligner After $w$ is updated, we refine the ReAligner $p_{\psi}(z\mid\mathcal{G},q)$ . In EM terms, we view $z$ as a latent variable: E-Step (Posterior Inference) - Compute / Re-rank Graph-aware Reasoning Chains. Re-sample or re-rank the $K$ Graph-aware Reasoning Chains using the updated $p_{w}$ . We want Graph-aware Reasoning Chains that are “most aligned” with the correct answer $a$ . Formally: $$ p_{w,\psi}(z\mid\mathcal{G},q,a)\;\propto\;p_{w}(a\mid q,z)\,p_{\psi}(z\mid% \mathcal{G},q). $$ - Scoring. For a single Graph-aware Reasoning Chain $z$ , define $$ S(z)\;=\;\log p_{w}(a\mid q,z)\;+\;\log p_{\psi}(z\mid\mathcal{G},q). $$ - Selecting High-Quality Graph-aware Reasoning Chains. Rank (or sample) Graph-aware Reasoning Chains by $S(z)$ and select the top set $z_{I}=\{\text{top-}K\text{ Graph-aware Reasoning Chains}\}$ . M-Step (Update $\psi$ ) - Treat $z_{I}$ (the selected high-quality Graph-aware Reasoning Chains) as if they were observed. - Update $\psi$ by maximizing: $$ \log p_{\psi}(z_{I}\mid\mathcal{G},q)\;=\;\sum_{z\,\in\,z_{I}}\log p_{\psi}(z% \mid\mathcal{G},q) $$ - In practice, this amounts to standard fine-tuning (e.g., instruction tuning or teacher forcing) of $p_{\psi}$ on the newly identified high-quality Graph-aware Reasoning Chains. A.3.3 Complete Iteration A single iteration of our EM-style algorithm proceeds as follows: 1. (Update $w$ ): For each sample $(\mathcal{G},q,a)$ , draw $K$ Graph-aware Reasoning Chains from $p_{\psi}$ , then update $w$ by maximizing $$ \log\left(\tfrac{1}{K}\sum_{k=1}^{K}p_{w}(a\mid q,z_{k})\right). $$ 1. (E-Step): Using the updated $w$ , compute $$ p_{w,\psi}(z\mid\mathcal{G},q,a)\;\propto\;p_{w}(a\mid q,z)\,p_{\psi}(z\mid% \mathcal{G},q). $$ Select high-quality Graph-aware Reasoning Chains $z_{I}$ from the candidates. 1. (M-Step): Update $\psi$ by maximizing $\log p_{\psi}(z_{I}\mid\mathcal{G},q)$ , i.e. fine-tune $p_{\psi}$ so that it is more likely to emit $z_{I}$ in the future. This loop can be repeated until convergence or for a fixed number of epochs. A.3.4 Practical Variations 1. Top- $K$ vs. Full Posterior. Instead of summing/sampling over all subsets, it is simpler to pick the top- $K$ Graph-aware Reasoning Chains by $S(·)$ . 1. Skipping Responser Optimization. To further improve efficiency, we can skip optimizing Responser. LLMs often possess strong zero-shot summarization or question-answering capabilities, which means they can produce high-quality answers from given Graph-aware Reasoning Chains without additional training. As a result, we can treat an LLM as a pre-optimized Responser and focus solely on updating ReAligner, thereby reducing overall computation. Appendix B More Related Work B.1 Comparison with Agent Exploration Methods To situate RAR within the KGQA landscape, we first contrast it with representative agent exploration methods such as ToG (Sun et al., 2024). Although both paradigms comprise stages that can be informally mapped to reasoning and grounding, their internal principles diverge markedly. Training methodology and optimization. RAR is trained with a mathematically grounded expectation–maximisation (EM) procedure that explicitly and stably refines two complementary capabilities: the NL Reasoner and the KG - aware Aligner. By contrast, many agent methods rely more heavily on prompt or workflow engineering Cheng et al. (2024); Gu et al. (2024); Huang et al. (2024); Li et al. (2023); Nie et al. (2024); they seldom perform task - specific optimization that directly targets the core reasoning mechanism. Accuracy, reliability, and complexity handling. The synergy between NL reasoning, KG - constrained alignment, and EM - guided supervised fine - tuning translates into markedly higher accuracy and robustness for RAR (see Tab. 1). Empirically, its explicit decomposition allows it to cope well with multi - hop and conjunctive constraints that are challenging for purely prompt - driven agents. Resource consumption. Once training is finished, inference in RAR can be carried out by a collection of relatively small, specialized, fine - tuned models—one each for the Reasoner, Aligner, and Responser. This modularity yields the efficiency gains reported in Tab. 3. Agent systems, in contrast, often incur higher latency and cost because they perform several large - LLM calls while exploring the KG. Taken together, these differences show that RAR is not a mere variant of the agent paradigm; its EM - centered optimization strategy and KG - constrained decoding constitute a distinct design that offers both practical efficiency and stronger empirical performance. B.2 Comparison with Path Generation Methods RAR also differs fundamentally from path generation methods such as RoG (Luo et al., 2024a) and GCR (Luo et al., 2024c). Where those systems directly predict a linear sequence of KG relations, ours maintains a higher - level, human - readable plan in natural language and lets the Aligner ground each step to KG triples. Specifically, (i) the Reasoner produces a multi - step NL chain that expresses the conceptual logic; (ii) the Aligner incrementally matches every NL step to concrete triples through KG - constrained decoding; and (iii) the Responser integrates evidence from both the NL chain and the aligned KG path to craft the final answer. Training again relies on EM—iteratively improving latent NL chains that can be aligned and that ultimately yield correct answers—whereas RoG and related work usually depend on direct SFT with shortest KG paths that may be noisy supervision signals. Illustrative example. For the query “What did Dr Josef Mengele do?” the two paradigms unfold differently: - RoG. An LLM planner outputs the relation people.person.profession; a symbolic retriever then follows this edge in the KG to obtain the triple (Josef Mengele, profession, Physician), and the answer Physician is returned. - RAR. Step 1 (A Reasoner step) proposes: “Identify the profession associated with Josef Mengele.” The Aligner grounds this to the same triple as above. The Responser finally reports Physician, explicitly citing both the reasoning chain and the grounded path. Handling complex constraints. Because RoG represents reasoning as a single linear relation sequence, it struggles with conjunctive queries such as “presidents who were also actors”; after following profession $→$ Actor, it cannot backtrack to verify profession $→$ President. RAR, in contrast, naturally decomposes the query into two successive NL steps (“find presidents” $→$ “filter those who are actors”) and grounds each step separately, ensuring both constraints are satisfied. Robustness to noisy supervision. Shortest - path supervision can be incorrect when more semantically plausible paths exist. By letting EM discover latent NL chains that are both alignable and answer - bearing, RAR avoids this brittleness and achieves the quality gains visualized in Fig. 4 of the submission. In summary, the collaboration of an explicit NL Reasoner, a KG - constrained Aligner, and EM - based optimization endows RAR with a distinctive combination of interpretability, flexibility, and empirical strength that is not achieved by prior agent exploration or path generation methods. Appendix C Experimental Setup C.1 Fine-tuning Datasets and Knowledge Graph For evaluation, we use two benchmark KGQA datasets: WebQuestionSP (WebQSP) (Yih et al., 2016) and Complex WebQuestions (CWQ) (Talmor and Berant, 2018). To ensure fair comparison, we adopt identical train and test splits as previous works (Jiang et al., 2022; Luo et al., 2024b). The detailed statistics of these datasets are presented in Tab. 6. Both WebQSP and CWQ are based on Freebase (Bollacker et al., 2008). To reduce computational overhead and memory consumption, we utilize the same subgraphs as previous work (Luo et al., 2024b). Additionally, we preprocess Freebase by converting CVT (Compound Value Type) nodes, which represent n-ary relationships, into binary relationships by concatenating edge labels with “-” as the delimiter, following Li et al. (2024). Dataset Dataset Statistics Statistics of Answer Numbers #Train #Test #Ans = 1 2 $≥$ #Ans $≤$ 4 5 $≥$ #Ans $≤$ 9 #Ans $≥$ 10 WebQSP 2,826 1,628 51.2% 27.4% 8.3% 12.1% CWQ 27,639 3,531 70.6% 19.4% 6% 4% Table 6: Statistics of datasets. C.2 Datasets for Cold Starting To initialize the training of Reasoner and Aligner, we leverage high-quality Reasoning Chains and Knowledge Paths derived from SPARQL queries in the WebQSP and CWQ datasets. This approach prevents the models from generating malformed outputs during early training stages. The SPARQL queries in these datasets represent the gold-standard Reasoning Chain that human experts would use to solve questions using the KG. We decompose these SPARQL queries according to a predefined grammar, breaking them into atomic chunks that each represent a single reasoning step. For each chunk, we query Freebase to obtain the corresponding triples, then use GPT-4o to generate natural language reasoning chains based on these retrieved triples. Through this process, we generate a dataset of 2,000 high-quality Reasoning Chains with their corresponding Knowledge Paths for each question. This dataset enables us to perform cold-start pretraining of Reasoner and Aligner, teaching them to generate well-structured, step-by-step Reasoning Chains and Knowledge Paths with appropriate special tokens, a crucial foundation for subsequent optimization using the EM algorithm. C.3 Hyperparameters For Responser, we adopt Llama-3.1-8B (Meta, 2024) without fine-tuning based on our preliminary experiments (detailed analysis in App. A.3.4). For both Reasoner and Aligner, we conduct extensive experiments with various lightweight LLMs ranging from 0.5B to 8B parameters (Yang et al., 2024; Touvron et al., 2023; Meta, 2024). All models share the same hyperparameter configuration: training for 3 epochs with a batch size of 4 and a learning rate of 2e-5. We employ a cosine learning rate scheduler with a warmup ratio of 0.03. The training is performed on 4 A6000 GPUs for each model variant. Appendix D Details of KG-constrained Decoding For efficient knowledge graph operations, we implemented a Virtuoso-based Freebase instance with distributed high-speed SPARQL querying capabilities. Our system achieves a throughput of 2,000 requests per second, enabling rapid graph traversal and neighborhood node retrieval during the constrained decoding process. This high-performance infrastructure allows us to efficiently retrieve next-hop candidates for token constraints directly from the knowledge graph. Appendix E Case Study of Different Iteration Steps In Fig. 7, we provide two examples to investigate the effect of iteration steps of the EM algorithm. <details> <summary>x7.png Details</summary> ![3cd68d04](/v1/image/3cd68d048f022d2956729ce230afac760fe44ad8c95fb235e6bef45a3000d2a1) ### Visual Description ## Table: Reasoning Chain and Knowledge Path Examples ### Overview The image presents a table illustrating two cases of question-answering, detailing the reasoning chain and knowledge path used to arrive at the answer. Each case includes the question, the steps involved in reasoning, the knowledge path taken, and the final response, indicating whether the response is correct or incorrect. ### Components/Axes The table is structured with the following rows: - **Case**: Indicates the case number (Case 1, Case 2). - **Question**: The question being asked. - **Steps**: Number of steps taken in the reasoning chain (20, 100). - **Reasoning Chain & Knowledge Path**: Details the steps taken to arrive at the answer, including the entities and relations used. - **Response**: The final answer provided and whether it is correct (indicated by a checkmark) or incorrect (indicated by an "X"). The table is structured with the following columns: - **Left Column**: Contains the labels for each row (Case, Steps, Reasoning Chain & Knowledge Path, Response). - **Middle Column**: Shows the reasoning chain and knowledge path for the model's incorrect answer. - **Right Column**: Shows the reasoning chain and knowledge path for the model's correct answer. ### Detailed Analysis or ### Content Details **Case 1:** - **Question:** What high school did the artist who recorded "Girl Tonight" attend? Answer: Petersburg. - **Steps:** 20 (incorrect), 100 (correct). - **Reasoning Chain & Knowledge Path (Incorrect):** - First, identify the artist "c" associated with the recording "Girl Tonight". `=>` `<John H. Guyer, school_type, High school>` - Next, determine the high school attended by the artist "c". `=>` `<John H. Guyer, notable_object, High school>` - **Reasoning Chain & Knowledge Path (Correct):** - First, identify the artist "c" associated with the recording "Girl Tonight". `=>` `<Trey Songz, recordings, Girl Tonight>` - Next, determine the educational institutions "x" attended by artist "c". `=>` `<Trey Songz, education_institution, Petersburg>` - Finally, confirm which of educational institutions "x" is a high school. `=>` `<Petersburg, school_type, High school>` - **Response:** John H. Guyer (Incorrect - X), Petersburg (Correct - ✓). **Case 2:** - **Question:** What type of Claude Debussy music appears in the film Black Tights? Answer: Ballet. - **Steps:** 20 (incorrect), 100 (correct). - **Reasoning Chain & Knowledge Path (Incorrect):** - First, identify the type of music "x" associated with Claude Debussy. `=>` `<Claude Debussy, compositions, En blanc et noir>` - Next, confirm that the type of music "x" is featured in the film Black Tights. `=>` `<En blanc et noir, composer, Claude Debussy>` - **Reasoning Chain & Knowledge Path (Correct):** - First, identify the type of music "x" associated with Claude Debussy. `=>` `<Claude Debussy, genre, Ballet>` - Next, determine whether the type of music "x" appears in the film Black Tights. `=>` `<Ballet, films_in_this_genre, Black Tights>` - **Response:** En blanc et noir (Incorrect - X), Ballet (Correct - ✓). ### Key Observations - The table presents two distinct cases where the model initially provides an incorrect answer but eventually arrives at the correct answer through a more detailed reasoning chain. - The "Steps" row indicates the number of steps taken in the reasoning chain. The correct answers consistently involve a higher number of steps (100) compared to the incorrect answers (20). - The knowledge paths are represented as triplets within angle brackets, indicating entities and their relationships. ### Interpretation The data suggests that a more thorough and multi-step reasoning process, as indicated by the higher "Steps" value, leads to more accurate answers. The incorrect answers are likely due to a superficial or incomplete analysis of the question and available knowledge. The knowledge paths demonstrate how the model connects different entities and relationships to arrive at a conclusion. The examples highlight the importance of considering multiple factors and relationships to answer questions accurately. </details> Figure 7: Examples of Reasoning Chains and Knowledge Paths generated by RAR under different iteration steps. Appendix F Templates and Prompts In this section, we illustrate all the prompts used in the experiments. Reasoning Chain Template. The template of Reasoning Chains generated by Reasoner is shown in Fig. 8, where each $s_{i}$ is a discrete reasoning step in natural language. Reasoning Chain Template <THINK> $s_{1}s_{2}s_{3}... s_{n}$ </THINK> Figure 8: The template for the Reasoning Chain generated by Reasoner. Knowledge Path Template. The template of Knowledge Path generated by Aligner is shown in Fig. 9, where $e$ abd $r$ denotes the entities and relations from the KG. Knowledge Path Template <ALIGN><TRIPLE> <|> $e_{1}$ <|> $r_{1}$ <|> $e_{1}^{{}^{\prime}}$ </TRIPLE> … <TRIPLE> <|> $e_{n}$ <|> $r_{n}$ <|> $e_{n}^{{}^{\prime}}$ </TRIPLE></ALIGN> Figure 9: The template of Knowledge Paths generated by Aligner. ReAligner Prompt. The prompt for instructing ReAligner is shown in Fig. 10, where the task is to generate the Reasoning Chain and Knowledge Path given the question and the KG. ReAligner Prompt ============================= Prompt Input ================================ Generate a step-by-step thinking process for the given question. Ensure the thinking process is aligned with triples in the knowledge base. Question: <Question> Query entities: <Question Entities> ============================= LLM Output ================================ Thinking process: <Reasoning Chain> Align Process: <Knowledge Path> Figure 10: The prompt template for ReAligner. Responser Prompt. The prompt for instructing Responser is shown in Fig. 11, where the task is to generate the final answer based on the given the question and the generated Reasoning Chain and Knowledge Path. Responser Prompt ============================= Prompt Input ================================ Generate a step-by-step thinking process for the given question. Ensure the thinking process is aligned with triples in the knowledge base. Question: <Question> Query entities: <Question Entities> Thinking process: <Reasoning Chain> Align Process: <Knowledge Path> Summarization: ============================= LLM Output ================================ <Answer> Figure 11: The prompt template for Responser. LLM-driven Consolidation Prompt. The prompt for LLM-driven Consolidation is shown in Fig. 12. We use RAR to generate $K$ Knowledge Paths and hypothesis answers for each question. The Knowledge Paths and hypothesis answers are provided to general LLMs to answer the questions without fine-tuning. LLM-driven Consolidation Prompt ============================= Prompt Input ================================ Relevant triples: <Knowledge Path 1>. Therefore, a possible answer could be: <Hypothesis Answer 1> $...$ <Knowledge Path K>. Therefore, a possible answer could be: <Hypothesis Answer K> Question: <Question> Based on the reasoning paths, please answer the given question. Please keep the answer as simple as possible and only return answers. Please return each answer in a new line. ============================= LLM Output ================================ <Answer 1> <Answer 2> $...$ Figure 12: The prompt template for LLM-driven Consolidation.

Rendering Paper...