# Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning
**Authors**: Wenjie Wu, Yongcheng Jing, Yingjie Wang, Wenbin Hu, Dacheng Tao
> Wenjie Wu and Wenbin Hu are with Wuhan University (Email: hwb@whu.edu.cn).
Yongcheng Jing, Yingjie Wang, and Dacheng Tao are with Nanyang Technological University (Email: dacheng.tao@ntu.edu.sg).
Abstract
Recent large language model (LLM) reasoning, despite its success, suffers from limited domain knowledge, susceptibility to hallucinations, and constrained reasoning depth, particularly in small-scale models deployed in resource-constrained environments. This paper presents the first investigation into integrating step-wise knowledge graph retrieval with step-wise reasoning to address these challenges, introducing a novel paradigm termed as graph-augmented reasoning. Our goal is to enable frozen, small-scale LLMs to retrieve and process relevant mathematical knowledge in a step-wise manner, enhancing their problem-solving abilities without additional training. To this end, we propose KG-RAR, a framework centered on process-oriented knowledge graph construction, a hierarchical retrieval strategy, and a universal post-retrieval processing and reward model (PRP-RM) that refines retrieved information and evaluates each reasoning step. Experiments on the Math500 and GSM8K benchmarks across six models demonstrate that KG-RAR yields encouraging results, achieving a 20.73% relative improvement with Llama-3B on Math500.
Index Terms: Large Language Model, Knowledge Graph, Reasoning
1 Introduction
Enhancing the reasoning capabilities of large language models (LLMs) continues to be a major challenge [1, 2, 3]. Conventional methods, such as chain-of-thought (CoT) prompting [4], improve inference by encouraging step-by-step articulation [5, 6, 7, 8, 9, 10], while external tool usage and domain-specific fine-tuning further refine specific task performance [11, 12, 13, 14, 15, 16]. Most recently, o1-like multi-step reasoning has emerged as a paradigm shift [17, 18, 19, 20, 21, 22], leveraging test-time compute strategies [5, 23, 24, 25, 26], exemplified by reasoning models like GPT-o1 [27] and DeepSeek-R1 [28]. These approaches, including Best-of-N [29] and Monte Carlo Tree Search [23] , allocate additional computational resources during inference to dynamically refine reasoning paths [30, 31, 29, 25].
<details>
<summary>x1.png Details</summary>

### Visual Description
## Flow Diagram: Comparison of Reasoning Methods
### Overview
The image presents a comparative flow diagram illustrating three different reasoning methods: Chain of Thought, Traditional RAG, and Step-by-Step KG-RAR. Each method is depicted as a sequence of steps, starting from a problem and leading to an answer. The diagrams highlight the key components and processes involved in each approach.
### Components/Axes
* **Diagram Structure:** Each method is represented as a vertical flow, with rounded rectangles indicating processes or states. Arrows indicate the direction of flow.
* **Titles:** Each method has a title at the bottom: "Chain of Thought" (light blue), "Traditional RAG" (light green), and "Step-by-Step KG-RAR" (teal).
* **Initial State:** All three methods start with a "Problem" (top).
* **Intermediate Steps:** Each method includes "Step1" and "Step2".
* **Final State:** All three methods end with an "Answer" (bottom).
* **Additional Components:**
* Traditional RAG includes "Docs" at the top, feeding into "CoT + RAG".
* Step-by-Step KG-RAR includes "KG" (Knowledge Graph) at the top, feeding into "CoT + KG-RAR", and "Sub-KG" (Sub-Knowledge Graph) after each step.
* **Icons:**
* Magnifying glass icon next to "Docs" in Traditional RAG.
* Magnifying glass icon with a plus sign next to the feedback loops in Step-by-Step KG-RAR.
* Lightning bolt icon between "Step2" and "Answer" in Chain of Thought and Traditional RAG.
* Hourglass icon between "Step2" and "Answer" in Step-by-Step KG-RAR.
* Question mark icon with arrow looping back to "KG-RAR of Step1" in Step-by-Step KG-RAR.
* Map icon with magnifying glass next to "Sub-KG" in Step-by-Step KG-RAR.
### Detailed Analysis or Content Details
**1. Chain of Thought (Left Column):**
* Starts with "Problem".
* Proceeds to "CoT-prompting" (light blue).
* Flows through "Step1" and "Step2".
* Ends with "Answer".
* A dotted arrow with a lightning bolt icon connects "Step2" to "Answer".
**2. Traditional RAG (Middle Column):**
* Starts with "Problem" and "Docs".
* "Docs" feeds into "CoT + RAG" (light green).
* Flows through "Step1" and "Step2".
* Ends with "Answer".
* A dotted arrow with a lightning bolt icon connects "Step2" to "Answer".
**3. Step-by-Step KG-RAR (Right Column):**
* Starts with "Problem" and "KG".
* "KG" feeds into "CoT + KG-RAR" (teal).
* Flows through "Step1" and "Sub-KG".
* A feedback loop (light blue) with a question mark icon goes from "Step1" back to "KG-RAR of Step1" (teal).
* "Sub-KG" feeds into "KG-RAR of Step1".
* Flows through "Step2" and "Sub-KG".
* Ends with "Answer".
* A dotted arrow with an hourglass icon connects "Step2" to "Answer".
* A feedback loop (light green) with a plus sign and magnifying glass icon goes from "CoT + KG-RAR" to "KG".
* A feedback loop (light green) with a plus sign and magnifying glass icon goes from "KG-RAR of Step1" to "KG".
### Key Observations
* The diagrams illustrate the flow of information and processing steps in each reasoning method.
* Chain of Thought is the simplest, involving a direct sequence of steps.
* Traditional RAG incorporates external documents to enhance reasoning.
* Step-by-Step KG-RAR uses a knowledge graph and iterative refinement with sub-knowledge graphs.
* The lightning bolt icon in Chain of Thought and Traditional RAG might represent a direct or quick inference.
* The hourglass icon in Step-by-Step KG-RAR might represent a more time-consuming or deliberate process.
* The feedback loops in Step-by-Step KG-RAR indicate iterative refinement using knowledge graphs.
### Interpretation
The diagrams compare three different approaches to reasoning. Chain of Thought represents a basic, sequential reasoning process. Traditional RAG enhances this by incorporating external knowledge from documents. Step-by-Step KG-RAR represents a more complex approach that leverages knowledge graphs and iterative refinement. The choice of method depends on the complexity of the problem and the available resources (documents, knowledge graphs). The Step-by-Step KG-RAR method appears to be the most sophisticated, involving iterative refinement and external knowledge integration at multiple stages. The icons suggest differences in the speed and nature of the reasoning process.
</details>
Figure 1: Illustration of the proposed step-by-step knowledge graph retrieval for o1-like reasoning, which dynamically retrieves and utilises structured sub-graphs (Sub-KGs) during reasoning. Our approach iteratively refines the reasoning process by retrieving relevant Sub-KGs at each step, enhancing accuracy, consistency, and reasoning depth for complex tasks, thereby offering a novel form of scaling test-time computation.
Despite encouraging advancements in o1-like reasoning, LLMs—particularly smaller and less powerful variants—continue to struggle with complex reasoning tasks in mathematics and science [2, 3, 32, 33]. These challenges arise from insufficient domain knowledge, susceptibility to hallucinations, and constrained reasoning depth [34, 35, 36]. Given the novelty of o1-like reasoning, effective solutions to these issues remain largely unexplored, with few studies addressing this gap in the literature [17, 18, 22]. One potential solution from the pre-o1 era is retrieval-augmented generation (RAG), which has been shown to mitigate hallucinations and factual inaccuracies by retrieving relevant information from external knowledge sources (Fig. 1, the \nth 2 column) [37, 38, 39]. However, in the context of o1-like multi-step reasoning, traditional RAG faces two significant challenges:
- (1) Step-wise hallucinations: LLMs may hallucinate during intermediate steps —a problem not addressed by applying RAG solely to the initial prompt [40, 41];
- (2) Missing structured relationships: Traditional RAG may retrieve information that lacks the structured relationships necessary for complex reasoning tasks, leading to inadequate augmentation that fails to capture the depth required for accurate reasoning [39, 42].
In this paper, we strive to address both challenges by introducing a novel graph-augmented multi-step reasoning scheme to enhance LLMs’ o1-like reasoning capability, as depicted in Fig. 1. Our idea is motivated by the recent success of knowledge graphs (KGs) in knowledge-based question answering and fact-checking [43, 44, 45, 46, 47, 48, 49]. Recent advances have demonstrated the effectiveness of KGs in augmenting prompts with retrieved knowledge or enabling LLMs to query KGs for factual information [50, 51]. However, little attention has been given to improving step-by-step reasoning for complex tasks with KGs [52, 53], such as mathematical reasoning, which requires iterative logical inference rather than simple knowledge retrieval.
To fill this gap, the objective of the proposed graph-augmented reasoning paradigm is to integrate structured KG retrieval into the reasoning process in a step-by-step manner, providing contextually relevant information at each reasoning step to refine reasoning paths and mitigate step-wise inaccuracies and hallucinations, thereby addressing both aforementioned challenges simultaneously. This approach operates without additional training, making it particularly well-suited for small-scale LLMs in resource-constrained environments. Moreover, it extends test-time compute by incorporating external knowledge into the reasoning context, transitioning from direct CoT to step-wise guided retrieval and reasoning.
Nevertheless, implementing this graph-augmented reasoning paradigm is accompanied with several key issues: (1) Frozen LLMs struggle to query KGs effectively [50], necessitating a dynamic integration strategy for iterative incorporation of graph-based knowledge; (2) Existing KGs primarily encode static facts rather than the procedural knowledge required for multi-step reasoning [54, 55], highlighting the need for process-oriented KGs; (3) Reward models, which are essential for validating reasoning steps [30, 29], often require costly fine-tuning and suffer from poor generalization [56], underscoring the need for a universal, training-free scoring mechanism tailored to KG.
To address these issues, we propose KG-RAR, a step-by-step knowledge graph based retrieval-augmented reasoning framework that retrieves, refines, and reasons using structured knowledge graphs in a step-wise manner. Specifically, to enable effective KG querying, we design a hierarchical retrieval strategy in which questions and reasoning steps are progressively matched to relevant subgraphs, dynamically narrowing the search space. Also, we present a process-oriented math knowledge graph (MKG) construction method that encodes step-by-step procedural knowledge, ensuring that LLMs retrieve and apply structured reasoning sequences rather than static facts. Furthermore, we introduce the post-retrieval processing and reward model (PRP-RM)—a training-free scoring mechanism that refines retrieved knowledge before reasoning and evaluates step correctness in real time. By integrating structured retrieval with test-time computation, our approach mitigates reasoning inconsistencies, reduces hallucinations, and enhances stepwise verification—all without additional training.
In sum, our contribution is therefore the first attempt that dynamically integrates step-by-step KG retrieval into an o1-like multi-step reasoning process. This is achieved through our proposed hierarchical retrieval, process-oriented graph construction method, and PRP-RM—a training-free scoring mechanism that ensures retrieval relevance and step correctness. Experiments on Math500 and GSM8K validate the effectiveness of our approach across six smaller models from the Llama3 and Qwen2.5 series. The best-performing model, Llama-3B on Math500, achieves a 20.73% relative improvement over CoT prompting, followed by Llama-8B on Math500 with a 15.22% relative gain and Llama-8B on GSM8K with an 8.68% improvement.
2 Related Work
<details>
<summary>x2.png Details</summary>

### Visual Description
## Flowchart: Mathematical Problem Solving
### Overview
The image presents a flowchart illustrating the process of solving a mathematical problem. It combines two approaches: a "Retrieving" approach that explores relevant mathematical concepts and a "CoT" (Chain of Thought) approach that outlines the step-by-step solution.
### Components/Axes
* **Question:** "A factory produces items in batches of 35. If today is the 1234th batch, what is the remainder?"
* **Left Side:** "Retrieving" - A series of green and yellow boxes connected by arrows, representing different mathematical concepts and questions.
* **Right Side:** "CoT: Let's think step by step..." - A series of boxes outlining the solution process, with labels "Refining" (yellow) and "Reasoning" (green).
* **Box Colors:**
* Green: Represents the main steps or questions.
* Yellow: Represents related concepts or refinements.
* Blue: Represents sub-steps or details.
* Orange: Represents a negative outcome.
### Detailed Analysis or ### Content Details
**Left Side - Retrieving:**
* **Top-Left:** "Retrieving" (Green box)
* Connected to:
* "Congruence" (Yellow box)
* "Modular Arithmetic" (Yellow box)
* "Number Theory" (Yellow box)
* "Modular Arithmetic" (Yellow box) is connected to:
* "What is the remainder when 1,493,824 is divided by 4?" (Green box)
* "Suppose that a $30$-digit integer $N$ is composed of thirteen $7$s and seventeen $3$s. What is the remainder when $N$ is divided by $36$?" (Green box)
* Connected to:
* "modular arithmetic: Is..." (Blue box)
* "multiplication of relevant units digits" (Green box)
* "What is the remainder when the product $1734 x5389 ×80607$ is divided by 10?" (Green box)
* Connected to:
* "last digit property: Is..." (Blue box)
* "observation of last digit property" (Green box)
* Connected to:
* "application of last digit property" (Blue box)
* **Bottom-Left:** "Retrieving" (Green box)
* Connected to:
* "checking if the solution satisfies the congruence condition" (Green box)
* Connected to:
* "quotient: Is..." (Blue box)
* "finding the next multiple of the modulus" (Green box)
* "division algorithm: Is..." (Blue box)
* "performing division to find the quotient and remainder" (Green box)
* Connected to:
* "expressing the number as a sum of powers of the base using the quotient and remainder" (Green box)
* "zero remainder: Is..." (Blue box)
* "remainder: Is..." (Blue box)
* Connected to:
* "the solution does not satisfy the congruence condition" (Orange box)
**Right Side - CoT (Chain of Thought):**
* "CoT: Let's think step by step..." (White box)
* Connected to:
* "Retrieval: The question can be solved using modular arithmetic..." (Green box)
* Labeled "Refining" (Yellow tag)
* Connected to:
* "Step1: Divide 1234 by 35: 1234 ÷ 35 = 35.2571" (White box)
* Labeled "Reasoning" (Green tag)
* Connected to:
* "Retrieval: The remainder is what remains after subtracting the largest multiple of 35 that fits into 1234..." (Green box)
* Labeled "Refining" (Yellow tag)
* Connected to:
* "Step2: Subtract this product from 1234 to find the remainder: 1234 - 35 x 35 = 9" (White box)
* Labeled "Reasoning" (Green tag)
### Key Observations
* The "Retrieving" section explores various mathematical concepts that could be relevant to solving the problem.
* The "CoT" section provides a direct, step-by-step solution using division and subtraction.
* The flowchart structure visually represents the problem-solving process.
### Interpretation
The flowchart illustrates two distinct approaches to problem-solving. The "Retrieving" section represents a broader exploration of mathematical concepts, potentially useful for understanding the problem's context and identifying relevant solution strategies. The "CoT" section demonstrates a more direct, algorithmic approach to finding the solution. The combination of these approaches highlights the importance of both conceptual understanding and procedural execution in mathematical problem-solving. The orange box at the bottom-right suggests that some solution paths might lead to dead ends, emphasizing the iterative nature of problem-solving.
</details>
Figure 2: Example of Step-by-Step KG-RAR’s iterative process: 1) Retrieving: For a given question or intermediate reasoning step, the KG is retrieved to find the most similar problem or procedure (underlined in the figure) and extract its subgraph as the raw retrieval. 2) Refining: A frozen LLM processes the raw retrieval to generate a refined and targeted context for reasoning. 3) Reasoning: Using the refined retrieval, another LLM reflects on previous steps and generates next intermediate reasoning steps. This iterative workflow refines and guides the reasoning path to problem-solving.
2.1 LLM Reasoning
Large Language Models (LLMs) have advanced in structured reasoning through techniques like Chain-of-Thought (CoT) [4], Self-Consistency [5], and Tree-of-Thought [10], improving inference by generating intermediate steps rather than relying on greedy decoding [6, 7, 8, 57, 58, 9]. Recently, GPT-o1-like reasoning has emerged as a paradigm shift [17, 18, 25, 22], leveraging Test-Time Compute strategies such as Best-of-N [29], Beam Search [5], and Monte Carlo Tree Search [23], often integrated with reward models to refine reasoning paths dynamically [30, 31, 29, 59, 60, 61]. Reasoning models like DeepSeek-R1 exemplify this trend by iteratively searching, verifying, and refining solutions, significantly enhancing inference accuracy and robustness [27, 28]. However, these methods remain computationally expensive and challenging for small-scale LLMs, which struggle with hallucinations and inconsistencies due to limited reasoning capacity and lack of domain knowledge [32, 34, 3].
2.2 Knowledge Graphs Enhanced LLM Reasoning
Knowledge Graphs (KGs) are structured repositories of interconnected entities and relationships, offering efficient graph-based knowledge representation and retrieval [62, 63, 64, 65, 54]. Prior work integrating KGs with LLMs has primarily focused on knowledge-based reasoning tasks such as knowledge-based question answering [43, 44, 66, 67], fact-checking [45, 46, 68], and entity-centric reasoning [69, 47, 70, 48, 49]. However, in these tasks, ”reasoning” is predominantly limited to identifying and retrieving static knowledge rather than performing iterative, multi-step logical computations [71, 72, 73]. In contrast, our work is to integrate KGs with LLMs for o1-like reasoning in domains such as mathematics, where solving problems demands dynamic, step-by-step inference rather than static knowledge retrieval.
2.3 Reward Models
Reward models are essential across various domains such as computer vision [74, 75]. Notably, they play a crucial role in aligning LLM outputs with human preferences by evaluating accuracy, relevance, and logical consistency [76, 77, 78]. Fine-tuned reward models, including Outcome Reward Models (ORMs) [30] and Process Reward Models (PRMs) [31, 29, 60], improve validation accuracy but come at a high training cost and often lack generalization across diverse tasks [56]. Generative reward models [61] further enhance performance by integrating CoT reasoning into reward assessments, leveraging Test-Time Compute to refine evaluation. However, the reliance on fine-tuning makes these models resource-intensive and limits adaptability [56]. This underscores the need for universal, training-free scoring mechanisms that maintain robust performance while ensuring computational efficiency across various reasoning domains.
3 Pre-analysis
3.1 Motivation and Problem Definition
Motivation. LLMs have demonstrated remarkable capabilities across various domains [79, 4, 80, 81, 82], yet their proficiency in complex reasoning tasks remains limited [3, 83, 84]. Challenges such as hallucinations [34, 85], inaccuracies, and difficulties in handling complex, multi-step reasoning due to insufficient reasoning depth are particularly evident in smaller models or resource-constrained environments [86, 87, 88]. Moreover, traditional reward models, including ORMs [30] and PRMs [31, 29], require extensive fine-tuning, incurring significant computational costs for dataset collection, GPU usage, and prolonged training time [89, 90, 91]. Despite these efforts, fine-tuned reward models often suffer from poor generalization, restricting their effectiveness across diverse reasoning tasks [56].
To simultaneously overcome these challenges, this paper introduces a novel paradigm tailored for o1-like multi-step reasoning:
Remark 3.1 (Graph-Augmented Multi-Step Reasoning). The goal of graph-augmented reasoning is to enhance the step-by-step reasoning ability of frozen LLMs by integrating external knowledge graphs (KGs), eliminating the need for additional fine-tuning.
The proposed graph-augmented scheme aims to offer the following unique advantages:
- Improving Multi-Step Reasoning: Enhances reasoning capabilities, particularly for small-scale LLMs in resource-constrained environments;
- Scaling Test-Time Compute: Introduces a novel dimension of scaling test-time compute through dynamic integration of external knowledge;
- Transferability Across Reasoning Tasks: By leveraging domain-specific KGs, the framework can be easily adapted to various reasoning tasks, enabling transferability across different domains.
3.2 Challenge Analysis
However, implementing the proposed graph-augmented reasoning paradigm presents several critical challenges:
- Effective Integration: How can KGs be efficiently integrated with LLMs to support step-by-step reasoning without requiring model modifications? Frozen LLMs cannot directly query KGs effectively [50]. Additionally, since LLMs may suffer from hallucinations and inaccuracies during intermediate reasoning steps [34, 85, 41], it is crucial to dynamically integrate KGs at each step rather than relying solely on static knowledge retrieved at the initial stage;
- Knowledge Graph Construction: How can we design and construct process-oriented KGs tailored for LLM-driven multi-step reasoning? Existing KGs predominantly store static knowledge rather than the procedural and logical information required for reasoning [54, 55, 92]. A well-structured KG that represents reasoning steps, dependencies, and logical flows is necessary to support iterative reasoning;
- Universal Scoring Mechanism: How can we develop a training-free reward mechanism capable of universally evaluating reasoning paths across diverse tasks without domain-specific fine-tuning? Current approaches depend on fine-tuned reward models, which are computationally expensive and lack adaptability [56]. A universal, training-free scoring mechanism leveraging frozen LLMs is essential for scalable and efficient reasoning evaluation.
To address these challenges and unlock the full potential of graph-augmented reasoning, we propose a Step-by-Step Knowledge Graph based Retrieval-Augmented Reasoning (KG-RAR) framework, accompanied by a dedicated Post-Retrieval Processing and Reward Model (PRP-RM), which will be elaborated in the following section.
4 Proposed Approach
4.1 Overview
Our objective is to integrate KGs for o1-like reasoning with frozen, small-scale LLMs in a training-free and universal manner. This is achieved by integrating a step-by-step knowledge graph based retrieval-augmented reasoning (KG-RAR) module within a structured, iterative reasoning framework. As shown in Figure 2, the iterative process comprises three core phases: retrieving, refining, and reasoning.
4.2 Process-Oriented Math Knowledge Graph
<details>
<summary>x3.png Details</summary>

### Visual Description
## Diagram: Math Knowledge Graph Generation
### Overview
The image is a diagram illustrating the process of generating a Math Knowledge Graph using Datasets and a Large Language Model (LLM). It shows how a problem from a dataset is processed by the LLM to generate a knowledge graph, potentially involving multiple steps and leading to either correct procedures or errors.
### Components/Axes
* **Datasets:** This column represents the input data, including problems and steps.
* Problem
* Step1 (indicated by a smiling face emoji with sunglasses)
* Step2 (indicated by a smiling face emoji with sunglasses)
* Step3 (indicated by a frowning face emoji)
* **LLM:** This column represents the Large Language Model, which processes the input data.
* An image of a llama is present between "Problem" and "Step1".
* An image of a snowflake is present between "Step1" and "Step2".
* **Math Knowledge Graph:** This section represents the output of the LLM, a structured knowledge graph.
* Branch (yellow)
* SubField (yellow)
* Problem (green)
* Problem Type (yellow)
* Procedure1 (teal)
* Knowledge1 (light blue)
* Procedure2 (teal)
* Knowledge2 (light blue)
* Error3 (orange)
* Knowledge3 (light blue)
### Detailed Analysis
The diagram illustrates a flow from left to right, starting with the Datasets and progressing through the LLM to the Math Knowledge Graph.
* **Datasets to LLM:**
* "Problem" from Datasets is processed by the LLM (represented by a llama icon).
* "Step1" from Datasets (represented by a smiling face emoji with sunglasses) is processed by the LLM.
* "Step2" from Datasets (represented by a smiling face emoji with sunglasses) is processed by the LLM (represented by a snowflake icon).
* "Step3" from Datasets (represented by a frowning face emoji) is processed by the LLM.
* **LLM to Math Knowledge Graph:**
* The LLM processes the "Problem" to generate "Branch" and "Problem", which are connected to "SubField" and "Problem Type" respectively.
* The LLM processes "Step1" to generate "Procedure1", which is connected to "Knowledge1".
* The LLM processes "Step2" to generate "Procedure2", which is connected to "Knowledge2".
* The LLM processes "Step3" to generate "Error3", which is connected to "Knowledge3".
### Key Observations
* The diagram shows a sequential process where data from the Datasets column is fed into the LLM, which then generates components of the Math Knowledge Graph.
* The use of emojis (smiling faces with sunglasses and a frowning face) suggests different outcomes or states in the process.
* The different colors of the boxes in the Math Knowledge Graph likely represent different categories or types of knowledge.
### Interpretation
The diagram illustrates a system for generating a Math Knowledge Graph using an LLM. The process starts with input data (Problems and Steps) and uses the LLM to extract and structure knowledge. The different steps and outcomes (correct procedures vs. errors) suggest that the LLM's performance can vary. The knowledge graph components (Branch, SubField, Problem, Problem Type, Procedure, Knowledge, Error) represent the different types of information that the LLM can extract and organize. The diagram highlights the potential of LLMs to automate the creation of structured knowledge from unstructured data.
</details>
Figure 3: Pipeline for constructing the process-oriented math knowledge graph from process supervision datasets.
To support o1-like multi-step reasoning, we construct a Mathematical Knowledge Graph tailored for multi-step logical inference. Public process supervision datasets, such as PRM800K [29], provide structured problem-solving steps annotated with artificial ratings. Each sample will be decomposed into the following structured components: branch, subfield, problem type, problem, procedures, errors, and related knowledge.
The Knowledge Graph is formally defined as: $G=(V,E)$ , where $V$ represents nodes—including problems, procedures, errors, and mathematical knowledge—and $E$ represents edges encoding their relationships (e.g., ”derived from,” ”related to”).
As shown in Figure 3, for a given problem $P$ with solutions $S_{1},S_{2},...,S_{n}$ and human ratings, the structured representation is:
$$
P\mapsto\{B_{p},F_{p},T_{p},\mathbf{r}\},
$$
| | $\displaystyle S_{i}^{\text{good}}$ | $\displaystyle\mapsto\{S_{i},K_{i},\mathbf{r}_{i}^{\text{good}}\},\quad S_{i}^{%
\text{bad}}\mapsto\{E_{i},K_{i},\mathbf{r}_{i}^{\text{bad}}\},$ | |
| --- | --- | --- | --- |
where $B_{p},F_{p},T_{p}$ represent the branch, subfield, and type of $P$ , respectively. The symbols $S_{i}$ and $E_{i}$ denote the procedures derived from correct steps and the errors from incorrect steps, respectively. Additionally, $K_{i}$ contains relevant mathematical knowledge. The relationships between problems, steps, and knowledge are encoded through $\mathbf{r},\mathbf{r}_{i}^{\text{good}},\mathbf{r}_{i}^{\text{bad}}$ , which capture the edge relationships linking these elements.
To ensure a balance between computational efficiency and quality of KG, we employ a Llama-3.1-8B-Instruct model to process about 10,000 unduplicated samples from PRM800K. The LLM is prompted to output structured JSON data, which is subsequently transformed into a Neo4j-based MKG. This process yields a graph with approximately $80,000$ nodes and $200,000$ edges, optimized for efficient retrieval.
4.3 Step-by-Step Knowledge Graph Retrieval
KG-RAR for Problem Retrieval. For a given test problem $Q$ , the most relevant problem $P^{*}∈ V_{p}$ and its subgraph are retrieved to assist reasoning. The retrieval pipeline comprises the following steps:
1. Initial Filtering: Classify $Q$ by $B_{q},F_{q},T_{q}$ (branch, subfield, and problem type). The candidate set $V_{Q}⊂ V_{p}$ is filtered hierarchically, starting from $T_{q}$ , expanding to $F_{q}$ , and then to $B_{q}$ if no exact match is found.
2. Semantic Similarity Scoring:
$$
P^{*}=\arg\max_{P\in V_{Q}}\cos(\mathbf{e}_{Q},\mathbf{e}_{P}),
$$
where:
$$
\cos(\mathbf{e}_{Q},\mathbf{e}_{P})=\frac{\langle\mathbf{e}_{Q},\mathbf{e}_{P}%
\rangle}{\|\mathbf{e}_{Q}\|\|\mathbf{e}_{P}\|}
$$
and $\mathbf{e}_{Q},\mathbf{e}_{P}∈\mathbb{R}^{d}$ are embeddings of $Q$ and $P$ , respectively.
3. Context Retrieval: Perform Depth-First Search (DFS) on $G$ to retrieve procedures ( $S_{p}$ ), errors ( $E_{p}$ ), and knowledge ( $K_{p}$ ) connected to $P^{*}$ .
Input: Test problem $Q$ and MKG $G$
Output: Most relevant problem $P^{*}$ and its context $(S_{p},E_{p},K_{p})$
1 Filter $G$ using $B_{q}$ , $F_{q}$ , and $T_{q}$ to obtain $V_{Q}$ ;
2 foreach $P∈ V_{Q}$ do
3 Compute $\text{Sim}_{\text{semantic}}(Q,P)$ ;
4
5 $P^{*}←\arg\max_{P∈ V_{Q}}\text{Sim}_{\text{semantic}}(Q,P)$ ;
6 Retrieve $S_{p}$ , $E_{p}$ , and $K_{p}$ from $P^{*}$ using DFS;
7 return $P^{*}$ , $S_{p}$ , $E_{p}$ , and $K_{p}$ ;
Algorithm 1 KG-RAR for Problem Retrieval
KG-RAR for Step Retrieval. Given an intermediate reasoning step $S$ , the most relevant step $S^{*}∈ G$ and its subgraph is retrieved dynamically:
1. Contextual Filtering: Restrict the search space $V_{S}$ to the subgraph induced by previously retrieved top-k similar problems $\{P_{1},P_{2},...,P_{k}\}∈ V_{Q}$ .
2. Step Similarity Scoring:
$$
S^{*}=\arg\max_{S_{i}\in V_{S}}\cos(\mathbf{e}_{S},\mathbf{e}_{S_{i}}).
$$
3. Context Retrieval: Perform Breadth-First Search (BFS) on $G$ to extract subgraph of $S^{*}$ , including potential next steps, related knowledge, and error patterns.
4.4 Post-Retrieval Processing and Reward Model
Step Verification and End-of-Reasoning Detection. Inspired by previous works [93, 56, 94, 61], we use a frozen LLM to evaluate both step correctness and whether reasoning should terminate. The model is queried with an instruction, producing a binary classification decision:
$$
\textit{Is this step correct (Yes/No)? }\begin{cases}\text{Yes}\xrightarrow[%
Probability]{Token}p(\text{Yes})\\
\text{No}\xrightarrow[Probability]{Token}p(\text{No})\\
\text{Other Tokens.}\end{cases}
$$
The corresponding confidence score for step verification or reasoning termination is computed as:
$$
Score(S,I)=\frac{\exp(p(\text{Yes}|S,I))}{\exp(p(\text{Yes}|S,I))+\exp(p(\text%
{No}|S,I))}.
$$
For step correctness, the instruction $I$ is ”Is this step correct (Yes/No)?”, while for reasoning termination, the instruction $I_{E}$ is ”Has a final answer been reached (Yes/No)?”.
Post-Retrieval Processing. Post-retrieval processing is a crucial component of the retrieval-augmented generation (RAG) framework, ensuring that retrieved information is improved to maximize relevance while minimizing noise [37, 95, 96].
For a problem $P$ or a reasoning step $S$ :
$$
\mathcal{R}^{\prime}=\operatorname{LLM}_{\text{refine}}(P+\mathcal{R}\text{ or%
}S+\mathcal{R}),
$$
where $\mathcal{R}$ is the raw retrieved context, and $\mathcal{R}^{\prime}$ represents its rewritten, targeted form.
Iterative Refinement and Verification. Inspired by generative reward models [61, 97], we integrate retrieval refinement as a form of CoT reasoning before scoring each step. To ensure consistency in multi-step reasoning, we employ an iterative retrieval refinement and scoring mechanism, as illustrated in Figure 4.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Diagram: Reasoning Process with Knowledge Graph Interaction
### Overview
The image is a diagram illustrating a reasoning process that interacts with a knowledge graph (KG). It shows a sequence of steps involving a Prompt-Response-Prompt Reasoning Module (PRP-RM) and a Reasoner, with intermediate states and interactions with the KG and a Sub-KG. The diagram uses color-coding to represent different types of data or processes.
### Components/Axes
* **PRP-RM:** Prompt-Response-Prompt Reasoning Module, represented by a head outline with speech bubbles.
* **Reasoner:** Represented by a head outline with interconnected circles inside.
* **KG:** Knowledge Graph, represented by a green interconnected node diagram.
* **Sub-KG:** Sub-Knowledge Graph, represented by a green interconnected node diagram.
* **Prompt (Blue):** Represents the input prompt.
* **Output (Yellow):** Represents the output generated.
* **Token Prob (Green):** Represents token probabilities.
* **Temp Chat (Checkered):** Represents temporary chat data.
* **Boxes:** Rectangular boxes divided into sections, representing different states or data.
* P: Prompt
* Rp: Response
* R'p: Refined Response
* S1: State 1
* R1: Response 1
* R'1: Refined Response 1
* I: Input
* IE: Input End
* Score: Score
* End: End
* S2: State 2
### Detailed Analysis
The diagram shows a multi-step reasoning process:
1. **Initial Prompt:** A PRP-RM generates a prompt (P) and a response (Rp). The refined response (R'p) is output. This output is fed into a Knowledge Graph (KG).
2. **First Reasoning Step:** The KG provides information to the Reasoner, which uses the prompt (P) and refined response (R'p) to generate a state (S1).
3. **Second Prompt:** A second PRP-RM takes the state (S1) and generates a response (R1). The refined response (R'1) is output. This output is fed into a Sub-KG.
4. **Second Reasoning Step:** The Sub-KG provides information to the Reasoner, which uses the refined response (R'1) along with input (I), input end (IE), score, and end to generate a new state (S2).
**Detailed Breakdown of Boxes:**
* **Top Box:** Divided into two blue sections labeled "P" and "Rp" on top, and a yellow section labeled "R'p" on the bottom.
* **Second Box:** Divided into two sections, a blue section labeled "P" and a blue section labeled "R'p" on top, and a yellow section labeled "S1" on the bottom.
* **Third Box:** Divided into multiple sections:
* Top: Two blue sections labeled "S1" and "R1".
* Middle: A yellow section labeled "R'1".
* Bottom Middle: Two checkered sections labeled "I" and "IE".
* Bottom: Two green sections labeled "Score" and "End".
* **Bottom Box:** Divided into two sections, a blue section labeled "R'1" on top, and a yellow section labeled "S2" on the bottom.
### Key Observations
* The process involves iterative prompting and reasoning, with interactions with both a KG and a Sub-KG.
* The PRP-RM modules seem to be responsible for generating prompts and responses, while the Reasoner uses these prompts and responses in conjunction with knowledge from the KGs to update its state.
* The use of "refined responses" (R'p, R'1) suggests a process of iterative refinement or improvement of the responses.
* The "Temp Chat" (checkered) section suggests a temporary storage or processing of chat-related data.
* The "Token Prob" (green) section suggests a probability score is being calculated.
### Interpretation
The diagram illustrates a complex reasoning process that leverages knowledge graphs to enhance the quality of responses. The iterative nature of the process, with repeated prompting and reasoning steps, suggests a system designed for complex tasks that require multiple steps of inference and knowledge retrieval. The use of both a KG and a Sub-KG may indicate a hierarchical knowledge structure or a process of focusing on relevant subsets of knowledge. The inclusion of "Score" and "End" suggests a mechanism for evaluating the quality of the reasoning process and determining when to terminate the process. The diagram highlights the interplay between prompting, reasoning, and knowledge retrieval in achieving intelligent behavior.
</details>
Figure 4: Illustration of the Post-Retrieval Processing and Reward Model (PRP-RM). Given a problem $P$ and its retrieved context $\mathcal{R}_{p}$ from the Knowledge Graph (KG), PRP-RM refines it into $\mathcal{R^{\prime}}_{p}$ . The Reasoner LLM generates step $S_{1}$ based on $\mathcal{R^{\prime}}_{p}$ , followed by iterative retrieval and refinement ( $\mathcal{R}_{t}→\mathcal{R^{\prime}}_{t}$ ) for each step $S_{t}$ . Correctness is assessed using $I=$ ”Is this step correct?” to compute $\operatorname{Score}(S_{t})$ , while completion is checked via $I_{E}=$ ”Has a final answer been reached?” to compute $\operatorname{End}(S_{t})$ . The process continues until $\operatorname{End}(S_{t})$ surpasses a threshold or a predefined inference depth is reached.
Input: Current step $S$ and retrieved problems $\{P_{1},...,P_{k}\}$
Output: Relevant step $S^{*}$ and its context subgraph
1 Initialize step collection $V_{S}←\bigcup_{i=1}^{k}\text{Steps}(P_{i})$ ;
2 foreach $S_{i}∈ V_{S}$ do
3 Compute semantic similarity $\text{Sim}_{\text{semantic}}(S,S_{i})$ ;
4
5 $S^{*}←\arg\max_{S_{i}∈ V_{S}}\text{Sim}_{\text{semantic}}(S,S_{i})$ ;
6 Construct context subgraph via $\text{BFS}(S^{*})$ ;
7 return $S^{*}$ , $\text{subgraph}(S^{*})$ ;
Algorithm 2 KG-RAR for Step Retrieval
For a reasoning step $S_{t}$ , the iterative refinement history is:
$$
H_{t}=\{P+\mathcal{R}_{p},\mathcal{R}^{\prime}_{p},S_{1}+\mathcal{R}_{1},%
\mathcal{R}^{\prime}_{1},\dots,S_{t}+\mathcal{R}_{t},\mathcal{R}^{\prime}_{t}\}.
$$
The refined retrieval context is generated recursively:
$$
\mathcal{R}^{\prime}_{t}=\operatorname{LLM}_{\text{refine}}(H_{t-1},S_{t}+%
\mathcal{R}_{t}).
$$
The correctness and end-of-reasoning probabilities are:
$$
\operatorname{Score}(S_{t})=\frac{\exp(p(\text{Yes}|H_{t},I))}{\exp(p(\text{%
Yes}|H_{t},I))+\exp(p(\text{No}|H_{t},I))},
$$
$$
\operatorname{End}(S_{t})=\frac{\exp(p(\text{Yes}|H_{t},I_{E}))}{\exp(p(\text{%
Yes}|H_{t},I_{E}))+\exp(p(\text{No}|H_{t},I_{E}))}.
$$
This process iterates until $\operatorname{End}(S_{t})>\theta$ , signaling completion.
TABLE I: Performance evaluation across different levels of the Math500 dataset using various models and methods.
| Dataset: Math500 | Level 1 (+9.09%) | Level 2 (+5.38%) | Level 3 (+8.90%) | Level 4 (+7.61%) | Level 5 (+16.43%) | Overall (+8.95%) | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Model | Method | Maj | Last | Maj | Last | Maj | Last | Maj | Last | Maj | Last | Maj | Last |
| Llama-3.1-8B (+15.22%) | CoT-prompting | 80.6 | 80.6 | 74.1 | 74.1 | 59.4 | 59.4 | 46.4 | 46.4 | 27.4 | 27.1 | 51.9 | 51.9 |
| Step-by-Step KG-RAR | 88.4 | 81.4 | 83.3 | 82.2 | 70.5 | 69.5 | 53.9 | 53.9 | 32.1 | 25.4 | 59.8 | 57.0 | |
| Llama-3.2-3B (+20.73%) | CoT-prompting | 63.6 | 65.1 | 61.9 | 61.9 | 51.1 | 51.1 | 43.2 | 43.2 | 20.4 | 20.4 | 43.9 | 44.0 |
| Step-by-Step KG-RAR | 83.7 | 79.1 | 68.9 | 68.9 | 61.0 | 52.4 | 49.2 | 47.7 | 29.9 | 28.4 | 53.0 | 50.0 | |
| Llama-3.2-1B (-4.02%) | CoT-prompting | 64.3 | 64.3 | 52.6 | 52.2 | 41.6 | 41.6 | 25.3 | 25.3 | 8.0 | 8.2 | 32.3 | 32.3 |
| Step-by-Step KG-RAR | 72.1 | 72.1 | 50.0 | 50.0 | 40.0 | 40.0 | 18.0 | 19.5 | 10.4 | 13.4 | 31.0 | 32.2 | |
| Qwen2.5-7B (+2.91%) | CoT-prompting | 95.3 | 95.3 | 88.9 | 88.9 | 86.7 | 86.3 | 77.3 | 77.1 | 50.0 | 49.8 | 75.6 | 75.4 |
| Step-by-Step KG-RAR | 95.3 | 93.0 | 90.0 | 90.0 | 87.6 | 88.6 | 79.7 | 79.7 | 54.5 | 56.7 | 77.8 | 78.4 | |
| Qwen2.5-3B (+3.13%) | CoT-prompting | 93.0 | 93.0 | 85.2 | 85.2 | 81.0 | 80.6 | 62.5 | 62.5 | 40.1 | 39.3 | 67.1 | 66.8 |
| Step-by-Step KG-RAR | 95.3 | 95.3 | 84.4 | 85.6 | 83.8 | 77.1 | 64.1 | 64.1 | 44.0 | 38.1 | 69.2 | 66.4 | |
| Qwen2.5-1.5B (-5.12%) | CoT-prompting | 88.4 | 88.4 | 78.5 | 77.4 | 71.4 | 68.9 | 49.2 | 49.5 | 34.6 | 34.3 | 58.6 | 57.9 |
| Step-by-Step KG-RAR | 97.7 | 93.0 | 78.9 | 75.6 | 66.7 | 67.6 | 48.4 | 44.5 | 24.6 | 23.1 | 55.6 | 53.4 | |
Role-Based System Prompting. Inspired by agent-based reasoning frameworks [98, 99, 100], we introduce role-based system prompting to further optimize our PRP-RM. In this approach, we define three distinct personas to enhance the reasoning process. The Responsible Teacher [101] processes retrieved knowledge into structured guidance and evaluates the correctness of each step. The Socratic Teacher [102], rather than pr oviding direct guidance, reformulates the retrieved content into heuristic questions, encouraging self-reflection. Finally, the Critical Teacher [103] acts as a critical evaluator, diagnosing reasoning errors before assigning a score. Each role focuses on different aspects of post-retrieval processing, improving robustness and interpretability.
5 Experiments
In this section, we evaluate the effectiveness of our proposed Step-by-Step KG-RAR and PRP-RM methods by comparing them with Chain-of-Thought (CoT) prompting [4] and finetuned reward models [30, 29]. Additionally, we perform ablation studies to examine the impact of individual components.
5.1 Experimental Setup
Following prior works [5, 29, 21], we evaluate on Math500 [104] and GSM8K [30], using Accuracy (%) as the primary metric. Experiments focus on instruction-tuned Llama3 [105] and Qwen2.5 [106], with Best-of-N [107, 30, 29] search ( $n=8$ for Math500, $n=4$ for GSM8K). We employ Majority Vote [5] for self-consistency and Last Vote [25] for benchmarking reward models. To evaluate PRP-RM, we compare against fine-tuned reward models: Math-Shepherd-PRM-7B [108], RLHFlow-ORM-Deepseek-8B and RLHFlow-PRM-Deepseek-8B [109]. For Step-by-Step KG-RAR, we set step depth to $8$ and padding to $4$ . The Socratic Teacher role is used in PRP-RM to minimize direct solving. Both the Reasoner LLM and PRP-RM remain consistent for fair comparison.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Chart: Model Accuracy vs. Difficulty Level
### Overview
The image is a line chart comparing the accuracy of four different language models across five difficulty levels. The chart displays accuracy (in percentage) on the y-axis and difficulty level on the x-axis. Each model is represented by a distinct colored line with a specific marker.
### Components/Axes
* **X-axis:** "Difficulty Level" with numerical values 1, 2, 3, 4, and 5.
* **Y-axis:** "Accuracy (%)" with values ranging from 0% to 80%, incrementing by 20%.
* **Legend:** Located at the top of the chart, identifying each model by color and name:
* Green: Math-Shepherd-PRM-7B
* Orange: RLHFlow-ORM-Deepseek-8B
* Blue: RLHFlow-PRM-Deepseek-8B
* Red: PRP-RM(frozen Llama-3.2-3B)
### Detailed Analysis
**1. Math-Shepherd-PRM-7B (Green Line):**
* Trend: Generally decreasing accuracy as difficulty increases.
* Data Points:
* Difficulty 1: Approximately 68%
* Difficulty 2: Approximately 65%
* Difficulty 3: Approximately 50%
* Difficulty 4: Approximately 43%
* Difficulty 5: Approximately 18%
**2. RLHFlow-ORM-Deepseek-8B (Orange Line):**
* Trend: Relatively stable accuracy between difficulty levels 1 and 2, then decreasing.
* Data Points:
* Difficulty 1: Approximately 63%
* Difficulty 2: Approximately 63%
* Difficulty 3: Approximately 53%
* Difficulty 4: Approximately 41%
* Difficulty 5: Approximately 20%
**3. RLHFlow-PRM-Deepseek-8B (Blue Line):**
* Trend: Decreasing accuracy as difficulty increases.
* Data Points:
* Difficulty 1: Approximately 70%
* Difficulty 2: Approximately 61%
* Difficulty 3: Approximately 54%
* Difficulty 4: Approximately 46%
* Difficulty 5: Approximately 20%
**4. PRP-RM(frozen Llama-3.2-3B) (Red Line):**
* Trend: Decreasing accuracy as difficulty increases.
* Data Points:
* Difficulty 1: Approximately 79%
* Difficulty 2: Approximately 70%
* Difficulty 3: Approximately 54%
* Difficulty 4: Approximately 48%
* Difficulty 5: Approximately 28%
### Key Observations
* All models show a decrease in accuracy as the difficulty level increases.
* PRP-RM(frozen Llama-3.2-3B) generally has the highest accuracy across all difficulty levels, except at difficulty level 5, where RLHFlow-ORM-Deepseek-8B and RLHFlow-PRM-Deepseek-8B are slightly more accurate.
* Math-Shepherd-PRM-7B has the lowest accuracy at difficulty level 5.
* The accuracy of all models converges as the difficulty level increases, particularly at level 5.
### Interpretation
The chart demonstrates the performance of different language models on tasks of varying difficulty. The consistent decrease in accuracy across all models as difficulty increases suggests that all models struggle with more complex tasks. The PRP-RM(frozen Llama-3.2-3B) model appears to be the most robust, maintaining higher accuracy across most difficulty levels. However, the convergence of accuracy at higher difficulty levels indicates that the gap in performance between the models narrows as the tasks become more challenging. This could imply a ceiling effect, where the inherent limitations of the models or the nature of the tasks prevent further differentiation in performance.
</details>
Figure 5: Comparison of reward models under Last@8.
5.2 Comparative Experimental Results
TABLE II: Evaluation results on the GSM8K dataset.
| Model | Method | Maj@4 | Last@4 |
| --- | --- | --- | --- |
| Llama-3.1-8B (+8.68%) | CoT-prompting | 81.8 | 82.0 |
| Step-by-Step KG-RAR | 88.9 | 88.0 | |
| Qwen-2.5-7B (+1.09%) | CoT-prompting | 91.6 | 91.1 |
| Step-by-Step KG-RAR | 92.6 | 93.1 | |
Table I shows that Step-by-Step KG-RAR consistently outperforms CoT-prompting across all difficulty levels on Math500, with more pronounced improvements in the Llama3 series compared to Qwen2.5, likely due to Qwen2.5’s higher baseline accuracy leaving less room for improvement. Performance declines for smaller models like Qwen-1.5B and Llama-1B on harder problems due to increased reasoning inconsistencies. Among models showing improvements, Step-by-Step KG-RAR achieves an average relative accuracy gain of 8.95% on Math500 under Maj@8, while Llama-3.2-8B attains a 8.68% improvement on GSM8K under Maj@4 (Table II). Additionally, PRP-RM achieves comparable performance to ORM and PRM. Figure 5 confirms its effectiveness with Llama-3B on Math500, highlighting its viability as a training-free alternative.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Bar Chart: Accuracy Comparison of Raw vs. Processed Data
### Overview
The image is a bar chart comparing the accuracy (in percentage) of "Raw" and "Processed" data across two metrics: "Maj@8" and "Last@8". The chart displays the accuracy values for each category using vertical bars, with "Raw" data represented by coral-colored bars and "Processed" data by teal-colored bars.
### Components/Axes
* **Y-axis:** "Accuracy (%)" with a scale from 0% to 50%, incrementing by 10%.
* **X-axis:** "Metrics" with two categories: "Maj@8" and "Last@8".
* **Legend:** Located at the top of the chart, indicating "Raw" (coral) and "Processed" (teal).
* Horizontal grid lines are present at each 10% increment on the y-axis.
### Detailed Analysis
* **Maj@8:**
* Raw: The coral bar reaches approximately 32%.
* Processed: The teal bar reaches approximately 40%.
* **Last@8:**
* Raw: The coral bar reaches approximately 36%.
* Processed: The teal bar reaches approximately 40%.
### Key Observations
* For both metrics (Maj@8 and Last@8), the "Processed" data shows higher accuracy than the "Raw" data.
* The accuracy of "Processed" data is the same (approximately 40%) for both metrics.
* The "Last@8" metric has a higher "Raw" accuracy (approximately 36%) compared to the "Maj@8" metric (approximately 32%).
### Interpretation
The chart demonstrates that processing the data leads to an improvement in accuracy compared to using the raw data. The "Processed" data achieves a consistent accuracy of around 40% across both "Maj@8" and "Last@8" metrics. The "Last@8" metric appears to be inherently more accurate than "Maj@8" even before processing, as indicated by the higher "Raw" accuracy value. This suggests that the "Last@8" metric might be a better indicator or more robust to noise in the raw data. The consistent accuracy of the processed data suggests that the processing method is effective in mitigating the differences between the two metrics.
</details>
Figure 6: Comparison of raw and processed retrieval.
<details>
<summary>x7.png Details</summary>

### Visual Description
## Bar Chart: Accuracy Comparison of Different Retrieval Methods
### Overview
The image is a bar chart comparing the accuracy (in percentage) of three different retrieval methods: "None", "RAG", and "KG-RAG" across two metrics: "Maj@8" and "Last@8". The chart displays the accuracy achieved by each method for each metric, allowing for a direct comparison of their performance.
### Components/Axes
* **Y-axis:** "Accuracy (%)" with a scale from 0% to 60% in increments of 10%. Horizontal dashed lines are present at each 10% increment.
* **X-axis:** "Metrics" with two categories: "Maj@8" and "Last@8".
* **Legend:** Located at the top of the chart, it identifies the color-coded retrieval methods:
* Blue: "None"
* Orange: "RAG"
* Green: "KG-RAG"
### Detailed Analysis
**Maj@8 Metric:**
* **None (Blue):** Accuracy is approximately 51%.
* **RAG (Orange):** Accuracy is approximately 52%.
* **KG-RAG (Green):** Accuracy is approximately 54%.
**Last@8 Metric:**
* **None (Blue):** Accuracy is approximately 44%.
* **RAG (Orange):** Accuracy is approximately 45%.
* **KG-RAG (Green):** Accuracy is approximately 52%.
### Key Observations
* For both metrics, KG-RAG consistently achieves the highest accuracy compared to None and RAG.
* RAG shows a marginal improvement over None for both metrics.
* The accuracy for all methods is higher for Maj@8 compared to Last@8.
* The difference in accuracy between KG-RAG and the other two methods is more pronounced for Last@8.
### Interpretation
The chart suggests that incorporating a knowledge graph (KG) into the retrieval-augmented generation (RAG) process (KG-RAG) improves accuracy compared to using RAG alone or no retrieval method (None). The "Maj@8" metric seems to be an easier task, resulting in higher accuracy across all methods. The "Last@8" metric appears more challenging, highlighting the benefits of KG-RAG more clearly. The data implies that KG-RAG is particularly effective in scenarios represented by the "Last@8" metric, potentially due to its ability to leverage structured knowledge to improve retrieval and generation.
</details>
Figure 7: Comparison of various RAG types.
5.3 Ablation Studies
Effectiveness of Post-Retrieval Processing (PRP). We compare reasoning with refined retrieval from PRP-RM against raw retrieval directly from Knowledge Graphs. Figure 7 shows that refining the retrieval context significantly improves performance, with experiments using Llama-3B on Math500 Level 3.
Effectiveness of Knowledge Graphs (KGs). KG-RAR outperforms both no RAG and unstructured RAG (PRM800K) baselines, demonstrating the advantage of structured retrieval (Figure 7, Qwen-0.5B Reasoner, Qwen-3B PRP-RM, Math500).
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Chart: Accuracy and Computation vs. Step Padding
### Overview
The image is a line chart comparing the accuracy (left y-axis) and computation (right y-axis) against step padding (x-axis). Two data series are plotted: "Maj@8" (teal line with circular markers) and "Compute" (black dashed line with 'x' markers). The x-axis represents step padding with values 1, 4, and +∞. The right y-axis indicates computation levels as "More", "Medium", and "Less".
### Components/Axes
* **X-axis:** "Step Padding" with values 1, 4, and +∞.
* **Left Y-axis:** "Accuracy (%)" with values 20%, 30%, 40%, and 50%.
* **Right Y-axis:** "Computation" with levels "Less", "Medium", and "More".
* **Legend:** Located in the top-right of the chart.
* Teal line with circular markers: "Maj@8"
* Black dashed line with 'x' markers: "Compute"
### Detailed Analysis
* **Maj@8 (Teal Line):**
* At Step Padding = 1, Accuracy is approximately 34%.
* At Step Padding
</details>
Figure 8: Comparison of step padding settings.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Bar Chart: Accuracy by Metric and Approach
### Overview
The image is a bar chart comparing the accuracy (in percentage) of three different approaches ("Socratic", "Responsible", and "Critical") across two metrics ("Maj@8" and "Last@8"). The chart uses color-coded bars to represent each approach, with a legend at the top for identification.
### Components/Axes
* **Title:** Implicit, but the chart compares accuracy across different metrics and approaches.
* **X-axis:** "Metrics" with two categories: "Maj@8" and "Last@8".
* **Y-axis:** "Accuracy (%)" ranging from 0.0% to 70.0% in increments of 10.0%.
* **Legend:** Located at the top of the chart.
* Green: "Socratic"
* Orange: "Responsible"
* Blue: "Critical"
* **Gridlines:** Horizontal dashed lines at each 10% increment on the y-axis.
### Detailed Analysis
Here's a breakdown of the accuracy for each approach and metric:
* **Maj@8 Metric:**
* Socratic (Green): Approximately 61.0% accuracy.
* Responsible (Orange): Approximately 64.0% accuracy.
* Critical (Blue): Approximately 70.0% accuracy.
* **Last@8 Metric:**
* Socratic (Green): Approximately 53.0% accuracy.
* Responsible (Orange): Approximately 62.0% accuracy.
* Critical (Blue): Approximately 69.0% accuracy.
### Key Observations
* For both metrics, the "Critical" approach consistently achieves the highest accuracy, followed by "Responsible" and then "Socratic".
* All three approaches show a decrease in accuracy when moving from the "Maj@8" metric to the "Last@8" metric.
* The "Socratic" approach experiences the largest drop in accuracy between the two metrics.
### Interpretation
The chart suggests that the "Critical" approach is the most effective among the three in terms of accuracy, regardless of the metric used. The decrease in accuracy from "Maj@8" to "Last@8" across all approaches indicates that the "Last@8" metric may be more challenging or require a different strategy. The significant drop in "Socratic" accuracy for "Last@8" suggests this approach is particularly sensitive to the change in metric.
</details>
Figure 9: Comparison of PRP-RM roles.
Effectiveness of Step-by-Step RAG. We evaluate step padding at 1, 4, and 1000. Small padding causes inconsistencies, while large padding hinders refinement. Figure 9 illustrates this trade-off (Llama-1B, Math500 Level 3).
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Model Size
### Overview
The image is a line chart comparing the accuracy of two methods, "Maj@8" and "Last@8", across different model sizes (0.5B, 1.5B, and 3B). The chart shows how accuracy increases with model size for both methods.
### Components/Axes
* **X-axis:** Model Size (PRP-RM), with markers at 0.5B, 1.5B, and 3B.
* **Y-axis:** Accuracy (%), with markers at 20%, 30%, 40%, 50%, and 60%.
* **Legend:** Located at the top of the chart.
* **Maj@8:** Represented by a teal line with circle markers.
* **Last@8:** Represented by a peach/orange line with square markers.
### Detailed Analysis
* **Maj@8 (Teal Line):** The accuracy generally increases as the model size increases.
* At 0.5B, the accuracy is approximately 27%.
* At 1.5B, the accuracy is approximately 42%.
* At 3B, the accuracy is approximately 54%.
* **Last@8 (Peach/Orange Line):** The accuracy also increases as the model size increases.
* At 0.5B, the accuracy is approximately 31%.
* At 1.5B, the accuracy is approximately 43%.
* At 3B, the accuracy is approximately 51%.
### Key Observations
* Both methods show a positive correlation between model size and accuracy.
* The "Maj@8" method has slightly higher accuracy than the "Last@8" method at the 3B model size.
* The increase in accuracy is more pronounced between 0.5B and 1.5B than between 1.5B and 3B for both methods.
### Interpretation
The chart suggests that increasing the model size (PRP-RM) generally improves the accuracy of both "Maj@8" and "Last@8" methods. However, the rate of improvement appears to diminish as the model size increases further. The "Maj@8" method seems to perform slightly better than the "Last@8" method at larger model sizes. This information is valuable for determining the optimal model size for a given accuracy target and for comparing the effectiveness of different methods.
</details>
Figure 10: Scaling PRP-RM sizes
<details>
<summary>x11.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Model Size
### Overview
The image is a line chart comparing the accuracy of two methods, "Maj@8" and "Last@8", across different model sizes (0.5B, 1.5B, and 3B). The chart shows how accuracy changes as the model size increases for each method.
### Components/Axes
* **X-axis:** Model Size (Reasoner), with values 0.5B, 1.5B, and 3B.
* **Y-axis:** Accuracy (%), with values ranging from 50% to 75% in 5% increments.
* **Legend:** Located at the top of the chart.
* "Maj@8" is represented by a teal line with circle markers.
* "Last@8" is represented by a coral line with square markers.
* **Gridlines:** Horizontal dashed lines at each 5% increment on the y-axis.
### Detailed Analysis
* **Maj@8 (Teal Line):** The accuracy generally increases as the model size increases.
* At 0.5B, the accuracy is approximately 54%.
* At 1.5B, the accuracy is approximately 62%.
* At 3B, the accuracy is approximately 69%.
* **Last@8 (Coral Line):** The accuracy also increases as the model size increases.
* At 0.5B, the accuracy is approximately 52%.
* At 1.5B, the accuracy is approximately 62%.
* At 3B, the accuracy is approximately 67%.
### Key Observations
* Both methods show improved accuracy with larger model sizes.
* "Maj@8" consistently shows slightly higher accuracy than "Last@8" across all model sizes.
* The accuracy improvement from 1.5B to 3B is less pronounced than from 0.5B to 1.5B for both methods.
* At 1.5B, both methods have approximately the same accuracy.
### Interpretation
The chart suggests that increasing the model size (Reasoner) generally improves the accuracy of both "Maj@8" and "Last@8" methods. However, the gains in accuracy diminish as the model size increases further. The "Maj@8" method appears to be slightly more effective than the "Last@8" method, as it consistently achieves higher accuracy across the tested model sizes. The convergence of accuracy at 1.5B suggests a potential inflection point or a region where the methods perform similarly.
</details>
Figure 11: Scaling reasoner LLM sizes
Comparison of PRP-RM Roles. Socratic Teacher minimizes direct problem-solving but sometimes introduces extraneous questions. Figure 9 shows Critical Teacher performs best among three roles (Llama-3B, Math500).
Scaling Model Size. Scaling trends in Section 5.2 are validated on Math500. Figures 11 and 11 confirm performance gains as both the Reasoner LLM and PRP-RM scale independently.
Scaling of Number of Solutions. We vary the number of generated solutions using Llama-3B on Math500 Level 3. Figure 13 shows accuracy improves incrementally with more solutions, underscoring the benefits of multiple candidates.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Number of Generated Solutions
### Overview
The image is a line chart comparing the accuracy of two methods, "Maj@8" and "Last@8", against the number of generated solutions. It also includes horizontal lines representing the performance of "Llama-3.1-8B" and "Llama-3.2-3B". The x-axis represents the number of generated solutions, while the y-axis represents the accuracy in percentage.
### Components/Axes
* **X-axis:** "Number of Generated Solutions" with values 1, 2, 4, 8, 16, 32, and 64.
* **Y-axis:** "Accuracy (%)" with values ranging from 40% to 80% in increments of 10%.
* **Legend:** Located at the top-right of the chart.
* "Maj@8" (teal line with circle markers)
* "Last@8" (coral line with square markers)
* "Llama-3.1-8B" (black dashed line)
* "Llama-3.2-3B" (gray dashed line)
* **Horizontal Lines:**
* Black dashed line at approximately 56% representing "Llama-3.1-8B".
* Gray dashed line at approximately 49% representing "Llama-3.2-3B".
### Detailed Analysis
* **Maj@8 (Teal Line):**
* Trend: Generally increasing with the number of generated solutions.
* Data Points:
* 1 Solution: ~45%
* 2 Solutions: ~50%
* 4 Solutions: ~55%
* 8 Solutions: ~62%
* 16 Solutions: ~73%
* 32 Solutions: ~72%
* 64 Solutions: ~73%
* **Last@8 (Coral Line):**
* Trend: Increasing initially, then plateaus.
* Data Points:
* 1 Solution: ~47%
* 2 Solutions: ~43%
* 4 Solutions: ~51%
* 8 Solutions: ~54%
* 16 Solutions: ~72%
* 32 Solutions: ~72%
* 64 Solutions: ~74%
### Key Observations
* Both "Maj@8" and "Last@8" show significant improvement in accuracy as the number of generated solutions increases from 1 to 16.
* Beyond 16 solutions, the accuracy for both methods plateaus.
* "Maj@8" generally outperforms "Last@8" except at 64 solutions where "Last@8" has a slightly higher accuracy.
* "Llama-3.1-8B" and "Llama-3.2-3B" serve as baseline performance levels.
### Interpretation
The chart suggests that generating more solutions initially improves the accuracy of both "Maj@8" and "Last@8" methods. However, there's a point of diminishing returns around 16 generated solutions. The performance of "Llama-3.1-8B" and "Llama-3.2-3B" provides a benchmark, indicating the relative effectiveness of the two methods being tested. The fact that both "Maj@8" and "Last@8" surpass the Llama baselines at higher numbers of generated solutions suggests that these methods are beneficial for improving accuracy in this context.
</details>
Figure 12: Scaling of the number of solutions.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Accuracy of Voting Methods
### Overview
The image is a bar chart comparing the accuracy of five different voting methods: Maj-Vote, Last-Vote, Min-Vote, Last-Max, and Min-Max. The y-axis represents accuracy in percentage, ranging from 50% to 70%. The x-axis represents the voting methods. All bars are colored in teal.
### Components/Axes
* **X-axis:** Voting Methods (Maj-Vote, Last-Vote, Min-Vote, Last-Max, Min-Max)
* **Y-axis:** Accuracy (%) with scale markers at 50%, 55%, 60%, 65%, and 70%.
* Horizontal grid lines are present at each 5% increment on the y-axis.
### Detailed Analysis
Here's a breakdown of the accuracy for each voting method:
* **Maj-Vote:** Accuracy is approximately 65%.
* **Last-Vote:** Accuracy is approximately 63%.
* **Min-Vote:** Accuracy is approximately 61.5%.
* **Last-Max:** Accuracy is approximately 56%.
* **Min-Max:** Accuracy is approximately 55.5%.
### Key Observations
* Maj-Vote has the highest accuracy among the five methods.
* Last-Max and Min-Max have the lowest accuracy.
* There is a clear decreasing trend in accuracy from Maj-Vote to Min-Max.
### Interpretation
The bar chart suggests that the Maj-Vote voting method is the most accurate among the five methods tested. The accuracy decreases as we move from majority-based voting (Maj-Vote) to methods involving minimum or maximum values (Last-Max, Min-Max). This could indicate that methods relying on extreme values are more susceptible to noise or outliers in the data, leading to lower overall accuracy. The differences in accuracy between the methods are relatively small, but the trend is consistent.
</details>
Figure 13: Comparison of various voting methods.
Comparison of Voting Methods. We widely evaluate five PRP-RM voting strategies: Majority Vote, Last Vote, Min Vote, Min-Max, and Last-Max [23, 21]. Majority Vote and Last Vote outperform others, as extreme-based methods are prone to PRP-RM overconfidence in incorrect solutions (Figure 13).
6 Conclusions and Limitations
In this paper, we introduce a novel graph-augmented reasoning paradigm that aims to enhance o1-like multi-step reasoning capabilities of frozen LLMs by integrating external KGs. Towards this end, we present step-by-step knowledge graph based retrieval-augmented reasoning (KG-RAR), a novel iterative retrieve-refine-reason framework that strengthens o1-like reasoning, facilitated by an innovative post-retrieval processing and reward model (PRP-RM) that refines raw retrievals and assigns step-wise scores to guide reasoning more effectively. Experimental results demonstrate an 8.95% relative improvement on average over CoT-prompting on Math500, with PRP-RM achieving competitive performance against fine-tuned reward models, yet without the heavy training or fine-tuning costs.
Despite these merits, the proposed approach indeed has some limitations, such as higher computational overhead and potential cases where KG-RAR may introduce unnecessary noise or fail to enhance reasoning. Our future work will focus on optimising the framework by incorporating active learning to dynamically update KGs, improving retrieval efficiency, and exploring broader applications in complex reasoning domains such as scientific discovery and real-world decision-making.
References
- [1] J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” arXiv preprint arXiv:2212.10403, 2022.
- [2] J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, J. Xu, M. Ding, H. Li, M. Geng et al., “A survey of reasoning with foundation models,” arXiv preprint arXiv:2312.11562, 2023.
- [3] A. Plaat, A. Wong, S. Verberne, J. Broekens, N. van Stein, and T. Back, “Reasoning with large language models, a survey,” arXiv preprint arXiv:2407.11511, 2024.
- [4] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
- [5] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022.
- [6] S. Zhou, U. Alon, F. F. Xu, Z. Jiang, and G. Neubig, “Docprompting: Generating code by retrieving the docs,” in The Eleventh International Conference on Learning Representations, 2022.
- [7] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
- [8] A. Creswell, M. Shanahan, and I. Higgins, “Selection-inference: Exploiting large language models for interpretable logical reasoning,” arXiv preprint arXiv:2205.09712, 2022.
- [9] N. Shinn, B. Labash, and A. Gopinath, “Reflexion: an autonomous agent with dynamic memory and self-reflection,” arXiv preprint arXiv:2303.11366, 2023.
- [10] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- [11] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,” arXiv preprint arXiv:2211.12588, 2022.
- [12] R. Yamauchi, S. Sonoda, A. Sannai, and W. Kumagai, “Lpml: Llm-prompting markup language for mathematical reasoning,” arXiv preprint arXiv:2309.13078, 2023.
- [13] Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang, “Toolqa: A dataset for llm question answering with external tools,” Advances in Neural Information Processing Systems, vol. 36, pp. 50 117–50 143, 2023.
- [14] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
- [15] L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu, “Metamath: Bootstrap your own mathematical questions for large language models,” arXiv preprint arXiv:2309.12284, 2023.
- [16] Y. Wu, F. Jia, S. Zhang, H. Li, E. Zhu, Y. Wang, Y. T. Lee, R. Peng, Q. Wu, and C. Wang, “Mathchat: Converse to tackle challenging math problems with llm agents,” in ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024.
- [17] S. Wu, Z. Peng, X. Du, T. Zheng, M. Liu, J. Wu, J. Ma, Y. Li, J. Yang, W. Zhou et al., “A comparative study on reasoning patterns of openai’s o1 model,” arXiv preprint arXiv:2410.13639, 2024.
- [18] D. Zhang, J. Wu, J. Lei, T. Che, J. Li, T. Xie, X. Huang, S. Zhang, M. Pavone, Y. Li et al., “Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning,” arXiv preprint arXiv:2410.02884, 2024.
- [19] Y. Zhang, S. Wu, Y. Yang, J. Shu, J. Xiao, C. Kong, and J. Sang, “o1-coder: an o1 replication for coding,” arXiv preprint arXiv:2412.00154, 2024.
- [20] J. Wang, F. Meng, Y. Liang, and J. Zhou, “Drt-o1: Optimized deep reasoning translation via long chain-of-thought,” arXiv preprint arXiv:2412.17498, 2024.
- [21] J. Wang, M. Fang, Z. Wan, M. Wen, J. Zhu, A. Liu, Z. Gong, Y. Song, L. Chen, L. M. Ni et al., “Openr: An open source framework for advanced reasoning with large language models,” arXiv preprint arXiv:2410.09671, 2024.
- [22] H. Luo, L. Shen, H. He, Y. Wang, S. Liu, W. Li, N. Tan, X. Cao, and D. Tao, “O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning,” arXiv preprint arXiv:2501.12570, 2025.
- [23] X. Feng, Z. Wan, M. Wen, Y. Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” in ICML 2024, 2024.
- [24] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” arXiv preprint arXiv:2305.14992, 2023.
- [25] C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,” arXiv preprint arXiv:2408.03314, 2024.
- [26] Y. Chen, X. Pan, Y. Li, B. Ding, and J. Zhou, “A simple and provable scaling law for the test-time compute of large language models,” arXiv preprint arXiv:2411.19477, 2024.
- [27] OpenAI, “Learning to reason with llms,” https://openai.com/index/learning-to-reason-with-llms/, 2024.
- [28] DeepSeek-AI, D. Guo, D. Yang, and et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.12948
- [29] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” arXiv preprint arXiv:2305.20050, 2023.
- [30] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
- [31] J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins, “Solving math word problems with process-and outcome-based feedback,” arXiv preprint arXiv:2211.14275, 2022.
- [32] A. Satpute, N. Gießing, A. Greiner-Petter, M. Schubotz, O. Teschke, A. Aizawa, and B. Gipp, “Can llms master math? investigating large language models on math stack exchange,” in Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, 2024, pp. 2316–2320.
- [33] I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models,” arXiv preprint arXiv:2410.05229, 2024.
- [34] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023.
- [35] S. Banerjee, A. Agarwal, and S. Singla, “Llms will always hallucinate, and we need to live with this,” arXiv preprint arXiv:2409.05746, 2024.
- [36] Z. Xu, S. Jain, and M. Kankanhalli, “Hallucination is inevitable: An innate limitation of large language models,” arXiv preprint arXiv:2401.11817, 2024.
- [37] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997, 2023.
- [38] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented large language models,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 6491–6501.
- [39] B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong, Y. Zhang, and S. Tang, “Graph retrieval-augmented generation: A survey,” arXiv preprint arXiv:2408.08921, 2024.
- [40] S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, and M. Abdelrazek, “Seven failure points when engineering a retrieval augmented generation system,” in Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI, 2024, pp. 194–199.
- [41] Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y. Liang, “Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,” arXiv preprint arXiv:2403.05313, 2024.
- [42] Y. Hu, Z. Lei, Z. Zhang, B. Pan, C. Ling, and L. Zhao, “Grag: Graph retrieval-augmented generation,” arXiv preprint arXiv:2405.16506, 2024.
- [43] X. Li, R. Zhao, Y. K. Chia, B. Ding, S. Joty, S. Poria, and L. Bing, “Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources,” arXiv preprint arXiv:2305.13269, 2023.
- [44] X. He, Y. Tian, Y. Sun, N. V. Chawla, T. Laurent, Y. LeCun, X. Bresson, and B. Hooi, “G-retriever: Retrieval-augmented generation for textual graph understanding and question answering,” arXiv preprint arXiv:2402.07630, 2024.
- [45] R.-C. Chang and J. Zhang, “Communitykg-rag: Leveraging community structures in knowledge graphs for advanced retrieval-augmented generation in fact-checking,” arXiv preprint arXiv:2408.08535, 2024.
- [46] Y. Mu, P. Niu, K. Bontcheva, and N. Aletras, “Predicting and analyzing the popularity of false rumors in weibo,” Expert Systems with Applications, vol. 243, p. 122791, 2024.
- [47] L. Luo, Y.-F. Li, G. Haffari, and S. Pan, “Reasoning on graphs: Faithful and interpretable large language model reasoning,” arXiv preprint arXiv:2310.01061, 2023.
- [48] J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y. Gong, H.-Y. Shum, and J. Guo, “Think-on-graph: Deep and responsible reasoning of large language model with knowledge graph,” arXiv preprint arXiv:2307.07697, 2023.
- [49] H. Liu, S. Wang, Y. Zhu, Y. Dong, and J. Li, “Knowledge graph-enhanced large language models via path selection,” arXiv preprint arXiv:2406.13862, 2024.
- [50] L. Luo, Z. Zhao, C. Gong, G. Haffari, and S. Pan, “Graph-constrained reasoning: Faithful reasoning on knowledge graphs with large language models,” arXiv preprint arXiv:2410.13080, 2024.
- [51] Q. Zhang, J. Dong, H. Chen, D. Zha, Z. Yu, and X. Huang, “Knowgpt: Knowledge graph based prompting for large language models,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
- [52] N. Choudhary and C. K. Reddy, “Complex logical reasoning over knowledge graphs using large language models,” arXiv preprint arXiv:2305.01157, 2023.
- [53] Z. Zhao, Y. Rong, D. Guo, E. Gözlüklü, E. Gülboy, and E. Kasneci, “Stepwise self-consistent mathematical reasoning with large language models,” arXiv preprint arXiv:2402.17786, 2024.
- [54] S. Ji, S. Pan, E. Cambria, P. Marttinen, and S. Y. Philip, “A survey on knowledge graphs: Representation, acquisition, and applications,” IEEE transactions on neural networks and learning systems, vol. 33, no. 2, pp. 494–514, 2021.
- [55] J. Wang, “Math-kg: Construction and applications of mathematical knowledge graph,” arXiv preprint arXiv:2205.03772, 2022.
- [56] C. Zheng, Z. Zhang, B. Zhang, R. Lin, K. Lu, B. Yu, D. Liu, J. Zhou, and J. Lin, “Processbench: Identifying process errors in mathematical reasoning,” arXiv preprint arXiv:2412.06559, 2024.
- [57] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, L. Gianinazzi, J. Gajda, T. Lehmann, M. Podstawski, H. Niewiadomski, P. Nyczyk et al., “Graph of thoughts: Solving elaborate problems with large language models,” arXiv preprint arXiv:2308.09687, 2023.
- [58] E. Zelikman, Y. Wu, J. Mu, and N. Goodman, “Star: Bootstrapping reasoning with reasoning,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 476–15 488, 2022.
- [59] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen, “Making large language models better reasoners with step-aware verifier,” arXiv preprint arXiv:2206.02336, 2022.
- [60] F. Yu, A. Gao, and B. Wang, “Ovm, outcome-supervised value models for planning in mathematical reasoning,” in Findings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 858–875.
- [61] L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal, “Generative verifiers: Reward modeling as next-token prediction,” arXiv preprint arXiv:2408.15240, 2024.
- [62] H. Paulheim, “Knowledge graph refinement: A survey of approaches and evaluation methods,” Semantic web, vol. 8, no. 3, pp. 489–508, 2016.
- [63] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,” IEEE transactions on knowledge and data engineering, vol. 29, no. 12, pp. 2724–2743, 2017.
- [64] Y. Jing, Y. Yang, X. Wang, M. Song, and D. Tao, “Meta-aggregator: Learning to aggregate for 1-bit graph neural networks,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5301–5310.
- [65] Y. Jing, C. Yuan, L. Ju, Y. Yang, X. Wang, and D. Tao, “Deep graph reprogramming,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 345–24 354.
- [66] D. Sanmartin, “Kg-rag: Bridging the gap between knowledge and creativity,” arXiv preprint arXiv:2405.12035, 2024.
- [67] Y. Wang, N. Lipka, R. A. Rossi, A. Siu, R. Zhang, and T. Derr, “Knowledge graph prompting for multi-document question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 19 206–19 214.
- [68] A. Kau, X. He, A. Nambissan, A. Astudillo, H. Yin, and A. Aryani, “Combining knowledge graphs and large language models,” arXiv preprint arXiv:2407.06564, 2024.
- [69] J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, and J.-R. Wen, “Structgpt: A general framework for large language model to reason over structured data,” arXiv preprint arXiv:2305.09645, 2023.
- [70] Z. Chai, T. Zhang, L. Wu, K. Han, X. Hu, X. Huang, and Y. Yang, “Graphllm: Boosting graph reasoning ability of large language model,” arXiv preprint arXiv:2310.05845, 2023.
- [71] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen, and N. Zhang, “Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities,” World Wide Web, vol. 27, no. 5, p. 58, 2024.
- [72] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large language models and knowledge graphs: A roadmap,” IEEE Transactions on Knowledge and Data Engineering, 2024.
- [73] G. Agrawal, T. Kumarage, Z. Alghamdi, and H. Liu, “Can knowledge graphs reduce hallucinations in llms?: A survey,” arXiv preprint arXiv:2311.07914, 2023.
- [74] A. S. Pinto, A. Kolesnikov, Y. Shi, L. Beyer, and X. Zhai, “Tuning computer vision models with task rewards,” in International Conference on Machine Learning. PMLR, 2023, pp. 33 229–33 239.
- [75] Y. Jing, Y. Yang, X. Wang, M. Song, and D. Tao, “Turning frequency to resolution: Video super-resolution via event cameras,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7772–7781.
- [76] M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward design with language models,” arXiv preprint arXiv:2303.00001, 2023.
- [77] Y. Cao, H. Zhao, Y. Cheng, T. Shu, Y. Chen, G. Liu, G. Liang, J. Zhao, J. Yan, and Y. Li, “Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods,” IEEE Transactions on Neural Networks and Learning Systems, 2024.
- [78] Z. Wang, B. Bi, S. K. Pentyala, K. Ramnath, S. Chaudhuri, S. Mehrotra, X.-B. Mao, S. Asur et al., “A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more,” arXiv preprint arXiv:2407.16216, 2024.
- [79] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- [80] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022.
- [81] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. V. Le, and E. H. Chi, “Least-to-most prompting enables complex reasoning in large language models,” in The Eleventh International Conference on Learning Representations, ICLR 2023, 2023.
- [82] X. Wang and D. Zhou, “Chain-of-thought reasoning without prompting,” arXiv preprint arXiv:2402.10200, 2024.
- [83] Y. Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter, Y. Xia, W. Wu, T. Song, M. Lan, and F. Wei, “Llm as a mastermind: A survey of strategic reasoning with large language models,” arXiv preprint arXiv:2404.01230, 2024.
- [84] F. Yu, H. Zhang, P. Tiwari, and B. Wang, “Natural language reasoning, a survey,” ACM Computing Surveys, vol. 56, no. 12, pp. 1–39, 2024.
- [85] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, 2024.
- [86] Z. Chu, J. Chen, Q. Chen, W. Yu, T. He, H. Wang, W. Peng, M. Liu, B. Qin, and T. Liu, “A survey of chain of thought reasoning: Advances, frontiers and future,” arXiv preprint arXiv:2309.15402, 2023.
- [87] J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large language models for mathematical reasoning: Progresses and challenges,” arXiv preprint arXiv:2402.00157, 2024.
- [88] Z. Li, Y. Cao, X. Xu, J. Jiang, X. Liu, Y. S. Teo, S.-W. Lin, and Y. Liu, “Llms for relational reasoning: How far are we?” in Proceedings of the 1st International Workshop on Large Language Models for Code, 2024, pp. 119–126.
- [89] Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu, “Aligning large language models with human: A survey,” arXiv preprint arXiv:2307.12966, 2023.
- [90] T. Shen, R. Jin, Y. Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y. Liu, and D. Xiong, “Large language model alignment: A survey,” arXiv preprint arXiv:2309.15025, 2023.
- [91] S. R. Balseiro, O. Besbes, and D. Pizarro, “Survey of dynamic resource-constrained reward collection problems: Unified model and analysis,” Operations Research, vol. 72, no. 5, pp. 2168–2189, 2024.
- [92] D. Tomaszuk, Ł. Szeremeta, and A. Korniłowicz, “Mmlkg: Knowledge graph for mathematical definitions, statements and proofs,” Scientific Data, vol. 10, no. 1, p. 791, 2023.
- [93] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023.
- [94] D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y. Jiang, C. Chen, T. Wu et al., “From generation to judgment: Opportunities and challenges of llm-as-a-judge,” arXiv preprint arXiv:2411.16594, 2024.
- [95] Y. Shi, X. Zi, Z. Shi, H. Zhang, Q. Wu, and M. Xu, “Enhancing retrieval and managing retrieval: A four-module synergy for improved quality and efficiency in rag systems,” arXiv preprint arXiv:2407.10670, 2024.
- [96] Y. Cao, Z. Gao, Z. Li, X. Xie, and S. K. Zhou, “Lego-graphrag: Modularizing graph-based retrieval-augmented generation for design space exploration,” arXiv preprint arXiv:2411.05844, 2024.
- [97] D. Mahan, D. Van Phung, R. Rafailov, C. Blagden, N. Lile, L. Castricato, J.-P. Fränken, C. Finn, and A. Albalak, “Generative reward models,” arXiv preprint arXiv:2410.12832, 2024.
- [98] C.-M. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better llm-based evaluators through multi-agent debate,” arXiv preprint arXiv:2308.07201, 2023.
- [99] Y. Talebirad and A. Nadiri, “Multi-agent collaboration: Harnessing the power of intelligent llm agents,” arXiv preprint arXiv:2306.03314, 2023.
- [100] A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang, “Expel: Llm agents are experiential learners,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 19 632–19 642.
- [101] X. Ning, Z. Wang, S. Li, Z. Lin, P. Yao, T. Fu, M. B. Blaschko, G. Dai, H. Yang, and Y. Wang, “Can llms learn by teaching for better reasoning? a preliminary study,” arXiv preprint arXiv:2406.14629, 2024.
- [102] E. Y. Chang, “Prompting large language models with the socratic method,” in 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 2023, pp. 0351–0360.
- [103] P. Ke, B. Wen, Z. Feng, X. Liu, X. Lei, J. Cheng, S. Wang, A. Zeng, Y. Dong, H. Wang et al., “Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation,” arXiv preprint arXiv:2311.18702, 2023.
- [104] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,” NeurIPS, 2021.
- [105] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
- [106] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al., “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024.
- [107] E. Charniak and M. Johnson, “Coarse-to-fine n-best parsing and maxent discriminative reranking,” in Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), 2005, pp. 173–180.
- [108] P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui, “Math-shepherd: Verify and reinforce llms step-by-step without human annotations,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9426–9439.
- [109] W. Xiong, H. Zhang, N. Jiang, and T. Zhang, “An implementation of generative prm,” https://github.com/RLHFlow/RLHF-Reward-Modeling, 2024.