# Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning
**Authors**: Wenjie Wu, Yongcheng Jing, Yingjie Wang, Wenbin Hu, Dacheng Tao
> Wenjie Wu and Wenbin Hu are with Wuhan University (Email: Yongcheng Jing, Yingjie Wang, and Dacheng Tao are with Nanyang Technological University (Email:
## Abstract
Recent large language model (LLM) reasoning, despite its success, suffers from limited domain knowledge, susceptibility to hallucinations, and constrained reasoning depth, particularly in small-scale models deployed in resource-constrained environments. This paper presents the first investigation into integrating step-wise knowledge graph retrieval with step-wise reasoning to address these challenges, introducing a novel paradigm termed as graph-augmented reasoning. Our goal is to enable frozen, small-scale LLMs to retrieve and process relevant mathematical knowledge in a step-wise manner, enhancing their problem-solving abilities without additional training. To this end, we propose KG-RAR, a framework centered on process-oriented knowledge graph construction, a hierarchical retrieval strategy, and a universal post-retrieval processing and reward model (PRP-RM) that refines retrieved information and evaluates each reasoning step. Experiments on the Math500 and GSM8K benchmarks across six models demonstrate that KG-RAR yields encouraging results, achieving a 20.73% relative improvement with Llama-3B on Math500.
Index Terms: Large Language Model, Knowledge Graph, Reasoning
## 1 Introduction
Enhancing the reasoning capabilities of large language models (LLMs) continues to be a major challenge [1, 2, 3]. Conventional methods, such as chain-of-thought (CoT) prompting [4], improve inference by encouraging step-by-step articulation [5, 6, 7, 8, 9, 10], while external tool usage and domain-specific fine-tuning further refine specific task performance [11, 12, 13, 14, 15, 16]. Most recently, o1-like multi-step reasoning has emerged as a paradigm shift [17, 18, 19, 20, 21, 22], leveraging test-time compute strategies [5, 23, 24, 25, 26], exemplified by reasoning models like GPT-o1 [27] and DeepSeek-R1 [28]. These approaches, including Best-of-N [29] and Monte Carlo Tree Search [23] , allocate additional computational resources during inference to dynamically refine reasoning paths [30, 31, 29, 25].
<details>
<summary>x1.png Details</summary>

### Visual Description
## Flowchart: Comparative Analysis of Problem-Solving Methodologies
### Overview
The image presents a comparative flowchart of three problem-solving methodologies:
1. **Chain of Thought (CoT)**
2. **Traditional RAG (Retrieval-Augmented Generation)**
3. **Step-by-Step KG-RAR (Knowledge Graph-Retrieval-Augmented Reasoning)**
Each methodology is structured as a vertical pipeline with inputs, intermediate steps, and outputs. The flowcharts use color-coded blocks to distinguish components and arrows to indicate sequential processing.
---
### Components/Axes
#### Common Elements Across All Methods:
- **Inputs**:
- `Problem` (question mark icon)
- **Output**:
- `Answer` (final box)
- **Steps**:
- `Step1` → `Step2` (sequential processing)
#### Method-Specific Components:
1. **Chain of Thought (CoT)**:
- **Input**: `Problem`
- **Process**:
- `CoT-prompting` (blue block)
- **Flow**:
`Problem` → `Step1` → `Step2` → `Answer`
2. **Traditional RAG**:
- **Inputs**:
- `Problem`
- `Docs` (document icon)
- **Process**:
- `CoT + RAG` (green block)
- **Flow**:
`Problem` → `Docs` → `Step1` → `Step2` → `Answer`
3. **Step-by-Step KG-RAR**:
- **Inputs**:
- `Problem`
- `KG` (knowledge graph icon)
- **Process**:
- `CoT + KG-RAR` (teal block)
- `KG-RAR of Step1` (nested teal block)
- **Subcomponents**:
- `Sub-KG` (smaller knowledge graph icon)
- **Flow**:
`Problem` → `KG` → `Step1` → `KG-RAR of Step1` → `Sub-KG` → `Step2` → `Answer`
---
### Detailed Analysis
#### 1. Chain of Thought (CoT):
- **Structure**:
- Simplest pipeline with no external inputs.
- Relies solely on internal reasoning (`CoT-prompting`).
- **Flow**:
- Direct progression from `Problem` to `Answer` via two reasoning steps.
#### 2. Traditional RAG:
- **Enhancements**:
- Integrates `Docs` (external documents) to augment reasoning.
- Maintains CoT structure but adds retrieval of contextual data.
- **Flow**:
- `Docs` are processed alongside `Problem` in `Step1`, suggesting document retrieval occurs early in the pipeline.
#### 3. Step-by-Step KG-RAR:
- **Complexity**:
- Combines `KG` (knowledge graph) with CoT.
- Introduces hierarchical reasoning:
- `KG-RAR of Step1` processes `Step1` output using the knowledge graph.
- `Sub-KG` further refines intermediate results.
- **Flow**:
- `KG` is used iteratively:
- First to inform `Step1`, then again via `Sub-KG` for `Step2`.
---
### Key Observations
1. **Incremental Complexity**:
- CoT < Traditional RAG < Step-by-Step KG-RAR in terms of input complexity and processing depth.
2. **Knowledge Integration**:
- Step-by-Step KG-RAR uniquely uses `KG` and `Sub-KG` to refine intermediate steps, suggesting a feedback loop between reasoning and structured knowledge.
3. **Visual Cues**:
- Color coding (blue, green, teal) correlates with methodology sophistication.
- Arrows indicate strict sequential dependencies (e.g., `Step1` must complete before `Step2`).
---
### Interpretation
1. **Methodological Evolution**:
- The flowchart illustrates a progression from basic reasoning (CoT) to hybrid approaches that incorporate external data (`Docs` in RAG) and structured knowledge (`KG` in KG-RAR).
2. **Knowledge Graph Role**:
- In Step-by-Step KG-RAR, the knowledge graph is not a one-time input but a dynamic tool that refines reasoning at multiple stages (`KG-RAR of Step1` and `Sub-KG`).
3. **Practical Implications**:
- Step-by-Step KG-RAR likely improves answer accuracy by leveraging domain-specific knowledge iteratively, though at the cost of increased computational complexity.
4. **Unresolved Questions**:
- The flowchart does not specify how `Sub-KG` is generated or how `KG-RAR` differs from traditional RAG. These details would require additional context.
---
**Note**: The image lacks numerical data, trends, or statistical values. All analysis is based on structural and symbolic elements.
</details>
Figure 1: Illustration of the proposed step-by-step knowledge graph retrieval for o1-like reasoning, which dynamically retrieves and utilises structured sub-graphs (Sub-KGs) during reasoning. Our approach iteratively refines the reasoning process by retrieving relevant Sub-KGs at each step, enhancing accuracy, consistency, and reasoning depth for complex tasks, thereby offering a novel form of scaling test-time computation.
Despite encouraging advancements in o1-like reasoning, LLMs—particularly smaller and less powerful variants—continue to struggle with complex reasoning tasks in mathematics and science [2, 3, 32, 33]. These challenges arise from insufficient domain knowledge, susceptibility to hallucinations, and constrained reasoning depth [34, 35, 36]. Given the novelty of o1-like reasoning, effective solutions to these issues remain largely unexplored, with few studies addressing this gap in the literature [17, 18, 22]. One potential solution from the pre-o1 era is retrieval-augmented generation (RAG), which has been shown to mitigate hallucinations and factual inaccuracies by retrieving relevant information from external knowledge sources (Fig. 1, the 2 column) [37, 38, 39]. However, in the context of o1-like multi-step reasoning, traditional RAG faces two significant challenges:
- (1) Step-wise hallucinations: LLMs may hallucinate during intermediate steps —a problem not addressed by applying RAG solely to the initial prompt [40, 41];
- (2) Missing structured relationships: Traditional RAG may retrieve information that lacks the structured relationships necessary for complex reasoning tasks, leading to inadequate augmentation that fails to capture the depth required for accurate reasoning [39, 42].
In this paper, we strive to address both challenges by introducing a novel graph-augmented multi-step reasoning scheme to enhance LLMs’ o1-like reasoning capability, as depicted in Fig. 1. Our idea is motivated by the recent success of knowledge graphs (KGs) in knowledge-based question answering and fact-checking [43, 44, 45, 46, 47, 48, 49]. Recent advances have demonstrated the effectiveness of KGs in augmenting prompts with retrieved knowledge or enabling LLMs to query KGs for factual information [50, 51]. However, little attention has been given to improving step-by-step reasoning for complex tasks with KGs [52, 53], such as mathematical reasoning, which requires iterative logical inference rather than simple knowledge retrieval.
To fill this gap, the objective of the proposed graph-augmented reasoning paradigm is to integrate structured KG retrieval into the reasoning process in a step-by-step manner, providing contextually relevant information at each reasoning step to refine reasoning paths and mitigate step-wise inaccuracies and hallucinations, thereby addressing both aforementioned challenges simultaneously. This approach operates without additional training, making it particularly well-suited for small-scale LLMs in resource-constrained environments. Moreover, it extends test-time compute by incorporating external knowledge into the reasoning context, transitioning from direct CoT to step-wise guided retrieval and reasoning.
Nevertheless, implementing this graph-augmented reasoning paradigm is accompanied with several key issues: (1) Frozen LLMs struggle to query KGs effectively [50], necessitating a dynamic integration strategy for iterative incorporation of graph-based knowledge; (2) Existing KGs primarily encode static facts rather than the procedural knowledge required for multi-step reasoning [54, 55], highlighting the need for process-oriented KGs; (3) Reward models, which are essential for validating reasoning steps [30, 29], often require costly fine-tuning and suffer from poor generalization [56], underscoring the need for a universal, training-free scoring mechanism tailored to KG.
To address these issues, we propose KG-RAR, a step-by-step knowledge graph based retrieval-augmented reasoning framework that retrieves, refines, and reasons using structured knowledge graphs in a step-wise manner. Specifically, to enable effective KG querying, we design a hierarchical retrieval strategy in which questions and reasoning steps are progressively matched to relevant subgraphs, dynamically narrowing the search space. Also, we present a process-oriented math knowledge graph (MKG) construction method that encodes step-by-step procedural knowledge, ensuring that LLMs retrieve and apply structured reasoning sequences rather than static facts. Furthermore, we introduce the post-retrieval processing and reward model (PRP-RM)—a training-free scoring mechanism that refines retrieved knowledge before reasoning and evaluates step correctness in real time. By integrating structured retrieval with test-time computation, our approach mitigates reasoning inconsistencies, reduces hallucinations, and enhances stepwise verification—all without additional training.
In sum, our contribution is therefore the first attempt that dynamically integrates step-by-step KG retrieval into an o1-like multi-step reasoning process. This is achieved through our proposed hierarchical retrieval, process-oriented graph construction method, and PRP-RM—a training-free scoring mechanism that ensures retrieval relevance and step correctness. Experiments on Math500 and GSM8K validate the effectiveness of our approach across six smaller models from the Llama3 and Qwen2.5 series. The best-performing model, Llama-3B on Math500, achieves a 20.73% relative improvement over CoT prompting, followed by Llama-8B on Math500 with a 15.22% relative gain and Llama-8B on GSM8K with an 8.68% improvement.
## 2 Related Work
<details>
<summary>x2.png Details</summary>

### Visual Description
## Flowchart: Solving Remainder Problems Using Modular Arithmetic
### Overview
The flowchart illustrates a step-by-step process to solve a remainder problem: "A factory produces items in batches of 35. If today is the 1234th batch, what is the remainder?" It combines mathematical reasoning (modular arithmetic, congruence) with algorithmic steps (division, subtraction) to derive the solution.
### Components/Axes
- **Nodes**:
- **Question**: "A factory produces items in batches of 35. If today is the 1234th batch, what is the remainder?"
- **Retrieving**: Subsections for "Modular Arithmetic" and "Number Theory."
- **Congruence**: Example problem about a 30-digit integer composed of 13 $7s and 17 $3s.
- **Modular Arithmetic**: Substeps like "last digit property" and "multiplication of relevant units digits."
- **Refining**: Final steps to compute the remainder.
- **Arrows**: Indicate logical flow (e.g., "What is the remainder when 1,493,824 is divided by 4?" → "modular arithmetic").
- **Colors**:
- Green: "Retrieving" (e.g., division algorithm, checking congruence).
- Yellow: "Congruence" (e.g., example problem).
- Blue: "Modular Arithmetic" (e.g., last digit property).
- Red: "Refining" (e.g., final remainder calculation).
### Detailed Analysis
1. **Step 1 (Refining)**:
- **Text**: "Divide 1234 by 35: 1234 ÷ 35 = 35.2571."
- **Calculation**: Quotient = 35, remainder = 1234 - (35 × 35) = 9.
2. **Step 2 (Refining)**:
- **Text**: "Subtract this product from 1234 to find the remainder: 1234 − 35 × 35 = 9."
3. **Congruence Example**:
- **Text**: "Suppose a 30-digit integer $N$ is composed of thirteen $7s and seventeen $3s. What is the remainder when $N$ is divided by 36?"
- **Substeps**:
- "last digit property" (blue node).
- "multiplication of relevant units digits" (green node).
4. **Modular Arithmetic**:
- **Text**: "What is the remainder when the product $1734 × 5389 × 80607$ is divided by 10?"
- **Substeps**: "observation of last digit property" (green node).
### Key Observations
- **Modular Arithmetic Focus**: The flowchart emphasizes breaking down large numbers using properties like last-digit behavior and congruence conditions.
- **Algorithmic Steps**: Division and subtraction are used to isolate the remainder.
- **Redundancy Check**: The flowchart includes a step to verify if the solution satisfies the congruence condition (e.g., "checking if the solution satisfies the congruence condition").
### Interpretation
The flowchart demonstrates a systematic approach to solving remainder problems by:
1. **Decomposing the problem** into smaller, manageable parts (e.g., using modular arithmetic to simplify division).
2. **Applying mathematical properties** (e.g., last-digit rules, congruence conditions) to reduce computational complexity.
3. **Validating solutions** through iterative checks (e.g., ensuring the remainder aligns with the original problem constraints).
The example of dividing 1234 by 35 highlights how modular arithmetic simplifies large divisions by focusing on remainders rather than full quotients. The final answer (remainder = 9) is derived by subtracting the largest multiple of 35 (35 × 35 = 1225) from 1234. This method avoids direct computation of large products or divisions, leveraging number theory principles for efficiency.
**Note**: The flowchart does not include numerical data for all substeps (e.g., the 30-digit integer example lacks a final answer), suggesting it is a template for problem-solving rather than a complete solution.
</details>
Figure 2: Example of Step-by-Step KG-RAR’s iterative process: 1) Retrieving: For a given question or intermediate reasoning step, the KG is retrieved to find the most similar problem or procedure (underlined in the figure) and extract its subgraph as the raw retrieval. 2) Refining: A frozen LLM processes the raw retrieval to generate a refined and targeted context for reasoning. 3) Reasoning: Using the refined retrieval, another LLM reflects on previous steps and generates next intermediate reasoning steps. This iterative workflow refines and guides the reasoning path to problem-solving.
### 2.1 LLM Reasoning
Large Language Models (LLMs) have advanced in structured reasoning through techniques like Chain-of-Thought (CoT) [4], Self-Consistency [5], and Tree-of-Thought [10], improving inference by generating intermediate steps rather than relying on greedy decoding [6, 7, 8, 57, 58, 9]. Recently, GPT-o1-like reasoning has emerged as a paradigm shift [17, 18, 25, 22], leveraging Test-Time Compute strategies such as Best-of-N [29], Beam Search [5], and Monte Carlo Tree Search [23], often integrated with reward models to refine reasoning paths dynamically [30, 31, 29, 59, 60, 61]. Reasoning models like DeepSeek-R1 exemplify this trend by iteratively searching, verifying, and refining solutions, significantly enhancing inference accuracy and robustness [27, 28]. However, these methods remain computationally expensive and challenging for small-scale LLMs, which struggle with hallucinations and inconsistencies due to limited reasoning capacity and lack of domain knowledge [32, 34, 3].
### 2.2 Knowledge Graphs Enhanced LLM Reasoning
Knowledge Graphs (KGs) are structured repositories of interconnected entities and relationships, offering efficient graph-based knowledge representation and retrieval [62, 63, 64, 65, 54]. Prior work integrating KGs with LLMs has primarily focused on knowledge-based reasoning tasks such as knowledge-based question answering [43, 44, 66, 67], fact-checking [45, 46, 68], and entity-centric reasoning [69, 47, 70, 48, 49]. However, in these tasks, ”reasoning” is predominantly limited to identifying and retrieving static knowledge rather than performing iterative, multi-step logical computations [71, 72, 73]. In contrast, our work is to integrate KGs with LLMs for o1-like reasoning in domains such as mathematics, where solving problems demands dynamic, step-by-step inference rather than static knowledge retrieval.
### 2.3 Reward Models
Reward models are essential across various domains such as computer vision [74, 75]. Notably, they play a crucial role in aligning LLM outputs with human preferences by evaluating accuracy, relevance, and logical consistency [76, 77, 78]. Fine-tuned reward models, including Outcome Reward Models (ORMs) [30] and Process Reward Models (PRMs) [31, 29, 60], improve validation accuracy but come at a high training cost and often lack generalization across diverse tasks [56]. Generative reward models [61] further enhance performance by integrating CoT reasoning into reward assessments, leveraging Test-Time Compute to refine evaluation. However, the reliance on fine-tuning makes these models resource-intensive and limits adaptability [56]. This underscores the need for universal, training-free scoring mechanisms that maintain robust performance while ensuring computational efficiency across various reasoning domains.
## 3 Pre-analysis
### 3.1 Motivation and Problem Definition
Motivation. LLMs have demonstrated remarkable capabilities across various domains [79, 4, 80, 81, 82], yet their proficiency in complex reasoning tasks remains limited [3, 83, 84]. Challenges such as hallucinations [34, 85], inaccuracies, and difficulties in handling complex, multi-step reasoning due to insufficient reasoning depth are particularly evident in smaller models or resource-constrained environments [86, 87, 88]. Moreover, traditional reward models, including ORMs [30] and PRMs [31, 29], require extensive fine-tuning, incurring significant computational costs for dataset collection, GPU usage, and prolonged training time [89, 90, 91]. Despite these efforts, fine-tuned reward models often suffer from poor generalization, restricting their effectiveness across diverse reasoning tasks [56].
To simultaneously overcome these challenges, this paper introduces a novel paradigm tailored for o1-like multi-step reasoning:
Remark 3.1 (Graph-Augmented Multi-Step Reasoning). The goal of graph-augmented reasoning is to enhance the step-by-step reasoning ability of frozen LLMs by integrating external knowledge graphs (KGs), eliminating the need for additional fine-tuning.
The proposed graph-augmented scheme aims to offer the following unique advantages:
- Improving Multi-Step Reasoning: Enhances reasoning capabilities, particularly for small-scale LLMs in resource-constrained environments;
- Scaling Test-Time Compute: Introduces a novel dimension of scaling test-time compute through dynamic integration of external knowledge;
- Transferability Across Reasoning Tasks: By leveraging domain-specific KGs, the framework can be easily adapted to various reasoning tasks, enabling transferability across different domains.
### 3.2 Challenge Analysis
However, implementing the proposed graph-augmented reasoning paradigm presents several critical challenges:
- Effective Integration: How can KGs be efficiently integrated with LLMs to support step-by-step reasoning without requiring model modifications? Frozen LLMs cannot directly query KGs effectively [50]. Additionally, since LLMs may suffer from hallucinations and inaccuracies during intermediate reasoning steps [34, 85, 41], it is crucial to dynamically integrate KGs at each step rather than relying solely on static knowledge retrieved at the initial stage;
- Knowledge Graph Construction: How can we design and construct process-oriented KGs tailored for LLM-driven multi-step reasoning? Existing KGs predominantly store static knowledge rather than the procedural and logical information required for reasoning [54, 55, 92]. A well-structured KG that represents reasoning steps, dependencies, and logical flows is necessary to support iterative reasoning;
- Universal Scoring Mechanism: How can we develop a training-free reward mechanism capable of universally evaluating reasoning paths across diverse tasks without domain-specific fine-tuning? Current approaches depend on fine-tuned reward models, which are computationally expensive and lack adaptability [56]. A universal, training-free scoring mechanism leveraging frozen LLMs is essential for scalable and efficient reasoning evaluation.
To address these challenges and unlock the full potential of graph-augmented reasoning, we propose a Step-by-Step Knowledge Graph based Retrieval-Augmented Reasoning (KG-RAR) framework, accompanied by a dedicated Post-Retrieval Processing and Reward Model (PRP-RM), which will be elaborated in the following section.
## 4 Proposed Approach
### 4.1 Overview
Our objective is to integrate KGs for o1-like reasoning with frozen, small-scale LLMs in a training-free and universal manner. This is achieved by integrating a step-by-step knowledge graph based retrieval-augmented reasoning (KG-RAR) module within a structured, iterative reasoning framework. As shown in Figure 2, the iterative process comprises three core phases: retrieving, refining, and reasoning.
### 4.2 Process-Oriented Math Knowledge Graph
<details>
<summary>x3.png Details</summary>

### Visual Description
## Flowchart: Process Flow from Datasets to Math Knowledge Graph
### Overview
The image depicts a three-section flowchart illustrating the relationship between datasets, a language model (LLM), and a structured Math Knowledge Graph. Arrows indicate directional flow and dependencies between components.
### Components/Axes
1. **Datasets Section (Left)**
- **Problem**: Topmost node with placeholder text
- **Step1/Step2**: Green smiley face icons (✅)
- **Step3**: Red sad face icon (❌)
- Arrows connect all steps to the LLM section
2. **LLM Section (Center)**
- Llama icon (🦙) representing the language model
- Arrows connect to Math Knowledge Graph
3. **Math Knowledge Graph (Right)**
- **Hierarchical Structure**:
- **Branch** (Yellow) → **SubField** (Yellow)
- **Problem** (Green) → **Problem Type** (Yellow)
- **Procedure1/Procedure2** (Teal) → **Knowledge1/Knowledge2** (Blue)
- **Error3** (Red) → **Knowledge3** (Blue)
### Detailed Analysis
- **Color Coding**:
- Yellow: High-level categories (Branch, SubField)
- Green: Problem-related nodes
- Teal: Procedural steps
- Red: Error component
- Blue: Knowledge elements
- **Spatial Relationships**:
- Datasets → LLM → Math Knowledge Graph (left-to-right flow)
- Vertical hierarchy within Math Knowledge Graph:
- Branch/SubField (top)
- Problem/Procedure (middle)
- Knowledge/Error (bottom)
### Key Observations
1. **Error Propagation**: Step3's red sad face directly connects to Error3 in the knowledge graph, suggesting error handling mechanisms
2. **Knowledge Hierarchy**: Procedural steps (Procedure1/2) map to specific knowledge elements, while errors connect to a separate knowledge node
3. **Problem Decomposition**: The Problem node branches into both Problem Type and Procedure components
### Interpretation
This flowchart represents a knowledge management system where:
1. Dataset problems (with potential errors in Step3) are processed by an LLM
2. The LLM interacts with a structured knowledge graph that:
- Organizes mathematical concepts hierarchically (Branch → SubField)
- Maps problems to procedural knowledge (Procedure → Knowledge)
- Explicitly handles errors as a distinct knowledge category
3. The red sad face in Step3 implies error detection at the dataset level, which then feeds into the knowledge graph's error management system
The system appears designed for mathematical problem-solving with built-in error tracking and knowledge organization, where each procedural step corresponds to specific domain knowledge elements.
</details>
Figure 3: Pipeline for constructing the process-oriented math knowledge graph from process supervision datasets.
To support o1-like multi-step reasoning, we construct a Mathematical Knowledge Graph tailored for multi-step logical inference. Public process supervision datasets, such as PRM800K [29], provide structured problem-solving steps annotated with artificial ratings. Each sample will be decomposed into the following structured components: branch, subfield, problem type, problem, procedures, errors, and related knowledge.
The Knowledge Graph is formally defined as: $G=(V,E)$ , where $V$ represents nodes—including problems, procedures, errors, and mathematical knowledge—and $E$ represents edges encoding their relationships (e.g., ”derived from,” ”related to”).
As shown in Figure 3, for a given problem $P$ with solutions $S_{1},S_{2},\dots,S_{n}$ and human ratings, the structured representation is:
$$
P\mapsto\{B_{p},F_{p},T_{p},\mathbf{r}\},
$$
| | $\displaystyle S_{i}^{\text{good}}$ | $\displaystyle\mapsto\{S_{i},K_{i},\mathbf{r}_{i}^{\text{good}}\},\quad S_{i}^{ \text{bad}}\mapsto\{E_{i},K_{i},\mathbf{r}_{i}^{\text{bad}}\},$ | |
| --- | --- | --- | --- |
where $B_{p},F_{p},T_{p}$ represent the branch, subfield, and type of $P$ , respectively. The symbols $S_{i}$ and $E_{i}$ denote the procedures derived from correct steps and the errors from incorrect steps, respectively. Additionally, $K_{i}$ contains relevant mathematical knowledge. The relationships between problems, steps, and knowledge are encoded through $\mathbf{r},\mathbf{r}_{i}^{\text{good}},\mathbf{r}_{i}^{\text{bad}}$ , which capture the edge relationships linking these elements.
To ensure a balance between computational efficiency and quality of KG, we employ a Llama-3.1-8B-Instruct model to process about 10,000 unduplicated samples from PRM800K. The LLM is prompted to output structured JSON data, which is subsequently transformed into a Neo4j-based MKG. This process yields a graph with approximately $80,000$ nodes and $200,000$ edges, optimized for efficient retrieval.
### 4.3 Step-by-Step Knowledge Graph Retrieval
KG-RAR for Problem Retrieval. For a given test problem $Q$ , the most relevant problem $P^{*}\in V_{p}$ and its subgraph are retrieved to assist reasoning. The retrieval pipeline comprises the following steps:
1. Initial Filtering: Classify $Q$ by $B_{q},F_{q},T_{q}$ (branch, subfield, and problem type). The candidate set $V_{Q}\subset V_{p}$ is filtered hierarchically, starting from $T_{q}$ , expanding to $F_{q}$ , and then to $B_{q}$ if no exact match is found.
2. Semantic Similarity Scoring:
$$
P^{*}=\arg\max_{P\in V_{Q}}\cos(\mathbf{e}_{Q},\mathbf{e}_{P}),
$$
where:
$$
\cos(\mathbf{e}_{Q},\mathbf{e}_{P})=\frac{\langle\mathbf{e}_{Q},\mathbf{e}_{P}
\rangle}{\|\mathbf{e}_{Q}\|\|\mathbf{e}_{P}\|}
$$
and $\mathbf{e}_{Q},\mathbf{e}_{P}\in\mathbb{R}^{d}$ are embeddings of $Q$ and $P$ , respectively.
3. Context Retrieval: Perform Depth-First Search (DFS) on $G$ to retrieve procedures ( $S_{p}$ ), errors ( $E_{p}$ ), and knowledge ( $K_{p}$ ) connected to $P^{*}$ .
Input: Test problem $Q$ and MKG $G$
Output: Most relevant problem $P^{*}$ and its context $(S_{p},E_{p},K_{p})$
1 Filter $G$ using $B_{q}$ , $F_{q}$ , and $T_{q}$ to obtain $V_{Q}$ ;
2 foreach $P\in V_{Q}$ do
3 Compute $\text{Sim}_{\text{semantic}}(Q,P)$ ;
4
5 $P^{*}\leftarrow\arg\max_{P\in V_{Q}}\text{Sim}_{\text{semantic}}(Q,P)$ ;
6 Retrieve $S_{p}$ , $E_{p}$ , and $K_{p}$ from $P^{*}$ using DFS;
7 return $P^{*}$ , $S_{p}$ , $E_{p}$ , and $K_{p}$ ;
Algorithm 1 KG-RAR for Problem Retrieval
KG-RAR for Step Retrieval. Given an intermediate reasoning step $S$ , the most relevant step $S^{*}\in G$ and its subgraph is retrieved dynamically:
1. Contextual Filtering: Restrict the search space $V_{S}$ to the subgraph induced by previously retrieved top-k similar problems $\{P_{1},P_{2},\dots,P_{k}\}\in V_{Q}$ .
2. Step Similarity Scoring:
$$
S^{*}=\arg\max_{S_{i}\in V_{S}}\cos(\mathbf{e}_{S},\mathbf{e}_{S_{i}}).
$$
3. Context Retrieval: Perform Breadth-First Search (BFS) on $G$ to extract subgraph of $S^{*}$ , including potential next steps, related knowledge, and error patterns.
### 4.4 Post-Retrieval Processing and Reward Model
Step Verification and End-of-Reasoning Detection. Inspired by previous works [93, 56, 94, 61], we use a frozen LLM to evaluate both step correctness and whether reasoning should terminate. The model is queried with an instruction, producing a binary classification decision:
$$
\textit{Is this step correct (Yes/No)? }\begin{cases}\text{Yes}\xrightarrow[
Probability]{Token}p(\text{Yes})\\
\text{No}\xrightarrow[Probability]{Token}p(\text{No})\\
\text{Other Tokens.}\end{cases}
$$
The corresponding confidence score for step verification or reasoning termination is computed as:
$$
Score(S,I)=\frac{\exp(p(\text{Yes}|S,I))}{\exp(p(\text{Yes}|S,I))+\exp(p(\text
{No}|S,I))}.
$$
For step correctness, the instruction $I$ is ”Is this step correct (Yes/No)?”, while for reasoning termination, the instruction $I_{E}$ is ”Has a final answer been reached (Yes/No)?”.
Post-Retrieval Processing. Post-retrieval processing is a crucial component of the retrieval-augmented generation (RAG) framework, ensuring that retrieved information is improved to maximize relevance while minimizing noise [37, 95, 96].
For a problem $P$ or a reasoning step $S$ :
$$
\mathcal{R}^{\prime}=\operatorname{LLM}_{\text{refine}}(P+\mathcal{R}\text{ or
}S+\mathcal{R}),
$$
where $\mathcal{R}$ is the raw retrieved context, and $\mathcal{R}^{\prime}$ represents its rewritten, targeted form.
Iterative Refinement and Verification. Inspired by generative reward models [61, 97], we integrate retrieval refinement as a form of CoT reasoning before scoring each step. To ensure consistency in multi-step reasoning, we employ an iterative retrieval refinement and scoring mechanism, as illustrated in Figure 4.
<details>
<summary>x4.png Details</summary>

### Visual Description
## Flowchart: Multi-Stage Reasoning System Architecture
### Overview
The diagram illustrates a multi-stage reasoning system involving a Prompt-Response Model (PRP-RM), a Reasoner, and knowledge graph components (KG, Sub-KG). The system processes inputs through probabilistic outputs, token probabilities, and temperature-based chat interactions, culminating in reasoned outputs.
### Components/Axes
1. **Legend**:
- **Blue**: Prompt
- **Yellow**: Output
- **Green**: Token Probability
- **Checkered (Blue/White)**: Temperature Chat
- **Green Network Nodes**: Knowledge Graph (KG) / Sub-Knowledge Graph (Sub-KG)
2. **Key Elements**:
- **PRP-RM (Upper Left)**: Contains prompt (`P`), output (`R'_p`), and token probability (`R'_p`).
- **Reasoner (Upper Right)**: Depicted as a brain with interconnected nodes, receiving inputs from KG and Sub-KG.
- **KG (Top Center)**: Green network connecting PRP-RM to `S1`.
- **Sub-KG (Bottom Center)**: Green network connecting `S1` to Reasoner.
- **PRP-RM (Lower Left)**: Contains structured components (`S1`, `R1`, `R'_1`, `I`, `I_E`, `Score`, `End`).
- **S2 (Lower Right)**: Output from `R'_1`, feeding into Reasoner.
### Detailed Analysis
1. **Upper PRP-RM**:
- **Prompt (`P`)**: Initiates the process.
- **Output (`R'_p`)**: Directly feeds into KG.
- **Token Probability (`R'_p`)**: Linked to `S1`.
2. **KG and Sub-KG**:
- **KG**: Connects PRP-RM to `S1` (blue/yellow box).
- **Sub-KG**: Connects `S1` to Reasoner via `S2`.
3. **Lower PRP-RM**:
- **`S1`**: Receives input (`I`) and external input (`I_E`).
- **`R1`**: Output from `S1`, feeding into `R'_1`.
- **`R'_1`**: Processes `Score` and `End` (checkered sections), outputs `S2`.
- **`Score`/`End`**: Likely evaluation/termination markers.
4. **Reasoner**:
- Integrates inputs from KG, Sub-KG, and `S2` to produce final outputs.
### Key Observations
- **Flow Direction**:
- Prompts → Outputs → Token Probabilities → KG → `S1` → Sub-KG → Reasoner.
- Lower PRP-RM processes structured data (`Score`, `End`) to refine outputs via `R'_1` → `S2` → Reasoner.
- **Color Consistency**:
- Blue (`Prompt`) and yellow (`Output`) dominate PRP-RM sections.
- Green networks (`KG`, `Sub-KG`) visually separate probabilistic and knowledge-based components.
- **Checkered Sections**:
- `Score` and `End` in lower PRP-RM suggest probabilistic evaluation steps.
### Interpretation
This architecture represents a hybrid reasoning system where:
1. **Probabilistic Processing**: PRP-RM generates outputs with token probabilities and temperature-adjusted chat interactions, ensuring variability.
2. **Knowledge Integration**: KG and Sub-KG provide contextual grounding, refining raw outputs into structured inputs (`S1`, `S2`).
3. **Reasoning Pipeline**: The Reasoner synthesizes knowledge and structured data to produce final outputs, likely for tasks requiring both creativity (via PRP-RM) and logic (via Reasoner).
The system emphasizes iterative refinement: initial prompts are probabilistically expanded, evaluated via knowledge graphs, and finally reasoned through a structured pipeline. The `Score`/`End` markers suggest built-in quality control or termination criteria, preventing infinite loops or low-quality outputs.
</details>
Figure 4: Illustration of the Post-Retrieval Processing and Reward Model (PRP-RM). Given a problem $P$ and its retrieved context $\mathcal{R}_{p}$ from the Knowledge Graph (KG), PRP-RM refines it into $\mathcal{R^{\prime}}_{p}$ . The Reasoner LLM generates step $S_{1}$ based on $\mathcal{R^{\prime}}_{p}$ , followed by iterative retrieval and refinement ( $\mathcal{R}_{t}\to\mathcal{R^{\prime}}_{t}$ ) for each step $S_{t}$ . Correctness is assessed using $I=$ ”Is this step correct?” to compute $\operatorname{Score}(S_{t})$ , while completion is checked via $I_{E}=$ ”Has a final answer been reached?” to compute $\operatorname{End}(S_{t})$ . The process continues until $\operatorname{End}(S_{t})$ surpasses a threshold or a predefined inference depth is reached.
Input: Current step $S$ and retrieved problems $\{P_{1},\ldots,P_{k}\}$
Output: Relevant step $S^{*}$ and its context subgraph
1 Initialize step collection $V_{S}\leftarrow\bigcup_{i=1}^{k}\text{Steps}(P_{i})$ ;
2 foreach $S_{i}\in V_{S}$ do
3 Compute semantic similarity $\text{Sim}_{\text{semantic}}(S,S_{i})$ ;
4
5 $S^{*}\leftarrow\arg\max_{S_{i}\in V_{S}}\text{Sim}_{\text{semantic}}(S,S_{i})$ ;
6 Construct context subgraph via $\text{BFS}(S^{*})$ ;
7 return $S^{*}$ , $\text{subgraph}(S^{*})$ ;
Algorithm 2 KG-RAR for Step Retrieval
For a reasoning step $S_{t}$ , the iterative refinement history is:
$$
H_{t}=\{P+\mathcal{R}_{p},\mathcal{R}^{\prime}_{p},S_{1}+\mathcal{R}_{1},
\mathcal{R}^{\prime}_{1},\dots,S_{t}+\mathcal{R}_{t},\mathcal{R}^{\prime}_{t}\}.
$$
The refined retrieval context is generated recursively:
$$
\mathcal{R}^{\prime}_{t}=\operatorname{LLM}_{\text{refine}}(H_{t-1},S_{t}+
\mathcal{R}_{t}).
$$
The correctness and end-of-reasoning probabilities are:
$$
\operatorname{Score}(S_{t})=\frac{\exp(p(\text{Yes}|H_{t},I))}{\exp(p(\text{
Yes}|H_{t},I))+\exp(p(\text{No}|H_{t},I))},
$$
$$
\operatorname{End}(S_{t})=\frac{\exp(p(\text{Yes}|H_{t},I_{E}))}{\exp(p(\text{
Yes}|H_{t},I_{E}))+\exp(p(\text{No}|H_{t},I_{E}))}.
$$
This process iterates until $\operatorname{End}(S_{t})>\theta$ , signaling completion.
TABLE I: Performance evaluation across different levels of the Math500 dataset using various models and methods.
| Dataset: Math500 | Level 1 (+9.09%) | Level 2 (+5.38%) | Level 3 (+8.90%) | Level 4 (+7.61%) | Level 5 (+16.43%) | Overall (+8.95%) | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| Model | Method | Maj | Last | Maj | Last | Maj | Last | Maj | Last | Maj | Last | Maj | Last |
| Llama-3.1-8B (+15.22%) | CoT-prompting | 80.6 | 80.6 | 74.1 | 74.1 | 59.4 | 59.4 | 46.4 | 46.4 | 27.4 | 27.1 | 51.9 | 51.9 |
| Step-by-Step KG-RAR | 88.4 | 81.4 | 83.3 | 82.2 | 70.5 | 69.5 | 53.9 | 53.9 | 32.1 | 25.4 | 59.8 | 57.0 | |
| Llama-3.2-3B (+20.73%) | CoT-prompting | 63.6 | 65.1 | 61.9 | 61.9 | 51.1 | 51.1 | 43.2 | 43.2 | 20.4 | 20.4 | 43.9 | 44.0 |
| Step-by-Step KG-RAR | 83.7 | 79.1 | 68.9 | 68.9 | 61.0 | 52.4 | 49.2 | 47.7 | 29.9 | 28.4 | 53.0 | 50.0 | |
| Llama-3.2-1B (-4.02%) | CoT-prompting | 64.3 | 64.3 | 52.6 | 52.2 | 41.6 | 41.6 | 25.3 | 25.3 | 8.0 | 8.2 | 32.3 | 32.3 |
| Step-by-Step KG-RAR | 72.1 | 72.1 | 50.0 | 50.0 | 40.0 | 40.0 | 18.0 | 19.5 | 10.4 | 13.4 | 31.0 | 32.2 | |
| Qwen2.5-7B (+2.91%) | CoT-prompting | 95.3 | 95.3 | 88.9 | 88.9 | 86.7 | 86.3 | 77.3 | 77.1 | 50.0 | 49.8 | 75.6 | 75.4 |
| Step-by-Step KG-RAR | 95.3 | 93.0 | 90.0 | 90.0 | 87.6 | 88.6 | 79.7 | 79.7 | 54.5 | 56.7 | 77.8 | 78.4 | |
| Qwen2.5-3B (+3.13%) | CoT-prompting | 93.0 | 93.0 | 85.2 | 85.2 | 81.0 | 80.6 | 62.5 | 62.5 | 40.1 | 39.3 | 67.1 | 66.8 |
| Step-by-Step KG-RAR | 95.3 | 95.3 | 84.4 | 85.6 | 83.8 | 77.1 | 64.1 | 64.1 | 44.0 | 38.1 | 69.2 | 66.4 | |
| Qwen2.5-1.5B (-5.12%) | CoT-prompting | 88.4 | 88.4 | 78.5 | 77.4 | 71.4 | 68.9 | 49.2 | 49.5 | 34.6 | 34.3 | 58.6 | 57.9 |
| Step-by-Step KG-RAR | 97.7 | 93.0 | 78.9 | 75.6 | 66.7 | 67.6 | 48.4 | 44.5 | 24.6 | 23.1 | 55.6 | 53.4 | |
Role-Based System Prompting. Inspired by agent-based reasoning frameworks [98, 99, 100], we introduce role-based system prompting to further optimize our PRP-RM. In this approach, we define three distinct personas to enhance the reasoning process. The Responsible Teacher [101] processes retrieved knowledge into structured guidance and evaluates the correctness of each step. The Socratic Teacher [102], rather than pr oviding direct guidance, reformulates the retrieved content into heuristic questions, encouraging self-reflection. Finally, the Critical Teacher [103] acts as a critical evaluator, diagnosing reasoning errors before assigning a score. Each role focuses on different aspects of post-retrieval processing, improving robustness and interpretability.
## 5 Experiments
In this section, we evaluate the effectiveness of our proposed Step-by-Step KG-RAR and PRP-RM methods by comparing them with Chain-of-Thought (CoT) prompting [4] and finetuned reward models [30, 29]. Additionally, we perform ablation studies to examine the impact of individual components.
### 5.1 Experimental Setup
Following prior works [5, 29, 21], we evaluate on Math500 [104] and GSM8K [30], using Accuracy (%) as the primary metric. Experiments focus on instruction-tuned Llama3 [105] and Qwen2.5 [106], with Best-of-N [107, 30, 29] search ( $n=8$ for Math500, $n=4$ for GSM8K). We employ Majority Vote [5] for self-consistency and Last Vote [25] for benchmarking reward models. To evaluate PRP-RM, we compare against fine-tuned reward models: Math-Shepherd-PRM-7B [108], RLHFlow-ORM-Deepseek-8B and RLHFlow-PRM-Deepseek-8B [109]. For Step-by-Step KG-RAR, we set step depth to $8$ and padding to $4$ . The Socratic Teacher role is used in PRP-RM to minimize direct solving. Both the Reasoner LLM and PRP-RM remain consistent for fair comparison.
<details>
<summary>x5.png Details</summary>

### Visual Description
## Line Graph: Model Accuracy Across Difficulty Levels
### Overview
The image is a line graph comparing the accuracy of four different models across five difficulty levels. The x-axis represents "Difficulty Level" (1 to 5), and the y-axis represents "Accuracy (%)" (0% to 80%). Four data series are plotted, each with distinct colors and markers, as indicated in the legend.
### Components/Axes
- **X-axis (Difficulty Level)**: Labeled "Difficulty Level" with ticks at 1, 2, 3, 4, and 5.
- **Y-axis (Accuracy %)**: Labeled "Accuracy (%)" with ticks at 0%, 20%, 40%, 60%, and 80%.
- **Legend**: Located in the top-left corner, with four entries:
- **Math-Shepherd-PRM-7B** (green circle)
- **RLHFlow-PRM-Deepseek-8B** (blue triangle)
- **RLHFlow-ORM-Deepseek-8B** (orange square)
- **PRP-RM(frozen Llama-3.2-3B)** (red diamond)
### Detailed Analysis
- **PRP-RM(frozen Llama-3.2-3B)** (red diamond):
- Difficulty 1: ~80%
- Difficulty 2: ~70%
- Difficulty 3: ~55%
- Difficulty 4: ~45%
- Difficulty 5: ~30%
- **RLHFlow-PRM-Deepseek-8B** (blue triangle):
- Difficulty 1: ~70%
- Difficulty 2: ~60%
- Difficulty 3: ~50%
- Difficulty 4: ~40%
- Difficulty 5: ~25%
- **RLHFlow-ORM-Deepseek-8B** (orange square):
- Difficulty 1: ~65%
- Difficulty 2: ~60%
- Difficulty 3: ~50%
- Difficulty 4: ~40%
- Difficulty 5: ~20%
- **Math-Shepherd-PRM-7B** (green circle):
- Difficulty 1: ~60%
- Difficulty 2: ~60%
- Difficulty 3: ~50%
- Difficulty 4: ~40%
- Difficulty 5: ~20%
### Key Observations
1. **Downward Trend**: All models show a consistent decline in accuracy as difficulty increases.
2. **PRP-RM(frozen Llama-3.2-3B)** maintains the highest accuracy across all difficulty levels, though it drops sharply at higher levels.
3. **RLHFlow-PRM-Deepseek-8B** and **RLHFlow-ORM-Deepseek-8B** exhibit similar trends, with PRM slightly outperforming ORM.
4. **Math-Shepherd-PRM-7B** has the lowest accuracy, with minimal improvement across difficulty levels.
### Interpretation
The data suggests that model performance degrades with increased task complexity. The **PRP-RM(frozen Llama-3.2-3B)** model demonstrates superior robustness, likely due to its frozen architecture or training methodology. In contrast, **Math-Shepherd-PRM-7B** underperforms consistently, indicating potential limitations in its design or training data. The **RLHFlow** models (PRM and ORM) show comparable performance, with PRM marginally better, suggesting that the PRM variant may have more effective alignment or optimization. The steep decline in PRP-RM's accuracy at higher difficulty levels highlights the challenges of generalizing frozen models to complex tasks. This graph underscores the trade-offs between model architecture, training strategies, and task difficulty in achieving high accuracy.
</details>
Figure 5: Comparison of reward models under Last@8.
### 5.2 Comparative Experimental Results
TABLE II: Evaluation results on the GSM8K dataset.
| Model | Method | Maj@4 | Last@4 |
| --- | --- | --- | --- |
| Llama-3.1-8B (+8.68%) | CoT-prompting | 81.8 | 82.0 |
| Step-by-Step KG-RAR | 88.9 | 88.0 | |
| Qwen-2.5-7B (+1.09%) | CoT-prompting | 91.6 | 91.1 |
| Step-by-Step KG-RAR | 92.6 | 93.1 | |
Table I shows that Step-by-Step KG-RAR consistently outperforms CoT-prompting across all difficulty levels on Math500, with more pronounced improvements in the Llama3 series compared to Qwen2.5, likely due to Qwen2.5’s higher baseline accuracy leaving less room for improvement. Performance declines for smaller models like Qwen-1.5B and Llama-1B on harder problems due to increased reasoning inconsistencies. Among models showing improvements, Step-by-Step KG-RAR achieves an average relative accuracy gain of 8.95% on Math500 under Maj@8, while Llama-3.2-8B attains a 8.68% improvement on GSM8K under Maj@4 (Table II). Additionally, PRP-RM achieves comparable performance to ORM and PRM. Figure 5 confirms its effectiveness with Llama-3B on Math500, highlighting its viability as a training-free alternative.
<details>
<summary>x6.png Details</summary>

### Visual Description
## Bar Chart: Accuracy Comparison for Maj@8 and Last@8 Metrics
### Overview
The chart compares the accuracy of two metrics, **Maj@8** and **Last@8**, across two data states: **Raw** (orange) and **Processed** (teal). Accuracy is measured on a y-axis from 0% to 50%, with two grouped bars per metric.
### Components/Axes
- **X-axis (Metrics)**: Labeled "Maj@8" and "Last@8".
- **Y-axis (Accuracy)**: Labeled "Accuracy (%)" with increments at 10%, 20%, ..., 50%.
- **Legend**: Located at the top-right corner, with:
- **Orange**: "Raw" data.
- **Teal**: "Processed" data.
- **Bars**: Two bars per metric (Raw and Processed), grouped side-by-side.
### Detailed Analysis
- **Maj@8**:
- **Raw**: Approximately 30% accuracy.
- **Processed**: Approximately 40% accuracy.
- **Last@8**:
- **Raw**: Approximately 35% accuracy.
- **Processed**: Approximately 40% accuracy.
### Key Observations
1. **Processed data consistently outperforms Raw** for both metrics.
2. **Maj@8** shows a larger improvement (10% increase) compared to **Last@8** (5% increase).
3. **Last@8 Raw** has higher baseline accuracy than **Maj@8 Raw** (35% vs. 30%).
### Interpretation
The data suggests that processing improves accuracy across both metrics, with **Maj@8** benefiting more significantly. The **Last@8** metric starts with higher raw accuracy but plateaus at 40% after processing, indicating potential saturation or diminishing returns. The uniformity of **Processed** values (both at ~40%) may reflect a standardization effect or shared upper limit in the processing pipeline. The disparity in raw accuracy between metrics could imply differing inherent complexities or data quality issues.
</details>
Figure 6: Comparison of raw and processed retrieval.
<details>
<summary>x7.png Details</summary>

### Visual Description
## BarChart: Model Accuracy Comparison Across Metrics
### Overview
The chart compares the accuracy of three models (None, RAG, KG-RAG) across two evaluation metrics: **Maj@8** and **Last@8**. Accuracy is measured on a percentage scale from 0% to 60%. The KG-RAG model consistently outperforms the other two models in both metrics.
### Components/Axes
- **X-axis (Metrics)**:
- **Maj@8** (leftmost group)
- **Last@8** (rightmost group)
- **Y-axis (Accuracy)**:
- Scale: 0% to 60% in 10% increments
- Labels: "%" symbol
- **Legend**:
- Position: Top-left corner
- Colors:
- **Blue**: None
- **Orange**: RAG
- **Green**: KG-RAG
- **Bars**:
- Grouped by metric, with three bars per group (one per model)
- Bar heights correspond to accuracy percentages
### Detailed Analysis
- **Maj@8**:
- **None**: ~50% (blue bar)
- **RAG**: ~52% (orange bar)
- **KG-RAG**: ~54% (green bar)
- **Last@8**:
- **None**: ~44% (blue bar)
- **RAG**: ~45% (orange bar)
- **KG-RAG**: ~51% (green bar)
### Key Observations
1. **KG-RAG Dominance**:
- Outperforms other models by ~2-4% in both metrics.
- Largest gap in **Maj@8** (54% vs. 50% for None).
2. **Metric-Specific Trends**:
- **Maj@8** accuracies are consistently higher than **Last@8** for all models.
- **KG-RAG** maintains a ~6% advantage over RAG in **Last@8**.
3. **Model Improvements**:
- RAG improves over None by ~2% in **Maj@8** and ~1% in **Last@8**.
- KG-RAG adds ~2% over RAG in **Maj@8** and ~6% in **Last@8**.
### Interpretation
The data demonstrates that integrating knowledge graphs (KG-RAG) significantly enhances model performance, particularly in later positions (**Last@8**). The smaller accuracy gap between RAG and None in **Last@8** suggests that RAG's improvements are more pronounced in earlier positions. The consistent outperformance of KG-RAG implies that knowledge graph augmentation provides robust benefits across evaluation scenarios. The ~10% drop in accuracy from **Maj@8** to **Last@8** for all models highlights challenges in maintaining performance at later positions, where KG-RAG's advantage becomes most critical.
</details>
Figure 7: Comparison of various RAG types.
### 5.3 Ablation Studies
Effectiveness of Post-Retrieval Processing (PRP). We compare reasoning with refined retrieval from PRP-RM against raw retrieval directly from Knowledge Graphs. Figure 7 shows that refining the retrieval context significantly improves performance, with experiments using Llama-3B on Math500 Level 3.
Effectiveness of Knowledge Graphs (KGs). KG-RAR outperforms both no RAG and unstructured RAG (PRM800K) baselines, demonstrating the advantage of structured retrieval (Figure 7, Qwen-0.5B Reasoner, Qwen-3B PRP-RM, Math500).
<details>
<summary>x8.png Details</summary>

### Visual Description
## Line Graph: Accuracy vs. Computation Trade-off
### Overview
The graph compares two metrics—**Accuracy (%)** and **Computation**—across varying levels of **Step Padding** (1, 4, +∞). Two data series are plotted:
- **Maj@8** (teal circles)
- **Compute** (black crosses)
### Components/Axes
- **X-axis (Step Padding)**: Discrete values at 1, 4, and +∞.
- **Left Y-axis (Accuracy %)**: Ranges from 20% to 50%, with gridlines at 20%, 30%, 40%, and 50%.
- **Right Y-axis (Computation)**: Categorical scale labeled **Less**, **Medium**, **More** (bottom to top).
- **Legend**: Located in the upper-right corner, mapping **Maj@8** (teal) and **Compute** (black).
### Detailed Analysis
1. **Maj@8 (teal circles)**:
- At Step Padding = 1: Accuracy ≈ 35%.
- At Step Padding = 4: Accuracy peaks at ≈ 40%.
- At Step Padding = +∞: Accuracy drops to ≈ 35%.
- **Trend**: Slightly increases then declines, forming a shallow "M" shape.
2. **Compute (black crosses)**:
- At Step Padding = 1: Accuracy ≈ 45%.
- At Step Padding = 4: Accuracy ≈ 35% (intersects Maj@8).
- At Step Padding = +∞: Accuracy ≈ 25%.
- **Trend**: Steadily declines, forming a downward slope.
### Key Observations
- **Intersection at Step Padding = 4**: Both metrics converge at ≈ 35% accuracy.
- **Computation vs. Accuracy**:
- **Compute** starts with higher accuracy (45% at Step Padding = 1) but requires **More** computation.
- **Maj@8** maintains moderate accuracy (35–40%) with **Medium** computation.
- **Divergence at +∞**: Compute’s accuracy plummets to 25%, while Maj@8 stabilizes at 35%.
### Interpretation
The graph illustrates a **trade-off between accuracy and computational cost**:
- **Compute** prioritizes initial accuracy but becomes inefficient at higher Step Padding, requiring disproportionate resources for diminishing returns.
- **Maj@8** balances accuracy and efficiency, maintaining stable performance with lower computational overhead.
- The **+∞** Step Padding scenario highlights scalability issues for Compute, suggesting it may not be viable for large-scale applications.
**Critical Insight**: Maj@8 offers a pragmatic middle ground, avoiding the extremes of high computational cost (Compute) or suboptimal accuracy at scale. This aligns with Pareto optimization principles in machine learning model selection.
</details>
Figure 8: Comparison of step padding settings.
<details>
<summary>x9.png Details</summary>

### Visual Description
## Bar Chart: Accuracy Comparison Across Metrics and Categories
### Overview
The chart compares accuracy percentages (%) across two metrics ("Maj@8" and "Last@8") for three categories: "Socratic" (green), "Responsible" (orange), and "Critical" (blue). Accuracy values are represented as vertical bars, with the y-axis ranging from 0% to 70% in 10% increments.
### Components/Axes
- **X-axis (Metrics)**:
- Labels: "Maj@8" (left) and "Last@8" (right)
- **Y-axis (Accuracy)**:
- Scale: 0% to 70% (dashed horizontal lines at 10% intervals)
- Label: "Accuracy (%)"
- **Legend**:
- Position: Top-right
- Categories:
- Socratic (green)
- Responsible (orange)
- Critical (blue)
### Detailed Analysis
- **Maj@8**:
- Socratic: ~60% (green bar)
- Responsible: ~65% (orange bar)
- Critical: ~70% (blue bar)
- **Last@8**:
- Socratic: ~50% (green bar)
- Responsible: ~60% (orange bar)
- Critical: ~70% (blue bar)
### Key Observations
1. **Critical Category Dominance**:
- Critical consistently achieves the highest accuracy (~70%) in both metrics, with no significant drop between "Maj@8" and "Last@8".
2. **Socratic Decline**:
- Socratic accuracy drops from ~60% ("Maj@8") to ~50% ("Last@8"), a 10% decrease.
3. **Responsible Stability**:
- Responsible maintains ~60-65% accuracy across both metrics.
4. **Color Consistency**:
- All bars match their legend colors (green = Socratic, orange = Responsible, blue = Critical).
### Interpretation
The data suggests that the "Critical" category is prioritized or optimized for accuracy, maintaining near-peak performance across both metrics. The "Socratic" category shows a notable decline in "Last@8", potentially indicating a shift in strategy or resource allocation. The "Responsible" category acts as a middle ground, with stable but suboptimal performance compared to "Critical". These trends could reflect differing algorithmic priorities or risk thresholds in the evaluated system.
</details>
Figure 9: Comparison of PRP-RM roles.
Effectiveness of Step-by-Step RAG. We evaluate step padding at 1, 4, and 1000. Small padding causes inconsistencies, while large padding hinders refinement. Figure 9 illustrates this trade-off (Llama-1B, Math500 Level 3).
<details>
<summary>x10.png Details</summary>

### Visual Description
## Line Chart: Model Size vs Accuracy Comparison
### Overview
The chart compares the accuracy of two metrics ("Maj@8" and "Last@8") across three model sizes (0.5B, 1.5B, 3B) in PRP-RM. Both metrics show increasing accuracy with larger model sizes, but "Maj@8" consistently outperforms "Last@8" at larger scales.
### Components/Axes
- **X-axis**: Model Size (PRP-RM) with discrete markers at 0.5B, 1.5B, and 3B.
- **Y-axis**: Accuracy (%) ranging from 20% to 60% in 10% increments.
- **Legend**:
- Green circles with connecting lines represent "Maj@8".
- Orange squares with connecting lines represent "Last@8".
- **Gridlines**: Dashed horizontal lines at 20%, 30%, 40%, 50%, and 60%.
### Detailed Analysis
1. **Model Size 0.5B**:
- Maj@8: ~27% accuracy (green circle).
- Last@8: ~30% accuracy (orange square).
2. **Model Size 1.5B**:
- Maj@8: ~44% accuracy (green circle).
- Last@8: ~42% accuracy (orange square).
3. **Model Size 3B**:
- Maj@8: ~55% accuracy (green circle).
- Last@8: ~52% accuracy (orange square).
### Key Observations
- Both metrics show a positive correlation between model size and accuracy.
- "Maj@8" outperforms "Last@8" at 1.5B and 3B model sizes.
- The accuracy gap between the two metrics widens as model size increases.
- At 0.5B, "Last@8" has a slight edge (~3% higher accuracy).
### Interpretation
The data suggests that larger models improve performance for both metrics, but "Maj@8" scales more effectively. This could indicate that "Maj@8" is better optimized for larger architectures or captures more nuanced patterns in the data. The divergence at 3B highlights potential architectural or algorithmic advantages in "Maj@8". The initial parity at 0.5B implies that smaller models may benefit more from "Last@8" optimizations, possibly due to simpler decision boundaries or reduced computational overhead.
</details>
Figure 10: Scaling PRP-RM sizes
<details>
<summary>x11.png Details</summary>

### Visual Description
## LineGraph: Accuracy vs Model Size (Reasoner)
### Overview
The image depicts a line graph comparing the accuracy of two evaluation metrics ("Maj@8" and "Last@8") across three model sizes (0.5B, 1.5B, and 3B parameters). Both metrics show increasing accuracy with larger model sizes, with "Maj@8" consistently outperforming "Last@8".
### Components/Axes
- **X-axis**: Model Size (Reasoner)
- Categories: 0.5B, 1.5B, 3B
- **Y-axis**: Accuracy (%)
- Scale: 50% to 75% (in 5% increments)
- **Legend**:
- Top-right corner
- "Maj@8" (teal circles)
- "Last@8" (orange squares)
- **Data Points**:
- Solid lines connect data points at each model size.
### Detailed Analysis
- **Maj@8 (Teal)**:
- 0.5B: ~54% accuracy
- 1.5B: ~62% accuracy
- 3B: ~69% accuracy
- **Last@8 (Orange)**:
- 0.5B: ~52% accuracy
- 1.5B: ~61% accuracy
- 3B: ~67% accuracy
### Key Observations
1. Both metrics show a clear upward trend as model size increases.
2. "Maj@8" outperforms "Last@8" at all model sizes, with the gap narrowing slightly at 3B (2% difference vs. 3% at 1.5B).
3. The steepest improvement occurs between 0.5B and 1.5B for both metrics.
### Interpretation
The data suggests that larger model sizes correlate with higher accuracy for both evaluation metrics. The consistent outperformance of "Maj@8" implies it may reflect a more robust or stringent evaluation framework compared to "Last@8". The diminishing gap at 3B could indicate diminishing returns in model scaling or convergence in performance between the two metrics at larger scales. This trend underscores the importance of model size in reasoning tasks while highlighting potential trade-offs between evaluation methodologies.
</details>
Figure 11: Scaling reasoner LLM sizes
Comparison of PRP-RM Roles. Socratic Teacher minimizes direct problem-solving but sometimes introduces extraneous questions. Figure 9 shows Critical Teacher performs best among three roles (Llama-3B, Math500).
Scaling Model Size. Scaling trends in Section 5.2 are validated on Math500. Figures 11 and 11 confirm performance gains as both the Reasoner LLM and PRP-RM scale independently.
Scaling of Number of Solutions. We vary the number of generated solutions using Llama-3B on Math500 Level 3. Figure 13 shows accuracy improves incrementally with more solutions, underscoring the benefits of multiple candidates.
<details>
<summary>x12.png Details</summary>

### Visual Description
## Line Chart: Accuracy vs. Number of Generated Solutions
### Overview
The chart compares the accuracy of two metrics, **Maj@8** (teal circles) and **Last@8** (orange squares), across different numbers of generated solutions (1, 2, 4, 8, 16, 32, 64). Two baseline models, **Llama-3.1-8B** (dashed black line) and **Llama-3.2-3B** (dashed gray line), are shown for reference. Accuracy is measured on the y-axis (40%–80%), while the x-axis represents the number of generated solutions.
---
### Components/Axes
- **X-axis**: "Number of Generated Solutions" (logarithmic scale: 1, 2, 4, 8, 16, 32, 64).
- **Y-axis**: "Accuracy (%)" (linear scale: 40%–80%).
- **Legend**:
- **Maj@8**: Teal circles (solid line).
- **Last@8**: Orange squares (solid line).
- **Llama-3.1-8B**: Dashed black line.
- **Llama-3.2-3B**: Dashed gray line.
---
### Detailed Analysis
1. **Maj@8 (Teal Circles)**:
- Starts at ~45% accuracy for 1 solution.
- Increases steadily to ~75% at 16 solutions.
- Plateaus at ~73% for 32 and 64 solutions.
2. **Last@8 (Orange Squares)**:
- Begins at ~43% for 1 solution.
- Rises to ~75% at 16 solutions.
- Slightly increases to ~76% at 64 solutions.
3. **Baselines**:
- **Llama-3.1-8B** (dashed black): Horizontal at ~50%.
- **Llama-3.2-3B** (dashed gray): Horizontal at ~45%.
---
### Key Observations
- Both **Maj@8** and **Last@8** show significant accuracy improvements as the number of generated solutions increases, with the steepest gains between 1 and 16 solutions.
- **Maj@8** consistently outperforms **Last@8** at lower solution counts (e.g., 1–8 solutions), but the gap narrows at higher counts (e.g., 16–64 solutions).
- The baselines (**Llama-3.1-8B** and **Llama-3.2-3B**) are below the data points, indicating that the evaluated models outperform these baselines.
- Accuracy plateaus for both metrics after 16 solutions, suggesting diminishing returns beyond this point.
---
### Interpretation
The chart demonstrates that generating more solutions improves accuracy up to a critical threshold (16 solutions), after which further increases yield minimal gains. This suggests that the models are optimized for a specific number of solutions, and excessive generation may not enhance performance. The convergence of **Maj@8** and **Last@8** at higher solution counts implies that both metrics align in their evaluation of solution quality at scale. The baselines highlight that the models under evaluation are more effective than the referenced Llama versions, particularly for larger solution sets.
</details>
Figure 12: Scaling of the number of solutions.
<details>
<summary>x13.png Details</summary>

### Visual Description
## Bar Chart: Accuracy of Voting Methods
### Overview
The image is a bar chart comparing the accuracy of five voting methods: **Maj-Vote**, **Last-Vote**, **Min-Vote**, **Last-Max**, and **Min-Max**. The y-axis represents accuracy as a percentage (50% to 70%), and the x-axis lists the voting methods. All bars are teal with black borders, and a legend confirms the color association. Accuracy values are approximate and decrease progressively from left to right.
### Components/Axes
- **X-Axis (Voting Methods)**:
- Labels: "Maj-Vote", "Last-Vote", "Min-Vote", "Last-Max", "Min-Max".
- Positioning: Horizontally spaced, left-aligned.
- **Y-Axis (Accuracy %)**:
- Scale: 50% to 70% in 5% increments.
- Positioning: Left-aligned, vertical.
- **Legend**:
- Color: Teal (labeled as "Accuracy (%)").
- Positioning: Not explicitly visible in the image but implied by bar color.
### Detailed Analysis
- **Maj-Vote**: Tallest bar, reaching ~65% accuracy.
- **Last-Vote**: Second tallest, ~63% accuracy.
- **Min-Vote**: Third, ~61% accuracy.
- **Last-Max**: Fourth, ~56% accuracy.
- **Min-Max**: Shortest bar, ~55% accuracy.
- **Trend**: Accuracy decreases monotonically from left to right (Maj-Vote to Min-Max).
### Key Observations
1. **Maj-Vote** achieves the highest accuracy (~65%), suggesting it is the most reliable method among those tested.
2. **Min-Max** has the lowest accuracy (~55%), indicating potential inefficiencies or biases in this method.
3. The decline in accuracy is consistent across all methods, with no outliers or anomalies.
### Interpretation
The data suggests that **Maj-Vote** (majority voting) is the most accurate method, likely due to its reliance on consensus. In contrast, **Min-Max** (minimum-maximum voting) performs poorly, possibly because it prioritizes extreme values, which may not reflect collective preference. The gradual decline in accuracy from **Maj-Vote** to **Min-Max** implies that methods emphasizing consensus or majority agreement are more effective than those focusing on extremes or minority preferences. This could inform the design of voting systems in contexts requiring high reliability, such as decision-making processes or resource allocation.
</details>
Figure 13: Comparison of various voting methods.
Comparison of Voting Methods. We widely evaluate five PRP-RM voting strategies: Majority Vote, Last Vote, Min Vote, Min-Max, and Last-Max [23, 21]. Majority Vote and Last Vote outperform others, as extreme-based methods are prone to PRP-RM overconfidence in incorrect solutions (Figure 13).
## 6 Conclusions and Limitations
In this paper, we introduce a novel graph-augmented reasoning paradigm that aims to enhance o1-like multi-step reasoning capabilities of frozen LLMs by integrating external KGs. Towards this end, we present step-by-step knowledge graph based retrieval-augmented reasoning (KG-RAR), a novel iterative retrieve-refine-reason framework that strengthens o1-like reasoning, facilitated by an innovative post-retrieval processing and reward model (PRP-RM) that refines raw retrievals and assigns step-wise scores to guide reasoning more effectively. Experimental results demonstrate an 8.95% relative improvement on average over CoT-prompting on Math500, with PRP-RM achieving competitive performance against fine-tuned reward models, yet without the heavy training or fine-tuning costs.
Despite these merits, the proposed approach indeed has some limitations, such as higher computational overhead and potential cases where KG-RAR may introduce unnecessary noise or fail to enhance reasoning. Our future work will focus on optimising the framework by incorporating active learning to dynamically update KGs, improving retrieval efficiency, and exploring broader applications in complex reasoning domains such as scientific discovery and real-world decision-making.
## References
- [1] J. Huang and K. C.-C. Chang, “Towards reasoning in large language models: A survey,” arXiv preprint arXiv:2212.10403, 2022.
- [2] J. Sun, C. Zheng, E. Xie, Z. Liu, R. Chu, J. Qiu, J. Xu, M. Ding, H. Li, M. Geng et al., “A survey of reasoning with foundation models,” arXiv preprint arXiv:2312.11562, 2023.
- [3] A. Plaat, A. Wong, S. Verberne, J. Broekens, N. van Stein, and T. Back, “Reasoning with large language models, a survey,” arXiv preprint arXiv:2407.11511, 2024.
- [4] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022.
- [5] X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowdhery, and D. Zhou, “Self-consistency improves chain of thought reasoning in language models,” arXiv preprint arXiv:2203.11171, 2022.
- [6] S. Zhou, U. Alon, F. F. Xu, Z. Jiang, and G. Neubig, “Docprompting: Generating code by retrieving the docs,” in The Eleventh International Conference on Learning Representations, 2022.
- [7] T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022.
- [8] A. Creswell, M. Shanahan, and I. Higgins, “Selection-inference: Exploiting large language models for interpretable logical reasoning,” arXiv preprint arXiv:2205.09712, 2022.
- [9] N. Shinn, B. Labash, and A. Gopinath, “Reflexion: an autonomous agent with dynamic memory and self-reflection,” arXiv preprint arXiv:2303.11366, 2023.
- [10] S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- [11] W. Chen, X. Ma, X. Wang, and W. W. Cohen, “Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks,” arXiv preprint arXiv:2211.12588, 2022.
- [12] R. Yamauchi, S. Sonoda, A. Sannai, and W. Kumagai, “Lpml: Llm-prompting markup language for mathematical reasoning,” arXiv preprint arXiv:2309.13078, 2023.
- [13] Y. Zhuang, Y. Yu, K. Wang, H. Sun, and C. Zhang, “Toolqa: A dataset for llm question answering with external tools,” Advances in Neural Information Processing Systems, vol. 36, pp. 50 117–50 143, 2023.
- [14] B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remez et al., “Code llama: Open foundation models for code,” arXiv preprint arXiv:2308.12950, 2023.
- [15] L. Yu, W. Jiang, H. Shi, J. Yu, Z. Liu, Y. Zhang, J. T. Kwok, Z. Li, A. Weller, and W. Liu, “Metamath: Bootstrap your own mathematical questions for large language models,” arXiv preprint arXiv:2309.12284, 2023.
- [16] Y. Wu, F. Jia, S. Zhang, H. Li, E. Zhu, Y. Wang, Y. T. Lee, R. Peng, Q. Wu, and C. Wang, “Mathchat: Converse to tackle challenging math problems with llm agents,” in ICLR 2024 Workshop on Large Language Model (LLM) Agents, 2024.
- [17] S. Wu, Z. Peng, X. Du, T. Zheng, M. Liu, J. Wu, J. Ma, Y. Li, J. Yang, W. Zhou et al., “A comparative study on reasoning patterns of openai’s o1 model,” arXiv preprint arXiv:2410.13639, 2024.
- [18] D. Zhang, J. Wu, J. Lei, T. Che, J. Li, T. Xie, X. Huang, S. Zhang, M. Pavone, Y. Li et al., “Llama-berry: Pairwise optimization for o1-like olympiad-level mathematical reasoning,” arXiv preprint arXiv:2410.02884, 2024.
- [19] Y. Zhang, S. Wu, Y. Yang, J. Shu, J. Xiao, C. Kong, and J. Sang, “o1-coder: an o1 replication for coding,” arXiv preprint arXiv:2412.00154, 2024.
- [20] J. Wang, F. Meng, Y. Liang, and J. Zhou, “Drt-o1: Optimized deep reasoning translation via long chain-of-thought,” arXiv preprint arXiv:2412.17498, 2024.
- [21] J. Wang, M. Fang, Z. Wan, M. Wen, J. Zhu, A. Liu, Z. Gong, Y. Song, L. Chen, L. M. Ni et al., “Openr: An open source framework for advanced reasoning with large language models,” arXiv preprint arXiv:2410.09671, 2024.
- [22] H. Luo, L. Shen, H. He, Y. Wang, S. Liu, W. Li, N. Tan, X. Cao, and D. Tao, “O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning,” arXiv preprint arXiv:2501.12570, 2025.
- [23] X. Feng, Z. Wan, M. Wen, Y. Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” in ICML 2024, 2024.
- [24] S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” arXiv preprint arXiv:2305.14992, 2023.
- [25] C. Snell, J. Lee, K. Xu, and A. Kumar, “Scaling llm test-time compute optimally can be more effective than scaling model parameters,” arXiv preprint arXiv:2408.03314, 2024.
- [26] Y. Chen, X. Pan, Y. Li, B. Ding, and J. Zhou, “A simple and provable scaling law for the test-time compute of large language models,” arXiv preprint arXiv:2411.19477, 2024.
- [27] OpenAI, “Learning to reason with llms,” https://openai.com/index/learning-to-reason-with-llms/, 2024.
- [28] DeepSeek-AI, D. Guo, D. Yang, and et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.12948
- [29] H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe, “Let’s verify step by step,” arXiv preprint arXiv:2305.20050, 2023.
- [30] K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano et al., “Training verifiers to solve math word problems,” arXiv preprint arXiv:2110.14168, 2021.
- [31] J. Uesato, N. Kushman, R. Kumar, F. Song, N. Siegel, L. Wang, A. Creswell, G. Irving, and I. Higgins, “Solving math word problems with process-and outcome-based feedback,” arXiv preprint arXiv:2211.14275, 2022.
- [32] A. Satpute, N. Gießing, A. Greiner-Petter, M. Schubotz, O. Teschke, A. Aizawa, and B. Gipp, “Can llms master math? investigating large language models on math stack exchange,” in Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval, 2024, pp. 2316–2320.
- [33] I. Mirzadeh, K. Alizadeh, H. Shahrokhi, O. Tuzel, S. Bengio, and M. Farajtabar, “Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models,” arXiv preprint arXiv:2410.05229, 2024.
- [34] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” arXiv preprint arXiv:2311.05232, 2023.
- [35] S. Banerjee, A. Agarwal, and S. Singla, “Llms will always hallucinate, and we need to live with this,” arXiv preprint arXiv:2409.05746, 2024.
- [36] Z. Xu, S. Jain, and M. Kankanhalli, “Hallucination is inevitable: An innate limitation of large language models,” arXiv preprint arXiv:2401.11817, 2024.
- [37] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv preprint arXiv:2312.10997, 2023.
- [38] W. Fan, Y. Ding, L. Ning, S. Wang, H. Li, D. Yin, T.-S. Chua, and Q. Li, “A survey on rag meeting llms: Towards retrieval-augmented large language models,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 6491–6501.
- [39] B. Peng, Y. Zhu, Y. Liu, X. Bo, H. Shi, C. Hong, Y. Zhang, and S. Tang, “Graph retrieval-augmented generation: A survey,” arXiv preprint arXiv:2408.08921, 2024.
- [40] S. Barnett, S. Kurniawan, S. Thudumu, Z. Brannelly, and M. Abdelrazek, “Seven failure points when engineering a retrieval augmented generation system,” in Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI, 2024, pp. 194–199.
- [41] Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y. Liang, “Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,” arXiv preprint arXiv:2403.05313, 2024.
- [42] Y. Hu, Z. Lei, Z. Zhang, B. Pan, C. Ling, and L. Zhao, “Grag: Graph retrieval-augmented generation,” arXiv preprint arXiv:2405.16506, 2024.
- [43] X. Li, R. Zhao, Y. K. Chia, B. Ding, S. Joty, S. Poria, and L. Bing, “Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources,” arXiv preprint arXiv:2305.13269, 2023.
- [44] X. He, Y. Tian, Y. Sun, N. V. Chawla, T. Laurent, Y. LeCun, X. Bresson, and B. Hooi, “G-retriever: Retrieval-augmented generation for textual graph understanding and question answering,” arXiv preprint arXiv:2402.07630, 2024.
- [45] R.-C. Chang and J. Zhang, “Communitykg-rag: Leveraging community structures in knowledge graphs for advanced retrieval-augmented generation in fact-checking,” arXiv preprint arXiv:2408.08535, 2024.
- [46] Y. Mu, P. Niu, K. Bontcheva, and N. Aletras, “Predicting and analyzing the popularity of false rumors in weibo,” Expert Systems with Applications, vol. 243, p. 122791, 2024.
- [47] L. Luo, Y.-F. Li, G. Haffari, and S. Pan, “Reasoning on graphs: Faithful and interpretable large language model reasoning,” arXiv preprint arXiv:2310.01061, 2023.
- [48] J. Sun, C. Xu, L. Tang, S. Wang, C. Lin, Y. Gong, H.-Y. Shum, and J. Guo, “Think-on-graph: Deep and responsible reasoning of large language model with knowledge graph,” arXiv preprint arXiv:2307.07697, 2023.
- [49] H. Liu, S. Wang, Y. Zhu, Y. Dong, and J. Li, “Knowledge graph-enhanced large language models via path selection,” arXiv preprint arXiv:2406.13862, 2024.
- [50] L. Luo, Z. Zhao, C. Gong, G. Haffari, and S. Pan, “Graph-constrained reasoning: Faithful reasoning on knowledge graphs with large language models,” arXiv preprint arXiv:2410.13080, 2024.
- [51] Q. Zhang, J. Dong, H. Chen, D. Zha, Z. Yu, and X. Huang, “Knowgpt: Knowledge graph based prompting for large language models,” in The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
- [52] N. Choudhary and C. K. Reddy, “Complex logical reasoning over knowledge graphs using large language models,” arXiv preprint arXiv:2305.01157, 2023.
- [53] Z. Zhao, Y. Rong, D. Guo, E. Gözlüklü, E. Gülboy, and E. Kasneci, “Stepwise self-consistent mathematical reasoning with large language models,” arXiv preprint arXiv:2402.17786, 2024.
- [54] S. Ji, S. Pan, E. Cambria, P. Marttinen, and S. Y. Philip, “A survey on knowledge graphs: Representation, acquisition, and applications,” IEEE transactions on neural networks and learning systems, vol. 33, no. 2, pp. 494–514, 2021.
- [55] J. Wang, “Math-kg: Construction and applications of mathematical knowledge graph,” arXiv preprint arXiv:2205.03772, 2022.
- [56] C. Zheng, Z. Zhang, B. Zhang, R. Lin, K. Lu, B. Yu, D. Liu, J. Zhou, and J. Lin, “Processbench: Identifying process errors in mathematical reasoning,” arXiv preprint arXiv:2412.06559, 2024.
- [57] M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, L. Gianinazzi, J. Gajda, T. Lehmann, M. Podstawski, H. Niewiadomski, P. Nyczyk et al., “Graph of thoughts: Solving elaborate problems with large language models,” arXiv preprint arXiv:2308.09687, 2023.
- [58] E. Zelikman, Y. Wu, J. Mu, and N. Goodman, “Star: Bootstrapping reasoning with reasoning,” Advances in Neural Information Processing Systems, vol. 35, pp. 15 476–15 488, 2022.
- [59] Y. Li, Z. Lin, S. Zhang, Q. Fu, B. Chen, J.-G. Lou, and W. Chen, “Making large language models better reasoners with step-aware verifier,” arXiv preprint arXiv:2206.02336, 2022.
- [60] F. Yu, A. Gao, and B. Wang, “Ovm, outcome-supervised value models for planning in mathematical reasoning,” in Findings of the Association for Computational Linguistics: NAACL 2024, 2024, pp. 858–875.
- [61] L. Zhang, A. Hosseini, H. Bansal, M. Kazemi, A. Kumar, and R. Agarwal, “Generative verifiers: Reward modeling as next-token prediction,” arXiv preprint arXiv:2408.15240, 2024.
- [62] H. Paulheim, “Knowledge graph refinement: A survey of approaches and evaluation methods,” Semantic web, vol. 8, no. 3, pp. 489–508, 2016.
- [63] Q. Wang, Z. Mao, B. Wang, and L. Guo, “Knowledge graph embedding: A survey of approaches and applications,” IEEE transactions on knowledge and data engineering, vol. 29, no. 12, pp. 2724–2743, 2017.
- [64] Y. Jing, Y. Yang, X. Wang, M. Song, and D. Tao, “Meta-aggregator: Learning to aggregate for 1-bit graph neural networks,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 5301–5310.
- [65] Y. Jing, C. Yuan, L. Ju, Y. Yang, X. Wang, and D. Tao, “Deep graph reprogramming,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 24 345–24 354.
- [66] D. Sanmartin, “Kg-rag: Bridging the gap between knowledge and creativity,” arXiv preprint arXiv:2405.12035, 2024.
- [67] Y. Wang, N. Lipka, R. A. Rossi, A. Siu, R. Zhang, and T. Derr, “Knowledge graph prompting for multi-document question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 19 206–19 214.
- [68] A. Kau, X. He, A. Nambissan, A. Astudillo, H. Yin, and A. Aryani, “Combining knowledge graphs and large language models,” arXiv preprint arXiv:2407.06564, 2024.
- [69] J. Jiang, K. Zhou, Z. Dong, K. Ye, W. X. Zhao, and J.-R. Wen, “Structgpt: A general framework for large language model to reason over structured data,” arXiv preprint arXiv:2305.09645, 2023.
- [70] Z. Chai, T. Zhang, L. Wu, K. Han, X. Hu, X. Huang, and Y. Yang, “Graphllm: Boosting graph reasoning ability of large language model,” arXiv preprint arXiv:2310.05845, 2023.
- [71] Y. Zhu, X. Wang, J. Chen, S. Qiao, Y. Ou, Y. Yao, S. Deng, H. Chen, and N. Zhang, “Llms for knowledge graph construction and reasoning: Recent capabilities and future opportunities,” World Wide Web, vol. 27, no. 5, p. 58, 2024.
- [72] S. Pan, L. Luo, Y. Wang, C. Chen, J. Wang, and X. Wu, “Unifying large language models and knowledge graphs: A roadmap,” IEEE Transactions on Knowledge and Data Engineering, 2024.
- [73] G. Agrawal, T. Kumarage, Z. Alghamdi, and H. Liu, “Can knowledge graphs reduce hallucinations in llms?: A survey,” arXiv preprint arXiv:2311.07914, 2023.
- [74] A. S. Pinto, A. Kolesnikov, Y. Shi, L. Beyer, and X. Zhai, “Tuning computer vision models with task rewards,” in International Conference on Machine Learning. PMLR, 2023, pp. 33 229–33 239.
- [75] Y. Jing, Y. Yang, X. Wang, M. Song, and D. Tao, “Turning frequency to resolution: Video super-resolution via event cameras,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7772–7781.
- [76] M. Kwon, S. M. Xie, K. Bullard, and D. Sadigh, “Reward design with language models,” arXiv preprint arXiv:2303.00001, 2023.
- [77] Y. Cao, H. Zhao, Y. Cheng, T. Shu, Y. Chen, G. Liu, G. Liang, J. Zhao, J. Yan, and Y. Li, “Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods,” IEEE Transactions on Neural Networks and Learning Systems, 2024.
- [78] Z. Wang, B. Bi, S. K. Pentyala, K. Ramnath, S. Chaudhuri, S. Mehrotra, X.-B. Mao, S. Asur et al., “A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more,” arXiv preprint arXiv:2407.16216, 2024.
- [79] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- [80] S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” arXiv preprint arXiv:2210.03629, 2022.
- [81] D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. V. Le, and E. H. Chi, “Least-to-most prompting enables complex reasoning in large language models,” in The Eleventh International Conference on Learning Representations, ICLR 2023, 2023.
- [82] X. Wang and D. Zhou, “Chain-of-thought reasoning without prompting,” arXiv preprint arXiv:2402.10200, 2024.
- [83] Y. Zhang, S. Mao, T. Ge, X. Wang, A. de Wynter, Y. Xia, W. Wu, T. Song, M. Lan, and F. Wei, “Llm as a mastermind: A survey of strategic reasoning with large language models,” arXiv preprint arXiv:2404.01230, 2024.
- [84] F. Yu, H. Zhang, P. Tiwari, and B. Wang, “Natural language reasoning, a survey,” ACM Computing Surveys, vol. 56, no. 12, pp. 1–39, 2024.
- [85] L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin et al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, 2024.
- [86] Z. Chu, J. Chen, Q. Chen, W. Yu, T. He, H. Wang, W. Peng, M. Liu, B. Qin, and T. Liu, “A survey of chain of thought reasoning: Advances, frontiers and future,” arXiv preprint arXiv:2309.15402, 2023.
- [87] J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, and W. Yin, “Large language models for mathematical reasoning: Progresses and challenges,” arXiv preprint arXiv:2402.00157, 2024.
- [88] Z. Li, Y. Cao, X. Xu, J. Jiang, X. Liu, Y. S. Teo, S.-W. Lin, and Y. Liu, “Llms for relational reasoning: How far are we?” in Proceedings of the 1st International Workshop on Large Language Models for Code, 2024, pp. 119–126.
- [89] Y. Wang, W. Zhong, L. Li, F. Mi, X. Zeng, W. Huang, L. Shang, X. Jiang, and Q. Liu, “Aligning large language models with human: A survey,” arXiv preprint arXiv:2307.12966, 2023.
- [90] T. Shen, R. Jin, Y. Huang, C. Liu, W. Dong, Z. Guo, X. Wu, Y. Liu, and D. Xiong, “Large language model alignment: A survey,” arXiv preprint arXiv:2309.15025, 2023.
- [91] S. R. Balseiro, O. Besbes, and D. Pizarro, “Survey of dynamic resource-constrained reward collection problems: Unified model and analysis,” Operations Research, vol. 72, no. 5, pp. 2168–2189, 2024.
- [92] D. Tomaszuk, Ł. Szeremeta, and A. Korniłowicz, “Mmlkg: Knowledge graph for mathematical definitions, statements and proofs,” Scientific Data, vol. 10, no. 1, p. 791, 2023.
- [93] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,” Advances in Neural Information Processing Systems, vol. 36, pp. 46 595–46 623, 2023.
- [94] D. Li, B. Jiang, L. Huang, A. Beigi, C. Zhao, Z. Tan, A. Bhattacharjee, Y. Jiang, C. Chen, T. Wu et al., “From generation to judgment: Opportunities and challenges of llm-as-a-judge,” arXiv preprint arXiv:2411.16594, 2024.
- [95] Y. Shi, X. Zi, Z. Shi, H. Zhang, Q. Wu, and M. Xu, “Enhancing retrieval and managing retrieval: A four-module synergy for improved quality and efficiency in rag systems,” arXiv preprint arXiv:2407.10670, 2024.
- [96] Y. Cao, Z. Gao, Z. Li, X. Xie, and S. K. Zhou, “Lego-graphrag: Modularizing graph-based retrieval-augmented generation for design space exploration,” arXiv preprint arXiv:2411.05844, 2024.
- [97] D. Mahan, D. Van Phung, R. Rafailov, C. Blagden, N. Lile, L. Castricato, J.-P. Fränken, C. Finn, and A. Albalak, “Generative reward models,” arXiv preprint arXiv:2410.12832, 2024.
- [98] C.-M. Chan, W. Chen, Y. Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better llm-based evaluators through multi-agent debate,” arXiv preprint arXiv:2308.07201, 2023.
- [99] Y. Talebirad and A. Nadiri, “Multi-agent collaboration: Harnessing the power of intelligent llm agents,” arXiv preprint arXiv:2306.03314, 2023.
- [100] A. Zhao, D. Huang, Q. Xu, M. Lin, Y.-J. Liu, and G. Huang, “Expel: Llm agents are experiential learners,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, 2024, pp. 19 632–19 642.
- [101] X. Ning, Z. Wang, S. Li, Z. Lin, P. Yao, T. Fu, M. B. Blaschko, G. Dai, H. Yang, and Y. Wang, “Can llms learn by teaching for better reasoning? a preliminary study,” arXiv preprint arXiv:2406.14629, 2024.
- [102] E. Y. Chang, “Prompting large language models with the socratic method,” in 2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 2023, pp. 0351–0360.
- [103] P. Ke, B. Wen, Z. Feng, X. Liu, X. Lei, J. Cheng, S. Wang, A. Zeng, Y. Dong, H. Wang et al., “Critiquellm: Scaling llm-as-critic for effective and explainable evaluation of large language model generation,” arXiv preprint arXiv:2311.18702, 2023.
- [104] D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt, “Measuring mathematical problem solving with the math dataset,” NeurIPS, 2021.
- [105] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan et al., “The llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024.
- [106] A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei et al., “Qwen2. 5 technical report,” arXiv preprint arXiv:2412.15115, 2024.
- [107] E. Charniak and M. Johnson, “Coarse-to-fine n-best parsing and maxent discriminative reranking,” in Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), 2005, pp. 173–180.
- [108] P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui, “Math-shepherd: Verify and reinforce llms step-by-step without human annotations,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 9426–9439.
- [109] W. Xiong, H. Zhang, N. Jiang, and T. Zhang, “An implementation of generative prm,” https://github.com/RLHFlow/RLHF-Reward-Modeling, 2024.