# Follow the Path: Reasoning over Knowledge Graph Paths to Improve LLM Factuality
Abstract
We introduce fs1, a simple yet effective method that improves the factuality of reasoning traces by sourcing them from large reasoning models (e.g., DeepSeek-R1) and grounding them by conditioning on knowledge graph (KG) paths. We fine-tune eight instruction-tuned Large Language Models (LLMs) on 3.9K factually grounded reasoning traces and rigorously evaluate them on six complex open-domain question-answering (QA) benchmarks encompassing 23.9K questions. Our results demonstrate that our fs1 -tuned model (32B parameters) consistently outperforms instruction-tuned counterparts with parallel sampling by 6-14 absolute points (pass@ $16$ ). Our detailed analysis shows that fs1 considerably improves model performance over more complex questions (requiring 3 or more hops on KG paths) and numerical answer types compared to the baselines. Furthermore, in single-pass inference, we notice that smaller LLMs show the most improvements. While prior works demonstrate the effectiveness of reasoning traces primarily in the STEM domains, our work shows strong evidence that anchoring reasoning to factual KG paths is a critical step in transforming LLMs for reliable knowledge-intensive tasks.
1 Introduction
Factual consistency of LLM-generated output is a requirement for critical real-world applications. LLM reasoning in the form of βthinkingβ has shown promising improvements in model performance on complex downstream tasks, such as mathematical reasoning and puzzle-like questions using additional compute resources during inference (e.g., test-time scaling; Wu et al., 2024; Muennighoff et al., 2025; Zhang et al., 2025). However, it remains an open question whether these reasoning techniques improve factuality, particularly for complex multi-hop QA (mQA). This task tests a modelβs ability to answer a question by synthesizing information from multiple pieces of evidence, often spread across different resources and requiring reasoning steps. We hypothesize that reasoning models should perform better than non-reasoning LLMs on the mQA task. To test this hypothesis, we source reasoning traces from state-of-the-art reasoning models and fine-tune several non-reasoning LLMs to attempt to induce reasoning capabilities. However, we have no guarantee that these reasoning traces from the large reasoning models are factually correct. In order to have a formal factual grounding in these traces, we condition the models on retrieved knowledge graph (KG) paths relevant to the questions. This is possible as KGs encode facts as directed, labeled graphs over entities and relations, which offers a verifiable foundation to inform each step of the reasoning process. We call our approach fs1 (factual simple test-time scaling; Muennighoff et al., 2025).
We fine-tune eight different LLMs sizes on the original reasoning traces (rt; 3.4K samples) or on our KG-enhanced traces (fs1; 3.9K samples). We evaluate the fine-tuned models on six QA test sets spanning 23.9K questions, finding that fine-tuning on this amount of data can improve accuracy by 6-14 absolute points (pass@ $16$ ) for a 32B parameter model across the benchmarks. A snapshot of our method is in Figure 1. This setup enables us to address our research question (RQ): To what extent does grounding the reasoning processes of LLMs in KG paths enhance their factual accuracy for mQA? To address this question, our contributions are as follows:
<details>
<summary>x1.png Details</summary>

### Visual Description
\n
## Diagram: AI Model Performance Comparison
### Overview
This diagram compares the performance of four different AI models (Instruction-tuned, Chain-of-Thought, Thinking (rt), and Thinking with KGs (fs1)) in answering the question: "The visual artist that created the art series of Las Meninas, where did they live?". Each model's attempt at an answer is presented within a rounded rectangle, along with a visual indicator (red 'X' or green checkmark) of whether the answer is correct.
### Components/Axes
The diagram consists of four main sections, each representing a different AI model. Each section includes:
* **Model Name:** (Instruction-tuned, Chain-of-Thought, Thinking (rt), Thinking with KGs (fs1)) β displayed in a light blue banner at the top of each section.
* **Robot Icon:** A small robot icon is present in each section, likely representing the AI model.
* **Performance Indicator:** A red 'X' indicates an incorrect answer, while a green checkmark indicates a correct answer.
* **Model Output:** The text generated by the model as its answer to the question.
* **Wikidata Logo:** A Wikidata logo is present in the "Thinking with KGs (fs1)" section.
The overall diagram is framed by a light green banner at the top containing the question: "The visual artist that created the art series of Las Meninas, where did they live?".
### Detailed Analysis or Content Details
Here's a breakdown of each model's performance:
1. **Instruction-tuned:**
* Answer: "The answer is Paris."
* Performance: Incorrect (Red 'X')
2. **Chain-of-Thought:**
* Answer: "Let me think step-by-stepβ¦\nMy answer is Vienna."
* Performance: Incorrect (Red 'X')
3. **Thinking (rt):**
* Answer: `\nThe answer should be Barcelona."`
* Performance: Incorrect (Red 'X')
4. **Thinking with KGs (fs1):**
* Answer: `<question> + <Wikidata Logo>\n\nThe visual artist who created the art series [β¦] The answer is Madrid."`
* Performance: Correct (Green Checkmark)
### Key Observations
* Three out of the four models (Instruction-tuned, Chain-of-Thought, and Thinking (rt)) provided incorrect answers.
* The "Thinking with KGs (fs1)" model was the only one to provide the correct answer (Madrid).
* The "Thinking with KGs (fs1)" model utilizes external knowledge from Wikidata, as indicated by the logo.
* The "Thinking (rt)" model uses `` tags to denote its thought process.
* The "Chain-of-Thought" model explicitly states its reasoning process ("Let me think step-by-stepβ¦").
### Interpretation
This diagram demonstrates the effectiveness of incorporating Knowledge Graphs (KGs) into AI models for question answering. The "Thinking with KGs (fs1)" model, leveraging Wikidata, was able to correctly identify the birthplace of the artist who created "Las Meninas," while the other models failed. This suggests that access to external knowledge significantly improves the accuracy of AI responses, particularly for questions requiring factual information. The other models, relying solely on their internal training data, were prone to errors. The use of tags like `<think>` suggests an attempt to model a reasoning process, but this alone isn't sufficient for accurate results. The diagram highlights the importance of grounding AI models in reliable knowledge sources.
</details>
Figure 1: Snapshot of Method. We show a snapshot of the experiments executed in this study. There are four settings on how a question can be answered; (1) direct answer from an instruction-tuned model, (2) step-by-step reasoning via Chain-of-Thought, (3) original βthinkingβ, and (4) knowledge-graph enhanced βthinkingβ. We show an example of how (4) looks like in Figure 4.
- We demonstrate that with test-time scaling (parallel sampling), our fs1 -tuned Qwen2.5-32B model improves factual accuracy by 6-14 absolute points (at pass@ $16$ ).
- We conduct an analysis over the question and answer types (e.g., question difficulty, answer type, and domains) to investigate where fs1 -tuned models provide improvements. We find that fs1 -tuned models perform better on more difficult questions, requiring 3 hops or more.
- We examine performance of eight fs1 -tuned models (360M-32B parameters) in a pass@ $1$ setting against baselines. We find that smaller LLMs have the largest increase in performance, whereas larger models see less profound improvements.
- We release 3.4K raw reasoning traces and 3.9K KG-enhanced reasoning traces both sourced from QwQ-32B and Deepseek-R1. All code, datasets, and models are publicly available under an MIT license: https://github.com/jjzha/fs1.
2 Reasoning Data
rt: Distilling Reasoning Traces.
To obtain reasoning traces, we use ComplexWebQuestions (CWQ; Talmor & Berant, 2018), a dataset designed for complex mQA. The CWQ dataset is created by automatically generating complex SPARQL queries based on Freebase (Bollacker et al., 2008). These queries are then automatically transformed into natural language questions, which are further refined by human paraphrasing. We take the CWQ dev. set, which consists of 3,519 questions, to curate the reasoning traces. We query both QwQ-32B (Qwen Team, 2025) and Deepseek-R1 (671B; DeepSeek-AI, 2025). By querying the model directly with a question, e.g., β What art movement was Pablo Picasso part of? β, we retrieve the reasoning traces surrounded by βthinkβ tokens (<think>...</think>) and force the model to give the final answer to the question in \boxed{} format. We extract around 3.4K correct-only traces (final answer is correct), which we call rt. We show full examples in Figure 8 and Figure 9 (Appendix C).
fs1: Enhancing Reasoning Traces with Knowledge Graph Paths.
We attempt to steer the reasoning traces with KG paths to remove the inaccuracies in the traces. Since the CWQ dataset consists of entities from Freebase, we align them to their corresponding Wikidata entities. For each question in the dev. set of the CWQ dataset, relevant KG paths are extracted from Wikidata using random walks using SPARQL queries as shown in Appendix E. Each mQA pair in the dataset may contain multiple valid KG paths, which are linearized graphs that retain the structural information of the KG. The paths are generated by extracting the relevant entities from the question and the gold answer. These diverse KG paths that can lead to the same answer reflect the possible diversity of the reasoning traces. Therefore, including linearized graphs improves the interpretability and the explainability of the reasoning traces. The prompt to obtain the improved reasoning traces is shown in Figure 4, by prompting QwQ-32B and Deepseek-R1 again. Full examples are in Figure 10 and Figure 11 (Appendix C).
Table 1: Training Data Statistics. Statistics of reasoning traces of QwQ-32B and Deepseek-R1 on CWQ based on the Qwen2.5-32B tokenizer. The original reasoning traces (rt) are from simply querying the question to the reasoning models, whereas fs1 indicates the statistics when queried with the knowledge graphs. We calculate the performance of the modelsβ final answer via LLM-as-a-Judge. We show that fs1 has higher performance in terms of accuracy compared to rt.
| | | QwQ-32B | R1-685B | Total | | | |
| --- | --- | --- | --- | --- | --- | --- | --- |
| rt | fs1 | rt | fs1 | rt | fs1 | | |
| Exact Match | | 0.46 | $\uparrow$ 0.63 | 0.56 | $\uparrow$ 0.72 | 0.51 | $\uparrow$ 0.67 |
| Sem. Match (all-MiniLM-L6-v2) | | 0.50 | $\uparrow$ 0.58 | 0.55 | $\uparrow$ 0.63 | 0.52 | $\uparrow$ 0.60 |
| LLM-as-a-Judge (gpt-4o-mini) | | 0.44 | $\uparrow$ 0.61 | 0.54 | $\uparrow$ 0.70 | 0.49 | $\uparrow$ 0.65 |
| Samples with only correct answers | | | | | | | |
| Number of Samples | | 1,533 | 1,972 | 1,901 | 1,914 | 3,434 | 3,886 |
| Avg. Reasoning Length (subwords) | | 937 | 897 | 1,043 | 637 | 990 | 767 |
| Avg. Answer Length (subwords) | | 40 | 93 | 64 | 116 | 52 | 104 |
<details>
<summary>x2.png Details</summary>

### Visual Description
\n
## Violin Plots: Token Count Distribution by Data Source and Model
### Overview
The image presents two pairs of violin plots, comparing the distribution of token counts for two different data sources ("rt" and "fs1") across two different models: "QwQ-32B" and "DeepSeek-R1". The y-axis represents the "Token Count (# Subwords)", while the x-axis represents the "Data Source". Each violin plot also displays the median, first quartile (Q1), and third quartile (Q3) of the distribution.
### Components/Axes
* **X-axis:** "Data Source" with two categories: "rt" and "fs1".
* **Y-axis:** "Token Count (# Subwords)" ranging from approximately 0 to 8000.
* **Models:** Two models are compared: "QwQ-32B" (left plots) and "DeepSeek-R1" (right plots).
* **Violin Plots:** Each plot represents the distribution of token counts for a specific data source and model combination.
* **Statistical Markers:** Each violin plot includes:
* Median (indicated by a dashed horizontal line)
* Q1 (First Quartile, indicated by a dashed horizontal line)
* Q3 (Third Quartile, indicated by a dashed horizontal line)
### Detailed Analysis
**QwQ-32B (Left Plots)**
* **rt Data Source:** The violin plot for "rt" is centered around a lower token count. The distribution is relatively narrow.
* Median: 552
* Q1: 392
* Q3: 1017
* **fs1 Data Source:** The violin plot for "fs1" is centered around a higher token count and is wider than the "rt" plot.
* Median: 553
* Q1: 386
* Q3: 1039
**DeepSeek-R1 (Right Plots)**
* **rt Data Source:** The violin plot for "rt" is similar in shape to the QwQ-32B "rt" plot, but slightly shifted to the right (higher token counts).
* Median: 635
* Q1: 431
* Q3: 1274
* **fs1 Data Source:** The violin plot for "fs1" is similar in shape to the QwQ-32B "fs1" plot, but shifted to the left (lower token counts).
* Median: 496
* Q1: 359
* Q3: 792
### Key Observations
* For both models, the "fs1" data source consistently exhibits a higher median and wider distribution of token counts compared to the "rt" data source.
* The DeepSeek-R1 model generally shows higher token counts for the "rt" data source compared to the QwQ-32B model.
* The DeepSeek-R1 model shows lower token counts for the "fs1" data source compared to the QwQ-32B model.
* The distributions are not symmetrical, with a longer tail extending towards higher token counts in both cases.
### Interpretation
The data suggests that the "fs1" data source inherently leads to longer sequences (higher token counts) than the "rt" data source, regardless of the model used. This could be due to the nature of the content within each data source β "fs1" might contain more verbose or complex text.
The differences in token counts between the models for each data source indicate that the models process the data differently. DeepSeek-R1 appears to generate longer sequences when processing "rt" data, while QwQ-32B generates longer sequences when processing "fs1" data. This could be related to the models' architectures, training data, or tokenization methods.
The violin plots effectively visualize the spread and central tendency of the token counts, allowing for a clear comparison between data sources and models. The quartiles provide additional information about the distribution's shape and variability. The asymmetry of the distributions suggests that there are occasional instances of very long sequences that contribute to the tail of the plots.
</details>
Figure 2: Distribution of Reasoning Traces. We show the distribution of the reasoning length among the queried models. In the left plot, we show rt and in the right plot, we show fs1. We show that particularly for fs1 and Deepseek-R1, the reasoning length is shorter in terms of subwords.
Data Statistics.
In Table 1 and Figure 2, we compare the reasoning trace accuracy and statistics of rt and fs1. We evaluate reasoning traces using three methods: (1) Exact Match, checking if the \boxed{} answer exactly matches or is a sub-phrase of any gold answer; (2) Semantic Match, accepting answers with a cosine similarity score $>$ 0.5; and (3) LLM-as-a-Judge, verifying entity alignment using gpt-4o-mini-2024-07-18. Results show that fs1 achieves higher accuracy, indicating that it contains more factual answers. Traces from rt are longer (up to 1K subwords), fs1 traces are typically shorter (around 800 subwords). The median length in subwords is similar for QwQ-32B (552 for rt and 553 for fs1), while there is a difference for Deepseek-R1 (635 median for rt and 496 for fs1). Spot-checking reveals that fs1 yields a more definitive answer.
3 Methodology
3.1 Training and Inference
We fine-tune six Qwen2.5-Instruct models (0.5B to 32B) on rt and fs1, using only reasoning traces with correct final answers. During inference, we evaluate the model on the original questions to test its performance. Following Muennighoff et al. (2025), we train for 5 epochs with a sequence length of 8,192, a batch size of 16, a learning rate of $1Γ 10^{-5}$ (cosine schedule, 5% warmup), and a weight decay of $1Γ 10^{-4}$ . The models are optimized with a standard supervised fine-tuning (SFT) loss, which minimizes the negative log-likelihood (implemented as the cross-entropy function) of target tokens in an autoregressive manner. Let $y_{t}^{*}$ be the correct token and $p_{\theta}(y_{t}^{*}\mid x,y_{<t})$ be the modelβs probability of predicting it. The model optimizes the function:
$$
\mathcal{L}_{\text{SFT}}(\theta)=-\frac{1}{T}\sum_{t=1}^{T}\log p_{\theta}\left(y_{t}^{*}\mid x,y_{<t}\right). \tag{1}
$$
For inference, we use a temperature ( $T$ ) of $0.7$ and top_p of $0.8$ for original instruct models. Otherwise, we use $T=0.6$ and top_p of $0.95$ . Further details on hardware and costs are in Appendix B.
fs1 Prompt Example
β¬ When did the sports team owned by Leslie Alexander win the NBA championship? While answering the question, make use of the following linearised graph as an inspiration in your reasoning, not as the only answer: 1994 NBA Finals, winner, Houston Rockets Houston Rockets, owned by, Leslie Alexander 1995 NBA Finals, winner, Houston Rockets Houston Rockets, owned by, Leslie Alexander. Put your final answer within \ boxed {}. β (For illustration) Gold Answer: [β1994 and 1995β]
Figure 3: fs1 Prompt Example. We depict how we prompt both Deepseek-R1 and QwQ-32B to obtain better reasoning traces with KG paths.
LLM-as-a-Judge (Llama-3.3-70B)
β¬ gold answer: [β Joule per gram per kelvin β, β Joule per kilogram per kelvin β] predicted answer: β J /(kg $ \ cdot$ K)β Is the gold answer entity or value contained in the predicted answer? Respond only with 0 (no) or 1 (yes). # Llama -3.3-70 B - Instruct outputs β1β
Figure 4: Prompt for LLM-as-a-Judge. We show the LLM-as-a-Judge prompt for evaluating whether the predicted and gold answer refer to the same real-world entity, where regular exact string matching will not capture the alignment between the gold and predicted answer in this example (i.e., the measurement unit).
Table 2: Test Benchmark. Overview of the mQA test sets used in our evaluation.
| CWQ (Talmor & Berant, 2018) | apache-2.0 | 3.5K | Multi-hop QA from WebQuestionsSP with compositional SPARQL queries for Freebase paraphrased by crowd workers. |
| --- | --- | --- | --- |
| ExaQT (Jia et al., 2021) | cc-by-4.0 | 3.2K | Temporal-QA benchmark combining eight KB-QA datasets, focusing on time-specific queries. |
| GrailQA (Gu et al., 2021) | apache-2.0 | 6.8K | Freebase QA dataset with annotated answers and logical forms (SPARQL/S-expressions) across 86 domains. |
| SimpleQA (Wei et al., 2024a) | MIT | 4.3K | Fact-seeking questions with verified answers, designed to measure and challenge the factual accuracy of language models. |
| Mintaka (Sen et al., 2022) | cc-by-4.0 | 4.0K | Multilingual QA (9 languages), entity-linked pairs across diverse domains (English test split). |
| WebQSP (Yih et al., 2016) | apache-2.0 | 2.0K | Enhanced WebQuestions with Freebase QA annotated with SPARQL ( $\sim$ 82% coverage). |
| Total | | 23.9K | |
<details>
<summary>x3.png Details</summary>

### Visual Description
\n
## Heatmap: Cosine Similarity and Match Counts Between Datasets
### Overview
The image presents three heatmaps comparing different datasets (CWQ_train, CWQ_test, ExaQT, GrailQA, SimpleQA, Mintaka, WebQSP) based on cosine similarity and exact match counts. The heatmaps are arranged horizontally, with the first showing cosine similarity counts (above 0.90), the second showing exact match counts, and the third showing average cosine similarity. Each heatmap uses a color gradient to represent the magnitude of the values, with darker shades indicating higher values.
### Components/Axes
* **Datasets:** The datasets are listed on both the x-axis and y-axis of each heatmap: CWQ_train, CWQ_test, ExaQT, GrailQA, SimpleQA, Mintaka, WebQSP.
* **Heatmap 1 (Cosine Similarity Count (> 0.90)):** Values represent the number of times the cosine similarity between two datasets exceeds 0.90.
* **Heatmap 2 (Exact Match Count):** Values represent the number of exact matches between two datasets.
* **Heatmap 3 (Average Cosine Similarity):** Values represent the average cosine similarity between two datasets.
* **Color Scale:** The color scale is not explicitly provided, but it appears to range from light green (low values) to dark green (high values).
* **Placement:** The three heatmaps are positioned side-by-side horizontally. The dataset labels are positioned below each heatmap.
### Detailed Analysis or Content Details
**Heatmap 1: Cosine Similarity Count (> 0.90)**
* CWQ_train vs. CWQ_test: 109
* CWQ_train vs. ExaQT: 51
* CWQ_train vs. GrailQA: 1
* CWQ_train vs. SimpleQA: 0
* CWQ_train vs. Mintaka: 1
* CWQ_train vs. WebQSP: 12
* CWQ_test vs. ExaQT: 80
* CWQ_test vs. GrailQA: 0
* CWQ_test vs. SimpleQA: 0
* CWQ_test vs. Mintaka: 4
* CWQ_test vs. WebQSP: 26
* ExaQT vs. GrailQA: 0
* ExaQT vs. SimpleQA: 0
* ExaQT vs. Mintaka: 0
* ExaQT vs. WebQSP: 83
* GrailQA vs. SimpleQA: 0
* GrailQA vs. Mintaka: 0
* GrailQA vs. WebQSP: 0
* SimpleQA vs. Mintaka: 0
* SimpleQA vs. WebQSP: 0
* Mintaka vs. WebQSP: 15
**Heatmap 2: Exact Match Count**
* All values are 0, except:
* CWQ_train vs. CWQ_train: 0
* CWQ_test vs. CWQ_test: 0
* ExaQT vs. ExaQT: 0
* GrailQA vs. GrailQA: 0
* SimpleQA vs. SimpleQA: 0
* Mintaka vs. Mintaka: 0
* WebQSP vs. WebQSP: 0
* CWQ_train vs. WebQSP: 1
* WebQSP vs. CWQ_train: 1
**Heatmap 3: Average Cosine Similarity**
* CWQ_train vs. CWQ_test: 0.15
* CWQ_train vs. ExaQT: 0.11, 0.12
* CWQ_train vs. GrailQA: 0.08, 0.05
* CWQ_train vs. SimpleQA: 0.11, 0.12, 0.07
* CWQ_train vs. Mintaka: 0.11, 0.12, 0.07
* CWQ_train vs. WebQSP: 0.12, 0.12, 0.11
* CWQ_test vs. ExaQT: 0.11
* CWQ_test vs. GrailQA: 0.08
* CWQ_test vs. SimpleQA: 0.11
* CWQ_test vs. Mintaka: 0.11
* CWQ_test vs. WebQSP: 0.11
### Key Observations
* **High Cosine Similarity:** CWQ_train and CWQ_test exhibit the highest cosine similarity counts (>0.90), with a value of 109. This suggests a strong overlap between these two datasets.
* **Exact Matches are Rare:** Exact matches between datasets are extremely rare, with only two instances (CWQ_train vs. WebQSP and WebQSP vs. CWQ_train) having a count of 1.
* **WebQSP and ExaQT Correlation:** WebQSP shows relatively high cosine similarity counts with ExaQT (83 > 0.90).
* **Low Average Cosine Similarity:** The average cosine similarity values are generally low, ranging from 0.05 to 0.15, indicating a limited overall similarity between the datasets.
* **GrailQA is Distinct:** GrailQA consistently shows low cosine similarity counts and average cosine similarity values, suggesting it is quite different from the other datasets.
### Interpretation
The data suggests that while some datasets share a high degree of similarity (specifically CWQ_train and CWQ_test), most datasets are relatively distinct from each other. The scarcity of exact matches reinforces this observation. The high cosine similarity between CWQ_train and CWQ_test likely indicates that CWQ_test is a subset or a closely related variation of CWQ_train. The low average cosine similarity values suggest that these datasets cover different aspects of the underlying problem space, or are represented in different ways. The distinctness of GrailQA could be due to its unique data source, task formulation, or data processing pipeline. The heatmap provides a quantitative assessment of the relationships between these datasets, which can be valuable for tasks such as dataset selection, transfer learning, and model evaluation. The fact that exact matches are so rare suggests that the datasets are not simply duplicates or minor variations of each other.
</details>
Figure 5: Data Overlap. We show data overlap between the train set and benchmark. On the left, one can observe the count of similar questions when the cosine similarity $>$ 0.90 (measured with paraphrase-MiniLM-L6-v2; Reimers & Gurevych, 2019). In the middle, we measure exact match counts. On the right, we show the average pairwise cosine similarity across the full test sets.
3.2 Benchmarks and Evaluation
We show the test datasets, licenses, size and a short description in Table 2. We have four baselines, namely Qwen2.5-72B-Instruct (Qwen Team, 2024), QwQ-32B, Deepseek-R1, and o3-mini (OpenAI, 2025). To evaluate our models, we select a suite of six mQA benchmarks with a total of 23.9K questions. We have four setups for benchmarking the models: (1) All models including baselines are evaluated zero-shot (i.e., only querying the question); (2) the models are queried using zero-shot chain-of-thought prompting (Kojima et al., 2022; Wei et al., 2022), where we simply append the prompt β Put your final answer within \boxed{}. Think step-by-step. β; (3) we benchmark the models fine-tuned on rt; (4) we benchmark the models fine-tuned on fs1. In Figure 12 (Section D.1), we show an example of each dataset in the test benchmark.
Possible Data Leakage.
In Figure 5, we show the overlap of the questions in the training set of ComplexWebQuestions (CWQ_train) versus all the other benchmarks used in our study (all questions lower-cased). On the left, we count the times that the cosine similarity between questions exceeds 0.90. We can see that there is the most overlap between CWQ_train and CWQ_test (109 questions), and the second most is between WebQSP and ExaQT (83 questions). In the middle, we show that there is almost to none exact string match between the questions. On the right, we show the average pairwise cosine similarity across the benchmarks is lower or equal to 0.15.
Evaluation Metric.
Similar to previous studies, e.g., Ma et al. (2025), we report pass@ $k$ , which reflects the probability that at least one out of $k$ randomly selected completions (drawn from a total of $n$ completions per problem) is correct. As such, it serves as an upper-bound on practical performance, which would require a subsequent selection mechanism. Formally, pass@ $k$ is given by: $\mathbb{E}_{\text{problems}}\left[1-\frac{\binom{n-c}{k}}{\binom{n}{k}}\right]$ , where $n$ is the number of generated completions per problem and $c$ is the count of correct completions (Chen et al., 2021). For our benchmarks, we evaluate $k=\{1,2,4,8,16\}$ . In practice, pass@32 is typically reported for formal theorem-proving tasks, while pass@1 (reducing to standard top-1 accuracy) is standard for math and coding tasks as mentioned by Ma et al. (2025). In this work for factual mQA, we report until $k=16$ .
LLM-as-a-Judge.
To decide whether an answer is correct or not (1 or 0), our main evaluation approach is using LLM-as-a-judge with Llama-3.3-70B-Instruct We compare both gpt-4o-mini-2024-07-18 and Llama-3.3-70B-Instruct on a large subsample of our outputs and saw there there is almost no difference in predictions. Additionally, Llama-3.3-70B is rated higher in LM Arena than gpt-4o-mini (at time of writing 79 ${}^{\text{th}}$ vs. 83 ${}^{\text{rd}}$ respectively). to determine whether a predicted answer obtained from the \boxed{} output is referring to the same real-world entity as the gold answer. An example of this is shown in Figure 4. When the model does not generate a \boxed{} output, we take the last 10 subwords as predicted answer, which LLM-as-a-judge can infer what the predicted real-world entity is when there is not exact string matching. This same approach is used in Table 1. Compared to exact string matching and semantic similarity evaluation methods, LLM-as-a-Judge rates the quality of output similarly compared to the other methods.
<details>
<summary>x4.png Details</summary>

### Visual Description
\n
## Charts: Pass@k Performance Across Datasets
### Overview
The image presents six separate line charts, each displaying the pass@k performance (in percentage) for different models (inst, cot, rt, fs1) across six different question answering datasets: ComplexWebQuestions, ExaQT, GrailQA, Mintaka, SimpleQA, and WebQSP. The x-axis represents 'k', likely indicating the number of retrieved answers considered, and the y-axis represents the pass@k score (percentage).
### Components/Axes
* **X-axis Label (all charts):** k
* **Y-axis Label (all charts):** pass@k (%)
* **Legend (bottom-center):**
* inst (cyan solid line)
* cot (green solid line with triangle markers)
* rt (red dotted line)
* fs1 (red dashed line with square markers)
* **Chart Titles (top-center of each chart):**
* ComplexWebQuestions
* ExaQT
* GrailQA
* Mintaka
* SimpleQA
* WebQSP
* **X-axis Scale:** Varies per chart.
* ComplexWebQuestions: 1, 2, 4, 8, 16
* ExaQT: 1, 2, 4, 8, 16
* GrailQA: 1, 2, 4, 8, 16
* Mintaka: 1, 2, 4, 8, 16
* SimpleQA: 1, 2, 4, 8, 16
* WebQSP: 1, 2, 4, 8, 16
* **Y-axis Scale:** Varies per chart, but generally ranges from approximately 30% to 85%.
### Detailed Analysis
Here's a breakdown of the data for each chart:
**1. ComplexWebQuestions:**
* **inst:** Starts at ~40%, increases to ~60% at k=16. The line slopes upward, but the increase slows down after k=8.
* **cot:** Starts at ~45%, increases to ~62% at k=16. Similar upward slope to 'inst', with a slowing increase after k=8.
* **rt:** Starts at ~50%, increases to ~62% at k=16. Relatively flat slope.
* **fs1:** Starts at ~52%, increases to ~63% at k=16. Slightly steeper slope than 'rt'.
**2. ExaQT:**
* **inst:** Starts at ~38%, increases to ~52% at k=16. Upward slope, with a more pronounced slowing after k=8.
* **cot:** Starts at ~42%, increases to ~55% at k=16. Similar trend to 'inst'.
* **rt:** Starts at ~45%, increases to ~55% at k=16. Relatively flat slope.
* **fs1:** Starts at ~48%, increases to ~56% at k=16. Slightly steeper slope than 'rt'.
**3. GrailQA:**
* **inst:** Starts at ~35%, increases to ~48% at k=16. Upward slope, slowing after k=8.
* **cot:** Starts at ~38%, increases to ~50% at k=16. Similar trend to 'inst'.
* **rt:** Starts at ~42%, increases to ~52% at k=16. Relatively flat slope.
* **fs1:** Starts at ~45%, increases to ~53% at k=16. Slightly steeper slope than 'rt'.
**4. Mintaka:**
* **inst:** Starts at ~72%, increases to ~82% at k=16. Steep upward slope, with a slight slowing after k=8.
* **cot:** Starts at ~75%, increases to ~84% at k=16. Similar trend to 'inst'.
* **rt:** Starts at ~78%, increases to ~84% at k=16. Relatively flat slope.
* **fs1:** Starts at ~80%, increases to ~85% at k=16. Slightly steeper slope than 'rt'.
**5. SimpleQA:**
* **inst:** Starts at ~8%, increases to ~20% at k=16. Steep upward slope, with a slight slowing after k=8.
* **cot:** Starts at ~10%, increases to ~22% at k=16. Similar trend to 'inst'.
* **rt:** Starts at ~12%, increases to ~22% at k=16. Relatively flat slope.
* **fs1:** Starts at ~14%, increases to ~23% at k=16. Slightly steeper slope than 'rt'.
**6. WebQSP:**
* **inst:** Starts at ~64%, increases to ~72% at k=16. Upward slope, slowing after k=8.
* **cot:** Starts at ~67%, increases to ~74% at k=16. Similar trend to 'inst'.
* **rt:** Starts at ~70%, increases to ~74% at k=16. Relatively flat slope.
* **fs1:** Starts at ~72%, increases to ~75% at k=16. Slightly steeper slope than 'rt'.
### Key Observations
* 'fs1' consistently performs better than 'rt' across all datasets.
* 'inst' and 'cot' generally show similar performance trends.
* The performance gains from increasing 'k' diminish as 'k' increases, indicating diminishing returns.
* Mintaka has the highest overall pass@k scores, while SimpleQA has the lowest.
* The relative performance of the models varies across datasets. For example, 'rt' and 'fs1' are closer in performance on GrailQA than on Mintaka.
### Interpretation
The charts demonstrate the pass@k performance of different retrieval-augmented generation models on various question answering datasets. The 'k' parameter represents the number of retrieved documents considered, and the pass@k score indicates the percentage of questions answered correctly when considering the top 'k' retrieved documents.
The consistent outperformance of 'fs1' over 'rt' suggests that the 'fs1' method is more effective at leveraging retrieved information. The similar performance of 'inst' and 'cot' indicates that the chain-of-thought prompting strategy ('cot') doesn't significantly improve performance in this setting.
The varying performance across datasets highlights the importance of dataset characteristics. Mintaka, with its high pass@k scores, likely presents questions that are easier to answer with retrieved information, while SimpleQA, with its low scores, may involve more complex reasoning or require information not readily available in the retrieved documents.
The diminishing returns observed as 'k' increases suggest that there's a limit to the benefit of considering more retrieved documents. Beyond a certain point, the additional documents may introduce noise or irrelevant information, hindering performance. This suggests an optimal 'k' value exists for each dataset and model combination.
</details>
Figure 6: Upper-bound Test-Time Scaling for Factual Reasoning. We show with Qwen2.5-32B that parallel scaling is beneficial for complex mQA, measured by pass@ $k$ , especially when fine-tuned on fs1, instead of conducting single-pass inference.
4 Results and Discussion
4.1 Results with Test-time Scaling
Parallel scaling can achieve lower latency by enabling multiple (identical) models to run simultaneously locally (via e.g., multiple GPUs or batching techniques) or via API based methods to generate multiple answers. Formally, parallel sampling entails an aggregation technique that combines $N$ independent solutions into a single final prediction, commonly known as a best-of- $N$ approach (Chollet, 2019; Irvine et al., 2023; Brown et al., 2024a; Li et al., 2022). Formally, given a set of $N$ predictions $P=\{p_{1},...,p_{N}\}$ , the best-of- $N$ method selects a prediction $pβ P$ as the final output.
In this work, we present results using pass@ $k$ (see Section 3.2), extending the number of sampled $k$ (until $k$ = 16). In Figure 6, we show parallel scaling results by performing 16 inference runs with Qwen2.5-32B -Instruct, CoT, rt, fs1 on each test dataset. For parallel sampling, we limit ourselves to Qwen2.5-32B as running 16 inferences for 8 models for all 4 settings would require 12.2M model inferences for the test benchmarks, which is computationally prohibitive. As $k$ increases, pass@ $k$ (indicating whether at least one generation is correct) rises steadily across all benchmarks. Parallel sampling boosts the chance of producing a correct answer, especially when fine-tuned on fs1. For example, on CWQ, we see a performance increase of 16 absolute points at $k$ = 16 and on SimpleQA around 6 absolute points at the same $k$ compared to their original instruction-tuned counterpart.
<details>
<summary>x5.png Details</summary>

### Visual Description
\n
## Bar Chart: Relative Improvement (RI) by Hops
### Overview
This bar chart displays the relative improvement (RI) in percentage (%) for three different models β 'cot', 'rt', and 'fs1' β across varying numbers of hops (1, 2, 3, and 3+). The y-axis represents the RI (%) and pass@16 metric, while the x-axis indicates the number of hops. Each hop count has three bars representing the RI for each model.
### Components/Axes
* **Title:** "Relative Improvement (RI) by Hops" - positioned at the top-center.
* **X-axis Label:** "Hops" - indicating the number of hops. Markers are at 1, 2, 3, and 3+.
* **Y-axis Label:** "RI (%) ; pass@16" - indicating the relative improvement in percentage and the pass@16 metric. The scale ranges from 0 to 60, with increments of 10.
* **Legend:** Located at the top-left corner.
* 'cot' - Light blue color.
* 'rt' - Gray color.
* 'fs1' - Teal color.
### Detailed Analysis
The chart consists of grouped bar plots for each hop count.
* **Hops = 1:**
* 'cot': Approximately 38%
* 'rt': Approximately 34%
* 'fs1': Approximately 31%
* **Hops = 2:**
* 'cot': Approximately 36%
* 'rt': Approximately 43%
* 'fs1': Approximately 32%
* **Hops = 3:**
* 'cot': Approximately 31%
* 'rt': Approximately 52%
* 'fs1': Approximately 59%
* **Hops = 3+:**
* 'cot': Approximately 26%
* 'rt': Approximately 45%
* 'fs1': Approximately 48%
**Trends:**
* **'cot'**: The RI for 'cot' generally decreases as the number of hops increases, starting at approximately 38% for 1 hop and decreasing to approximately 26% for 3+ hops.
* **'rt'**: The RI for 'rt' generally increases with the number of hops, starting at approximately 34% for 1 hop and reaching approximately 45% for 3+ hops.
* **'fs1'**: The RI for 'fs1' increases significantly from 1 to 3 hops (approximately 31% to 59%), then slightly decreases to approximately 48% for 3+ hops.
### Key Observations
* 'fs1' consistently demonstrates the highest relative improvement at 3 hops, significantly outperforming 'cot' and 'rt'.
* 'cot' shows the lowest relative improvement across all hop counts.
* The relative improvement for 'rt' increases steadily with the number of hops.
* The largest jump in relative improvement for 'fs1' occurs between 2 and 3 hops.
### Interpretation
The data suggests that increasing the number of hops generally improves the performance of 'rt' and 'fs1' models, while it negatively impacts the 'cot' model. 'fs1' appears to benefit the most from increasing hops, achieving the highest relative improvement at 3 hops. This could indicate that 'fs1' is better suited for more complex reasoning tasks that require multiple steps (hops). The 'cot' model's decreasing performance with more hops suggests it may be less effective at handling complex reasoning or may suffer from error propagation as the number of steps increases. The pass@16 metric, combined with RI, suggests that the models are being evaluated on their ability to achieve a certain level of accuracy (pass@16) and the relative improvement indicates how much better they are performing compared to a baseline. The 3+ hops category may represent a more diverse set of scenarios, leading to a slight decrease in 'fs1' performance compared to the peak at 3 hops.
</details>
(a) Performance by number of hops required to answer the question measured in pass@ $16$ .
<details>
<summary>x6.png Details</summary>

### Visual Description
\n
## Bar Chart: Relative Improvement (RI) by Answer Type
### Overview
This bar chart displays the relative improvement (RI) in percentage (%) for different answer types (date, place, person, other, number) across three different methods: Chain-of-Thought (cot), Retrieval-augmented generation (rt), and Fine-tuned Self-Instruct (fs1). The performance metric is "pass@16".
### Components/Axes
* **Title:** "Relative Improvement (RI) by Answer Type"
* **X-axis:** Answer Type (date, place, person, other, number)
* **Y-axis:** RI (%) ; pass@16 (Scale from 0 to 60, increments of 10)
* **Legend:**
* cot (Yellow)
* rt (Light Blue/Teal)
* fs1 (Dark Blue/Navy)
### Detailed Analysis
The chart consists of five groups of three bars, one for each answer type and method.
**Date:**
* cot: Approximately 48%
* rt: Approximately 42%
* fs1: Approximately 55%
* Trend: fs1 shows the highest improvement, followed by cot, then rt.
**Place:**
* cot: Approximately 32%
* rt: Approximately 20%
* fs1: Approximately 26%
* Trend: cot shows the highest improvement, followed by fs1, then rt.
**Person:**
* cot: Approximately 22%
* rt: Approximately 25%
* fs1: Approximately 12%
* Trend: rt shows the highest improvement, followed by cot, then fs1.
**Other:**
* cot: Approximately 20%
* rt: Approximately 30%
* fs1: Approximately 30%
* Trend: rt and fs1 show the highest improvement, followed by cot.
**Number:**
* cot: Approximately 16%
* rt: Approximately 24%
* fs1: Approximately 26%
* Trend: fs1 shows the highest improvement, followed by rt, then cot.
### Key Observations
* The "fs1" method consistently performs well, achieving the highest relative improvement for "date", "number", and being tied for highest on "other".
* The "cot" method shows strong performance for "date" and "place".
* The "rt" method performs best for "person" and ties for best on "other".
* The relative improvement varies significantly depending on the answer type.
### Interpretation
The data suggests that the effectiveness of different methods for improving answer accuracy depends on the type of question being asked. "fs1" appears to be a robust method across multiple answer types, while "cot" and "rt" excel in specific areas. This indicates that a hybrid approach, leveraging the strengths of each method based on the answer type, might yield the best overall results. The "pass@16" metric suggests that these improvements are statistically significant, as they represent the percentage of times the correct answer is within the top 16 generated responses. The large differences in RI across answer types suggest that the underlying challenges of each type (e.g., identifying dates vs. understanding relationships between people) require different strategies for improvement. The chart provides valuable insights for optimizing question-answering systems by tailoring the method to the specific task.
</details>
(b) Performance by answer type (i.e., what type of entity the answer is) in pass@ $16$ .
<details>
<summary>x7.png Details</summary>

### Visual Description
\n
## Bar Chart: Relative Improvement (RI) by Domain
### Overview
This bar chart displays the relative improvement (RI) across different domains, measured as a percentage. Three different models ("cot", "rt", and "fs1") are compared for each domain. The Y-axis represents the RI (%) at pass@16, and the X-axis lists the domains.
### Components/Axes
* **Title:** "Relative Improvement (RI) by Domain" - positioned at the top-center of the chart.
* **X-axis Label:** "Domain" - labels the categories along the horizontal axis. The domains are: "art", "sports", "other", "geography", "tv shows", "video games", "politics", "music", "sci & tech", "history".
* **Y-axis Label:** "RI (%) ; pass@16" - labels the vertical axis, indicating the relative improvement percentage. The scale ranges from 0 to approximately 85.
* **Legend:** Located in the top-right corner of the chart.
* "cot" - represented by a solid blue color.
* "rt" - represented by a black color.
* "fs1" - represented by a red color.
### Detailed Analysis
The chart consists of grouped bar plots for each domain, representing the RI for each model.
* **Art:**
* cot: Approximately 72%
* rt: Approximately 62%
* fs1: Approximately 68%
* **Sports:**
* cot: Approximately 48%
* rt: Approximately 62%
* fs1: Approximately 44%
* **Other:**
* cot: Approximately 44%
* rt: Approximately 44%
* fs1: Approximately 44%
* **Geography:**
* cot: Approximately 48%
* rt: Approximately 32%
* fs1: Approximately 36%
* **TV Shows:**
* cot: Approximately 40%
* rt: Approximately 36%
* fs1: Approximately 40%
* **Video Games:**
* cot: Approximately 36%
* rt: Approximately 44%
* fs1: Approximately 64%
* **Politics:**
* cot: Approximately 64%
* rt: Approximately 36%
* fs1: Approximately 84%
* **Music:**
* cot: Approximately 24%
* rt: Approximately 32%
* fs1: Approximately 28%
* **Sci & Tech:**
* cot: Approximately 16%
* rt: Approximately 24%
* fs1: Approximately 12%
* **History:**
* cot: Approximately 16%
* rt: Approximately 12%
* fs1: Approximately 18%
### Key Observations
* The "fs1" model consistently demonstrates the highest relative improvement in "politics" (approximately 84%).
* The "cot" model shows the highest relative improvement in "art" (approximately 72%).
* "Sci & Tech" and "History" consistently show the lowest relative improvements across all models.
* In the "sports" domain, the "rt" model outperforms both "cot" and "fs1".
* The "video games" domain shows a significant improvement for the "fs1" model compared to "cot".
### Interpretation
The chart illustrates the performance of three different models ("cot", "rt", and "fs1") across various domains. The relative improvement (RI) varies significantly depending on the domain and the model used. This suggests that the effectiveness of each model is domain-specific. The "fs1" model appears to be particularly strong in "politics", while the "cot" model excels in "art". The consistently low performance in "sci & tech" and "history" might indicate that these domains present unique challenges for all three models, potentially due to the complexity of the subject matter or the availability of training data. The differences in performance between the models highlight the importance of selecting the appropriate model for a given task or domain. The pass@16 metric suggests a threshold for acceptable performance, and the RI indicates how much improvement is achieved relative to a baseline.
</details>
(c) Performance measured per domain in pass@ $16$ .
Figure 7: Relative Improvements across Different Axes. We show the relative performance improvement (%) at pass@ $16$ of different Qwen-32B (i.e., CoT, rt and fs1) against the original instruct model. In (a), we show the performance of the models by the number of hops required to answer the question. In (b), we show the performance of the models by answer type. In (c), we show the performance by domain of the question. Absolute numbers are in Figure 13 (Appendix F).
4.2 What type of samples do models seem to fail on?
In Figure 7, we investigate what kind of questions the model (Qwen2.5-32B) seems to fail on. We take metadata information from SimpleQA (Wei et al., 2024a), which indicates the question difficulty in number of hops required to answer the question (Figure 7(a)), type of answer (Figure 7(b)) and the domain of the question (Figure 7(c)). For question difficulty, we source the number of hops for each question in SimpleQA from Lavrinovics et al. (2025). We count the number of relations (P) from Wikidata, which would indicate the number of hops required to go from entity A to B. When a question does not contain any relations, we assume it takes more than 3 hops to answer the question.
In Figure 7(a), we observe that fs1 has lower relative improvements on easier questions (e.g., 1 or 2 hops required), but outperforms the other models when the question gets more complex (3 or more hops required). This indicates that inducing KG paths helps answering complex questions. In Figure 7(b), we show that fs1 has the most relative improvement on numerical answers, such as numbers, dates and also miscellaneous answer types. Last, in Figure 7(c), fs1 performs best on questions related to video games, geography, politics, music, and miscellaneous questions. Additionally, rt performs best on art and history-related questions. Last, CoT performs best on questions related to sports.
4.3 Single pass results across scale
In Table 3, we show results in terms of accuracy via LLM-as-a-Judge at pass@ $1$ (i.e., one inference run per question) on all test datasets. For the baselines, we observe that o3-mini is the dominant model, achieving the highest score on five out of six datasets, such as its 0.774 accuracy on Mintaka and 0.680 on WebQSP. The only exception is SimpleQA, where R1-70B performs best with a score of 0.188. These are followed by Qwen2.5-72B-Instruct and QwQ-32B in overall performance.
Observing the Qwen2.5 results, the benefits of fine-tuning on rt and fs1 are most pronounced at the sub-billion parameter scale. For instance, fine-tuning the 0.5B model on fs1 yields substantial relative gains across all tasks, peaking at a +74.6% on WebQSP. However, as model size increases, the performance differences become more nuanced. For the 1.5B model, the same fs1 fine-tuning leads to performance degradation on four out of six datasets, such as ExaQT (-4.7%) and WebQSP (-1.1%). While larger models like the 32B still benefit from fine-tuning (e.g., rt and fs1 are often the best performers in their group), the relative gains are smaller than those seen at the 0.5B scale.
Our results also show that fine-tuning improvements do not uniformly generalize across different model families at the sub-billion parameter scale. A comparison between the fine-tuned Qwen2.5 and SmolLM2 models reveals a significant performance divergence. Specifically, fine-tuning on fs1 provided consistent enhancements for the Qwen2.5-0.5B model, improving its CWQ score from 0.135 to 0.209. In contrast, the same fine-tuning on SmolLM2-360M yielded mixed results; while it improved performance on most tasks, it caused a notable degradation of -15.9% on GrailQA.
This variance diminishes with scale, as models at the 1.5B/1.7B parameter scale exhibit more convergent behavior. For example, fine-tuning with rt on GrailQA provides a nearly identical small boost to both Qwen2.5-1.5B (+1.9%) and SmolLM2-1.7B (+1.8%). Overall, We hypothesize this scale-dependent effect may occur because larger models (e.g., 32B) possess stronger parametric knowledge, making them less reliant on the explicit guidance from KG paths.
Table 3: Single Pass (pass@1) Results on mQA Benchmarks. We show accuracy and relative performance gains on our benchmarks for several baselines, Qwen2.5, and SmolLM2 models. For each size, we show the original instruction-tuned model followed by versions fine-tuned with chain-of-thought, rt, and fs1. Parentheses indicate the relative improvement over the instruction-tuned counterpart. The benefits of fine-tuning are most pronounced for smaller models.
| Model Large Language Model Baselines Qwen2.5-72B | CWQ 0.481 | ExaQT 0.440 | GrailQA 0.361 | SimpleQA 0.117 | Mintaka 0.736 | WebQSP 0.653 |
| --- | --- | --- | --- | --- | --- | --- |
| QwQ-32B | 0.479 | 0.390 | 0.358 | 0.097 | 0.708 | 0.612 |
| R1-70B | 0.501 | 0.476 | 0.340 | 0.188 | 0.755 | 0.549 |
| o3-mini | 0.558 | 0.497 | 0.438 | 0.138 | 0.774 | 0.680 |
| Small Language Models (0.36B-1.7B) | | | | | | |
| SmolLM2-360M | 0.148 | 0.088 | 0.164 | 0.024 | 0.175 | 0.235 |
| + cot | 0.151 (+2.0 %) | 0.101 (+14.8 %) | 0.169 (+3.0 %) | 0.025 (+4.2 %) | 0.188 (+7.4 %) | 0.230 (-2.1 %) |
| + rt | 0.192 (+29.7 %) | 0.111 (+26.1 %) | 0.156 (-4.9 %) | 0.029 (+20.8 %) | 0.202 (+15.4 %) | 0.293 (+24.7 %) |
| + fs1 | 0.179 (+20.9 %) | 0.093 (+5.7 %) | 0.138 (-15.9 %) | 0.027 (+12.5 %) | 0.197 (+12.6 %) | 0.264 (+12.3 %) |
| Qwen2.5-0.5B | 0.135 | 0.058 | 0.127 | 0.023 | 0.131 | 0.173 |
| + cot | 0.161 (+19.3 %) | 0.104 (+79.3 %) | 0.141 (+11.0 %) | 0.031 (+34.8 %) | 0.214 (+63.4 %) | 0.234 (+35.3 %) |
| + rt | 0.190 (+40.7 %) | 0.089 (+53.4 %) | 0.155 (+22.0 %) | 0.022 (-4.3 %) | 0.178 (+35.9 %) | 0.286 (+65.3 %) |
| + fs1 | 0.209 (+54.8 %) | 0.101 (+74.1 %) | 0.166 (+30.7 %) | 0.035 (+52.2 %) | 0.202 (+54.2 %) | 0.302 (+74.6 %) |
| Qwen2.5-1.5B | 0.234 | 0.170 | 0.208 | 0.031 | 0.316 | 0.360 |
| + cot | 0.252 (+7.7 %) | 0.179 (+5.3 %) | 0.216 (+3.8 %) | 0.041 (+32.3 %) | 0.318 (+0.6 %) | 0.391 (+8.6 %) |
| + rt | 0.255 (+9.0 %) | 0.173 (+1.8 %) | 0.212 (+1.9 %) | 0.038 (+22.6 %) | 0.294 (-7.0 %) | 0.360 (+0.0 %) |
| + fs1 | 0.263 (+12.4 %) | 0.162 (-4.7 %) | 0.204 (-1.9 %) | 0.035 (+12.9 %) | 0.301 (-4.7 %) | 0.356 (-1.1 %) |
| SmolLM2-1.7B | 0.248 | 0.176 | 0.219 | 0.032 | 0.293 | 0.408 |
| + cot | 0.285 (+14.9 %) | 0.177 (+0.6 %) | 0.209 (-4.6 %) | 0.032 (+0.0 %) | 0.295 (+0.7 %) | 0.409 (+0.2 %) |
| + rt | 0.306 (+23.4 %) | 0.184 (+4.5 %) | 0.223 (+1.8 %) | 0.038 (+18.7 %) | 0.366 (+24.9 %) | 0.454 (+11.3 %) |
| + fs1 | 0.305 (+23.0 %) | 0.179 (+1.7 %) | 0.218 (-0.5 %) | 0.036 (+12.5 %) | 0.341 (+16.4 %) | 0.426 (+4.4 %) |
| Large Language Models (3B-32B) | | | | | | |
| Qwen2.5-3B | 0.317 | 0.214 | 0.252 | 0.044 | 0.396 | 0.466 |
| + cot | 0.302 (-4.7 %) | 0.222 (+3.7 %) | 0.248 (-1.6 %) | 0.048 (+9.1 %) | 0.431 (+8.8 %) | 0.477 (+2.4 %) |
| + rt | 0.363 (+14.5 %) | 0.235 (+9.8 %) | 0.279 (+10.7 %) | 0.053 (+20.5 %) | 0.495 (+25.0 %) | 0.483 (+3.6 %) |
| + fs1 | 0.330 (+4.1 %) | 0.205 (-4.2 %) | 0.253 (+0.4 %) | 0.045 (+2.3 %) | 0.444 (+12.1 %) | 0.406 (-12.9 %) |
| Qwen2.5-7B | 0.376 | 0.281 | 0.299 | 0.070 | 0.548 | 0.580 |
| + cot | 0.383 (+1.9 %) | 0.292 (+3.9 %) | 0.295 (-1.3 %) | 0.062 (-11.4 %) | 0.580 (+5.8 %) | 0.565 (-2.6 %) |
| + rt | 0.401 (+6.6 %) | 0.296 (+5.3 %) | 0.300 (+0.3 %) | 0.067 (-4.3 %) | 0.576 (+5.1 %) | 0.517 (-10.9 %) |
| + fs1 | 0.408 (+8.5 %) | 0.272 (-3.2 %) | 0.303 (+1.3 %) | 0.053 (-24.3 %) | 0.551 (+0.5 %) | 0.492 (-15.2 %) |
| Qwen2.5-14B | 0.392 | 0.336 | 0.318 | 0.068 | 0.624 | 0.599 |
| + cot | 0.422 (+7.7 %) | 0.356 (+6.0 %) | 0.322 (+1.3 %) | 0.080 (+17.6 %) | 0.664 (+6.4 %) | 0.592 (-1.2 %) |
| + rt | 0.451 (+15.1 %) | 0.352 (+4.8 %) | 0.331 (+4.1 %) | 0.082 (+20.6 %) | 0.678 (+8.7 %) | 0.562 (-6.2 %) |
| + fs1 | 0.454 (+15.8 %) | 0.339 (+0.9 %) | 0.328 (+3.1 %) | 0.079 (+16.2 %) | 0.654 (+4.8 %) | 0.558 (-6.8 %) |
| Qwen2.5-32B | 0.428 | 0.362 | 0.334 | 0.087 | 0.674 | 0.621 |
| + cot | 0.435 (+1.6 %) | 0.366 (+1.1 %) | 0.332 (-0.6 %) | 0.099 (+13.8 %) | 0.696 (+3.3 %) | 0.614 (-1.1 %) |
| + rt | 0.471 (+10.0 %) | 0.366 (+1.1 %) | 0.342 (+2.4 %) | 0.094 (+8.0 %) | 0.680 (+0.9 %) | 0.563 (-9.3 %) |
| + fs1 | 0.477 (+11.4 %) | 0.361 (-0.3 %) | 0.344 (+3.0 %) | 0.078 (-10.3 %) | 0.682 (+1.2 %) | 0.576 (-7.2 %) |
5 Related Work
Different methods that involve long chain-of-thought processes (Kojima et al., 2022; Wei et al., 2022) involving reflection, backtracking, thinking (e.g., DeepSeek-AI, 2025; Muennighoff et al., 2025), self-consistency (e.g., Wang et al., 2023), and additional computation at inference time, such as test-time scaling (Wu et al., 2024; Muennighoff et al., 2025; Zhang et al., 2025), have shown promising improvements in LLM performance on complex reasoning tasks. Our work intersects with efforts in factuality, knowledge graph grounding, and test-time scaling.
Graph-enhanced In-context Learning.
Enhancing the factual consistency of LLMs using KGs has been explored in different directions, including semantic parsing methods that convert natural language questions into formal KG queries (Lan & Jiang, 2020; Ye et al., 2022). Retrieval-augmented methods (KG-RAG) (Li et al., 2023; Sanmartin, 2024; Jiang et al., 2023) aim to reduce LLMsβ reliance on latent knowledge by incorporating explicit, structured information from a KG; reasoning on graphs (RoG) models (Luo et al., 2024) generate relation paths grounded by KGs as faithful paths for the model to follow. Recent works like Tan et al. (2025) also use KG paths to guide reasoning, our fs1 approach focuses on distilling and fine-tuning general instruction models on these factually-grounded traces, and we provide a broad empirical study on its effect across model scales.
Long Form Factuality.
Factuality in NLP involves multiple challenges (Augenstein et al., 2024), and while prior efforts have established reasoning datasets like SAFE (Wei et al., 2024b) and SimpleQA (Wei et al., 2024a), they often lack explicit grounding in structured knowledge subgraphs. In contrast, Tian et al. (2024) directly address factual accuracy by fine-tuning models on automatically generated preference rankings that prioritize factual consistency.
Test-Time Scaling as a Performance Upper-Bound.
Our evaluation using pass@ $k$ is situated within the broader context of test-time scaling, which seeks to improve performance by dedicating more compute at inference. This field encompasses parallel scaling (e.g., Best-of-N), where multiple candidate solutions are generated to increase the probability of finding a correct one (Chollet, 2019; Irvine et al., 2023; Brown et al., 2024a; Li et al., 2022), and sequential scaling, where a single solution is iteratively refined through techniques like chain-of-thought prompting and revision (Wei et al., 2022; Nye et al., 2021; Madaan et al., 2023; Lee et al., 2025; Hou et al., 2025; Huang et al., 2023; Min et al., 2024; Muennighoff et al., 2025; Wang et al., 2024b; Li et al., 2025; Jurayj et al., 2025). While practical applications of parallel scaling depend on a selection mechanism (e.g., majority voting or reward-model-based scoring) to choose the final answer (Wang et al., 2023; Christiano et al., 2017; Lightman et al., 2024; Wang et al., 2024a; Wu et al., 2024; Beeching et al., 2025; Pan et al., 2024; Hassid et al., 2024; Stroebl et al., 2024), the performance of any such method is fundamentally limited by the quality of the underlying generations, often facing diminishing returns (Brown et al., 2024b; Snell et al., 2024; Wu et al., 2024; Levi, 2024). Our work, therefore, focuses on improving the quality of each individual reasoning trace through fine-tuning, thereby directly boosting the upper-bound potential that is measured by pass@ $k$ .
Domain-specific Test-Time Scaling.
Test-time scaling also spans specific domains like coding and medicine. Z1-7B optimizes coding tasks through constrained reasoning windows, reducing overthinking while maintaining accuracy (Yu et al., 2025). In medicine, extended reasoning boosts smaller modelsβ clinical QA performance significantly (Huang et al., 2025), complemented by structured datasets like MedReason, which enhance factual reasoning via knowledge-graph-guided paths (Wu et al., 2025), similar to our work.
6 Conclusion
In this work, we have investigated whether grounding reasoning traces on knowledge graph paths and training models on them yield tangible gains in factual accuracy on complex open-domain QA tasks. After distilling over 3K original and knowledge-graph-enhanced reasoning traces from models such as QwQ-32B and Deepseek-R1, we fine-tuned 8 LLMs on rt and fs1 and evaluated them across 6 diverse benchmarks. In short, with parallel sampling, we consistently improve 6-14 absolute points in accuracy over their instruction-tuned counterpart. Particularly, using SimpleQA, we highlight that CoT and rt perform better on simpler questions (1 or 2 hops required), whereas our fs1 -tuned model performs better on more complex questions, requiring 3 hops or more. Lastly, we examined the performance of eight fs1 -tuned models across different parameter scales, finding that smaller models (below the 1.7B parameter range) show the largest increase in performance, while larger models see less profound improvements in a pass@1 setting. By releasing all code, models, and reasoning traces, we provide a rich resource for future work on process-level verification and the development of factuality-aware reward models. In turn, we hope this work facilitates more factual large language models, making them more useful for real-world usage.
Limitations.
Our approach assumes that conditioning on KG paths improves the accuracy of reasoning traces, though it does not guarantee perfect intermediate processes. Additionally, accurately evaluating entity answers poses challenges; we attempted to mitigate this limitation using LLM-based judgments, but these methods have their own inherent limitations. For evaluation, we note that pass@ $k$ is an upper-bound performance measure. A practical implementation would require an additional selection mechanism, such as majority voting or a verifier model, to choose the final answer. Last, some of the the test datasets used might be on the older side and English only, where we do not have control on whether the data has been included in any type of LLM pre- or post-training.
Future Work.
Several future research directions emerge from both the KG and test-time scaling perspectives. One promising avenue is leveraging these reasoning traces to develop process reward models, which are designed for complex reasoning and decision-making tasks where evaluating intermediate (factual reasoning) steps is critical to achieving the desired outcomes. This in fact is a crucial step towards more factual LLMs. This can be done together with KGs, one possible example could be (Amayuelas et al., 2025), where they attempt to ground every generation with a knowledge graph entry. This can possibly be done during the generation of long reasoning.
Ethics Statement
The primary ethical motivation for this research is to enhance the factuality and reliability of LLMs. By addressing the probability of these models to generate incorrect information, our work aims to contribute positively to the development of more trustworthy AI systems. We do not foresee any direct negative ethical implications arising from this research. Instead, our goal is to provide a methodology that mitigates existing risks associated with misinformation, thereby promoting a safer and more beneficial application of language technologies.
Reproducibility Statement
We are committed to full reproducibility. All associated artifacts, such as source code, datasets, and pretrained model weights, will be made publicly available via GitHub and the Huggingface Hub upon publication. Further details regarding the computational environment, including hardware specifications, software dependencies, and hyperparameters used for fine-tuning and inference, are documented in Section 3 and Appendix B. We acknowledge that minor variations in numerical results may arise from discrepancies in hardware or software versions; however, we have provided sufficient detail to allow for a faithful replication of our experimental setup.
Acknowledgments
MZ and JB are supported by the research grant (VIL57392) from VILLUM FONDEN. We would like to thank the AAU-NLP group for helpful discussions and feedback on an earlier version of this article. We acknowledge the Danish e-Infrastructure Cooperation for awarding this project access (No. 465001263; DeiC-AAU-N5-2024078 - H2-2024-18) to the LUMI supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CSC (Finland) and the LUMI consortium through DeiC, Denmark.
References
- Amayuelas et al. (2025) Alfonso Amayuelas, Joy Sain, Simerjot Kaur, and Charese Smiley. Grounding llm reasoning with knowledge graphs, 2025. URL https://arxiv.org/abs/2502.13247.
- Augenstein et al. (2024) Isabelle Augenstein, Timothy Baldwin, Meeyoung Cha, Tanmoy Chakraborty, Giovanni Luca Ciampaglia, David Corney, Renee DiResta, Emilio Ferrara, Scott Hale, Alon Halevy, et al. Factuality challenges in the era of large language models and opportunities for fact-checking. Nature Machine Intelligence, 6(8):852β863, 2024.
- Beeching et al. (2025) Edward Beeching, Lewis Tunstall, and Sasha Rush. Scaling test-time compute with open models, 2025. URL https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute.
- Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD β08, pp. 1247β1250, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605581026. doi: 10.1145/1376616.1376746. URL https://doi.org/10.1145/1376616.1376746.
- Brown et al. (2024a) Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V Le, Christopher RΓ©, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. ArXiv preprint, abs/2407.21787, 2024a. URL https://arxiv.org/abs/2407.21787.
- Brown et al. (2024b) Bradley Brown, Jordan Juravsky, Ryan Ehrlich, Ronald Clark, Quoc V. Le, Christopher RΓ©, and Azalia Mirhoseini. Large language monkeys: Scaling inference compute with repeated sampling. ArXiv preprint, abs/2407.21787, 2024b. URL https://arxiv.org/abs/2407.21787.
- Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
- Chollet (2019) FranΓ§ois Chollet. On the measure of intelligence, 2019.
- Christiano et al. (2017) Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 4299β4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.
- DeepSeek-AI (2025) DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. ArXiv preprint, abs/2501.12948, 2025. URL https://arxiv.org/abs/2501.12948.
- Gu et al. (2021) Yu Gu, Sue Kase, Michelle Vanni, Brian M. Sadler, Percy Liang, Xifeng Yan, and Yu Su. Beyond I.I.D.: three levels of generalization for question answering on knowledge bases. In Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia (eds.), WWW β21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, pp. 3477β3488. ACM / IW3C2, 2021. doi: 10.1145/3442381.3449992. URL https://doi.org/10.1145/3442381.3449992.
- Hassid et al. (2024) Michael Hassid, Tal Remez, Jonas Gehring, Roy Schwartz, and Yossi Adi. The larger the better? improved llm code-generation via budget reallocation. ArXiv preprint, abs/2404.00725, 2024. URL https://arxiv.org/abs/2404.00725.
- Hou et al. (2025) Zhenyu Hou, Xin Lv, Rui Lu, Jiajie Zhang, Yujiang Li, Zijun Yao, Juanzi Li, Jie Tang, and Yuxiao Dong. Advancing language model reasoning through reinforcement learning and inference scaling. ArXiv preprint, abs/2501.11651, 2025. URL https://arxiv.org/abs/2501.11651.
- Huang et al. (2023) Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. Large language models can self-improve. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 1051β1068, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.67. URL https://aclanthology.org/2023.emnlp-main.67.
- Huang et al. (2025) Xiaoke Huang, Juncheng Wu, Hui Liu, Xianfeng Tang, and Yuyin Zhou. m1: Unleash the potential of test-time scaling for medical reasoning with large language models, 2025. URL https://arxiv.org/abs/2504.00869.
- Irvine et al. (2023) Robert Irvine, Douglas Boubert, Vyas Raina, Adian Liusie, Ziyi Zhu, Vineet Mudupalli, Aliaksei Korshuk, Zongyi Liu, Fritz Cremer, Valentin Assassi, Christie-Carol Beauchamp, Xiaoding Lu, Thomas Rialan, and William Beauchamp. Rewarding chatbots for real-world engagement with millions of users. ArXiv preprint, abs/2303.06135, 2023. URL https://arxiv.org/abs/2303.06135.
- Jia et al. (2021) Zhen Jia, Soumajit Pramanik, Rishiraj Saha Roy, and Gerhard Weikum. Complex temporal question answering on knowledge graphs. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, CIKM β21, pp. 792β802, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450384469. doi: 10.1145/3459637.3482416. URL https://doi.org/10.1145/3459637.3482416.
- Jiang et al. (2023) Jinhao Jiang, Kun Zhou, Xin Zhao, Yaliang Li, and Ji-Rong Wen. ReasoningLM: Enabling structural subgraph reasoning in pre-trained language models for question answering over knowledge graph. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 3721β3735, Singapore, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.228. URL https://aclanthology.org/2023.emnlp-main.228.
- Jurayj et al. (2025) William Jurayj, Jeffrey Cheng, and Benjamin Van Durme. Is that your final answer? test-time scaling improves selective question answering. ArXiv preprint, abs/2502.13962, 2025. URL https://arxiv.org/abs/2502.13962.
- Kojima et al. (2022) Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html.
- Kwon et al. (2023) Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023.
- Lacoste et al. (2019) Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. Quantifying the carbon emissions of machine learning. ArXiv preprint, abs/1910.09700, 2019. URL https://arxiv.org/abs/1910.09700.
- Lan & Jiang (2020) Yunshi Lan and Jing Jiang. Query graph generation for answering multi-hop complex questions from knowledge bases. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 969β974, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.91. URL https://aclanthology.org/2020.acl-main.91.
- Lavrinovics et al. (2025) Ernests Lavrinovics, Russa Biswas, Katja Hose, and Johannes Bjerva. Multihal: Multilingual dataset for knowledge-graph grounded evaluation of llm hallucinations, 2025. URL https://arxiv.org/abs/2505.14101.
- Lee et al. (2025) Kuang-Huei Lee, Ian Fischer, Yueh-Hua Wu, Dave Marwood, Shumeet Baluja, Dale Schuurmans, and Xinyun Chen. Evolving deeper llm thinking. ArXiv preprint, abs/2501.09891, 2025. URL https://arxiv.org/abs/2501.09891.
- Levi (2024) Noam Levi. A simple model of inference scaling laws. ArXiv preprint, abs/2410.16377, 2024. URL https://arxiv.org/abs/2410.16377.
- Li et al. (2025) Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Llms can easily learn to reason from demonstrations: structure, not content, is what matters! ArXiv preprint, abs/2502.07374, 2025. URL https://arxiv.org/abs/2502.07374.
- Li et al. (2023) Shiyang Li, Yifan Gao, Haoming Jiang, Qingyu Yin, Zheng Li, Xifeng Yan, Chao Zhang, and Bing Yin. Graph reasoning for question answering with triplet retrieval. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp. 3366β3375, Toronto, Canada, 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.208. URL https://aclanthology.org/2023.findings-acl.208.
- Li et al. (2022) Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, RΓ©mi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Gabriel Synnaeve, David Imrie, Peter Ying, Peter Battaglia, and Oriol Vinyals. Competition-level code generation with alphacode. ArXiv preprint, abs/2203.07814, 2022. URL https://arxiv.org/abs/2203.07814.
- Lightman et al. (2024) Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Letβs verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi.
- Luo et al. (2024) Linhao Luo, Yuan-Fang Li, Gholamreza Haffari, and Shirui Pan. Reasoning on graphs: Faithful and interpretable large language model reasoning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=ZGNWW7xZ6Q.
- Ma et al. (2025) Wenjie Ma, Jingxuan He, Charlie Snell, Tyler Griggs, Sewon Min, and Matei Zaharia. Reasoning models can be effective without thinking, 2025. URL https://arxiv.org/abs/2504.09858.
- Madaan et al. (2023) Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html.
- Min et al. (2024) Yingqian Min, Zhipeng Chen, Jinhao Jiang, Jie Chen, Jia Deng, Yiwen Hu, Yiru Tang, Jiapeng Wang, Xiaoxue Cheng, Huatong Song, Feng Ji, Qi Zhang, and Fuli Luo. Imitate, explore, and self-improve: A reproduction report on slow-thinking reasoning systems. ArXiv preprint, abs/2412.09413, 2024. URL https://arxiv.org/abs/2412.09413.
- Muennighoff et al. (2025) Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling. ArXiv preprint, abs/2501.19393, 2025. URL https://arxiv.org/abs/2501.19393.
- Nye et al. (2021) Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Henryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, Wojciech Zaremba, Illya Sutskever, Charles Sutton, Martin Wattenberg, Michael Terry, Jeff Dean, Douglas Eck, Bastien Potier, and Ethan Dyer. Show your work: Scratchpads for intermediate computation with language models. ArXiv preprint, abs/2112.00114, 2021. URL https://arxiv.org/abs/2112.00114.
- OpenAI (2025) OpenAI. Openai o3-mini: Pushing the frontier of cost-effective reasoning., 2025. URL https://openai.com/index/openai-o3-mini/.
- Pan et al. (2024) Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, and Yizhe Zhang. Training software engineering agents and verifiers with swe-gym. ArXiv preprint, abs/2412.21139, 2024. URL https://arxiv.org/abs/2412.21139.
- Qwen Team (2024) Qwen Team. Qwen2.5: A party of foundation models, 2024. URL https://qwenlm.github.io/blog/qwen2.5/.
- Qwen Team (2025) Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, 2025. URL https://qwenlm.github.io/blog/qwq-32b/.
- Reimers & Gurevych (2019) Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982β3992, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/D19-1410/.
- Sanmartin (2024) Diego Sanmartin. Kg-rag: Bridging the gap between knowledge and creativity. ArXiv preprint, abs/2405.12035, 2024. URL https://arxiv.org/abs/2405.12035.
- Sen et al. (2022) Priyanka Sen, Alham Fikri Aji, and Amir Saffari. Mintaka: A complex, natural, and multilingual dataset for end-to-end question answering. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (eds.), Proceedings of the 29th International Conference on Computational Linguistics, pp. 1604β1619, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics. URL https://aclanthology.org/2022.coling-1.138/.
- Snell et al. (2024) Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. ArXiv preprint, abs/2408.03314, 2024. URL https://arxiv.org/abs/2408.03314.
- Stroebl et al. (2024) Benedikt Stroebl, Sayash Kapoor, and Arvind Narayanan. Inference scaling flaws: The limits of llm resampling with imperfect verifiers. ArXiv preprint, abs/2411.17501, 2024. URL https://arxiv.org/abs/2411.17501.
- Talmor & Berant (2018) Alon Talmor and Jonathan Berant. The web as a knowledge-base for answering complex questions. In Marilyn Walker, Heng Ji, and Amanda Stent (eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 641β651, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-1059. URL https://aclanthology.org/N18-1059/.
- Tan et al. (2025) Xingyu Tan, Xiaoyang Wang, Qing Liu, Xiwei Xu, Xin Yuan, and Wenjie Zhang. Paths-over-graph: Knowledge graph empowered large language model reasoning. In Proceedings of the ACM on Web Conference 2025, pp. 3505β3522, 2025.
- Tian et al. (2024) Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, and Chelsea Finn. Fine-tuning language models for factuality. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=WPZ2yPag4K.
- Wang et al. (2024a) Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 9426β9439, 2024a.
- Wang et al. (2023) Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=1PL1NIMMrw.
- Wang et al. (2024b) Yifei Wang, Yuyang Wu, Zeming Wei, Stefanie Jegelka, and Yisen Wang. A theoretical understanding of self-correction through in-context alignment. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024b. URL http://papers.nips.cc/paper_files/paper/2024/hash/a399456a191ca36c7c78dff367887f0a-Abstract-Conference.html.
- Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html.
- Wei et al. (2024a) Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models. ArXiv preprint, abs/2411.04368, 2024a. URL https://arxiv.org/abs/2411.04368.
- Wei et al. (2024b) Jerry Wei, Chengrun Yang, Xinying Song, Yifeng Lu, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, and Quoc V. Le. Long-form factuality in large language models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024b. URL http://papers.nips.cc/paper_files/paper/2024/hash/937ae0e83eb08d2cb8627fe1def8c751-Abstract-Conference.html.
- Wu et al. (2025) Juncheng Wu, Wenlong Deng, Xingxuan Li, Sheng Liu, Taomian Mi, Yifan Peng, Ziyang Xu, Yi Liu, Hyunjin Cho, Chang-In Choi, Yihan Cao, Hui Ren, Xiang Li, Xiaoxiao Li, and Yuyin Zhou. Medreason: Eliciting factual medical reasoning steps in llms via knowledge graphs, 2025. URL https://arxiv.org/abs/2504.00993.
- Wu et al. (2024) Yangzhen Wu, Zhiqing Sun, Shanda Li, Sean Welleck, and Yiming Yang. Inference scaling laws: An empirical analysis of compute-optimal inference for problem-solving with language models. ArXiv preprint, abs/2408.00724, 2024. URL https://arxiv.org/abs/2408.00724.
- Ye et al. (2022) Xi Ye, Semih Yavuz, Kazuma Hashimoto, Yingbo Zhou, and Caiming Xiong. RNG-KBQA: Generation augmented iterative ranking for knowledge base question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6032β6043, Dublin, Ireland, 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.417. URL https://aclanthology.org/2022.acl-long.417.
- Yih et al. (2016) Wen-tau Yih, Matthew Richardson, Chris Meek, Ming-Wei Chang, and Jina Suh. The value of semantic parse labeling for knowledge base question answering. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 201β206, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-2033. URL https://aclanthology.org/P16-2033/.
- Yu et al. (2025) Zhaojian Yu, Yinghao Wu, Yilun Zhao, Arman Cohan, and Xiao-Ping Zhang. Z1: Efficient test-time scaling with code, 2025. URL https://arxiv.org/abs/2504.00810.
- Zhang et al. (2025) Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. What, how, where, and how well? a survey on test-time scaling in large language models, 2025. URL https://arxiv.org/abs/2503.24235.
Appendix A Large Language Model Use
We made use of LLMs to polish our writing and plotting our figures.
Appendix B Training and Inference
For running Deepseek-R1, o3-mini, and some LLM-as-a-Judge experiments with gpt-4o-mini, we use API-based solutions via OpenAI https://platform.openai.com/ or TogetherAI https://api.together.ai/. The costs of running inference on all data took around 250 USD. For fine-tuning and running inference of the local models, we make use of a large HPC cluster with hardware configurations comprising multiple nodes (depending on model size; e.g., 32B models require 4 nodes for training and 1 node for inference), each with node contains eight AMD MI250x GPU modules alongside a single 64-core AMD EPYC βTrentoβ CPU. The library we use for inference is vllm (Kwon et al., 2023). For all the experiments it resulted in around 6,500 GPU hours spent.
B.1 Environmental Impact
We acknowledge that conducting a large-scale analysis using LLMs comes with an environmental impact. Experiments were conducted using private infrastructure in [Redacted] running on green energy. A cumulative of 6,500 GPU hours of computation was performed on AMD MI250x GPU modules, which has a TDP of 500 Watts. The experiments were ran from February to September 2025. During this time, the average carbon efficiency in [Redacted] was 0.085 $kg/kWh$ . According to https://app.electricitymaps.com/map. This means we released about 276 $kg$ of $CO_{2}$ equivalent. Estimations were conducted using the Machine Learning Impact calculator Find the tool here: https://mlco2.github.io/impact. presented in (Lacoste et al., 2019).
Appendix C Example Reasoning Traces
We show four full examples in Figure 8 and Figure 9 for rt and Figure 10 and Figure 11 for fs1. For visualization purposes, we split up the specific parts in special tags: <question>Q</question>, <think>T</think>, <gold>G</gold> to indicate what it refers to. Deepseek-R1 Example (rt)
β¬ < question > In Family Guy, Seth Macfarlane plays the parent of Chris Griffin. What is his character β s name? </ question > < think > Okay, so I need to figure out the name of the character Seth MacFarlane voices who is the parent of Chris Griffin in Family Guy. Let me start by recalling what I know about the show.\ n \ nFirst, Family Guy is an animated sitcom created by Seth MacFarlane. He β s known for voicing several characters in the show. The main family is the Griffins. The children in the family are Chris, Meg, and Stewie, and there β s also the talking dog Brian. The parents would be the father and mother of these kids.\ n \ nI remember that the father is Peter Griffin, who is the bumbling, overweight patriarch of the family. Seth MacFarlane does voice Peter Griffin, along with other characters like Stewie and Brian. The mother is Lois Griffin, but I think she β s voiced by Alex Borstein, not Seth MacFarlane. So, since the question is about the parent of Chris Griffin that Seth MacFarlane plays, it must be Peter Griffin.\ n \ nWait, let me double - check. Chris is the son, so his parents are Peter and Lois. Since Seth MacFarlane voices Peter, that β s the answer. But just to be thorough, are there any other parents in the show that Seth voices? For example, maybe a grandparent? But Chris β s grandparents would be Peter β s parents or Lois β s parents. I think Peter β s father is named Francis Griffin, but I don β t recall if Seth voices him. Lois β s parents are Carter and Babs Pewterschmidt, and they are voiced by other actors. So no, the main parent characters voiced by Seth would be Peter. \ n \ nAlso, Stewie is the baby, and Brian is the dog, both voiced by Seth, but they aren β t parents. So yes, the answer should be Peter Griffin. </ think > The character voiced by Seth MacFarlane who is the parent of Chris Griffin in * Family Guy * is \\ boxed {Peter Griffin}. < gold > [β Peter Griffin β, β Peter Lowenbrou Griffin β] </ gold >
Figure 8: Example Instance (rt). A successful reasoning trace for a simple multi-hop question by Deepseek-R1. QwQ-32B Example (rt)
β¬ < question > What Eastern Europe main religion worshipers also attend St. Mary β s Cathedral, in Batticaloa, Sri Lanka? </ question > < think > Okay, so the user is asking about the main religion worshippers in Eastern Europe who attend St. Mary β s Cathedral in Batticaloa, Sri Lanka. Hmm, first I need to break this down. Let me start by recalling that Batticaloa is a city in Sri Lanka, which is in South Asia. The cathedral mentioned is a Christian church, specifically a Catholic cathedral since it β s named St. Mary β s.\ n \ nNow, the question is about Eastern Europeans who worship there. Wait, Eastern Europe β s main religions are predominantly Christian, like Eastern Orthodox or Roman Catholic in some countries. But Sri Lanka β s main religions are Buddhism, Hinduism, Islam, and then there β s a Christian minority. So why would Eastern Europeans be attending a cathedral there? Maybe there β s a community of Eastern Europeans living or working in Batticaloa? Or perhaps it β s a question about the cathedral β s affiliation?\ n \ nAlternatively, maybe there β s confusion in the question. St. Mary β s Cathedral in Batticaloa is likely part of the Roman Catholic Church. Eastern Europeans who are Catholic, like from countries such as Poland, Lithuania, or parts of Croatia, might attend. But Eastern Orthodox Christians would typically go to their own churches. Since the cathedral is Catholic, the main worshippers there would be local Catholics, but the question specifies Eastern Europeans. \ n \ nWait, perhaps the user is thinking of the term \β Eastern Europe \β in a different context. Maybe they β re referring to the Eastern Catholic Churches, which are in communion with Rome but have Eastern rites. For example, Ukrainian Catholics or Ruthenian Catholics. But would they be in Sri Lanka? Maybe there β s a small expat community or missionary group. Alternatively, maybe the question is mixing up terms, and the answer is Roman Catholic, as that β s the cathedral β s denomination, and Eastern Europeans of that faith would attend. \ n \ nAlternatively, could there be a confusion with Batticaloa β s demographics? The area has a significant Tamil population, but Tamils are mostly Hindu or Christian. The cathedral serves the local Catholic community, which includes Sri Lankan Tamils and others. But the question specifically asks about Eastern Europeans. Maybe the answer is Roman Catholics from Eastern Europe, but I need to confirm if there β s a specific group. \ n \ nAlternatively, perhaps the question is a trick, and the main religion in Eastern Europe (like Orthodox) isn β t the one at the cathedral, so the answer is Roman Catholics. Since the cathedral is Catholic, Eastern European Catholics would be the ones attending. So the main religion worshippers there from Eastern Europe would be Roman Catholics. \ n \ nI think that β s the most logical answer here. The cathedral is Catholic, so Eastern Europeans who are Roman Catholic would attend. The main religions in Eastern Europe include Orthodox and Catholic, so the Catholic ones from there would be the answer. </ think > \\ boxed {Roman Catholics}. < gold > [β Catholicism β, β Roman Catholic convert β, β Roman Catholic β] </ gold >
Figure 9: Example Instance (rt). A successful reasoning trace for a simple multi-hop question by QwQ-32B. QwQ-32B Example (fs1)
β¬ < question > What geographic region is the country that used the Ottoman Lira used? While answering the question, make use of the following linearised graph as an inspiration in your reasoning, not as the only answer: Konstantin Mikhailovich Bazili, place of birth, Istanbul Istanbul, work location, Near East. Put your final answer within \ boxed {}. </ question > < think > Okay, let β s tackle this question step by step. The user is asking about the geographic region of the country that used the Ottoman Lira. First, I need to recall what the Ottoman Lira was. From what I remember, the lira was the currency used in the Ottoman Empire. So the country in question here is the Ottoman Empire. But since the Ottoman Empire is a historical entity and not a current country, the region would refer to where the empire was located. Now, the Ottoman Empire was a large and extensive empire that existed from the 14 th century until after World War I. Its heartland was in Anatolia (modern - day Turkey) and Southeast Europe. The capital was Constantinople, which is now known as Istanbul, Turkey. The empire also controlled much of the Balkans, parts of North Africa, and the Middle East. β¦ (cutoff) β¦ Wait, but the question is about the geographic region of the country that used the Ottoman Lira. The country is the Ottoman Empire. The geographic region that it is in is the Near East. So combining the clues provided (Istanbul being part of the Near East), the answer would be the Near East. Alternatively, if they want the specific country now, like Turkey, but since the Ottoman Empire was not Turkey, but Turkey is the successor state. The question is a bit ambiguous. However, given the clues pointing towards Near East as the work location linked to Istanbul, I think the intended answer is the Near East region. So the final answer would be Near East. </ think > The Ottoman Lira was the currency of the Ottoman Empire, which was centered in the geographic region known historically as the Near East. This region included the empire β s heartland in Anatolia (modern - day Turkey), Istanbul (then Constantinople), parts of Southeast Europe, and the Middle East. The provided clues about Istanbul, the former capital, and its association with the Near East further reinforce this connection. \\ boxed {Near East}. < gold > [β Near East β] </ gold >
Figure 10: Example Instance (fs1). A successful reasoning trace for a question by QwQ-32B. Deepseek-R1 Example (fs1)
β¬ < question > What part of the world map is the Government of the Republic of Seychelles located in? While answering the question, make use of the following linearised graph as an inspiration in your reasoning, not as the only answer: Seychellois Creole, country, Seychelles Seychelles, continent, Africa Seychellois Creole, indigenous to, Seychelles Seychelles, continent, Africa. Put your final answer within \ boxed {}. </ question > < think > Okay, so I need to figure out where the Government of the Republic of Seychelles is located on the world map. Let me start by recalling what I know about Seychelles. I think Seychelles is a country made up of islands. But where exactly? The user provided a linearised graph that mentions Seychelles is in the continent of Africa. Hmm, that β s interesting because when I think of Africa, I usually picture the mainland, but there are also island countries associated with Africa. Let me verify that. Seychelles is an archipelago in the Indian Ocean, right? So it β s a group of islands northeast of Madagascar. Since Madagascar itself is off the southeastern coast of Africa, Seychelles being northeast of there would place it in the Indian Ocean, near the African continent. The graph also mentions Seychellois Creole being indigenous to Seychelles, which makes sense as the local language. So, putting this together, the Government of the Republic of Seychelles is located in the country of Seychelles, which is part of the African continent. Even though it β s an island nation, geographically and politically, it β s considered part of Africa. Therefore, the answer should be that the government is located in Seychelles, which is in Africa. </ think > The Government of the Republic of Seychelles is located in the country of Seychelles. According to the provided graph, Seychelles is part of the continent of Africa. Although geographically situated in the Indian Ocean northeast of Madagascar, it is politically and regionally classified as part of Africa. \\ boxed {Africa} < gold > [ β Africa β, β Afrique β ] </ gold >
Figure 11: Example Instance (fs1). A successful reasoning trace for a question by Deepseek-R1.
Appendix D Test Benchmark
D.1 Examples Test Benchmark
In Figure 12, we show for each dataset in the test benchmark an example of how the question and answers look like. Benchmark Test Examples
β¬ < cwq > Who was the president in 2012 of the country where Unidad de Valor Constante is the chosen currency? Answers: [ β Rafael Correa β, β Rafael Correa Delgado β, β Rafael Vicente Correa Delgado β ] </ cwq > < exaqt > what nba team won the back to back championship in 1989 and 1990? Answer: [ β Detroit Pistons β ] </ exaqt > < grailqa > lonnie wheeler contributed to a book edition published by what publisher? Answer: [ β Simon & Schuster β ] </ grailqa > < simpleqa > Who received the IEEE Frank Rosenblatt Award in 2010? Answer: [ β Michio Sugeno β ] </ simpleqa > < mintaka > How many books are in Goosebumps? Answer: [ β235β ] </ mintaka > < webqsp > where did diego velazquez die? Answer: [ β Madrid β ] </ webqsp >
Figure 12: Text Examples. For each dataset in the benchmark, we show an example.
Appendix E SPARQL queries
The query format to retrieve the Wikidata entities for Freebase entities is given by
SELECT ?wikientity
WHERE
{?wikientity wdt:P646 $FREEBASE_ENTITY}
The general structure of the SPARQL queries for 2-hop paths between the source and target entities are given by
SELECT ?p1 ?p1Label ?o1 ?o1Label ?p2 ?p2Label
WHERE
{wd:$SOURCE_ENTITY ?p1 ?o1.
?o1 ?p2 wd:$TARGET_ENTITY}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
SELECT ?o1 ?o1Label ?p1 ?p1Label ?p2 ?p2Label
WHERE
{ ?p1 ?o1.
?o1 ?p2 wd:$TARGET_ENTITY}
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],mul,en". }
Appendix F Absolute Performance Results for Analysis
We show the absolute performance numbers from Figure 7 at pass@ $16$ in Figure 13.
<details>
<summary>x8.png Details</summary>

### Visual Description
\n
## Bar Chart: Absolute Performance by Hops
### Overview
The image presents a bar chart comparing the performance of four different methods ("inst", "cot", "rt", "fs1") across varying numbers of hops (1, 2, 3, and 3+). The performance metric is "pass@16", which appears to represent a success rate. Each hop count has four bars, one for each method, showing the corresponding pass@16 value.
### Components/Axes
* **Title:** "Absolute Performance by Hops" - positioned at the top-center of the chart.
* **X-axis:** "Hops" - labeled at the bottom of the chart, with markers at 1, 2, 3, and 3+.
* **Y-axis:** "pass@16" - labeled on the left side of the chart, ranging from 0.0 to 0.35, with increments of 0.05.
* **Legend:** Located in the top-left corner of the chart. It defines the color-coding for each method:
* "inst" - Light Gray
* "cot" - Light Blue
* "rt" - Dark Gray
* "fs1" - Teal/Cyan
### Detailed Analysis
The chart consists of four groups of bars, one for each hop count. Within each group, there are four bars representing the "pass@16" values for each method.
* **1 Hop:**
* inst: Approximately 0.17
* cot: Approximately 0.21
* rt: Approximately 0.23
* fs1: Approximately 0.22
* **2 Hops:**
* inst: Approximately 0.21
* cot: Approximately 0.26
* rt: Approximately 0.32
* fs1: Approximately 0.34
* **3 Hops:**
* inst: Approximately 0.23
* cot: Approximately 0.25
* rt: Approximately 0.30
* fs1: Approximately 0.35
* **3+ Hops:**
* inst: Approximately 0.23
* cot: Approximately 0.24
* rt: Approximately 0.27
* fs1: Approximately 0.25
**Trends:**
* Generally, performance (pass@16) increases as the number of hops increases, up to 3 hops.
* "fs1" consistently shows the highest performance, especially at 3 hops.
* "inst" consistently shows the lowest performance.
* "rt" and "cot" show similar performance, with "rt" slightly outperforming "cot" at higher hop counts.
### Key Observations
* The largest performance gains are observed when increasing from 2 to 3 hops, particularly for "fs1" and "rt".
* The performance difference between methods is relatively small at 1 hop, but becomes more pronounced at higher hop counts.
* The performance of "inst" remains relatively stable across all hop counts.
### Interpretation
The chart demonstrates the impact of the number of hops on the performance of different methods. The increasing performance with more hops suggests that the methods benefit from additional processing steps or information gathering. The consistent superiority of "fs1" indicates that this method is particularly effective in leveraging multiple hops. The relatively stable performance of "inst" suggests that it may be limited by its inherent design or approach. The data suggests that for optimal performance, utilizing 3 hops is beneficial, and the "fs1" method is the most effective choice. The "3+" hop category shows a slight decrease in performance for "fs1" compared to 3 hops, which could indicate diminishing returns or potential issues with excessive hops. This could be a point for further investigation.
</details>
(a) Performance by number of hops required to answer the question measured in pass@ $16$ .
<details>
<summary>x9.png Details</summary>

### Visual Description
\n
## Bar Chart: Absolute Performance by Answer Type
### Overview
This bar chart displays the absolute performance, measured as "pass@16", for different answer types: date, number, other, person, and place. The performance is broken down by four different methods: "inst", "cot", "rt", and "fsl1". Each answer type has four bars representing the performance of each method.
### Components/Axes
* **Title:** Absolute Performance by Answer Type
* **X-axis:** Answer Type (date, number, other, person, place)
* **Y-axis:** pass@16 (ranging from 0.0 to 0.35, with increments of 0.05)
* **Legend:**
* inst (light gray)
* cot (yellow)
* rt (medium blue)
* fsl1 (dark blue, hatched)
### Detailed Analysis
The chart consists of five groups of four bars each, corresponding to the five answer types. The bars are grouped by answer type, with each color within a group representing a different method.
**Date:**
* inst: Approximately 0.13
* cot: Approximately 0.21
* rt: Approximately 0.22
* fsl1: Approximately 0.21
**Number:**
* inst: Approximately 0.24
* cot: Approximately 0.27
* rt: Approximately 0.29
* fsl1: Approximately 0.28
**Other:**
* inst: Approximately 0.26
* cot: Approximately 0.28
* rt: Approximately 0.32
* fsl1: Approximately 0.33
**Person:**
* inst: Approximately 0.12
* cot: Approximately 0.15
* rt: Approximately 0.15
* fsl1: Approximately 0.14
**Place:**
* inst: Approximately 0.18
* cot: Approximately 0.22
* rt: Approximately 0.23
* fsl1: Approximately 0.22
**Trends:**
* For all answer types, "rt" and "fsl1" generally outperform "inst" and "cot".
* "fsl1" consistently shows the highest performance for "other" and "number" answer types.
* "cot" and "rt" show similar performance levels across most answer types.
* "inst" consistently shows the lowest performance across all answer types.
### Key Observations
* The "other" answer type demonstrates the highest overall performance, particularly with "fsl1".
* The "person" answer type exhibits the lowest overall performance.
* The difference in performance between methods is most pronounced for the "other" answer type.
* The "inst" method consistently underperforms compared to the other three methods.
### Interpretation
The chart suggests that the methods "rt" and "fsl1" are more effective at answering questions across all answer types compared to "inst" and "cot". The "other" answer type is the easiest to answer correctly, while "person" is the most challenging. The consistent underperformance of "inst" suggests it may be a less suitable method for these types of questions. The higher performance of "fsl1" on "other" and "number" suggests it may be particularly well-suited for these categories. The data indicates a clear hierarchy of difficulty among the answer types, and a varying effectiveness of the different methods in addressing those difficulties. The chart provides insights into the strengths and weaknesses of each method for different types of questions, which could inform the selection of appropriate methods for specific tasks.
</details>
(b) Performance by answer type (i.e., what type of entity the answer is) in pass@ $16$ .
<details>
<summary>x10.png Details</summary>

### Visual Description
## Bar Chart: Absolute Performance by Domain
### Overview
The image presents a bar chart comparing the performance of different models (inst, cot, rt, fs1) across various domains (art, geography, history, music, other, politics, sci & tech, sports, tv shows, video games). The performance metric is "pass@16", representing the probability of a correct answer.
### Components/Axes
* **Title:** "Absolute Performance by Domain" - positioned at the top-center of the chart.
* **X-axis:** Represents the domains. The categories are: "art", "geography", "history", "music", "other", "politics", "sci & tech", "sports", "tv shows", "video games".
* **Y-axis:** Represents the "pass@16" score, ranging from 0.0 to 0.3, with tick marks at 0.0, 0.1, 0.2, and 0.3. The axis is labeled "pass@16".
* **Legend:** Located in the top-right corner, identifying the four data series:
* "inst" - Light Blue
* "cot" - Light Purple
* "rt" - Dark Gray
* "fs1" - Pink/Red
### Detailed Analysis
Each domain has four bars representing the "pass@16" score for each model. I will analyze each domain individually, noting approximate values.
* **Art:** inst β 0.26, cot β 0.24, rt β 0.24, fs1 β 0.24
* **Geography:** inst β 0.27, cot β 0.25, rt β 0.23, fs1 β 0.24
* **History:** inst β 0.28, cot β 0.26, rt β 0.24, fs1 β 0.25
* **Music:** inst β 0.29, cot β 0.27, rt β 0.22, fs1 β 0.25
* **Other:** inst β 0.23, cot β 0.21, rt β 0.21, fs1 β 0.22
* **Politics:** inst β 0.27, cot β 0.25, rt β 0.23, fs1 β 0.24
* **Sci & Tech:** inst β 0.28, cot β 0.27, rt β 0.25, fs1 β 0.26
* **Sports:** inst β 0.26, cot β 0.25, rt β 0.23, fs1 β 0.24
* **TV Shows:** inst β 0.23, cot β 0.22, rt β 0.18, fs1 β 0.22
* **Video Games:** inst β 0.21, cot β 0.19, rt β 0.16, fs1 β 0.18
**Trends:**
* The "inst" model generally performs the best across most domains, with a slight upward trend in performance from "art" to "sci & tech", then a decline.
* "cot" consistently performs well, usually second to "inst".
* "rt" generally has the lowest scores across all domains.
* "fs1" performance is variable, sometimes close to "cot" and sometimes closer to "rt".
### Key Observations
* The "music" domain shows the largest difference in performance between the "inst" and "rt" models.
* "TV Shows" and "Video Games" have the lowest overall performance scores across all models.
* The "inst" model consistently outperforms the others, but the margin varies by domain.
* The "rt" model consistently underperforms compared to the other three.
### Interpretation
The chart demonstrates the varying performance of different models across a range of knowledge domains. The "inst" model appears to be the most robust, achieving the highest "pass@16" scores in most categories. This suggests that the "inst" model is better at answering questions across a broader range of topics. The consistently lower performance of the "rt" model indicates it may struggle with the complexity or nuance of the questions posed in these domains. The lower scores in "TV Shows" and "Video Games" could indicate a lack of training data or inherent difficulty in these areas. The differences in performance between domains highlight the importance of domain-specific knowledge and the challenges of building general-purpose question-answering systems. The chart provides valuable insights into the strengths and weaknesses of each model, which can inform future development efforts.
</details>
(c) Performance measured per domain in pass@ $16$ .
Figure 13: Absolute Performance across Different Axes. We show the absolute performance at pass@ $16$ of different versions of Qwen-32B (i.e., instruct, CoT, rt and fs1). In (a), we show the performance of the models by answer type. In (b), we show the performance of the models by the number of hops required to answer the question. In (c), we show the performance by domain of the question.