# Inference Scaling for Long-Context Retrieval Augmented Generation
**Authors**: Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, Michael Bendersky
> University of Illinois Urbana-Champaign
> Equal contributionGoogle DeepMind
> Google DeepMind
> Work done while at Google DeepMindUniversity of Massachusetts Amherst
Corresponding author:
zhenrui3@illinois.edu, hlz@google.com
Abstract
The scaling of inference computation has unlocked the potential of long-context large language models (LLMs) across diverse settings. For knowledge-intensive tasks, the increased compute is often allocated to incorporate more external knowledge. However, without effectively utilizing such knowledge, solely expanding context does not always enhance performance. In this work, we investigate inference scaling for retrieval augmented generation (RAG), exploring the combination of multiple strategies beyond simply increasing the quantity of knowledge, including in-context learning and iterative prompting. These strategies provide additional flexibility to scale test-time computation (e.g., by increasing retrieved documents or generation steps), thereby enhancing LLMs’ ability to effectively acquire and utilize contextual information. We address two key questions: (1) How does RAG performance benefit from the scaling of inference computation when optimally configured? (2) Can we predict the optimal test-time compute allocation for a given budget by modeling the relationship between RAG performance and inference parameters? Our observations reveal that increasing inference computation leads to nearly linear gains in RAG performance when optimally allocated, a relationship we describe as the inference scaling laws for RAG. Building on this, we further develop the computation allocation model to estimate RAG performance across different inference configurations. The model predicts optimal inference parameters under various computation constraints, which align closely with the experimental results. By applying these optimal configurations, we demonstrate that scaling inference compute on long-context LLMs achieves up to 58.9% gains on benchmark datasets compared to standard RAG.
keywords: Inference scaling, Retrieval augmented generation, Long-context LLMs
1 Introduction
Long-context large language models (LLMs) are designed to handle extended input sequences, enabling them to process and understand longer context (e.g., Gemini 1.5 Pro with up to 2M tokens) (Achiam et al., 2023; Team et al., 2023; Reid et al., 2024). Combined with increased inference computation, long-context LLMs demonstrate improved performance across various downstream tasks (Agarwal et al., ; Snell et al., 2024). For example, many-shot in-context learning (ICL) can match the performance of supervised fine-tuning by providing extensive in-context examples (Bertsch et al., 2024). Particularly for knowledge-intensive tasks that leverage retrieval augmented generation (RAG), increasing the quantity or size of retrieved documents up to a certain threshold consistently enhances the performance (Ram et al., 2023; Xu et al., 2024; Jiang et al., 2024).
<details>
<summary>2410.04343v2/extracted/6240196/figures/perf_vs_length_musique.png Details</summary>

### Visual Description
# Technical Document Extraction: Performance Analysis Charts
## Chart 1: RAG vs Optimal Configuration
### Spatial Grounding
- **Legend Position**: Top-left corner
- **Legend Colors**:
- Purple: RAG
- Red: Optimal Config
### Axis Labels
- **X-axis**: Effective Context Length (logarithmic scale: 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
### Key Trends
1. **RAG (Purple Line)**:
- Starts at **-2** (10²)
- Gradual increase to **~0** at 10⁵
- Plateaus near **0** at 10⁶
2. **Optimal Config (Red Points)**:
- Begins at **-2.2** (10²)
- Sharp rise to **0.1** at 10⁵
- Declines to **-0.1** at 10⁶
3. **Trend Line (Dashed Black)**:
- Follows RAG's trajectory with slight divergence at 10⁵
### Data Points
| Effective Context Length | RAG Performance | Optimal Config Performance |
|--------------------------|-----------------|----------------------------|
| 10² | -2.0 | -2.2 |
| 10³ | -1.5 | -1.8 |
| 10⁴ | -0.5 | -0.2 |
| 10⁵ | 0.0 | 0.1 |
| 10⁶ | 0.0 | -0.1 |
## Chart 2: DRAg vs IterDRAG vs Optimal Configuration
### Spatial Grounding
- **Legend Position**: Top-left corner
- **Legend Colors**:
- Blue: DRAg
- Green: IterDRAG
- Red: Optimal Config
### Axis Labels
- **X-axis**: Effective Context Length (logarithmic scale: 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
### Key Trends
1. **DRAg (Blue Line)**:
- Starts at **-2** (10²)
- Steady increase to **1.5** at 10⁶
2. **IterDRAG (Green Line)**:
- Begins at **-1.8** (10²)
- Peaks at **1.8** at 10⁶
3. **Optimal Config (Red Points)**:
- Mirrors Chart 1's pattern: **-2.2** → **1.0** → **-0.1**
### Data Points
| Effective Context Length | DRAg Performance | IterDRAG Performance | Optimal Config Performance |
|--------------------------|------------------|----------------------|----------------------------|
| 10² | -2.0 | -1.8 | -2.2 |
| 10³ | -1.0 | -1.2 | -1.8 |
| 10⁴ | 0.0 | 0.2 | -0.2 |
| 10⁵ | 1.0 | 1.2 | 1.0 |
| 10⁶ | 1.5 | 1.8 | -0.1 |
### Component Isolation
1. **Header**: Legends and axis labels
2. **Main Chart**: Data lines and trend lines
3. **Footer**: No additional elements
### Cross-Reference Verification
- All red markers in both charts correspond to "Optimal Config"
- Purple (Chart 1) and blue/green (Chart 2) lines match legend colors
- Dashed lines represent trend lines in both charts
### Observations
- Optimal Config shows non-linear performance with a mid-range peak
- DRAg and IterDRAG demonstrate consistent improvement with scale
- RAG's performance plateaus at large context lengths
</details>
Figure 1: Normalized performance vs. effective context lengths on MuSiQue. Each line represents a fixed configuration, scaled by adjusting the number of documents. Red dots and dash lines represent the optimal configurations and their fitting results. Standard RAG plateaus early at $10^{4}$ tokens, in contrast, DRAG and IterDRAG show near-linear improvement as the effective context length grows.
Previous studies on inference scaling for RAG focus on expanding the retrieved knowledge by increasing the number or lengths of retrieved documents (Xu et al., 2024; Jiang et al., 2024; Shao et al., 2024). However, only emphasizing on the knowledge quantity without providing further guidance presents certain limitations. On one hand, current long-context LLMs still have limited ability to effectively locate relevant information in ultra-long sequences upon challenging tasks (Li et al., 2024; Kuratov et al., 2024). For instance, the optimal performance of long-context LLMs is often achieved without fully utilizing the maximum length (Agarwal et al., ). On the other hand, numerous studies show that retrieving over soft thresholds (e.g., top-10 documents) leads to a performance plateau and may even cause declines (Ram et al., 2023; Lee et al., 2024a; Kuratov et al., 2024). Such performance drops may be traced back to the increased noise within context, which causes distraction and adversely affects generation (Yoran et al., 2024; Zhang et al., 2024; Leng et al., 2024). As a result, inference scaling of long-context RAG remains challenging for existing methods.
In this work, we leverage a broader range of strategies to comprehensively explore how RAG benefits from the scaling of inference computation. A straightforward strategy is demonstration-based RAG (DRAG), where multiple RAG examples are provided as demonstrations to utilize the long-context capabilities of LLMs (Brown et al., 2020). DRAG allows models to learn (in-context) how to locate relevant information and apply it to response generation Different from in-context RAG that prepends documents / QA examples (Press et al., 2023; Ram et al., 2023), we leverage multiple examples comprising of documents, questions and answers to demonstrate the task.. Nevertheless, the quality of one-step retrieval varies across tasks and often fails to provide sufficient information. Inspired by iterative methods (Trivedi et al., 2023; Yoran et al., 2024), we develop iterative demonstration-based RAG (IterDRAG). IterDRAG learns to decompose input queries into simpler sub-queries and answer them using interleaved retrieval. By iteratively retrieving and generating upon sub-queries, LLMs construct reasoning chains that bridge the compositionality gap for multi-hop queries. Together, these strategies provide additional flexibility in scaling inference computation for RAG, allowing long-context LLMs to more effectively address complex knowledge-intensive queries.
Building on these strategies, we investigate multiple ways to scale up inference computation. Here, we measure computation by considering the total number of input tokens across all iterations, referred to as the effective context length. In DRAG, scaling the effective context length can be done by increasing two inference parameters: the number of retrieved documents and in-context examples. In IterDRAG, test-time compute can be further extended by introducing additional generation steps. Since different combinations of inference parameters result in varied allocations of computational resources, our goal is to establish the relationship between RAG performance, different scales and allocations of inference computation. Through extensive experiments on benchmark QA datasets, we demonstrate an almost linear relationship between RAG performance and the scale of effective context length by combining both RAG strategies, as shown in Figure 1 (right). Moreover, our RAG strategies exhibit improved performance than merely scaling the number of documents, achieving state-of-the-art performance with the compact Gemini 1.5 Flash (See evaluation in Figure 2).
Drawing from our observations, we examine the relationship between RAG performance and inference computation, which we quantify as the inference scaling laws for RAG. These observed inference scaling laws reveal that RAG performance consistently improves with the expansion of the effective context length under optimal configurations. Consequently, we take a deeper dive into modeling RAG performance with respect to various inference computation allocations. Our goal is to predict the optimal set of inference parameters that maximize the performance across different RAG tasks. To achieve this, we quantitatively model the relationship between RAG performance and varying inference configurations with the computation allocation model for RAG. Using the estimated computation allocation model, the optimal configurations can be empirically determined and generalize well for various scenarios, thereby maximizing the utilization of the computation budget. We summarize our contributions as follows:
- We systematically investigate inference scaling for long-context RAG, for which we introduce two scaling strategies, DRAG and IterDRAG, to effectively scale inference compute.
- We comprehensively evaluate DRAG and IterDRAG, where they not only achieve state-of-the-art performance, but also exhibit superior scaling properties compared to solely increasing the quantity of documents.
- Through extensive experiments on benchmark QA datasets, we demonstrate that when test-time compute is optimally allocated, long-context RAG performance can scale almost linearly with the increasing order of magnitude of the computation budget.
- We quantitatively model the relationship between RAG performance and different inference parameters, deriving the computation allocation model. This model aligns closely with our experimental results and generalize well across scenarios, providing practical guidance for optimal computation allocation in long-context RAG.
<details>
<summary>2410.04343v2/extracted/6240196/figures/intro_barchart.png Details</summary>

### Visual Description
# Technical Document Extraction: Bar Chart Analysis
## Chart Type
Bar chart comparing QA method accuracies across datasets.
## Axis Labels
- **X-axis**: Datasets (`Bamboogle`, `HotpotQA`, `MuSiQue`, `2WikiMultiHopQA`, `Average`)
- **Y-axis**: Accuracy (0–80%)
## Legend
- **Position**: Top of chart
- **Labels & Colors**:
- Zero-shot QA: Gray
- Many-shot QA: Orange
- RAG: Purple
- DRAG: Blue
- IterDRAG: Green
## Categories & Sub-Categories
- **Datasets** (X-axis):
- Bamboogle
- HotpotQA
- MuSiQue
- 2WikiMultiHopQA
- Average
- **QA Methods** (Legend):
- Zero-shot QA
- Many-shot QA
- RAG
- DRAG
- IterDRAG
## Data Points & Trends
### Bamboogle
- **Zero-shot QA**: 19.2 (Gray)
- **Many-shot QA**: 24.8 (Orange)
- **RAG**: 52.8 (Purple)
- **DRAG**: 57.6 (Blue)
- **IterDRAG**: 68.8 (Green)
**Trend**: Accuracy increases from gray to green bars.
### HotpotQA
- **Zero-shot QA**: 25.2 (Gray)
- **Many-shot QA**: 26.2 (Orange)
- **RAG**: 50.9 (Purple)
- **DRAG**: 52.2 (Blue)
- **IterDRAG**: 56.4 (Green)
**Trend**: Similar upward progression as Bamboogle.
### MuSiQue
- **Zero-shot QA**: 6.6 (Gray)
- **Many-shot QA**: 8.5 (Orange)
- **RAG**: 16.8 (Purple)
- **DRAG**: 18.2 (Blue)
- **IterDRAG**: 30.5 (Green)
**Trend**: Lowest accuracies across all methods; IterDRAG dominates.
### 2WikiMultiHopQA
- **Zero-shot QA**: 30.7 (Gray)
- **Many-shot QA**: 34.3 (Orange)
- **RAG**: 48.4 (Purple)
- **DRAG**: 53.3 (Blue)
- **IterDRAG**: 76.9 (Green)
**Trend**: Highest accuracy for IterDRAG; steep increase from gray to green.
### Average
- **Zero-shot QA**: 20.4 (Gray)
- **Many-shot QA**: 23.5 (Orange)
- **RAG**: 42.2 (Purple)
- **DRAG**: 45.4 (Blue)
- **IterDRAG**: 58.2 (Green)
**Trend**: IterDRAG maintains highest average accuracy.
## Spatial Grounding
- **Legend**: Top-center of chart.
- **Bar Colors**: Match legend labels exactly (e.g., green bars = IterDRAG).
## Trend Verification
- **IterDRAG**: Consistently highest accuracy across all datasets.
- **Zero-shot QA**: Lowest accuracy in MuSiQue (6.6) and Bamboogle (19.2).
- **DRAG**: Second-highest accuracy in Bamboogle (57.6) and 2WikiMultiHopQA (53.3).
## Component Isolation
1. **Header**: Chart title (implied) and legend.
2. **Main Chart**: Bar groupings for each dataset.
3. **Footer**: No explicit footer; y-axis label at left.
## Critical Observations
- **IterDRAG** outperforms all methods by 10–30% in most datasets.
- **MuSiQue** shows the largest performance gap between methods (6.6 vs. 30.5).
- **2WikiMultiHopQA** has the highest absolute accuracy (76.9 for IterDRAG).
## Data Table Reconstruction
| Dataset | Zero-shot QA | Many-shot QA | RAG | DRAG | IterDRAG |
|-----------------------|--------------|--------------|-------|-------|----------|
| Bamboogle | 19.2 | 24.8 | 52.8 | 57.6 | 68.8 |
| HotpotQA | 25.2 | 26.2 | 50.9 | 52.2 | 56.4 |
| MuSiQue | 6.6 | 8.5 | 16.8 | 18.2 | 30.5 |
| 2WikiMultiHopQA | 30.7 | 34.3 | 48.4 | 53.3 | 76.9 |
| **Average** | 20.4 | 23.5 | 42.2 | 45.4 | 58.2 |
## Language Notes
- All text is in English. No non-English content detected.
</details>
Figure 2: Evaluation accuracy of Gemini 1.5 Flash using different methods: zero-shot QA, many-shot QA, RAG (with an optimal number of documents), DRAG and IterDRAG on benchmark QA datasets. By scaling up inference compute (up to 5M tokens), DRAG consistently outperforms baselines, while IterDRAG improves upon DRAG through interleaving retrieval and iterative generation.
2 Related Work
2.1 Long-Context LLMs
Long-context large language models (LLMs) are designed to utilize extensive context and thereby improve their generative capabilities. Early works in extending context lengths involve sparse / low-rank kernels to reduce memory requirements (Kitaev et al., 2019; Beltagy et al., 2020; Zaheer et al., 2020; Choromanski et al., 2020). In addition, recurrent and state space models (SSMs) are proposed as efficient substitutes for transformer-based models (Gu et al., 2021; Gu and Dao, 2023; Peng et al., 2023a; Beck et al., 2024). For causal LLMs, extrapolation and interpolation methods have proven effective in expanding context window lengths (Press et al., 2021; Chen et al., 2023; Sun et al., 2023; Peng et al., 2023b). Recent advancements in efficient attention methods (Dao et al., 2022; Jacobs et al., 2023; Liu et al., 2023) further enable LLMs to train and infer upon input sequences comprising millions of tokens (Achiam et al., 2023; Team et al., 2023; Reid et al., 2024).
2.2 In-Context Learning
In-context learning (ICL) offers a computationally efficient approach to enhance model performance at inference time by conditioning on a few demonstrations of the task (Brown et al., 2020). To further improve ICL performance, existing works focuses on pretraining strategies that optimize the language models to learn in-context (Min et al., 2022; Wei et al., 2023; Gu et al., 2023). In addition, selective usage of few-shot examples are shown to be helpful for enhancing downstream task performance (Liu et al., 2022; Rubin et al., 2022; Wang et al., 2024). Notably, reformatting or finding optimal ordering of in-context examples also improves ICL performance effectiveness (Lu et al., 2022; Wu et al., 2023; Liu et al., 2024a). With the emergence of long-context LLMs (Achiam et al., 2023; Team et al., 2023; Reid et al., 2024), scaling the number of examples becomes possible in ICL (Li et al., 2023; Bertsch et al., 2024; Agarwal et al., ). For instance, Agarwal et al. show that many-shot ICL can mitigate pretraining biases within LLMs and thus improves ICL performance across various tasks.
2.3 Retrieval Augmented Generation
Retrieval augmented generation (RAG) improves language model performance by incorporating relevant knowledge from external sources (Lewis et al., 2020; Guu et al., 2020; Karpukhin et al., 2020). In contrast to naïve RAG, optimizing the retrieval stage can effectively enhance context relevance and improve generation performance (Ma et al., 2023; Trivedi et al., 2023; Jiang et al., 2023; Shi et al., 2024; Sarthi et al., 2024; Lin et al., 2024). An example is REPLUG, in which Shi et al. (2024) leverage LLM as supervision to learn a dense retriever model. In addition, encoding documents can increase knowledge retrieval and improve generation capabilities (Khandelwal et al., 2019; Izacard and Grave, 2021; Borgeaud et al., 2022; Izacard et al., 2023). For instance, Izacard and Grave (2021) leverages fusion-in-decoder architecture to encode multiple question-passage pairs while maintaining the model efficiency. Alternatively, selectively utilizing knowledge from the documents improves the robustness of LLMs against irrelevant context (Yu et al., 2023; Yoran et al., 2024; Yan et al., 2024; Yue et al., 2024; Zhang et al., 2024). For example, RAFT proposes to train language models with negative documents to improve generation quality and relevance (Zhang et al., 2024). Concurrent to our work, long-document retrieval and datastore scaling are proposed to optimize RAG performance (Jiang et al., 2024; Shao et al., 2024). Despite such progress, inference scaling remains under-explored for long-context RAG methods. As such, we investigate how variations in inference computation impact RAG performance, with the goal of optimizing test-time compute allocation.
3 Inference Scaling Strategies for RAG
3.1 Preliminaries
We measure inference computation with effective context length, defined as the total number of input tokens across all iterations before the LLM outputs the final answer. For most methods that only call the LLM once, the effective context length is equivalent to the number of input tokens in the prompt and is limited by the context window limit of the LLM. For methods that iteratively call the LLM, the effective context length can be extended indefinitely depending on the strategy. We exclude output tokens and retrieval costs from our analysis, as LLMs typically generate significantly fewer tokens (fewer than 10) in knowledge-intensive tasks. Additionally, retrieval is generally much less computationally expensive than LLM inference, especially with scalable matching methods (Sun et al., 2024). Our objective is to understand how RAG performance changes as we scale up inference computation. In demonstration-based RAG (DRAG), we achieve such scaling by incorporating both extensive documents and in-context examples. For further scaling, we increase generation steps through iterative demonstration-based RAG (IterDRAG). We introduce both strategies below.
3.2 Demonstration-Based RAG
Demonstration-based RAG (DRAG) leverages in-context learning to exploit the capabilities of long-context LLMs by directly generating answers from an extended input context. DRAG builds upon naïve RAG and integrates both documents and in-context examples into the input prompt. This expanded context allows the model to generate answers to the input query within a single inference request (See Figure 3 left). For both in-context examples and the test-time query, we employ a retrieval model to select the top- $k$ retrieved documents from a large corpus (e.g., Wikipedia). We reverse the order of the retrieved documents, placing higher-ranked documents closer to the query (Liu et al., 2024b). As we use instruction-tuned LLMs, we design a similar prompt template following Agarwal et al. and align the formatting with prefixes for retrieved documents, input and output (See Appendix H). Unlike previous works (Press et al., 2023; Trivedi et al., 2023), DRAG incorporates extensive retrieved documents within the demonstrations, enabling long-context LLMs to learn to extract relevant information and answer questions using a rich input context.
<details>
<summary>2410.04343v2/x1.png Details</summary>

### Visual Description
# Technical Document Extraction: Flowchart Analysis
## 1. Component Identification
### Key Labels and Elements
- **Input Query**: Blue box at the start of the flowchart
- **In-Context Examples**: Gray box (top-left)
- **Documents**: Red box (below In-Context Examples)
- **Sub-Queries**:
- Sub-Query 1 (purple)
- Sub-Query 2 (red)
- Sub-Query n (purple)
- **Intermediate Answers**:
- Intermediate Answer 1 (purple)
- Intermediate Answer 2 (red)
- Intermediate Answer n (purple)
- **Final Answer**: Green box (bottom-left and bottom-right)
- **Methods**:
- DRAG (left side)
- IterDRAG (right side)
### Spatial Grounding
- **Legend Colors**:
- Gray: In-Context Examples
- Red: Documents / Sub-Query 2
- Blue: Input Query
- Purple: Sub-Queries 1/n / Intermediate Answers 1/n
- Green: Final Answer
## 2. Flowchart Structure
### DRAG Method (Left Side)
1. **Input Query** → Generates **Final Answer** directly
2. Uses **In-Context Examples** and **Documents** as context
3. Single-path flow with no intermediate steps
### IterDRAG Method (Right Side)
1. **Input Query** → Generates **Sub-Query 1**
2. **Sub-Query 1** → "Retrieve & Generate" → **Intermediate Answer 1**
3. **Intermediate Answer 1** → Generates **Sub-Query 2**
4. **Sub-Query 2** → "Retrieve & Generate" → **Intermediate Answer 2**
5. ... (repeats for Sub-Query n → Intermediate Answer n)
6. **Intermediate Answer n** → Generates **Final Answer**
## 3. Symbolic Elements
- **Robot Icon**: Appears next to all "Generate" actions
- **Magnifying Glass**: Appears next to all "Retrieve & Generate" actions
- **Earth Globe**: Appears next to all "Retrieve & Generate" actions
## 4. Color-Coded Flow Analysis
| Component | Color | Connection Pattern |
|-------------------------|-------|----------------------------------------|
| Input Query | Blue | Single arrow to Final Answer (DRAG) |
| In-Context Examples | Gray | Contextual input for DRAG |
| Documents | Red | Contextual input for DRAG |
| Sub-Queries 1/n | Purple| Sequential generation in IterDRAG |
| Intermediate Answers | Purple/Red | Iterative refinement in IterDRAG |
| Final Answer | Green | Terminal output for both methods |
## 5. Methodological Comparison
### DRAG
- **Pros**:
- Simpler architecture
- Direct generation from input
- **Cons**:
- Limited context utilization
- No iterative refinement
### IterDRAG
- **Pros**:
- Multi-stage refinement
- Contextual feedback loops
- Scalable sub-query structure
- **Cons**:
- Increased computational complexity
- Longer processing time
## 6. Critical Observations
1. **Color Consistency**:
- Purple consistently represents generative steps
- Red marks both documents and Sub-Query 2
- Green exclusively marks final outputs
2. **Iterative Pattern**:
- Sub-Queries and Intermediate Answers form a closed-loop system
- Each Sub-Query n directly informs Intermediate Answer n
3. **Contextual Dependency**:
- DRAG relies entirely on pre-existing context (In-Context Examples + Documents)
- IterDRAG builds context dynamically through intermediate steps
## 7. Technical Implications
- **DRAG Suitability**:
- Best for simple, context-rich queries
- Ideal when computational resources are limited
- **IterDRAG Suitability**:
- Optimal for complex, multi-faceted queries
- Recommended when accuracy outweighs speed
- **Scalability**:
- IterDRAG's "n" sub-queries suggest horizontal scalability
- DRAG remains fixed-architecture
## 8. Missing Elements
- No explicit data points or numerical values present
- No temporal or quantitative metrics included
- No alternative pathways or error handling shown
## 9. Diagram Transcription
</details>
Figure 3: DRAG vs. IterDRAG. IterDRAG breaks down the input query into sub-queries and answer them to improve the accuracy of the final answer. In test-time, IterDRAG scales the computation through multiple inference steps to decompose complex queries and retrieve documents.
3.3 Iterative Demonstration-Based RAG
Despite access to external knowledge, complex multi-hop queries remain challenging due to the compositionality gap. To tackle this issue, we introduce iterative demonstration-based RAG (IterDRAG), which handles complex queries by decomposing the query into simpler sub-queries. For each sub-query, retrieval is performed to gather additional contextual information, which is then used to generate intermediate answers. After all sub-queries are resolved, the retrieved context, sub-queries, and their answers are combined to synthesize the final answer (See Figure 3 right).
While multiple existing datasets provide training data with queries and corresponding answers, sub-queries and intermediate answers are often absent. To generate in-context examples with sub-queries and intermediate answers, we prompt LLMs with constrained decoding to follow the Self-Ask format (Press et al., 2023; Koo et al., 2024). In each iteration, LLMs generate either a sub-query, an intermediate answer, or the final answer. If a sub-query is generated, additional documents are retrieved and interleaved into the prompt before producing the intermediate answer. IterDRAG continues until the final answer is generated or the number of maximum iterations is reached, at which point LLM is forced to generate the final answer. We retain examples with intermediate steps and correct final answers to construct in-context demonstrations. Each example should include the retrieved documents, sub-query and answer pairs, as well as the final answer.
During inference, in-context examples are prepended to the initial documents retrieved for the input query. Similarly, each inference request yields a sub-query, an intermediate answer, or the final answer. Upon sub-queries, additional documents are retrieved and merged with the initial ones to generate intermediate answers. In our implementation, we allow up to five iterations of query decomposition before generating the final answer. This iterative process effectively scales test-time computation, with the input tokens from all iterations summed to calculate the effective context length. IterDRAG facilitates a more granular approach by learning to: (1) decompose query into simple and manageable sub-queries; and (2) retrieve and locate relevant information to answer (sub)-queries. As such, the iterative retrieval and generation strategy helps narrowing the compositionality gap and improves knowledge extraction, thereby enhancing overall RAG performance.
4 RAG Performance and Inference Computation Scale
4.1 Fixed Budget Optimal Performance
For a given budget on inference computation, i.e., a maximum effective context length $L_{\text{max}}$ , there are multiple ways to optimize the use of computation resources through inference parameters. For example, in DRAG, we can adjust both the number of retrieved documents and in-context examples, while in the IterDRAG strategy, we additionally introduce the number of iterations for retrieval and generation. Henceforth, we use $\theta$ to denote all these inference parameters.
For each input query and its ground-truth answer $(x_{i},y_{i})∈\mathcal{X}$ , we can apply the RAG inference strategy $f$ parameterized by $\theta$ . We denote the effective input context length to the LLM as $l(x_{i};\theta)$ and the obtained prediction as $\hat{y}_{i}=f(x_{i};\theta)$ . A metric $P(y_{i},\hat{y}_{i})$ can then be calculated based on $y_{i}$ and $\hat{y}_{i}$ . To understand the relationship between RAG performance and inference computation, we sample a few different inference computation budgets. For each budget $L_{\text{max}}$ , we find the optimal average metric $P^{*}(L_{\text{max}})$ achievable within this budget by enumerating different $\theta∈\Theta$ :
$$
P^{*}(L_{\text{max}}):=\max_{\theta\in\Theta}\Big{\{}\frac{1}{|\mathcal{X}|}%
\sum_{i}P\big{(}y_{i},f(x_{i};\theta)\big{)}\Big{|}\forall i,l(x_{i};\theta)%
\leq L_{\text{max}}\Big{\}}. \tag{1}
$$
Our goal is to establish the relationship between the inference computation budget $L_{\text{max}}$ and the best possible performance within this budget $P^{*}(L_{\text{max}})$ , using any possible strategies and parameter configurations to allocate the inference computation resources. For simplicity, we also refer to $P^{*}(L_{\text{max}})$ as the optimal performance. We investigate the following factors within the inference parameter set $\theta$ : (1) the number of documents $k$ , which are retrieved from a large corpus (e.g., Wikipedia) based on the input query; (2) the number of in-context examples $m$ , where each of the examples consists of $k$ documents, an input query and its label; and (3) the number of generation iterations $n$ . In DRAG, an answer can be directly generated upon input context, so $n=1$ . In contrast, IterDRAG involves multiple steps of interleaved retrieval and generation, expanding both the effective context length and inference compute without needing longer context windows.
Table 1: Optimal performance of different methods with varying maximum effective context lengths $L_{\text{max}}$ (i.e., the total number of input tokens across all iterations). ZS QA and MS QA refers to zero-shot QA and many-shot QA respectively. Partial results are omitted for methods that do not further scale with increasing $L_{\text{max}}$ . For clarity, we mark the best results for each $L_{\text{max}}$ in bold.
| $L_{\text{max}}$ | Method | Bamboogle | HotpotQA | MuSiQue | 2WikiMultiHopQA | | | | | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| EM | F1 | Acc | EM | F1 | Acc | EM | F1 | Acc | EM | F1 | Acc | | |
| 16k | ZS QA | 16.8 | 25.9 | 19.2 | 22.7 | 32.0 | 25.2 | 5.0 | 13.2 | 6.6 | 28.3 | 33.5 | 30.7 |
| MS QA | 24.0 | 30.7 | 24.8 | 24.6 | 34.0 | 26.2 | 7.4 | 16.4 | 8.5 | 33.2 | 37.5 | 34.3 | |
| RAG | 44.0 | 54.5 | 45.6 | 44.2 | 57.9 | 49.2 | 12.3 | 21.5 | 15.3 | 42.3 | 49.3 | 46.5 | |
| DRAG | 44.0 | 55.2 | 45.6 | 45.5 | 58.5 | 50.2 | 14.5 | 24.6 | 16.9 | 45.2 | 53.5 | 50.5 | |
| IterDRAG | 46.4 | 56.2 | 51.2 | 36.0 | 47.4 | 44.4 | 8.1 | 17.5 | 12.2 | 33.2 | 38.8 | 43.8 | |
| 32k | RAG | 48.8 | 56.2 | 49.6 | 44.2 | 58.2 | 49.3 | 12.3 | 21.5 | 15.3 | 42.9 | 50.6 | 48.0 |
| DRAG | 48.8 | 59.2 | 50.4 | 46.9 | 60.3 | 52.0 | 15.4 | 26.0 | 17.3 | 45.9 | 53.7 | 51.4 | |
| IterDRAG | 46.4 | 56.2 | 52.0 | 38.3 | 49.8 | 44.4 | 12.5 | 23.1 | 19.7 | 44.3 | 54.6 | 56.8 | |
| 128k | RAG | 51.2 | 60.3 | 52.8 | 45.7 | 59.6 | 50.9 | 14.0 | 23.7 | 16.8 | 43.1 | 50.7 | 48.4 |
| DRAG | 52.8 | 62.3 | 54.4 | 47.4 | 61.3 | 52.2 | 15.4 | 26.0 | 17.9 | 47.5 | 55.3 | 53.1 | |
| IterDRAG | 63.2 | 74.8 | 68.8 | 44.8 | 59.4 | 52.8 | 17.3 | 28.0 | 24.5 | 62.3 | 73.8 | 74.6 | |
| 1M | DRAG | 56.0 | 62.9 | 57.6 | 47.4 | 61.3 | 52.2 | 15.9 | 26.0 | 18.2 | 48.2 | 55.7 | 53.3 |
| IterDRAG | 65.6 | 75.6 | 68.8 | 48.7 | 63.3 | 55.3 | 22.2 | 34.3 | 30.5 | 65.7 | 75.2 | 76.4 | |
| 5M | IterDRAG | 65.6 | 75.6 | 68.8 | 51.7 | 64.4 | 56.4 | 22.5 | 35.0 | 30.5 | 67.0 | 75.2 | 76.9 |
We evaluate the performance of Gemini 1.5 Flash with context length window up to 1M tokens on knowledge-intensive question answering, including multi-hop datasets Bamboogle, HotpotQA, MuSiQue and 2WikiMultiHopQA (Press et al., 2023; Yang et al., 2018; Trivedi et al., 2022; Ho et al., 2020). Additional results are provided in Appendix B and Appendix C. To manage the computational costs of extensive experiments, we follow Wu et al. (2024); Gutiérrez et al. (2024) and sample 1.2k examples from each dataset for evaluation. The evaluation metrics include exact match (EM), F1 score (F1) and accuracy (Acc), in which the accuracy metric assesses whether the ground truth is located within the prediction. We sample the inference computation budget $L_{\text{max}}$ as 16k, 32k, 128k, 1M and 5M tokens. For the parameter space $\Theta$ of DRAG, we consider the number of documents $k∈\{0,1,2,5,10,20,50,100,200,500,1000\}$ , and the number in-context examples $m$ ranging from $0 0$ , $2^{0}$ , $2^{1}$ , …, to $2^{8}$ . For IterDRAG, we further experiment with number of iterations $n$ up to 5. We compare to the following baselines: (1) zero-shot QA (QA), where the model does not leverage any retrieved documents or demonstrations; (2) many-shot QA (MS QA), where the model only uses varying number of demonstrations $m$ without any retrieved document; and (3) retrieval augmented generation (RAG), where the model only uses $k$ retrieved documents without demonstrations. We report the optimal performance of each method with different maximum effective context length budgets by examining their performance with different inference parameter configurations.
<details>
<summary>2410.04343v2/extracted/6240196/figures/perf_vs_length.png Details</summary>

### Visual Description
# Technical Document Analysis of Chart
## 1. Labels and Axis Titles
- **X-axis**: "Effective Context Length" (log scale, 10² to 10⁶)
- **Y-axis**: "Normalized Performance" (-2 to 3)
- **Legend**: Located in top-left corner
- DRAG: Blue triangles
- IterDRAG: Green triangles
- Optimal Config: Red circles
## 2. Data Series and Trends
### DRAG (Blue Triangles)
- **Trend**: Upward slope with variability
- **Key Points**:
- [10², -1.5]
- [10³, -0.8]
- [10⁴, 0.2]
- [10⁵, 0.6]
- [10⁶, 0.9]
- **Pattern**: Gradual improvement with increasing context length
### IterDRAG (Green Triangles)
- **Trend**: Steeper upward trajectory than DRAG
- **Key Points**:
- [10², -0.8]
- [10³, -0.2]
- [10⁴, 0.5]
- [10⁵, 1.2]
- [10⁶, 1.8]
- **Pattern**: Consistent outperformance of DRAG at larger scales
### Optimal Config (Red Circles)
- **Trend**: Perfect alignment with dashed trend line
- **Key Points**:
- [10², -1.2]
- [10³, -0.5]
- [10⁴, 0.3]
- [10⁵, 1.0]
- [10⁶, 1.8]
- **Pattern**: Linear improvement matching theoretical optimum
## 3. Trend Verification
- **Dashed Black Line**: Represents theoretical optimum
- Slope: +0.002 per log₁₀ unit
- Equation: y = 0.002x - 1.4 (approximate)
- **Validation**: All Optimal Config points lie exactly on this line
## 4. Spatial Grounding
- **Legend Position**: [x=0.05, y=0.95] (normalized coordinates)
- **Axis Markers**:
- X-axis ticks at 10², 10³, 10⁴, 10⁵, 10⁶
- Y-axis ticks at -2, -1, 0, 1, 2, 3
## 5. Component Isolation
### Main Chart Region
- **X-axis Range**: 10² to 10⁶ (log scale)
- **Y-axis Range**: -2 to 3
- **Data Density**:
- DRAG: Dense clustering at lower x-values
- IterDRAG: More dispersed at higher x-values
- Optimal Config: Perfectly spaced along trend line
## 6. Critical Observations
1. **Performance Scaling**:
- Both algorithms show logarithmic improvement
- IterDRAG achieves 2.5× better performance than DRAG at 10⁶ context length
2. **Optimal Config Validation**:
- Red circles perfectly track dashed trend line
- Confirms theoretical performance bounds
3. **Variability**:
- Shaded regions around DRAG/IterDRAG lines indicate 95% confidence intervals
- DRAG shows wider variance at lower context lengths
## 7. Missing Elements
- No textual annotations beyond legend
- No colorbar present
- No secondary y-axis
## 8. Language Analysis
- All text in English
- No non-English content detected
</details>
Figure 4: Normalized performance vs. effective context lengths across datasets. Each line represents a fixed configuration, scaled by varying the number of documents. Red dots indicate the optimal configurations, with the dashed line showing the fitting results. The observed optimal performance can be approximated by a linear relationship with the effective context lengths.
4.2 Overall Performance
We report the optimal performance $P^{*}(L_{\text{max}})$ for different inference strategies in Table 1, where we identify the optimal inference parameters for each computation budget $L_{\text{max}}$ . Some variants are omitted for certain $L_{\text{max}}$ because they do not scale to the corresponding context length. For example, the prompt for zero-shot QA cannot be increased, while the number of in-context examples for many-shot QA is capped at $2^{8}$ , so neither scales to $L_{\text{max}}=$ 32k. Similarly, RAG does not scale to $L_{\text{max}}$ larger than 128k, and DRAG is limited by the LLM’s context window limit of 1M.
Unlike QA and RAG baselines, the performance of DRAG and IterDRAG consistently increase as we expand the maximum effective context length. More specifically, we observe: (1) DRAG and IterDRAG scale better than baselines. Baselines like many-shot QA peak at 16k tokens, while RAG improves until 128k, after which performance plateaus. In comparison, DRAG and IterDRAG can find optimal configurations to more effectively utilize test-time compute, exhibiting superior performance and scaling properties. Performance of DRAG consistently improves until 1M tokens, while IterDRAG further enhances RAG performance with 5M tokens of computation budget by iteratively calling LLMs. (2) DRAG excels with shorter maximum lengths, while IterDRAG scales more effectively with longer effective context length. At 16k and 32k, DRAG typically delivers the best performance, while at 128k and beyond, IterDRAG achieves superior results overall, highlighting the effectiveness of iterative retrieval and generation. These results suggest that increasing $L_{\text{max}}$ is beneficial for RAG, with DRAG and IterDRAG strategies each excelling at different scales.
4.3 Inference Scaling Laws for RAG
To analyze RAG performance with different effective context lengths, we plot the performance of all configurations across datasets in Figure 4. Similar to Figure 1, we visualize DRAG and IterDRAG and highlight the optimal performance $P^{*}(L_{\text{max}})$ for different selections of $L_{\text{max}}$ . The fitting results are shown as grey dashed lines. We provide additional dataset-specific results in Appendix E.
The optimal performance exhibits consistent gains as the effective context length expands, demonstrating a strong linear correlation, which we term the inference scaling laws for RAG. Combined with dataset-specific results, our key observations are: (1) The optimal performance scales nearly linearly with the order of magnitude of the inference compute. Such linear relationship suggests that RAG performance can be improved by increasing computation, allowing for more accurate predictions of performance given available compute resources. (2) For $L_{\text{max}}$ above $10^{5}$ , IterDRAG continues to scale effectively with interleaving retrieval and iterative generation. This aligns with our results in Table 1, where IterDRAG better utilizes computation budgets for effective context lengths exceeding 128k. (3) Gains on optimal performance gradually diminish beyond an effective context length of 1M. Despite dataset variations, the performance follows similar trends up to 1M tokens. Beyond that, improvements from 1M to 5M are less substantial or plateau, potentially due to limitations in long-context modeling. In summary, while gains are smaller beyond 1M tokens, optimal RAG performance scales almost linearly with increasing inference compute through DRAG and IterDRAG.
<details>
<summary>2410.04343v2/x2.png Details</summary>

### Visual Description
# Technical Document Analysis of Performance Heatmap
## 1. Chart Structure and Labels
### Axis Titles
- **X-Axis (Horizontal):** Document Counts (0-Doc, 1-Doc, 2-Doc, 5-Doc, 10-Doc, 20-Doc, 50-Doc, 100-Doc, 200-Doc, 500-Doc, 1000-Doc)
- **Y-Axis (Vertical):** Shot Types (0-Shot, 2^0-Shot, 2^1-Shot, 2^2-Shot, 2^3-Shot, 2^4-Shot, 2^5-Shot, 2^6-Shot, 2^7-Shot, 2^8-Shot)
- **Section Titles (Top of Each Heatmap):**
- EM Performance
- F1 Performance
- Acc Performance
### Legend
- **Color Gradient:** Blue (low performance) to Red (high performance)
- **Note:** Legend not explicitly visible in image; inferred from color progression.
---
## 2. Data Categories and Sub-Categories
### Shot Types (Y-Axis)
- 0-Shot
- 2^0-Shot
- 2^1-Shot
- 2^2-Shot
- 2^3-Shot
- 2^4-Shot
- 2^5-Shot
- 2^6-Shot
- 2^7-Shot
- 2^8-Shot
### Document Counts (X-Axis)
- 0-Doc
- 1-Doc
- 2-Doc
- 5-Doc
- 10-Doc
- 20-Doc
- 50-Doc
- 100-Doc
- 200-Doc
- 500-Doc
- 1000-Doc
---
## 3. Data Table Reconstruction
### EM Performance Section
| Shot Type | 0-Doc | 1-Doc | 2-Doc | 5-Doc | 10-Doc | 20-Doc | 50-Doc | 100-Doc | 200-Doc | 500-Doc | 1000-Doc |
|-------------|-------|-------|-------|-------|--------|--------|--------|---------|---------|---------|----------|
| 0-Shot | 18.2 | 22.8 | 27.5 | 30.4 | 32.0 | 34.9 | 35.6 | 36.9 | 37.8 | 38.2 | 36.9 |
| 2^0-Shot | 19.4 | 26.9 | 28.9 | 31.3 | 34.3 | 36.0 | 38.0 | 36.1 | 40.0 | 40.7 | 39.8 |
| 2^1-Shot | 20.4 | 27.6 | 29.2 | 31.8 | 34.4 | 36.9 | 38.2 | 39.8 | 40.2 | 40.5 | |
| 2^2-Shot | 19.9 | 27.6 | 30.0 | 32.8 | 34.4 | 37.1 | 37.9 | 38.5 | 40.2 | 40.1 | |
| 2^3-Shot | 21.0 | 29.4 | 30.6 | 33.5 | 35.5 | 38.0 | 39.0 | 40.4 | 39.3 | | |
| 2^4-Shot | 20.3 | 30.2 | 31.6 | 34.4 | 35.9 | 37.1 | 39.2 | 40.1 | | | |
| 2^5-Shot | 20.7 | 30.1 | 32.5 | 35.8 | 37.1 | 38.2 | 39.3 | 41.2 | | | |
| 2^6-Shot | 21.2 | 30.6 | 33.0 | 36.0 | 37.4 | 38.2 | 39.0 | | | | |
| 2^7-Shot | 21.8 | 30.6 | 34.3 | 36.3 | 38.2 | 38.6 | | | | | |
| 2^8-Shot | 21.6 | 30.7 | 32.5 | 36.0 | 37.8 | | | | | | |
### F1 Performance Section
| Shot Type | 0-Doc | 1-Doc | 2-Doc | 5-Doc | 10-Doc | 20-Doc | 50-Doc | 100-Doc | 200-Doc | 500-Doc | 1000-Doc |
|-------------|-------|-------|-------|-------|--------|--------|--------|---------|---------|---------|----------|
| 0-Shot | 26.2 | 30.1 | 35.6 | 40.2 | 42.2 | 45.0 | 45.8 | 46.6 | 47.4 | 48.4 | 47.1 |
| 2^0-Shot | 26.4 | 35.4 | 37.9 | 41.5 | 45.1 | 46.8 | 48.6 | 47.2 | 50.0 | 50.7 | 49.1 |
| 2^1-Shot | 27.3 | 35.5 | 38.3 | 42.2 | 45.3 | 47.0 | 48.8 | 49.8 | 50.2 | 50.0 | |
| 2^2-Shot | 27.4 | 35.8 | 39.1 | 43.5 | 45.3 | 47.4 | 49.1 | 48.7 | 49.5 | 49.5 | |
| 2^3-Shot | 28.2 | 38.5 | 40.1 | 44.0 | 46.5 | 48.3 | 49.5 | 50.1 | 49.3 | | |
| 2^4-Shot | 27.9 | 38.9 | 41.0 | 44.9 | 46.7 | 47.8 | 49.7 | 50.6 | 49.6 | | |
| 2^5-Shot | 28.2 | 39.6 | 42.7 | 46.3 | 47.6 | 48.6 | 49.8 | 50.8 | | | |
| 2^6-Shot | 28.8 | 40.5 | 42.9 | 46.4 | 48.3 | 48.9 | 50.0 | | | | |
| 2^7-Shot | 28.8 | 39.7 | 44.1 | 47.4 | 48.5 | 49.0 | | | | | |
| 2^8-Shot | 29.0 | 40.1 | 43.0 | 46.2 | 48.1 | | | | | | |
### Acc Performance Section
| Shot Type | 0-Doc | 1-Doc | 2-Doc | 5-Doc | 10-Doc | 20-Doc | 50-Doc | 100-Doc | 200-Doc | 500-Doc | 1000-Doc |
|-------------|-------|-------|-------|-------|--------|--------|--------|---------|---------|---------|----------|
| 0-Shot | 20.4 | 25.2 | 29.9 | 32.7 | 35.0 | 38.0 | 39.0 | 40.5 | 41.1 | 41.8 | 40.7 |
| 2^0-Shot | 20.4 | 29.0 | 30.9 | 34.0 | 37.1 | 39.4 | 41.6 | 39.7 | 43.4 | 44.1 | 43.0 |
| 2^1-Shot | 21.4 | 29.4 | 31.3 | 34.4 | 37.3 | 40.1 | 41.6 | 43.0 | 43.6 | 43.9 | |
| 2^2-Shot | 21.0 | 29.4 | 32.0 | 35.5 | 37.4 | 40.5 | 41.4 | 42.0 | 43.5 | 43.4 | |
| 2^3-Shot | 22.1 | 31.1 | 32.6 | 36.0 | 38.5 | 41.0 | 42.2 | 43.7 | 42.8 | | |
| 2^4-Shot | 21.6 | 32.0 | 33.7 | 37.0 | 38.9 | 40.2 | 42.6 | 43.5 | 42.5 | | |
| 2^5-Shot | 21.8 | 32.2 | 34.6 | 38.6 | 40.1 | 41.1 | 42.7 | 44.8 | | | |
| 2^6-Shot | 22.4 | 32.8 | 35.0 | 38.3 | 40.2 | 41.2 | 42.4 | | | | |
| 2^7-Shot | 22.9 | 32.7 | 36.2 | 38.6 | 41.0 | 42.0 | | | | | |
| 2^8-Shot | 22.7 | 32.7 | 34.6 | 38.1 | 40.4 | | | | | | |
---
## 4. Key Trends and Observations
### EM Performance
- **Trend:** Performance increases with document count for all shot types.
- **Highest Value:** 2^8-Shot at 1000-Doc (36.9)
- **Lowest Value:** 0-Shot at 0-Doc (18.2)
### F1 Performance
- **Trend:** Consistent upward trajectory across document counts.
- **Highest Value:** 2^5-Shot at 1000-Doc (50.8)
- **Notable Outlier:** 2^0-Shot at 1000-Doc (49.1) slightly below 2^5-Shot.
### Acc Performance
- **Trend:** Gradual improvement with document count.
- **Highest Value:** 2^5-Shot at 200-Doc (44.8)
- **Lowest Value:** 0-Shot at 0-Doc (20.4)
---
## 5. Cross-Referenced Insights
- **Color Consistency:** Red values (high performance) dominate higher document counts across all sections.
- **Shot Type Impact:** Higher-order shots (e.g., 2^5-Shot, 2^8-Shot) generally outperform lower-order shots.
- **Document Count Thresholds:** Significant performance jumps occur between 10-Doc and 20-Doc in most cases.
---
## 6. Spatial Grounding
- **Legend Position:** Not explicitly visible; assumed to be on the right side of the heatmap.
- **Data Point Verification:** Red values (e.g., 50.8 in F1 Performance) align with the inferred high-performance range.
---
## 7. Component Isolation
### Header
- Section titles: EM, F1, Acc Performance.
### Main Chart
- Three heatmaps with identical axes but distinct performance metrics.
### Footer
- No explicit footer; data tables occupy the lower portion of each section.
---
## 8. Missing Information
- **Legend:** Not visible; color mapping inferred.
- **Units:** Performance metric units not specified (assumed to be percentages or normalized scores).
---
## 9. Conclusion
The heatmaps illustrate performance metrics (EM, F1, Accuracy) across varying document counts and shot types. Performance improves with increased document availability, with higher-order shot types achieving the best results. Numerical values confirm consistent trends observed visually.
</details>
(a) Averaged DRAG performance heatmap for different metrics.
<details>
<summary>2410.04343v2/x3.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Labels and Axis Information
- **X-Axis**: "Number of Documents" (logarithmic scale: 0, 10, 100, 1000)
- **Y-Axis**: "Normalized Performance" (linear scale: -2, -1, 0, 1, 2)
- **Legend**: Located in the top-left corner
- **DRAG**: Blue dashed line
- **IterDRAG**: Green dashed line
## Chart Components
1. **Grid Lines**: Dashed gray lines forming a grid across the plot area.
2. **Shaded Regions**:
- Blue gradient shading around the DRAG line (indicating variability).
- Green gradient shading around the IterDRAG line (indicating variability).
## Data Series Trends
### DRAG (Blue Dashed Line)
- **Trend**:
- Starts at approximately **y = -2** when x = 0.
- Gradually increases, reaching a peak of **y ≈ 0.2** at x = 1000.
- Declines slightly after x = 1000.
- **Key Data Points**:
- x = 10: y ≈ -1.5
- x = 100: y ≈ 0.1
- x = 1000: y ≈ 0.2
### IterDRAG (Green Dashed Line)
- **Trend**:
- Starts at approximately **y = -1** when x = 0.
- Rises sharply, peaking at **y ≈ 1.2** near x = 100.
- Declines sharply after x = 100, reaching **y ≈ -1** at x = 1000.
- **Key Data Points**:
- x = 10: y ≈ -0.5
- x = 100: y ≈ 1.2
- x = 1000: y ≈ -1
## Spatial Grounding
- **Legend Position**: Top-left corner (coordinates: [0, 0.95] relative to the plot area).
- **Line Placement**:
- DRAG (blue) consistently below IterDRAG (green) until x ≈ 100.
- IterDRAG dominates performance until x ≈ 100, after which DRAG surpasses it.
## Variability Analysis
- **Shaded Region Width**:
- IterDRAG's shaded region is wider than DRAG's, indicating higher variability in performance.
- Both shaded regions narrow as x increases, suggesting reduced variability at higher document counts.
## Additional Observations
- **Logarithmic X-Axis**: The x-axis uses a logarithmic scale, causing exponential growth in document counts (e.g., 10 → 100 → 1000).
- **Performance Saturation**: Both lines plateau near y = 0.2 (DRAG) and y = 1.2 (IterDRAG) before declining, suggesting diminishing returns at scale.
## Language Notes
- **Primary Language**: English (all labels, axis titles, and legends are in English).
- **No Non-English Text Detected**.
## Conclusion
The chart compares the normalized performance of two algorithms (DRAG and IterDRAG) across varying document counts. IterDRAG outperforms DRAG at lower document counts but experiences a sharp decline at higher scales, while DRAG maintains steadier performance with gradual improvement.
</details>
(b) Performance vs. number of documents.
<details>
<summary>2410.04343v2/x4.png Details</summary>

### Visual Description
# Technical Analysis of Performance Comparison Chart
## Chart Overview
The image presents a **line chart** comparing the performance of two algorithms: **DRAG** and **IterDRAG**, across varying numbers of computational "shots". The chart uses a **logarithmic scale** for the x-axis (Number of Shots) and a **linear scale** for the y-axis (Normalized Performance).
---
### Key Components
1. **Legend**
- **Location**: Top-left corner
- **Labels**:
- `DRAG` (blue dashed line)
- `IterDRAG` (green dashed line)
2. **Axes**
- **X-axis**:
- Title: `Number of Shots`
- Scale: Logarithmic (markers at `0`, `10⁰`, `10¹`, `10²`)
- Range: `0` to `10²` (0 to 100)
- **Y-axis**:
- Title: `Normalized Performance`
- Scale: Linear (markers at `-3`, `-2`, `-1`, `0`, `1`, `2`)
- Range: `-3` to `2`
3. **Data Series**
- **DRAG** (blue dashed line):
- Starts at `~-1.5` at `0` shots.
- Gradually increases to `~-0.5` at `100` shots.
- Shaded blue region (confidence interval) narrows as shots increase.
- **IterDRAG** (green dashed line):
- Starts at `~-2.5` at `0` shots.
- Sharp upward trend, surpassing DRAG at `10¹` shots.
- Reaches `~0.5` at `100` shots.
- Shaded green region (confidence interval) remains wide, indicating higher variability.
---
### Trends and Observations
1. **DRAG Performance**
- **Trend**: Steady linear improvement with increasing shots.
- **Key Data Points**:
- `0` shots: `~-1.5`
- `10¹` shots: `~-0.7`
- `10²` shots: `~-0.5`
2. **IterDRAG Performance**
- **Trend**: Accelerated improvement, outperforming DRAG after `10¹` shots.
- **Key Data Points**:
- `0` shots: `~-2.5`
- `10¹` shots: `~0.3`
- `10²` shots: `~0.5`
3. **Variability**
- **DRAG**: Narrow shaded region (low variance).
- **IterDRAG**: Wide shaded region (high variance), especially at `0` and `10¹` shots.
---
### Spatial Grounding
- **Legend**: Top-left corner (`x=0.1, y=0.9`).
- **Line Colors**:
- Blue (`DRAG`) matches all blue dashed lines and shaded regions.
- Green (`IterDRAG`) matches all green dashed lines and shaded regions.
---
### Conclusion
The chart demonstrates that **IterDRAG** significantly outperforms **DRAG** in normalized performance as the number of shots increases, despite higher variability. DRAG shows consistent but slower improvement.
</details>
(c) Performance vs. number of shots.
Figure 5: RAG performance changes with varying number of documents and in-context examples. 5(a) reports the averaged metric values across datasets, whereas in 5(b) and 5(c), each line represents the normalized performance of a consistent configuration with progressively increasing documents / shots.
4.4 Parameter-Specific Scaling
To gain further insights into the dynamics of DRAG and IterDRAG, we grid search over different combinations of $\theta$ and evaluate the performance. The results are presented in Figure 5, where we visualize DRAG performance using heatmaps (See IterDRAG heatmap in Appendix C). Additionally, we provide further results with varying numbers of documents ( $k$ ) and shots ( $m$ ). In summary, scaling retrieval, demonstrations and more generation steps leads to performance gains in most cases, yet such gains vary by effective context length and method. In particular, we note: (1) Documents and in-context examples are not equally helpful. For a fixed configuration, increasing the number of retrieved documents $k$ usually leads to more substantial performance gains, as evidenced by the differing slopes in Figure 5. (2) Increasing shots $m$ is more helpful for IterDRAG. For example, increase $m$ from 0 to 1 (rather than $k$ ) is more helpful for IterDRAG, possibly due to demonstrations that leads to improved in-context query decomposition and knowledge extraction. (3) Scaling saturates differently for DRAG and IterDRAG. An example can be found in the increase of $m$ from 0 to 1, which results in significant improvements for IterDRAG but shows little impact on DRAG. Beyond the soft thresholds, further increases in $k$ or $m$ yield marginal gains or even results in performance declines. (4) For a given $L_{\text{max}}$ , the optimal $\theta$ depends on the method, metric and dataset. As illustrated in Figure 5(a) and Figure 8, the optimal combinations are sensitive to the metrics and located differently, posing challenges for performance modeling w.r.t. $\theta$ . In conclusion, increasing documents, demonstrations and iterations can enhance RAG performance, but each contributes differently to the overall results. As such, identifying the optimal combination of hyperparameters remains challenging.
5 Inference Computation Allocation for Long-Context RAG
After examining the overall performance of different RAG strategies and the varying impacts of different inference parameters, we now quantify the relationship between performance and the hyperparameter set $\theta$ . We hypothesize that for long-context RAG, we can model such test-time scaling properties and term it computation allocation model for RAG. This model, in turn, can be used to guide the selection of $\theta$ based on the maximum effective context length $L_{\text{max}}$ .
5.1 Formulation and Estimation
With a slight abuse of notation, we redefine the average performance metric $P$ (e.g., accuracy) on dataset $\mathcal{X}$ as a function of $\theta$ . We consider the number of documents $k$ , demonstrations $m$ and maximum iterations $n$ within $\theta$ , namely $\theta:=(k,m,n)^{T}$ . To account for the variance across methods and tasks, we introduce $i:=(i_{\text{doc}},i_{\text{shot}},0)^{T}$ . $i_{\text{doc}}$ and $i_{\text{shot}}$ measure the informativeness of documents and in-context examples respectively. While technically we can also define an $i_{\text{iter}}$ to measure the informativeness of additional generation steps, applying $i_{\text{iter}}$ does not yield improved accuracy, so we leave it as 0 in our experiments. We formulate the computation allocation model as In our implementation, we shift the values within $\theta$ by a small $\epsilon$ to prevent numerical issues with $\log(0)$ .:
$$
\sigma^{-1}(P(\theta))\approx(a+b\odot i)^{T}\log(\theta)+c, \tag{2}
$$
where $\odot$ refers to element-wise product. $a,b∈\mathbb{R}^{3}$ and $c∈\mathbb{R}$ are parameters to be estimated, and $i$ can be computed base on the specific task. There are different ways to define $i$ ; we choose a definition to compute $i$ based on the performance difference between selected base configurations. In particular, for each strategy on each dataset, $i_{\text{doc}}$ is defined as the performance gain by only adding one document compared to zero-shot QA. Similarly, $i_{\text{shot}}$ is defined as the performance gain by adding only one in-context example compared to zero-shot QA. To account for the sub-linearity in extremely long contexts (above 1M), we apply an inverse sigmoidal mapping $\sigma^{-1}$ to scale the values of the metric $P$ . Further implementation details are reported in Appendix H.
<details>
<summary>2410.04343v2/x5.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Overview
The image contains **four line charts** comparing normalized performance across datasets as a function of "Number of Shots" (logarithmic scale). Each chart includes:
- **Legend**: 5 data series with distinct line styles/colors
- **Axes**:
- X-axis: "Number of Shots" (log scale: 0, 10⁰, 10¹, 10²)
- Y-axis: "Normalized Performance" (-3 to 2)
- **Shaded regions**: Confidence intervals around each line
---
## Chart 1: Bamboo
### Legend
| Color/Style | Label |
|-------------------|-----------|
| Blue dashed | 0-Doc |
| Orange dashed | 1-Doc |
| Purple dashed | 10-Doc |
| Gray dashed | 100-Doc |
| Red dashed | 1000-Doc |
### Key Trends
1. **1000-Doc (Red)**: Strong upward trend (y ≈ 1.0 at 10⁰ shots → 1.8 at 10²)
2. **100-Doc (Gray)**: Moderate upward trend (y ≈ 0.8 → 1.5)
3. **10-Doc (Purple)**: Slight upward trend (y ≈ -0.2 → 0.6)
4. **1-Doc (Orange)**: Downward trend (y ≈ -1.2 → -0.5)
5. **0-Doc (Blue)**: Steep downward trend (y ≈ -2.5 → -2.0)
### Data Points (approximate)
| Shots | 0-Doc | 1-Doc | 10-Doc | 100-Doc | 1000-Doc |
|-------|-------|-------|--------|---------|----------|
| 10⁰ | -2.5 | -1.2 | -0.2 | 0.8 | 1.0 |
| 10¹ | -2.3 | -0.8 | 0.1 | 1.0 | 1.4 |
| 10² | -2.0 | -0.5 | 0.6 | 1.5 | 1.8 |
---
## Chart 2: HotpotQA
### Legend
Same as Bamboo (see above)
### Key Trends
1. **1000-Doc (Red)**: Strong upward trend (y ≈ 0.8 → 1.2)
2. **100-Doc (Gray)**: Moderate upward trend (y ≈ 0.6 → 1.0)
3. **10-Doc (Purple)**: Slight upward trend (y ≈ -0.1 → 0.4)
4. **1-Doc (Orange)**: Downward trend (y ≈ -1.0 → -0.3)
5. **0-Doc (Blue)**: Steep downward trend (y ≈ -2.8 → -2.2)
### Data Points (approximate)
| Shots | 0-Doc | 1-Doc | 10-Doc | 100-Doc | 1000-Doc |
|-------|-------|-------|--------|---------|----------|
| 10⁰ | -2.8 | -1.0 | -0.1 | 0.6 | 0.8 |
| 10¹ | -2.5 | -0.5 | 0.2 | 0.9 | 1.0 |
| 10² | -2.2 | -0.3 | 0.4 | 1.0 | 1.2 |
---
## Chart 3: MuSiQue
### Legend
Same as Bamboo (see above)
### Key Trends
1. **1000-Doc (Red)**: Strong upward trend (y ≈ -2.5 → -0.5)
2. **100-Doc (Gray)**: Moderate upward trend (y ≈ -2.0 → 0.0)
3. **10-Doc (Purple)**: Slight upward trend (y ≈ -1.5 → 0.2)
4. **1-Doc (Orange)**: Downward trend (y ≈ -2.8 → -1.0)
5. **0-Doc (Blue)**: Steep downward trend (y ≈ -3.0 → -2.0)
### Data Points (approximate)
| Shots | 0-Doc | 1-Doc | 10-Doc | 100-Doc | 1000-Doc |
|-------|-------|-------|--------|---------|----------|
| 10⁰ | -3.0 | -2.8 | -1.5 | -2.0 | -2.5 |
| 10¹ | -2.7 | -2.5 | -1.2 | -1.5 | -1.0 |
| 10² | -2.5 | -2.0 | 0.2 | 0.0 | -0.5 |
---
## Chart 4: 2WikiMultiHopQA
### Legend
Same as Bamboo (see above)
### Key Trends
1. **1000-Doc (Red)**: Strong upward trend (y ≈ -3.0 → -0.8)
2. **100-Doc (Gray)**: Moderate upward trend (y ≈ -2.5 → 0.2)
3. **10-Doc (Purple)**: Slight upward trend (y ≈ -2.0 → 0.5)
4. **1-Doc (Orange)**: Downward trend (y ≈ -3.2 → -1.2)
5. **0-Doc (Blue)**: Steep downward trend (y ≈ -3.5 → -2.2)
### Data Points (approximate)
| Shots | 0-Doc | 1-Doc | 10-Doc | 100-Doc | 1000-Doc |
|-------|-------|-------|--------|---------|----------|
| 10⁰ | -3.5 | -3.2 | -2.0 | -2.5 | -3.0 |
| 10¹ | -3.3 | -3.0 | -1.5 | -2.0 | -1.2 |
| 10² | -3.0 | -2.5 | 0.5 | 0.2 | -0.8 |
---
## Cross-Chart Observations
1. **Performance Scaling**: All datasets show improved performance with more documents (1000-Doc outperforms lower doc counts).
2. **Baseline Degradation**: 0-Doc (blue) consistently shows worst performance across all datasets.
3. **Log Scale Impact**: Performance improvements are more pronounced at higher shot counts (10²) due to logarithmic scaling.
## Notes
- No non-English text detected.
- Shaded regions indicate variability (likely 95% confidence intervals).
- All charts share identical axis ranges and legend structure.
</details>
Figure 6: The estimated performance using the proposed observational scaling laws vs. actual metric values in DRAG. The subplots represent different datasets, where each line corresponds to a fixed number of documents, we scale the context length by increasing the number of shots.
In Equation 2, estimations on $a$ , $b$ and $c$ are specific to a certain model, reflecting how LLMs improve with varying number of documents and shots (i.e., in-context learning / zero-shot capabilities). In contrast, $i$ models the performance variations within the selected task (i.e., how external knowledge / demonstrations help responding to the query). Therefore, the computation allocation model can be estimated once and applied to various downstream tasks without requiring additional calibration. To estimate the parameters, varying combinations of $\theta$ are evaluated to perform ordinary least squares on $a$ , $b$ and $c$ . We report the parameters for Gemini 1.5 Flash in Appendix F.
5.2 Validating the Computation Allocation Model for RAG
We evaluate the computation allocation model for RAG by comparing the predicted metrics to the actual values, with normalized results for DRAG visualized in Figure 6. Here, each subplot represents a different dataset, and each line corresponds to a document setting ( $k$ ), we scale the context length by adjusting in-context examples ( $m$ ). As illustrated, the performance improves with the increase of $k$ and $m$ across datasets, displaying highly consistent trends between the predicted and actual metric values, despite some variations. Notably, each dataset exhibits different levels of consistency: Bamboogle exhibits the highest consistency, while HotpotQA generates more variable results. Our findings demonstrate how external knowledge and in-context learning can effectively enhance RAG performance with long-context capabilities, suggesting the effectiveness of the computation allocation model for RAG and how they may be used to predict benchmark results.
Table 2: Ablation study results of the computation allocation model for RAG.
| | Exclude $b$ | Quadratic $\theta$ | Linear $\sigma$ | Sigmoidal $\sigma$ | | | | |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| $R^{2}$ | MSE | $R^{2}$ | MSE | $R^{2}$ | MSE | $R^{2}$ | MSE | |
| Values | 0.866 | 0.116 | 0.867 | 0.117 | 0.876 | 0.109 | 0.903 | 0.085 |
Ablation Study.
To verify the effectiveness of the computation allocation model, we perform ablation studies and evaluate the fitting performance of different variants. In particular, we assess: (1) estimation without $b$ and $i$ (i.e., Exclude $b$ ); (2) a quadratic form of input $\log(\theta)$ (Quadratic $\theta$ ); (3) linear scaling of $P$ (Linear $\sigma$ ); and (4) sigmoid scaling of $P$ (Sigmoidal $\sigma$ ). The $R^{2}$ and MSE values for these variants are reported in Table 2, in which (4) represents the complete design of our computation allocation model. The results indicate that incorporating the additional $b$ with $i$ enhances the relevance and reduces error across all tasks. Moreover, applying inverse sigmoid to $P$ significantly improves the estimation in comparison to quadratic $\theta$ or linear scaling.
Table 3: Domain generalization results of the computation allocation model for RAG.
| Baseline Predict Oracle | 49.6 64.0 65.6 | 58.8 75.6 75.6 | 51.2 68.0 68.8 | 46.3 47.8 48.7 | 60.2 63.3 63.3 | 51.4 55.3 55.3 | 14.9 19.3 22.2 | 24.7 32.5 34.3 | 16.9 29.3 30.5 | 46.5 60.8 65.7 | 53.7 72.4 75.2 | 51.6 74.9 76.4 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
Domain Generalization.
We also examine the generalization of the computation allocation model for RAG for unseen domains. In other words, the parameters of Equation 2 are tested on the target domain but learnt from the remaining domains. For inference, only $i$ is derived from the target domain. We report the results for 1M effective context length in Table 3, where we compare to an 8-shot baseline configuration (scaled by increasing retrieved documents) and the optimum results (Oracle). In summary, the results show that computation allocation model significantly outperforms baseline and closely aligns with the oracle results (96.6% of the optimal performance). Notably, Bamboogle and HotpotQA exhibit highly similar target results, with the performance metrics varying by less than 2.5% from the oracle. These results suggest the potential of applying the computation allocation model for RAG to a wider range of knowledge-intensive tasks.
Table 4: Length extrapolation results of the computation allocation model for RAG.
| Baseline Predict Oracle | 37.4 37.4 39.2 | 47.6 48.2 49.8 | 40.4 41.0 42.7 | 39.0 41.2 46.9 | 49.5 52.0 59.0 | 42.2 45.4 55.1 | 39.3 48.0 50.5 | 49.3 60.9 62.1 | 42.8 56.9 57.7 | 44.5 47.9 51.7 | 55.4 59.8 62.6 | 49.8 55.2 58.1 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
Length Extrapolation.
In addition to predictability on unseen domains, we explore the extrapolation of context length based on the computation allocation model. Here, we estimate the parameters of Equation 2 using experiments with shorter context lengths and assess their predictive accuracy on longer ones. We assess different extrapolation settings and present the predicted metric values in Table 4. Our observations are: (1) The predictions are accurate and consistently outperform the 8-shot baseline. For instance, the average difference between the predicted and oracle results from 128k to 1M tokens is just 2.8%. (2) Extrapolating from 32k to 128k is challenging. This is because DRAG performs best around 32k, while IterDRAG typically excels at a long context of 128k, as evidenced in Figure 4. Consequently, it creates a discrepancy between training and predicting performance distribution. (3) 5M context length is less predictable, with the average performance difference between predicted and oracle metrics observed at a substantial 5.6%. Overall, length extrapolation with computation allocation model is accurate and more effective for target lengths below 1M.
6 Discussion
Retrieval.
One critical factor in improving performance of RAG lies in the quality of the retrieved documents. To study how retrieval impacts final accuracy, we analyze retrieval performance and report the results across different document sizes in Appendix A. In all datasets, recall scores demonstrate improvements as the number of documents increases, approaching near-perfect scores with large document sets (e.g., $\sim$ 1k). Despite consistent gains in recall, the results show diminishing returns on discounted ranking metrics like NDCG, indicating increasing distraction within the context. This trend is also evident in in Figure 5(b), where RAG performance peaks between 100 and 500 documents. Our observations suggest the necessity of refining retrieval (e.g., through re-ranking) to further optimize the document relevance, particularly in cases of complex, multi-hop queries. However, how the inference scaling behavior discovered in this paper would change in the presence of such a refining component remains unknown. Alternatively, iterative retrieval, as seen in IterDRAG, improves recall performance by using simpler, straightforward sub-queries to collect additional context for each intermediate answer. In summary, retrieving more documents improves recall but does not necessarily lead to better generation quality if the documents are not effectively ranked or filtered. This highlights the need for retrieval methods that dynamically adjust to minimize irrelevant content.
Error Analysis.
Despite overall improvements, our error analysis in Appendix G reveals that certain errors persist, particularly in cases of compositional reasoning tasks where multiple hops of reasoning are required. The common errors fall into four categories: (1) inaccurate or outdated retrieval; (2) incorrect or lack of reasoning; (3) hallucination or unfaithful reasoning; and (4) evaluation issues or refusal to answer. The first category highlights the need for enhancing retrieval methods and maintaining a reliable & up-to-date knowledge base, specially for complex questions that rely on multiple supporting facts. In addition, incorrect or missing reasoning steps often result in errors or partially correct answers. In our experiments, we observe that both (1) and (2) are substantially improved with IterDRAG, suggesting the importance of interleaving retrieval and iterative generation for multi-hop queries. Moreover, developing faithful LLMs and strategies to mitigate hallucination could further enhance RAG performance. Finally, we note that existing metrics fail in certain cases (e.g., abbreviations), underscoring the need for more robust and reliable evaluation methods.
Long-Context Modeling.
We also discuss the impact of long-context modeling w.r.t. RAG performance. In summary, we find that retrieving more documents is generally beneficial for RAG performance, as demonstrated in Section 4. Nevertheless, naïvely extending the context length in each generation step does not always lead to better results. Specifically, DRAG performance peaks at around $10^{5}$ tokens, while IterDRAG achieves optimal performance at around $10^{6}$ tokens by leveraging multiple rounds of generation. For instance, as seen in the performance plateau in Figure 1 and Figure 11, LLMs struggle to effectively utilize very long contexts ( $≥ 10^{5}$ tokens) in each iteration, potentially due to inherent limitations of long-context modeling. Our observations suggest that: (1) the model’s ability to identify relevant information from extensive context remains to be improved, especially when presented with large quantity of “similar” documents; (2) the long-context modeling should be further refined to enhance in-context learning capabilities, where multiple lengthy demonstrations are provided.
Trade-Off Between Inference Compute and RAG Performance.
In our experiments, we observe consistent benefits of inference scaling using DRAG and IterDRAG, potentially changing the optimal trade-off between inference compute and RAG performance. Existing methods often exhibit diminishing returns when scaling inference compute beyond certain thresholds where RAG performance plateaus. As a result, the optimal trade-off between inference compute and RAG performance is unlikely to be found beyond these thresholds, as further investment in scaling inference compute becomes inefficient. In contrast, our findings demonstrate that long-context RAG performance can improve almost linearly with increased test-time compute when optimally allocated. Therefore, the optimal trade-off in our setting largely depends on the inference budget, with higher budgets consistently yielding steady gains. Combined with the computation allocation model for RAG, this approach enables the derivation of a (nearly) optimal solution for long-context RAG given computation constraints.
7 Conclusion
In this paper, we explore inference scaling in long-context RAG. By systematically studying the performance with different inference configurations, we demonstrate that RAG performance improves almost linearly with the increasing order of magnitude of the test-time compute under optimal inference parameters. Based on our observations, we derive inference scaling laws for RAG and the corresponding computation allocation model, designed to predict RAG performance on varying hyperparameters. Through extensive experiments, we show that optimal configurations can be accurately estimated and align closely with the experimental results. These insights provide a strong foundation for future research in optimizing inference strategies for long-context RAG.
References
- Achiam et al. (2023) J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- (2) R. Agarwal, A. Singh, L. M. Zhang, B. Bohnet, L. Rosias, S. C. Chan, B. Zhang, A. Faust, and H. Larochelle. Many-shot in-context learning. In ICML 2024 Workshop on In-Context Learning.
- Beck et al. (2024) M. Beck, K. Pöppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter. xLSTM: Extended long short-term memory. arXiv preprint arXiv:2405.04517, 2024.
- Beltagy et al. (2020) I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Bertsch et al. (2024) A. Bertsch, M. Ivgi, U. Alon, J. Berant, M. R. Gormley, and G. Neubig. In-context learning with long-context models: An in-depth exploration. arXiv preprint arXiv:2405.00200, 2024.
- Borgeaud et al. (2022) S. Borgeaud, A. Mensch, J. Hoffmann, T. Cai, E. Rutherford, K. Millican, G. B. Van Den Driessche, J.-B. Lespiau, B. Damoc, A. Clark, et al. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022.
- Brown et al. (2020) T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Chen et al. (2023) S. Chen, S. Wong, L. Chen, and Y. Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Choromanski et al. (2020) K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser, et al. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2020.
- Dao et al. (2022) T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. Flashattention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Geva et al. (2021) M. Geva, D. Khashabi, E. Segal, T. Khot, D. Roth, and J. Berant. Did Aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021.
- Gu and Dao (2023) A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Gu et al. (2021) A. Gu, K. Goel, and C. Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- Gu et al. (2023) Y. Gu, L. Dong, F. Wei, and M. Huang. Pre-training to learn in context. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4849–4870, 2023.
- Gutiérrez et al. (2024) B. J. Gutiérrez, Y. Shu, Y. Gu, M. Yasunaga, and Y. Su. HippoRAG: Neurobiologically inspired long-term memory for large language models. arXiv preprint arXiv:2405.14831, 2024.
- Guu et al. (2020) K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pages 3929–3938. PMLR, 2020.
- Ho et al. (2020) X. Ho, A.-K. D. Nguyen, S. Sugawara, and A. Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020.
- Izacard and Grave (2021) G. Izacard and É. Grave. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, 2021.
- Izacard et al. (2023) G. Izacard, P. Lewis, M. Lomeli, L. Hosseini, F. Petroni, T. Schick, J. Dwivedi-Yu, A. Joulin, S. Riedel, and E. Grave. Atlas: Few-shot learning with retrieval augmented language models. Journal of Machine Learning Research, 24(251):1–43, 2023.
- Jacobs et al. (2023) S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, L. Song, S. Rajbhandari, and Y. He. DeepSpeed Ulysses: System optimizations for enabling training of extreme long sequence transformer models. arXiv preprint arXiv:2309.14509, 2023.
- Jiang et al. (2023) Z. Jiang, F. F. Xu, L. Gao, Z. Sun, Q. Liu, J. Dwivedi-Yu, Y. Yang, J. Callan, and G. Neubig. Active retrieval augmented generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7969–7992, 2023.
- Jiang et al. (2024) Z. Jiang, X. Ma, and W. Chen. LongRAG: Enhancing retrieval-augmented generation with long-context LLMs. arXiv preprint arXiv:2406.15319, 2024.
- Joshi et al. (2017) M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, 2017.
- Karpukhin et al. (2020) V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020.
- Khandelwal et al. (2019) U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis. Generalization through memorization: Nearest neighbor language models. In International Conference on Learning Representations, 2019.
- Kitaev et al. (2019) N. Kitaev, L. Kaiser, and A. Levskaya. Reformer: The efficient transformer. In International Conference on Learning Representations, 2019.
- Koo et al. (2024) T. Koo, F. Liu, and L. He. Automata-based constraints for language model decoding. arXiv preprint arXiv:2407.08103, 2024.
- Kuratov et al. (2024) Y. Kuratov, A. Bulatov, P. Anokhin, I. Rodkin, D. Sorokin, A. Sorokin, and M. Burtsev. BABILong: Testing the limits of LLMs with long context reasoning-in-a-haystack. arXiv preprint arXiv:2406.10149, 2024.
- Kwiatkowski et al. (2019) T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
- Lee et al. (2024a) J. Lee, A. Chen, Z. Dai, D. Dua, D. S. Sachan, M. Boratko, Y. Luan, S. M. Arnold, V. Perot, S. Dalmia, et al. Can long-context language models subsume retrieval, RAG, SQL, and more? arXiv preprint arXiv:2406.13121, 2024a.
- Lee et al. (2024b) J. Lee, Z. Dai, X. Ren, B. Chen, D. Cer, J. R. Cole, K. Hui, M. Boratko, R. Kapadia, W. Ding, et al. Gecko: Versatile text embeddings distilled from large language models. arXiv preprint arXiv:2403.20327, 2024b.
- Leng et al. (2024) Q. Leng, J. Portes, S. Havens, M. Zaharia, and M. Carbin. Long context rag performance of large language models. arXiv preprint arXiv:2411.03538, 2024.
- Lewis et al. (2020) P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Li et al. (2023) M. Li, S. Gong, J. Feng, Y. Xu, J. Zhang, Z. Wu, and L. Kong. In-context learning with many demonstration examples. arXiv preprint arXiv:2302.04931, 2023.
- Li et al. (2024) T. Li, G. Zhang, Q. D. Do, X. Yue, and W. Chen. Long-context LLMs struggle with long in-context learning. arXiv preprint arXiv:2404.02060, 2024.
- Lin et al. (2024) X. V. Lin, X. Chen, M. Chen, W. Shi, M. Lomeli, R. James, P. Rodriguez, J. Kahn, G. Szilvasy, M. Lewis, et al. RA-DIT: Retrieval-augmented dual instruction tuning. In The Twelfth International Conference on Learning Representations, 2024.
- Liu et al. (2023) H. Liu, M. Zaharia, and P. Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889, 2023.
- Liu et al. (2022) J. Liu, D. Shen, Y. Zhang, W. B. Dolan, L. Carin, and W. Chen. What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, 2022.
- Liu et al. (2024a) S. Liu, H. Ye, L. Xing, and J. Y. Zou. In-context vectors: Making in context learning more effective and controllable through latent space steering. In Forty-first International Conference on Machine Learning, 2024a.
- Liu et al. (2024b) Z. Liu, W. Ping, R. Roy, P. Xu, C. Lee, M. Shoeybi, and B. Catanzaro. ChatQA: Surpassing GPT-4 on conversational QA and RAG. arXiv preprint arXiv:2401.10225, 2024b.
- Lu et al. (2022) Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, 2022.
- Ma et al. (2023) X. Ma, Y. Gong, P. He, H. Zhao, and N. Duan. Query rewriting in retrieval-augmented large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5303–5315, 2023.
- Min et al. (2022) S. Min, M. Lewis, L. Zettlemoyer, and H. Hajishirzi. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, 2022.
- Ni et al. (2021) J. Ni, C. Qu, J. Lu, Z. Dai, G. H. Ábrego, J. Ma, V. Y. Zhao, Y. Luan, K. B. Hall, M.-W. Chang, et al. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899, 2021.
- Peng et al. (2023a) B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, L. Derczynski, et al. RWKV: Reinventing rnns for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14048–14077, 2023a.
- Peng et al. (2023b) B. Peng, J. Quesnelle, H. Fan, and E. Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023b.
- Petroni et al. (2020) F. Petroni, A. Piktus, A. Fan, P. Lewis, M. Yazdani, N. De Cao, J. Thorne, Y. Jernite, V. Karpukhin, J. Maillard, et al. KILT: a benchmark for knowledge intensive language tasks. arXiv preprint arXiv:2009.02252, 2020.
- Press et al. (2021) O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Press et al. (2023) O. Press, M. Zhang, S. Min, L. Schmidt, N. A. Smith, and M. Lewis. Measuring and narrowing the compositionality gap in language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 5687–5711, 2023.
- Ram et al. (2023) O. Ram, Y. Levine, I. Dalmedigos, D. Muhlgay, A. Shashua, K. Leyton-Brown, and Y. Shoham. In-context retrieval-augmented language models. Transactions of the Association for Computational Linguistics, 11:1316–1331, 2023.
- Reid et al. (2024) M. Reid, N. Savinov, D. Teplyashin, D. Lepikhin, T. Lillicrap, J.-b. Alayrac, R. Soricut, A. Lazaridou, O. Firat, J. Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
- Rubin et al. (2022) O. Rubin, J. Herzig, and J. Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2655–2671, 2022.
- Sarthi et al. (2024) P. Sarthi, S. Abdullah, A. Tuli, S. Khanna, A. Goldie, and C. D. Manning. RAPTOR: Recursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Representations, 2024.
- Shao et al. (2024) R. Shao, J. He, A. Asai, W. Shi, T. Dettmers, S. Min, L. Zettlemoyer, and P. W. Koh. Scaling retrieval-based language models with a trillion-token datastore. arXiv preprint arXiv:2407.12854, 2024.
- Shi et al. (2024) W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W.-t. Yih. REPLUG: Retrieval-augmented black-box language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8364–8377, 2024.
- Snell et al. (2024) C. Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314, 2024.
- Sun et al. (2024) P. Sun, D. Simcha, D. Dopson, R. Guo, and S. Kumar. SOAR: improved indexing for approximate nearest neighbor search. Advances in Neural Information Processing Systems, 36, 2024.
- Sun et al. (2023) Y. Sun, L. Dong, B. Patra, S. Ma, S. Huang, A. Benhaim, V. Chaudhary, X. Song, and F. Wei. A length-extrapolatable transformer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14590–14604, 2023.
- Team et al. (2023) G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Trivedi et al. (2022) H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022.
- Trivedi et al. (2023) H. Trivedi, N. Balasubramanian, T. Khot, and A. Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10014–10037, 2023.
- Wang et al. (2024) X. Wang, W. Zhu, M. Saxon, M. Steyvers, and W. Y. Wang. Large language models are latent variable models: Explaining and finding good demonstrations for in-context learning. Advances in Neural Information Processing Systems, 36, 2024.
- Wei et al. (2022) J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Wei et al. (2023) J. Wei, L. Hou, A. Lampinen, X. Chen, D. Huang, Y. Tay, X. Chen, Y. Lu, D. Zhou, T. Ma, et al. Symbol tuning improves in-context learning in language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 968–979, 2023.
- Wu et al. (2024) K. Wu, E. Wu, and J. Zou. How faithful are RAG models? quantifying the tug-of-war between RAG and LLMs’ internal prior. arXiv preprint arXiv:2404.10198, 2024.
- Wu et al. (2023) Z. Wu, Y. Wang, J. Ye, and L. Kong. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1423–1436, 2023.
- Xu et al. (2024) P. Xu, W. Ping, X. Wu, L. McAfee, C. Zhu, Z. Liu, S. Subramanian, E. Bakhturina, M. Shoeybi, and B. Catanzaro. Retrieval meets long context large language models. In The Twelfth International Conference on Learning Representations, 2024.
- Yan et al. (2024) S.-Q. Yan, J.-C. Gu, Y. Zhu, and Z.-H. Ling. Corrective retrieval augmented generation. arXiv preprint arXiv:2401.15884, 2024.
- Yang et al. (2018) Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, 2018.
- Yoran et al. (2024) O. Yoran, T. Wolfson, O. Ram, and J. Berant. Making retrieval-augmented language models robust to irrelevant context. In The Twelfth International Conference on Learning Representations, 2024.
- Yu et al. (2023) W. Yu, H. Zhang, X. Pan, K. Ma, H. Wang, and D. Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023.
- Yue et al. (2024) Z. Yue, H. Zeng, Y. Lu, L. Shang, Y. Zhang, and D. Wang. Evidence-driven retrieval augmented response generation for online misinformation. arXiv preprint arXiv:2403.14952, 2024.
- Zaheer et al. (2020) M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang, et al. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
- Zhang et al. (2024) T. Zhang, S. G. Patil, N. Jain, S. Shen, M. Zaharia, I. Stoica, and J. E. Gonzalez. RAFT: Adapting language model to domain specific rag. arXiv preprint arXiv:2403.10131, 2024.
Appendix A Retrieval Quality
We assess the retrieval quality of DRAG and IterDRAG using the Gecko-1B model (Lee et al., 2024b) and evaluate their impact on final RAG performance. Specifically, we retrieve varying numbers of documents per input query and measure the retrieval quality using three metrics: Recall, NDCG, and MRR, with document counts ranging from 1 to 2k. The retrieval results of DRAG are shown in Figure 7. In addition, we evaluate the quality of iterative retrieval, where a maximum of five interleaving retrieval steps are performed. Here, we retrieve 50 documents at each step and use a 2-shot setting, with the results in comparison to DRAG in Table 5.
<details>
<summary>2410.04343v2/x6.png Details</summary>

### Visual Description
# Technical Analysis of Multi-Dataset Performance Graphs
## Key Components Extracted
### Legend
- **Position**: Top-right corner
- **Labels & Colors**:
- Recall: Purple (▲)
- NDCG: Blue (▲)
- MRR: Green (▲)
### Axis Labels
- **X-axis**: "Number of Documents" (logarithmic scale: 10⁰ to 10³)
- **Y-axis**: "Scores" (0.0 to 0.9)
---
## Dataset-Specific Analysis
### 1. Bamboogle
**Trend Verification**:
- Recall: Sharp upward slope (0.15 → 0.85)
- NDCG: Gradual increase (0.18 → 0.36)
- MRR: Flat line (0.16 → 0.24)
**Data Points**:
| Documents | Recall | NDCG | MRR |
|-----------|--------|------|-----|
| 10⁰ | 0.15 | 0.18 | 0.16|
| 10¹ | 0.45 | 0.28 | 0.22|
| 10² | 0.75 | 0.33 | 0.24|
| 10³ | 0.85 | 0.36 | 0.24|
---
### 2. HotpotQA
**Trend Verification**:
- Recall: Steep upward trajectory (0.35 → 0.92)
- NDCG: Moderate increase (0.38 → 0.55)
- MRR: Flat line (0.40 → 0.45)
**Data Points**:
| Documents | Recall | NDCG | MRR |
|-----------|--------|------|-----|
| 10⁰ | 0.35 | 0.38 | 0.40|
| 10¹ | 0.60 | 0.45 | 0.42|
| 10² | 0.80 | 0.50 | 0.45|
| 10³ | 0.92 | 0.55 | 0.45|
---
### 3. MuSiQue
**Trend Verification**:
- Recall: Rapid increase (0.18 → 0.80)
- NDCG: Steady rise (0.19 → 0.29)
- MRR: Flat line (0.17 → 0.19)
**Data Points**:
| Documents | Recall | NDCG | MRR |
|-----------|--------|------|-----|
| 10⁰ | 0.18 | 0.19 | 0.17|
| 10¹ | 0.40 | 0.22 | 0.18|
| 10² | 0.60 | 0.25 | 0.19|
| 10³ | 0.80 | 0.29 | 0.19|
---
### 4. 2WikiMultiHopQA
**Trend Verification**:
- Recall: Exponential growth (0.20 → 0.95)
- NDCG: Gradual increase (0.21 → 0.45)
- MRR: Flat line (0.20 → 0.33)
**Data Points**:
| Documents | Recall | NDCG | MRR |
|-----------|--------|------|-----|
| 10⁰ | 0.20 | 0.21 | 0.20|
| 10¹ | 0.40 | 0.28 | 0.22|
| 10² | 0.70 | 0.35 | 0.30|
| 10³ | 0.95 | 0.45 | 0.33|
---
## Cross-Validation Summary
1. **Legend Consistency**: All line colors match legend entries (purple=Recall, blue=NDCG, green=MRR)
2. **Axis Scaling**: Logarithmic x-axis confirmed via 10⁰-10³ markers
3. **Data Integrity**: All y-axis values align with visual trends (e.g., Recall lines show steepest slopes)
## Conclusion
The graphs demonstrate that Recall consistently outperforms NDCG and MRR across all datasets, with performance gains accelerating as document count increases. MRR shows minimal improvement beyond 10¹ documents in most cases.
</details>
Figure 7: Retrieval performance of DRAG on different datasets.
In Figure 7, recall demonstrates consistent improvements as the number of documents increases, approaching near-perfect scores when large document sets (e.g., 1k) are retrieved. However, both NDCG and MRR metrics plateau early at around 100 documents, with diminishing gains as the document count further rises. This divergence suggests that while more documents lead to better recall, the relevance and ranking quality (captured by NDCG and MRR) do not improve proportionally, and even introduce extensive noise. Therefore, higher recall doesn’t necessarily translate into better final answer quality when the retrieved documents aren’t effectively ranked or filtered.
Table 5: Retrieval performance of DRAG and IterDRAG ( $k=50$ documents, $m=2$ shots).
| DRAG IterDRAG | 0.632 0.736 | 0.321 0.420 | 0.239 0.346 | 0.783 0.855 | 0.535 0.549 | 0.465 0.478 | 0.509 0.670 | 0.255 0.365 | 0.188 0.291 | 0.722 0.935 | 0.421 0.605 | 0.336 0.528 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
Unlike the one-step retrieval in DRAG, iterative retrieval based on query decomposition often yields simpler sub-queries, facilitating more effective retrieval. In addition, merging the retrieved documents from different steps typically results in higher overall retrieval performance, as evidenced in Table 5. With IterDRAG, the performance gains are consistent and reach the average of 30.5%. Specifically, we observe higher gains for complex multi-hop queries (e.g., 2WikiMultiHopQA), where metric improvements can be as high as 57.1%. Moreover, the gains on ranking-discounted metrics (30.7% in NDCG and 39.9% MRR) show greater improvements compared to recall (21.7%). In summary, these findings highlight the superiority of iterative retrieval with query decomposition over one-step methods, which effectively contribute to the overall performance of IterDRAG.
Appendix B Chain-of-Thought vs. IterDRAG.
Table 6: Chain-of-thought (CoT) vs. IterDRAG results ( $k=5$ documents, $m=4$ shots).
| CoT IterDRAG | 40.2 44.8 | 51.3 59.4 | 45.6 52.8 | 8.9 17.9 | 16.1 30.1 | 10.8 25.9 | 33.0 57.5 | 37.9 69.9 | 36.7 72.3 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
To evaluate different iterative strategies, we compare the commonly used chain-of-thought (CoT) with IterDRAG (Wei et al., 2022). In particular, we generate the CoT examples following Trivedi et al. (2023) and adopt the 4-shot setting with 5 documents. The results on three larger datasets (HotpotQA, MuSiQue and 2WikiMultiHopQA), as reported in Table 6, highlight the performance differences between these strategies, in which IterDRAG consistently outperforms CoT with significant improvements. Such difference can be traced back to three key factors: (1) the retrieval quality of CoT is limited without interleaving retrieval as in IterDRAG; (2) Gemini 1.5 Flash is relatively small and may not perform well in free-form reasoning in comparison to larger LLMs; and (3) the generated CoT examples are less informative than handcrafted ones and underperform compared to constrained decoding with Self-Ask (Press et al., 2023; Koo et al., 2024). Consequently, IterDRAG demonstrates its effectiveness as a scalable method for knowledge-intensive tasks.
Appendix C Additional RAG Results
<details>
<summary>2410.04343v2/x7.png Details</summary>

### Visual Description
# Technical Document Analysis of Heatmap Image
## Overview
The image is a **heatmap** divided into three sections: **EM Performance**, **F1 Performance**, and **Acc Performance**. Each section visualizes performance metrics across varying document counts (0-Doc to 1000-Doc) and shot types (0-Shot to 28-Shot). The color gradient transitions from **blue (low values)** to **red (high values)**.
---
## Key Components
### 1. **Axis Labels and Titles**
- **X-Axis (Columns):**
- Document counts: `0-Doc`, `1-Doc`, `2-Doc`, `5-Doc`, `10-Doc`, `20-Doc`, `50-Doc`, `100-Doc`, `200-Doc`, `500-Doc`, `1000-Doc`.
- Titles:
- **EM Performance** (left section)
- **F1 Performance** (middle section)
- **Acc Performance** (right section)
- **Y-Axis (Rows):**
- Shot types: `0-Shot`, `2^0-Shot`, `2^1-Shot`, `2^2-Shot`, `2^3-Shot`, `2^4-Shot`, `2^5-Shot`, `2^6-Shot`, `2^7-Shot`, `2^8-Shot`.
- **Legend:**
- Position: **Top of the image** (implied by description).
- Color mapping: Blue (low performance) to Red (high performance).
---
### 2. **Data Tables**
#### **EM Performance**
| Shot Type | 0-Doc | 1-Doc | 2-Doc | 5-Doc | 10-Doc | 20-Doc | 50-Doc | 100-Doc | 200-Doc | 500-Doc | 1000-Doc |
|-----------------|-------|-------|-------|-------|--------|--------|--------|---------|---------|---------|----------|
| 0-Shot | 12.2 | 13.9 | 18.6 | 19.1 | 20.5 | 21.4 | 20.6 | 22.4 | 21.1 | 21.3 | 21.2 |
| 2^0-Shot | 26.4 | 34.1 | 36.8 | 38.8 | 42.0 | 43.2 | 45.1 | 46.3 | 46.6 | | |
| 2^1-Shot | 27.6 | 32.7 | 36.5 | 40.8 | 44.4 | 45.2 | 44.9 | 47.5 | 46.9 | | |
| 2^2-Shot | 27.8 | 34.8 | 40.4 | 42.7 | 45.4 | 46.4 | 47.5 | 48.6 | 47.6 | | |
| 2^3-Shot | 28.0 | 35.4 | 41.4 | 44.0 | 46.0 | 47.2 | 48.7 | 48.8 | 49.7 | | |
| 2^4-Shot | 28.7 | 37.0 | 43.0 | 45.4 | 47.4 | 47.2 | 48.2 | 48.8 | | | |
| 2^5-Shot | 30.7 | 38.6 | 42.2 | 45.7 | 48.1 | 47.3 | 47.3 | | | | |
| 2^6-Shot | 29.6 | 39.8 | 41.8 | 44.7 | 47.1 | 47.2 | | | | | |
| 2^7-Shot | 30.5 | 39.3 | 42.2 | 44.5 | 46.8 | | | | | | |
| 2^8-Shot | 31.0 | 40.3 | 42.3 | 45.2 | 47.5 | | | | | | |
#### **F1 Performance**
| Shot Type | 0-Doc | 1-Doc | 2-Doc | 5-Doc | 10-Doc | 20-Doc | 50-Doc | 100-Doc | 200-Doc | 500-Doc | 1000-Doc |
|-----------------|-------|-------|-------|-------|--------|--------|--------|---------|---------|---------|----------|
| 0-Shot | 24.6 | 25.4 | 30.3 | 33.3 | 35.8 | 36.7 | 36.7 | 38.2 | 37.7 | 38.0 | 38.5 |
| 2^0-Shot | 35.9 | 44.7 | 48.5 | 52.1 | 55.0 | 56.5 | 58.7 | 59.6 | 59.1 | | |
| 2^1-Shot | 36.5 | 42.8 | 47.8 | 53.8 | 56.9 | 57.6 | 57.7 | 59.9 | 59.1 | | |
| 2^2-Shot | 37.5 | 45.6 | 51.2 | 56.0 | 58.0 | 59.6 | 60.6 | 61.3 | 60.0 | | |
| 2^3-Shot | 37.8 | 45.4 | 52.3 | 57.4 | 58.2 | 59.4 | 60.5 | 60.8 | 61.1 | | |
| 2^4-Shot | 37.7 | 47.8 | 53.7 | 58.1 | 59.2 | 60.0 | 60.3 | 60.4 | | | |
| 2^5-Shot | 39.5 | 48.9 | 53.9 | 58.4 | 59.9 | 59.6 | 59.3 | | | | |
| 2^6-Shot | 39.0 | 51.2 | 54.2 | 57.2 | 59.0 | 59.7 | | | | | |
| 2^7-Shot | 39.8 | 50.8 | 54.2 | 57.6 | 59.4 | | | | | | |
| 2^8-Shot | 40.7 | 50.5 | 53.6 | 57.3 | 59.1 | | | | | | |
#### **Acc Performance**
| Shot Type | 0-Doc | 1-Doc | 2-Doc | 5-Doc | 10-Doc | 20-Doc | 50-Doc | 100-Doc | 200-Doc | 500-Doc | 1000-Doc |
|-----------------|-------|-------|-------|-------|--------|--------|--------|---------|---------|---------|----------|
| 0-Shot | 31.3 | 32.8 | 35.6 | 40.3 | 42.2 | 44.5 | 44.7 | 46.8 | 46.8 | 46.3 | 46.3 |
| 2^0-Shot | 30.6 | 42.4 | 46.1 | 48.6 | 52.1 | 53.4 | 55.7 | 56.2 | 56.5 | | |
| 2^1-Shot | 31.4 | 40.2 | 43.7 | 50.2 | 53.5 | 53.5 | 53.8 | 55.4 | 54.7 | | |
| 2^2-Shot | 31.4 | 42.6 | 48.2 | 52.4 | 53.7 | 55.9 | 56.3 | 56.8 | 55.9 | | |
| 2^3-Shot | 31.5 | 42.0 | 48.7 | 52.8 | 53.6 | 55.1 | 55.8 | 56.0 | 56.4 | | |
| 2^4-Shot | 31.8 | 43.6 | 49.4 | 53.2 | 54.6 | 54.7 | 55.0 | 55.2 | | | |
| 2^5-Shot | 33.6 | 43.7 | 48.8 | 53.0 | 54.8 | 54.4 | 54.2 | | | | |
| 2^6-Shot | 32.7 | 45.2 | 48.2 | 50.8 | 53.1 | 53.6 | | | | | |
| 2^7-Shot | 33.7 | 44.5 | 48.3 | 50.5 | 52.9 | | | | | | |
| 2^8-Shot | 34.2 | 44.2 | 47.5 | 50.8 | 53.1 | | | | | | |
---
## Key Trends
1. **EM Performance**
- **0-Shot** starts at **12.2** (0-Doc) and increases to **21.2** (1000-Doc).
- **2^3-Shot** achieves the highest value (**49.7**) at 1000-Doc.
- Performance generally improves with more documents, but some shot types plateau (e.g., 2^5-Shot at 47.3).
2. **F1 Performance**
- **0-Shot** ranges from **24.6** (0-Doc) to **38.5** (1000-Doc).
- **2^3-Shot** peaks at **61.1** (1000-Doc).
- Higher shot types (e.g., 2^3-Shot, 2^4-Shot) show steeper improvements with document counts.
3. **Acc Performance**
- **0-Shot** starts at **31.3** (0-Doc) and stabilizes at **46.3** (1000-Doc).
- **2^3-Shot** achieves the highest value (**56.4**) at 1000-Doc.
- Performance improves with document counts, but some shot types (e.g., 2^6-Shot) show minimal gains.
---
## Spatial Grounding
- **Legend Position:** Top of the image (implied).
- **Color Consistency:**
- Blue values correspond to lower performance (e.g., 0-Shot in EM/F1/Acc).
- Red values indicate higher performance (e.g., 2^3-Shot in F1/Acc).
---
## Notes
- **Missing Data:** Some cells in the tables are empty (e.g., 2^0-Shot in EM Performance at 1000-Doc).
- **Color Gradient:** The heatmap uses a **blue-to-red** scale, but the exact legend mapping is not explicitly labeled in the image.
- **Trend Verification:** All data series show an **upward trend** with increasing document counts, aligning with the visual heatmap.
---
## Conclusion
The heatmap illustrates performance metrics (EM, F1, Accuracy) across different shot types and document counts. Higher document counts generally improve performance, with **2^3-Shot** consistently achieving the highest values in F1 and Acc metrics. The data is structured in three distinct sections, each with its own axis labels and color-coded values.
</details>
Figure 8: IterDRAG performance heatmap for different metrics averaged across datasets.
We report the IterDRAG results averaged across datasets in Figure 8, shown as heatmaps where the x-axis represents the number of documents and the y-axis represents the number of shots. Performance is color-coded, with blue indicating lower values and red indicating higher values. The best-performing combinations are located toward the bottom right of each heatmap, which corresponds to longer context lengths. In comparison to DRAG, as reported in Figure 5(a), the optimal number of in-context examples is higher at 32, which highlights the importance of in-context demonstrations in enabling better query decomposition and interleaved retrieval. Combined with multiple generation steps, IterDRAG further improves RAG performance over DRAG.
<details>
<summary>2410.04343v2/x8.png Details</summary>

### Visual Description
# Technical Document Extraction: Heatmap Analysis
## Overview
The image contains two side-by-side heatmaps comparing accuracy metrics across different configurations. The heatmaps use a color gradient from blue (low values) to red (high values) to represent performance.
---
## Left Heatmap: TriviaQA Accuracy
### Axes Labels
- **X-Axis (Horizontal):** Document Counts (`1-Doc`, `2-Doc`, `5-Doc`, `10-Doc`, `20-Doc`, `50-Doc`, `100-Doc`)
- **Y-Axis (Vertical):** Shot Sizes (`2^0-Shot`, `2^2-Shot`, `2^4-Shot`, `2^6-Shot`, `2^8-Shot`)
### Key Data Points
| Shot Size | 1-Doc | 2-Doc | 5-Doc | 10-Doc | 20-Doc | 50-Doc | 100-Doc |
|-------------|-------|-------|-------|--------|--------|--------|---------|
| 2^0-Shot | 63.6 | 64.7 | 66.6 | 67.6 | 68.6 | 65.9 | 68.2 |
| 2^2-Shot | 64.0 | 65.5 | 66.9 | 67.7 | 68.5 | 69.0 | 65.7 |
| 2^4-Shot | 65.2 | 66.5 | 67.5 | 67.9 | 68.6 | 66.8 | - |
| 2^6-Shot | 65.5 | 66.1 | 67.3 | 67.5 | 68.2 | - | - |
| 2^8-Shot | 65.5 | 66.4 | 67.2 | 67.6 | - | - | - |
### Trends
- **General Pattern:** Accuracy improves with increasing document counts up to 20-Doc, then declines at 50-Doc and 100-Doc for most shot sizes.
- **Peak Performance:**
- `2^2-Shot` achieves the highest accuracy (69.0) at 50-Doc.
- `2^0-Shot` shows the lowest baseline accuracy (63.6) at 1-Doc.
- **Anomaly:** `2^2-Shot` at 100-Doc drops to 65.7, breaking the upward trend.
---
## Right Heatmap: NaturalQ Accuracy
### Axes Labels
- **X-Axis (Horizontal):** Document Counts (`1-Doc`, `2-Doc`, `5-Doc`, `10-Doc`, `20-Doc`, `50-Doc`, `100-Doc`)
- **Y-Axis (Vertical):** Shot Sizes (`2^0-Shot`, `2^2-Shot`, `2^4-Shot`, `2^6-Shot`, `2^8-Shot`)
### Key Data Points
| Shot Size | 1-Doc | 2-Doc | 5-Doc | 10-Doc | 20-Doc | 50-Doc | 100-Doc |
|-------------|-------|-------|-------|--------|--------|--------|---------|
| 2^0-Shot | 44.8 | 49.3 | 53.5 | 54.0 | 54.6 | 42.3 | 45.6 |
| 2^2-Shot | 45.2 | 48.5 | 52.0 | 53.2 | 53.0 | 41.5 | 44.0 |
| 2^4-Shot | 45.3 | 49.0 | 51.7 | 52.3 | 52.7 | 44.2 | - |
| 2^6-Shot | 45.8 | 49.4 | 52.2 | 52.7 | 52.0 | - | - |
| 2^8-Shot | 45.2 | 48.4 | 51.2 | 50.7 | - | - | - |
### Trends
- **General Pattern:** Accuracy peaks at 20-Doc for most shot sizes, then declines sharply at 50-Doc and 100-Doc.
- **Peak Performance:**
- `2^0-Shot` achieves the highest accuracy (54.6) at 20-Doc.
- `2^2-Shot` shows the lowest baseline accuracy (41.5) at 50-Doc.
- **Anomaly:** `2^8-Shot` at 100-Doc is missing data (marked as `-`).
---
## Comparative Analysis
1. **TriviaQA vs. NaturalQ:**
- TriviaQA consistently outperforms NaturalQ across all shot sizes and document counts.
- Example: At 20-Doc, TriviaQA's `2^0-Shot` (68.6) vs. NaturalQ's `2^0-Shot` (54.6).
2. **Document Count Impact:**
- Both datasets show diminishing returns beyond 20-Doc, with significant drops at 50-Doc and 100-Doc.
- NaturalQ exhibits more pronounced declines (e.g., `2^0-Shot` drops from 54.6 at 20-Doc to 42.3 at 50-Doc).
3. **Shot Size Efficiency:**
- Higher shot sizes (`2^6-Shot`, `2^8-Shot`) show marginal improvements over lower shot sizes in TriviaQA but not in NaturalQ.
---
## Color Legend (Inferred)
- **Blue:** Low accuracy (40-50 range)
- **Red:** High accuracy (60-70 range)
- **Gradient:** Intermediate values (50-60 range)
---
## Missing Data
- Right heatmap (NaturalQ) has missing values for:
- `2^4-Shot` at 100-Doc
- `2^6-Shot` at 50-Doc and 100-Doc
- `2^8-Shot` at 20-Doc, 50-Doc, and 100-Doc
---
## Conclusion
The heatmaps reveal that TriviaQA generally achieves higher accuracy than NaturalQ, with both datasets showing optimal performance at 20-Doc configurations. Further investigation is needed to address data gaps in the NaturalQ heatmap.
</details>
Figure 9: Evaluation accuracy of DRAG on TriviaQA and Natural Questions (NaturalQ.).
In addition to multi-hop question answering datasets, we also report results on one-hop datasets, specifically TriviaQA and Natural Questions (Joshi et al., 2017; Kwiatkowski et al., 2019). The evaluations for one-hop datasets are performed with DRAG and presented in Figure 9, similar to Figure 8. For TriviaQA, increasing the number of documents generally leads to improved accuracy, where the highest accuracy of 69.0% is achieved with 50 documents. In Natural Questions, performance increases with the number of documents up to about 10 or 20 documents, but further increases in the document count lead to diminishing returns or even slight declines in accuracy. The highest accuracy of 54.6% is achieved with 20 documents in 1-shot, and performance drops slightly when more documents are included. In summary, the optimal number of shots falls between 1 and 4. While increasing the number of shots and documents leads to initial performance gains, these improvements plateau beyond certain thresholds. This trend, in contrast to multi-hop datasets, may be partially attributed to the nature of the one-hop questions and retrieval relevance.
Table 7: StrategyQA accuracy results.
| Acc | 61.1 | 74.7 | 74.7 | 79.0 | 83.4 |
| --- | --- | --- | --- | --- | --- |
| | StrategyQA | | | | |
| Zero-shot QA | Many-shot QA | RAG | DRAG | IterDRAG | |
We also include the multi-hop and binary StrategyQA dataset in our experiments, see Table 7 (Geva et al., 2021). Despite being binary questions, we observe similar trends to our main experiments. For example, DRAG consistently outperforms the baseline QA and RAG methods, with 29.3% accuracy improvement to for the baseline QA model. Furthermore, the performance is boosted with 83.4 accuracy using the iterative IterDRAG. These results demonstrate that even for binary, multi-hop tasks, iterative approaches provide substantial gains, confirming the effectiveness of both long-context and iterative strategies for inference scaling in RAG.
Appendix D Additional Results on Inference Scaling Laws with GTR Retriever
<details>
<summary>2410.04343v2/x9.png Details</summary>

### Visual Description
# Technical Analysis of Performance vs. Effective Context Length
## Figure Caption
**Performance Comparison Across Models and Configurations**
*Left: RAG Model Performance*
*Right: DRAG and IterDRAG Model Performance*
---
## Left Chart: RAG Model Performance
### Axes
- **X-axis**: Effective Context Length (logarithmic scale, 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
### Legend
- **Location**: Top-left corner
- **Labels**:
- `RAG` (purple line with triangle markers)
- `Optimal Config` (red circular markers)
### Data Trends
1. **RAG Line**:
- Starts at **-1.8** (10²)
- Peaks at **0.2** (10⁵)
- Drops to **-0.8** (10⁶)
- *Visual Trend*: Initial upward slope, followed by a sharp decline after 10⁵.
2. **Optimal Config Points**:
- Data points at:
- (10², -1.5)
- (10³, -1.2)
- (10⁴, -0.8)
- (10⁵, 0.0)
- (10⁶, 0.2)
- *Visual Trend*: Gradual improvement until 10⁵, then slight decline.
### Spatial Grounding
- Red markers (Optimal Config) align with the dashed black trendline at all x-axis intervals.
---
## Right Chart: DRAG and IterDRAG Model Performance
### Axes
- **X-axis**: Effective Context Length (logarithmic scale, 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
### Legend
- **Location**: Top-left corner
- **Labels**:
- `DRAG` (blue line with triangle markers)
- `IterDRAG` (green line with triangle markers)
- `Optimal Config` (red circular markers)
### Data Trends
1. **DRAG Line**:
- Starts at **-1.8** (10²)
- Peaks at **0.2** (10⁵)
- Drops to **-0.8** (10⁶)
- *Visual Trend*: Similar to RAG, with a sharp decline after 10⁵.
2. **IterDRAG Line**:
- Starts at **-1.5** (10²)
- Peaks at **1.5** (10⁵)
- Drops to **-1.5** (10⁶)
- *Visual Trend*: Steeper ascent and sharper decline compared to DRAG.
3. **Optimal Config Points**:
- Data points at:
- (10², -1.8)
- (10³, -1.5)
- (10⁴, -1.2)
- (10⁵, -0.8)
- (10⁶, 1.5)
- *Visual Trend*: Consistent improvement until 10⁶, with a significant spike at the highest context length.
### Spatial Grounding
- Red markers (Optimal Config) align with the dashed black trendline at all x-axis intervals.
---
## Cross-Chart Observations
1. **Optimal Config Consistency**:
- Red markers (Optimal Config) appear at identical y-values across both charts, suggesting a standardized benchmark.
2. **Model Performance**:
- Both RAG and DRAG models show similar performance trajectories, with IterDRAG outperforming DRAG at higher context lengths (10⁵) but underperforming at 10⁶.
3. **Logarithmic Scale Impact**:
- X-axis scaling emphasizes performance changes at lower context lengths (10²–10⁴), while higher lengths (10⁵–10⁶) show diminishing returns.
---
## Notes
- No non-English text detected.
- All data points and trends are visually verified against legend labels and axis markers.
- No embedded tables or additional components present.
</details>
Figure 10: Normalized performance with increasing effective context lengths on MuSiQue, the results are obtained with GTR XXL retriever and Gemini 1.5 Flash.
To enhance the generalizability of our scaling observations and validate the findings with an alternative retriever model, we conduct additional experiments using the open-source GTR XXL retriever and Gemini 1.5 Flash (Ni et al., 2021). Figure 10 shows the results on MuSiQue, evaluated using 100 sampled examples from the dataset for computational efficiency. Different from the performance plateau in standard RAG, DRAG and IterDRAG yields consistent performance gains with increasing context length, especially with IterDRAG at longer context lengths. Overall, the results demonstrate consistent patterns in inference scaling even with a different retriever model, highlighting the potential of expanding text-time compute in long-context RAG for more generalized scenarios.
Appendix E Additional Results on Inference Scaling Laws for RAG
<details>
<summary>2410.04343v2/x10.png Details</summary>

### Visual Description
# Technical Document Extraction: Performance vs. Effective Context Length
## Chart 1: RAG vs. Optimal Configuration
### Spatial Layout
- **Legend Position**: Top-left corner
- **X-axis**: Effective Context Length (log scale: 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
### Key Components
1. **Legend**:
- **RAG**: Purple triangles (▲)
- **Optimal Config**: Red circles (●)
2. **Trend Line**:
- Dashed black line showing upward slope
- Represents ideal performance trajectory
### Data Series Analysis
#### RAG (Purple Triangles)
- **Trend**: Gradual upward trajectory with minor fluctuations
- **Key Points**:
- [10², -2.5]
- [10³, -1.2]
- [10⁴, 0.0]
- [10⁵, 0.3]
- [10⁶, 0.5]
#### Optimal Config (Red Circles)
- **Trend**: Tightly follows dashed trend line
- **Key Points**:
- [10², -2.3]
- [10³, -1.0]
- [10⁴, 0.2]
- [10⁵, 0.6]
- [10⁶, 0.8]
## Chart 2: DRAG vs. IterDRAG
### Spatial Layout
- **Legend Position**: Top-left corner
- **X-axis**: Effective Context Length (log scale: 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
### Key Components
1. **Legend**:
- **DRAG**: Blue triangles (▲)
- **IterDRAG**: Green triangles (▲)
- **Optimal Config**: Red circles (●)
2. **Trend Line**:
- Dashed black line showing upward slope
- Represents ideal performance trajectory
### Data Series Analysis
#### DRAG (Blue Triangles)
- **Trend**: Noisy upward trajectory with significant variance
- **Key Points**:
- [10², -2.5]
- [10³, -1.1]
- [10⁴, 0.1]
- [10⁵, 0.7]
- [10⁶, 1.2]
#### IterDRAG (Green Triangles)
- **Trend**: Steeper upward trajectory with clustering at higher context lengths
- **Key Points**:
- [10², -2.3]
- [10³, -1.0]
- [10⁴, 0.3]
- [10⁵, 1.0]
- [10⁶, 1.5]
#### Optimal Config (Red Circles)
- **Trend**: Consistent with dashed trend line
- **Key Points**:
- [10², -2.4]
- [10³, -0.9]
- [10⁴, 0.4]
- [10⁵, 1.1]
- [10⁶, 1.6]
## Cross-Chart Observations
1. **Performance Scaling**:
- All configurations show improved performance with increased context length
- Optimal Config consistently outperforms other methods across all context lengths
2. **Method Comparison**:
- **RAG**: Moderate improvement, lags behind Optimal Config
- **DRAG**: High variability, particularly at 10⁴-10⁵ context lengths
- **IterDRAG**: Most consistent improvement, closes performance gap with Optimal Config at 10⁶ context length
3. **Normalization**:
- All metrics use identical -2 to 2 performance scale
- Optimal Config serves as reference point (0-1 range in both charts)
## Technical Notes
- Logarithmic x-axis enables visualization of 4 orders of magnitude (100-1,000,000 tokens)
- Normalized performance suggests relative comparison rather than absolute metrics
- Dashed trend line represents theoretical maximum performance curve
</details>
(a) Normalized performance vs. effective context lengths on Bamboogle.
<details>
<summary>2410.04343v2/x11.png Details</summary>

### Visual Description
# Technical Document Extraction: Performance Analysis of RAG, DRAG, and IterDRAG Models
## Chart 1: RAG vs Optimal Configuration
### Spatial Grounding
- **Legend Position**: Top-left quadrant
- **Legend Colors**:
- Purple: RAG
- Red: Optimal Config
### Axis Labels
- **X-axis**: Effective Context Length (logarithmic scale: 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
### Key Trends
1. **RAG (Purple Line)**:
- Initial slope: Steep upward trajectory from 10² to 10³
- Convergence: Gradual flattening between 10³ and 10⁴
- Plateau: Stable performance at ~0.5 normalized units after 10⁴
2. **Optimal Config (Red Dots)**:
- Consistent performance: Horizontal dashed line at ~0.5 normalized units
- Convergence point: RAG matches Optimal Config at ~10⁴ effective context length
### Data Points
- **RAG**:
- 10²: -1.8
- 10³: -0.2
- 10⁴: 0.5
- 10⁵: 0.5
- 10⁶: 0.5
- **Optimal Config**:
- Constant: 0.5 across all x-values
## Chart 2: DRAG vs IterDRAG vs Optimal Configuration
### Spatial Grounding
- **Legend Position**: Top-left quadrant
- **Legend Colors**:
- Blue: DRAG
- Green: IterDRAG
- Red: Optimal Config
### Axis Labels
- **X-axis**: Effective Context Length (logarithmic scale: 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
### Key Trends
1. **DRAG (Blue Line)**:
- Initial slope: Steep upward trajectory from 10² to 10³
- Convergence: Gradual flattening between 10³ and 10⁴
- Plateau: Stable performance at ~0.8 normalized units after 10⁴
2. **IterDRAG (Green Line)**:
- Initial slope: Moderate upward trajectory from 10² to 10³
- Convergence: Accelerated improvement between 10³ and 10⁵
- Plateau: Stable performance at ~1.2 normalized units after 10⁵
3. **Optimal Config (Red Dots)**:
- Consistent performance: Horizontal dashed line at ~1.0 normalized units
- Convergence point: Both DRAG and IterDRAG approach Optimal Config at ~10⁵ effective context length
### Data Points
- **DRAG**:
- 10²: -1.5
- 10³: -0.3
- 10⁴: 0.8
- 10⁵: 0.8
- 10⁶: 0.8
- **IterDRAG**:
- 10²: -1.2
- 10³: -0.1
- 10⁴: 0.9
- 10⁵: 1.2
- 10⁶: 1.2
- **Optimal Config**:
- Constant: 1.0 across all x-values
## Comparative Analysis
1. **Performance Gaps**:
- RAG: 0.5 vs Optimal (1.0) = 0.5 unit deficit
- DRAG: 0.8 vs Optimal (1.0) = 0.2 unit deficit
- IterDRAG: 1.2 vs Optimal (1.0) = 0.2 unit surplus
2. **Scalability**:
- All models show logarithmic improvement patterns
- IterDRAG demonstrates fastest convergence (10³ to 10⁵)
- RAG shows slowest improvement rate
3. **Optimal Thresholds**:
- RAG: Reaches 50% of Optimal at 10⁴
- DRAG: Reaches 80% of Optimal at 10⁴
- IterDRAG: Exceeds Optimal at 10⁵
## Technical Notes
- All charts use logarithmic x-axis scaling for context length
- Normalized performance values suggest relative efficiency metrics
- Dashed lines represent theoretical optimal baselines
- Convergence points indicate model maturity thresholds
</details>
(b) Normalized performance vs. effective context lengths on HotpotQA.
<details>
<summary>2410.04343v2/x12.png Details</summary>

### Visual Description
# Technical Document Analysis of Performance Charts
## Chart 1: RAG vs Optimal Configuration
### Axes and Labels
- **X-axis**: Effective Context Length (logarithmic scale, 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
- **Legend**:
- **Purple Triangle**: RAG
- **Red Circle**: Optimal Config
- **Legend Position**: Top-left corner
### Key Trends
1. **RAG (Purple Triangle)**:
- Starts at ~-1.5 at 10²
- Dips to ~-2 at 10³
- Rises to ~0 at 10⁵
- Plateaus near 0 at 10⁶
2. **Optimal Config (Red Circle)**:
- Starts at ~-1.5 at 10²
- Rises steadily to 0 at 10⁴
- Remains flat at 0 from 10⁴ to 10⁶
### Spatial Grounding
- All data points align with legend colors:
- Purple triangles match RAG line
- Red circles match Optimal Config dashed line
## Chart 2: DRAG vs IterDRAG vs Optimal Configuration
### Axes and Labels
- **X-axis**: Effective Context Length (logarithmic scale, 10² to 10⁶)
- **Y-axis**: Normalized Performance (-2 to 2)
- **Legend**:
- **Blue Triangle**: DRAG
- **Green Triangle**: IterDRAG
- **Red Circle**: Optimal Config
- **Legend Position**: Top-left corner
### Key Trends
1. **DRAG (Blue Triangle)**:
- Starts at ~-1.5 at 10²
- Gradually increases to ~0 at 10⁴
- Plateaus near 0 at 10⁵–10⁶
2. **IterDRAG (Green Triangle)**:
- Starts at ~-2 at 10²
- Sharp rise to ~1 at 10⁵
- Plateaus near 1.5 at 10⁶
3. **Optimal Config (Red Circle)**:
- Starts at ~-1.5 at 10²
- Steady increase to ~1.5 at 10⁵
- Continues upward to ~2 at 10⁶
### Spatial Grounding
- All data points align with legend colors:
- Blue triangles match DRAG line
- Green triangles match IterDRAG line
- Red circles match Optimal Config dashed line
## Cross-Chart Observations
1. **Optimal Config** consistently outperforms other methods across all context lengths.
2. **RAG** shows volatility at lower context lengths (10²–10³) but stabilizes at higher lengths.
3. **IterDRAG** demonstrates the most significant improvement at mid-to-high context lengths (10⁴–10⁶).
4. All methods converge toward the Optimal Config trendline as context length increases.
## Data Table Reconstruction (Hypothetical)
| Effective Context Length | RAG Performance | Optimal Config Performance | DRAG Performance | IterDRAG Performance |
|--------------------------|-----------------|----------------------------|------------------|-----------------------|
| 10² | -1.5 | -1.5 | -1.5 | -2 |
| 10³ | -2 | -1 | -1.2 | -1.8 |
| 10⁴ | 0 | 0 | 0 | 0.5 |
| 10⁵ | 0 | 0 | 0.2 | 1.5 |
| 10⁶ | 0 | 0 | 0 | 1.8 |
## Notes
- All charts use logarithmic x-axis scaling for effective context length.
- Normalized performance values are relative to a baseline (not absolute).
- No textual annotations or embedded data tables are present in the image.
</details>
(c) Normalized performance vs. effective context lengths on 2WikiMultiHopQA.
Figure 11: Normalized performance with increasing effective context lengths on different datasets.
We present data-specific results on the relationship between the performance and the effective context length. Figure 11 presents the results on the other three datasets other than MuSiQue (See Figure 1 for visualized results on MuSiQue). We observe different behavior depending on the datasets. For instance, the gains are more linear and consistent on Bamboogle and MuSiQue, and almost linear on 2WikiMultiHopQA until 1M tokens. However, HotpotQA and 2WikiMultiHopQA with effective context length longer than 100k tokens exhibit more sigmoidal patterns, likely due to the difficulty of the datasets and the quality of the retrieved documents.
Appendix F Additional Results on Computation Allocation Model for RAG
<details>
<summary>2410.04343v2/x13.png Details</summary>

### Visual Description
# Technical Document Extraction: Performance Analysis Across Datasets
## Overview
The image contains four comparative line charts analyzing normalized performance across different datasets as a function of "Number of Shots" (logarithmic scale). Each chart includes confidence intervals (shaded regions) and performance metrics for four document-level configurations.
---
## Legend & Key
- **Legend Position**: Top center
- **Color/Style Encoding**:
- `0-Doc`: Blue dashed line
- `1-Doc`: Orange dash-dot line
- `10-Doc`: Purple dotted line
- `100-Doc`: Gray dashed line
- **Confidence Intervals**: Shaded regions around each line
---
## Dataset-Specific Analysis
### 1. Bambooogle
- **X-axis**: Number of Shots (log scale: 10⁰, 10¹, 10²)
- **Y-axis**: Normalized Performance (-3 to 2)
- **Trends**:
- `100-Doc` (gray dashed): Highest performance, peaks at ~1.5 (10¹ shots), declines slightly at 10²
- `10-Doc` (purple dotted): Second-highest, peaks at ~1.2 (10¹ shots)
- `1-Doc` (orange dash-dot): Peaks at ~0.5 (10¹ shots)
- `0-Doc` (blue dashed): Lowest performance, declines from -1.5 (10⁰) to -2.5 (10²)
- **Confidence Intervals**: Narrowest at 10² shots for all configurations
### 2. HotpotQA
- **X-axis**: Number of Shots (log scale: 10⁰, 10¹, 10²)
- **Y-axis**: Normalized Performance (-3 to 2)
- **Trends**:
- `100-Doc` (gray dashed): Peaks at ~1.3 (10¹ shots), declines to ~0.8 at 10²
- `10-Doc` (purple dotted): Peaks at ~1.1 (10¹ shots)
- `1-Doc` (orange dash-dot): Peaks at ~0.6 (10¹ shots)
- `0-Doc` (blue dashed): Declines from -1.2 (10⁰) to -2.1 (10²)
- **Confidence Intervals**: Overlap significantly between `10-Doc` and `100-Doc` at 10¹ shots
### 3. MuSiQue
- **X-axis**: Number of Shots (log scale: 10⁰, 10¹, 10²)
- **Y-axis**: Normalized Performance (-3 to 2)
- **Trends**:
- `100-Doc` (gray dashed): Peaks at ~1.4 (10¹ shots), declines to ~0.9 at 10²
- `10-Doc` (purple dotted): Peaks at ~1.0 (10¹ shots)
- `1-Doc` (orange dash-dot): Peaks at ~0.4 (10¹ shots)
- `0-Doc` (blue dashed): Declines from -1.0 (10⁰) to -2.3 (10²)
- **Confidence Intervals**: `100-Doc` confidence interval widens at 10² shots
### 4. 2WikiMultiHopQA
- **X-axis**: Number of Shots (log scale: 10⁰, 10¹, 10²)
- **Y-axis**: Normalized Performance (-3 to 2)
- **Trends**:
- `100-Doc` (gray dashed): Peaks at ~1.5 (10¹ shots), declines to ~0.7 at 10²
- `10-Doc` (purple dotted): Peaks at ~1.2 (10¹ shots)
- `1-Doc` (orange dash-dot): Peaks at ~0.5 (10¹ shots)
- `0-Doc` (blue dashed): Declines from -1.3 (10⁰) to -2.4 (10²)
- **Confidence Intervals**: `100-Doc` shows largest variance at 10² shots
---
## Cross-Dataset Observations
1. **Document-Level Impact**:
- `100-Doc` consistently outperforms other configurations across all datasets
- `0-Doc` shows the steepest decline in performance with increasing shots
2. **Logarithmic Scaling**:
- Performance improvements plateau at 10¹ shots for most configurations
- Diminishing returns observed beyond 10¹ shots
3. **Confidence Intervals**:
- Wider intervals at higher shot counts (10²) suggest increased variability
---
## Spatial Grounding & Validation
- **Legend Accuracy**:
- All line styles/colors match legend entries (e.g., blue dashed = `0-Doc`)
- No mismatches detected between legend and chart elements
- **Axis Consistency**:
- All charts use identical axis labels and scales
- Logarithmic x-axis ensures comparable shot-count ranges
---
## Limitations
- No explicit error bars or statistical significance markers provided
- Confidence intervals are qualitative (shaded regions without numerical bounds)
- No control for dataset-specific hyperparameters
---
## Conclusion
The charts demonstrate a clear trend where higher document-level configurations (`100-Doc`) outperform lower ones (`0-Doc`) across all datasets. Performance gains are most pronounced at moderate shot counts (10¹), with diminishing returns at higher shot counts (10²). Confidence intervals suggest increasing uncertainty in performance estimates as shot counts increase.
</details>
Figure 12: The estimated performance using the proposed computation allocation model vs. actual metric values in IterDRAG. The subplots represent different datasets, where each line corresponds to a fixed number of documents, we scale the context length by increasing the number of shots.
Table 8: Computation allocation mode of Gemini 1.5 Flash with $p$ -value, $R^{2}$ and MSE statistics.
| Value | 0.325 | 0.101 | 0.177 | -0.067 | -0.008 | 0 | -0.730 | 0.903 | 0.085 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| $p$ -value | 0.000 | 0.000 | 0.000 | 0.000 | 0.092 | N/A | 0.000 | N/A | N/A |
We further explore the findings on the computation allocation model. In particular, we report the estimated parameters along with $p$ -values, $R^{2}$ , and MSE statistics in Table 8. In our implementation, we constrain the last element of $b$ , leaving six learnable parameters in total. Our analysis shows that all parameters are statistically significant, except for $b_{1}$ , which has a $p$ -value slightly above 0.05. Nonetheless, our experiments suggest that retaining $b_{1}$ improves generalization in many cases, such as IterDRAG on multi-hop datasets. For sigmoid scaling, we fit a custom function between the predicted $\hat{P}$ and ground truth $P$ values, defined as $\sigma(x)=\frac{3.30}{1+e^{-1.81(x+0.46)}}-2.18$ .
We also visualize the predictions on for IterDRAG across different datasets in Figure 12, where each subplot represents a dataset and each line corresponds to a document setting ( $k$ ). The inference compute is scaled by increasing the number of in-context examples ( $m$ ) and generation iterations ( $n$ ). Here, we find similar trends to those in Figure 6, although IterDRAG shows larger variations compared to DRAG. HotpotQA and 2WikiMultiHopQA show more consistent trends with the predictions, likely due to the predominance of multi-hop queries. In summary, our findings are consistent for both DRAG and IterDRAG, demonstrating that RAG performance can be accurately modeled by our computation allocation model for RAG. For Bamboogle, HotpotQA and 2WikiMultiHopQA, we provide the normalized performance with increasing effective context lengths in Figure 11, in which we observe similar trends to the results on MuSiQue (See Figure 1). We also illustrate the prediction surface for both DRAG and IterDRAG in Figure 13.
<details>
<summary>2410.04343v2/x14.png Details</summary>

### Visual Description
# Technical Document Extraction: 3D Scatter Plot Analysis
## Image Type
3D Scatter Plot with Surface Visualization
## Axis Labels and Markers
- **X-Axis**: Number of Documents (Range: 0–1000)
- **Y-Axis**: Number of Shots (Range: 0–250)
- **Z-Axis**: Normalized Performance (Range: -3.0–1.0)
## Legend
- **Location**: Top-right corner
- **Entries**:
- Blue Series: Blue data points
- Red Series: Red data points
## Data Series Analysis
### Blue Series
- **Visual Trend**: Clustered near the origin (low document/shots values)
- **Key Observations**:
- Points densely packed in the lower-left quadrant
- Minimal spread along the Z-axis (Normalized Performance ≈ -2.5 to -1.0)
- Spatial grounding: Most points at [x=0–50, y=0–50, z=-2.5–-1.0]
### Red Series
- **Visual Trend**: Dispersed across the plot with increasing performance
- **Key Observations**:
- Points extend toward higher document/shots values
- Z-axis values range from -1.5 to 1.0
- Notable gradient: Higher performance (Z > 0) correlates with increased document/shots counts
- Spatial grounding: Points at [x=200–1000, y=50–200, z=-1.5–1.0]
## Surface Visualization
- **Color Gradient**: Red-to-pink gradient indicating performance density
- **Key Features**:
- Darker red regions correspond to higher data point density
- Surface peaks at [x=800–1000, y=150–200, z=0.5–1.0]
- Transparency allows visibility of underlying data points
## Cross-Reference Validation
1. **Legend Colors**:
- Blue matches all blue data points (confirmed)
- Red matches all red data points (confirmed)
2. **Trend Consistency**:
- Blue series clustering aligns with low Z-values
- Red series dispersion matches increasing Z-values
## Spatial Grounding Summary
- **Blue Series**: Origin-proximal cluster ([0–50, 0–50])
- **Red Series**: Mid-to-high range distribution ([200–1000, 50–200])
- **Surface**: Overlays both series, emphasizing performance trends
## Conclusion
The plot demonstrates an inverse relationship between document/shots counts and performance for the blue series, while the red series shows a positive correlation. The surface visualization reinforces these trends through spatial density mapping.
</details>
(a) Performance vs. predicted surface for DRAG.
<details>
<summary>2410.04343v2/x15.png Details</summary>

### Visual Description
# Technical Document Extraction: 3D Scatter Plot Analysis
## 1. **Axis Labels and Titles**
- **X-Axis**: "Number of Documents"
- Range: 0 to 1000 (increments of 200)
- **Y-Axis**: "Number of Shots"
- Range: 0 to 250 (increments of 50)
- **Z-Axis**: "Normalized Performance"
- Range: -2.0 to 2.0 (increments of 1.0)
## 2. **Legend and Color Mapping**
- **Legend Location**: Top-right corner of the plot.
- **Legend Content**: Not explicitly described in the image.
- **Color Consistency Check**:
- Data points are uniformly blue.
- No explicit legend labels or color mappings provided.
## 3. **Surface and Data Distribution**
- **Surface**:
- A gradient plane spanning the entire 3D space.
- Color gradient: Red (high z-values) to Blue (low z-values).
- Suggests a regression or density surface, though no explicit formula is provided.
- **Data Points**:
- Blue dots scattered across the plot.
- **Clustering**:
- Dense clusters near the origin (low "Number of Documents" and "Number of Shots").
- Sparse distribution at higher "Number of Documents" (e.g., 800–1000) and "Number of Shots" (e.g., 150–200).
- **Z-Value Trends**:
- Most data points cluster around z = -1.0 to 0.0.
- Fewer points at z = 1.0–2.0 (upper-right region).
## 4. **Key Trends and Observations**
- **Trend Verification**:
- No clear linear or exponential trend in data point distribution.
- The gradient surface implies a potential relationship between "Number of Documents," "Number of Shots," and "Normalized Performance," but the exact nature is not quantified.
- **Spatial Grounding**:
- Legend is positioned at the top-right, but no color-to-label mapping is visible.
- Data points align with the gradient surface, suggesting a density or regression model.
## 5. **Component Isolation**
- **Header**: No explicit header text.
- **Main Chart**:
- 3D axes with gridlines.
- Gradient surface and scattered data points.
- **Footer**: No footer text.
## 6. **Textual Content**
- **Embedded Text**: None.
- **Data Table**: Not present.
## 7. **Language and Transcription**
- **Primary Language**: English.
- **Other Languages**: None detected.
## 8. **Conclusion**
The image depicts a 3D scatter plot with axes labeled "Number of Documents," "Number of Shots," and "Normalized Performance." A gradient surface (red to blue) and blue data points are present, but no explicit legend or textual annotations are visible. The data distribution suggests clustering near the origin, with sparse points at higher values. No textual or tabular data is embedded in the image.
</details>
(b) Performance vs. predicted surface for IterDRAG.
Figure 13: Normalized performance vs. predicted surface for DRAG and IterDRAG.
Inaccurate or outdated retrieval
Question: What is the lowest elevation of the longest railway tunnel? Prediction: 500 meters Annotation: 312 m Question: According to QS World University Rankings, where does the college that Ibrahim Shihata attended rank? Prediction: 3rd Annotation: 551-600 Question: Which battle occurred first, the Battle of Manila or the Battle of Guam? Prediction: Battle of Manila Annotation: Battle of Guam
(a) Example mistakes due to inaccurate or outdated retrieval.
Incorrect or lack of reasoning
Question: Which mountain, Masherbrum or Khunyang Chhish, is a taller mountain? Prediction: Masherbrum Annotation: Khunyang Chhish Question: What is the date of death of the director of film The Organization (Film)? Prediction: April 15, 2018 Annotation: December 12, 2012 Question: Who introduced a system of musical notation in the 14th century that is used in the area where most of the invasion of the eastern Roman Empire took place? Prediction: Philippe de Vitry Annotation: John Kukuzelis
(b) Example mistakes due to incorrect or lack of reasoning.
Hallucination or unfaithful reasoning
Question: Who was the last emperor of the dynasty that succeeded the Song dynasty? Prediction: Emperor Yuanzhen Annotation: Toghon Temür Question: What is another notable work by the illustrator of Sylvester and the Magic Pebble? Prediction: Shrek! Annotation: Doctor De Soto Question: In what movie did a Kenyan-Mexican actress, who graduated from Hampshire College, star in in 2015? Prediction: Queen of Katwe Annotation: Star Wars: The Force Awakens
(c) Example mistakes due to hallucination or unfaithful reasoning.
Evaluation issues or refusal to answer
Question: The most populous city in Punjab is how large (area wise)? Prediction: 310 sq. km Annotation: 310 square kilometers Question: Renáta Tomanová and Larisa Neiland are former professional athletes for what sport? Prediction: Tennis Annotation: Professional tennis
(d) Example mistakes due to evaluation issues or refusal to answer.
Figure 14: Example mistakes of DRAG and IterDRAG across datasets.
Appendix G Error Analysis
Despite the performance gains from scaling effective context length, RAG performance on challenging datasets like MuSiQue remain moderate, even for IterDRAG. To address this, we analyze the mistakes in both DRAG and IterDRAG to examine the limitations and errors inherent in these approaches. In the following, we explore common failure cases (See Figure 14) to understand where each method falls short and how they could be further improved.
We provide selected example mistakes from Figure 14(a) to Figure 14(d), with retrieved documents omitted for brevity. The reasons for common errors can be grouped into four categories: (1) inaccurate or outdated retrieval; (2) incorrect or lack of reasoning; (3) hallucination or unfaithful reasoning; and (4) evaluation issues or refusal to answer. We elaborate on these categories below:
- Inaccurate or outdated retrieval: A major source of RAG errors stems from the retrieval process, where relevant knowledge is not correctly retrieved. For example, in the first question of Figure 14(c), the top-50 retrieved documents do not contain the correct answer. A similar issue occurs in the second QA pair, where outdated retrieval results fail to provide useful information. In the third case, although both battles are retrieved, the initial documents overly focus on the Battle of Manila, leading to an incorrect response.
- Incorrect or lack of reasoning: Beyond retrieval issues, incorrect reasoning chains are another common source of errors. For example, in the first case in Figure 14(b), although the correct documents are retrieved, the reasoning process is incomplete (i.e., no explicit comparison of the mountain heights), leading to an incorrect answer in DRAG. Similarly, in the second and third cases, the reasoning is either absent (as in DRAG) or flawed. As a result, reasoning-related errors tend to occur more frequently in difficult questions and in the one-step DRAG approach.
- Hallucination or unfaithful reasoning: Other than retrieval and reasoning, hallucination and unfaithful reasoning also contribute to errors in knowledge-intensive tasks. In the first case, the prediction is incorrect and cannot be found in the retrieved documents. As for the rest cases, while the answers are related, certain steps in the reasoning chain are flawed and cause errors in the final answers. These highlight the persistent challenge of hallucination in LLMs, particularly in long-context generation tasks.
- Evaluation issues or refusal to answer: Finally, we observed several evaluation issues that may lead to inaccurate evaluation. For instance, the use of abbreviations or variations in date format can result in incorrect scoring across all metrics. Moreover, our experiments do not account for abstaining from answering, which could cause unfair scores.
<details>
<summary>2410.04343v2/x16.png Details</summary>

### Visual Description
# Technical Document Extraction: Flowchart Analysis
## Diagram Overview
The image depicts a **RAG (Retrieval-Augmented Generation)** pipeline with iterative processing. The flowchart uses color-coded blocks to represent data components and their flow through the system.
---
### Key Components & Labels
1. **In-context examples (m)**
- Position: Top of diagram
- Description: Represents the number of examples used for context
2. **Documents (k)**
- Position: Three horizontal blocks (left to right)
- Color: Red (#FF6B6B)
- Contents:
- "Example Documents 1"
- "Example Documents n"
- "Test Documents"
3. **Example Queries**
- Position: Blue blocks (#457B9D) interspersed between Documents blocks
- Contents:
- "Example Query 1"
- "Example Query n"
- "Test Query"
4. **Example Answers**
- Position: Green blocks (#63B386) following each Query block
- Contents:
- "Example Answer 1"
- "Example Answer n"
5. **Generation Process**
- Icon: Robot face (🤖)
- Text: "Generate with RAG / Iterative RAG"
6. **Output**
- Position: Final green block (#63B386)
- Label: "Final Answer"
---
### Spatial Grounding & Flow
1. **Input Flow**
- Documents → Example Queries → Example Answers
- Pattern: Repeats for `n` examples (indicated by ellipsis `...`)
2. **Processing Path**
- All components feed into a unified "Prompt"
- Prompt → Robot icon → Final Answer
3. **Color-Coded Logic**
- Red: Source documents
- Blue: Queries
- Green: Answers
- Confirmed: All color associations match legend (see below)
---
### Legend & Data Mapping
| Color | Component | Count Variable |
|-------------|--------------------|----------------|
| Red (#FF6B6B) | Documents | k |
| Blue (#457B9D) | Queries | n |
| Green (#63B386) | Answers | n |
---
### Trend Verification
- **No numerical trends** present (diagram shows static components)
- **Flow direction**: Left-to-right processing pipeline
- **Iterative nature**: Implied by "Iterative RAG" label and repeating pattern
---
### Component Isolation
1. **Header**
- "In-context examples (m)"
- Connects to all document blocks
2. **Main Chart**
- Three document blocks with query/answer pairs
- Test documents/queries at the end
3. **Footer**
- Robot icon and generation process
- Final answer output
---
### Language Analysis
- **Primary language**: English
- **No secondary languages** detected
---
### Critical Observations
1. The system uses **k documents** for training examples and **n queries/answers** for context
2. Test documents/queries are processed separately from training examples
3. Final answer generation employs **RAG with iterative refinement**
This diagram illustrates a standard RAG implementation with explicit separation of training/test data and iterative processing capabilities.
</details>
Figure 15: Input prompt that comprises of $m$ in-context examples, the test documents and query, in which each document chunk consists of $k$ retrieved documents. For IterDRAG, the example answers additionally provide sub-queries and intermediate answers as demonstrations.
Appendix H Implementation
In our experiments, we utilize the Gecko-1B (en) embedding model to index both the documents and input queries (Lee et al., 2024b), using Wikipedia passages from the KILT benchmark as the document source (Petroni et al., 2020). In test-time, the input query is compared against all embeddings in the corpus, and the top- $k$ neighbors are selected for inference. Each document is then truncated on the right side to a maximum of 1024 tokens using whitespace tokenization. For each example, we arrange the elements in the following order: documents, query, and label, with the retrieved documents listed in reverse order, placing the higher-ranked documents closer to the query (Liu et al., 2024b). Consequently, the prompt comprises of multiple in-context examples, followed by the test documents and test query, as illustrated in Figure 15.
For generation, we utilize Gemini 1.5 Flash for more efficient experiments. In DRAG, inference scaling is achieved by increasing the context length through the combination of documents ( $k$ ) and in-context examples ( $m$ ). Then, the prompt (See Figure 16) is provided to the model for a one-time generation using the default generation parameters. For IterDRAG, the input prompt is constructed in a similar fashion, with the example answers consisting of assembled sub-queries, intermediate answers, and the final answer (See Figure 17). Here, we scale test-time compute by incorporating iterative retrieval and generation, along with the increase of documents and demonstrations. In each iteration, we restrict the generation to adhere to the Self-Ask format, in which the response should start with “Follow up: ”, “Intermediate answer: ” or “So the final answer is: ” (Koo et al., 2024). Each iteration begins with the generation of a sub-query and concludes with the production of an intermediate answer. If a sub-query is generated, additional documents are retrieved and appended to the initial set (i.e., Test Documents in Figure 15), after which the model generates an intermediate answer. We allow up to five iterations, after which the model is forced to produce the final answer.
To evaluate the estimated parameters within computation allocation model for RAG, we normalized the performance metrics by subtracting the mean and dividing by the standard deviation for each dataset and metric. For DRAG, the effective context length is calculated by counting the tokens in the prompt, while for IterDRAG, it is determined by summing the context tokens across all inference requests. We constrain the last parameter in $b$ and perform ordinary least squares to estimate rest six parameters in Equation 2. To prevent numerical instability, we shift the values in $\theta$ by a small constant $\epsilon$ of 0.01. When computing $R^{2}$ and MSE, we manage noisy data by excluding peak and valley outliers in our experiments. However, for domain generalization and length extrapolation, all data points are included in the evaluation. To predict downstream task performance, $i$ should be computed for each task. Specifically, in each strategy and task: $i_{\text{doc}}=P(k=1,m=0,n=1)-P(k=0,m=0,n=1)$ , $i_{\text{shot}}=P(k=0,m=1,n=1)-P(k=0,m=0,n=1)$ . For the predicted optimal hyperparameters, we present the actual metric values to validate the efficacy of computation allocation model for RAG.
Prompt for DRAG
You are an expert in question answering. I am going to give you one or more example triples of context, question and answer, in which the context may or may not be relevant to the question. The examples will be written. Context (which may or may not be relevant): <Retrieved documents> Question: What is the place of birth of the director of film Servant’S Entrance? Answer: Helsingfors <Further demonstrations> After the examples, I am going to provide another pair of context and question, in which the context may or may not be relevant to the question. I want you to answer the question. Give only the answer, and no extra commentary, formatting, or chattiness. Answer the question. Context (which may or may not be relevant): <Retrieved documents> Question: Who was born first out of Thomas Henry Holland and Jean-Mandé Sigogne? Answer:
Figure 16: Example prompt for DRAG. The prompt comprises of instructions and varying number of demonstrations, followed by a test example.
Prompt for IterDRAG
You are an expert in question answering. I am going to give you one or more example sets of context, question, potential follow up questions and their respective answers, in which the context may or may not be relevant to the questions. The examples will be written. Context: <Retrieved documents> Question: What nationality is the director of film Boggy Creek Ii: And The Legend Continues? Follow up: Who is the director of the film Boggy Creek II: And The Legend Continues? Intermediate answer: The director of the film Boggy Creek II: And The Legend Continues is Charles B. Pierce. Follow up: What is the nationality of Charles B. Pierce? Intermediate answer: The nationality of Charles B. Pierce is American. So the final answer is: American <Further demonstrations> After the examples, I am going to provide another pair of context and question, in which the context may or may not be relevant to the question. I want you to answer the question. When needed, generate follow up question(s) using the format ’Follow up: X’, where X is the follow up question. Then, answer each follow up question using ’Intermediate answer: X’ with X being the answer. Finally, answer to the main question with the format ’So the final answer is: X’, where X is the final answer. Context: <Retrieved documents (with interleaving retrieval)> Question: Where was the director of film Death Of A Friend born? Follow up: | Intermediate answer: | So the final answer is:
Figure 17: Example prompt for IterDRAG. The prompt comprises of instructions and varying number of demonstrations, followed by a test example. In each iteration, we control the generation to follow the Self-Ask format with constrained decoding.