# Base of RoPE Bounds Context Length
**Authors**:
- Xin Men (Baichuan Inc.)
- &Mingyu Xu (Baichuan Inc.)
- &Bingning Wang (Baichuan Inc.)
- \ANDQingyu Zhang
- &Hongyu Lin
- &Xianpei Han
- \AND Weipeng Chen
- Baichuan Inc
> Equal contributionCorresponding author,daniel@baichuan-inc.com
Abstract
Position embedding is a core component of current Large Language Models (LLMs). Rotary position embedding (RoPE), a technique that encodes the position information with a rotation matrix, has been the de facto choice for position embedding in many LLMs, such as the Llama series. RoPE has been further utilized to extend long context capability, which is roughly based on adjusting the base parameter of RoPE to mitigate out-of-distribution (OOD) problems in position embedding. However, in this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. We revisit the role of RoPE in LLMs and propose a novel property of long-term decay, we derive that the base of RoPE bounds context length: there is an absolute lower bound for the base value to obtain certain context length capability. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
<details>
<summary>2405.14591v1/x1.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Chart Type and Structure
- **Chart Type**: Line chart with logarithmic y-axis and linear x-axis.
- **Components**:
- **Header**: None explicitly labeled.
- **Main Chart**:
- Axes with grid lines.
- Data series (red line and blue triangular markers).
- **Legend**: Positioned in the top-right corner.
## 2. Axis Labels and Markers
- **X-Axis (Horizontal)**:
- **Title**: "Context length"
- **Scale**: Linear, with markers at `10³`, `10⁴`, `10⁵`, `10⁶`.
- **Y-Axis (Vertical)**:
- **Title**: "Lower bound of RoPE’s base"
- **Scale**: Logarithmic (base 10), with markers at `10⁴`, `10⁵`, `10⁶`, `10⁷`, `10⁸`.
## 3. Legend and Data Series
- **Legend**:
- **Position**: Top-right corner.
- **Entry**:
- **Color**: Red.
- **Label**: `y = 0.0424x¹.⁶²⁸`.
- **Data Series**:
- **Red Line**: Exponential curve defined by `y = 0.0424x¹.⁶²⁸`.
- **Blue Triangles**: Data points aligned with the red line.
## 4. Key Trends and Data Points
- **Trend Verification**:
- The red line exhibits an **exponential upward trend** due to the power function `x¹.⁶²⁸`.
- Blue triangular markers follow the red line closely, confirming the trend.
- **Extracted Data Points** (approximate, based on axis scaling):
| Context Length (x) | Lower Bound (y) |
|---------------------|-----------------|
| `10³` | ~42.4 |
| `10⁴` | ~424,000 |
| `10⁵` | ~42,400,000 |
| `10⁶` | ~424,000,000 |
## 5. Spatial Grounding and Accuracy Checks
- **Legend Color Matching**:
- Red line matches the legend label `y = 0.0424x¹.⁶²⁸`.
- Blue triangles (data points) are not explicitly labeled in the legend but align with the red line.
- **Axis Consistency**:
- Logarithmic y-axis ensures proportional spacing between `10⁴`, `10⁵`, etc.
- Linear x-axis spacing increases exponentially (e.g., `10³` to `10⁴` is a 10x jump).
## 6. Additional Observations
- **Grid Lines**: Dashed gray lines for reference.
- **No Other Languages**: All text is in English.
- **No Explicit Data Table**: Data inferred from the line equation and axis scaling.
## 7. Conclusion
The chart illustrates the relationship between **context length** (x-axis) and the **lower bound of RoPE’s base** (y-axis), governed by the exponential equation `y = 0.0424x¹.⁶²⁸`. The blue triangular markers validate the trend, and the logarithmic y-axis emphasizes the rapid growth rate.
</details>
Figure 1: Context length and its corresponding lower bound of RoPE’s base value.
1 Introduction
In the past few years, large language models have demonstrated surprising capabilities and undergone rapid development. By now, LLMs have been widely applied across various domains, including chatbots, intelligent agents, and code assistants (Achiam et al., 2023; Jiang et al., 2023b). The Transformer (Vaswani et al., 2017), based on the attention mechanism, has been the most popular backbone of LLMs due to its good performance and scaling properties (Tay et al., 2022). One of the key component modules in the Transformer is position embedding, which is introduced to embed positional information that is vital for processing sequential data. Rotary position embedding (RoPE), which encodes relative distance information in the form of absolute position embedding (Su et al., 2024), has been a popular choice and applied in many LLMs (Touvron et al., 2023a; Yang et al., 2023; Bai et al., 2023).
RoPE introduces no training parameters and shows improvement in language modeling and many other tasks (Su et al., 2024; Heo et al., 2024). One reason that RoPE is widely used is its ability for context length extrapolation (Peng et al., 2023b; Chen et al., 2023), which extends the context length of a trained LLM without expensive retraining. In practice, many works (Touvron et al., 2023a; Liu et al., 2024a; Young et al., 2024) have successfully extended the window length by simply increasing base value, the only one hyper-parameter in RoPE, and fine-tuning on long texts.
The reasons behind the success of these long context extensions are often explained as avoiding out-of-distribution (OOD) rotation angles (Liu et al., 2024b; Han et al., 2023) in RoPE, meaning the extended context length (OOD) can be mapped to the in-distribution context length that has been properly trained. Based on the OOD theory, a recent study (Liu et al., 2024b) finds that a smaller base can mitigate OOD and is beneficial for the model’s ability to process long contexts, which inspires us to further study the relationship between the base of RoPE and the length of context the model can process.
In this paper, we find that the model may show superficial long context capability with an inappropriate RoPE base value, in which case the model can only preserve low perplexity but loses the ability to retrieve long context information. We also show that the out-of-distribution (OOD) theory in position embedding, which motivates most length extrapolation works (Peng et al., 2023b; Chen et al., 2023; Liu et al., 2024b), is insufficient to fully reflect the model’s ability to process long contexts. Therefore, we revisit the role of RoPE in LLMs and derive a novel property of long-term decay in RoPE: the ability to attend more attention to similar tokens than random tokens decays as the relative distance increases. While previous long context works often focus on the relative scale of the RoPE base, based on our theory, we derive an absolute lower bound for the base value of RoPE to obtain a certain context length ability, as shown in Figure 1. To verify our theory, we conducted thorough experiments on various LLMs such as Llama2-7B (Touvron et al., 2023b), Baichuan2-7B (Yang et al., 2023) and a 2-billion model we trained from scratch, demonstrating that this lower bound holds not only in the fine-tuning stage but also in the pre-training stage.
We summarize the contributions of the paper as follows:
- Theoretical perspective: we derive a novel property of long-term decay in RoPE, indicating the model’s ability to attend more to similar tokens than random tokens, which is a new perspective to study the long context capability of the LLMs.
- Lower Bound of RoPE’s Base: to achieve the expected context length capability, we derive an absolute lower bound for RoPE’s base according to our theory. In short, the base of RoPE bounds context length.
- Superficial Capability: we reveal that if the RoPE’s base is smaller than a lower bound, the model may obtain superficial long context capability, which can preserve low perplexity but lose the ability to retrieve information from long context.
2 Background
In this section, we first introduce the Transformer and RoPE, which are most commonly used in current LLMs. Then we discuss long context methods based on the OOD of rotation angle theory.
2.1 Attention and RoPE
The LLMs in current are primarily based on the Transformer (Vaswani et al., 2017). The core component of it is the calculation of the attention mechanism. The naive attention can be written as:
$$
\displaystyle A_{ij} \displaystyle=q_{i}^{T}k_{j} \displaystyle\text{ATTN}(X) \displaystyle=\text{softmax}(A/\sqrt{d})\ v, \tag{1}
$$
where $A∈ R^{L× L}$ $q,k,v∈ R^{d}$ . Position embedding is introduced to make use of the order of the sequence in attention.
RoPE (Su et al., 2024) implements relative position embedding through absolute position embedding, which applies rotation matrix into the calculation of the attention score in Eq. 1, which can be written as:
$$
\displaystyle A_{ij}=(R_{i,\theta}q_{i})^{T}(R_{j,\theta}k_{i})=q_{i}^{T}R_{j-i,\theta}k_{j}=q_{i}^{T}R_{m,\theta}k_{j}, \tag{3}
$$
where $m=j-i$ is the relative distance of $i$ and $j$ , $R_{m,\theta}$ is a rotation matrix denoted as:
$$
\displaystyle\left[\begin{array}[]{ccccccc}cos(m\theta_{0})&-sin(m\theta_{0})&0&0&\cdots&0&0\\
sin(m\theta_{0})&cos(m\theta_{0})&0&0&\cdots&0&0\\
0&0&cos(m\theta_{1})&-sin(m\theta_{1})&\cdots&0&0\\
0&0&sin(m\theta_{1})&cos(m\theta_{1})&\cdots&0&0\\
\vdots&\vdots&\vdots&\vdots&\ddots&\vdots&\vdots\\
0&0&0&0&\cdots&cos(m\theta_{d/2-1})&-sin(m\theta_{d/2-1})\\
0&0&0&0&\cdots&sin(m\theta_{d/2-1})&cos(m\theta_{d/2-1})\end{array}\right] \tag{11}
$$
Generally, the selection of rotation angles satisfies $\theta_{i}=base^{-2i/d}$ , the typical base value for current LLMs is 10,000, and the base of RoPE in LLMs is shown in Table 1.
Table 1: The setting of RoPE’s base and context length in various LLMs.
| Base | 10,000 | 10,000 | 500,000 | 1,000,000 | 10,000 |
| --- | --- | --- | --- | --- | --- |
| Length | 2,048 | 4,096 | 8,192 | 32,768 | 4,096 |
<details>
<summary>2405.14591v1/x2.png Details</summary>

### Visual Description
# Technical Document Extraction: RoPE vs Context-Size Graph
## Axes and Labels
- **X-axis**:
- Label: `Context-size`
- Range: `0` to `30,000`
- Tick marks: `0`, `10,000`, `20,000`, `30,000`
- **Y-axis**:
- Label: `RoPE`
- Range: `-1.0` to `1.0`
- Tick marks: `-1.0`, `-0.5`, `0.0`, `0.5`, `1.0`
## Line Data
- **Line Color**: Blue
- **Trend**:
- Starts at `(0, 1.0)`
- Monotonically decreases to `(30,000, -1.0)`
- Crosses `RoPE = 0` at approximately `Context-size = 25,000`
- **Key Points**:
- `(0, 1.0)`: Initial value
- `(30,000, -1.0)`: Final value
## Shaded Regions
1. **Left Region** (Light Blue):
- Label: `Original context`
- Bounds: `Context-size = 0` to `5,000`
2. **Right Region** (Pink):
- Label: `Extended context`
- Bounds: `Context-size = 5,000` to `30,000`
## Legend
- **Placement**: Upper-right corner
- **Entries**:
- `Original context` ↔ Light Blue
- `Extended context` ↔ Pink
## Spatial Grounding
- Vertical blue line at `Context-size = 5,000` divides the graph into two regions.
- Legend colors match shaded regions exactly.
## Trend Verification
- **Line Behavior**:
- Steep decline from `Context-size = 0` to `10,000`
- Gradual decline from `10,000` to `25,000` (crosses zero)
- Sharp decline from `25,000` to `30,000`
## Component Isolation
1. **Header**: None
2. **Main Chart**:
- Line plot with dual-shaded regions
- Axes labeled with numerical ranges
3. **Footer**: None
## Data Table (Reconstructed)
| Context-size | RoPE | Region |
|--------------|-------|-------------------|
| 0 | 1.0 | Original context |
| 5,000 | 0.0 | Boundary |
| 10,000 | 0.25 | Extended context |
| 20,000 | -0.25 | Extended context |
| 25,000 | 0.0 | Extended context |
| 30,000 | -1.0 | Extended context |
## Notes
- No additional text or data tables present.
- All textual information extracted as per image content.
</details>
(a) base=1e4
<details>
<summary>2405.14591v1/x3.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Axis Labels and Markers
- **X-axis**:
- Title: "Context-size"
- Range: 0 to 30,000
- Tick marks: 0, 10,000, 20,000, 30,000
- **Y-axis**:
- Title: Implicit (standardized vertical axis)
- Range: -1.0 to 1.0
- Tick marks: -1.0, -0.5, 0.0, 0.5, 1.0
## 2. Key Components
### A. Line Series
- **Color**: Blue (#0000FF)
- **Behavior**:
- Periodic sine wave oscillation
- Amplitude: ±1.0 (full y-axis range)
- Frequency: Consistent peaks/troughs across x-axis
- Notable: No phase shift or damping observed
### B. Shaded Region
- **Color**: Pink (#FFC0CB)
- **Label**: "Extended context" (black text, centered)
- **Spatial Bounds**:
- X-axis: ~5,000 to 30,000
- Y-axis: Full range (-1.0 to 1.0)
## 3. Trend Verification
- **Blue Line (Sine Wave)**:
- Visual trend: Perfectly periodic with no amplitude decay
- Data points: Peaks at y=1.0, troughs at y=-1.0
- Frequency: ~6 full cycles visible (0–30,000 x-range)
- **Pink Region**:
- Visual trend: Static rectangular highlight
- Purpose: Indicates "Extended context" zone
## 4. Spatial Grounding
- **Legend**: Absent (colors directly mapped to elements)
- Blue = Sine wave
- Pink = Extended context region
- **Text Placement**:
- "Extended context" at x≈15,000, y=0.0
## 5. Component Isolation
### Header
- No explicit header elements
### Main Chart
- Dual-layer visualization:
1. Primary: Blue sine wave (dynamic)
2. Secondary: Pink shaded region (static context marker)
### Footer
- X-axis label "Context-size" with numerical scale
## 6. Data Extraction
- **No explicit data table** present
- **Implicit data points** (approximate):
| X (Context-size) | Y (Amplitude) | Region |
|------------------|---------------|--------------|
| 0 | 1.0 | Unshaded |
| 2,500 | -1.0 | Shaded |
| 5,000 | 1.0 | Shaded |
| 7,500 | -1.0 | Shaded |
| 10,000 | 1.0 | Shaded |
| 12,500 | -1.0 | Shaded |
| 15,000 | 1.0 | Shaded |
| 17,500 | -1.0 | Shaded |
| 20,000 | 1.0 | Shaded |
| 22,500 | -1.0 | Shaded |
| 25,000 | 1.0 | Shaded |
| 27,500 | -1.0 | Shaded |
| 30,000 | 1.0 | Shaded |
## 7. Observations
- **Extended Context Zone**:
- Covers 83% of x-axis range (5,000–30,000)
- Overlaps with 100% of sine wave oscillations
- **Phase Alignment**:
- Sine wave peaks/troughs align with shaded region boundaries
- No phase shift between oscillation and context zone
## 8. Language Notes
- All text in English
- No non-English content detected
</details>
(b) base=500
<details>
<summary>2405.14591v1/x4.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## Image Description
The image is a **line graph** depicting the relationship between **context-size** (x-axis) and an unspecified metric (y-axis). The graph includes two distinct data series, shaded regions, and a legend. The visual elements are structured to highlight differences in trends between the two data series.
---
## Key Components
### 1. **Axes and Labels**
- **X-axis (Horizontal):**
- **Title:** "Context-size"
- **Range:** 0 to 30,000
- **Markers:** 0, 10,000, 20,000, 30,000
- **Y-axis (Vertical):**
- **Title:** Not explicitly labeled (implied metric, e.g., "Value" or "Proportion")
- **Range:** 0.90 to 1.00
- **Markers:** 0.90, 0.92, 0.94, 0.96, 0.98, 1.00
### 2. **Data Series**
- **Blue Solid Line:**
- **Trend:** Starts at (0, 1.00) and sharply declines to (5,000, 0.90), then plateaus.
- **Key Points:**
- (0, 1.00)
- (5,000, 0.90)
- **Legend:** Not explicitly labeled (assumed to represent "Baseline" or "Default" context).
- **Red Dashed Line:**
- **Label:** "Extended context" (annotated in the upper right of the graph).
- **Trend:** Starts at (0, 1.00) and gradually declines to (30,000, 0.90).
- **Key Points:**
- (0, 1.00)
- (30,000, 0.90)
### 3. **Shaded Regions**
- **Blue Shaded Area:**
- **X-range:** 0 to 5,000
- **Purpose:** Likely highlights the "Baseline" or "Default" context range.
- **Red Shaded Area:**
- **X-range:** 5,000 to 30,000
- **Purpose:** Likely highlights the "Extended context" range.
### 4. **Legend**
- **Placement:** Upper right corner of the graph.
- **Content:**
- **Red Dashed Line:** "Extended context"
- **Blue Solid Line:** No explicit label (assumed to represent the default context).
---
## Trends and Observations
1. **Blue Solid Line (Baseline Context):**
- **Initial Drop:** A steep decline from 1.00 to 0.90 between context-sizes 0 and 5,000.
- **Plateau:** Remains at 0.90 for context-sizes beyond 5,000.
2. **Red Dashed Line (Extended Context):**
- **Gradual Decline:** A slow decrease from 1.00 to 0.90 across the entire context-size range (0 to 30,000).
- **Consistency:** Maintains a near-linear decline without plateaus.
3. **Shaded Regions:**
- The blue and red shaded areas visually separate the "Baseline" and "Extended context" ranges, reinforcing the distinction between the two data series.
---
## Spatial Grounding and Color Matching
- **Legend Colors:**
- **Red Dashed Line:** Matches the "Extended context" label.
- **Blue Solid Line:** No label, but matches the blue shaded area (0–5,000).
- **Data Point Verification:**
- All data points align with their respective lines and shaded regions.
---
## Additional Notes
- **No Explicit Y-axis Label:** The y-axis metric is not explicitly named, but the values (0.90–1.00) suggest a proportion or normalized metric.
- **No Data Table:** The graph does not include a data table; trends are inferred from the lines and shaded regions.
- **Language:** All text is in English.
---
## Conclusion
The graph illustrates a comparison between two context types:
- **Baseline Context** (blue solid line): Rapid decline followed by stabilization.
- **Extended Context** (red dashed line): Gradual decline across the entire range.
The shaded regions and legend provide visual cues to differentiate the two contexts.
</details>
(c) base= $b· s^{\frac{d}{d-2}}$
Figure 2: An illustration of OOD in RoPE when we extend context length from 4k to 32k, and two solutions to avoid the OOD. We show the last dimension as it is the lowest frequency part of RoPE, which suffers OOD mostly in extrapolation. (a) For a 4k context-length model with base value as 1e4, when we extend the context length to 32k without changing the base value, the context length from 4k to 32k is OOD for RoPE (red area in the figure). (b) OOD can be avoided with a small base value like 500 (Liu et al., 2024b), since the full period has been fitted during fine-tuning stage. (c) We set base as $b· s^{\frac{d}{d-2}}$ from NTK (Peng et al., 2023b).The blue line denotes the pre-training stage (base=1e4) and the red dashed line denotes the fine-tuning stage (base= $b· s^{\frac{d}{d-2}}$ ), we can observe that the RoPE’s rotation angle of extended positions is in-distribution.
2.2 OOD theory of relative rotation angle
Based on RoPE, researchers have proposed various methods to extend the long context ability of LLMs, among which representatives are PI (Chen et al., 2023) and NTK-series (NTK-aware (bloc97, 2023), YaRN (Peng et al., 2023b), and Dynamical-NTK (emozilla, 2023)). Those methods depend on the relative scale $s=T_{\text{new}}/T_{\text{origin}}$ , where $T_{\text{origin}}$ is the training length of the original pre-trained model and $T_{\text{new}}$ is the training length in long-context fine-tuning.
PI
PI directly interpolates the position embedding, and the calculation of $A_{ij}$ becomes:
$$
\displaystyle A_{ij}=(R_{i/s}q_{i})^{T}(R_{j/s}k_{i})=q_{i}^{T}R_{(j-i)/s}k_{j}=q_{i}^{T}R_{m/s}k_{j}, \tag{12}
$$
In other words, the position embedding of the token at position $i$ in pre-training becomes $i/s$ in fine-tuning, ensuring the position embedding range of the longer context remains the same as before.
NTK-series
The idea is that neural networks are difficult to learn high-frequency features, and direct interpolation can affect the high-frequency parts. Therefore, the NTK-aware method achieves high-frequency extrapolation and low-frequency interpolation by modifying the base value of RoPE. Specifically, it modifies the base $b$ of the RoPE to:
$$
\displaystyle b_{\text{new}}=b\ s^{\frac{d}{d-2}}. \tag{13}
$$
The derivation of this expression is derived from $T_{\text{new}}b_{\text{new}}^{-\frac{d-2}{d}}=T_{\text{origin}}b^{-\frac{d-2}{d}}$ to ensure that the lowest frequency part being interpolated.
A recent study (Liu et al., 2024b) proposes to set a much smaller base (e.g. 500), in which case $\theta_{i}=base^{-\frac{2i}{d}}$ is small enough and typical training length (say 4,096) fully covers the period of $\cos(t-s)\theta_{i}$ , so the model can obtain longer context capabilities.
One perspective to explain current extrapolation methods is the OOD of rotation angle (Liu et al., 2024b; Han et al., 2023). If all possible values of $\cos(t-s)\theta_{i}$ have been fitted during the pre-training stage, OOD would be avoided when processing longer context. Figure 2 demonstrates how these methods avoid OOD of RoPE.
3 Motivation
<details>
<summary>2405.14591v1/x5.png Details</summary>

### Visual Description
# Technical Document Extraction: Perplexity vs. Context Size Analysis
## Chart Description
The image is a line graph comparing perplexity metrics across varying context sizes for two language models. The graph contains two distinct data series with contrasting behaviors.
### Axis Labels
- **X-axis**: "Context size" (ranging from 0 to 125,000 in increments of 25,000)
- **Y-axis**: "Perplexity" (ranging from 0 to 100 in increments of 20)
### Legend
- **Placement**: Upper-right quadrant
- **Entries**:
- Blue line: "finetune on 32k(base=500)"
- Red line: "Llama2-7B-Baseline"
### Data Series Analysis
1. **Blue Line ("finetune on 32k(base=500")**
- **Initial Value**: Starts at approximately 10 perplexity at x=0
- **Trend**: Maintains a flat trajectory with minor fluctuations (~10-12 perplexity) across all context sizes
- **Final Value**: Remains at ~12 perplexity at x=125,000
2. **Red Line ("Llama2-7B-Baseline")**
- **Initial Value**: Starts at 100 perplexity at x=0
- **Trend**:
- Immediate sharp decline to 10 perplexity at x=0
- Maintains flat trajectory at 10 perplexity for all subsequent context sizes
- **Final Value**: Remains at 10 perplexity at x=125,000
### Key Observations
- The red line demonstrates a 90% reduction in perplexity at the smallest context size (x=0)
- Both lines converge at 10 perplexity after the initial drop in the red line
- No additional data points or annotations present in the graph
- No textual blocks, tables, or secondary visual elements detected
## Spatial Grounding
- Legend coordinates: [x=875, y=15] (relative to image dimensions)
- Blue line occupies lower portion of graph (y=10-12)
- Red line occupies upper portion initially (y=100) then lower portion (y=10)
## Trend Verification
- Blue line: Horizontal trend confirmed by consistent y-values across x-axis
- Red line: Step function trend confirmed by abrupt y-axis drop followed by horizontal stabilization
## Component Isolation
1. **Header**: No header elements present
2. **Main Chart**:
- Grid background with dashed lines
- Two data series with distinct color coding
3. **Footer**: No footer elements present
## Data Extraction Table
| Context Size | finetune on 32k(base=500) | Llama2-7B-Baseline |
|--------------|---------------------------|--------------------|
| 0 | 10 | 100 |
| 25,000 | 10 | 10 |
| 50,000 | 10 | 10 |
| 75,000 | 10 | 10 |
| 100,000 | 10 | 10 |
| 125,000 | 12 | 10 |
## Language Analysis
- All text appears in English
- No non-English text detected
</details>
(a) Perplexity
<details>
<summary>2405.14591v1/x6.png Details</summary>

### Visual Description
# Technical Document Extraction: Accuracy vs. Context Length
## Chart Description
This image is a **line graph** comparing the accuracy of two language models across varying context lengths. The graph includes two data series, axis labels, a legend, and gridlines for reference.
---
### **Key Components**
1. **Axes**
- **X-axis (Horizontal)**:
- Label: `Context length`
- Range: `1000` to `5000` (in increments of 1000)
- Ticks: `1000`, `2000`, `3000`, `4000`, `5000`
- **Y-axis (Vertical)**:
- Label: `Accuracy`
- Range: `0.0` to `1.0` (in increments of 0.2)
- Ticks: `0.0`, `0.2`, `0.4`, `0.6`, `0.8`, `1.0`
2. **Legend**
- Located in the **upper right corner**.
- Entries:
- `finetune on 32k(base=500)` (blue line)
- `Llama2-7B-Baseline` (red line)
3. **Gridlines**
- Dashed gray lines span the plot area, aligning with axis ticks.
---
### **Data Series Analysis**
#### **1. `finetune on 32k(base=500)` (Blue Line)**
- **Trend**:
- Starts at **1.0 accuracy** at `context length = 1000`.
- Sharp decline to **~0.2 accuracy** by `context length = 2000`.
- Fluctuates between **0.1 and 0.3** for `context lengths = 3000` to `5000`.
- **Key Data Points**:
- `1000`: 1.0
- `2000`: ~0.2
- `3000`: ~0.15
- `4000`: ~0.05
- `5000`: ~0.15
#### **2. `Llama2-7B-Baseline` (Red Line)**
- **Trend**:
- Maintains **~1.0 accuracy** until `context length = 4000`.
- Abrupt drop to **0.0 accuracy** at `context length = 4000`.
- Remains at 0.0 for `context lengths = 4000` to `5000`.
- **Key Data Points**:
- `1000`: 1.0
- `2000`: 1.0
- `3000`: 1.0
- `4000`: 0.0
- `5000`: 0.0
---
### **Cross-Reference Verification**
- **Legend Colors**:
- Blue (`finetune`) matches the blue line.
- Red (`Llama2-7B-Baseline`) matches the red line.
- **Spatial Grounding**:
- Legend is positioned in the **upper right** (coordinates: `[x=0.8, y=0.9]` relative to the plot area).
---
### **Observations**
1. The `finetune on 32k(base=500)` model shows **significant performance degradation** as context length increases beyond 1000.
2. The `Llama2-7B-Baseline` model maintains high accuracy until `context length = 4000`, after which it fails completely.
3. Both models exhibit **non-linear behavior**, with sharp transitions at specific context lengths.
---
### **Conclusion**
The graph highlights a critical divergence in model performance: the `finetune` model degrades gradually, while the `Llama2-7B-Baseline` model fails catastrophically at `context length = 4000`. Further investigation is warranted to understand the root cause of the red line's abrupt collapse.
</details>
(b) Long-eval (Li* et al., 2023)
<details>
<summary>2405.14591v1/x7.png Details</summary>

### Visual Description
# Technical Document Extraction: Heatmap Analysis
## 1. Labels and Axis Titles
- **X-Axis (Horizontal):**
- Title: `Token Limit`
- Markers: Incremental values from `0` to `4000` in steps of `100` (e.g., `0`, `100`, `200`, ..., `4000`).
- **Y-Axis (Vertical):**
- Title: `Context Length`
- Markers: Incremental values from `0.0` to `100.0` in steps of `11.0` (e.g., `0.0`, `11.0`, `22.0`, ..., `100.0`).
- **Legend (Color Scale):**
- Title: `Score`
- Range: `0` (red) to `10` (green).
- Spatial Grounding: Located on the **right edge** of the heatmap.
## 2. Key Trends and Data Points
- **General Pattern:**
- Lower `Token Limit` values (left side of the heatmap) correlate with **higher scores** (green/yellow regions).
- Higher `Token Limit` values (right side) correlate with **lower scores** (red regions).
- **Notable Exceptions:**
- A **green square** at `Token Limit = 2800` and `Context Length = 67.0`, indicating a high score (~8–10).
- The **top row** (`Context Length = 100.0`) is uniformly red, suggesting consistently low scores (~0–2) across all token limits.
- **Color Gradient Interpretation:**
- **Green (8–10):** High performance or favorable outcomes.
- **Yellow (4–6):** Moderate performance.
- **Red (0–2):** Low performance or unfavorable outcomes.
## 3. Data Structure and Reconstruction
The heatmap represents a matrix where:
- **Rows** correspond to `Context Length` (0.0–100.0).
- **Columns** correspond to `Token Limit` (0–4000).
- **Cells** encode scores via color intensity.
### Example Data Points (Approximate Scores):
| Context Length | Token Limit | Score Range | Color |
|----------------|-------------|-------------|--------|
| 0.0 | 0–4000 | 8–10 | Green |
| 11.0 | 0–4000 | 6–8 | Yellow |
| 22.0 | 0–4000 | 4–6 | Yellow |
| 33.0 | 0–4000 | 2–4 | Red |
| 44.0 | 0–4000 | 0–2 | Red |
| ... | ... | ... | ... |
| 100.0 | 0–4000 | 0–2 | Red |
## 4. Component Isolation
- **Header:** Axis titles (`Token Limit`, `Context Length`).
- **Main Chart:** Heatmap grid with color-coded cells.
- **Footer:** Legend (`Score` scale).
## 5. Trend Verification
- **Downward Trend:** As `Token Limit` increases, scores generally decrease (green → yellow → red).
- **Exception Verification:** The green square at `(2800, 67.0)` breaks the trend, suggesting an outlier or specific optimization.
- **Top Row Consistency:** The uniform red row at `Context Length = 100.0` confirms a systemic limitation at maximum context length.
## 6. Language and Transcription
- **Primary Language:** English (all labels, titles, and annotations are in English).
- **No Secondary Language Detected.**
## 7. Summary
The heatmap illustrates a trade-off between `Token Limit` and `Context Length` in terms of performance scores. Lower token limits yield higher scores across most context lengths, while higher token limits degrade performance. The exception at `(2800, 67.0)` warrants further investigation. The top row (`Context Length = 100.0`) indicates a critical performance bottleneck.
</details>
(c) Needle in Haystack (G, 2023)
Figure 3: The superficial long context capability of avoiding OOD by the smaller base. Following the recent work (Liu et al., 2024b), we fine-tune Llama2-7B with a small base (500) to a context length of 32k.
NTK-based methods are widely adopted in long-context extension (Touvron et al., 2023a; Liu et al., 2024a; Young et al., 2024). To obtain better long-context capability, however, practitioners often adopt a much larger base than the original NTK-aware method suggested. This leads to speculation that there is another bound of RoPE’s base determined by context length.
On the other hand, a recent work (Liu et al., 2024b) proposes to set a much smaller base for RoPE to extend the context length. However, we find it may be a superficial long-context capability as shown in Figure 3. This method can obtain a low perplexity even at 128k context length, which can be explained by the OOD theory as explained above, but the model could not retrieve related information for context length as short as 1k, even much shorter than the model’s pre-trained length. Our findings support previous research (Hu et al., 2024) on the limitations of perplexity in evaluating long-context abilities. To delve deep into this phenomenon, we do the theoretical exploration in the next section.
4 Theory Perspective
For attention mechanism in language modeling, we have the following desiderata:
**Desiderata 1**
*The closer token gets more attention: the current token tends to pay more attention to the token that has a smaller relative distance.*
**Desiderata 2**
*The similar token gets more attention: the token tends to pay more attention to the token whose key value is more similar to the query value of the current token.*
Then we examine the desiderata when we apply RoPE to the attention mechanism in LLMs.
4.1 Long-term Decay of Upper Bound of Attention Score
For Desiderata 1, the property of RoPE makes the model attend more to closer tokens. This kind of long-term decay has been thoroughly discussed in previous work (Su et al., 2024; Sun et al., 2022). It comes from the upper bound of attention score calculation, which can be written as:
$$
\displaystyle|A_{ij}|=|q_{i}^{T}R_{m}k_{j}| \displaystyle\leq\max_{l}(|h_{l}-h_{l+1}|)\sum_{n=1}^{d/2}|S_{n}| \displaystyle=\max_{l}(|h_{l}-h_{l+1}|)\sum_{n=1}^{d/2}|\sum_{l=0}^{n-1}e^{(j-i)\theta_{l}\sqrt{-1}}|, \tag{14}
$$
where $h_{l}=q_{i}^{T}[2l:l2+1]k_{j}[2l:2l+1]$ . Equation 4.1 indicates that the upper bound of the attention score $|A_{ij}|$ decays as the relative distance increases. Figure 5 shows the long-term decay curve of this upper bound, which is in accordance with previous findings (Su et al., 2024; Sun et al., 2022).
4.2 Long-term Decay of the Ability to Attend More to Similar Tokens than Random Tokens
In addition to the attention score’s upper bound, we also find there exists another long-term decay property in RoPE: the ability to attend more to similar tokens than random tokens decays as the relative distance increases. We define the ability to attend more to similar tokens than random tokens as:
$$
\displaystyle\mathbb{E}_{q,k^{*}}\left[q^{T}R_{m,\theta}k^{*}\right]-\mathbb{E}_{q,k}\left[q^{T}R_{m,\theta}k\right], \tag{15}
$$
where $q∈ R^{d}$ is the query vector for the current token, $k^{*}=q+\epsilon$ is the key value of a similar token, where $\epsilon$ is a small random variable, $k∈ R^{d}$ is the key vector of a random token, $R_{m,\theta}$ is the rotation matrix in RoPE. The first term in Eq. 15 is the attention score of $q$ and a similar token $k^{*}$ , the second term in Eq. 15 is the attention score of $q$ and random token $k$ . Then we derive the following theorem:
**Theorem 1**
*Assuming that the components of query $q∈ R^{d}$ and key $k∈ R^{d}$ are independent and identically distributed, their standard deviations are denoted as $\sigma∈ R$ . The key $k^{*}=q+\epsilon$ is a token similar to the query, where $\epsilon$ is a random variable with a mean of 0. Then we have:
$$
\displaystyle\frac{1}{2\sigma^{2}}(\mathbb{E}_{q,k^{*}}\left[q^{T}R_{m,\theta}k^{*}\right]-\mathbb{E}_{q,k}\left[q^{T}R_{m,\theta}k\right])=\sum_{i=0}^{d/2-1}\cos(m\theta_{i}) \tag{16}
$$*
<details>
<summary>2405.14591v1/x8.png Details</summary>

### Visual Description
# Technical Document Extraction
## Image Description
The image contains two side-by-side line charts comparing "Relative upper bound" values across varying "Relative distance" for different base values. Both charts use color-coded lines to represent distinct base magnitudes, with legends positioned on the right side of each chart.
---
## Chart 1: Base Values 1e2, 1e3, 1e4
### Axes
- **X-axis**: Relative distance (0 to 4000)
- **Y-axis**: Relative upper bound (2.5 to 15.0)
### Legend
- **Blue**: base:1e4
- **Green**: base:1e2
- **Orange**: base:1e3
### Line Trends
1. **Blue (1e4)**:
- Starts at ~15.0 at x=0
- Drops sharply to ~5.0 by x=1000
- Stabilizes with minor oscillations (~5.0–7.5) between x=1000–4000
2. **Green (1e2)**:
- Starts at ~12.5 at x=0
- Drops sharply to ~5.0 by x=1000
- Stabilizes with oscillations (~5.0–7.5) between x=1000–4000
3. **Orange (1e3)**:
- Starts at ~10.0 at x=0
- Drops sharply to ~5.0 by x=1000
- Stabilizes with oscillations (~5.0–7.5) between x=1000–4000
### Key Observations
- All lines exhibit a steep initial decline followed by stabilization.
- The 1e4 (blue) line has the highest initial value and slowest decline.
- Oscillations increase in amplitude for all lines as relative distance increases.
---
## Chart 2: Base Values 1e4, 1e5, 1e6
### Axes
- **X-axis**: Relative distance (0 to 30,000)
- **Y-axis**: Relative upper bound (5.0 to 15.0)
### Legend
- **Blue**: base:1e4
- **Green**: base:1e5
- **Orange**: base:1e6
### Line Trends
1. **Blue (1e4)**:
- Starts at ~15.0 at x=0
- Drops sharply to ~5.0 by x=1000
- Stabilizes with oscillations (~5.0–10.0) between x=1000–30,000
2. **Green (1e5)**:
- Starts at ~10.0 at x=0
- Drops sharply to ~5.0 by x=1000
- Stabilizes with oscillations (~5.0–10.0) between x=1000–30,000
3. **Orange (1e6)**:
- Starts at ~15.0 at x=0
- Drops sharply to ~5.0 by x=1000
- Stabilizes with oscillations (~5.0–10.0) between x=1000–30,000
### Key Observations
- All lines show a similar sharp decline pattern but with extended oscillations.
- The 1e6 (orange) line has the most pronounced oscillations.
- The 1e5 (green) line starts lower than the others but stabilizes similarly.
---
## Cross-Chart Comparison
| Base Value | Initial Y-Value | Stabilization Range | Oscillation Amplitude |
|------------|------------------|----------------------|------------------------|
| 1e2 (Chart 1) | ~12.5 | 5.0–7.5 | Low |
| 1e3 (Chart 1) | ~10.0 | 5.0–7.5 | Moderate |
| 1e4 (Chart 1) | ~15.0 | 5.0–7.5 | Moderate |
| 1e4 (Chart 2) | ~15.0 | 5.0–10.0 | High |
| 1e5 (Chart 2) | ~10.0 | 5.0–10.0 | High |
| 1e6 (Chart 2) | ~15.0 | 5.0–10.0 | Very High |
---
## Notes
- All data points are in English; no non-English text detected.
- No embedded tables or diagrams beyond the two line charts.
- Spatial grounding confirms legends are consistently placed on the right for both charts.
- Trends align with legend labels: higher base values correlate with higher initial Y-values and more pronounced oscillations.
</details>
Figure 4: The upper bound of attention score with respect to the relative distance.
<details>
<summary>2405.14591v1/x9.png Details</summary>

### Visual Description
# Technical Document Extraction
## Image Description
The image contains two side-by-side line charts comparing the parameter **B_m,θ** across varying **Relative distance**. Each chart includes three data series distinguished by base values (1e2, 1e3, 1e4 for the left chart; 1e4, 1e5, 1e6 for the right chart). The charts exhibit oscillatory behavior with decaying amplitudes.
---
## Chart 1: Left Panel
### Axis Labels
- **X-axis**: Relative distance (0 to 4000)
- **Y-axis**: B_m,θ (-0.2 to 0.6)
### Legend
- **Location**: Right side of the chart
- **Entries**:
- Blue line: base:1e4
- Green line: base:1e2
- Orange line: base:1e3
### Data Series Trends
1. **Blue (base:1e4)**:
- Starts at ~0.6, steeply declines to ~0.1 by x=2000.
- Oscillations decrease in amplitude over time.
- Final value at x=4000: ~0.05.
2. **Green (base:1e2)**:
- Begins at ~0.4, declines to ~0.05 by x=2000.
- Oscillations are smaller in amplitude compared to blue.
- Final value at x=4000: ~0.02.
3. **Orange (base:1e3)**:
- Starts at ~0.5, declines to ~0.03 by x=2000.
- Oscillations are intermediate in amplitude.
- Final value at x=4000: ~0.04.
### Spatial Grounding
- Legend colors match line colors exactly:
- Blue (1e4) aligns with the highest initial value.
- Green (1e2) aligns with the lowest initial value.
- Orange (1e3) aligns with intermediate initial value.
---
## Chart 2: Right Panel
### Axis Labels
- **X-axis**: Relative distance (0 to 30,000)
- **Y-axis**: B_m,θ (-0.2 to 0.6)
### Legend
- **Location**: Right side of the chart
- **Entries**:
- Blue line: base:1e4
- Green line: base:1e5
- Orange line: base:1e6
### Data Series Trends
1. **Blue (base:1e4)**:
- Starts at ~0.6, declines to ~0.1 by x=10,000.
- Oscillations persist but decay in amplitude.
- Final value at x=30,000: ~0.02.
2. **Green (base:1e5)**:
- Begins at ~0.5, declines to ~0.05 by x=10,000.
- Oscillations are smaller than blue but larger than orange.
- Final value at x=30,000: ~0.03.
3. **Orange (base:1e6)**:
- Starts at ~0.4, declines to ~0.02 by x=10,000.
- Oscillations are smallest in amplitude.
- Final value at x=30,000: ~0.01.
### Spatial Grounding
- Legend colors match line colors exactly:
- Blue (1e4) aligns with the highest initial value.
- Green (1e5) aligns with intermediate initial value.
- Orange (1e6) aligns with the lowest initial value.
---
## Key Observations
1. **Base Value Impact**:
- Higher base values (e.g., 1e6) correlate with lower initial B_m,θ and slower decay.
- Lower base values (e.g., 1e2) show steeper initial declines but smaller oscillations.
2. **Oscillation Behavior**:
- All lines exhibit damped oscillations, with higher base values showing less pronounced oscillations.
3. **Consistency Across Charts**:
- The trend of decaying oscillations is consistent, but the right chart (larger x-range) shows more sustained oscillations.
---
## Notes
- No non-English text detected.
- No data tables or embedded diagrams present.
- All textual information extracted from axis labels, legends, and visual trends.
</details>
Figure 5: The ability to attend more to similar tokens than random tokens.
The proof is shown in Appendix A. We denote $\sum_{i=0}^{d/2-1}\cos(m\theta_{i})$ as $B_{m,\theta}$ , and according to Theorem 1, $B_{m,\theta}$ measures the ability to give more attention to similar tokens than random tokens, which decreases as the relative distance $m$ increases, as shown in Figure 5. For a very small base value, we can observe that the $B_{m,\theta}$ is even below zero at a certain distance, meaning the random tokens have larger attention scores than the similar tokens, which may be problematic for long context modeling.
4.3 Base of RoPE Bounds the Context Length
To satisfy the Desiderata 2, we will get $\mathbb{E}_{q,k^{*}}\left[q^{T}R_{m,\theta}k^{*}\right]≥\mathbb{E}_{q,k}\left[q^{T}R_{m,\theta}k\right]$ . According to Theorem 1, $B_{m,\theta}$ needs to be larger than zero. Given the $\theta$ in RoPE, the context length $L_{\theta}$ that can be truly obtained satisfies:
$$
\displaystyle L_{\theta}=\sup\{L|B_{m,\theta}\geq 0,\forall m\in[0,1,...,L]\} \tag{17}
$$
In other word, if we follow the setting that $\theta_{i}=base^{-2i/d}$ , in order to get the expected context length $L$ , there is a lower bound of the base value $base_{L}$ :
$$
\displaystyle base_{L}=\inf\{base|B_{m,\theta}\geq 0,\forall m\in[0,1,...,L]\} \tag{18}
$$
In summary, the RoPE’s base determines the upper bound of context length the model can truly obtain. Although there exists the absolute lower bound, Eq. 16 and Eq. 18 are hard to get the closed-form solution since $B_{m,\theta}$ is a summation of many cosine functions. Therefore, in this paper, we get the numerical solution. Table 2 shows this lower bound for context length ranging from 1,000 to one million. In Figure 1, we plot the context length and corresponding lower bound, we can observe that as the context length increases, the required base also increases.
Table 2: Context length and its corresponding lower bound of RoPE’s base.
| Context Len. Lower Bound | 1k 4.3e3 | 2k 1.6e4 | 4k 2.7e4 | 8k 8.4e4 | 16k 3.1e5 | 32k 6.4e5 | 64k 2.1e6 | 128k 7.8e6 | 256k 3.6e7 | 512k 6.4e7 | 1M 5.1e8 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
Note: this boundary is not very strict because the stacking of layers in LLMs allows the model to extract information beyond the single layers’ range, which may increase the context length in Eq. 17 and decrease the base in Eq. 18. Notwithstanding, in Section 5 we find that the derived bound approximates the real context length in practice.
Long-term decay from different perspectives. The long-term decay in section 4.1 and section 4.2 are from different perspectives. The former refers to the long-term decay of the attention score as the relative distance increases. This ensures that current tokens tend to pay more attention to the tokens closer to them. The latter indicates that with the introduction of the rotation matrix in attention, the ability to discriminate the relevant tokens from irrelevant tokens decreases as the relative distance increases. Therefore, a large $B_{m,\theta}$ , corresponding to a large base value, is important to keep the model’s discrimination ability in long context modeling.
5 Experiment
In this section, we conduct thorough experiments. The empirical result can be summarized in Table 3, the details are in the following sections.
Table 3: In Section 5, we aim to answer the following questions.
| Q: Does RoPE’s base bounds the context length during the fine-tuning stage? | Yes. When the base is small, it is difficult to get extrapolation for specific context length. |
| --- | --- |
| Q: Does RoPE’s base bounds the context length during the pre-training stage? | Yes. Our proposed lower bound for RoPE’s base also applies to pre-training. If we train a model from scratch with a small base but the context length is large (larger than the bounded length), the resulting model has very limited the context length capabilities, meaning some of context in pre-training is wasted. |
| Q: What happened when base is set smaller than the lower bound? | The model will get the superficial long context capability. The model can keep perplexity low, but can’t retrieve useful information from long context. |
5.1 Experiments Setup
For fine-tuning, we utilized Llama2-7B (Touvron et al., 2023a) and Baichuan2-7B (Yang et al., 2023), both of which are popular open-source models employing RoPE with a base of $1e4$ . We utilized a fixed learning rate of 2e-5 and a global batch size of 128 and fine-tuning for 1000 steps. For pre-training, we trained a Llama-like 2B model from scratch for a total of 1 trillion tokens. We set the learning rate to 1e-4 and adopted a cosine decay schedule, with models trained on a total of 1T tokens. The dataset we used is a subset of RedPajama (Computer, 2023). More details of the experimental setup are provided in Appendix B.
Our evaluation focused on two aspects: (1) Perplexity: we use PG19 dataset (Rae et al., 2019) which are often used in long context evaluation; (2) Retrieval: in addition to perplexity, we also adopt retrieval since it represents the real long-context understanding ability of LLMs. We choose a) Long-eval benchmark from (Li* et al., 2023) and b) needle in a haystack (NIH) (G, 2023). The Long-eval benchmark generates numerous random similar sentences and asks the model to answer questions based on a specific sentence within the context, while the NIH requires the model to retrieve information from various positions in the long context.
5.2 Base of RoPE bounds context length in fine-tuning stages
<details>
<summary>2405.14591v1/x10.png Details</summary>

### Visual Description
# Technical Document Extraction: Perplexity vs. Context Graph
## Axis Labels
- **Y-Axis**: "Perplexity" (values: 6, 8, 10, 12, 14, 16, 18, 20)
- **X-Axis**: "Context" (values: 5000, 10000, 15000, 20000, 25000, 30000)
## Legend
The legend is positioned on the **right side** of the graph. Each line is color-coded with the following labels:
1. **Blue**: 32K-base:1e4
2. **Green**: 32K-base:2e5
3. **Orange**: 32K-base:9e5
4. **Red**: 32K-base:5e6
5. **Purple**: 32K-base:1e9
6. **Black**: 32K-base:1e12
7. **Gray**: 4K-Baseline
## Line Trends and Data Points
### 1. **4K-Baseline (Gray Line)**
- **Trend**: Sharp upward spike at the start (x ≈ 5000, y ≈ 18), followed by a rapid decline to stabilize around **y ≈ 8** for x > 10,000.
- **Key Data Points**:
- x = 5000: y ≈ 18
- x = 10,000: y ≈ 8
- x = 30,000: y ≈ 8.5
### 2. **32K-base:1e4 (Blue Line)**
- **Trend**: Slightly fluctuates between **y ≈ 8.2–8.8** across the entire x-axis.
- **Key Data Points**:
- x = 5000: y ≈ 8.2
- x = 10,000: y ≈ 8.5
- x = 30,000: y ≈ 8.8
### 3. **32K-base:2e5 (Green Line)**
- **Trend**: Similar to blue line, with minor fluctuations between **y ≈ 8.1–8.7**.
- **Key Data Points**:
- x = 5000: y ≈ 8.1
- x = 10,000: y ≈ 8.4
- x = 30,000: y ≈ 8.7
### 4. **32K-base:9e5 (Orange Line)**
- **Trend**: Slightly higher than green line, fluctuating between **y ≈ 8.3–8.9**.
- **Key Data Points**:
- x = 5000: y ≈ 8.3
- x = 10,000: y ≈ 8.6
- x = 30,000: y ≈ 8.9
### 5. **32K-base:5e6 (Red Line)**
- **Trend**: Slightly higher than orange line, fluctuating between **y ≈ 8.4–9.0**.
- **Key Data Points**:
- x = 5000: y ≈ 8.4
- x = 10,000: y ≈ 8.7
- x = 30,000: y ≈ 9.0
### 6. **32K-base:1e9 (Purple Line)**
- **Trend**: Slightly higher than red line, fluctuating between **y ≈ 8.5–9.1**.
- **Key Data Points**:
- x = 5000: y ≈ 8.5
- x = 10,000: y ≈ 8.8
- x = 30,000: y ≈ 9.1
### 7. **32K-base:1e12 (Black Line)**
- **Trend**: Slightly higher than purple line, fluctuating between **y ≈ 8.6–9.2**.
- **Key Data Points**:
- x = 5000: y ≈ 8.6
- x = 10,000: y ≈ 8.9
- x = 30,000: y ≈ 9.2
## Observations
- The **4K-Baseline** (gray) exhibits a unique sharp spike at the start, unlike the 32K-base lines, which remain relatively flat.
- All 32K-base lines (blue, green, orange, red, purple, black) show minimal variation, clustering tightly between **y ≈ 8–9.5**.
- The **32K-base:1e12** (black) line is the highest among the 32K-base series, while the **32K-base:1e4** (blue) is the lowest.
## Spatial Grounding
- **Legend Position**: Right side of the graph.
- **Color Consistency**: All lines match their legend labels (e.g., blue = 32K-base:1e4, gray = 4K-Baseline).
## Conclusion
The graph compares perplexity across different 32K-base configurations and a 4K baseline. The 4K baseline shows a distinct initial spike, while 32K-base configurations exhibit stable, low perplexity values with minor variations based on base magnitude.
</details>
(a) Perplexity
<details>
<summary>2405.14591v1/x11.png Details</summary>

### Visual Description
# Technical Document Extraction: Accuracy vs Base Value Graph
## Chart Description
The image depicts a line graph illustrating the relationship between "Base value" (x-axis) and "Accuracy on 32K" (y-axis). The graph includes a primary data series and a reference line labeled "Lower bound."
---
### **Axes and Labels**
- **Y-axis**:
- Title: "Accuracy on 32K"
- Scale: 0.0 to 0.3 in increments of 0.1
- Tick marks: Dashed horizontal lines at each increment
- **X-axis**:
- Title: "Base value"
- Scale: Logarithmic (1e4 to 1e12)
- Key markers:
- 2e5 (2×10⁵)
- 9e5 (9×10⁵)
- 5e6 (5×10⁶)
- 1e9 (1×10⁹)
- 1e12 (1×10¹²)
- Tick marks: Dashed vertical lines at each marker
---
### **Legend**
- **Location**: Lower-right corner of the chart
- **Entries**:
1. **Cyan line with star markers**: "Accuracy on 32K"
2. **Blue dashed vertical line**: "Lower bound"
---
### **Data Series and Trends**
#### **Primary Series (Cyan Line with Star Markers)**
- **Trend**:
- Starts at **0.0** when Base value = 1e4
- Sharp increase to **0.15** at 2e5
- Continues rising to **0.25** at 9e5
- Peaks at **0.35** at 5e6
- Dips to **0.30** at 1e9
- Rises again to **0.35** at 1e12
- **Key Data Points**:
| Base Value | Accuracy on 32K |
|------------|-----------------|
| 1e4 | 0.0 |
| 2e5 | 0.15 |
| 9e5 | 0.25 |
| 5e6 | 0.35 |
| 1e9 | 0.30 |
| 1e12 | 0.35 |
#### **Reference Line (Blue Dashed Vertical Line)**
- **Label**: "Lower bound"
- **Position**: Fixed at **Base value = 9e5** (9×10⁵)
- **Purpose**: Indicates a threshold or minimum acceptable value for the primary series.
---
### **Chart Components**
1. **Grid**:
- Dashed horizontal and vertical lines for alignment.
2. **Markers**:
- Star symbols denote data points for the primary series.
3. **Annotations**:
- "Lower bound" explicitly labeled on the blue dashed line.
---
### **Observations**
- The primary series exhibits a **non-linear relationship** between Base value and Accuracy on 32K, with a notable peak at 5e6 and a dip at 1e9.
- The "Lower bound" at 9e5 acts as a critical threshold, with the primary series crossing it at 9e5 and maintaining higher values beyond.
- Accuracy stabilizes at **0.35** for Base values ≥5e6, except for a temporary drop at 1e9.
---
### **Missing Elements**
- No chart title is present in the image.
- No additional annotations or contextual text beyond axis labels and legend.
---
### **Conclusion**
The graph demonstrates that Accuracy on 32K improves significantly with increasing Base value up to 5e6, after which it fluctuates but remains above the "Lower bound" threshold. The dip at 1e9 suggests potential instability or optimization challenges at that scale.
</details>
(b) Long-eval 32k
Figure 6: Fine-tuning Llama2-7B-Base on 32k context length with varying RoPE’s base. Although the perplexity remains low with varying bases, the Long-eval accuracy reveals a discernible bound for the base value, below which the Long-eval accuracy declines significantly. The dotted line denotes the lower bound derived from Eq. 18.
According to Eq. 18, there is a lower bound of RoPE’s base determined by expected context length. We fine-tune Llama2-7b-Base on 32k context with varying bases. As depicted in Figure 6, although the difference in perplexity between different bases is negligible, the accuracy of Long-eval varies significantly. In Figure 6(b), the dotted line denotes the lower bound derived from Eq. 18, below which the Long-eval accuracy declines significantly. Additional results are provided in Appendix C. Notably, this empirically observed lower bound closely aligns with our theoretical derivation. On the other hand, we can see that $base=2e5$ achieves the best perplexity, but the accuracy of Long-eval is very low, which indicates the limitations of perplexity in evaluating long context capabilities.
5.3 The Base of RoPE bounds context length in pre-training stages
<details>
<summary>2405.14591v1/x12.png Details</summary>

### Visual Description
# Technical Document Analysis: Line Graph of Perplexity vs. Context
## 1. **Axis Labels and Titles**
- **Y-Axis**: Labeled "Perplexity" with tick marks at intervals of 5 units (10, 15, 20, 25, 30).
- **X-Axis**: Labeled "Context" with tick marks at intervals of 1000 units (0, 1000, 2000, 3000, 4000).
- **Title**: No explicit title is present in the image.
## 2. **Legend and Data Series**
- **Legend**: No legend is present in the image.
- **Data Series**: A single blue line represents the relationship between "Context" (X-axis) and "Perplexity" (Y-axis).
## 3. **Grid and Background**
- **Grid**: Dashed gray grid lines span the entire plot area, intersecting at all axis tick marks.
## 4. **Line Graph Characteristics**
- **Line Color**: Blue.
- **Trend**:
- **Initial Slope**: The line begins at approximately (0, 11) and rises steadily to a peak at (2000, ~14.5).
- **Plateau**: From Context 2000 to 3000, the line remains relatively flat, fluctuating slightly between ~14.3 and ~14.5.
- **Decline**: After Context 3000, the line decreases gradually to ~13.5 at Context 4000.
## 5. **Data Points and Spatial Grounding**
- **Key Points**:
- (0, ~11): Starting point of the line.
- (2000, ~14.5): Peak value.
- (4000, ~13.5): Final value.
- **Legend Placement**: Not applicable (no legend exists).
## 6. **Trend Verification**
- The line exhibits a clear upward trend from Context 0 to 2000, followed by stabilization and a slight decline. This aligns with the visual slope and curvature of the plotted data.
## 7. **Component Isolation**
- **Header**: No header text or title.
- **Main Chart**:
- Axes with labeled ticks.
- Dashed grid lines.
- Blue line with plotted data points.
- **Footer**: No footer text or annotations.
## 8. **Additional Observations**
- **No Embedded Text**: No textual annotations or labels are present within the diagram itself.
- **No Data Table**: The image does not include a data table or numerical dataset.
## 9. **Language and Transcription**
- **Primary Language**: English.
- **Transcription**: All textual elements (axis labels, tick marks) are transcribed as described above.
## 10. **Final Notes**
- The graph illustrates a relationship where "Perplexity" increases with "Context" up to a threshold (Context = 2000), after which it stabilizes and slightly decreases. No additional contextual or categorical data is provided.
</details>
<details>
<summary>2405.14591v1/x13.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Axis Labels and Titles
- **X-Axis**: "Context length" (numerical scale from 1000 to 5000 in increments of 1000)
- **Y-Axis**: "Accuracy" (numerical scale from 0.0 to 0.5 in increments of 0.1)
- **Graph Title**: Not explicitly labeled (assumed to be contextually implied by axes)
## 2. Data Series and Trends
- **Line Behavior**:
- **Initial Segment (1000 → 1500)**:
- Starts at **0.52** (highest point) at x=1000.
- Sharp decline to **0.3** at x=1500.
- **Mid-Segment (1500 → 3000)**:
- Oscillates between **0.0** and **0.15**.
- Notable peaks at x=2000 (~0.12) and x=2500 (~0.14).
- **Final Segment (3000 → 5000)**:
- Continued low-amplitude fluctuations (~0.0 to ~0.05).
- Final data point at x=5000: **~0.02**.
## 3. Key Observations
- **Dominant Trend**: Rapid accuracy degradation as context length increases from 1000 to 1500.
- **Stabilization**: Accuracy stabilizes below 0.15 after x=1500, with minor fluctuations.
- **No Legend**: No categorical labels or color-coded data series identified.
## 4. Spatial Grounding
- **Legend Placement**: Not applicable (no legend present).
- **Data Point Verification**: All points align with the single blue line; no conflicting colors or labels.
## 5. Component Isolation
- **Header**: Absent (graph starts directly with axes).
- **Main Chart**:
- Gridlines: Dashed gray lines at x=1000, 2000, 3000, 4000, 5000 and y=0.0, 0.1, 0.2, 0.3, 0.4, 0.5.
- Line Style: Solid blue line with no markers.
- **Footer**: Absent.
## 6. Data Table Reconstruction
- **Extracted Data Points**:
| Context Length | Accuracy |
|----------------|----------|
| 1000 | 0.52 |
| 1500 | 0.3 |
| 2000 | 0.12 |
| 2500 | 0.14 |
| 3000 | 0.05 |
| 4000 | 0.03 |
| 5000 | 0.02 |
## 7. Trend Verification
- **Line A (Blue)**:
- **Visual Trend**: Steep decline followed by stabilization with minor oscillations.
- **Numerical Consistency**: Data points align with observed trends (e.g., sharp drop at x=1500 matches 0.3 value).
## 8. Missing Elements
- **Explicit Labels**: No title, legend, or units beyond axis labels.
- **Additional Data**: No secondary axes, annotations, or error bars.
## 9. Language and Transcription
- **Primary Language**: English (all labels and axis titles in English).
- **Transcription Accuracy**: All textual elements extracted verbatim.
## 10. Conclusion
The graph illustrates a critical threshold at x=1500, where accuracy drops below 0.3 and remains unstable. No further degradation trends are evident beyond this point.
</details>
<details>
<summary>2405.14591v1/x14.png Details</summary>

### Visual Description
# Technical Document Extraction: Heatmap Analysis
## 1. **Axis Labels and Titles**
- **X-Axis (Horizontal):**
- Label: `Token Limit`
- Range: `0` to `5000`
- Increment: `100` (e.g., `0, 100, 200, ..., 5000`)
- **Y-Axis (Vertical):**
- Label: `Context Length`
- Range: `0` to `100`
- Increment: `5` (e.g., `0, 5, 10, ..., 100`)
- **Color Bar (Legend):**
- Label: `Score`
- Range: `0` (red) to `10` (green)
## 2. **Key Trends and Data Points**
- **General Pattern:**
- Scores decrease as `Token Limit` increases (left-to-right gradient from green to red).
- Scores decrease as `Context Length` increases (bottom-to-top gradient from green to red).
- **Notable Features:**
- **Vertical Green Stripe:**
- Located at `Token Limit ≈ 500`.
- Scores range from `8` (green) to `10` (bright green) across all `Context Length` values.
- **Horizontal Green Bands:**
- Observed at lower `Context Length` values (`0–20`).
- Scores remain high (`6–10`) even at higher `Token Limits` (e.g., `1000–2000`).
- **Red Dominance:**
- Upper-right quadrant (`Token Limit > 2000`, `Context Length > 50`) shows predominantly low scores (`0–2`).
## 3. **Legend and Color Mapping**
- **Legend Position:** Right side of the heatmap.
- **Color Gradient:**
- `Red` → `0` (lowest score)
- `Yellow-Orange` → `4–6` (mid-range scores)
- `Green` → `8–10` (highest scores)
## 4. **Spatial Grounding of Data Points**
- **High-Score Regions:**
- **Cluster 1:** `Token Limit = 0–500`, `Context Length = 0–20` (green/yellow).
- **Cluster 2:** `Token Limit = 500`, all `Context Length` (green stripe).
- **Low-Score Regions:**
- `Token Limit > 2000`, `Context Length > 50` (red).
## 5. **Trend Verification**
- **Token Limit vs. Score:**
- Scores decline monotonically with increasing `Token Limit` (e.g., `Token Limit = 0` → `Score ≈ 10`; `Token Limit = 5000` → `Score ≈ 0`).
- **Context Length vs. Score:**
- Scores decline monotonically with increasing `Context Length` (e.g., `Context Length = 0` → `Score ≈ 10`; `Context Length = 100` → `Score ≈ 0`).
## 6. **Component Isolation**
- **Header:** No explicit title or subtitle visible.
- **Main Chart:**
- Grid of colored squares representing scores.
- Axes and color bar provide reference for interpretation.
- **Footer:** Color bar legend with numerical score mapping.
## 7. **Data Table Reconstruction**
| Token Limit | Context Length | Score |
|-------------|----------------|-------|
| 0 | 0 | 10 |
| 100 | 0 | 9 |
| 200 | 0 | 8 |
| ... | ... | ... |
| 5000 | 100 | 0 |
*Note: Full table reconstruction requires pixel-level analysis of the heatmap, which is not feasible here. The above is a representative sample.*
## 8. **Language and Transcription**
- **Primary Language:** English (axis labels, legend).
- **No Secondary Language Detected.**
## 9. **Critical Observations**
- The heatmap suggests a trade-off between `Token Limit` and `Context Length` for optimal performance (high scores).
- The `Token Limit = 500` region exhibits an anomaly with consistently high scores, warranting further investigation.
- Scores degrade sharply beyond `Token Limit = 2000` and `Context Length = 50`.
</details>
<details>
<summary>2405.14591v1/x15.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Chart Identification
- **Type**: Line graph
- **Primary Data Series**: Single blue line (no legend present)
- **Visual Characteristics**:
- Smooth curve with no markers/annotations
- Dashed grid lines (gray)
- White background
## 2. Axis Labels & Scales
### X-Axis (Context)
- **Title**: "Context"
- **Range**: 0 → 4000
- **Markers**: 0, 1000, 2000, 3000, 4000
- **Units**: Not specified
### Y-Axis (Perplexity)
- **Title**: "Perplexity"
- **Range**: 10 → 30
- **Markers**: 10, 15, 20, 25, 30
- **Units**: Not specified
## 3. Data Series Analysis
### Line Behavior
- **Color**: Blue (#0000FF)
- **Trend**:
- Initial upward slope (0 → ~1000 context)
- Peak at ~14.5 perplexity
- Gradual decline (1000 → 4000 context)
- Final plateau (~13.5 perplexity)
### Key Data Points (Estimated)
| Context | Perplexity |
|---------|------------|
| 0 | 13.0 |
| 1000 | 14.5 |
| 2000 | 14.2 |
| 3000 | 13.3 |
| 4000 | 13.5 |
## 4. Spatial Grounding
- **Legend**: Absent (no color-key or labels present)
- **Line Placement**:
- Entirely within y=10–15 range
- X-axis coverage: 0–4000
## 5. Trend Verification
- **Initial Phase (0–1000)**:
- Slope: +1.5 perplexity/1000 context
- Verification: Confirmed by visual ascent from 13.0 → 14.5
- **Peak Phase (1000–2000)**:
- Slope: -0.3 perplexity/1000 context
- Verification: Confirmed by slight descent to 14.2
- **Decline Phase (2000–4000)**:
- Slope: -0.15 perplexity/1000 context
- Verification: Confirmed by gradual drop to 13.5
## 6. Component Isolation
### Header
- No explicit header text present
### Main Chart
- **Primary Element**: Single data series (blue line)
- **Grid Structure**:
- Horizontal: 5 lines (y=10,15,20,25,30)
- Vertical: 5 lines (x=0,1000,2000,3000,4000)
### Footer
- No footer elements present
## 7. Missing Elements
- **Legend**: Not applicable (no legend present)
- **Annotations**: None
- **Secondary Axes**: None
## 8. Data Integrity Check
- All y-values remain within 10–15 range
- No data points exceed axis limits
- Trend consistency verified across all phases
## 9. Language & Localization
- **Primary Language**: English
- **Non-English Text**: None detected
## 10. Final Notes
- Chart appears to model perplexity changes across context sizes
- No explicit units provided for either axis
- Data suggests diminishing returns in perplexity increase
</details>
<details>
<summary>2405.14591v1/x16.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Axis Labels and Markers
- **X-axis**:
- Title: "Context length"
- Values: 1000, 2000, 3000, 4000, 5000 (increments of 1000)
- **Y-axis**:
- Title: "Accuracy"
- Values: 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6 (increments of 0.1)
## 2. Legend
- **Position**: Right side of the graph
- **Color**: Blue
- **Label**: "Model Performance"
## 3. Data Series and Trends
- **Line Color**: Blue (matches legend)
- **Visual Trend**:
- Starts at ~0.58 accuracy at 1000 context length.
- Exhibits fluctuations with an overall **downward trend**.
- Peaks at ~0.52 around 1500 context length.
- Sharp decline to ~0.38 at 2000, ~0.25 at 2500, ~0.15 at 3000, ~0.05 at 4000.
- Slight recovery to ~0.08 at 5000.
## 4. Key Data Points (x, y)
| Context Length | Accuracy |
|----------------|----------|
| 1000 | 0.58 |
| 1500 | 0.52 |
| 2000 | 0.38 |
| 2500 | 0.25 |
| 3000 | 0.15 |
| 4000 | 0.05 |
| 5000 | 0.08 |
## 5. Observations
- **Initial High Performance**: Accuracy begins near 0.6 at 1000 context length.
- **Rapid Decline**: Steep drop to ~0.38 by 2000 context length.
- **Stabilization at Lower Accuracy**: Fluctuates between 0.05–0.08 for context lengths ≥4000.
- **No Additional Categories**: Single data series ("Model Performance") dominates the graph.
## 6. Missing Elements
- No secondary data series, annotations, or textual explanations beyond axis labels and legend.
- No gridlines or shaded regions present.
## 7. Conclusion
The graph illustrates a **context-dependent accuracy decay** for "Model Performance," with diminishing returns as context length increases beyond 1000. Accuracy stabilizes at ~0.05–0.08 for longer contexts (≥4000).
</details>
<details>
<summary>2405.14591v1/x17.png Details</summary>

### Visual Description
# Technical Document Extraction: Heatmap Analysis
## Image Type
- **Heatmap** visualizing the relationship between **Token Limit** (x-axis) and **Context Length** (y-axis), with color intensity representing a numerical **Score**.
---
## Axis Labels and Markers
### X-Axis (Token Limit)
- **Title**: "Token Limit"
- **Range**: 0 to 5000
- **Ticks**: Incremented by 100 (0, 100, 200, ..., 5000)
- **Labels**: Numerical values (e.g., "0", "100", "200", ..., "5000")
### Y-Axis (Context Length)
- **Title**: "Context Length"
- **Range**: 0 to 100
- **Ticks**: Incremented by 5 (0, 5, 10, ..., 100)
- **Labels**: Numerical values (e.g., "0", "5", "10", ..., "100")
---
## Color Scale and Legend
- **Legend Position**: Top-right corner of the heatmap
- **Color Bar**:
- **Range**: 0 (red) to 10 (green)
- **Gradient**: Red → Yellow → Green (increasing score)
- **Key Values**:
- 0: Dark red
- 2: Light red
- 4: Orange
- 6: Yellow-green
- 8: Light green
- 10: Bright green
---
## Key Trends and Data Points
1. **Left Region (Low Token Limits)**:
- **Token Limits**: 0–1000
- **Dominant Colors**: Green (high scores)
- **Pattern**: Vertical green bars (e.g., token limit 0, context lengths 0–100)
- **Exception**: A yellow square at (Token Limit: 100, Context Length: 50) suggests a moderate score.
2. **Middle Region (Moderate Token Limits)**:
- **Token Limits**: 1000–3000
- **Dominant Colors**: Mixed green/yellow with sporadic red/orange
- **Notable Data Points**:
- Red square at (Token Limit: 2000, Context Length: 70) indicates a low score.
- Yellow-green cluster around (Token Limit: 1500, Context Length: 30–40).
3. **Right Region (High Token Limits)**:
- **Token Limits**: 3000–5000
- **Dominant Colors**: Red/orange (low scores)
- **Pattern**: Dense red/orange blocks (e.g., token limit 4000, context lengths 0–50).
---
## Spatial Grounding of Legend
- **Legend Coordinates**: Top-right corner (aligned with x=5000, y=100)
- **Color Consistency Check**:
- Red squares in the heatmap correspond to the lowest score (0).
- Green squares align with the highest score (10).
---
## Component Isolation
1. **Header**: No explicit title present.
2. **Main Chart**:
- Grid lines divide the heatmap into cells.
- Each cell’s color represents the score for a specific (Token Limit, Context Length) pair.
3. **Footer**: Color bar with numerical scale (0–10).
---
## Data Table Reconstruction
| Context Length | Token Limit | Score (Color) |
|----------------|-------------|---------------|
| 0 | 0 | 10 (Green) |
| 5 | 0 | 10 (Green) |
| ... | ... | ... |
| 100 | 5000 | 0 (Red) |
*Note: Exact numerical scores are inferred from color intensity; precise values require legend cross-referencing.*
---
## Language Declaration
- **Primary Language**: English
- **Other Languages**: None detected.
---
## Summary
The heatmap reveals an inverse relationship between **Token Limit** and **Score**: lower token limits generally yield higher scores (green), while higher token limits correlate with lower scores (red/orange). Exceptions exist, such as the yellow square at (100, 50) and the red square at (2000, 70), indicating localized anomalies.
</details>
<details>
<summary>2405.14591v1/x18.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Labels and Axis Titles
- **Y-Axis Label**: "Perplexity" (ranging from 10 to 30 in increments of 5).
- **X-Axis Label**: "Context" (ranging from 0 to 4000 in increments of 1000).
- **Grid**: Dotted lines forming a Cartesian grid.
- **Legend**: Absent in the image.
## 2. Key Trends
- The blue line exhibits a **non-linear trend** with fluctuations:
- **Initial Rise**: Starts at ~12, increases to ~14 by x=1000.
- **Peak**: Reaches ~14.5 at x=2000.
- **Decline**: Drops to ~13.5 at x=3000.
- **Stabilization**: Fluctuates between ~13.5 and ~14 from x=3000 to x=4000.
## 3. Data Points (Approximate)
| X-Value | Y-Value (Perplexity) | Notes |
|---------|----------------------|---------------------------|
| 0 | ~12 | Starting point |
| 1000 | ~14 | First local maximum |
| 2000 | ~14.5 | Global maximum |
| 3000 | ~13.5 | Local minimum |
| 4000 | ~13.7 | Final stabilized value |
## 4. Component Isolation
- **Header**: No explicit header text.
- **Main Chart**:
- Blue line dominates the plot.
- Grid lines are evenly spaced (x: 1000 units, y: 5 units).
- **Footer**: No footer text or annotations.
## 5. Trend Verification
- **Visual Confirmation**:
- The line slopes upward from x=0 to x=1000, then plateaus with minor oscillations until x=2000.
- A gradual decline occurs between x=2000 and x=3000, followed by stabilization.
- **Numerical Consistency**: Data points align with the visual trend (e.g., peak at x=2000 matches the highest y-value).
## 6. Spatial Grounding
- **Legend Placement**: Not applicable (no legend present).
- **Line Color**: Blue, consistent throughout the plot.
## 7. Additional Observations
- No embedded text, data tables, or secondary diagrams.
- The graph focuses solely on the relationship between "Context" (x-axis) and "Perplexity" (y-axis).
## 8. Conclusion
The graph illustrates a fluctuating relationship between context size and perplexity, with a peak at x=2000 and stabilization beyond x=3000. No additional textual or categorical data is present.
</details>
<details>
<summary>2405.14591v1/x19.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Axes and Labels
- **X-axis**:
- Label: "Context length"
- Range: 1000 to 5000 (increments of 1000)
- Ticks: 1000, 2000, 3000, 4000, 5000
- **Y-axis**:
- Label: "Accuracy"
- Range: 0.0 to 0.5 (increments of 0.1)
- Ticks: 0.0, 0.1, 0.2, 0.3, 0.4, 0.5
## 2. Legend
- **Placement**: Top-right corner
- **Color**: Blue
- **Label**: "Model Performance"
## 3. Key Trends
- **Initial Rise**:
- Starts at ~0.35 accuracy at 1000 context length.
- Peaks at ~0.55 accuracy at 2000 context length.
- **Decline Phase**:
- Drops to ~0.35 accuracy at 3000 context length.
- Sharp decline to ~0.05 accuracy at 4000 context length.
- **Stabilization**:
- Fluctuates between ~0.05 and ~0.10 accuracy from 4000 to 5000 context length.
## 4. Data Points (Approximate)
| Context Length | Accuracy |
|----------------|----------|
| 1000 | 0.35 |
| 2000 | 0.55 |
| 3000 | 0.35 |
| 4000 | 0.05 |
| 5000 | 0.10 |
## 5. Observations
- **Peak Performance**: Optimal accuracy (~0.55) occurs at 2000 context length.
- **Performance Degradation**: Accuracy drops sharply beyond 3000 context length.
- **Stability at High Context Lengths**: Minimal variation (~0.05–0.10) observed between 4000–5000.
## 6. Language and Text
- All text is in English. No non-English content detected.
## 7. Spatial Grounding
- Legend color (blue) matches the line color in the graph.
- Data points align with labeled axis markers.
## 8. Component Isolation
- **Main Chart**: Line graph with trends described above.
- **No Header/Footer**: No additional text or annotations outside the chart area.
## 9. Conclusion
The graph illustrates a non-linear relationship between context length and model accuracy, with a peak at 2000 context length followed by a significant decline. Accuracy stabilizes at lower values for context lengths ≥4000.
</details>
<details>
<summary>2405.14591v1/x20.png Details</summary>

### Visual Description
# Technical Document Extraction: Heatmap Analysis
## 1. Axis Labels and Markers
- **X-Axis (Horizontal):**
- Title: `Token Limit`
- Values: `102, 204, 306, 408, 512, 614, 716, 818, 920, 1022, 1124, 1226, 1328, 1430, 1532, 1634, 1736, 1838, 1940, 2042, 2144, 2246, 2348, 2450, 2552, 2654, 2756, 2858, 2960, 3062, 3164, 3266, 3368, 3470, 3572, 3674, 3776, 3878, 3980, 4082, 4184, 4286, 4388, 4490, 4592, 4694, 4796, 4898, 5000`
- Spatial Grounding: `[x, y]` values correspond to token limits increasing left-to-right.
- **Y-Axis (Vertical):**
- Title: `Context Length`
- Values: `0.0, 5.0, 11.0, 16.0, 21.0, 26.0, 32.0, 37.0, 42.0, 47.0, 53.0, 58.0, 63.0, 68.0, 74.0, 79.0, 84.0, 89.0, 95.0, 100.0`
- Spatial Grounding: `[x, y]` values correspond to context lengths increasing bottom-to-top.
## 2. Legend and Color Scale
- **Legend Title:** `Score`
- **Color Gradient:**
- Red (`#FF0000`): Score = 0
- Yellow (`#FFFF00`): Score = 4
- Green (`#00FF00`): Score = 8
- Dark Green (`#008000`): Score = 10
- **Spatial Grounding:** Legend is positioned on the **right edge** of the chart.
## 3. Key Data Points and Trends
### A. High-Score Clusters (Green/Yellow)
- **Region 1:**
- Token Limits: `204–300`
- Context Lengths: `60–80`
- Observations: Dense cluster of high scores (8–10).
- **Region 2:**
- Token Limits: `2500–3000`
- Context Lengths: `21–32`
- Observations: Multiple high-score cells (8–10).
### B. Low-Score Outlier (Red)
- **Data Point:**
- Token Limit: `4896`
- Context Length: `37.0`
- Score: `0` (confirmed via legend match).
- Spatial Grounding: `[4896, 37.0]` in the bottom-right quadrant.
### C. Medium-Score Distribution (Yellow-Orange)
- **Region:**
- Token Limits: `3000–4000`
- Context Lengths: `26–37`
- Observations: Scattered medium scores (4–6).
## 4. Visual Trends
- **General Pattern:**
- Scores increase (green) for smaller token limits and moderate context lengths.
- Scores decrease (red) for very large token limits (e.g., 4896) and low context lengths (e.g., 37.0).
- **Notable Anomaly:**
- The red square at `(4896, 37.0)` starkly contrasts with surrounding green/yellow cells, indicating a critical failure or edge case.
## 5. Data Structure Reconstruction
The heatmap represents a matrix where:
- **Rows:** Context Lengths (0–100)
- **Columns:** Token Limits (102–5000)
- **Cell Values:** Scores (0–10), encoded via color intensity.
Example Subset:
| Token Limit | Context Length | Score |
|-------------|----------------|-------|
| 204 | 60 | 9 |
| 2500 | 21 | 8 |
| 4896 | 37.0 | 0 |
## 6. Component Isolation
- **Main Chart:** Heatmap with color-coded scores.
- **Legend:** Right-aligned color scale.
- **Axes:** Labeled with numerical ranges.
## 7. Cross-Reference Validation
- All red/yellow/green cells in the heatmap align with the legend’s score mapping.
- The red square at `(4896, 37.0)` is the only instance of a score = 0, matching the legend’s red threshold.
## 8. Conclusion
The heatmap reveals a strong correlation between token limits, context lengths, and performance scores. High scores dominate smaller token limits and moderate context lengths, while extreme values (e.g., 4896 tokens) exhibit critical failures.
</details>
Figure 7: The first row: the results of a 2B model training from scratch with base=1e2. The second row: The results of fine-tuning the 2B model with base=1e4. The third row: The results of fine-tuning the 2B model with base=1e6.
According to and Theorem 1 and Eq. 18, this constraints could also apply to pre-training stage. To validate this, we trained a 2B model from scratch with RoPE base=100. The results, depicted in the first row of Figure 7, indicate that even though the model was trained with a context length of 4,096 tokens, it was capable of retrieving information from only the most recent approximately 500 tokens. This demonstrates that the base parameter bounds the context length during the pre-training stage as well. We define the context length from which the model can effectively retrieve information as the effective context length.
And according to our theory, the effective context length can be extended as the RoPE’s base increases. To validate this, we further fine-tune this 2B model on 32k context length, with RoPE’s base set to 1e4, as shown in the second row of Figure 7. While the effective context length increased, it remains significantly below 32k since the effective context length bounded by base=1e4 is much smaller than 32k. Further, when we increase the base to 1e6 and fine-tune the base 2B model on 32K (the third row in Figure 7), the model could obtain a larger context length than base=1e4, which is in accodance with our theory.
To further remove the influence of model size, we also fine-tuned a larger 7B model on a 32k context length with a RoPE base set to 1e4 and observed an effective context length nearly identical to that of the 2B model with the same RoPE base (see Appendix 14). This is empirical proof that the effective context length is determined by RoPE’s base.
5.4 Interpretation for the superficial long context capability for small base
Based on our theory and empirical observations, it is easy to explain what happens in Figure 3.
Better Extrapolation (Perplexity)? Due to the small base, $B_{m,\theta}$ can be smaller than zero as $m$ increases, which is shown in Figure 5. The model can’t attend more to similar tokens than random tokens with a large relative distance, so the model tends to focus more on nearby tokens, this will lead to a smaller empirical receptive field, even smaller than the training length. In this case, the model has a strong ability to maintain perplexity stability (Chi et al., 2023).
Worse Ability (Long-eval and NIH)! According to our previous analysis, RoPE’s base bounds the context length, and the context length bounded by 500 is much lower than that bound by 10,000. Therefore, when the base is set to 500, the effective context length drops sharply, even after training on 32k context length.
5.5 OOD theory is insufficient to reveal long context capability
Table 4: The comparison of "Method 1" and "Method 2". These methods are designed carefully. They both are no OOD, but they are very different under our theory.
| Method 1 | \usym 2718 | 0.33 | 0.27 | 0 | 0 |
| --- | --- | --- | --- | --- | --- |
| Method 2 | \usym 2718 | 0.40 | 0.00 | 97 | 2554 |
Section 3 mentions that methods based on the OOD theory of rotation angles may not fully reflect the long context capability. In this section, we conduct further experiments to substantiate and explain this observation. We present two methods to extend the context length of Llama2 from 4k to 32k. Both of them are devoid of OOD angles. These methods are delineated mathematically as follows:
- Method 1: $\theta_{i}=(5e6)^{-2i/d}$ ,
- Method 2: $\theta_{i}=\left\{\begin{aligned} &(1e4)^{-2i/128}/8,&i≥ 44\\
&(1e4*8^{128/88})^{-2i/128},&i<44.\end{aligned}\right.$
We can see from Table 4 that these two methods exhibit significantly different long context capabilities. Under the perspective of OOD rotation angle, both methods avoid OOD rotation angle, suggesting effective extrapolation. However, despite being trained on a context length of 32k, "method 2" struggles in completing the retrieval task at a context length of 32k. This phenomenon is beyond the scope which the OOD theory can explain.
Under our perspective, "method 2" is severely violating $B_{m,\theta}≥ 0$ when $m∈[15k,30k]$ , thereby impeding its ability to achieve long-context discrimination. We speculate that the model may achieve better extrapolation in the fine-tuning stage if the base is sufficiently large to surpass a lower bound and avoid OOD of rotation angles.
6 Related Work
Position embedding.
Since its introduction, Transformer (Vaswani et al., 2017) has achieved remarkable results in the field of natural language processing. To make full use of the order of sequence, researchers have introduced position embedding. The earliest position embedding was based on sinusoidal functions (Vaswani et al., 2017) for absolute positions, learnable absolute position embedding (Devlin et al., 2018) and many variants (Kiyono et al., 2021; Li et al., 2019) were proposed. Nevertheless, absolute position embedding has difficulties in extending directly to texts longer than the training length. Subsequently, researchers proposed relative position embedding methods (Shaw et al., 2018; Ke et al., 2020). With the development of large language models, rotary position embedding and its variants (Su et al., 2024; Sun et al., 2022) has become widely used, such as Llama2 (Touvron et al., 2023a), Baichuan2 (Yang et al., 2023), Mistral-7B- (Jiang et al., 2023a). A recent study reveals that no position embedding is also potential (Kazemnejad et al., 2024).
Long context learning.
Implementing models with longer or even infinitely long contexts has always been an important goal in the field of natural language processing. Due to the squared complexity of the transformer model over time, a significant portion of the work focuses on improving the model structure (Gu & Dao, 2023; 2023; Peng et al., 2023a; Qin et al., 2024). However, most of the work is still based on the transformer architecture. The other part of the work is aimed at reducing the computational complexity of attention itself, such as sparse attention (Beltagy et al., 2020) and group query attention (Ainslie et al., 2023). In addition, there are also some optimizations in engineering efficiency, such as flash attention (Dao et al., 2022) and ring attention (Liu et al., 2023). In the model inference stage, to save time and space, there are also some methods for accelerating long context, such as KV cache compression (Hooper et al., 2024), etc. And the position embedding is important in extrapolation. In the process of fine-tuning, methods such as PI (Chen et al., 2023), NTK, and YARN (Peng et al., 2023b) are used to change the original position embedding information. FoT (Tworkowski et al., 2024) assigns the position information of the tokens outside the local context as the first token in the local context.
7 Limitation
In this work, we investigate the relationship between the base of RoPE and context length. Although we have derived that there exists a lower bound for the base of RoPE determined by context length, the existence of the upper bound for RoPE’s base remains an open question that warrants further exploration. In addition, because of the lack of effective benchmarks for assessing long-context capabilities, the scope of long-context capabilities discussed in this paper may be limited.
8 Conclusion
Our work presents a comprehensive study on the role of RoPE in LLMs for effectively modeling long context. Our main contribution lies in uncovering a novel property of RoPE through theoretical analysis, demonstrating that as the relative distance between tokens increases, the model’s ability to attend more to similar tokens decreases. According to our theory, we derive a lower bound for RoPE’s base in accommodating to expected context lengths. Our experimental results validate that the base of RoPE bounds context length for not only fine-tuning but also the pre-training stage. Our theory offers a new perspective on understanding the functionality of RoPE in long-context modeling. By shedding light on the relationship between context length and position embedding, we hope our work could provide insights for enhancing the long context capability of LLMs.
References
- Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Ainslie et al. (2023) Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebron, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Bai et al. (2023) Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Beltagy et al. (2020) Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- bloc97 (2023) bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/, 2023.
- Chen et al. (2023) Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Chi et al. (2023) Ta-Chung Chi, Ting-Han Fan, Alexander Rudnicky, and Peter Ramadge. Dissecting transformer length extrapolation via the lens of receptive field analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13522–13537, 2023.
- Computer (2023) Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, April 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Dao et al. (2022) Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
- Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- emozilla (2023) emozilla. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning. https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/, 2023.
- G (2023) Kamradt G. Needle in a haystack - pressure testing llms. https://github.com/gkamradt/LLMTest_NeedleInAHaystack, 2023.
- Gu & Dao (2023) Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Han et al. (2023) Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
- Heo et al. (2024) Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. arXiv preprint arXiv:2403.13298, 2024.
- Hooper et al. (2024) Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024.
- Hu et al. (2024) Yutong Hu, Quzhe Huang, Mingxu Tao, Chen Zhang, and Yansong Feng. Can perplexity reflect large language model’s ability in long text understanding? In The Second Tiny Papers Track at ICLR 2024, 2024.
- Jiang et al. (2023a) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023a.
- Jiang et al. (2023b) Nan Jiang, Kevin Liu, Thibaud Lutellier, and Lin Tan. Impact of code language models on automated program repair. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pp. 1430–1442. IEEE, 2023b.
- Kazemnejad et al. (2024) Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Ke et al. (2020) Guolin Ke, Di He, and Tie-Yan Liu. Rethinking positional encoding in language pre-training. In International Conference on Learning Representations, 2020.
- Kiyono et al. (2021) Shun Kiyono, Sosuke Kobayashi, Jun Suzuki, and Kentaro Inui. Shape: Shifted absolute position embedding for transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3309–3321, 2021.
- Li* et al. (2023) Dacheng Li*, Rulin Shao*, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph E. Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
- Li et al. (2019) Hailiang Li, YC Adele, Yang Liu, Du Tang, Zhibin Lei, and Wenye Li. An augmented transformer architecture for natural language generation tasks. In 2019 International Conference on Data Mining Workshops (ICDMW), pp. 1–7. IEEE, 2019.
- Liu et al. (2023) Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
- Liu et al. (2024a) Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024a.
- Liu et al. (2024b) Xiaoran Liu, Hang Yan, Chenxin An, Xipeng Qiu, and Dahua Lin. Scaling laws of roPE-based extrapolation. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=JO7k0SJ5V6.
- Mohtashami & Jaggi (2024) Amirkeivan Mohtashami and Martin Jaggi. Random-access infinite context length for transformers. Advances in Neural Information Processing Systems, 36, 2024.
- Peng et al. (2023a) Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Leon Derczynski, et al. Rwkv: Reinventing rnns for the transformer era. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 14048–14077, 2023a.
- Peng et al. (2023b) Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. In The Twelfth International Conference on Learning Representations, 2023b.
- Qin et al. (2024) Zhen Qin, Songlin Yang, and Yiran Zhong. Hierarchically gated recurrent neural network for sequence modeling. Advances in Neural Information Processing Systems, 36, 2024.
- Rae et al. (2019) Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, Chloe Hillier, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2019.
- Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468, 2018.
- Shoeybi et al. (2020) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020.
- Su et al. (2024) Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024.
- Sun et al. (2022) Yutao Sun, Li Dong, Barun Patra, Shuming Ma, Shaohan Huang, Alon Benhaim, Vishrav Chaudhary, Xia Song, and Furu Wei. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022.
- Tay et al. (2022) Yi Tay, Mostafa Dehghani, Samira Abnar, Hyung Won Chung, William Fedus, Jinfeng Rao, Sharan Narang, Vinh Q Tran, Dani Yogatama, and Donald Metzler. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551, 2022.
- Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models, 2023a.
- Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open foundation and fine-tuned chat models, 2023b.
- Tworkowski et al. (2024) Szymon Tworkowski, Konrad Staniszewski, Mikołaj Pacek, Yuhuai Wu, Henryk Michalewski, and Piotr Miłoś. Focused transformer: Contrastive training for context scaling. Advances in Neural Information Processing Systems, 36, 2024.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Yang et al. (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
- Young et al. (2024) Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
Appendix A The proof of Theorem 1.
Assuming that the components of query $q∈ R^{d}$ and key $k∈ R^{d}$ are independent, their standard deviations are denoted as $\sigma∈ R^{d}$ and the means are donated as $\mu∈ R^{d}$ . The key $k^{*}$ similar to $q$ is $q+\epsilon$ , where $\epsilon$ is a random variable with a mean of 0. Then, we have:
$$
\displaystyle\mathbb{E}_{q,k^{*}}q^{T}R_{m}k^{*}-\mathbb{E}_{q,k}q^{T}R_{m}k \displaystyle= \displaystyle\mathbb{E}_{q}q^{T}R_{m}q+\mathbb{E}_{q,\epsilon}q^{T}R_{m}\epsilon-\mathbb{E}_{q,k}q^{T}R_{m}k \displaystyle= \displaystyle\mathbb{E}_{q}\sum_{i=0}^{d/2-1}(q_{2i}^{2}\cos(m\theta_{i})-q_{2i}q_{2i+1}sin(m\theta_{i})+q_{2i+1}q_{2i}sin(m\theta_{i})+q_{2i+1}^{2}\cos(m\theta_{i}))+\mathbb{E}_{q}q^{T}R_{m}\mathbb{E}_{\epsilon}\epsilon \displaystyle-\mathbb{E}_{q,k}\sum_{i=0}^{d/2-1}(q_{2i}k_{2i}\cos(m\theta_{i})-q_{2i}k_{2i+1}sin(m\theta_{i})+q_{2i+1}k_{2i}sin(m\theta_{i})+q_{2i+1}k_{2i+1}\cos(m\theta_{i})) \displaystyle= \displaystyle\sum_{i=0}^{d/2-1}\mathbb{E}(q_{2i}^{2})\cos(m\theta_{i})-\mu_{2i}\mu_{2i+1}sin(m\theta_{i})+\mu_{2i}\mu_{2i+1}sin(m\theta_{i})+\mathbb{E}(q_{2i+1}^{2})\cos(m\theta_{i}))+\mu R_{m}0 \displaystyle-\sum_{i=0}^{d/2-1}(\mu_{2i}^{2}\cos(m\theta_{i})-\mu_{2i}\mu_{2i+1}sin(m\theta_{i})+\mu_{i}\mu_{2i+1}sin(m\theta_{i})+\mu_{2i+1}^{2}\cos(m\theta_{i})) \displaystyle= \displaystyle\sum_{i=0}^{d/2-1}(E(q_{2i}^{2}+q_{2i+1}^{2})-\mu_{2i}^{2}-\mu_{2i+1}^{2})\cos(m\theta_{i}) \displaystyle= \displaystyle\sum_{i=0}^{d/2-1}(\sigma_{i}^{2}+\sigma_{i+1}^{2})\cos(m\theta_{i}) \tag{19}
$$
Then we can get:
$$
\displaystyle\sum_{i=0}^{d/2-1}(\sigma_{2i}^{2}+\sigma_{2i+1}^{2})\cos(m\theta_{i})=\mathbb{E}_{q,k^{*}}q^{T}R_{m}k^{*}-\mathbb{E}_{q,k}q^{T}R_{m}k \tag{20}
$$
And when all $\sigma$ are equal, we can get:
$$
\displaystyle\sum_{i=0}^{d/2-1}\cos(m\theta_{i})=\frac{1}{2\sigma^{2}}(\mathbb{E}_{q,k^{*}}q^{T}R_{m}k^{*}-\mathbb{E}_{q,k}q^{T}R_{m}k) \tag{21}
$$
Appendix B The detail setting of experiment.
For training, we mainly conducted experiments on Llama2-7B (Touvron et al., 2023a) and Baichuan2-7B (Yang et al., 2023). In addition, we also trained a 2B model from scratch, whose structure is the same with Baichuan2-7B-Base but with a smaller hidden size = 2048. Both training and testing are accelerated by FlashAttention-2 (Dao et al., 2022) and Megatron-LM (Shoeybi et al., 2020). The dataset of both fine-tuning and training from scratch is a subset of RedPajama (Computer, 2023). The hyper parameters of training are list in Appendix 5. All experiments are conducted on a cluster of 16 machines with 128 NVIDIA A100 80G.
Table 5: Training hyper-parameters in our experiments
| Llama2-7B-Base | 32K | 4B | 128 | 2e5 | constant | 0 |
| --- | --- | --- | --- | --- | --- | --- |
| Baichuan2-7B-Base | 32K | 4B | 128 | 2e5 | constant | 0 |
| Our-2B-Base | 4K | 1T | 1024 | 2e4 | cosine | 0.1 |
For evaluation, we test the long context capabilities comprehensively, the benchmarks are listed below: perplexity on PG19 (Rae et al., 2019) test split. We evaluate the perplexity of each sample and get the mean value across samples.
Long-eval (Li* et al., 2023). This test generates massive random similar sentences and asks the model to answer questions according to a specific sentence in the context. Because the long context consists of many similar patterns, it’s more difficult to get the right answer. We find this test is harder than other long context evaluations such as Perplexity, Passkey Retrieval (Mohtashami & Jaggi, 2024), Needle in Haystack (G, 2023). A test sample is list in Figure 8
<details>
<summary>2405.14591v1/x21.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Index and REGISTER_CONTENT Records
## Image Description
The image contains a structured text record of labeled lines with associated numerical values. Each line follows the format:
`line <label>: REGISTER_CONTENT is <number>`.
The record concludes with a query requesting the `REGISTER_CONTENT` value for a specific label.
---
## Extracted Textual Information
### Record Structure
1. **Line Indexing**:
- Each line begins with `line <label>`, where `<label>` is a hyphen-separated identifier (e.g., `swift-baby`, `dangerous-breast`).
- The label is followed by `REGISTER_CONTENT is <number>`, where `<number>` is a unique numerical identifier.
2. **Full Record**:
- **Line swift-baby**: `REGISTER_CONTENT is 12821`
- **Line dangerous-breast**: `REGISTER_CONTENT is 28051`
- **Line bad-sculptural**: `REGISTER_CONTENT is 32916`
- **Line flashy-college**: `REGISTER_CONTENT is 34027`
- **Line voiceless-brochure**: `REGISTER_CONTENT is 8964`
- **Line fast-peony**: `REGISTER_CONTENT is 5218`
3. **Query**:
- "Now tell me what is the `<REGISTER_CONTENT>` in line `dangerous-breast`?"
---
## Answer to Query
The `REGISTER_CONTENT` for the line labeled `dangerous-breast` is **28051**.
---
## Notes
- **Language**: All text is in English.
- **Data Format**: Numerical values are enclosed in `<>` brackets (e.g., `<28051>`).
- **Purpose**: The record appears to be a lookup table for retrieving numerical identifiers based on line labels.
- **Validation**: No visual trends, charts, or diagrams are present.
</details>
Figure 8: Long-eval sample prompt
needle in haystack(NIH) (G, 2023). NIH tests the long context capability not only under different context lengths but also at different positions where the correct answer is located in the context, which provides a more detailed view of the long context capability.
Appendix C Baichuan2-7B-Base: Lower bound Base of RoPE
<details>
<summary>2405.14591v1/x22.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Chart Overview
This image depicts a **line chart** comparing perplexity values across different model configurations. The chart includes six data series representing variations of a "32K-base" model with distinct base values, alongside a "4K-Baseline" reference line.
---
### **Axes and Labels**
- **X-Axis (Horizontal):**
- Label: `Context`
- Range: `5000` to `30000` (increments of `5000`).
- Tick marks: Dashed gray lines at each interval.
- **Y-Axis (Vertical):**
- Label: `Perplexity`
- Range: `6` to `20` (increments of `2`).
- Tick marks: Dashed gray lines at each interval.
---
### **Legend**
- **Position:** Right side of the chart, outside the plot area.
- **Entries (Color-Coded):**
1. **Blue:** `32K-base:1e4`
2. **Green:** `32K-base:2e5`
3. **Orange:** `32K-base:9e5`
4. **Red:** `32K-base:5e6`
5. **Purple:** `32K-base:1e9`
6. **Black:** `32K-base:1e12`
7. **Gray:** `4K-Baseline`
---
### **Data Series Analysis**
#### 1. **4K-Baseline (Gray Line)**
- **Trend:**
- Sharp vertical spike at `x=5000`, reaching `y=20`.
- Immediate drop to `y≈7` after `x=5000`.
- Remains relatively flat between `x=5000` and `x=30000`.
#### 2. **32K-Base Models**
All lines exhibit **low perplexity** (6–9) with minor fluctuations. Key observations:
- **Blue (`32K-base:1e4`):**
- Highest perplexity at `x=30000` (`y≈8.5`).
- Slight upward trend after `x=20000`.
- **Black (`32K-base:1e12`):**
- Second-highest perplexity (`y≈8.2`).
- Slightly smoother than the blue line.
- **Purple (`32K-base:1e9`):**
- Perplexity ≈ `8.0`.
- Minimal variation.
- **Red (`32K-base:5e6`):**
- Perplexity ≈ `7.8`.
- Slight dip near `x=25000`.
- **Orange (`32K-base:9e5`):**
- Perplexity ≈ `7.6`.
- Stable trend.
- **Green (`32K-base:2e5`):**
- Lowest perplexity (`y≈7.4`).
- Most stable line.
---
### **Key Trends**
1. **4K-Baseline** exhibits an anomalous spike at `x=5000`, suggesting a potential outlier or initialization effect.
2. **32K-Base Models** show inverse correlation between base value and perplexity:
- Higher base values (e.g., `1e12`) correlate with lower perplexity.
- Lower base values (e.g., `1e4`) correlate with higher perplexity.
3. All 32K-Base lines converge closely, indicating minimal divergence in performance across configurations.
---
### **Spatial Grounding**
- **Legend Placement:** Right-aligned, outside the plot boundary.
- **Line Colors:**
- Confirmed matches between legend labels and line colors (e.g., blue = `1e4`, green = `2e5`).
---
### **Conclusion**
The chart illustrates that increasing the `32K-base` value generally reduces perplexity, with the `4K-Baseline` serving as a reference for comparison. The `4K-Baseline`’s initial spike may indicate a distinct computational or architectural behavior at lower context sizes.
</details>
(a) Perplexity
<details>
<summary>2405.14591v1/x23.png Details</summary>

### Visual Description
# Technical Document Extraction: Accuracy vs. Base Value Graph
## Axes and Labels
- **X-axis (Horizontal):**
- Label: `Base value`
- Scale: Logarithmic (1e4 to 1e12)
- Tick markers: `1e4`, `2e5`, `9e5`, `5e6`, `1e9`, `1e12`
- **Y-axis (Vertical):**
- Label: `Accuracy on 32K`
- Range: `0.0` to `0.6`
- Tick markers: `0.0`, `0.1`, `0.2`, `0.3`, `0.4`, `0.5`, `0.6`
## Legend
- **Position:** Bottom-right corner
- **Color:** Dashed blue line
- **Label:** `Lower bound`
## Data Series
- **Primary Line (Cyan with Star Markers):**
- **Trend:**
- Starts at `0.0` accuracy for `Base value = 1e4`.
- Sharp increase at `Base value = 2e5` (accuracy ≈ `0.15`).
- Continues rising to `Base value = 5e6` (accuracy ≈ `0.5`).
- Peaks at `Base value = 1e9` (accuracy ≈ `0.6`).
- Declines to `Base value = 1e12` (accuracy ≈ `0.45`).
- **Key Data Points:**
- `(1e4, 0.0)`
- `(2e5, 0.15)`
- `(5e6, 0.5)`
- `(1e9, 0.6)`
- `(1e12, 0.45)`
## Additional Elements
- **Vertical Dashed Line:**
- Position: `Base value = 9e5`
- Color: Blue (matches legend color)
## Observations
1. The cyan line exhibits a **non-linear relationship** between `Base value` and `Accuracy on 32K`, with a sharp rise between `2e5` and `5e6`, followed by a plateau and gradual decline.
2. The `Lower bound` (dashed blue line) remains constant at `0.0` across all `Base value` ranges.
3. No other languages or textual elements are present in the image.
## Spatial Grounding
- Legend: `[x: 0.9, y: 0.05]` (bottom-right corner).
- Vertical dashed line: `[x: 9e5, y: 0.0]` to `[x: 9e5, y: 0.6]`.
## Trend Verification
- **Cyan Line:**
- Slopes upward from `1e4` to `5e6`, then downward from `1e9` to `1e12`.
- Matches extracted data points.
## Component Isolation
1. **Header:** No explicit header present.
2. **Main Chart:** Dominates the image, with axes, legend, and data series.
3. **Footer:** No explicit footer present.
## Conclusion
The graph illustrates a logarithmic relationship between `Base value` and `Accuracy on 32K`, with a critical threshold at `Base value = 9e5`. The `Lower bound` remains static, while the primary metric peaks at `1e9` before declining.
</details>
(b) Long-eval 32k
Figure 9: Fine-tuning Baichuan2-7B-Base on 32k context length with varying RoPE’s base. Although the perplexity remains low with varying bases, the Long-eval accuracy revels a discernible bound for the base value, below which the Long-eval accuracy declines significantly. the dotted line denotes the lower bound derived from Eq. 18.
Appendix D Long Context Test Results on Various LLMs
<details>
<summary>2405.14591v1/x24.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Axis Labels and Markers
- **Y-Axis (Vertical):**
- Label: "Perplexity"
- Scale: 6 to 14 (increments of 2)
- Tick marks: 6, 8, 10, 12, 14
- **X-Axis (Horizontal):**
- Label: "Context size"
- Scale: 0 to 30,000 (increments of 10,000)
- Tick marks: 0, 10,000, 20,000, 30,000
- **Grid:**
- Dotted lines forming a Cartesian grid.
## 2. Data Series and Trends
- **Line Characteristics:**
- Color: Blue (no legend present; inferred from visual inspection).
- Data points:
- At x=0: y≈7.2
- At x=10,000: y≈7.1
- At x=30,000: y≈7.8
- **Trend Analysis:**
- Initial segment (x=0 to x=10,000): Slight downward slope (7.2 → 7.1).
- Subsequent segment (x=10,000 to x=30,000): Gradual upward trend (7.1 → 7.8).
## 3. Structural Components
- **Header:**
- No explicit title or subtitle present.
- **Main Chart:**
- Single line graph with no annotations or secondary data series.
- **Footer:**
- No additional text or metadata.
## 4. Critical Observations
- **No Legend:**
- Line color (blue) cannot be cross-referenced with a legend.
- **Data Consistency:**
- Line color matches inferred blue throughout the graph.
- **Spatial Grounding:**
- All data points align with x-axis increments (0, 10,000, 30,000).
## 5. Missing Elements
- No data table, heatmap categories, or embedded text.
- No secondary axes or annotations.
## 6. Summary
The graph depicts a single blue line representing "Perplexity" across varying "Context size." The metric decreases marginally at lower context sizes (0–10,000) and increases steadily at higher sizes (10,000–30,000). No additional contextual or categorical data is provided.
</details>
<details>
<summary>2405.14591v1/x25.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Axis Labels and Markers
- **X-Axis (Horizontal):**
- Label: "Context length"
- Range: 0 to 12500
- Markers: 0, 2500, 5000, 7500, 10000, 12500
- **Y-Axis (Vertical):**
- Label: "Accuracy"
- Range: 0.2 to 1.0
- Markers: 0.2, 0.4, 0.6, 0.8, 1.0
## 2. Line Graph Characteristics
- **Line Color:** Blue
- **Trend:**
- Initial high accuracy (~0.8) at x=0.
- Fluctuations observed between x=0 and x=5000, with peaks near 0.9 and troughs near 0.7.
- Steady decline after x=5000, dropping below 0.2 by x=12500.
- No consistent upward or downward trend; erratic behavior dominates early stages.
## 3. Background and Grid
- **Grid:** Dotted lines forming a Cartesian coordinate system.
- **No additional annotations, legends, or secondary data series present.**
## 4. Observations
- **Critical Thresholds:**
- Accuracy drops below 0.4 after x=7500.
- Final accuracy stabilizes near 0.15–0.2 by x=12500.
- **Potential Implications:**
- Performance degradation correlates with increasing context length.
- Early fluctuations suggest sensitivity to input variations at lower context lengths.
## 5. Missing Elements
- **No title or subtitle present.**
- **No legend (despite single data series).**
- **No secondary axes or annotations.**
## 6. Spatial Grounding
- **Legend Position:** Not applicable (no legend present).
- **Line Placement:** Single blue line occupies the entire plot area.
## 7. Trend Verification
- **Visual Confirmation:**
- Line A (blue) starts high, fluctuates, then declines sharply.
- No secondary trends or patterns observed.
## 8. Component Isolation
- **Regions Analyzed:**
- Header: None.
- Main Chart: Line graph with axes and grid.
- Footer: None.
## 9. Data Table Reconstruction
- **Not applicable** (no embedded data table).
## 10. Language and Transcription
- **Language:** English (no non-English text detected).
- **Transcribed Text:**
- Axis labels: "Context length", "Accuracy".
- Axis markers: Numerical values as listed.
## 11. Final Notes
- The graph illustrates a clear inverse relationship between context length and accuracy after x=5000.
- Early-stage fluctuations may indicate model instability or data noise at lower context lengths.
</details>
<details>
<summary>2405.14591v1/x26.png Details</summary>

### Visual Description
# Technical Document Extraction: Heatmap Analysis
## 1. Labels and Axis Titles
- **Y-Axis (Vertical):**
- Label: `Context length`
- Range: `0` to `1000`
- Increment: `5`
- Values: `0, 5, 10, ..., 1000`
- **X-Axis (Horizontal):**
- Label: `Token limit`
- Range: `1000` to `32000`
- Increment: `1000`
- Values: `1000, 2000, ..., 32000`
- **Legend (Right Side):**
- Title: `Score`
- Color Gradient:
- `Green` (top) → `Yellow` → `Red` (bottom)
- Numerical Range: `10` (green) → `0` (red)
---
## 2. Data Categories and Sub-Categories
- **Context Length Categories (Y-Axis):**
- `0–5`, `5–10`, `10–15`, ..., `95–100`, `100`
- **Token Limit Categories (X-Axis):**
- `1000`, `2000`, `3000`, ..., `32000`
- **Score Categories (Legend):**
- `0` (red) → `10` (green)
---
## 3. Embedded Text and Diagram Components
- **No additional text blocks** are present in the diagram.
- **Heatmap Structure:**
- Grid of colored squares representing score distributions.
- Color intensity correlates with score (green = high, red = low).
---
## 4. Spatial Grounding and Color Verification
- **Legend Placement:**
- Located on the **right side**, vertically aligned.
- Color-to-score mapping:
- `Green` (top) = `10`
- `Yellow` = `5–7`
- `Red` (bottom) = `0`
- **Data Point Verification:**
- Example: A red square at `(x=1000, y=50)` corresponds to a score of `0` (matches legend).
- Example: A yellow square at `(x=20000, y=95)` corresponds to a score of `7` (matches legend).
---
## 5. Trend Verification
- **Vertical Red Block (Left Side):**
- **Trend:** Consistently low scores (`0–2`) across all context lengths for token limits ≤ `5000`.
- **Data Points:**
- `(x=1000, y=0)` → `0`
- `(x=2000, y=50)` → `0`
- `(x=5000, y=100)` → `1`
- **Horizontal Red Block (Bottom):**
- **Trend:** Low scores (`0–3`) for context lengths ≥ `95` and token limits ≥ `20000`.
- **Data Points:**
- `(x=20000, y=95)` → `3`
- `(x=30000, y=100)` → `0`
- **Middle Region (Scattered Blocks):**
- **Trend:** Variable scores (`2–8`), with higher scores concentrated in token limits `10000–25000` and context lengths `50–70`.
- **Data Points:**
- `(x=15000, y=60)` → `8`
- `(x=25000, y=70)` → `5`
- **Top-Right Region (Mixed Scores):**
- **Trend:** Mixed scores (`4–9`), with higher scores (`8–9`) in token limits `25000–32000` and context lengths `80–100`.
- **Data Points:**
- `(x=28000, y=90)` → `9`
- `(x=32000, y=100)` → `7`
---
## 6. Component Isolation
- **Main Chart (Heatmap):**
- Dominates the image, with color intensity indicating score distributions.
- **Legend:**
- Right-aligned, vertical, with a gradient from green (`10`) to red (`0`).
---
## 7. Key Observations
1. **Low Scores (Red):**
- Dominant in regions with **low token limits** (≤ `5000`) and **high context lengths** (≥ `95`).
2. **High Scores (Green/Yellow):**
- Concentrated in **mid-to-high token limits** (`10000–25000`) and **mid-range context lengths** (`50–70`).
3. **Variability:**
- Scores fluctuate significantly in the **top-right quadrant** (`token limits ≥ 25000`, `context lengths ≥ 80`).
---
## 8. Conclusion
The heatmap reveals that **score performance** is inversely related to **token limits** for low context lengths and directly related to **token limits** for mid-to-high context lengths. No textual data beyond axis labels and legend is present.
</details>
Figure 10: Llama2-7B-Base with base=1e4 fine-tuned on 32k context (original context=4096)
<details>
<summary>2405.14591v1/x27.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. **Axis Labels and Titles**
- **X-axis**: "Context size" (numerical range: 0 to 30,000, increments of 10,000).
- **Y-axis**: "Perplexity" (numerical range: 6 to 14, increments of 2).
## 2. **Legend and Data Series**
- **Legend**: Absent. No explicit labels or color mappings provided.
- **Data Series**: Single line (blue) representing "Perplexity" across context sizes.
## 3. **Key Trends and Data Points**
- **Visual Trend**:
- The blue line exhibits a **slight downward trend** from context size 0 to 10,000, followed by a **gradual upward trend** from 10,000 to 30,000.
- No sharp peaks or troughs; overall movement is smooth and linear.
- **Extracted Data Points** (approximate):
- **Context size = 0**: Perplexity ≈ 7.2.
- **Context size = 10,000**: Perplexity ≈ 7.0.
- **Context size = 30,000**: Perplexity ≈ 7.8.
## 4. **Grid and Styling**
- **Grid**: Dotted lines (horizontal and vertical) for reference.
- **Line Style**: Solid blue line with no markers or annotations.
## 5. **Spatial Grounding**
- **Legend Placement**: Not applicable (no legend present).
- **Line Placement**:
- Starts at [0, 7.2], dips to [10,000, 7.0], and ascends to [30,000, 7.8].
## 6. **Component Isolation**
- **Header**: No explicit header beyond axis titles.
- **Main Chart**: Dominated by the line graph and grid.
- **Footer**: No footer elements.
## 7. **Language and Text**
- **Primary Language**: English.
- **Transcribed Text**:
- Axis labels: "Context size", "Perplexity".
- No embedded text or annotations on the line or grid.
## 8. **Verification**
- **Trend Logic Check**:
- Initial dip (0 → 10,000) aligns with lower perplexity values.
- Subsequent rise (10,000 → 30,000) matches increasing values.
- **Color Consistency**: Blue line matches no legend entry (no discrepancies).
## 9. **Conclusion**
The graph illustrates a **non-linear relationship** between context size and perplexity, with a minor decrease followed by a gradual increase. No additional categories, sub-categories, or data tables are present.
</details>
<details>
<summary>2405.14591v1/x28.png Details</summary>

### Visual Description
# Technical Document Extraction: Accuracy vs. Context Length
## 1. Labels and Axis Titles
- **Y-Axis**: Labeled "Accuracy" with tick marks at 0.2, 0.4, 0.6, 0.8, and 1.0.
- **X-Axis**: Labeled "Context length" with tick marks at 0, 2500, 5000, 7500, 10000, and 12500.
- **Grid**: Light gray dotted grid lines span the entire plot area.
## 2. Line Graph Components
- **Line Color**: Blue (no legend present; assumed to represent "Accuracy" metric).
- **Line Behavior**:
- Starts at (0, 1.0) with a sharp initial decline.
- Fluctuates between ~0.85 and ~0.95 for context lengths ≤5000.
- Gradual decline to ~0.88 at 10,000 context length.
- Slight upward trend to ~0.92 at 12,500 context length.
## 3. Key Trends and Data Points
### Initial Decline (0–2500)
- **Trend**: Sharp drop from 1.0 to ~0.95, followed by stabilization.
- **Data Points**:
- (0, 1.0)
- (2500, ~0.95)
### Mid-Range Fluctuations (2500–7500)
- **Trend**: Oscillates between ~0.85 and ~0.95 with no clear directional bias.
- **Data Points**:
- (5000, ~0.95)
- (7500, ~0.88)
### Late-Stage Decline and Recovery (7500–12500)
- **Trend**: Gradual decline to ~0.88 at 10,000, followed by a minor recovery to ~0.92 at 12,500.
- **Data Points**:
- (10000, ~0.88)
- (12500, ~0.92)
## 4. Observations
- **Stability**: Accuracy stabilizes near 0.9 after initial fluctuations.
- **Context Length Impact**: No significant improvement beyond 5000 context length; minor degradation observed at higher lengths.
- **No Legend**: No explicit legend present to clarify line semantics.
## 5. Missing Elements
- **Data Table**: Not present in the image.
- **Embedded Text**: No additional annotations or labels beyond axis titles and tick marks.
## 6. Spatial Grounding
- **Legend**: Not applicable (no legend present).
- **Line Placement**: Blue line occupies the central region of the plot, adhering to the grid structure.
## 7. Trend Verification
- **Initial Drop**: Confirmed by steep decline from (0, 1.0) to (2500, ~0.95).
- **Mid-Range Oscillations**: Validated by alternating peaks and troughs between 2500–7500.
- **Late-Stage Behavior**: Decline to 10,000 and recovery to 12,500 aligns with visual trend.
## 8. Component Isolation
- **Main Chart**: Line graph dominates the image, with axes and grid providing context.
- **No Header/Footer**: No additional textual or graphical elements outside the plot area.
## 9. Language and Transcription
- **Primary Language**: English (all labels and titles are in English).
- **Transcribed Text**:
- Y-Axis: "Accuracy"
- X-Axis: "Context length"
- Tick Marks: Numerical values as listed above.
## 10. Conclusion
The graph illustrates a non-monotonic relationship between context length and accuracy. While accuracy begins at 1.0, it stabilizes near 0.9 after initial fluctuations, with no clear correlation to increasing context length beyond 5000.
</details>
<details>
<summary>2405.14591v1/x29.png Details</summary>

### Visual Description
# Technical Document Analysis of Heatmap Image
## 1. Axis Labels and Markers
- **X-Axis (Horizontal):**
- Label: `Token Limit`
- Markers: `1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000, 21000, 22000, 23000, 24000, 25000, 26000, 27000, 28000, 29000, 30000, 31000, 32000`
- Increment: `2000`
- **Y-Axis (Vertical):**
- Label: `Context Length`
- Markers: `0.0, 4.0, 8.0, 12.0, 16.0, 20.0, 24.0, 29.0, 33.0, 37.0, 41.0, 45.0, 49.0, 53.0, 57.0, 61.0, 65.0, 69.0, 73.0, 78.0, 82.0, 86.0, 90.0, 94.0, 98.4`
- Increment: `4.0` (with irregular spacing at higher values)
## 2. Legend and Color Scale
- **Legend Position:** Right side of the chart, vertically aligned.
- **Color Scale:**
- `0` (Red) → `10` (Green)
- Intermediate values: `2, 4, 6, 8` (Yellow to Lime Green)
## 3. Key Visual Components
- **Main Chart:**
- Grid of colored squares representing scores.
- **Vertical Green Bar:** Spans the entire Y-axis at `Token Limit = 25000`.
- **Color Distribution Trends:**
- **Low Scores (Red/Yellow):** Dominant in the lower-left quadrant (Token Limit < 1000, Context Length < 20).
- **High Scores (Green):** Clustered in the upper-right quadrant (Token Limit > 20000, Context Length > 50).
- **Transition Zone:** Moderate scores (Yellow/Green) between Token Limits 1000–20000 and Context Lengths 20–50.
## 4. Spatial Grounding and Verification
- **Legend Alignment:** Confirmed that red corresponds to low scores (e.g., bottom-left cells) and green to high scores (e.g., top-right cells).
- **Vertical Bar Consistency:** The green bar at `Token Limit = 25000` aligns with high-score regions, matching the legend’s green (score ~8–10).
## 5. Trend Verification
- **Token Limit < 1000:**
- Scores predominantly < 4 (Red/Yellow).
- Sparse green cells at Context Lengths > 50.
- **Token Limit 1000–20000:**
- Gradual increase in green cells as Context Length increases.
- Notable green clusters at Context Lengths ~60–90.
- **Token Limit > 25000:**
- Dense green/yellow cells across all Context Lengths.
- Vertical green bar suggests a critical threshold or optimal performance region.
## 6. Data Table Reconstruction
The heatmap implicitly represents a data table with:
- **Rows:** Context Length (0.0–98.4).
- **Columns:** Token Limit (1000–32000).
- **Values:** Scores (0–10), encoded via color.
Example inferred values (based on color):
| Token Limit | Context Length | Score |
|-------------|----------------|-------|
| 1000 | 0.0 | 0 |
| 25000 | 98.4 | 10 |
| 32000 | 98.4 | 10 |
## 7. Component Isolation
- **Header:** No explicit header text.
- **Main Chart:** Dominates 90% of the image.
- **Footer:** No explicit footer text.
## 8. Additional Observations
- **Language:** All text is in English.
- **Missing Data:** No explicit numerical values provided; scores inferred from color.
- **Critical Threshold:** The vertical green bar at `Token Limit = 25000` may indicate a system-defined limit or optimal performance boundary.
## Conclusion
The heatmap illustrates a positive correlation between Token Limit and Context Length for higher scores. Scores increase significantly beyond `Token Limit = 25000`, with the vertical green bar highlighting a critical region. No textual data tables or legends in non-English languages are present.
</details>
Figure 11: Llama2-7B-Base with base=2e5 fine-tuned on 32k context (original context=4096)
<details>
<summary>2405.14591v1/x30.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Chart Identification
- **Type**: Line graph
- **Primary Data Series**: Single blue line (no legend present)
- **Grid**: Dotted grid lines visible
## 2. Axis Labels and Markers
- **X-Axis (Horizontal)**:
- Label: "Context"
- Scale: Linear, increments of 10,000
- Markers: 0, 10,000, 20,000, 30,000
- **Y-Axis (Vertical)**:
- Label: "Perplexity"
- Scale: Linear, increments of 1
- Markers: 8, 9, 10, 11, 12, 13, 14
## 3. Data Trends and Points
- **Line Behavior**:
- **Initial Phase (0–10,000)**:
- Starts at ~8.5, dips to ~8.3 at 10,000
- Trend: Slight downward fluctuation
- **Mid-Phase (10,000–20,000)**:
- Rises to ~8.7 at 20,000
- Trend: Gradual upward slope
- **Final Phase (20,000–30,000)**:
- Peaks at ~9.0 at 30,000
- Trend: Accelerated upward trajectory
- **Key Data Points**:
- (0, 8.5)
- (10,000, 8.3)
- (20,000, 8.7)
- (30,000, 9.0)
## 4. Color and Legend Analysis
- **Line Color**: Blue (#0000FF)
- **Legend**: Absent (no color-to-label mapping required)
## 5. Spatial Grounding
- **Legend Position**: Not applicable (no legend present)
- **Line Placement**: Central axis alignment, no offset
## 6. Trend Verification
- **Overall Trend**: Net increase of +0.5 perplexity units across 30,000 context units
- **Fluctuations**: Minor dips observed at 10,000 context units, but no sustained declines
## 7. Component Isolation
- **Header**: None
- **Main Chart**: Dominates 90% of image area
- **Footer**: None
## 8. Missing Elements
- No data table, heatmap categories, or textual annotations present
- No secondary axes or annotations detected
## 9. Final Notes
- The graph depicts a noisy but generally increasing relationship between "Context" and "Perplexity"
- No textual content or embedded diagrams beyond the primary line graph
- All axis markers and labels are explicitly labeled in English
</details>
<details>
<summary>2405.14591v1/x31.png Details</summary>

### Visual Description
# Technical Document Analysis: Accuracy vs Context Length Graph
## 1. Title and Axes
- **Title**: "Accuracy vs Context length"
- **X-axis**:
- Label: "Context length"
- Range: 0 to 12500 (in increments of 2500)
- Markers: 0, 2500, 5000, 7500, 10000, 12500
- **Y-axis**:
- Label: "Accuracy"
- Range: 0.0 to 1.0 (in increments of 0.2)
- Markers: 0.0, 0.2, 0.4, 0.6, 0.8, 1.0
## 2. Data Series and Trends
- **Line Characteristics**:
- Color: Blue
- Style: Solid line with sharp peaks/troughs
- Trend: General downward slope with fluctuations
- **Key Data Points**:
- [0, 1.0]: Starts at maximum accuracy (1.0) at context length 0.
- [1000, ~0.8]: Sharp decline to ~0.8 at context length 1000.
- [2500, ~0.6]: Further drop to ~0.6 at context length 2500.
- [5000, ~0.4]: Steeper decline to ~0.4 at context length 5000.
- [7500, ~0.2]: Continued drop to ~0.2 at context length 7500.
- [10000, ~0.05–0.1]: Stabilizes near 0.05–0.1 between context lengths 10000–12500.
## 3. Observations
- **Initial Stability**: Accuracy remains near 1.0 for the first ~500 context length units.
- **Rapid Decline**: Sharp drop from 1.0 to ~0.4 between context lengths 0–5000.
- **Fluctuations**: Minor peaks/troughs observed between context lengths 2500–7500 (e.g., ~0.6 at 2500, ~0.5 at 3000, ~0.3 at 4000).
- **Final Stabilization**: Accuracy plateaus near 0.05–0.1 for context lengths ≥10000.
## 4. Structural Components
- **Header**: Title centered at the top.
- **Main Chart**:
- Grid: Light gray dashed lines.
- Axes: Bold black lines with labeled ticks.
- **Footer**: No explicit footer present.
## 5. Missing Elements
- **Legend**: Not present.
- **Data Table**: Not applicable (no embedded table).
## 6. Conclusion
The graph illustrates a **general decline in accuracy** as context length increases, with notable fluctuations in the intermediate range (2500–7500). Accuracy stabilizes at a low level (~0.05–0.1) for very long context lengths (≥10000). No legends or additional data sources are provided.
</details>
<details>
<summary>2405.14591v1/x32.png Details</summary>

### Visual Description
# Technical Document Analysis of Heatmap
## 1. Axis Labels and Markers
- **Y-Axis (Vertical):**
- Title: `Context length`
- Values: `0.0, 8.0, 12.0, 16.0, 20.0, 24.0, 29.0, 33.0, 37.0, 41.0, 45.0, 49.0, 53.0, 57.0, 61.0, 65.0, 69.0, 73.0, 78.0, 82.0, 86.0, 90.0, 94.0, 98.0`
- Increment: `8.0` (except final segment `94.0 → 98.0`)
- **X-Axis (Horizontal):**
- Title: `Token limit`
- Values: `1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000, 9000, 10000, 11000, 12000, 13000, 14000, 15000, 16000, 17000, 18000, 19000, 20000, 21000, 22000, 23000, 24000, 25000, 26000, 27000, 28000, 29000, 30000, 31000, 32200`
- Increment: `1000` (final segment `31000 → 32200`)
- **Legend (Right Side):**
- Title: `Score`
- Color Scale:
- `Red` → `0`
- `Yellow` → `4`
- `Green` → `10`
- Spatial Grounding: `[x: 32200, y: 98.0]` (aligned with axis maxima)
## 2. Heatmap Structure
- **Grid Dimensions:**
- Rows: `24` (context lengths)
- Columns: `32` (token limits)
- **Color Distribution:**
- **Left Region (Low Token Limits):**
- Dominant colors: `Green` (high scores) and `Yellow` (moderate scores).
- Example: `[x: 1000, y: 0.0]` = `Green` (score ≈ 10).
- **Right Region (High Token Limits):**
- Dominant colors: `Red` (low scores) and `Orange` (moderate scores).
- Example: `[x: 32200, y: 98.0]` = `Red` (score ≈ 0).
- **Diagonal Boundary:**
- Transition from `Green/Yellow` to `Red/Orange` occurs at `[x: 10000, y: 70.0]`.
- This boundary represents a critical threshold where scores drop sharply.
## 3. Key Trends
- **Trend 1:**
- **High Scores (Green/Yellow):**
- Occur in the **lower-left quadrant** (low token limits, low context lengths).
- Example: `[x: 1000, y: 0.0]` to `[x: 5000, y: 69.0]`.
- **Low Scores (Red/Orange):**
- Occur in the **upper-right quadrant** (high token limits, high context lengths).
- Example: `[x: 25000, y: 98.0]` to `[x: 32200, y: 98.0]`.
- **Trend 2:**
- **Score Degradation:**
- Scores decrease monotonically as token limits increase beyond `10,000`.
- Example: `[x: 10000, y: 70.0]` (score ≈ 6) → `[x: 32200, y: 98.0]` (score ≈ 0).
- **Trend 3:**
- **Context Length Impact:**
- Higher context lengths (e.g., `y > 70.0`) correlate with lower scores, especially for token limits > `10,000`.
## 4. Critical Threshold
- **Boundary Point:**
- `[x: 10000, y: 70.0]` marks the transition from high to low scores.
- This suggests a system limitation where token limits > `10,000` and context lengths > `70.0` degrade performance.
## 5. Component Isolation
- **Main Chart:**
- Heatmap dominates the image, with no headers/footers.
- **Legend:**
- Positioned to the right of the x-axis, aligned with the maximum token limit (`32200`).
## 6. Data Extraction
- **Sample Data Points (Score ≈ 10):**
- `[x: 1000, y: 0.0]`
- `[x: 2000, y: 8.0]`
- `[x: 3000, y: 12.0]`
- **Sample Data Points (Score ≈ 0):**
- `[x: 25000, y: 98.0]`
- `[x: 30000, y: 94.0]`
- `[x: 32200, y: 98.0]`
## 7. Language and Text
- **Primary Language:** English (axis labels, legend).
- **No Additional Languages Detected.**
## 8. Validation
- **Legend Consistency:**
- All `Red` cells correspond to scores ≤ `2`.
- All `Green` cells correspond to scores ≥ `8`.
- **Trend Verification:**
- Diagonal boundary aligns with the critical threshold described in the analysis.
## 9. Conclusion
The heatmap illustrates a performance degradation pattern where scores drop sharply beyond a token limit of `10,000` and context length of `70.0`. This suggests a system constraint or optimization threshold requiring further investigation.
</details>
Figure 12: Baichuan2-7B-Base with base=1e4 fine-tuned on 32k context (original context=4096)
<details>
<summary>2405.14591v1/x33.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. **Axis Labels and Titles**
- **Y-Axis**: Labeled "Perplexity" with numerical markers at intervals of 2 (6, 8, 10, 12, 14).
- **X-Axis**: Labeled "Context size" with numerical markers at intervals of 10,000 (0, 10,000, 20,000, 30,000).
## 2. **Legend and Color Mapping**
- **Legend**: No explicit legend is visible in the image. However, the single data series is represented by a **blue line**, which would typically correspond to a legend entry (e.g., "Data Series 1" or similar). Spatial grounding of a legend is not applicable here due to its absence.
## 3. **Data Points and Trends**
- **Key Data Points**:
- At **x = 0**: Perplexity ≈ 7.2.
- At **x = 10,000**: Perplexity ≈ 7.0 (slight dip).
- At **x = 30,000**: Perplexity ≈ 7.8 (gradual increase).
- **Trend Verification**:
- The line exhibits a **slight downward trend** from x = 0 to x = 10,000, followed by a **gradual upward trend** from x = 10,000 to x = 30,000. The fluctuations are minor, with no sharp peaks or troughs.
## 4. **Chart Components**
- **Grid**: Dotted gridlines for reference, spanning the entire plot area.
- **Line Style**: Smooth, continuous blue line with no markers or annotations.
## 5. **Component Isolation**
- **Main Chart**: Dominates the image, occupying ~90% of the space. No header or footer elements are present.
## 6. **Textual Content**
- No embedded text, annotations, or data tables are visible in the image.
## 7. **Language and Transcription**
- All text is in **English**. No non-English content detected.
## 8. **Final Notes**
- The graph illustrates a relationship between "Context size" (x-axis) and "Perplexity" (y-axis), with a net increase in perplexity over the observed range despite an initial dip. No additional contextual or categorical data is provided.
</details>
<details>
<summary>2405.14591v1/x34.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. Labels and Axis Titles
- **X-Axis**:
- Title: "Context length"
- Range: 0 to 12500
- Increment: 2500 (markers at 0, 2500, 5000, 7500, 10000, 12500)
- **Y-Axis**:
- Title: "Accuracy"
- Range: 0.2 to 1.0
- Increment: 0.2 (markers at 0.2, 0.4, 0.6, 0.8, 1.0)
- **Grid**: Horizontal dashed lines at each y-axis increment.
## 2. Line Characteristics
- **Color**: Blue (#0000FF)
- **Trend**:
- Initial dip from ~0.85 (x=0) to ~0.75 (x=2500).
- Sharp peak to ~0.95 at x=5000.
- Stabilizes between ~0.9 and ~0.95 from x=5000 to x=10000.
- Final decline to ~0.85 at x=12500.
## 3. Data Points (Extracted from Visual Trends)
| Context Length (x) | Accuracy (y) |
|---------------------|--------------|
| 0 | ~0.85 |
| 2500 | ~0.75 |
| 5000 | ~0.95 |
| 7500 | ~0.93 |
| 10000 | ~0.92 |
| 12500 | ~0.85 |
## 4. Legend and Color Matching
- **Legend**: Not explicitly present in the image.
- **Line Color**: Blue (#0000FF) matches no legend entry (no legend exists).
## 5. Spatial Grounding
- **X-Axis**: Bottom of the graph.
- **Y-Axis**: Left side of the graph.
- **Line Placement**: Central, fluctuating between y=0.75 and y=0.95.
## 6. Component Isolation
- **Main Chart**: Line graph with grid.
- **No Header/Footer**: No textual or graphical elements outside the main chart.
## 7. Trend Verification
- **Initial Dip**: Accuracy decreases sharply from x=0 to x=2500.
- **Peak at x=5000**: Highest accuracy (~0.95) observed.
- **Stabilization**: Minor fluctuations between x=5000 and x=10000.
- **Final Decline**: Accuracy drops to ~0.85 at x=12500.
## 8. Missing Elements
- **No Data Table**: The graph lacks embedded tables or numerical data.
- **No Secondary Axes**: Single y-axis for accuracy.
## 9. Language and Transcription
- **Language**: English (no non-English text present).
## 10. Critical Observations
- Accuracy is inversely correlated with context length beyond x=5000.
- The graph suggests diminishing returns in accuracy as context length increases beyond 5000.
- Initial performance (x=0 to x=2500) is suboptimal compared to later ranges.
</details>
<details>
<summary>2405.14591v1/x35.png Details</summary>

### Visual Description
# Technical Document Extraction: Heatmap Analysis
## 1. **Axis Labels and Markers**
- **X-axis (Horizontal):**
- Label: `Token Limit`
- Range: `1000` to `32000`
- Tick Intervals: `1000, 2000, 3000, ..., 32000`
- **Y-axis (Vertical):**
- Label: `Context length`
- Range: `0` to `1000`
- Tick Intervals: `0, 5, 10, ..., 1000`
## 2. **Legend and Color Scale**
- **Legend Position:** Right side of the heatmap.
- **Color Scale:**
- Gradient from `red` (score = 0) to `green` (score = 10).
- Intermediate colors: `yellow` (score ≈ 5), `orange` (score ≈ 8).
- **Spatial Grounding:**
- The legend is aligned vertically with the heatmap, occupying the rightmost 5% of the image.
## 3. **Key Trends and Data Points**
- **Heatmap Structure:**
- Grid of colored squares representing scores.
- **High Scores (Yellow/Orange):**
- Clustered in the **upper-right quadrant** (Token Limit: `25000–32000`, Context Length: `5–32`).
- Notable peaks:
- `[25000, 5]` (orange, score ≈ 8).
- `[32000, 32]` (orange, score ≈ 8).
- **Low Scores (Green):**
- Dominant in the **lower-left quadrant** (Token Limit: `1000–10000`, Context Length: `0–50`).
- Example: `[1000, 0]` (green, score ≈ 2).
- **Mid-Range Scores (Yellow):**
- Scattered in the **middle-right** (Token Limit: `15000–25000`, Context Length: `50–100`).
## 4. **Visual Trend Analysis**
- **Primary Trend:**
- Scores increase with **higher Token Limit** and **higher Context Length**.
- Example: Scores rise from `2` (green) in `[1000, 0]` to `8` (orange) in `[32000, 32]`.
- **Exceptions:**
- Some high scores appear at **lower Token Limits** but **higher Context Lengths** (e.g., `[15000, 64]` with score ≈ 6).
- A few low scores persist in the **upper-left** (e.g., `[5000, 0]` with score ≈ 3).
## 5. **Component Isolation**
- **Main Chart:**
- Heatmap grid with no additional annotations or text blocks.
- **Legend:**
- Color bar with numerical labels (`0` to `10`) and corresponding colors.
## 6. **Data Table Reconstruction**
| Token Limit | Context Length | Score | Color |
|-------------|----------------|-------|--------|
| 1000 | 0 | 2 | Green |
| 5000 | 0 | 3 | Green |
| 15000 | 64 | 6 | Yellow |
| 25000 | 5 | 8 | Orange |
| 32000 | 32 | 8 | Orange |
## 7. **Language and Text Extraction**
- **Primary Language:** English (all labels, axis titles, and legend text are in English).
- **No Additional Languages Detected.**
## 8. **Final Notes**
- The heatmap visualizes a relationship between **Token Limit** and **Context Length**, with scores indicating performance or efficiency.
- High scores (orange/yellow) are concentrated in the upper-right, suggesting optimal performance at higher limits.
- No explicit numerical data table is present; values are inferred from color intensity and spatial positioning.
</details>
Figure 13: Baichuan2-7B-Base with base=2e5 fine-tuned on 32k context (original context=4096)
<details>
<summary>2405.14591v1/x36.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## 1. **Axis Labels and Markers**
- **X-Axis (Horizontal):**
- Label: `Context`
- Markers: `0`, `10000`, `20000`, `30000`
- **Y-Axis (Vertical):**
- Label: `Perplexity`
- Markers: `8`, `10`, `12`, `14`
## 2. **Legend**
- **Placement:** Top-right corner
- **Entry:**
- Color: Blue
- Label: `Perplexity Trend`
## 3. **Data Series and Trends**
- **Line Characteristics:**
- Color: Blue (matches legend)
- Trend:
- Starts at ~8.5 at `x=0`
- Dips slightly to ~8.3 at `x=10000`
- Rises to ~8.7 at `x=20000`
- Peaks at ~9.0 at `x=30000`
- Overall: Gradual upward trend with minor fluctuations.
## 4. **Chart Structure**
- **Grid:** Dotted lines forming a Cartesian grid.
- **Background:** White with no additional annotations or text.
## 5. **Key Observations**
- No secondary data series or annotations present.
- Perplexity values remain within the range `[8, 14]` throughout the context range `[0, 30000]`.
## 6. **Conclusion**
The graph illustrates a **perplexity trend** over increasing context values, showing a general increase from ~8.5 to ~9.0 with minor variability. No textual data or additional categories are embedded in the image.
</details>
<details>
<summary>2405.14591v1/x37.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Graph Analysis
## Image Description
The image depicts a **line graph** illustrating the relationship between **context length** (x-axis) and **accuracy** (y-axis). The graph features a single blue line representing accuracy values across varying context lengths. The background includes a grid of dashed lines for reference.
---
## Key Components
### Axis Labels and Markers
- **X-axis (Horizontal):**
- Label: `Context length`
- Markers: `0`, `2000`, `4000`, `6000`
- **Y-axis (Vertical):**
- Label: `Accuracy`
- Markers: `0.2`, `0.4`, `0.6`, `0.8`, `1.0`
### Legend
- **Color:** Blue (matches the line color).
- **Placement:** Not explicitly visible in the image; inferred from the line color.
---
## Data Trends and Points
### Line Behavior
1. **Initial Decline:**
- Starts at `(0, 1.0)` and sharply decreases to `(1000, 0.6)`.
- Fluctuates between `0.6` and `0.8` until `x = 2000`.
2. **Mid-Range Volatility:**
- Drops to `0.4` at `x = 3000`, then rises to `0.5` at `x = 3500`.
- Declines to `0.2` at `x = 4500`, with a brief peak to `0.25` at `x = 5000`.
3. **Final Decline:**
- Reaches a low of `0.1` at `x = 5500`, then slightly increases to `0.15` at `x = 6000`.
### Extracted Data Points (Approximate)
| Context Length | Accuracy |
|----------------|----------|
| 0 | 1.0 |
| 500 | 0.9 |
| 1000 | 0.6 |
| 1500 | 0.8 |
| 2000 | 0.65 |
| 2500 | 0.6 |
| 3000 | 0.4 |
| 3500 | 0.5 |
| 4000 | 0.45 |
| 4500 | 0.2 |
| 5000 | 0.25 |
| 5500 | 0.1 |
| 6000 | 0.15 |
---
## Observations
- **General Trend:** Accuracy decreases as context length increases, with periodic fluctuations.
- **Critical Drop:** A significant decline occurs between `x = 0` and `x = 1000`, followed by stabilization until `x = 3000`.
- **Late-Stage Instability:** Accuracy becomes erratic after `x = 4000`, with sharp drops and minor recoveries.
---
## Notes
- No additional text, tables, or non-English content is present.
- The graph focuses solely on the relationship between context length and accuracy, with no legends or secondary data series.
</details>
<details>
<summary>2405.14591v1/x38.png Details</summary>

### Visual Description
# Technical Document Analysis of Heatmap Image
## Image Type
The image is a **heatmap** visualizing the relationship between two variables: **Token Limit** (x-axis) and **Context Length** (y-axis), with color intensity representing a numerical **Score** (0–10).
---
## Labels and Axis Titles
- **X-Axis (Horizontal):**
- Label: `Token Limit`
- Range: `1000` to `32000`
- Increment: `2000` (e.g., 1000, 3000, 5000, ..., 32000)
- **Y-Axis (Vertical):**
- Label: `Context Length`
- Range: `0` to `100`
- Increment: `5` (e.g., 0, 5, 10, ..., 100)
- **Legend:**
- Located on the **right side** of the heatmap.
- Gradient from **teal (0)** to **red (10)**.
- Label: `Score` (implicitly tied to color intensity).
---
## Color Scale and Legend
- **Color Bar:**
- Range: `0` (teal) to `10` (red).
- Intermediate values:
- `2` (light orange), `4` (orange), `6` (yellow), `8` (bright yellow).
- Spatial Grounding: Legend occupies the **rightmost 10% of the image width**, aligned vertically with the y-axis.
---
## Key Trends and Data Points
1. **General Pattern:**
- **Lower Token Limits (1000–8000):**
- Scores cluster in **teal (0–2)** and **green (3–5)**.
- Example: At `Token Limit = 1000`, scores are predominantly teal (0–2) across all `Context Length` values.
- **Mid-Range Token Limits (8000–20000):**
- Scores transition to **yellow (6–8)** and **orange (4–6)**.
- Example: At `Token Limit = 12000`, scores range from green (5) to yellow (8) depending on `Context Length`.
- **High Token Limits (20000–32000):**
- Scores peak in **red (9–10)**.
- Example: A **vertical red stripe** at `Token Limit = 28000` spans `Context Length = 5` to `15`, indicating maximum scores (10).
2. **Diagonal Transition:**
- A **diagonal band** of increasing scores (teal → red) runs from the **bottom-left (1000, 0)** to the **top-right (32000, 100)**, suggesting a **positive correlation** between `Token Limit` and `Score`.
3. **Anomalies:**
- **Red Vertical Stripe at 28000:**
- Scores reach `10` (red) for `Context Length = 5–15`.
- Surrounding areas show scores of `8–9` (yellow/orange).
- **Green Patches in Mid-Range:**
- At `Token Limit = 16000`, `Context Length = 30–40` shows scores of `6–7` (green/yellow).
---
## Categories and Sub-Categories
- **Token Limit Categories:**
- `1000–32000` (in 2000 increments).
- **Context Length Categories:**
- `0–100` (in 5 increments).
- **Score Categories:**
- `0–10` (mapped to teal → red).
---
## Spatial Grounding and Verification
- **Legend Position:** Right-aligned, occupying the full height of the heatmap.
- **Color Consistency Check:**
- Red squares (e.g., at `Token Limit = 28000`) align with the legend’s `10` (red).
- Teal squares (e.g., at `Token Limit = 1000`) align with the legend’s `0` (teal).
---
## Component Isolation
1. **Header:** No explicit header text.
2. **Main Chart:**
- Gridlines divide the heatmap into cells representing `(Token Limit, Context Length)` pairs.
- Color intensity encodes the `Score`.
3. **Footer:** No explicit footer text.
---
## Transcribed Text
- Axis Labels:
- `Token Limit` (x-axis)
- `Context Length` (y-axis)
- Legend:
- `Score` (color gradient: teal = 0, red = 10)
---
## Conclusion
The heatmap demonstrates that **higher Token Limits** (e.g., 28000) correlate with **higher Scores** (up to 10), while lower Token Limits (e.g., 1000) yield lower Scores (0–2). The vertical red stripe at `Token Limit = 28000` highlights a critical threshold for maximum performance.
</details>
Figure 14: Qwen1.5-7B-Base (Bai et al., 2023) with base=1e4 fine-tuned on 32k context (original context=4096)