## Line Graphs: ΔP Trends Across Layers for Qwen3-8B and Qwen3-32B Models
### Overview
The image contains two line graphs comparing the performance of Qwen3-8B and Qwen3-32B models across layers (0–30 and 0–60, respectively) using different anchoring strategies (Q-Anchored vs. A-Anchored) and datasets (PopQA, TriviaQA, HotpotQA, NQ). The y-axis represents ΔP (change in performance), and the x-axis represents model layers. Shaded regions indicate variability/confidence intervals.
---
### Components/Axes
- **X-Axis (Layer)**:
- Qwen3-8B: 0 to 30 (intervals of 10)
- Qwen3-32B: 0 to 60 (intervals of 20)
- **Y-Axis (ΔP)**:
- Range: -80 to 0 (negative values indicate performance degradation)
- Units: Not explicitly labeled, but ΔP implies relative change.
- **Legends**:
- **Qwen3-8B**:
- Solid lines: Q-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Dashed lines: A-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- **Qwen3-32B**:
- Solid lines: Q-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Dashed lines: A-Anchored (PopQA, TriviaQA, HotpotQA, NQ)
- Colors:
- Blue: PopQA
- Green: TriviaQA
- Purple: HotpotQA
- Red: NQ
---
### Detailed Analysis
#### Qwen3-8B Graph
- **Q-Anchored (Solid Lines)**:
- **PopQA**: Starts near 0, drops sharply to ~-80 by layer 30.
- **TriviaQA**: Begins at ~-20, declines to ~-70.
- **HotpotQA**: Starts at ~-10, falls to ~-75.
- **NQ**: Starts at ~-5, declines to ~-70.
- **A-Anchored (Dashed Lines)**:
- **PopQA**: Starts at 0, declines to ~-60.
- **TriviaQA**: Begins at ~-10, drops to ~-65.
- **HotpotQA**: Starts at ~-5, falls to ~-60.
- **NQ**: Starts at ~-2, declines to ~-60.
#### Qwen3-32B Graph
- **Q-Anchored (Solid Lines)**:
- **PopQA**: Starts near 0, drops to ~-80 by layer 60.
- **TriviaQA**: Begins at ~-20, declines to ~-75.
- **HotpotQA**: Starts at ~-10, falls to ~-70.
- **NQ**: Starts at ~-5, declines to ~-70.
- **A-Anchored (Dashed Lines)**:
- **PopQA**: Starts at 0, declines to ~-60.
- **TriviaQA**: Begins at ~-10, drops to ~-65.
- **HotpotQA**: Starts at ~-5, falls to ~-60.
- **NQ**: Starts at ~-2, declines to ~-60.
---
### Key Observations
1. **Q-Anchored vs. A-Anchored**:
- Q-Anchored models (solid lines) show steeper declines in ΔP across layers compared to A-Anchored (dashed lines), suggesting stronger dependency on question anchoring for performance.
- A-Anchored models exhibit more gradual declines, indicating greater stability in answer anchoring.
2. **Dataset Variability**:
- **PopQA** (blue) consistently shows the steepest decline for Q-Anchored models, implying higher sensitivity to question anchoring.
- **NQ** (red) datasets (e.g., Natural Questions) show moderate declines, suggesting intermediate reliance on anchoring strategies.
3. **Model Size**:
- Qwen3-32B (larger model) exhibits similar trends to Qwen3-8B but with slightly less variability in ΔP, possibly due to increased capacity to mitigate anchoring effects.
4. **Shaded Regions**:
- Wider shaded areas in Qwen3-8B suggest higher uncertainty in smaller models, while Qwen3-32B shows tighter confidence intervals.
---
### Interpretation
- **Anchoring Strategy Impact**: Q-Anchored models degrade more rapidly with increasing layers, highlighting their reliance on question-level context. A-Anchored models, which anchor to answers, show more consistent performance, suggesting answer-level grounding is more robust.
- **Dataset Complexity**: PopQA (simple QA) and NQ (complex QA) exhibit distinct trends, with PopQA being more sensitive to anchoring shifts. This may reflect differences in task structure (e.g., direct vs. multi-hop reasoning).
- **Model Scaling**: Larger models (Qwen3-32B) maintain performance better across layers, indicating that increased parameter count helps stabilize anchoring effects. However, the fundamental trend (Q-Anchored > A-Anchored decline) persists, emphasizing architectural trade-offs in grounding strategies.
---
### Spatial Grounding & Cross-Reference
- **Legend Position**: Bottom of both graphs, aligned with x-axis.
- **Color Consistency**:
- Q-Anchored: Solid lines (blue, green, purple, red).
- A-Anchored: Dashed lines (blue, green, purple, red).
- Dataset colors match across both graphs (e.g., blue = PopQA in both 8B and 32B).
---
### Conclusion
The graphs demonstrate that anchoring strategy (Q vs. A) significantly influences layer-wise performance degradation, with Q-Anchored models being more sensitive. Dataset complexity and model size further modulate these effects, providing insights into the design of question-answering architectures.