## Chart Type: Dual-Axis Bar and Line Chart
### Overview
This technical chart compares three different information retrieval/processing methods (**RAG**, **MemR3**, and **Full-Context**) across four distinct task categories. It evaluates these methods based on two metrics: **Retrieved Token Usage** (efficiency) and **LLM-as-a-Judge Score** (performance). The chart uses a logarithmic scale for token usage and a linear percentage scale for the performance score.
### Components/Axes
* **X-Axis (Methods):** Categorical axis listing three methods:
* **RAG** (Retrieval-Augmented Generation)
* **MemR3**
* **Full-Context**
* **Left Y-Axis (Retrieved Token Usage):** Logarithmic scale ranging from $10^0$ (1) to $10^5$ (100,000). This axis corresponds to the **bar charts**.
* **Right Y-Axis (LLM-as-a-Judge Score %):** Linear scale ranging from 50% to 100%. This axis corresponds to the **line graphs**.
* **Legends:**
* **Top-Left (Categories):** Defines the color-coding and markers for both bars and lines:
* **Teal Circle ($\bullet$):** Multi-hop
* **Dark Blue Square ($\blacksquare$):** Temporal
* **Tan Star ($\star$):** Open-domain
* **Pink Hexagon ($\hexagon$):** Single-hop
* **Top-Center/Right (Data Types):**
* **Grey Bar:** Token Usage (bar)
* **Black Line with Circle:** LLM-as-a-Judge (line)
### Content Details
#### 1. Retrieved Token Usage (Bars - Left Y-Axis)
The bars represent the volume of data processed.
* **RAG:** All four categories use a uniform amount of tokens, approximately **$1.2 \times 10^3$ (1,200 tokens)**.
* **MemR3:** Shows slight variation by category, averaging around **$2 \times 10^3$ (2,000 tokens)**.
* Multi-hop: ~1,800
* Temporal: ~2,000
* Open-domain: ~2,200
* Single-hop: ~1,500
* **Full-Context:** All four categories use a uniform, significantly higher amount of tokens, approximately **$2.5 \times 10^4$ (25,000 tokens)**.
#### 2. LLM-as-a-Judge Score (Lines - Right Y-Axis)
The lines track performance across the methods.
* **Multi-hop (Teal):** Slopes upward from RAG (~69%) to MemR3 (~72%) and reaches its peak at Full-Context (~73%).
* **Temporal (Dark Blue):** Shows a sharp upward trend from RAG (~65%) to a peak at MemR3 (~78%), followed by a significant drop at Full-Context (~58%).
* **Open-domain (Tan):** Remains relatively flat with a very slight peak at MemR3 (~60%) compared to RAG (~58%) and Full-Context (~59%).
* **Single-hop (Pink):** The highest performing category. It slopes upward from RAG (~84%) to a peak at MemR3 (~88%), then dips slightly at Full-Context (~86%).
### Key Observations
* **Token Efficiency:** Full-Context uses roughly **10x to 20x more tokens** than MemR3 and RAG, representing a massive increase in computational cost.
* **Performance Peak:** For three out of four categories (Temporal, Open-domain, Single-hop), **MemR3 achieves the highest performance score**, despite using significantly fewer tokens than Full-Context.
* **The "Temporal" Anomaly:** The Temporal category shows a unique pattern where Full-Context performance (~58%) is actually worse than the baseline RAG (~65%) and much worse than MemR3 (~78%).
* **Task Difficulty:** "Single-hop" tasks are consistently the easiest for all models (80%+ scores), while "Open-domain" tasks are the most challenging (all scores $\leq$ 60%).
### Interpretation
The data suggests that **MemR3 is the most optimized method** among those tested. It provides a "sweet spot" of high performance and low token usage.
The fact that MemR3 outperforms Full-Context in most categories—especially the dramatic 20-point lead in "Temporal" tasks—indicates that providing an LLM with the entire context (Full-Context) can introduce noise or "distractions" that degrade performance compared to a more targeted retrieval or memory management system like MemR3.
The "Temporal" drop-off in Full-Context is a classic example of the "lost in the middle" phenomenon or context-window saturation, where the model struggles to maintain chronological or sequential logic when overwhelmed by a massive, unfiltered context. Conversely, MemR3's success suggests it effectively filters or structures information to maintain these temporal relationships.