## Horizontal Bar Chart: R1-Llama | AIME25
### Overview
This is a horizontal bar chart comparing the percentage ratio of "Content Words" versus "Function Words" across different performance percentile groups for a model or system identified as "R1-Llama" on the "AIME25" benchmark. The chart illustrates how the composition of language (content vs. function words) varies with performance level.
### Components/Axes
* **Title:** "R1-Llama | AIME25" (centered at the top).
* **Y-Axis (Vertical):** Lists performance percentile ranges. From top to bottom:
* 90-100%
* 80-90%
* 70-80%
* 60-70%
* 50-60%
* 40-50%
* 30-40%
* 20-30%
* 10-20%
* Top-10%
* **X-Axis (Horizontal):** Labeled "Ratio (%)". Scale runs from 0 to 100 with major tick marks at 0, 20, 40, 60, 80, and 100.
* **Legend:** Positioned in the top-right corner.
* **Red Solid Bar:** Labeled "Content Words".
* **Gray Hatched Bar:** Labeled "Function Words".
* **Data Series:** Each percentile range has a paired horizontal bar. The red "Content Words" bar is on the left, and the gray hatched "Function Words" bar is on the right, together summing to 100% for each row.
### Detailed Analysis
The chart presents the following precise data points for each percentile group. The trend is that the proportion of **Content Words increases** as performance improves (moving down the y-axis), while the proportion of **Function Words decreases**.
| Percentile Range | Content Words (Red Bar) | Function Words (Gray Hatched Bar) |
| :--- | :--- | :--- |
| 90-100% | 29.3% | 70.7% |
| 80-90% | 29.6% | 70.4% |
| 70-80% | 30.7% | 69.3% |
| 60-70% | 30.1% | 69.9% |
| 50-60% | 30.0% | 70.0% |
| 40-50% | 31.5% | 68.5% |
| 30-40% | 32.6% | 67.4% |
| 20-30% | 35.5% | 64.5% |
| 10-20% | 39.3% | 60.7% |
| Top-10% | 44.3% | 55.7% |
**Trend Verification:**
* **Content Words (Red):** The line formed by the ends of the red bars slopes steadily downward and to the right, indicating a consistent increase in percentage from the lowest-performing group (90-100% at 29.3%) to the highest-performing group (Top-10% at 44.3%).
* **Function Words (Gray):** The line formed by the ends of the gray bars slopes steadily downward and to the left, indicating a consistent decrease from 70.7% to 55.7% across the same groups.
### Key Observations
1. **Inverse Relationship:** The percentages for Content and Function words are perfectly complementary for each row, summing to 100%.
2. **Monotonic Trend:** The increase in Content Words (and decrease in Function Words) is nearly monotonic across the performance spectrum. The only minor deviation is between the 70-80% (30.7%) and 60-70% (30.1%) groups, where the Content Words percentage dips slightly before resuming its upward trend.
3. **Significant Gap:** The largest single jump in Content Words percentage occurs between the "10-20%" group (39.3%) and the "Top-10%" group (44.3%), a 5-percentage-point increase.
4. **Dominance of Function Words:** In all percentile groups, Function Words constitute the majority of the ratio (always >55%).
### Interpretation
This chart suggests a strong correlation between the linguistic composition of a model's output and its performance on the AIME25 benchmark. Higher-performing instances (those in the "Top-10%" and "10-20%" brackets) use a significantly higher proportion of **Content Words**—words carrying semantic meaning like nouns, verbs, adjectives—compared to lower-performing instances.
Conversely, lower-performing models rely more heavily on **Function Words**—grammatical words like prepositions, articles, and conjunctions that structure language but carry less intrinsic meaning.
**What this might mean:**
* **Precision vs. Structure:** Better performance may be associated with more precise, information-dense language (content words) rather than verbose, structurally complex but semantically lighter language (function words).
* **Efficiency:** The Top-10% models might be communicating ideas more efficiently, using fewer "filler" or structural words to convey the same or better information.
* **Benchmark Nature:** The AIME25 benchmark likely rewards answers that are direct, factual, and semantically rich, which aligns with a higher content-word ratio. This pattern could be specific to this type of evaluation.
The data implies that analyzing the part-of-speech distribution in model outputs could serve as a diagnostic tool for performance, with a higher content-to-function word ratio being a potential indicator of higher-quality reasoning or answer generation for this specific task.