# The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation
**Authors**:
- İbrahim Ethem Deveci (Department of Cognitive Science)
- Ankara, Turkey
- Duygu Ataman (Department of Cognitive Science)
- Ankara, Turkey
Abstract
The rapid rise of Large Language Models (LLMs) and Large Reasoning Models (LRMs) has been accompanied by an equally rapid increase of benchmarks used to assess them. However, due to both improved model competence resulting from scaling and novel training advances as well as likely many of these datasets being included in pre or post training data, results become saturated, driving a continuous need for new and more challenging replacements. In this paper, we discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure? We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities across different benchmarks evolve over the years. We also analyze performance trends over the years across different reasoning tasks and discuss the current situation of benchmarking and remaining challenges. By offering a comprehensive overview of benchmarks and reasoning tasks, our work aims to serve as a first reference to ground future research in reasoning evaluation and model development.
1 Introduction
Benchmarks have long played a central role in evaluating and comparing machine learning models [1]. As models scale up in size and capability, particularly Large Language Models (LLMs) and the specialized Large Reasoning Models (LRMs), many benchmarks quickly saturate, often reaching or surpassing human-level performance. Whether this saturation is driven primarily by improved model capability or dataset contamination is generally unknown. Nevertheless, this quick saturation forces the development of new and more challenging benchmarks that could be used to further compare new model families. In this paper, we investigate several key research questions: How effective are current benchmarks at measuring model capabilities, and does surpassing a benchmark reliably indicate genuine reasoning?
To examine these questions, we select three model families, OpenAI, Anthropic, and Google, and compile performance data from official sources [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. We gather a comprehensive list of 52 benchmarks used in evaluating these models and classify them according to the types of reasoning they aim to evaluate. Analyzing performance trends over the years, we highlight where models improve, where they struggle, and what these trends reveal about the current state of benchmarking. Finally, we discuss the implications of the saturation cycle and emphasize the need for improved evaluation practices that more accurately capture model capabilities.
Our contributions are threefold: (1) we provide a curated list of reasoning benchmarks, classified by the types of reasoning they aim to assess (2) we analyze performance trends over the years to assess benchmarking effectiveness; (3) we examine current landscape of existing benchmarks, identifying which benchmarks have reached high performance thresholds and which seem to remain unsolved.
By situating our analysis within the broader evaluation landscape, our work collects evidence to emphasize the need for reasoning tasks that are more representative of the nature of reasoning process and target evaluation beyond downstream accuracy.
2 Benchmark Landscape and Categorization
In order to provide a general analysis of how the creation and adoption of reasoning benchmarks have evolved over time, we examine three model families and compile the set of benchmarks employed to evaluate them. Our aim is to provide a comprehensive overview of current benchmarking practices and to trace how the creation and adoption of benchmarks have evolved over time. The complete list of benchmarks, their assigned reasoning types, and short summaries can be found in Appendix A. To facilitate analysis, we categorize benchmarks into seven reasoning types: commonsense and logical reasoning, mathematical reasoning, multimodal reasoning, programming and coding, reading comprehension and question answering, reasoning with general knowledge, and LLM-specific capabilities such as safety, tool use, and instruction following. Figure 1 illustrates a marked increase in benchmark adoption for multimodal reasoning, mathematical reasoning, programming, reasoning with general knowledge, and LLM-specific benchmarks after 2023. In contrast, no new benchmarks in reading comprehension or commonsense reasoning were adopted by these model families during this period. While the literature contains several other benchmarks in these areas [23, 24, 25, 26, 27, 28, 29], our analysis shows they have not been utilized by any of the prominent model families. This likely reflects the evolving understanding of what constitutes reasoning in computational models, in accordance with their current capabilities and what the community deems important to evaluate. Since most models now have direct commercial applications, their performance in more applicable domains, such as coding and tool-use benchmarks, may also motivate the evaluation in certain categories of reasoning tasks.
<details>
<summary>figures/benchmarks_by_year.png Details</summary>

### Visual Description
## Line Chart: Growth of AI Benchmarks by Category (2015-2025)
### Overview
This image is a line chart illustrating the proliferation of different types of artificial intelligence (AI) or machine learning benchmarks over a ten-year period, from 2015 to 2025. It tracks seven distinct categories of benchmarks, showing a general trend of stagnation in early years followed by exponential growth in specific categories starting around 2022-2023.
### Components/Axes
**Spatial Layout:**
* **Main Chart Area:** Occupies the left and center portions of the image. It features a light grey grid with horizontal lines corresponding to the Y-axis major ticks and vertical lines corresponding to the X-axis major ticks.
* **Legend:** Positioned on the far right, outside the main chart grid, enclosed in a subtle grey bounding box.
**Axes:**
* **Y-Axis (Left):**
* **Title:** "Number of Benchmarks" (Rotated 90 degrees counter-clockwise, reading bottom-to-top).
* **Scale:** Linear, ranging from 0 to 12 (with the grid extending slightly above 12 to accommodate a data point at 13).
* **Markers:** 0, 2, 4, 6, 8, 10, 12.
* **X-Axis (Bottom):**
* **Title:** "Year" (Centered below the axis markers).
* **Scale:** Chronological, representing years.
* **Markers:** 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025. (Text is rotated approximately 45 degrees clockwise).
**Legend (Top-Right to Bottom-Right):**
* **Blue Line:** Commonsense and Logical Reasoning
* **Orange Line:** LLM Benchmarks (Instruction following, Tool use, etc.)
* **Green Line:** Mathematical Reasoning
* **Red Line:** Multimodal Reasoning
* **Purple Line:** Programming and Coding
* **Brown Line:** Reading Comprehension and Question Answering
* **Pink Line:** Reasoning with General Knowledge
---
### Detailed Analysis & Data Extraction
*Note: Data points are extracted based on their alignment with the grid intersections. Values appear to be exact integers.*
**1. Multimodal Reasoning (Red Line)**
* **Visual Trend:** This line starts higher than all others, rises steadily with minor plateaus between 2016-2018 and 2019-2020, and then accelerates sharply upward from 2022 to 2025, ending tied for the highest value.
* **Data Points:** 2015: 1 | 2016: 2 | 2017: 2 | 2018: 2 | 2019: 3 | 2020: 3 | 2021: 4 | 2022: 5 | 2023: 6 | 2024: 9 | 2025: 13
**2. LLM Benchmarks (Instruction following, Tool use, etc.) (Orange Line)**
* **Visual Trend:** This line remains completely flat at zero for the majority of the timeline. It exhibits a sudden, explosive upward spike starting in 2022, tying for the highest value by 2025.
* **Data Points:** 2015-2022: 0 | 2023: 2 | 2024: 7 | 2025: 13
**3. Mathematical Reasoning (Green Line)**
* **Visual Trend:** Flat at zero until 2020, followed by a moderate, steady upward slope through 2025.
* **Data Points:** 2015-2020: 0 | 2021: 2 | 2022: 2 | 2023: 3 | 2024: 7 | 2025: 8
**4. Programming and Coding (Purple Line)**
* **Visual Trend:** Flat at zero until 2019, bumps up slightly to 1, remains flat until 2023, and then slopes sharply upward.
* **Data Points:** 2015-2019: 0 | 2020: 1 | 2021: 1 | 2022: 1 | 2023: 1 | 2024: 3 | 2025: 7
**5. Reasoning with General Knowledge (Pink Line)**
* **Visual Trend:** Flat at zero until 2020, followed by a consistent, moderate upward slope through 2025.
* **Data Points:** 2015-2020: 0 | 2021: 1 | 2022: 1 | 2023: 3 | 2024: 5 | 2025: 7
**6. Reading Comprehension and Question Answering (Brown Line)**
* **Visual Trend:** Flat at zero until 2017, rises slightly to 2 by 2019, plateaus completely for five years, and ticks up slightly in 2025.
* **Data Points:** 2015-2017: 0 | 2018: 1 | 2019: 2 | 2020: 2 | 2021: 2 | 2022: 2 | 2023: 2 | 2024: 2 | 2025: 3
**7. Commonsense and Logical Reasoning (Blue Line)**
* **Visual Trend:** Flat at zero until 2018, rises to 1 in 2019, and remains completely flat at 1 for the rest of the timeline.
* **Data Points:** 2015-2018: 0 | 2019: 1 | 2020-2025: 1
#### Reconstructed Data Table
| Year | Multimodal (Red) | LLM Benchmarks (Orange) | Math (Green) | Programming (Purple) | Gen. Knowledge (Pink) | Reading Comp. (Brown) | Commonsense (Blue) |
| :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
| **2015** | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| **2016** | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| **2017** | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| **2018** | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| **2019** | 3 | 0 | 0 | 0 | 0 | 2 | 1 |
| **2020** | 3 | 0 | 0 | 1 | 0 | 2 | 1 |
| **2021** | 4 | 0 | 2 | 1 | 1 | 2 | 1 |
| **2022** | 5 | 0 | 2 | 1 | 1 | 2 | 1 |
| **2023** | 6 | 2 | 3 | 1 | 3 | 2 | 1 |
| **2024** | 9 | 7 | 7 | 3 | 5 | 2 | 1 |
| **2025** | 13 | 13 | 8 | 7 | 7 | 3 | 1 |
---
### Key Observations
* **The 2022/2023 Inflection Point:** Almost all categories experience a noticeable acceleration in the number of benchmarks created starting in 2022 or 2023.
* **Explosive Growth of LLM Benchmarks:** The Orange line (LLM Benchmarks) is the most dramatic outlier. It goes from non-existent (0) in 2022 to tying for the highest number of benchmarks (13) in just three years.
* **Dominance of Multimodal:** Multimodal Reasoning (Red) is the only category that had a presence in 2015 and has consistently led or tied for the lead in the number of benchmarks throughout the entire decade.
* **Stagnation of Early NLP Tasks:** "Reading Comprehension" (Brown) and "Commonsense" (Blue) show early, minor growth but plateau entirely from 2019/2020 onward, showing almost no new benchmark development in the later years.
### Interpretation
This chart serves as a visual history of the shifting priorities in Artificial Intelligence research and evaluation over the last decade.
**Reading Between the Lines:**
1. **The Generative AI Boom:** The sudden, violent spike in "LLM Benchmarks" (Orange) starting in 2023 perfectly correlates with the public release of ChatGPT (late 2022) and the subsequent explosion of Large Language Models. Because these models possessed novel capabilities (instruction following, tool use), the old benchmarks were rendered obsolete, necessitating a rapid creation of new evaluation frameworks.
2. **The Shift from Narrow to General/Complex AI:** The stagnation of the Blue (Commonsense) and Brown (Reading Comprehension) lines suggests that these "narrow" NLP problems were either considered "solved" by the research community around 2019/2020, or that they were subsumed by broader, more complex evaluations.
3. **The Push for AGI Metrics:** The sharp rise in Math (Green), Programming (Purple), and General Knowledge (Pink) in the 2023-2025 window indicates that as base language models became fluent, researchers shifted to testing them on rigorous, verifiable logic and reasoning tasks to measure true intelligence rather than just linguistic mimicry.
4. **The Inevitability of Multimodal:** The consistent, leading growth of Multimodal Reasoning (Red) shows that integrating text, vision, and audio has been a long-standing, steadily growing goal of the AI community, which has recently accelerated alongside LLM development (likely reflecting the release of models like GPT-4V and Gemini).
Ultimately, the data demonstrates a paradigm shift: a move away from static, single-task NLP benchmarks toward dynamic, complex, and multi-disciplinary evaluations designed to test the limits of modern foundational models.
</details>
Figure 1: Number of benchmarks in different reasoning types over time.
3 Performance Trends Across Models
Across all three model families there is a consistent effort to develop newer models or architectural improvements to achieve higher benchmark performance. However, comparing performance across families is challenging, as each family often employs different benchmarks, and even within a single family, benchmarks used can vary between model iterations. This variation appears to stem from two main factors: first, certain benchmarks reach saturation due to high performance; second, benchmark updates or more challenging subsets are introduced, such as the transition from MATH to MATH-500 [30].
We observe a recurring pattern: once a model family achieves a high performance on a particular benchmark, subsequent models tend to use that benchmark less frequently or may discontinue its use entirely. This reflects both practical and conceptual considerations: benchmarks that no longer discriminate between models provide limited evaluative value, and benchmark selection increasingly reflects the evolving understanding of which reasoning tasks remain challenging for current architectures.
Interestingly, performance trends reveal consistent directional correlations across benchmarks within the same reasoning type. For example, when a model demonstrates improved performance on a benchmark, it generally shows corresponding improvements on other benchmarks of the same type, while lower performance on one benchmark tends to coincide with lower performance on others. Nevertheless, the extent of performance differs across benchmarks, potentially due to variations in problem complexity and the scaling limitations evident in smaller models, as seen within the OpenAI family. This pattern suggests that benchmarks within a reasoning type often capture overlapping aspects of reasoning, so that advances in a models’ capabilities tend to propagate across related tasks. At the same time, variations in the magnitude of performance gains provide insight into the relative difficulty of different benchmarks within the same reasoning type. Detailed plots illustrating performance changes within model families for different reasoning types are provided in Appendix B.
Finally, we note that newer models generally achieve higher performance on previously low-scoring benchmarks. However, the limited overlap of common benchmarks across model families complicates cross-family comparisons. This raises a critical question: if benchmarks are intended to evaluate and compare model capabilities, why are they not consistently adopted or reported across families? If benchmarks are intended to provide a shared measure of capability, their fragmented and selective use undermines that goal and exemplifies the need for more standardized, representative, and domain-informed evaluation frameworks.
4 Performance of Models within Benchmarks
We collect all reported model performances across benchmarks and analyze saturation by defining it as whether a model has achieved at least 80% accuracy on the given benchmark. Out of the full set of benchmarks, we find that 27 benchmarks surpass this threshold in at least one model family, while 25 benchmarks never reach it. The majority of “solved” benchmarks belong to commonsense and logical reasoning, mathematical reasoning, reasoning with general knowledge, and reading comprehension and question answering. By contrast, benchmarks targeting LLM-specific capabilities and programming and coding remain comparatively difficult, with few instances of performance above 80%.
We then examine the release years of benchmarks that never surpass the 80% threshold. The distribution is striking: 60% of unsolved benchmarks were introduced in 2025, 32% in 2024, and only two pre-2023 benchmarks remain unsolved, which are ActivityNet [31] and EgoSchema [32], both multimodal reasoning benchmarks. This distribution suggests a clear trend. Nearly all benchmarks released prior to 2025 have already been surpassed by at least one model family, indicating rapid saturation. By contrast, the benchmarks still below the threshold overwhelmingly correspond to the most recently introduced evaluation tasks.
<details>
<summary>figures/stacked_bar_saturation.png Details</summary>

### Visual Description
## Horizontal Stacked Bar Chart: Benchmark Saturation by Category
### Overview
This image is a horizontal stacked bar chart illustrating the percentage of "Saturated" versus "Not Saturated" benchmarks across seven distinct cognitive and technical evaluation categories. The chart uses a green and red color scheme to visually represent the proportion of saturation within each category. The language used throughout the chart is entirely English.
### Components/Axes
**Spatial Grounding & Layout:**
* **Y-axis (Left):** Contains seven categorical labels describing different types of reasoning or tasks. The labels are right-aligned against the axis line.
* **X-axis (Bottom):** Represents the "Percentage of Benchmarks". It is scaled from 0 to 100 with major tick marks and labels at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Main Chart Area (Center):** Contains seven horizontal bars corresponding to the Y-axis categories. Each bar spans the full 100% width of the X-axis, divided into green (left) and red (right) segments. Text is embedded directly inside the colored segments.
* **Legend (Bottom-Right):** Located inside the chart area, just above the 80-100 mark on the X-axis.
* **Green Square:** Labeled "Saturated"
* **Red Square:** Labeled "Not Saturated"
### Detailed Analysis
**Trend Verification & Visual Proportions:**
For every category, the visual length of the green bar (starting from 0 on the left) and the red bar (filling the remainder to 100 on the right) perfectly matches the embedded percentage text.
Below is the reconstructed data table based on the embedded text and visual proportions. The green sections include both a percentage and a fraction (representing the number of saturated benchmarks over the total benchmarks in that category). The red sections only display the remaining percentage.
| Category (Y-Axis Label) | Saturated (Green Bar) | Not Saturated (Red Bar) | Total Benchmarks (Derived) |
| :--- | :--- | :--- | :--- |
| Reasoning with General Knowledge | 71.4% (5/7) | 28.6% | 7 |
| Reading Comprehension and Question Answering | 66.7% (2/3) | 33.3% | 3 |
| Programming and Coding | 33.3% (3/9) | 66.7% | 9 |
| Multimodal Reasoning | 46.2% (6/13) | 53.8% | 13 |
| Mathematical Reasoning | 87.5% (7/8) | 12.5% | 8 |
| LLM | 23.1% (3/13) | 76.9% | 13 |
| Commonsense and Logical Reasoning | 100.0% (1/1) | 0.0%* | 1 |
*\*Note: The "0.0" text in the bottom right red section is partially cut off by the edge of the image, but is clearly inferable based on the 100.0% green section and the visible "0.0".*
### Key Observations
* **Highest Saturation:** "Commonsense and Logical Reasoning" is visually entirely green, indicating 100% saturation. However, the fraction (1/1) shows this is based on a single benchmark. "Mathematical Reasoning" follows closely at 87.5% (7/8).
* **Lowest Saturation:** The "LLM" category has the shortest green bar, showing only 23.1% saturation (3/13). "Programming and Coding" is the second lowest at 33.3% (3/9).
* **Sample Size Variance:** The fractions embedded in the green bars reveal significant variance in the number of benchmarks per category. "Multimodal Reasoning" and "LLM" have the highest number of total benchmarks (13 each), while "Commonsense and Logical Reasoning" has only 1.
* **Total Benchmarks:** By summing the denominators of the fractions, we can determine that this chart represents a total of 54 individual benchmarks across all categories.
### Interpretation
**What the data suggests:**
In the context of AI and machine learning evaluation, a "saturated" benchmark typically refers to a test where current models have achieved near-perfect or human-level performance, rendering the benchmark obsolete for measuring future progress.
This chart demonstrates that traditional, well-defined domains like "Mathematical Reasoning" (87.5%) and "Reasoning with General Knowledge" (71.4%) are highly saturated. The AI community has largely "solved" the specific tests currently used in these categories.
Conversely, categories like "LLM" (23.1%), "Programming and Coding" (33.3%), and "Multimodal Reasoning" (46.2%) are largely *not* saturated. This indicates that these are the current frontiers of AI research; the existing benchmarks in these categories are still challenging enough to effectively measure and differentiate the capabilities of newer models.
**Reading between the lines (Peircean Analysis):**
The most critical piece of information in this chart is not the percentages, but the *fractions*. Presenting "Commonsense and Logical Reasoning" as 100% saturated is statistically misleading without the context of the (1/1) fraction. It doesn't mean AI has mastered all commonsense; it means AI has mastered the *single* benchmark this specific study chose to include.
Furthermore, the categories with the most benchmarks (LLM and Multimodal, with 13 each) have the lowest saturation. This suggests a correlation: as a field becomes more complex and less saturated, researchers create a higher volume of diverse benchmarks to try and capture the nuances of model performance. The chart effectively highlights the urgent need for the AI community to develop new, harder benchmarks for Mathematics and General Knowledge, as the current ones are no longer useful discriminators of model capability.
</details>
(a) Distribution of benchmarks that models surpassed 80% threshold and those not yet surpassed, grouped by reasoning type.
<details>
<summary>figures/pie_saturation_by_year.png Details</summary>

### Visual Description
## Pie Charts: Yearly Distribution Comparison (Green vs. Red Series)
### Overview
The image consists of two side-by-side pie charts displaying categorical data based on years. The left chart utilizes a monochromatic green color palette, while the right chart utilizes a monochromatic red color palette. Both charts display the year as an external label adjacent to its respective slice, with the corresponding percentage and absolute numerical value (enclosed in parentheses) embedded directly within the slice. There is no overarching title, axis, or separate legend provided; the context of the data is not explicitly stated. The language used is entirely English (numerals).
### Components/Axes
* **Left Chart (Green Series):** Located on the left half of the image. It contains 8 distinct slices representing various years between 2016 and 2025. Labels are placed radially outside the pie.
* **Right Chart (Red Series):** Located on the right half of the image. It contains 4 distinct slices representing years 2015, 2023, 2024, and 2025. Labels are placed radially outside the pie.
* **Data Format:** Inside each slice, text is formatted with the percentage on top (rounded to one decimal place) and the absolute count below it in parentheses, e.g., `18.5% \n (5)`.
### Detailed Analysis
#### Left Chart (Green Series)
*Visual Trend:* The chart is highly fragmented, with the visual weight concentrated at the bottom (the darkest and largest slice). The data is spread across 8 different years.
*Total Calculated N-value:* 27
Starting from the top (12 o'clock position) and moving clockwise:
| Position | Year Label | Slice Color | Percentage | Absolute Value |
| :--- | :--- | :--- | :--- | :--- |
| Top | 2021 | Medium-dark green | 18.5% | (5) |
| Top-Right | 2022 | Light green | 3.7% | (1) |
| Right | 2023 | Medium-dark green | 18.5% | (5) |
| Bottom | 2024 | Very dark green | 29.6% | (8) |
| Bottom-Left | 2025 | Medium green | 11.1% | (3) |
| Left (lower) | 2016 | Light green | 3.7% | (1) |
| Left (upper) | 2018 | Light green | 3.7% | (1) |
| Top-Left | 2019 | Medium green | 11.1% | (3) |
#### Right Chart (Red Series)
*Visual Trend:* This chart is heavily consolidated, with the vast majority of the visual weight occupying the bottom and right portions (the darkest and largest slice), followed by a significant slice at the top. It only contains 4 years.
*Total Calculated N-value:* 25
Starting from the top (12 o'clock position) and moving clockwise:
| Position | Year Label | Slice Color | Percentage | Absolute Value |
| :--- | :--- | :--- | :--- | :--- |
| Top | 2024 | Bright red | 32.0% | (8) |
| Bottom/Right | 2025 | Very dark red (Maroon) | 60.0% | (15) |
| Left (lower) | 2015 | Light red (Salmon) | 4.0% | (1) |
| Left (upper) | 2023 | Light red (Salmon) | 4.0% | (1) |
### Key Observations
* **Dominant Categories:** In the Green chart, 2024 is the largest category (29.6%, n=8). In the Red chart, 2025 is overwhelmingly the largest category (60.0%, n=15).
* **Overlapping Data:** The years 2023, 2024, and 2025 appear in both charts.
* 2024 has the exact same absolute value (n=8) in both charts, though it represents a different percentage of the whole (29.6% vs 32.0%).
* **Color Mapping Logic:** There appears to be a correlation between the darkness of the slice and the absolute value it represents. In both charts, the highest values (2024 in green, 2025 in red) are assigned the darkest shades. The lowest values (n=1) are assigned the lightest shades.
* **Distribution Skew:** The Green chart shows a wider, more historical distribution of data (spanning 2016-2025 with multiple mid-tier values). The Red chart is heavily skewed toward recent/future years, with 92% of its data concentrated in 2024 and 2025.
### Interpretation
Without a title or legend defining what the "Green" and "Red" series represent, a definitive conclusion cannot be drawn about the subject matter. However, applying Peircean semiotics (reading the signs and structure), we can infer the following:
1. **Comparative Cohorts:** The side-by-side placement implies a comparison between two distinct cohorts, outcomes, or categories (e.g., "Successful vs. Failed projects," "Category A vs. Category B," or "Pre-treatment vs. Post-treatment").
2. **Temporal Shift:** If these charts represent a progression or two sides of a binary outcome, there is a distinct temporal shift. The "Green" condition was more prevalent or evenly distributed in the late 2010s and early 2020s. The "Red" condition is highly concentrated in the years 2024 and 2025.
3. **Data Volume:** The total sample sizes are remarkably similar (Green N=27, Red N=25). This suggests they might be mutually exclusive outcomes of a single dataset of roughly 52 items. If so, whatever the "Red" outcome represents, it has seen a massive spike in occurrence in the year 2025 compared to historical norms.
</details>
(b) Release years of benchmarks relative to the 80% threshold: left pie shows surpassed benchmarks, right pie shows unsolved benchmarks.
Figure 2: Benchmark saturation dynamics.
This temporal pattern highlights the central dynamic of the saturation cycle: older benchmarks are rapidly mastered and lose discriminative power, while newly introduced benchmarks become the standards for demonstrating progress. Nearly all unsolved benchmarks are recent, highlighting both the accelerating pace of benchmark creation and the difficulty of maintaining evaluations that remain challenging over time. Yet this difficulty seems only temporary. It is highly plausible that within one or two years many of these currently unsolved benchmarks will also be surpassed, at which point model families will shift to alternative or newly designed evaluations to preserve differentiation. Crucially, this pattern reflects the fact that performance gains are often specific to individual benchmarks rather than to the broader reasoning type they are intended to assess. As the analyses indicate, while models often perform consistently and even strongly on benchmarks within a domain, the introduction of a more challenging, novel benchmark frequently leads to a drop in performance. This pattern may arise from the increased difficulty of the new benchmark, or from contamination that inflated performance on earlier benchmarks without truly reflecting generalizable reasoning ability. This situation raises the question of whether what appears as “reasoning ability” is often tied more to benchmark design and prior exposure than to robust mastery of the reasoning type itself. This saturation cycle casts doubt on the long-term evaluation value of benchmarks.
5 Discussion: Limitations of Current Benchmarking
Our analysis of three model families demonstrates that benchmark performance has generally increased over time, with newer models achieving higher scores across most reasoning types and benchmarks. However, given that many benchmarks have already been surpassed with high accuracy, we would like to highlight a question originally posed in [25] regarding commonsense reasoning, reframed here for reasoning in general: Have neural language models successfully acquired reasoning, or are we overestimating the true capabilities of machine reasoning? Several studies in the literature show that these models still perform poorly when required to generalize to longer contexts or handle tasks requiring inductive and compositional reasoning [33, 34, 35, 36, 37, 38]. This discrepancy suggests a limitation of current benchmarking practices: improvements in benchmark scores do not necessarily reflect generalizable reasoning ability.
We believe this discrepancy can be reduced by developing more sophisticated, task-specific evaluation metrics that capture intermediate reasoning steps or different modes of error. Additionally, formalizing reasoning for different task types can support these efforts, enabling more structured analyses and clearer assessment of models’ reasoning abilities. Such a formalization enables structured representations of diverse reasoning types and their interrelationships [39, 40, 41], and facilitates the design of layered, targeted evaluation procedures that assess specific reasoning capabilities rather than merely reporting overall accuracy. Furthermore, formal reasoning frameworks can support the development of algorithms that deliver structured feedback to models, guiding the refinement of their reasoning abilities. By integrating formalized reasoning with task-specific evaluations, benchmarking can be conducted in a more targeted and informative manner.
6 Limitations
The analysis in our study focuses on 52 benchmarks used by the three model families. Other model families and reasoning-focused models are not fully explored because including them, along with more than two hundred benchmarks identified from other model families and several studies evaluating different types of reasoning in large models, would create a combinatorial explosion of comparisons. This restriction was necessary to maintain the scope of our work on a qualitative evaluation of benchmark design and adoption rather than an exhaustive quantitative analysis of all models and benchmarks. A comprehensive comparison across a wider range of models and benchmarks is left for future work.
7 Conclusion
In this work, we analyze 52 benchmarks across three model families, covering multiple reasoning types. Our study reveals the rapid saturation of older benchmarks, selective adoption of new ones, and temporal dynamics that govern the utility of benchmarks in evaluating model performance. While model performance generally improves over time and correlations within reasoning types indicate overlapping evaluation properties, the introduction of more challenging benchmarks generally resets performance, suggesting that apparent reasoning ability is influenced more by extrinsic factors than by mastering the reasoning itself, as supported by other studies. This saturation cycle highlights the limitations of current practices: benchmarks provide only a partial view of model reasoning. Meaningful progress requires formalized reasoning tasks, layered evaluation procedures, and task-specific metrics that go beyond accuracy scores.
References
- [1] Thomas Liao, Rohan Taori, Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
- [2] Anthropic. Introducing the next generation of claude, March 2024. Accessed: 2025-08-28.
- [3] Anthropic. Claude 3.5 sonnet, June 2024. Accessed: 2025-08-28.
- [4] Anthropic. Introducing claude 4, May 2025. Accessed: 2025-08-28.
- [5] Anthropic. Introducing claude 3.5 haiku, October 2024. Accessed: 2025-08-28.
- [6] Anthropic. Claude 3.7 sonnet and claude code, February 2025. Accessed: 2025-08-28.
- [7] Anthropic. Claude opus 4.1, August 2025. Accessed: 2025-08-28.
- [8] Google DeepMind. Gemini 2.5 flash-lite, June 2025. Accessed: 2025-08-28.
- [9] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025.
- [10] Google DeepMind. Gemini 2.5: Our most intelligent ai model, March 2025. Accessed: 2025-08-28.
- [11] Gemini Team, Petko Georgiev, Ving Ian Lei, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
- [12] Gemini Team, Rohan Anil, Sebastian Borgeaud, et al. Gemini: A family of highly capable multimodal models, 2025.
- [13] OpenAI. Openai o1-mini: Advancing cost-efficient reasoning, September 2024. Accessed: 2025-08-28.
- [14] OpenAI. Introducing gpt-4.1 in the api, April 2025. Accessed: 2025-08-28.
- [15] OpenAI. Introducing gpt-4.5, February 2025. Accessed: 2025-08-28.
- [16] OpenAI. gpt-oss-120b & gpt-oss-20b model card, August 2025. Accessed: 2025-08-28.
- [17] OpenAI. Introducing gpt-5, August 2025. Accessed: 2025-08-28.
- [18] OpenAI. Model release notes. Accessed: 2025-08-28.
- [19] OpenAI. Introducing openai o3 and o4-mini, April 2025. Accessed: 2025-08-28.
- [20] OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence, July 2024. Accessed: 2025-08-28.
- [21] OpenAI. Hello gpt-4o, May 2024. Accessed: 2025-08-28.
- [22] OpenAI. Learning to reason with llms, September 2024. Accessed: 2025-08-28.
- [23] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jiasen Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020.
- [24] Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online, November 2020. Association for Computational Linguistics.
- [25] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, August 2021.
- [26] Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. Commonsenseqa 2.0: Exposing the limits of ai through gamification, 2022.
- [27] Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, and Chuang Gan. Sok-bench: A situated video reasoning benchmark with aligned open-world knowledge, 2024.
- [28] Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20, 2021.
- [29] Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations, 2020.
- [30] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
- [31] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
- [32] Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023.
- [33] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: limits of transformers on compositionality. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc.
- [34] Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2025.
- [35] Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025.
- [36] Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, and Tal Linzen. Relic: Evaluating compositional instruction following via language recognition, 2025.
- [37] S. Bedi, Y. Jiang, P. Chung, S. Koyejo, and N. Shah. Fidelity of medical reasoning in large language models. JAMA Network Open, 8(8):e2526021, 2025.
- [38] Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A systematic evaluation of the planning and scheduling abilities of the reasoning model o1. Transactions on Machine Learning Research, 2025.
- [39] P. N. Johnson-Laird. Mental models: towards a cognitive science of language, inference, and consciousness. Harvard University Press, USA, 1986.
- [40] Patrick Blackburn and Johannes Bos. Representation and Inference for Natural Language: A First Course in Computational Semantics. Center for the Study of Language and Information, Stanford, Calif., 2005.
- [41] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017.
- [42] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics.
- [43] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
- [44] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada, July 2023. Association for Computational Linguistics.
- [45] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
- [46] Long Phan, Alice Gatti, Ziwen Han, et al. Humanity’s last exam, 2025.
- [47] Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, and Sara Hooker. Global MMLU: Understanding and addressing cultural and linguistic biases in multilingual evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vienna, Austria, July 2025. Association for Computational Linguistics.
- [48] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023.
- [49] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024.
- [50] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- [51] Omer Goldman, Uri Shaham, Dan Malkin, Sivan Eiger, Avinatan Hassidim, Yossi Matias, Joshua Maynez, Adi Mayrav Gilady, Jason Riesa, Shruti Rijhwani, Laura Rimell, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. Eclektic: a novel challenge set for evaluation of cross-lingual knowledge transfer, 2025.
- [52] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- [53] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021.
- [54] Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022.
- [55] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024.
- [56] Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, and Mark Wildon. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2024.
- [57] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024.
- [58] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images, 2016.
- [59] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- [60] Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on document images, 2021.
- [61] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019.
- [62] Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025.
- [63] Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Mikel Artetxe, and Yi Tay. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models, 2024.
- [64] Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, David I. Atkinson, Aaditya Baranwal, Alexandru Coca, Mikah Dang, Sebastian Dziadzio, Jakob D. Kunz, Kaiqu Liang, Alexander Lo, Brian Pulfer, Steven Walton, Charig Yang, Kai Han, and Samuel Albanie. Zerobench: An impossible visual benchmark for contemporary large multimodal models, 2025.
- [65] Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 113569–113697. Curran Associates, Inc., 2024.
- [66] Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025.
- [67] Google DeepMind. Gemini robotics: Bringing ai into the physical world, 2025. Accessed: 2025-08-29.
- [68] Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024.
- [69] Stanford University and Laude Institute. Terminal-bench: A benchmark for ai agents in terminal environments, 2025. Accessed: 2025-08-29.
- [70] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
- [71] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024.
- [72] Aider. o1 tops aider’s new polyglot leaderboard, 2024. Accessed: 2025-08-29.
- [73] Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering?, 2025.
- [74] Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. $\tau$ -bench: A benchmark for tool-agent-user interaction in real-world domains, 2024.
- [75] Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. $\tau^{2}$ -bench: Evaluating conversational agents in a dual-control environment, 2025.
- [76] Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, and Karthik Narasimhan. Collie: Systematic construction of constrained text generation tasks, 2023.
- [77] Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models, 2024.
- [78] Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshtein, and Dipanjan Das. The facts grounding leaderboard: Benchmarking llms’ ability to ground responses to long-form input, 2025.
- [79] Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025.
- [80] Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. Complexfuncbench: Exploring multi-step and constrained function calling under long-context scenario, 2025.
- [81] Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023.
- [82] Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, and Sinong Wang. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following, 2024.
- [83] Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, and Kelvin Guu. Can long-context language models subsume retrieval, rag, sql, and more?, 2024.
- [84] Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E. Primack, Summer Yue, and Chen Xing. MultiChallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 18632–18702, Vienna, Austria, July 2025. Association for Computational Linguistics.
- [85] Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, 2025.
Appendix A Reasoning Benchmarks
Table 1: Taxonomy of benchmarks used in this study.
| HellaSwag [42] | Commonsense and Logical Reasoning | 2019 | Multiple-choice task: choose the most plausible sentence continuation. |
| --- | --- | --- | --- |
| MMLU [43] | Reasoning with General Knowledge | 2021 | Multiple-choice task: answer questions across 57 domains to test knowledge and problem-solving. |
| Big-Bench-Hard [44] | Reasoning with General Knowledge | 2023 | Open-generation task: solve difficult BIG-Bench problems testing multi-step reasoning and problem-solving. |
| MMMLU [45] | Reasoning with General Knowledge | 2024 | Multiple-choice task: answer 57 domain questions translated into 14 languages to test multilingual knowledge and problem-solving. |
| Humanity’s Last Exam [46] | Reasoning with General Knowledge | 2025 | Multi-modal task: answer closed-ended questions across many subjects to test verifiable knowledge. |
| Global MMLU (Lite) [47] | Reasoning with General Knowledge | 2025 | Multiple-choice task: answer 42-language questions with culturally sensitive labeling to test equitable multilingual knowledge. |
| GPQA Diamond [48] | Reasoning with General Knowledge | 2023 | Multiple-choice task: answer 448 expert-level science questions in biology, physics, and chemistry that are Google-proof and highly challenging. |
| MMLU Pro [49] | Reasoning with General Knowledge | 2024 | Multiple-choice task: extended from MMLU, answer more challenging reasoning questions with 10 options across diverse domains. |
| ARC (AI2 Reasoning Challenge) [50] | Reading Comprehension and Question Answering | 2018 | Multiple-choice task: answer grade-school science questions requiring advanced knowledge and reasoning beyond simple retrieval. |
| ECLeKTic [51] | Reading Comprehension and Question Answering | 2025 | Closed-book QA task: answer 12-language questions to test cross-lingual knowledge transfer. |
| DROP [52] | Reading Comprehension and Question Answering | 2019 | Open-ended QA task: answer 96k English questions requiring discrete reasoning over paragraph content. |
| GSM8K [53] | Mathematical Reasoning | 2021 | Open-ended QA task: solve grade-school problems requiring multi-step mathematical reasoning. |
| MATH [30] | Mathematical Reasoning | 2021 | Open-ended QA: solve 12,500 challenging competition problems with step-by-step solutions to test advanced mathematical reasoning. |
| MATH 500 [30] | Mathematical Reasoning | 2024 | Open-ended QA: Challenging subset of MATH benchmark. |
| MGSM [54] | Mathematical Reasoning | 2023 | Open-ended QA: solve 250 GSM8K problems translated into 10 languages. |
| MathVista [55] | Mathematical Reasoning | 2024 | Open-ended multimodal QA: solve 6,141 math problems requiring visual and compositional reasoning. |
| AIME 2024 | Mathematical Reasoning | 2024 | Open-ended QA: solve challenging competition-level mathematics problems. |
| AIME 2025 | Mathematical Reasoning | 2025 | Open-ended QA: solve challenging competition-level mathematics problems. |
| FrontierMath [56] | Mathematical Reasoning | 2024 | Open-ended QA: tests advanced mathematical reasoning across diverse and expert-level domains, requiring multi-step problem solving and deep mathematical knowledge. |
| MMMU [57] | Multimodal Reasoning | 2024 | Question answering task: multimodal multiple-choice and open-ended questions across 30 subjects requiring advanced reasoning and domain-specific knowledge. |
| AI2D [58] | Multimodal Reasoning | 2016 | Open-ended QA: multimodal questions with 5,000 diagrams and 15,000 Q&A pairs requiring diagram structure understanding and reasoning. |
| ChartQA [59] | Multimodal Reasoning | 2022 | Open-ended QA: multimodal questions with 32.7K chart-based problems requiring visual and logical reasoning. |
| EgoSchema [32] | Multimodal Reasoning | 2023 | Multiple-choice QA: multimodal questions with 5,000 long-form video clips requiring understanding of human activity and temporal reasoning. |
| DocVQA [60] | Multimodal Reasoning | 2021 | Open-ended QA: multimodal questions with 50,000 document images requiring reading and interpreting document layout and structure. |
| TextVQA [61] | Multimodal Reasoning | 2019 | Open-ended QA: multimodal questions with 45,336 images requiring reading and reasoning about embedded text. |
| VideoMMMU [62] | Multimodal Reasoning | 2025 | Open-ended QA: multimodal questions with 300 expert-level videos and 900 Q&A pairs assessing knowledge acquisition through perception, comprehension, and adaptation. |
| Vibe-Eval [63] | Multimodal Reasoning | 2024 | Open-ended QA: multimodal questions, testing visual understanding and multimodal chat capabilities. |
| ZeroBench [64] | Multimodal Reasoning | 2025 | Open-ended QA: multimodal questions with 434 visual reasoning problems designed to be impossible for current LMMs. |
| CharXiv [65] | Multimodal Reasoning | 2024 | Open-ended QA: multimodal questions with 2,323 charts requiring descriptive analysis and complex reasoning. |
| MMMU Pro [66] | Multimodal Reasoning | 2025 | QA task: multimodal multiple-choice and open-ended questions, extended from MMMU, testing integrated visual and textual reasoning. |
| ActivityNet [31] | Multimodal Reasoning | 2015 | Multiple-choice and open-ended QA: evaluates recognition and understanding of complex human activities in untrimmed videos, testing visual perception and temporal reasoning. |
| ERQA [67] | Multimodal Reasoning | 2025 | Multiple-choice QA: evaluates embodied reasoning and spatial understanding in real-world scenarios, requiring models to integrate text and visual inputs to select the correct answer. |
| SWE-bench Verified [68] | Programming and Coding | 2024 | Open-ended QA: answer 2,294 software engineering problems requiring multi-file code edits and complex reasoning. |
| Terminal-bench [69] | Programming and Coding | 2025 | Open-ended QA: answer complex tasks in terminal environments using text-based commands and reasoning. |
| HumanEval [70] | Programming and Coding | 2021 | Open-ended QA: answer Python programming problems from docstrings requiring functional code synthesis. |
| LiveCode Bench [71] | Programming and Coding | 2025 | Open-ended QA: answer 600+ coding problems from contests, testing generation, self-repair, execution, and test prediction. |
| Aider Polygot [72] | Programming and Coding | 2024 | Open-ended QA: answer 225 difficult coding problems in C++, Go, Java, JavaScript, Python, and Rust. |
| SWE-Lancer [73] | Programming and Coding | 2025 | Open-ended QA: answer 1,400 freelance software engineering tasks, including implementation and managerial decisions, with real-world evaluation. |
| SWE-Lancer Diamond [73] | Programming and Coding | 2025 | Open-ended QA: answer tasks from the public SWE-Lancer Diamond split, including implementation and managerial software engineering problems. |
| TAU-bench [74] | Tool Use – LLM | 2024 | Open-ended QA: tests reasoning, consistency, and rule-following in dynamic, tool-assisted human-agent interactions. |
| TAU2-bench [75] | Tool Use – LLM | 2025 | Open-ended QA: tests multi-turn reasoning, coordination, and communication in dual-control environments where both agent and user act with tools. |
| COLLIE [76] | Constrained Text Generation – LLM | 2023 | Open-ended QA: answer 2,080 prompts requiring constrained text generation with compositional, grammar-based, and reasoning challenges. |
| SimpleQA [77] | Factuality – LLM | 2024 | Factual QA benchmark designed to test factual accuracy and knowledge calibration. |
| FACTS Grounding [78] | Factuality – LLM | 2024 | Open-ended QA: answer questions requiring LLMs to generate factually accurate and well-grounded responses from provided source material. |
| BrowseComp [79] | Factuality – LLM | 2025 | Open-ended QA: answer 1,266 questions by persistently navigating the internet to find hard-to-locate information. |
| ComplexFunc Bench [80] | Tool Use – LLM | 2025 | Open-ended QA: answer complex function-calling tasks in five real-world scenarios requiring multi-step reasoning, parameter management, and long-context handling. |
| IFEval [81] | Instruction Following – LLM | 2023 | Open-ended QA: answer 500 prompts requiring LLMs to follow verifiable natural language instructions. |
| Multi-IF [82] | Instruction Following – LLM | 2024 | Open-ended QA: answer 4,501 multilingual multi-turn prompts requiring accurate instruction-following across languages and conversation turns. |
| LOFT [83] | Long-Context – LLM | 2024 | Open-ended QA: answer real-world tasks requiring reasoning and in-context retrieval over millions of tokens. |
| Graphwalks [14] | Long-Context – LLM | 2025 | Open-ended QA: perform multi-hop reasoning across a graph of millions of tokens to answer questions requiring breadth-first traversal. |
| Multi Challenge [84] | Multi-turn Conversation – LLM | 2025 | Open-ended QA: answer multi-turn conversation prompts requiring instruction-following, context management, and in-context reasoning. |
| HealthBench [85] | Safety – LLM | 2025 | Open-ended QA: evaluates LLMs on multi-turn healthcare conversations, requiring factual reasoning, safety awareness, and context-sensitive decision-making across diverse medical contexts. |
Appendix B Performance of Models
<details>
<summary>figures/claude_2_plots/claude_performance_Commonsense_and_Logical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: HellaSwag Benchmark Performance by Model Iteration
### Overview
This image is a line chart displaying the performance scores of a specific benchmark, identified as "HellaSwag," across sequential model iterations. The chart plots a single data series consisting of three data points connected by straight line segments, showing a clear upward trajectory in performance. The language used in the chart is entirely English.
### Components/Axes
**Component Isolation & Spatial Grounding:**
* **Y-Axis (Left):** Labeled "Score (%)". The axis features major tick marks and corresponding faint, dotted horizontal grid lines at intervals of 2. The visible labels are 86, 88, 90, 92, and 94.
* **X-Axis (Bottom):** Labeled "Model Number". The axis features major tick marks and corresponding faint, dotted vertical grid lines at intervals of 1. The visible labels range from 1 to 10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
* **Data Series (Main Chart Area):** A single solid blue line connecting three solid blue circular markers. The data points are located in the left-hand portion of the chart area (spanning x=1 to x=3).
* **Annotation (Top Left):** The text "HellaSwag" is written in blue, matching the color of the data line. It is positioned directly above the third and highest data point. There is no separate legend box; this annotation serves as the series label.
### Detailed Analysis
**Trend Verification:**
The visual trend of the single blue line slopes upward from left to right. The slope between the first and second points is positive and steep. The slope between the second and third points is also positive and visibly steeper than the first segment, indicating an accelerating rate of improvement.
**Data Point Extraction:**
*Note: Values are approximate based on visual interpolation between grid lines.*
* **Data Point 1:**
* X-axis (Model Number): Exactly 1
* Y-axis (Score %): The point sits just barely below the 86 grid line.
* *Approximate Value: ~85.9%*
* **Data Point 2:**
* X-axis (Model Number): Exactly 2
* Y-axis (Score %): The point sits exactly halfway between the 88 and 90 grid lines.
* *Approximate Value: ~89.0%*
* **Data Point 3:**
* X-axis (Model Number): Exactly 3
* Y-axis (Score %): The point sits above the highest labeled grid line (94). Assuming the next grid line would be 96, it sits slightly above the halfway mark between 94 and 96.
* *Approximate Value: ~95.3%*
### Key Observations
1. **Incomplete X-Axis Utilization:** While the x-axis extends to Model Number 10, data is only provided for Models 1, 2, and 3. The right-hand 70% of the chart is entirely empty space.
2. **Accelerating Gains:** The absolute gain between Model 1 and 2 is roughly 3.1 percentage points. The absolute gain between Model 2 and 3 is roughly 6.3 percentage points. The performance increase is non-linear and accelerating.
3. **High Baseline:** The chart's y-axis does not start at zero; it begins near 86%, indicating that even the first model iteration performed at a relatively high level on this specific metric.
### Interpretation
* **What the data suggests:** The chart demonstrates rapid, successive improvements in a machine learning or AI model's ability to perform the "HellaSwag" benchmark (a common test for evaluating commonsense natural language inference). Each new version of the model is significantly better than the last.
* **Relationship of elements:** The matching blue color of the "HellaSwag" text and the data line explicitly links the performance metric to the specific benchmark being tested, acting as an integrated legend.
* **Reading between the lines (Peircean investigative):** The presence of an x-axis extending to 10, despite only having data up to 3, strongly implies that this is an ongoing project or a standardized reporting template. It suggests the developers plan to release or evaluate up to 10 models in this series, and this chart represents a snapshot of early progress. Furthermore, the accelerating jump in performance between Model 2 and Model 3 (~6.3% vs ~3.1%) suggests that a significant architectural change, a massive increase in training compute/data, or a breakthrough in methodology occurred between those specific iterations, rather than just incremental tuning.
</details>
(a) Commonsense and Logical Reasoning
<details>
<summary>figures/claude_2_plots/claude_performance_Mathematical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Across Mathematical Benchmarks
### Overview
This image is a line chart displaying the performance scores of various numbered models across seven different mathematical reasoning benchmarks. The chart uses distinct colors and marker shapes for each benchmark, with labels placed directly adjacent to the data lines rather than in a separate legend box. The data spans across an x-axis representing sequential "Model Numbers" and a y-axis representing "Score (%)".
### Components/Axes
* **Y-Axis (Left):**
* **Label:** "Score (%)" (Rotated 90 degrees counter-clockwise).
* **Scale:** Ranges from 20 to 100, with major tick marks and labels at intervals of 10 (20, 30, 40, 50, 60, 70, 80, 90, 100). The axis line extends slightly below 20.
* **X-Axis (Bottom):**
* **Label:** "Model Number".
* **Scale:** Ranges from 1 to 10, with major tick marks and integer labels at every unit (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
* **Gridlines:** Faint, dashed, light gray gridlines intersect at every major tick mark on both the X and Y axes.
* **Legend/Labels:** There is no standalone legend. Series labels are color-coded to match their respective lines and are placed directly on the chart area near the end or middle of the data series.
### Detailed Analysis
*Note: All numerical values extracted from the chart are approximate based on visual interpolation of the gridlines.*
**1. Series: GSM8K**
* **Spatial Grounding:** Label is located at the top center, colored red, positioned just to the right of the final data point at x=5.
* **Visual Trend:** The red line with upward-pointing triangle markers starts very high, slopes gently upward, and plateaus slightly. Notably, there is no data point at x=4; the line connects directly from x=3 to x=5.
* **Data Points (x, y):**
* (1, ~89)
* (2, ~92)
* (3, ~95)
* (5, ~96.5)
**2. Series: MGSM**
* **Spatial Grounding:** Label is located in the upper middle, colored orange, positioned to the right of the final data point at x=5.
* **Visual Trend:** The orange line with square markers slopes upward from x=1 to x=3, experiences a distinct dip at x=4, and recovers with an upward slope to x=5.
* **Data Points (x, y):**
* (1, ~75)
* (2, ~83.5)
* (3, ~91)
* (4, ~86)
* (5, ~92.5)
**3. Series: MATH**
* **Spatial Grounding:** Label is located in the middle, colored brown, positioned to the right of the final data point at x=5.
* **Visual Trend:** The brown line with diamond markers starts relatively low and exhibits a steady, continuous upward slope, accelerating slightly between x=2 and x=4.
* **Data Points (x, y):**
* (1, ~39)
* (2, ~43)
* (3, ~60)
* (4, ~69.5)
* (5, ~78)
**4. Series: MathVista**
* **Spatial Grounding:** Label is located in the middle, colored blue, positioned to the right of the final data point at x=5.
* **Visual Trend:** The blue line with circular markers shows a very shallow upward slope from x=1 to x=3, followed by a steeper, consistent upward slope to x=5.
* **Data Points (x, y):**
* (1, ~46)
* (2, ~48)
* (3, ~50.5)
* (4, ~61.5)
* (5, ~68)
**5. Series: MATH 500**
* **Spatial Grounding:** Label is located at the top right, colored olive green, positioned to the right of the final data point at x=7.
* **Visual Trend:** The olive green line with small circular/dot markers begins exactly where the "MATH" series ends at x=5. It slopes upward to x=6, then sharply upward to x=7.
* **Data Points (x, y):**
* (5, ~78) *(Overlaps with the final point of the MATH series)*
* (6, ~82)
* (7, ~96)
**6. Series: AIME 2024**
* **Spatial Grounding:** Label is located on the right side, colored pink, positioned to the right of the final data point at x=7.
* **Visual Trend:** The pink line with small circular/dot markers starts at the lowest point on the entire chart at x=5. It slopes gently upward to x=6, followed by a massive, near-vertical spike to x=7.
* **Data Points (x, y):**
* (5, ~16)
* (6, ~23.5)
* (7, ~80)
**7. Series: AIME 2025**
* **Spatial Grounding:** Label is located on the far right, colored cyan, positioned below the line segment between x=9 and x=10.
* **Visual Trend:** The cyan line with star markers is the only series located on the far right of the x-axis. It slopes upward from x=8 to x=9, but then exhibits a sharp downward slope to x=10.
* **Data Points (x, y):**
* (8, ~85)
* (9, ~90)
* (10, ~78)
### Key Observations
* **Segmented X-Axis Domains:** The data is distinctly grouped by model numbers. Models 1-5 are tested on GSM8K, MGSM, MATH, and MathVista. Models 5-7 are tested on MATH 500 and AIME 2024. Models 8-10 are tested exclusively on AIME 2025.
* **Benchmark Saturation:** GSM8K starts near 90% and approaches 100%, indicating the benchmark is likely "solved" or saturated for these models.
* **Series Handoff:** The "MATH 500" series appears to act as a direct continuation of the "MATH" series, starting at the exact same coordinate (5, ~78).
* **Anomalous Drop:** The AIME 2025 series is the only benchmark that shows a significant performance degradation at the end of its curve (from Model 9 to Model 10).
* **Missing Data:** The GSM8K series skips Model 4 entirely.
### Interpretation
This chart illustrates the evolutionary progress of a series of AI models (likely a specific family of Large Language Models, given the sequential numbering 1 through 10) against standard mathematical reasoning benchmarks.
**Reading between the lines (Peircean Analysis):**
The chart tells a story of *benchmark obsolescence*. As models progress from 1 to 5, they rapidly master easier benchmarks like GSM8K and MGSM (approaching the 90-100% ceiling). Because these tests can no longer effectively differentiate the reasoning capabilities of newer models, researchers must introduce harder tests.
This is visually represented by the introduction of AIME 2024 and MATH 500 at Model 5. Model 5 scores highly on older tests but scores a dismal ~16% on AIME 2024. However, by Model 7, performance on AIME 2024 skyrockets to 80%. Consequently, an even harder benchmark, AIME 2025, is introduced for Models 8-10.
The drop in performance for Model 10 on AIME 2025 is a notable anomaly. It suggests that Model 10 might be a smaller, more efficient variant (e.g., a "mini" or "flash" model) rather than a direct capability scale-up from Model 9, or that a change in training methodology negatively impacted this specific type of complex mathematical reasoning.
</details>
(b) Mathematical Reasoning
<details>
<summary>figures/claude_2_plots/claude_performance_Multimodal_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores Across Different Benchmarks
### Overview
This image is a line chart displaying the performance scores of various models across four different evaluation benchmarks (DocVQA, AI2D, ChartQA, and MMMU). The chart tracks the progression of scores as the "Model Number" increases. A distinct visual feature is that three of the benchmarks only have data up to Model 5, while one benchmark continues up to Model 10.
### Components/Axes
**1. Y-Axis (Vertical):**
* **Label:** "Score (%)" (Positioned center-left, rotated 90 degrees counter-clockwise).
* **Scale:** Ranges from 50 to slightly above 90.
* **Markers:** Explicitly marked at 50, 60, 70, 80, and 90.
* **Gridlines:** Light grey, dashed horizontal lines extend from each 10-unit marker across the chart area.
**2. X-Axis (Horizontal):**
* **Label:** "Model Number" (Positioned bottom-center).
* **Scale:** Ranges from 1 to 10.
* **Markers:** Explicitly marked at integers 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10.
* **Gridlines:** Light grey, dashed vertical lines extend upward from each integer marker.
**3. Legend / Data Series Labels:**
There is no separate legend box. Instead, the labels are placed directly adjacent to the final data point of each respective line.
* **DocVQA:** Pink text, positioned top-center, just above the end of the pink line.
* **AI2D:** Dark red text, positioned top-center, just to the right of the end of the red line.
* **ChartQA:** Dark blue text, positioned top-center, just to the right of the end of the dark blue line.
* **MMMU:** Light blue/teal text, positioned mid-right, just above the end of the light blue line.
---
### Detailed Analysis
*Note: All numerical values extracted below are approximate based on visual interpolation between the gridlines.*
**Data Series 1: DocVQA (Pink line, triangle markers)**
* **Trend Verification:** The line starts at a high baseline, remains relatively flat with a very slight dip between models 2 and 3, and then slopes upward sharply between models 4 and 5. The series terminates at Model 5.
* **Data Points:**
* Model 1: ~88.5%
* Model 2: ~89.5%
* Model 3: ~89.2%
* Model 4: ~90.0%
* Model 5: ~95.0% (Highest overall point on the chart)
**Data Series 2: AI2D (Red line, square markers)**
* **Trend Verification:** The line starts slightly below DocVQA, rises slightly, dips slightly at model 3, and then exhibits a steep, continuous upward slope through models 4 and 5. The series terminates at Model 5.
* **Data Points:**
* Model 1: ~86.5%
* Model 2: ~88.5%
* Model 3: ~88.0%
* Model 4: ~92.0%
* Model 5: ~94.8%
**Data Series 3: ChartQA (Dark Blue line, circle markers)**
* **Trend Verification:** The line starts lower than DocVQA and AI2D. It remains almost perfectly flat (with a microscopic downward drift) for the first three models, then slopes upward significantly between models 3 and 4, and continues upward to model 5. The series terminates at Model 5.
* **Data Points:**
* Model 1: ~81.5%
* Model 2: ~81.0%
* Model 3: ~80.8%
* Model 4: ~87.0%
* Model 5: ~90.8%
**Data Series 4: MMMU (Light Blue/Teal line, diamond markers)**
* **Trend Verification:** This line starts at the lowest point on the chart. It shows a consistent upward trajectory. There is a moderate rise to model 3, a flattening to model 4, a massive spike between models 4 and 5, and then a steady, shallower climb with a minor dip at model 8, ending at model 10. This is the only series that continues past Model 5.
* **Data Points:**
* Model 1: ~50.2%
* Model 2: ~53.0%
* Model 3: ~59.5%
* Model 4: ~60.5%
* Model 5: ~70.5%
* Model 6: ~71.8%
* Model 7: ~75.0%
* Model 8: ~74.5%
* Model 9: ~76.5%
* Model 10: ~77.0%
---
### Key Observations
1. **Grouping by Performance:** There is a distinct stratification in the data. DocVQA, AI2D, and ChartQA form a high-performing cluster (ranging roughly from 80% to 95%). MMMU forms a lower-performing tier (ranging from 50% to 77%).
2. **The "Model 5" Anomaly:** Three of the four data series (DocVQA, AI2D, ChartQA) abruptly stop at Model 5. Only MMMU continues to be tracked up to Model 10.
3. **Universal Improvement:** Regardless of the starting baseline, all four benchmarks show significant improvement by Model 5. The jump between Model 4 and Model 5 is particularly steep for all four metrics.
4. **Convergence at the Top:** By Model 5, DocVQA and AI2D have nearly converged at approximately 95%.
---
### Interpretation
* **Model Evolution:** The x-axis ("Model Number") likely represents sequential iterations or scaling sizes of a specific AI model family (e.g., Model 1 is the oldest/smallest, Model 10 is the newest/largest). The universal upward trend indicates that successive models are becoming more capable.
* **Benchmark Difficulty:** The data strongly suggests that MMMU is a significantly more difficult benchmark than DocVQA, AI2D, and ChartQA. While the model achieves near-saturation (95%) on DocVQA and AI2D by iteration 5, it only achieves ~70% on MMMU at the same stage.
* **Investigative Deduction (The Missing Data):** Why do three lines stop at Model 5? There are a few logical deductions:
1. *Saturation:* DocVQA and AI2D hit ~95% at Model 5. The researchers may have stopped testing these benchmarks because the model effectively "solved" them, making further testing redundant or uninformative.
2. *Cost/Resource Allocation:* Evaluating models can be expensive. If MMMU is the primary benchmark of interest (perhaps because it is the most challenging and leaves the most room for measurable improvement), researchers may have chosen to only run the MMMU benchmark for models 6 through 10 to save compute resources.
3. *Change in Focus:* The chart might be illustrating a narrative where earlier models were tested broadly, but later development (Models 6-10) was specifically optimized for or solely evaluated against the MMMU standard.
* **The "Breakthrough" Point:** Something significant happened in the model architecture or training data between Model 4 and Model 5. Every single benchmark shows a sharp, anomalous spike in performance at this exact interval, suggesting a major leap in capability rather than a gradual iterative improvement.
</details>
(c) Multimodal Reasoning
<details>
<summary>figures/claude_2_plots/claude_performance_Programming_and_Coding.png Details</summary>

### Visual Description
## Line Chart: Benchmark Scores by Model Number
### Overview
This image is a line chart displaying the performance scores (in percentages) of various models (numbered 1 through 10) across three different evaluation benchmarks: HumanEval, SWE-bench Verified, and Terminal-bench. The chart illustrates how performance evolves across sequential model iterations, with different benchmarks being applied to different subsets of the models.
*Note: All text in this image is in English. No other languages are present.*
### Components/Axes
**Component Isolation:**
* **Left Edge (Y-Axis):** The vertical axis is labeled **"Score (%)"**. The scale ranges from 40 to 90, with major tick marks and corresponding labels at intervals of 10 (40, 50, 60, 70, 80, 90).
* **Bottom Edge (X-Axis):** The horizontal axis is labeled **"Model Number"**. The scale ranges from 1 to 10, with major tick marks and corresponding labels at intervals of 1 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
* **Background:** A light gray, dashed grid is present, aligning with the major tick marks on both the X and Y axes to aid in reading values.
* **Main Chart Area:** Contains three distinct data series, differentiated by color, marker shape, and direct text labeling (acting as the legend).
**Data Series Identifiers (Legend/Labels):**
* **HumanEval:** Blue line with solid circular markers. Label is positioned at the top-center, immediately to the right of its final data point.
* **SWE-bench Verified:** Brown line with solid square markers. Label is positioned at the middle-right, just above its final data point.
* **Terminal-bench:** Teal/Cyan line with solid upward-pointing triangular markers. Label is positioned at the bottom-right, intersecting its final line segment.
---
### Detailed Analysis
#### 1. HumanEval Series (Blue Line, Circular Markers)
* **Spatial Grounding:** Located in the top-left to top-center quadrant of the chart. Spans Model Numbers 1 through 5.
* **Trend Verification:** The line begins at a high baseline, dips slightly at model 2, and then exhibits a strong, consistent upward slope through model 5, reaching the highest overall score on the chart.
* **Data Points (Approximate):**
* Model 1: ~76%
* Model 2: ~73%
* Model 3: ~85%
* Model 4: ~88%
* Model 5: ~94%
#### 2. SWE-bench Verified Series (Brown Line, Square Markers)
* **Spatial Grounding:** Located in the center to middle-right area of the chart. Spans Model Numbers 4 through 10, notably skipping Model Number 7.
* **Trend Verification:** The line starts at the lowest point on the chart, rises moderately to model 5, then spikes sharply upward to model 6. It continues to rise to a peak at model 8, plateaus slightly to model 9, and then slopes downward to model 10.
* **Data Points (Approximate):**
* Model 4: ~41%
* Model 5: ~49%
* Model 6: ~70%
* Model 7: *No data point present.*
* Model 8: ~80%
* Model 9: ~79.5%
* Model 10: ~74%
#### 3. Terminal-bench Series (Teal Line, Triangular Markers)
* **Spatial Grounding:** Located in the bottom-right quadrant of the chart. Spans Model Numbers 8 through 10.
* **Trend Verification:** The line forms an inverted "V" shape. It starts low, slopes upward to a peak at model 9, and then slopes downward to model 10.
* **Data Points (Approximate):**
* Model 8: ~41%
* Model 9: ~50%
* Model 10: ~43%
---
### Key Observations
* **Non-Overlapping Domains:** The "HumanEval" benchmark is only recorded for early models (1-5), while "Terminal-bench" is only recorded for late models (8-10). "SWE-bench Verified" bridges the middle and late models (4-10).
* **Missing Data:** There is a distinct gap in the "SWE-bench Verified" data at Model Number 7. The line connects directly from Model 6 to Model 8.
* **Model 10 Regression:** Both benchmarks measured at Model 10 (SWE-bench Verified and Terminal-bench) show a decline in performance compared to Model 9.
* **Scale Differences:** HumanEval scores are significantly higher overall (70s to 90s) compared to the starting points of the other two benchmarks (which begin in the 40s).
---
### Interpretation
This chart likely tracks the evolutionary progress of a specific family of AI models (e.g., a series of Large Language Models) across different coding or agentic benchmarks.
**Reading between the lines:**
1. **Benchmark Saturation:** The "HumanEval" benchmark was likely abandoned after Model 5 because the score reached ~94%. In AI development, once a model effectively "solves" or saturates a benchmark, researchers move on to more difficult tests to accurately gauge further improvements.
2. **Increasing Difficulty:** "SWE-bench Verified" and "Terminal-bench" are clearly much harder evaluations than HumanEval. When SWE-bench is introduced at Model 4, the score is only ~41%, whereas the same model scores ~88% on HumanEval.
3. **The "Model 10" Anomaly:** The consistent drop in performance across multiple benchmarks from Model 9 to Model 10 is highly notable. This suggests that Model 10 might be a smaller, more efficient, or differently optimized model (e.g., a distilled version or a model optimized for speed rather than raw reasoning) rather than a direct, larger successor to Model 9. Alternatively, it represents a failed training run or a regression caused by a change in architecture or training data.
4. **The Missing Model 7:** The absence of data for Model 7 on the SWE-bench line suggests that Model 7 was either an internal experiment that was never fully evaluated on this benchmark, or the evaluation failed/was deemed invalid for that specific iteration.
</details>
(d) Programming and Coding
<details>
<summary>figures/claude_2_plots/claude_performance_Reading_Comprehension_and_Question_Answering.png Details</summary>

### Visual Description
## Line Chart: Model Performance on ARC and DROP Benchmarks
### Overview
This image is a line chart displaying the performance scores of sequential AI models on two specific benchmarks: ARC (AI2 Reasoning Challenge) and DROP. The chart tracks the progression of scores across different model iterations, showing a general upward trend in performance for both metrics, though the data series terminate at different points on the x-axis.
*Language Declaration:* The text in the image is entirely in English.
### Components/Axes
**1. X-Axis (Bottom)**
* **Label:** "Model Number" (Centered below the axis line).
* **Scale/Markers:** Linear scale with solid tick marks at integer intervals from 1 to 10.
* **Grid:** Faint, dashed light-gray vertical lines extend upward from each integer tick mark.
**2. Y-Axis (Left)**
* **Label:** "Score (%)" (Rotated 90 degrees counter-clockwise, centered vertically along the axis).
* **Scale/Markers:** Linear scale starting at 77.5 and ending at 95.0, with tick marks at intervals of 2.5 (77.5, 80.0, 82.5, 85.0, 87.5, 90.0, 92.5, 95.0).
* **Grid:** Faint, dashed light-gray horizontal lines extend rightward from each tick mark. Note that the grid and chart area extend slightly above the 95.0 mark.
**3. Data Labels (In-line Legend)**
Instead of a traditional legend box, the series are labeled directly on the chart area near their terminal data points.
* **Label 1:** "ARC (AI2 Reasoning Challenge)" - Written in light blue/cyan text, positioned in the top-left quadrant, specifically above and slightly left of the data point at x=3.
* **Label 2:** "DROP" - Written in dark blue text, positioned in the center of the chart, directly above the data point at x=5.
### Detailed Analysis
*Note: All extracted values are approximate (denoted by ~) based on visual interpolation between the y-axis grid lines.*
**Data Series 1: ARC (AI2 Reasoning Challenge)**
* **Visual Attributes:** Light blue/cyan line connecting square markers.
* **Visual Trend:** The line exhibits a steep, consistent upward slope from Model 1 to Model 3.
* **Data Points:**
* **Model 1:** The square marker is positioned just below the 90.0 horizontal grid line. Value: **~89.2%**.
* **Model 2:** The square marker is positioned above the 92.5 line, roughly one-third of the way to the 95.0 line. Value: **~93.3%**.
* **Model 3:** The square marker is positioned above the top labeled axis line (95.0). Value: **~96.4%**.
* *Note:* This data series terminates at Model 3.
**Data Series 2: DROP**
* **Visual Attributes:** Dark blue line connecting circular markers.
* **Visual Trend:** The line shows a slight upward slope from Model 1 to 2, a steeper upward slope to Model 3, a completely flat (horizontal) plateau between Model 3 and 4, and a steep upward slope to Model 5.
* **Data Points:**
* **Model 1:** The circular marker is positioned slightly above the 77.5 line. Value: **~78.4%**.
* **Model 2:** The circular marker is positioned just below the 80.0 line. Value: **~78.9%**.
* **Model 3:** The circular marker is positioned slightly above the 82.5 line. Value: **~83.1%**.
* **Model 4:** The circular marker is positioned at the exact same vertical height as Model 3. Value: **~83.1%**.
* **Model 5:** The circular marker is positioned above the 87.5 line, roughly one-third of the way to 90.0. Value: **~88.3%**.
* *Note:* This data series terminates at Model 5.
### Key Observations
1. **Baseline Discrepancy:** Model 1 performs significantly better on the ARC benchmark (~89.2%) compared to the DROP benchmark (~78.4%).
2. **Missing Data:** The ARC series only contains data for Models 1, 2, and 3. The DROP series contains data for Models 1 through 5. Neither series utilizes the x-axis space for Models 6 through 10.
3. **The Plateau:** The DROP benchmark shows zero improvement between Model 3 and Model 4, which is the only instance of non-growth in the entire chart.
### Interpretation
**What the data suggests:**
This chart illustrates the generational improvement of a specific lineage of AI models (likely Large Language Models, given the benchmarks). "Model Number" implies sequential iterations (e.g., a v1, v2, v3 progression). The data demonstrates that as the model number increases, reasoning and reading comprehension capabilities (as measured by ARC and DROP) generally improve.
**Reading between the lines (Peircean Analysis):**
* **Benchmark Difficulty:** The ARC benchmark (AI2 Reasoning Challenge, typically multiple-choice science questions) appears to be an "easier" task for this specific model architecture, starting near 90% and quickly approaching a ceiling (near 100%). DROP (Discrete Reasoning Over Paragraphs, which requires reading comprehension and discrete operations like addition/sorting) starts much lower, indicating it is a more rigorous test of this model family's capabilities.
* **The Model 3 to 4 Anomaly:** The flatline on the DROP benchmark between Model 3 and Model 4 is highly informative. It suggests that whatever architectural changes, scaling, or training data updates occurred between version 3 and 4, they did *not* benefit the specific complex reasoning skills required by DROP. However, the subsequent leap from Model 4 to Model 5 indicates a major breakthrough or significant scaling event that resolved this bottleneck.
* **Chart Design and Intent:** The x-axis extends to 10, but data stops at 3 and 5. This spatial emptiness on the right side of the chart serves a rhetorical purpose: it implies a roadmap. The creator of this chart is likely showing current progress while leaving room to plot future, unreleased models (Models 6-10), visually communicating an expectation of continued future growth. The fact that ARC stops at Model 3 might indicate that the benchmark was "solved" (approaching 100% accuracy) by Model 3, rendering it useless for testing Models 4 and 5, hence its removal from later evaluations.
</details>
(e) Reading Comprehension and QA
<details>
<summary>figures/claude_2_plots/claude_performance_Reasoning_with_General_Knowledge.png Details</summary>

### Visual Description
## Line Chart: Benchmark Scores Across Model Numbers
### Overview
This image is a line chart displaying the performance scores (in percentages) of various models across five different evaluation benchmarks. The x-axis represents a sequential "Model Number" from 1 to 10, while the y-axis represents the "Score (%)". The chart uses inline labeling rather than a separate legend, with text placed near the end of each respective data line.
### Components/Axes
* **Y-axis (Left):** Labeled "Score (%)". The scale features major tick marks at 40, 50, 60, 70, 80, and 90. The visible range extends slightly below 40 and slightly above 90.
* **X-axis (Bottom):** Labeled "Model Number". The scale features integer tick marks from 1 to 10 inclusive.
* **Gridlines:** Faint, light gray, dashed gridlines intersect at every major tick mark on both the x and y axes.
* **Legend/Labels:** There is no standalone legend box. Labels are color-coded to match their respective lines and are placed directly adjacent to the final data point of each series.
### Detailed Analysis
**1. Big-Bench-Hard (Green line, Square markers)**
* *Spatial Grounding:* Label is located at the top-center of the chart, just to the right of the final data point at x=5.
* *Trend Verification:* The line slopes upward consistently from Model 1 to Model 5. Notably, there is no data point at Model 4; the line connects directly from Model 3 to Model 5.
* *Data Points (Approximate ±1%):*
* Model 1: ~74%
* Model 2: ~83%
* Model 3: ~87%
* Model 4: (No data point)
* Model 5: ~93%
**2. MMLU (Brown line, Triangle markers)**
* *Spatial Grounding:* Label is located at the top-center, directly below the "Big-Bench-Hard" label, adjacent to its final point at x=5.
* *Trend Verification:* The line slopes upward from Model 1 to Model 3, then the slope flattens significantly, rising only slightly to Model 5. Similar to the green line, there is no data point at Model 4.
* *Data Points (Approximate ±1%):*
* Model 1: ~75%
* Model 2: ~79%
* Model 3: ~87% (Overlaps exactly with the Big-Bench-Hard point)
* Model 4: (No data point)
* Model 5: ~89%
**3. MMLU Pro (Gray line, Diamond markers)**
* *Spatial Grounding:* Label is located in the upper-middle section, adjacent to its final point at x=5.
* *Trend Verification:* This series only contains two points, showing a steep upward slope from Model 4 to Model 5.
* *Data Points (Approximate ±1%):*
* Model 4: ~65%
* Model 5: ~78%
**4. MMMLU (Cyan/Light Blue line, Small pentagon markers)**
* *Spatial Grounding:* Label is located at the top-right of the chart, just above the final data point at x=10.
* *Trend Verification:* The line shows a gradual, consistent upward slope starting from Model 5 and ending at Model 10.
* *Data Points (Approximate ±0.5%):*
* Model 5: ~82%
* Model 6: ~83%
* Model 7: ~86%
* Model 8: ~86.5%
* Model 9: ~89%
* Model 10: ~89.5%
**5. GPQA Diamond (Dark Blue line, Circle markers)**
* *Spatial Grounding:* Label is located on the right side, positioned slightly below the line segment connecting Model 9 and Model 10.
* *Trend Verification:* This is the most volatile series. It slopes upward from Model 1 to 3, drops sharply at Model 4, rises steeply through Model 5 to peak at Model 7, and then exhibits a slight, gradual downward slope to Model 10.
* *Data Points (Approximate ±1%):*
* Model 1: ~33%
* Model 2: ~40%
* Model 3: ~50.5%
* Model 4: ~41.5%
* Model 5: ~65%
* Model 6: ~68%
* Model 7: ~85%
* Model 8: ~84%
* Model 9: ~83.5%
* Model 10: ~81%
### Key Observations
* **Data Discontinuity:** The chart is split into two distinct phases. "Big-Bench-Hard" and "MMLU" are only tracked from Models 1 to 5. "MMMLU" is only tracked from Models 5 to 10. "MMLU Pro" only exists for Models 4 and 5. Only "GPQA Diamond" spans the entire x-axis (Models 1 through 10).
* **Missing Data:** Models 1-3 and 5 have data for Big-Bench-Hard and MMLU, but Model 4 is explicitly skipped for these two metrics.
* **Convergence at Model 5:** Model 5 is the only point on the x-axis where all five benchmark series have a recorded data point, acting as a transition point in the chart.
* **Performance Ceiling:** Big-Bench-Hard reaches the highest absolute score on the chart (~93% at Model 5).
* **Anomaly:** The sharp drop in GPQA Diamond performance at Model 4 is a significant anomaly compared to the general upward trend of all other data points in the first half of the chart.
### Interpretation
This chart likely illustrates the progression of a specific family of Large Language Models (LLMs) across different training checkpoints, parameter sizes, or generational iterations (represented by "Model Number" 1 through 10).
The data suggests a narrative of benchmark saturation and evolution. The older or standard benchmarks (Big-Bench-Hard, MMLU) show high performance early on, nearing the 90%+ range by Model 5. Because these benchmarks may have "maxed out" or lost their utility for measuring further intelligence gains, they appear to have been retired after Model 5.
In their place, harder variants were introduced. "MMLU Pro" is introduced at Model 4, showing significantly lower scores than standard MMLU, proving it is a more rigorous test. By Model 5, "MMMLU" is introduced and tracked through Model 10, replacing the older benchmarks.
"GPQA Diamond" is clearly the most difficult benchmark for the early models (starting at a very low ~33%). The sharp drop at Model 4 for GPQA Diamond suggests that specific model iteration might have been optimized for different tasks, suffered from catastrophic forgetting, or was a smaller parameter model in a varied lineup. However, by Model 7, the model architecture or training seems to have "cracked" the GPQA Diamond benchmark, jumping to 85%, before plateauing and slightly degrading, which could indicate overfitting on other tasks in later models (8-10).
Overall, the chart demonstrates a clear trajectory of increasing model capability, necessitating the introduction of increasingly difficult evaluation metrics to accurately gauge performance.
</details>
(f) Reasoning with General Knowledge
<details>
<summary>figures/claude_2_plots/claude_performance_LLM_Benchmarks_Combined.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores Across Benchmarks
### Overview
This image is a line chart displaying the performance scores of various iterations of a model (labeled by "Model Number") across three different evaluation benchmarks. The chart illustrates how performance evolves as the model number increases, showing a general trend of rapid initial improvement followed by a plateau.
### Components/Axes
**Spatial Grounding & Layout:**
* **Main Chart Area:** Occupies the majority of the image, featuring a white background with faint, dashed, light-gray gridlines forming a matrix.
* **X-Axis (Bottom):** Labeled **"Model Number"** in black text, centered below the axis. The axis features major tick marks at integer intervals from 1 to 10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
* **Y-Axis (Left):** Labeled **"Score (%)"** in black text, rotated 90 degrees counter-clockwise, centered along the axis. The axis features major tick marks at intervals of 10, ranging from 20 to 90 (20, 30, 40, 50, 60, 70, 80, 90). The grid extends slightly above the 90 mark to approximately 100.
* **Legend:** There is no separate legend box. Instead, the data series are labeled directly on the chart area (in-line labeling) adjacent to their respective lines.
**Data Series Identifiers:**
1. **IFEval:** Cyan/light blue line with triangle markers. Label is positioned at the top right of the line, near Model Number 7.
2. **TAU-bench Retail:** Brown line with square markers. Label is positioned above the rightmost end of the line, spanning Model Numbers 9 and 10.
3. **TAU-bench Airline:** Dark blue line with circular markers. Label is positioned below the rightmost end of the line, spanning Model Numbers 9 and 10.
---
### Detailed Analysis
*Note: Values are approximate based on visual interpolation of the gridlines.*
#### Series 1: IFEval (Cyan/Light Blue, Triangle Markers)
* **Trend Verification:** This line appears only for Model Numbers 5, 6, and 7. It starts at a very high baseline and slopes gently upward, indicating slight, incremental improvements.
* **Data Points:**
* Model 5: ~90.0%
* Model 6: ~91.0%
* Model 7: ~93.0%
#### Series 2: TAU-bench Retail (Brown, Square Markers)
* **Trend Verification:** This line begins at Model 4. It shows a steep, aggressive upward slope between Models 4 and 6. After Model 6, the line plateaus, showing a very slight dip at Model 8 before rising marginally through Model 10. *Note: There is no data point marker at Model 7; the line connects directly from 6 to 8.*
* **Data Points:**
* Model 4: ~51.0%
* Model 5: ~71.5%
* Model 6: ~81.0%
* Model 7: *(No marker)*
* Model 8: ~80.5%
* Model 9: ~81.5%
* Model 10: ~82.5%
#### Series 3: TAU-bench Airline (Dark Blue, Circle Markers)
* **Trend Verification:** This line begins at Model 4. Similar to the Retail benchmark, it exhibits a steep upward trajectory from Model 4 to Model 6. It then plateaus, peaking slightly at Model 8, before sloping downward toward Model 10, indicating a regression in performance. *Note: There is no data point marker at Model 7; the line connects directly from 6 to 8.*
* **Data Points:**
* Model 4: ~23.0%
* Model 5: ~49.0%
* Model 6: ~58.5%
* Model 7: *(No marker)*
* Model 8: ~60.0%
* Model 9: ~59.5%
* Model 10: ~56.0%
---
### Key Observations
1. **Missing Data:** Models 1, 2, and 3 have no data points for any benchmark. Model 7 lacks data points for both TAU-bench metrics. The IFEval metric is only tracked for models 5, 6, and 7.
2. **Rapid Capability Gain:** The transition from Model 4 to Model 6 represents a massive leap in capability for the TAU-bench metrics (Retail jumps ~30 points; Airline jumps ~35 points).
3. **Diminishing Returns / Plateau:** After Model 6, the rapid gains cease. Models 8, 9, and 10 show stagnation in the Retail benchmark and actual degradation (regression) in the Airline benchmark.
4. **Benchmark Difficulty:** There is a clear hierarchy of difficulty or baseline competency. IFEval scores are consistently in the 90s. TAU-bench Retail scores stabilize in the low 80s. TAU-bench Airline is clearly the most difficult task for these models, starting the lowest and struggling to break 60%.
---
### Interpretation
From a Peircean investigative standpoint, this chart tells a classic story of machine learning model development: the "S-curve" of scaling or iterative training.
* **The "Aha!" Phase (Models 4-6):** The steep climb suggests that whatever changes were made between versions 4, 5, and 6 (whether increased parameter count, better training data, or architectural tweaks) successfully unlocked the core competencies required for the TAU-bench tasks.
* **Saturation and Overfitting (Models 8-10):** The plateauing of the Retail score and the active decline of the Airline score suggest that the model architecture or training paradigm has hit a wall regarding these specific tasks. The drop in the Airline score at Model 10 is a strong indicator of "catastrophic forgetting" or overfitting—where optimizing the model for other tasks (perhaps IFEval, though we lack data past model 7) has actively harmed its ability to perform the complex reasoning required for the Airline benchmark.
* **Task Complexity:** The persistent gap between "Retail" and "Airline" implies that the Airline benchmark requires a fundamentally different or more complex type of reasoning (e.g., multi-step constraints, stricter rule-following) that the current model lineage has not yet mastered, despite general improvements.
</details>
(g) LLM Benchmarks
Figure 3: Performance of the Claude family on reasoning benchmarks by category. Model numbers and corresponding names are as follows: 1 – Claude 3 Haiku; 2 – Claude 3 Sonnet; 3 – Claude 3 Opus; 4 – Claude 3.5 Haiku; 5 – Claude 3.5 Sonnet; 6 – Claude 3.7 Sonnet; 7 – Claude 3.7 Sonnet (64K Extended Thinking); 8 – Claude Sonnet 4; 9 – Claude Opus 4; 10 – Claude Opus 4.1.
<details>
<summary>figures/gemini_2_plots/gemini_performance_Commonsense_and_Logical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: HellaSwag Score vs. Model Number
### Overview
This image is a 2D line chart displaying the performance scores of a sequence of models on a specific metric or benchmark labeled "HellaSwag". The chart plots a single data series across four distinct model iterations, showing an initial dip in performance followed by a significant increase.
### Components/Axes
**Component Isolation & Spatial Grounding:**
* **Y-Axis (Left):**
* **Label:** "Score (%)" (Rotated 90 degrees counter-clockwise, centered vertically along the axis).
* **Scale:** Linear, continuous numerical scale.
* **Markers:** Explicitly labeled at 86, 88, 90, and 92.
* **Gridlines:** Horizontal dashed gridlines extend from the y-axis across the chart area at intervals of 2 units. There is an implied gridline at the top (~94) and the bottom x-axis line acts as a baseline (approximately ~84.2 based on visual spacing).
* **X-Axis (Bottom):**
* **Label:** "Model Number" (Centered horizontally below the axis).
* **Scale:** Linear, discrete integer scale.
* **Markers:** Explicitly labeled from 1 to 10 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
* **Gridlines:** Vertical dashed gridlines extend upward from each integer marker.
* **Main Chart Area (Center):** Contains a single data series represented by a solid blue line connecting solid blue circular data points.
* **Legend/Labels:** There is no separate legend box. Instead, the text "HellaSwag" is written in blue, positioned directly above the final data point at x=4, acting as a direct label for the data series.
### Detailed Analysis
**Trend Verification:**
The single blue line ("HellaSwag") begins at Model 1. It slopes downward sharply to Model 2. From Model 2, it slopes upward moderately to Model 3. From Model 3, it slopes upward very sharply, reaching its peak at Model 4. The line terminates at Model 4; there is no data plotted for Models 5 through 10.
**Data Point Extraction (Approximate values with ±0.2% uncertainty):**
* **Model 1 (x=1):** The point is located just below the 88% gridline.
* *Estimated Value:* ~87.8%
* **Model 2 (x=2):** The point drops significantly, located slightly above the bottom axis line (which is visually estimated around 84.2%).
* *Estimated Value:* ~84.7%
* **Model 3 (x=3):** The point rises, located roughly a quarter of the way between the 86% and 88% gridlines.
* *Estimated Value:* ~86.5%
* **Model 4 (x=4):** The point spikes dramatically, located more than halfway between the 92% gridline and the implied 94% top boundary.
* *Estimated Value:* ~93.3%
### Key Observations
1. **Incomplete Data Series:** While the x-axis anticipates 10 models, data is only provided for the first four.
2. **Performance Dip:** Model 2 represents a significant regression in performance compared to Model 1 (a drop of roughly 3%).
3. **Performance Spike:** Model 4 represents a massive leap in performance, jumping approximately 6.8% from Model 3 and easily surpassing the initial baseline set by Model 1.
### Interpretation
This chart tracks the iterative development of a machine learning model (or a series of related models) evaluated against the "HellaSwag" benchmark, which is a well-known dataset for evaluating commonsense natural language inference in large language models (LLMs).
The data suggests a non-linear development process. The drop at Model 2 implies an architectural change, training data alteration, or hyperparameter tuning that negatively impacted this specific benchmark. However, the subsequent models show rapid recovery and enhancement. The massive jump between Model 3 and Model 4 suggests a "breakthrough" iteration—perhaps a significant scaling up of parameters, a highly successful change in training methodology, or the introduction of a new architectural paradigm that drastically improved commonsense reasoning capabilities.
The presence of x-axis markers up to 10, despite data stopping at 4, strongly implies this is a "work in progress" chart. It was likely generated during an ongoing project where 10 models are planned, but only the first four have completed training and evaluation on the HellaSwag benchmark at the time the chart was rendered.
</details>
(a) Commonsense and Logical Reasoning
<details>
<summary>figures/gemini_2_plots/gemini_performance_Mathematical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores Across Various Math Benchmarks
### Overview
This image is a line chart displaying the performance scores (in percentages) of various numbered models across six different mathematical and reasoning benchmarks. The chart illustrates how performance changes across a sequence of models (labeled 1 through 10), with some benchmarks evaluated only on a subset of these models.
### Components/Axes
* **Y-axis (Vertical):**
* **Label:** "Score (%)"
* **Scale:** Ranges from below 20 to above 90.
* **Markers/Ticks:** Major gridlines and numeric labels are provided at intervals of 10 (20, 30, 40, 50, 60, 70, 80, 90).
* **X-axis (Horizontal):**
* **Label:** "Model Number"
* **Scale:** Discrete integer values from 1 to 10.
* **Markers/Ticks:** Major vertical gridlines and numeric labels are provided at every integer (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
* **Legend/Labels:** There is no separate legend box. Instead, data series are identified by inline text labels placed adjacent to the final data point of each respective line. The series are distinguished by line color and marker shape.
### Detailed Analysis
*Note: All numerical values extracted below are approximate based on visual interpolation between gridlines, with an estimated uncertainty of ±1.0%.*
**1. GSM8K (Pink line, Diamond markers)**
* *Spatial Grounding:* Located in the top-left quadrant. The label "GSM8K" is positioned to the right of the final data point at x=4.
* *Trend Verification:* The line starts at the highest overall value on the chart, dips moderately at Model 2, remains relatively flat at Model 3, and rises again at Model 4.
* *Data Points:*
* Model 1: ~94.5%
* Model 2: ~86.5%
* Model 3: ~86.0%
* Model 4: ~91.0%
**2. MGSM (Blue line, Circle markers)**
* *Spatial Grounding:* Located in the top-left quadrant, directly below the GSM8K line. The label "MGSM" is positioned to the right of the final data point at x=4.
* *Trend Verification:* The line starts high, experiences a sharp decline at Model 2, recovers sharply at Model 3, and continues to rise moderately at Model 4.
* *Data Points:*
* Model 1: ~79.0%
* Model 2: ~63.5%
* Model 3: ~82.5%
* Model 4: ~87.5%
**3. MATH (Green line, Square markers)**
* *Spatial Grounding:* Located in the middle-left area. The label "MATH" is positioned to the right of the final data point at x=4.
* *Trend Verification:* Starts in the middle range, drops significantly to a local minimum at Model 2, rebounds sharply at Model 3, and continues upward at Model 4.
* *Data Points:*
* Model 1: ~53.0%
* Model 2: ~32.5%
* Model 3: ~55.0%
* Model 4: ~67.5%
**4. MathVista (Purple line, Triangle markers)**
* *Spatial Grounding:* Located in the middle-left area, intersecting the MATH line. The label "MathVista" is positioned to the right of the final data point at x=4.
* *Trend Verification:* Shares the exact starting point with MATH, dips moderately at Model 2, rises steadily through Models 3 and 4.
* *Data Points:*
* Model 1: ~53.0% (Overlaps with MATH)
* Model 2: ~45.0%
* Model 3: ~58.5%
* Model 4: ~64.0%
**5. AIME 2024 (Cyan point, Hexagon marker)**
* *Spatial Grounding:* Located in the top-right quadrant. It is a single, isolated data point. The label "AIME 2024" is positioned to the right of the point.
* *Trend Verification:* N/A (Single point).
* *Data Point:*
* Model 8: ~92.0%
**6. AIME 2025 (Olive/Yellow-green line, Pentagon markers)**
* *Spatial Grounding:* Spans from the bottom-left (starting at x=3) across to the middle-right. The label "AIME 2025" is positioned above the final data point at x=10.
* *Trend Verification:* Starts at the lowest point on the chart at Model 3. It rises slowly through Model 6, then spikes dramatically at Model 7 and peaks at Model 8. It then suffers a severe drop at Model 9 before recovering moderately at Model 10.
* *Data Points:*
* Model 3: ~14.5%
* Model 4: ~17.5%
* Model 5: ~23.5%
* Model 6: ~29.5%
* Model 7: ~72.0%
* Model 8: ~88.0%
* Model 9: ~49.5%
* Model 10: ~63.0%
### Key Observations
1. **The "Model 2 Dip":** Every benchmark evaluated on Models 1 through 4 (GSM8K, MGSM, MATH, MathVista) exhibits a distinct performance drop from Model 1 to Model 2, followed by a recovery in subsequent models.
2. **Incomplete Series:** Models 1 through 4 are evaluated on four specific benchmarks. Models 5 through 10 are *only* evaluated on the AIME 2025 benchmark (with the exception of the single AIME 2024 point at Model 8).
3. **Model 8 Peak:** Model 8 represents a massive peak in performance for the AIME 2025 benchmark (~88%), and is also the only model evaluated on AIME 2024, scoring exceptionally high (~92%).
4. **Benchmark Difficulty:** Based on the scores for Models 3 and 4, AIME 2025 is significantly more difficult than the other benchmarks, scoring roughly 40-70 percentage points lower than GSM8K, MGSM, MATH, and MathVista on those specific models.
### Interpretation
This chart likely represents the evaluation of a family of AI models (perhaps different sizes, iterations, or training checkpoints of a specific foundation model series) against standard mathematical reasoning benchmarks.
* **Reading between the lines regarding Model 2:** The universal dip at Model 2 suggests that this specific model is either a smaller parameter version (e.g., a 7B model compared to a 70B Model 1), a base model prior to instruction tuning, or a flawed checkpoint.
* **Shift in Evaluation Strategy:** The abrupt stop of GSM8K, MGSM, MATH, and MathVista at Model 4, combined with the introduction of AIME 2025 at Model 3, suggests a shift in the researchers' focus. It is highly probable that Models 5-10 became so capable that the earlier benchmarks (like GSM8K) "saturated" (approached 100%), prompting the evaluators to switch exclusively to a much harder benchmark (AIME 2025) to accurately measure further improvements.
* **The Significance of Model 8:** Model 8 is a major outlier in capability. The fact that it was specifically chosen to be tested against AIME 2024, and that it peaked on AIME 2025, implies Model 8 might be a specialized "Math" variant of the model family, or the largest/most heavily trained version. The subsequent drop at Model 9 suggests Model 9 might be a return to a smaller or more generalized model architecture before improving again at Model 10.
</details>
(b) Mathematical Reasoning
<details>
<summary>figures/gemini_2_plots/gemini_performance_Multimodal_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Benchmark Scores Across Model Numbers
### Overview
This image is a line chart displaying the performance scores of various evaluation benchmarks across a sequential series of "Model Numbers." The chart tracks nine distinct benchmarks, each represented by a uniquely colored line and marker style. The data suggests a comparison of different iterations, sizes, or versions of an AI model against a suite of standardized tests.
### Components/Axes
* **Y-Axis (Vertical):**
* **Label:** "Score (%)"
* **Scale:** Ranges from 0 to 100 (implied top), with major tick marks and labels at 0, 20, 40, 60, and 80.
* **Gridlines:** Light gray, dashed horizontal lines extend from each major tick mark, including an unlabelled line at the 100 mark.
* **X-Axis (Horizontal):**
* **Label:** "Model Number"
* **Scale:** Discrete integer values from 1 to 10.
* **Gridlines:** Light gray, dashed vertical lines extend upward from each integer.
* **Legend/Labels:** There is no separate legend box. Instead, the name of each benchmark is written directly on the chart, placed adjacent to the final data point of its respective line. The text color of the label matches the line color.
### Detailed Analysis
Below is the extraction of data for each series. Values are visual approximations (denoted by `~`) based on the Y-axis scale.
**1. AI2D**
* **Spatial Grounding:** Label is located at the top center, near x=4, y=95.
* **Visual Attributes:** Red line, solid diamond markers.
* **Trend:** Starts high, experiences a slight dip at Model 2, then rises sharply to Model 3, and slightly more to Model 4, where the series ends.
* **Data Points:**
* Model 1: ~79
* Model 2: ~73
* Model 3: ~91
* Model 4: ~94
**2. DocVQA**
* **Spatial Grounding:** Label is located at the top center, just below AI2D, near x=4, y=92.
* **Visual Attributes:** Brown line, solid pentagon markers.
* **Trend:** Starts as the highest scoring benchmark, dips slightly at Model 2, recovers at Model 3, and peaks at Model 4, where the series ends.
* **Data Points:**
* Model 1: ~91
* Model 2: ~88
* Model 3: ~90
* Model 4: ~93
**3. ChartQA**
* **Spatial Grounding:** Label is located at the top center, below DocVQA, near x=4, y=87.
* **Visual Attributes:** Green line, solid upward-pointing triangle markers.
* **Trend:** Follows the common early trend: starts high, dips at Model 2, rises sharply at Model 3, and rises slightly to Model 4, where the series ends.
* **Data Points:**
* Model 1: ~80
* Model 2: ~74
* Model 3: ~85
* Model 4: ~87
**4. TextVQA**
* **Spatial Grounding:** Label is located in the upper center-left, near x=4, y=79.
* **Visual Attributes:** Dark blue line, solid circle markers.
* **Trend:** Starts high, dips at Model 2, rises at Model 3, and remains perfectly flat to Model 4, where the series ends.
* **Data Points:**
* Model 1: ~82
* Model 2: ~74
* Model 3: ~79
* Model 4: ~79
**5. EgoSchema**
* **Spatial Grounding:** Label is located in the center, near x=4, y=72.
* **Visual Attributes:** Pink line, hollow square markers.
* **Trend:** This is a short series. It begins at Model 3 and slopes upward to Model 4, where it ends.
* **Data Points:**
* Model 3: ~66
* Model 4: ~72
**6. VideoMMMU**
* **Spatial Grounding:** Label is located in the upper right, near x=8, y=84.
* **Visual Attributes:** Cyan (light blue) line, cross (+) markers.
* **Trend:** Begins at Model 3, rises to Model 4, dips at Model 5, then exhibits a steady, continuous upward climb through Models 6, 7, and 8, where the series ends.
* **Data Points:**
* Model 3: ~65
* Model 4: ~70
* Model 5: ~64
* Model 6: ~68
* Model 7: ~79
* Model 8: ~83
**7. MMMU**
* **Spatial Grounding:** Label is located on the far right, near x=10, y=73.
* **Visual Attributes:** Orange line, solid square markers.
* **Trend:** This series spans the entire X-axis. It starts moderately high, drops sharply at Model 2, rises through Model 4, dips at Model 5, rises steadily to peak at Model 8, drops sharply at Model 9, and remains flat to Model 10. Notably, it tracks almost identically with VideoMMMU between Models 5 and 8.
* **Data Points:**
* Model 1: ~59
* Model 2: ~48
* Model 3: ~58
* Model 4: ~68
* Model 5: ~65
* Model 6: ~69
* Model 7: ~80
* Model 8: ~82
* Model 9: ~73
* Model 10: ~73
**8. Vibe-Eval (Reka)**
* **Spatial Grounding:** Label is located on the middle right, near x=10, y=58.
* **Visual Attributes:** Gray line, star markers.
* **Trend:** Begins at Model 3. It fluctuates, rising to Model 4, dipping at Model 5, rising steadily to peak at Model 8, dropping sharply at Model 9, and recovering slightly at Model 10. Its shape closely mirrors the MMMU line from Model 4 onwards, but at a lower score tier.
* **Data Points:**
* Model 3: ~52
* Model 4: ~56
* Model 5: ~51
* Model 6: ~55
* Model 7: ~65
* Model 8: ~69
* Model 9: ~51
* Model 10: ~58
**9. ZeroBench**
* **Spatial Grounding:** Label is located in the bottom right, near x=8, y=6.
* **Visual Attributes:** Yellow-green line, 'x' markers.
* **Trend:** An extreme outlier. Begins at Model 3 and remains nearly flat at the very bottom of the chart, showing only a microscopic upward slope until a very slight bump at Model 8, where it ends.
* **Data Points:**
* Model 3: ~1
* Model 4: ~1
* Model 5: ~1
* Model 6: ~1.5
* Model 7: ~2
* Model 8: ~5
### Key Observations
* **The "Model 2" Dip:** Every single benchmark evaluated at Model 1 (AI2D, DocVQA, ChartQA, TextVQA, MMMU) experiences a noticeable drop in performance at Model 2 before recovering at Model 3.
* **Truncated Data:** Five of the nine benchmarks (AI2D, DocVQA, ChartQA, TextVQA, EgoSchema) cease reporting data after Model 4.
* **Correlated Performance:** Between Models 4 and 8, the lines for VideoMMMU, MMMU, and Vibe-Eval (Reka) follow nearly identical trajectory shapes (up, down, up, up, up), suggesting these models scale similarly across these specific, perhaps related, multimodal tasks.
* **The "Model 9" Drop:** The only two benchmarks that continue past Model 8 (MMMU and Vibe-Eval) both show a sharp decline in performance at Model 9.
* **Outlier:** ZeroBench scores are drastically lower than all other benchmarks, never exceeding 5%.
### Interpretation
This chart likely visualizes the evaluation of a specific family of Large Multimodal Models (LMMs) across different developmental iterations or parameter sizes (represented by "Model Number" 1 through 10).
The universal dip at Model 2 suggests a regression in that specific model version—perhaps a smaller parameter size in a family of models, or a failed training checkpoint.
The clustering of lines ending at Model 4 implies a change in testing methodology. It is highly probable that Models 1-4 represent one phase of development or one specific model architecture, while Models 5-10 represent a newer phase where older benchmarks (like TextVQA or ChartQA) were either deemed "solved" (as they were approaching 90-95%) or deprecated in favor of harder, newer benchmarks like MMMU and VideoMMMU.
The near-zero performance on "ZeroBench" indicates it is an exceptionally difficult, perhaps adversarial, benchmark designed to test capabilities that none of these models currently possess. The sharp drop at Model 9 for the remaining benchmarks suggests that Model 9 is either a smaller, more efficient model variant (like a "Mobile" or "Nano" version) rather than a direct, more powerful successor to Model 8.
</details>
(c) Multimodal Reasoning
<details>
<summary>figures/gemini_2_plots/gemini_performance_Programming_and_Coding.png Details</summary>

### Visual Description
## Line Chart: Benchmark Scores Across Different Models
### Overview
This image is a line chart displaying the performance scores (in percentages) of various models across five different benchmarks. The chart tracks "Score (%)" on the vertical axis against "Model Number" on the horizontal axis. Instead of a traditional legend box, the data series are labeled directly on the chart canvas adjacent to their respective lines.
### Components/Axes
* **Y-Axis (Left):** Labeled "Score (%)". The scale ranges from 0 to 80, with major tick marks and corresponding labels at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70, 80). There is an unlabeled gridline above 80, representing 90.
* **X-Axis (Bottom):** Labeled "Model Number". The scale ranges from 1 to 10, with major tick marks and integer labels for every number (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
* **Gridlines:** The chart features a background grid of light gray, dashed lines corresponding to every major tick on both the X and Y axes.
* **Labels (Inline Legend):**
* "HumanEval" (Top center, dark blue text)
* "SWE-bench Verified M" (Middle right, cyan text)
* "LiveCodeBench" (Middle right, green text)
* "SWE-bench Verified S" (Lower right, brown text)
* "Aider Polygot" (Lower right, gray text)
### Detailed Analysis
*Note: All numerical values are visual approximations extracted from the chart.*
**1. HumanEval (Dark Blue Line, Circular Markers)**
* **Spatial Grounding:** The label is located at the top center of the chart, immediately to the right of the data point at Model 4.
* **Visual Trend:** This series only spans Models 1 through 4. It starts high, dips slightly at Model 2, recovers at Model 3, and rises to the highest overall point on the chart at Model 4.
* **Data Points:**
* Model 1: ~74%
* Model 2: ~68%
* Model 3: ~74%
* Model 4: ~84%
**2. SWE-bench Verified M (Cyan Line, Small Circular Markers)**
* **Spatial Grounding:** The label is located on the right side, positioned above the line segment connecting Models 9 and 10.
* **Visual Trend:** Spans Models 4 through 10. It starts in the mid-30s, dips at Model 5, rises steadily through Model 7, peaks at Model 8, drops sharply at Model 9, and flattens out with a slight rise at Model 10.
* **Data Points:**
* Model 4: ~34%
* Model 5: ~23%
* Model 6: ~34%
* Model 7: ~60%
* Model 8: ~67%
* Model 9: ~43%
* Model 10: ~45%
**3. LiveCodeBench (Green Line, Square Markers)**
* **Spatial Grounding:** The label is located on the right side, just above the data point for Model 9.
* **Visual Trend:** Spans Models 3 through 10. It remains remarkably flat and stable from Model 3 to Model 6. It then spikes sharply upward at Model 7, peaks at Model 8, drops precipitously at Model 9, and remains flat to Model 10.
* **Data Points:**
* Model 3: ~30%
* Model 4: ~30%
* Model 5: ~29%
* Model 6: ~29%
* Model 7: ~59%
* Model 8: ~74%
* Model 9: ~34%
* Model 10: ~34%
**4. SWE-bench Verified S (Brown Line, Upward Triangle Markers)**
* **Spatial Grounding:** The label is located on the lower right side, just below the data point for Model 9.
* **Visual Trend:** Spans Models 3 through 10. It exhibits a zig-zag pattern initially (rising at 4, dipping at 5), then climbs steadily through Model 7 to peak at Model 8. It drops sharply at Model 9 and declines slightly to Model 10.
* **Data Points:**
* Model 3: ~10%
* Model 4: ~22%
* Model 5: ~13%
* Model 6: ~21%
* Model 7: ~49%
* Model 8: ~59%
* Model 9: ~32%
* Model 10: ~28%
**5. Aider Polygot (Gray Line, Diamond Markers)**
* **Spatial Grounding:** The label is located at the bottom right, directly below the "SWE-bench Verified S" label.
* **Visual Trend:** Spans Models 3 through 10. It starts at the lowest point on the chart. It follows a similar zig-zag to the brown line (up at 4, down at 5, up at 6), then experiences a massive, steep climb through Model 7 to reach the highest peak among the lower four series at Model 8. It then crashes sharply at Model 9 and remains flat to Model 10.
* **Data Points:**
* Model 3: ~3%
* Model 4: ~17%
* Model 5: ~11%
* Model 6: ~21%
* Model 7: ~57%
* Model 8: ~82%
* Model 9: ~27%
* Model 10: ~27%
### Key Observations
* **The "Model 8" Anomaly:** There is a massive, uniform spike in performance at Model 8 across all four benchmarks that were tested on it. Aider Polygot sees the most dramatic increase, jumping from ~21% at Model 6 to ~82% at Model 8.
* **The "Model 5" Dip:** Conversely, Models 4, 5, and 6 show a consistent "V" shape across three of the benchmarks (SWE-bench M, SWE-bench S, and Aider Polygot), indicating Model 5 performed noticeably worse than the models immediately preceding and succeeding it.
* **Incomplete Data:** HumanEval is only plotted for Models 1-4, while the other four benchmarks are plotted for Models 3-10 (or 4-10).
* **Benchmark Correlation:** The four benchmarks spanning Models 3/4 to 10 exhibit highly correlated movement. They all dip at Model 5, rise at 6 and 7, peak at 8, and drop sharply at 9.
### Interpretation
The data strongly suggests a chronological or sequential progression of AI models (likely Large Language Models evaluated on coding tasks, given the benchmark names).
* **HumanEval as a Baseline:** HumanEval scores are significantly higher than the others for early models. This implies HumanEval is likely an older, easier, or more saturated benchmark. The testers may have stopped running it after Model 4 because it was no longer providing useful differentiation, shifting focus to the harder benchmarks.
* **Model 8 is a Breakthrough:** Model 8 represents a massive leap in capability. Given the sharp drop-off at Models 9 and 10, Model 8 might represent a much larger parameter model (e.g., a 70B model compared to 7B/8B models), a different architecture, or a model specifically fine-tuned for the tasks these benchmarks measure.
* **Models 9 & 10:** The fact that performance drops back down to levels similar to Models 4-6 suggests that Models 9 and 10 are not direct successors to Model 8 in terms of scale or capability. They might be smaller, more efficient models released later, or experimental branches that did not retain the coding capabilities of Model 8.
* **Benchmark Difficulty:** Based on the vertical stacking of the lines at the Model 8 peak, we can infer the relative difficulty of the benchmarks for that specific model: Aider Polygot (easiest/highest score) > LiveCodeBench > SWE-bench Verified M > SWE-bench Verified S (hardest/lowest score). However, this difficulty hierarchy is not strictly consistent across all models (e.g., at Model 4, SWE-bench M is higher than LiveCodeBench).
</details>
(d) Programming and Coding
<details>
<summary>figures/gemini_2_plots/gemini_performance_Reading_Comprehension_and_Question_Answering.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores (DROP vs. ECLeKTic)
### Overview
This image is a 2D line chart displaying the performance scores of two distinct evaluation metrics or datasets, labeled "DROP" and "ECLeKTic", across a sequential progression of "Model Numbers". The chart uses a minimalist design with a white background, faint gridlines, and inline labeling rather than a traditional separate legend box.
### Components/Axes
**1. Y-Axis (Vertical - Left)**
* **Label:** "Score (%)" (Rotated 90 degrees counter-clockwise, centered vertically).
* **Scale:** Linear scale.
* **Markers:** Major tick marks at 20, 30, 40, 50, 60, 70, and 80.
* **Range:** The visible axis line starts slightly below 20 (approx. 10) and ends slightly above 80 (approx. 85).
**2. X-Axis (Horizontal - Bottom)**
* **Label:** "Model Number" (Centered horizontally below the axis markers).
* **Scale:** Discrete/Sequential integer scale.
* **Markers:** 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
* **Range:** 1 to 10.
**3. Gridlines**
* Faint, light-gray, dashed lines extend from every major tick mark on both the X and Y axes, forming a complete grid to aid in reading values.
**4. Legend / Series Identifiers (Inline)**
* There is no standalone legend box. Series are identified by text placed directly adjacent to the data lines.
* **Series 1:** Dark blue text "DROP" is positioned to the right of the final data point of the dark blue line.
* **Series 2:** Light blue/cyan text "ECLeKTic" is positioned just above and to the left of the final data point of the light blue/cyan line.
---
### Detailed Analysis
#### Trend Verification & Data Extraction
**Series 1: DROP**
* **Visual Trend:** The dark blue line (with circular markers) occupies the upper portion of the chart. It begins at a high point, experiences a notable dip, recovers partially, and then dips slightly again, ending abruptly at Model 4. Overall, it remains relatively flat within the 74-83% range.
* **Data Points (Approximate values based on grid alignment):**
* Model 1: ≈ 82.5% (Positioned slightly above the 80 gridline)
* Model 2: ≈ 74.0% (Positioned just below the midway point between 70 and 80)
* Model 3: ≈ 78.5% (Positioned just below the 80 gridline)
* Model 4: ≈ 75.0% (Positioned exactly midway between 70 and 80)
**Series 2: ECLeKTic**
* **Visual Trend:** The light blue/cyan line (with square markers) occupies the lower-to-middle portion of the chart. It begins at Model 3 at a very low score and exhibits a consistent, positive upward slope through Model 8. The rate of improvement varies, with steeper climbs between Models 3-4 and Models 7-8, and a flatter plateau between Models 4-5.
* **Data Points (Approximate values based on grid alignment):**
* Model 3: ≈ 16.5% (Positioned below the 20 gridline)
* Model 4: ≈ 27.0% (Positioned above the midway point between 20 and 30)
* Model 5: ≈ 28.0% (Positioned slightly higher than Model 4, just below 30)
* Model 6: ≈ 34.0% (Positioned slightly below the midway point between 30 and 40)
* Model 7: ≈ 37.0% (Positioned above the midway point between 30 and 40)
* Model 8: ≈ 47.0% (Positioned above the midway point between 40 and 50)
#### Reconstructed Data Table
| Model Number | DROP Score (%) | ECLeKTic Score (%) |
| :--- | :--- | :--- |
| 1 | ≈ 82.5 | *No Data* |
| 2 | ≈ 74.0 | *No Data* |
| 3 | ≈ 78.5 | ≈ 16.5 |
| 4 | ≈ 75.0 | ≈ 27.0 |
| 5 | *No Data* | ≈ 28.0 |
| 6 | *No Data* | ≈ 34.0 |
| 7 | *No Data* | ≈ 37.0 |
| 8 | *No Data* | ≈ 47.0 |
| 9 | *No Data* | *No Data* |
| 10 | *No Data* | *No Data* |
---
### Key Observations
1. **Disjointed Data Ranges:** The most striking feature is the lack of continuity across the X-axis for both metrics. "DROP" is only measured for Models 1 through 4. "ECLeKTic" is only measured for Models 3 through 8.
2. **Overlap:** The only models where both scores are recorded simultaneously are Models 3 and 4.
3. **Performance Disparity:** Where they overlap (Models 3 and 4), the DROP score is vastly superior (approx. 50-60 percentage points higher) to the ECLeKTic score.
4. **Opposing Trajectories:** While DROP shows a slight overall degradation or stagnation from Model 1 to 4, ECLeKTic shows significant, continuous improvement from Model 3 to 8.
### Interpretation
* **What the data suggests:** The chart tracks the evolution of a system (likely machine learning models, given the terminology) across sequential iterations (Models 1 through 10). "DROP" appears to be an "easier" benchmark or a task the earlier models were already highly optimized for, starting at >80%. "ECLeKTic" appears to be a much more difficult benchmark, where early models (Model 3) fail significantly (<20%), but subsequent iterations show steady learning and capability gains, approaching 50% by Model 8.
* **Reading between the lines (Peircean Investigative):** The abrupt stop of the "DROP" metric at Model 4 strongly implies a shift in research or development focus. Because the models were already performing adequately on DROP (~75-80%), the developers likely deemed it a "solved" or less informative metric for future iterations. Conversely, the introduction of "ECLeKTic" at Model 3 suggests a new, harder benchmark was introduced to test capabilities that DROP could not measure. The fact that Models 9 and 10 are on the axis but have no data points suggests this chart may be a snapshot of ongoing work, or that testing for those specific models on these specific benchmarks was abandoned or is pending.
</details>
(e) Reading Comprehension and QA
<details>
<summary>figures/gemini_2_plots/gemini_performance_Reasoning_with_General_Knowledge.png Details</summary>

### Visual Description
## Line Chart: AI Model Performance Across Various Benchmarks
### Overview
This image is a line chart displaying the performance scores (in percentages) of ten sequential or distinct entities, labeled as "Model Number" 1 through 10, across five different evaluation benchmarks. The chart illustrates how performance varies across different tests, with some tests showing high saturation (scores near 90%) and others proving exceptionally difficult (scores below 25%).
### Components/Axes
**Main Chart Area:**
The chart uses a standard Cartesian coordinate system with faint, dashed, light-gray gridlines corresponding to the major axis ticks.
**X-Axis (Horizontal):**
* **Label:** "Model Number" (Centered at the bottom).
* **Scale:** Discrete integer values from 1 to 10.
* **Markers:** 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
**Y-Axis (Vertical):**
* **Label:** "Score (%)" (Rotated 90 degrees counter-clockwise, centered on the left).
* **Scale:** Continuous percentage scale, visually ranging from 0 to 100.
* **Markers:** 20, 40, 60, 80.
**Legend / Data Series Labels:**
There is no separate legend box. Instead, labels are placed directly within the chart area adjacent to their respective data lines.
* **Big-Bench-Hard:** Brown text, located top-center near Model 4. Corresponds to the brown line with triangle markers.
* **MMLU:** Green text, located top-center near Model 4, just below the Big-Bench-Hard label. Corresponds to the green line with square markers.
* **Global MMLU (Lite):** Gray text, located top-right near Models 9 and 10. Corresponds to the gray line with diamond markers.
* **GPQA Diamond:** Dark blue text, located mid-right near Models 9 and 10. Corresponds to the dark blue line with circular markers.
* **Humanity's Last Exam:** Light blue/cyan text, located bottom-right near Models 8 and 9. Corresponds to the light blue/cyan line with small circular/pentagonal markers.
### Detailed Analysis
*Note: All numerical values extracted below are visual approximations based on the y-axis scale, with an estimated uncertainty of ±2%.*
**1. Big-Bench-Hard (Brown line, Triangle markers)**
* **Visual Trend:** This series only contains data for Models 1 through 4. The line starts high, dips noticeably at Model 2, recovers at Model 3, and reaches its peak at Model 4.
* **Data Points:**
* Model 1: ~83%
* Model 2: ~75%
* Model 3: ~85%
* Model 4: ~89%
**2. MMLU (Green line, Square markers)**
* **Visual Trend:** Similar to Big-Bench-Hard, this series only covers Models 1 through 4. It starts at the highest point on the entire chart for Model 1, drops sharply at Model 2, remains perfectly flat at Model 3, and rises again at Model 4.
* **Data Points:**
* Model 1: ~90%
* Model 2: ~79%
* Model 3: ~79%
* Model 4: ~86%
**3. Global MMLU (Lite) (Gray line, Diamond markers)**
* **Visual Trend:** This series begins at Model 3 and continues to Model 10. It shows a general upward trajectory with minor fluctuations. It rises from Model 3 to 4, dips slightly at 5, climbs steadily to a peak at Model 8, drops at Model 9, and recovers slightly at Model 10.
* **Data Points:**
* Model 3: ~72%
* Model 4: ~81%
* Model 5: ~78%
* Model 6: ~83%
* Model 7: ~88%
* Model 8: ~90% (Peak)
* Model 9: ~81%
* Model 10: ~84%
**4. GPQA Diamond (Dark Blue line, Circle markers)**
* **Visual Trend:** This series spans all 10 models. It exhibits high volatility but a strong overall upward trend from Model 2 to Model 8. It starts relatively low, dips at Model 2, surges at Model 3, climbs to a peak at Model 8 (nearly matching Global MMLU Lite), then suffers a massive drop at Model 9 before stabilizing slightly at Model 10.
* **Data Points:**
* Model 1: ~36%
* Model 2: ~28%
* Model 3: ~50%
* Model 4: ~58%
* Model 5: ~50%
* Model 6: ~65%
* Model 7: ~82%
* Model 8: ~86% (Peak)
* Model 9: ~64%
* Model 10: ~66%
**5. Humanity's Last Exam (Light Blue/Cyan line, Small circle markers)**
* **Visual Trend:** This series begins at Model 4 and continues to Model 10. It represents the lowest scores on the chart by a wide margin. The trend is completely flat and near zero from Models 4 to 6, shows a slight rise at Model 7, peaks at Model 8, drops back to baseline at Model 9, and rises marginally at Model 10.
* **Data Points:**
* Model 4: ~5%
* Model 5: ~5%
* Model 6: ~5%
* Model 7: ~11%
* Model 8: ~21% (Peak)
* Model 9: ~5%
* Model 10: ~7%
### Key Observations
* **Incomplete Data Series:** Not all benchmarks were tested on all models. "Big-Bench-Hard" and "MMLU" stop after Model 4. "Global MMLU (Lite)" starts at Model 3, and "Humanity's Last Exam" starts at Model 4. Only "GPQA Diamond" spans the entire x-axis.
* **The "Model 8" Peak:** For the three benchmarks that span the latter half of the chart (Global MMLU Lite, GPQA Diamond, Humanity's Last Exam), Model 8 represents the absolute peak performance.
* **The "Model 9" Drop:** Conversely, Model 9 shows a significant performance regression across all three active benchmarks compared to Model 8.
* **Benchmark Difficulty Stratification:** The chart clearly shows three tiers of difficulty:
1. *Easier/Saturated:* MMLU, Big-Bench-Hard, and Global MMLU (Lite) generally score between 70% and 90%.
2. *Moderate/High Variance:* GPQA Diamond shows the most growth, moving from ~28% to ~86%.
3. *Extreme Difficulty:* "Humanity's Last Exam" rarely breaks above 10%, maxing out at ~21%.
### Interpretation
This chart likely tracks the historical progression or a specific comparative lineup of Large Language Models (LLMs) against standardized AI benchmarks. The "Model Number" likely represents either chronological releases (e.g., GPT-1 through a modern model) or a specific tier list of competing models from different organizations.
The data demonstrates the concept of "benchmark saturation." Older benchmarks like MMLU and Big-Bench-Hard were likely abandoned after Model 4 because the models were already scoring near 90%, leaving little room to measure meaningful improvement.
To replace them, harder benchmarks were introduced. GPQA Diamond shows a beautiful capability curve, where early models failed, but later models (specifically Model 8) mastered it.
The most striking element is "Humanity's Last Exam." Its placement at the very bottom of the chart, hovering near a 0-5% baseline for most models, indicates it is a next-generation benchmark designed specifically to be resistant to current AI capabilities. Even Model 8, which excels at everything else, barely achieves 20% on this test.
Finally, the sharp drop at Model 9 suggests that the x-axis is not strictly chronological by capability. Model 9 and 10 might represent smaller, more efficient models (like "mini" or "haiku" variants of flagship models), or models from a different, slightly less capable lineage compared to the state-of-the-art Model 8.
</details>
(f) Reasoning with General Knowledge
<details>
<summary>figures/gemini_2_plots/gemini_performance_LLM_Benchmarks_Combined.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores Across Different Benchmarks
### Overview
This image is a line chart displaying the performance scores (in percentages) of various numbered models across four different evaluation benchmarks. The chart illustrates how performance varies significantly depending on the specific task, with some models showing extreme volatility in certain long-context retrieval tasks, while maintaining high stability in fact-grounding tasks. Note that data is only plotted for models 3 through 10.
### Components/Axes
**Spatial Grounding & Layout:**
* **Main Chart Area:** Occupies the majority of the image, featuring a white background with a light gray grid. Horizontal gridlines are dashed (every 10 units), and vertical gridlines are dotted (every 1 unit).
* **X-Axis (Bottom):** Labeled **"Model Number"**. The scale runs from 1 to 10, with major tick marks and labels at every integer (1, 2, 3, 4, 5, 6, 7, 8, 9, 10).
* **Y-Axis (Left):** Labeled **"Score (%)"**. The scale runs from 10 to 90, with major tick marks and labels at increments of 10 (10, 20, 30, 40, 50, 60, 70, 80, 90).
* **Legend/Labels:** There is no standalone legend box. Instead, labels are placed directly within the chart area, adjacent to or pointing toward their respective data lines.
* *Top-Right (Pink text):* **"LOFT (hard retrieval) <=128K"** (Points via a thin gray line to the pink triangle at x=8).
* *Top-Right (Red text):* **"FACTS Grounding"** (Placed directly over the red line between x=8 and x=10).
* *Middle-Right (Teal text):* **"LOFT (hard retrieval) 1M"** (Points via a thin gray line to the teal diamond at x=8).
* *Bottom-Right (Dark Blue text):* **"SimpleQA"** (Placed just above the dark blue line at x=10).
### Detailed Analysis
*Note: All numerical values are visual approximations extracted from the chart.*
**1. FACTS Grounding (Red line, Square markers)**
* *Trend Verification:* This line represents the highest overall scores on the chart. It slopes slightly downward from model 3 to 4, then exhibits a steady, gradual upward trend through model 8, dips slightly at model 9, and rises again at model 10. It is the most stable metric.
* *Data Points:*
* Model 3: ~83%
* Model 4: ~80%
* Model 5: ~82.5%
* Model 6: ~84.5%
* Model 7: ~85.5%
* Model 8: ~88% (Peak)
* Model 9: ~84%
* Model 10: ~87%
**2. LOFT (hard retrieval) <=128K (Pink line, Upward-pointing Triangle markers)**
* *Trend Verification:* This series shows significant volatility. It slopes upward from model 3 to 4, experiences a sharp decline at model 5, and then climbs steeply and consistently to peak at model 8, where it nearly converges with the FACTS Grounding line. This series terminates at model 8.
* *Data Points:*
* Model 3: ~67%
* Model 4: ~76%
* Model 5: ~51% (Local minimum)
* Model 6: ~58%
* Model 7: ~82%
* Model 8: ~87% (Peak)
* Models 9 & 10: No data plotted.
**3. LOFT (hard retrieval) 1M (Teal line, Diamond markers)**
* *Trend Verification:* This is the most volatile series. It slopes upward from model 3 to 4, then crashes dramatically to near-zero scores for models 5 and 6. It then exhibits a massive recovery, spiking sharply upward through models 7 and 8. This series also terminates at model 8.
* *Data Points:*
* Model 3: ~37%
* Model 4: ~47%
* Model 5: ~7.5% (Trough)
* Model 6: ~7.5% (Trough)
* Model 7: ~59%
* Model 8: ~70% (Peak)
* Models 9 & 10: No data plotted.
**4. SimpleQA (Dark Blue line, Circle markers)**
* *Trend Verification:* This series generally occupies the lowest scoring tier. It is highly erratic, zigzagging up and down. It peaks significantly at model 8, but then crashes to its lowest points for models 9 and 10.
* *Data Points:*
* Model 3: ~9%
* Model 4: ~25%
* Model 5: ~16.5%
* Model 6: ~30%
* Model 7: ~27%
* Model 8: ~54% (Peak)
* Model 9: ~10.5%
* Model 10: ~13%
### Key Observations
* **Missing Data:** Models 1 and 2 have no data points for any metric. Furthermore, the two "LOFT" metrics abruptly end at Model 8, while FACTS Grounding and SimpleQA continue to Model 10.
* **The "Model 8" Anomaly:** Model 8 represents a universal peak. Every single metric evaluated reaches its highest score at Model 8.
* **The "Model 5/6" Collapse:** Models 5 and 6 struggle significantly with the "LOFT" retrieval tasks, particularly the 1M context window, which drops to single digits.
* **Context Window Difficulty:** Comparing the two LOFT metrics, the 1M (1 million) context window is consistently and significantly harder for the models to process than the <=128K context window, evidenced by the teal line always remaining below the pink line.
* **Task Disparity:** "FACTS Grounding" is clearly the easiest task for these models (or the models are specifically optimized for it), consistently scoring in the 80%+ range. Conversely, "SimpleQA" yields very poor results, mostly staying below 30%, except for the spike at Model 8.
### Interpretation
Reading between the lines, this chart likely represents an evaluation of a specific family of Large Language Models (LLMs), where the "Model Number" corresponds to different iterations, sizes, or training checkpoints (e.g., Model 3 might be a 7B parameter model, while Model 8 might be a 70B parameter model).
The data suggests that **Model 8 is the most capable model** in this lineup, showing strong competence across all tasks, including a massive improvement in the difficult SimpleQA benchmark.
The severe crash of the LOFT 1M metric at Models 5 and 6 indicates a critical failure in long-context retrieval for those specific architectures or checkpoints. They completely lose the ability to find information in a 1-million-token haystack, whereas Model 4 and Model 7 handle it much better.
The absence of LOFT data for Models 9 and 10 suggests one of two things: either those models do not support context windows large enough to run the LOFT benchmarks (meaning they are limited to less than 128k tokens), or the researchers simply did not run/finish those specific evaluations before generating the chart.
Finally, the vast gap between "FACTS Grounding" (high) and "SimpleQA" (low) implies a definitional difference in the benchmarks. "FACTS Grounding" might provide the model with the text to ground its answers in (making it an easier reading comprehension task), whereas "SimpleQA" might require the model to rely solely on its internal parametric memory, which these models clearly struggle with.
</details>
(g) LLM Benchmarks
Figure 4: Performance of the Gemini family on reasoning benchmarks by category. Model numbers and corresponding names are as follows: 1 – Gemini Ultra; 2 – Gemini Pro; 3 – Gemini 1.5 Flash; 4 – Gemini 1.5 Pro; 5 – Gemini 2.0 Flash-Lite; 6 – Gemini 2.0 Flash; 7 – Gemini 2.5 Flash; 8 – Gemini 2.5 Pro; 9 – Gemini 2.5 Flash Lite (no thinking); 10 – Gemini 2.5 Flash Lite (thinking).
<details>
<summary>figures/gpt_2_plots/gpt_performance_Mathematical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: AI Model Performance on Mathematical Benchmarks
### Overview
This image is a line chart illustrating the performance scores (in percentages) of various sequentially numbered models across seven different mathematical evaluation benchmarks. The chart demonstrates a general trend of increasing performance as the model number increases, with older benchmarks reaching near-perfect scores while newer benchmarks remain highly challenging.
### Components/Axes
* **Y-Axis (Left):** Labeled "Score (%)". The scale ranges from 20 to 100, with major tick marks and horizontal dashed grid lines at 20, 40, 60, 80, and 100.
* **X-Axis (Bottom):** Labeled "Model Number". The scale ranges from 1 to 22, with major tick marks and vertical dashed grid lines at every integer from 1 to 22.
* **Legend:** There is no standalone legend box. Instead, data series are identified by text labels placed directly adjacent to their respective lines, matching the color of the line.
### Detailed Analysis
The chart contains seven distinct data series. Below is the trend verification and extracted approximate data points for each, grounded by their visual characteristics.
**1. MGSM**
* **Visuals:** Orange line, square markers. Label "MGSM" is located near the top-left, above the line at Model 5.
* **Trend:** The line starts at a mid-range score, rises sharply by Model 3, dips slightly at Model 4, and peaks at Model 5, showing early mastery by lower-numbered models.
* **Data Points (Approximate):**
* Model 1: 56%
* Model 2: 75%
* Model 3: 88%
* Model 4: 87%
* Model 5: 91%
**2. MATH**
* **Visuals:** Blue line, circular markers. Label "MATH" is located near the top-left, to the right of the final data point at Model 5.
* **Trend:** Starts lower than MGSM, remains flat between Models 1 and 2, jumps significantly at Model 3, dips slightly, and rises again at Model 5.
* **Data Points (Approximate):**
* Model 1: 43%
* Model 2: 42%
* Model 3: 73%
* Model 4: 70%
* Model 5: 77%
**3. MATH-500**
* **Visuals:** Pink/Light Purple line, small circular/dot markers. Label "MATH-500" is located near the top-center, to the right of the final data point at Model 8.
* **Trend:** This series begins where the "MATH" series ends (Model 5) but at a lower score. It shows a steep, uninterrupted linear progression upward, terminating near 100%.
* **Data Points (Approximate):**
* Model 5: 60%
* Model 6: 85%
* Model 7: 90%
* Model 8: 95%
**4. MathVista**
* **Visuals:** Red line, upward-pointing triangle markers. Label "MathVista" is located in the upper-right quadrant, below the line at Model 16.
* **Trend:** Starts at Model 3, rises gradually, experiences a sharp drop at Model 10, recovers immediately by Model 11, plateaus briefly, and then continues to rise toward 90%.
* **Data Points (Approximate):**
* Model 3: 58%
* Model 4: 56%
* Model 5: 64%
* Model 8: 74%
* Model 10: 56%
* Model 11: 73%
* Model 12: 72%
* Model 13: 72%
* Model 15: 84%
* Model 16: 87%
**5. AIME 2024**
* **Visuals:** Brown line, diamond markers. Label "AIME 2024" is located in the top-right, adjacent to the data point at Model 18.
* **Trend:** This is the most volatile series. It starts extremely low at Model 4, rises rapidly to Model 9, suffers a massive drop at Model 10, recovers partially, drops again at Model 13, and then skyrockets to near 100% by Model 15, where it stabilizes.
* **Data Points (Approximate):**
* Model 4: 8%
* Model 5: 13%
* Model 6: 56%
* Model 7: 70%
* Model 8: 83%
* Model 9: 86%
* Model 10: 29%
* Model 11: 49%
* Model 12: 48%
* Model 13: 37%
* Model 14: 87%
* Model 15: 93%
* Model 16: 91%
* Model 17: 93%
* Model 18: 97%
**6. AIME 2025**
* **Visuals:** Olive Green/Yellow-Green line, circular markers. Label "AIME 2025" is located at the extreme top-right, above the final data points.
* **Trend:** Starts relatively high at Model 8, rises steadily with a slight plateau between Models 16 and 18, and finishes at a perfect or near-perfect score by Model 22.
* **Data Points (Approximate):**
* Model 8: 79%
* Model 14: 86%
* Model 15: 92%
* Model 16: 98%
* Model 18: 98%
* Model 21: 99%
* Model 22: 100%
**7. FrontierMath, Tier 1-3**
* **Visuals:** Cyan/Light Blue line, star markers. Label "FrontierMath, Tier 1-3" is located in the bottom-right quadrant, above the line at Model 18.
* **Trend:** This series appears only for the latest models (15-22). It starts very low, dips slightly, and then exhibits a slow, steady climb, but remains significantly lower than all other benchmarks.
* **Data Points (Approximate):**
* Model 15: 19%
* Model 16: 16%
* Model 20: 27%
* Model 21: 26%
* Model 22: 32%
### Key Observations
* **Benchmark Saturation:** Older or easier benchmarks (MGSM, MATH, MATH-500) are effectively "solved" (reaching 90%+) by earlier models (Models 5-8).
* **Volatility in Mid-Models:** Models 10 and 13 show significant performance regressions specifically on the AIME 2024 and MathVista benchmarks, breaking the otherwise upward trend.
* **The Frontier:** The "FrontierMath, Tier 1-3" benchmark is the only evaluation where the most advanced models (Models 20-22) fail to achieve high scores, maxing out at approximately 32%.
### Interpretation
This chart visually narrates the rapid progression of AI capabilities in mathematics and the corresponding need for increasingly difficult evaluation metrics. The X-axis ("Model Number") acts as a proxy for chronological advancement or increasing model scale/capability.
As models progress from 1 to 22, they systematically conquer benchmarks. Once a benchmark like MGSM or MATH approaches the 100% ceiling, it loses its utility for differentiating advanced models, necessitating the introduction of harder tests like AIME 2024/2025.
The severe dips at Models 10 and 13 for AIME 2024 and MathVista are notable anomalies. Reading between the lines, these "Model Numbers" likely do not represent a strict, single-lineage chronological release (e.g., GPT-1 to GPT-4). Instead, they likely represent a mix of different model families, sizes, or architectures plotted on a general timeline of release. Models 10 and 13 might be smaller, more efficient models, or models not specifically trained on the reasoning required for AIME or the visual components required for MathVista, resulting in lower scores compared to their immediate predecessors.
Finally, the introduction of "FrontierMath, Tier 1-3" highlights the current edge of AI research. While Models 20-22 can perfectly solve AIME 2025, they struggle immensely with FrontierMath, indicating that this specific benchmark is currently the primary metric for measuring future mathematical reasoning advancements in AI.
</details>
(a) Mathematical Reasoning
<details>
<summary>figures/gpt_2_plots/gpt_performance_Multimodal_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores across Model Numbers
### Overview
This image is a multi-series line chart displaying the performance scores (in percentages) of various models across different benchmarks or datasets. The x-axis represents a sequential "Model Number," while the y-axis represents the "Score (%)." Instead of a traditional legend box, the data series are labeled directly on the chart area, typically near the beginning or end of their respective lines.
### Components/Axes
* **Y-Axis (Vertical):**
* **Label:** "Score (%)"
* **Scale:** Ranges from 30 to roughly 95.
* **Markers:** Major tick marks and corresponding horizontal dashed light-gray gridlines are placed at 40, 50, 60, 70, 80, and 90.
* **X-Axis (Horizontal):**
* **Label:** "Model Number"
* **Scale:** Ranges from 1 to 22.
* **Markers:** Major tick marks and corresponding vertical dashed light-gray gridlines are placed at every integer from 1 to 22.
* **Legend/Labels:** There is no separate legend. Labels are color-coded to match their respective lines and are placed adjacent to the data points.
### Detailed Analysis
The data series can be categorized into three distinct visual patterns: short-span early models, highly volatile mid-span models, and long-span steady scaling models.
#### Group 1: Short-Span Early Models (x=3 to x=5)
These lines only contain two data points and represent benchmarks evaluated only on early model numbers. All show an upward trend.
* **AI2D (Purple line, square markers):**
* *Position:* Top-left.
* *Trend:* Slopes upward.
* *Data Points:* (x=3, y~89.5), (x=5, y~94.5)
* **DocVQA (Green line, upward triangle markers):**
* *Position:* Top-left, just below AI2D.
* *Trend:* Slopes upward.
* *Data Points:* (x=3, y~87), (x=5, y~92.5)
* **ChartQA (Red line, diamond markers):**
* *Position:* Top-left, below DocVQA.
* *Trend:* Slopes upward steeply.
* *Data Points:* (x=3, y~78), (x=5, y~86)
* **EgoSchema (Dark Blue line, circle markers):**
* *Position:* Mid-left.
* *Trend:* Slopes upward.
* *Data Points:* (x=3, y~64), (x=5, y~72)
* **ActivityNet (Orange line, square markers with cross inside):**
* *Position:* Mid-left.
* *Trend:* Slopes upward slightly.
* *Data Points:* (x=3, y~59.5), (x=5, y~62)
#### Group 2: Volatile Mid-Span Models (x=3/4 to x=13/21)
These lines exhibit significant fluctuations, notably sharing a sharp, distinct drop at Model Number 10.
* **CharXiv-D (Pink line, asterisk/star markers):**
* *Position:* Upper-middle. Label is at x~13.
* *Trend:* Starts high, dips slightly, rises, experiences a sharp drop at x=10, recovers immediately, and plateaus high.
* *Data Points:* (x=4, y~76.5), (x=5, y~85.5), (x=8, y~89), (x=10, y~74), (x=11, y~88.5), (x=12, y~88), (x=13, y~90)
* **MMMU (Brown line, pentagon markers):**
* *Position:* Spans from mid-left to top-right. Label is at x~21.
* *Trend:* Fluctuates early, drops sharply at x=10, then climbs steadily to merge with VideoMMMU at the end.
* *Data Points:* (x=3, y~63), (x=4, y~59.5), (x=8, y~78), (x=10, y~55.5), (x=11, y~72.5), (x=12, y~74.5), (x=13, y~75), (x=15, y~81.5), (x=16, y~83), (x=21, y~84.5)
* **CharXiv-R (Gray line, 'x' markers):**
* *Position:* Spans from bottom-left to upper-right. Label is at x~21.
* *Trend:* Starts very low, spikes up, declines steadily, drops sharply at x=10, recovers to a plateau, then climbs steeply.
* *Data Points:* (x=4, y~37), (x=5, y~59), (x=8, y~55), (x=10, y~40.5), (x=11, y~56.5), (x=12, y~56.5), (x=13, y~55.5), (x=15, y~72), (x=16, y~78.5), (x=21, y~81)
#### Group 3: Long-Span Steady Models (x=5 to x=21)
These lines show a consistent, near-linear upward trajectory with very few data points spread across a wide range.
* **VideoMMMU (Olive Green line, plus '+' markers):**
* *Position:* Spans mid-left to top-right. Label is at x~21.
* *Trend:* Steady, smooth upward slope.
* *Data Points:* (x=5, y~61.5), (x=11, y~73), (x=15, y~81.5), (x=16, y~83.5), (x=21, y~84.5)
* **MMMU Pro (Teal line, circle markers):**
* *Position:* Spans mid-left to mid-right. Label is at x~21.
* *Trend:* Steady, smooth upward slope.
* *Data Points:* (x=5, y~60), (x=16, y~76.5), (x=21, y~78.5)
* **ERQA (Light Blue line, downward triangle markers):**
* *Position:* Spans bottom-left to mid-right. Label is at x~21.
* *Trend:* Steady, smooth upward slope.
* *Data Points:* (x=5, y~35.5), (x=16, y~64), (x=21, y~65.5)
### Key Observations
1. **The "Model 10" Anomaly:** There is a severe, synchronized drop in performance at Model Number 10 for three specific benchmarks: CharXiv-D, MMMU, and CharXiv-R.
2. **General Upward Trend:** Despite the volatility in the middle section, the overarching trend for every single benchmark is positive; performance at the highest recorded model number is always greater than at the lowest recorded model number.
3. **Data Sparsity:** The chart mixes high-frequency testing (e.g., models 10, 11, 12, 13 for CharXiv and MMMU) with very low-frequency testing (e.g., ERQA and MMMU Pro only have data points at 5, 16, and 21).
4. **Convergence:** By Model 21, VideoMMMU and MMMU converge at nearly the exact same score (~84.5%).
### Interpretation
This chart likely illustrates the scaling laws or iterative improvements of a specific family of AI models (e.g., a series of Large Language Models or Multimodal Models released sequentially or scaled by parameter count, represented by "Model Number").
The general upward trend demonstrates that as the "Model Number" increases, the model becomes more capable across a wide variety of tasks (document reading, chart analysis, video understanding, etc.).
The most critical investigative takeaway is the anomaly at **Model 10**. Because CharXiv-D, MMMU, and CharXiv-R all crash simultaneously at this exact point, it strongly implies that Model 10 suffered from a specific architectural flaw, a bug during training, or was a specialized checkpoint that catastrophically forgot certain reasoning capabilities while perhaps optimizing for something else. The immediate recovery at Model 11 suggests the developers identified and fixed this issue.
Furthermore, the grouping of the data suggests different testing regimens. The short lines on the left (AI2D, DocVQA) might represent older benchmarks that were "solved" (reaching 90%+) early on and thus abandoned for later models. Conversely, the long, straight lines (VideoMMMU, ERQA) suggest benchmarks that are computationally expensive to run, resulting in them only being tested on major milestone models (e.g., 5, 16, 21) rather than every incremental iteration.
</details>
(b) Multimodal Reasoning
<details>
<summary>figures/gpt_2_plots/gpt_performance_Programming_and_Coding.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores Across Different Benchmarks
### Overview
This image is a line chart displaying the performance scores (in percentages) of various numbered models across four different evaluation benchmarks. The chart illustrates how performance varies significantly depending on the specific benchmark being tested, with one benchmark showing consistently high scores early on, while the other three exhibit high volatility across a wider range of models.
### Components/Axes
**1. Y-Axis (Vertical):**
* **Label:** "Score (%)"
* **Scale:** Linear, ranging from 0 to roughly 100.
* **Markers:** Major tick marks and faint horizontal dashed gridlines are present at 0, 20, 40, 60, and 80.
**2. X-Axis (Horizontal):**
* **Label:** "Model Number"
* **Scale:** Discrete integer values.
* **Markers:** Tick marks and faint vertical dashed gridlines are present for every integer from 1 to 22.
**3. Legend / Data Series Labels (Inline Spatial Grounding):**
Instead of a traditional separate legend box, the labels are placed directly on the chart area, adjacent to their respective data lines.
* **Top-Center:** "HumanEval" (Dark Blue text, corresponds to the dark blue line with circle markers).
* **Top-Right:** "Aider's Polygot Whole" (Pink text, corresponds to the pink line with triangle markers).
* **Upper-Right:** "Aider's Polygot Diff" (Red text, corresponds to the red line with square markers).
* **Upper-Right (below Red):** "SWE-Bench Verified" (Cyan text, corresponds to the cyan line with diamond markers).
---
### Detailed Analysis
*Note: All numerical values extracted from the chart are approximate (denoted by ~) based on visual interpolation between the gridlines.*
**Series 1: HumanEval**
* **Visual Attributes:** Dark blue line, solid circle markers.
* **Trend Verification:** This line appears only on the left side of the chart. It starts relatively high, dips slightly, experiences a sharp upward step, and then plateaus at a very high score.
* **Data Points:**
* Model 1: ~68%
* Model 2: ~67%
* Model 3: ~87%
* Model 4: ~87%
* Model 5: ~90%
* Model 6: ~92%
* Model 7: ~92% (Line terminates here)
**Series 2: Aider's Polygot Whole**
* **Visual Attributes:** Pink line, solid upward-pointing triangle markers.
* **Trend Verification:** This line exhibits extreme volatility. It starts near zero, spikes up, crashes back down near zero, climbs steadily with a slight dip, crashes again, and finally spikes to its highest point.
* **Data Points:**
* Model 4: ~3%
* Model 5: ~31%
* Model 8: ~64%
* Model 10: ~9%
* Model 11: ~34%
* Model 12: ~52%
* Model 14: ~66%
* Model 15: ~69%
* Model 16: ~81%
* Model 18: ~44%
* Model 21: ~88%
**Series 3: Aider's Polygot Diff**
* **Visual Attributes:** Red line, solid square markers.
* **Trend Verification:** This line closely tracks the shape and trajectory of the "Aider's Polygot Whole" (pink) line, though it generally scores slightly lower and includes a data point at Model 13 that the pink line lacks. It terminates earlier than the pink line.
* **Data Points:**
* Model 4: ~3%
* Model 5: ~18%
* Model 8: ~62%
* Model 10: ~6%
* Model 11: ~32%
* Model 12: ~53%
* Model 13: ~45%
* Model 14: ~61%
* Model 15: ~58%
* Model 16: ~80% (Line terminates here)
**Series 4: SWE-Bench Verified**
* **Visual Attributes:** Cyan line, solid diamond markers.
* **Trend Verification:** This line follows a generally upward but highly erratic trajectory. It shares some directional movements with the Aider lines (e.g., the drop at Model 11, the peak at Model 16) but diverges significantly at other points (e.g., it drops at Model 13 while others rise/are absent, and it rises at Model 18 while the pink line crashes).
* **Data Points:**
* Model 4: ~9%
* Model 5: ~33%
* Model 8: ~49%
* Model 11: ~24%
* Model 12: ~55%
* Model 13: ~38%
* Model 14: ~61%
* Model 15: ~68%
* Model 16: ~69%
* Model 18: ~62%
* Model 21: ~75%
---
### Key Observations
1. **Benchmark Difficulty Disparity:** The "HumanEval" benchmark yields vastly higher scores for early models (Models 1-7) compared to the other three benchmarks, which start near zero for Model 4.
2. **High Correlation:** The "Aider's Polygot Whole" (pink) and "Aider's Polygot Diff" (red) benchmarks are highly correlated in their trends, moving up and down in tandem, with the "Whole" metric generally scoring slightly higher.
3. **Missing Data / Sparse Testing:** The x-axis is continuous (1-22), but the data points are sparse. For example, no models were tested on the bottom three benchmarks between Models 5 and 8, or Models 8 and 10. Furthermore, not all models were tested on all benchmarks (e.g., Model 13 has data for Red and Cyan, but not Pink).
4. **Convergence Point:** At Model 12, the three lower benchmarks (Pink, Red, Cyan) converge tightly, all scoring between ~52% and ~55%.
5. **Extreme Volatility:** Models 8, 10, and 11 show massive swings in capability. Model 8 performs relatively well (~50-64%), Model 10 fails drastically (~6-9%), and Model 11 recovers partially (~24-34%).
---
### Interpretation
* **Evolution of Benchmarks:** The data strongly suggests a chronological or capability-based progression of Large Language Models (LLMs) or coding assistants. "HumanEval" is a well-known, older, and relatively simple coding benchmark. The fact that early models (1-7) easily achieve >90% on it, and it is not tracked for later models, implies it became "saturated" or too easy to be a useful metric for advanced models.
* **Introduction of Harder Tasks:** "SWE-Bench Verified" and the "Aider" benchmarks represent much more complex, real-world software engineering tasks. The low initial scores (Models 4-5) reflect this difficulty.
* **Non-Linear Model Progression:** The extreme volatility (especially the crash at Model 10 and the dip at Model 18 for the pink line) indicates that "Model Number" does not represent a strictly linear progression of capability. These numbers likely represent different model families, different sizes (e.g., 7B vs 70B parameters), or models trained with different methodologies. Model 10, for instance, might be a very small or specialized model that lacks general coding reasoning, whereas Models 16 and 21 are likely state-of-the-art, large-scale models.
* **Metric Relationships:** The tight tracking of "Aider's Polygot Whole" and "Aider's Polygot Diff" suggests they measure fundamentally similar underlying capabilities, likely the ability to generate entire files versus generating diffs/edits, with diff generation (red) appearing slightly more difficult or prone to formatting errors for the models tested.
</details>
(c) Programming and Coding
<details>
<summary>figures/gpt_2_plots/gpt_performance_Reading_Comprehension_and_Question_Answering.png Details</summary>

### Visual Description
## Line Chart: Model Score Progression
### Overview
The image is a 2D line chart displaying the performance scores of a sequence of models. The chart plots a single data series consisting of five connected data points against a grid. Notably, while the x-axis accommodates 22 distinct models, data is only provided for the first five, with the final data point explicitly annotated with the word "DROP". The language used in the chart is entirely English.
### Components/Axes
**Component Isolation & Spatial Grounding:**
* **Y-Axis (Left):**
* **Label:** "Score (%)" (Rotated 90 degrees counter-clockwise, centered vertically).
* **Scale:** Continuous numerical scale starting at 70 at the bottom and ending at 86 at the top.
* **Markers:** Tick marks and corresponding labels are placed at intervals of 2 (70, 72, 74, 76, 78, 80, 82, 84, 86).
* **X-Axis (Bottom):**
* **Label:** "Model Number" (Centered horizontally below the axis numbers).
* **Scale:** Discrete integer scale.
* **Markers:** Numbered sequentially from 1 to 22, with a tick mark for every integer.
* **Grid Area (Center/Main):** Light gray, dashed grid lines extend from every tick mark on both the x and y axes, creating a coordinate matrix.
* **Data Series:** A single solid blue line connecting solid blue circular markers.
* **Annotation (Top-Left Quadrant):** The text "DROP" appears in blue, positioned directly above the data point at x=5.
### Detailed Analysis
**Trend Verification and Data Extraction:**
The data series consists of a single blue line tracking five points.
1. **Model 1:** The line begins at the bottom left.
* *Value:* Located just above the 70 line. Approximately **~70.2%**.
2. **Trend 1 to 2:** A steep upward slope.
* **Model 2:** The point is located roughly halfway between the 80 and 82 grid lines. Approximately **~80.9%**.
3. **Trend 2 to 3:** A continued upward slope, though slightly less steep than the previous segment.
* **Model 3:** The point rests exactly on the top grid line. Exactly **86.0%**.
4. **Trend 3 to 4:** A sharp, steep downward slope.
* **Model 4:** The point falls just below the 80 grid line. Approximately **~79.7%**.
5. **Trend 4 to 5:** A moderate upward slope.
* **Model 5:** The point is located between 82 and 84, closer to the 84 line. Approximately **~83.4%**.
* *Embedded Text:* This specific point is annotated with the word "**DROP**" in matching blue text.
6. **Models 6 through 22:** The chart area is completely empty. No data points or lines are drawn in this region.
### Key Observations
* **High Volatility:** The scores fluctuate significantly between models, with a massive initial jump (+~10.7%), a peak, a severe regression (-~6.3%), and a partial recovery.
* **Peak Performance:** Model 3 achieved the highest score (86.0%) of the recorded set.
* **Incomplete Data Set:** The x-axis is explicitly scaled to accommodate 22 models, but the data abruptly terminates at Model 5.
* **Visual-Textual Contradiction:** The annotation "DROP" occurs at a point where the visual trend is moving *upward* (from ~79.7% to ~83.4%).
### Interpretation
This chart likely represents an iterative machine learning training process, a hyperparameter tuning sequence, or sequential statistical modeling.
The data demonstrates a search for optimal performance. Model 1 serves as a low-performing baseline. Subsequent iterations (Models 2 and 3) show rapid improvement, peaking at 86%. The sharp decline at Model 4 suggests a change in parameters that negatively impacted the model's accuracy/score.
The most critical piece of information requires reading between the lines regarding the "DROP" annotation and the empty space from models 6 to 22. Because the score actually *increased* from Model 4 to Model 5, the word "DROP" does not describe a drop in the metric itself. Instead, applying Peircean abductive reasoning, "DROP" almost certainly indicates an action taken by the researcher: the experiment, training run, or specific model lineage was **dropped** (terminated or abandoned) at step 5.
The presence of an x-axis extending to 22 strongly implies that a 22-step process was originally planned or is the standard length for this type of evaluation. The early termination at step 5—despite a slight recovery from step 4—suggests the researcher concluded that this specific model iteration path was no longer viable, perhaps because it failed to recover to the peak established by Model 3, or due to other unstated factors like overfitting or resource constraints.
</details>
(d) Reading Comprehension and QA
<details>
<summary>figures/gpt_2_plots/gpt_performance_Reasoning_with_General_Knowledge.png Details</summary>

### Visual Description
## Line Chart: AI Model Performance Across Various Benchmarks
### Overview
This image is a line chart displaying the performance scores of various numbered models across four different evaluation benchmarks. The chart illustrates a general upward trend in model capabilities, while highlighting the stark difficulty differences between established benchmarks and a newer, significantly harder benchmark.
### Components/Axes
**1. Y-Axis (Left):**
* **Label:** "Score (%)"
* **Scale:** Ranges from 0 to 100 (implied), with visible major tick marks and labels at 20, 40, 60, and 80.
* **Gridlines:** Faint horizontal dashed lines appear at intervals of 10 (e.g., 10, 20, 30, 40...).
**2. X-Axis (Bottom):**
* **Label:** "Model Number"
* **Scale:** Discrete integer values from 1 to 22.
* **Gridlines:** Faint vertical dashed lines align with each integer.
**3. Legend / Data Series Labels (Inline):**
Instead of a traditional legend box, labels are placed directly on the chart area, pointing to the final or near-final data point of their respective series.
* **Top-Right (Blue text):** "MMLU" points to the blue line with circle markers.
* **Top-Right (Red text):** "GPQA Diamond" points to the red line with square markers.
* **Center-Right (Pink text):** "MMMLU" points to the pink line with triangle markers.
* **Middle-Right (Cyan text):** "Humanity's Last Exam" points to the cyan line with diamond markers.
---
### Detailed Analysis
*Note: All numerical values extracted below are approximate based on visual alignment with the y-axis gridlines.*
#### Series 1: MMLU (Blue Line, Circle Markers)
* **Visual Trend:** This series starts at a high baseline, experiences a quick initial jump, and remains relatively stable at the top of the chart with minor fluctuations. It represents the highest overall scores on the chart.
* **Data Points:**
* Model 1: ~70%
* Model 2: ~86%
* Model 3: ~86%
* Model 4: ~82%
* Model 5: ~89%
* Model 6: ~92%
* Model 7: ~85%
* Model 8: ~92%
* Model 9: *No data point*
* Model 10: ~80%
* Model 11: ~87%
* Model 12: ~90%
* Model 13: ~91%
* Model 14: ~87%
* Models 15-17: *No data points*
* Model 18: ~90% (Label "MMLU" is attached here)
#### Series 2: GPQA Diamond (Red Line, Square Markers)
* **Visual Trend:** This series exhibits high volatility in the earlier models (1-10), featuring sharp peaks and deep valleys. From Model 10 onward, it shows a strong, consistent upward slope, eventually converging with the MMLU scores by Model 22.
* **Data Points:**
* Model 1: ~31%
* Model 2: ~36%
* Model 3: ~48%
* Model 4: ~40%
* Model 5: ~70%
* Model 6: ~78%
* Model 7: ~60%
* Model 8: ~78%
* Model 9: ~79%
* Model 10: ~50%
* Model 11: ~65%
* Model 12: ~66%
* Model 13: ~71%
* Model 14: ~80%
* Model 15: ~81%
* Model 16: ~83%
* Model 17: ~84%
* Models 18-20: *No data points*
* Model 21: ~87%
* Model 22: ~89% (Label "GPQA Diamond" is attached here)
#### Series 3: MMMLU (Pink Line, Triangle Markers)
* **Visual Trend:** This series begins later (at Model 4). It generally tracks slightly below the MMLU line, sharing a similar sharp dip at Model 10, before recovering and flattening out in the low 80s.
* **Data Points:**
* Model 4: ~70%
* Model 5: ~81%
* Model 6: ~84%
* Model 7: *No data point*
* Model 8: ~87%
* Model 9: *No data point*
* Model 10: ~67%
* Model 11: ~78%
* Model 12: ~87%
* Model 13: ~85%
* Model 14: ~81%
* Model 15: ~81%
* Models 16-17: *No data points*
* Model 18: ~81% (Label "MMMLU" is attached here)
#### Series 4: Humanity's Last Exam (Cyan Line, Diamond Markers)
* **Visual Trend:** This series starts much later (Model 9) and significantly lower than all other benchmarks. It slopes upward gradually, experiences a sharp spike at Model 20, dips, and recovers. It remains the lowest scoring benchmark by a wide margin.
* **Data Points:**
* Model 9: ~8%
* Model 14: ~13%
* Model 15: ~18%
* Model 16: ~25%
* Model 17: *No data point*
* Model 18: ~19%
* Model 19: ~27%
* Model 20: ~41%
* Model 21: ~35%
* Model 22: ~42% (Label "Humanity's Last Exam" is attached here)
---
### Key Observations
1. **Missing Data:** Not all models were tested on all benchmarks. For example, Models 1-3 lack MMMLU data, Models 1-8 lack "Humanity's Last Exam" data, and several models in the late teens are missing data across multiple series.
2. **Model 10 Anomaly:** There is a distinct, sharp drop in performance across all three active benchmarks (MMLU, GPQA, MMMLU) specifically at Model 10.
3. **Convergence:** By the later models (18-22), MMLU, GPQA Diamond, and MMMLU are all converging in the 80%-90% range.
4. **Difficulty Gap:** "Humanity's Last Exam" is vastly more difficult for these models than the other three benchmarks, never exceeding ~42%, while the others routinely score above 80%.
### Interpretation
This chart visually demonstrates the progression of AI model capabilities against standardized testing benchmarks. The x-axis ("Model Number") likely represents a chronological release order or a scaling progression of model size/compute.
The data suggests that older or more established benchmarks like MMLU and MMMLU are reaching a point of saturation; the models are scoring so high (near 90%) that these tests may no longer effectively differentiate between the capabilities of the newest models. GPQA Diamond shows that models struggled initially but have rapidly "solved" or adapted to this benchmark over time.
The introduction of "Humanity's Last Exam" represents a paradigm shift in evaluation. Because the models are scoring exceptionally low on it (starting below 10% and struggling to break 40%), it serves as a new "frontier" benchmark designed to test advanced reasoning or knowledge that current models have not yet mastered. The chart effectively communicates the necessity of creating harder tests as AI models rapidly conquer existing ones.
</details>
(e) Reasoning with General Knowledge
Figure 5: Performance of the GPT family on general reasoning benchmarks. Model numbers and corresponding names are as follows: 1 – GPT-3.5; 2 – GPT-4; 3 – GPT-4 Turbo; 4 – GPT-4o mini; 5 – GPT-4o; 6 – o1-preview; 7 – o1-mini; 8 – o1; 9 – o1-pro; 10 – GPT-4.1 nano; 11 – GPT-4.1 mini; 12 – GPT-4.1; 13 – GPT-4.5; 14 – o3-mini; 15 – o4-mini; 16 – o3; 17 – o3-pro; 18 – gpt-oss-120b; 19 – GPT-5 with Deep Research; 20 – ChatGPT Agent; 21 – GPT-5; 22 – GPT-5 Pro.
<details>
<summary>figures/gpt_2_plots/gpt_performance_Constrained_Text_Generation_-_LLM.png Details</summary>

### Visual Description
## Line Chart: COLLIE Score (%) vs. Model Number
### Overview
This image is a 2D line chart displaying the performance of a system or model series labeled "COLLIE". It plots a percentage-based "Score" against a sequential "Model Number". The chart features a single data series represented by a solid blue line with circular markers at specific data points. The data shows significant volatility in earlier models before stabilizing at a near-perfect score in later models.
### Components/Axes
**1. Y-Axis (Left):**
* **Label:** "Score (%)" (Rotated 90 degrees counter-clockwise, centered vertically).
* **Scale:** Linear, ranging from 40 to 100.
* **Markers/Ticks:** 40, 50, 60, 70, 80, 90, 100.
* **Gridlines:** Light gray, dashed horizontal lines extending from each major tick mark across the plot area.
**2. X-Axis (Bottom):**
* **Label:** "Model Number" (Centered horizontally below the axis).
* **Scale:** Linear, discrete integer values ranging from 1 to 22.
* **Markers/Ticks:** 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22.
* **Gridlines:** Light gray, dotted vertical lines extending upward from each integer tick mark.
**3. Data Series & Annotations:**
* **Line/Markers:** A single series using a medium-blue (steel blue) line connecting solid circular markers of the same color.
* **Annotation/Legend:** The text "COLLIE" is located in the top-right corner of the chart area, positioned just above the final data point. It is rendered in the same medium-blue color as the data line, serving as an inline legend identifying the series.
### Detailed Analysis
*Trend Verification and Data Point Extraction:*
* **Initial Climb:** The line begins at Model 4 and slopes upward to Model 5.
* Model 4: ~53%
* Model 5: ~61%
* **First Peak:** The line slopes steeply upward from Model 5 to Model 8. (Note: Models 6 and 7 are skipped).
* Model 8: ~95%
* **Sharp Decline:** The line drops precipitously from Model 8 to Model 10. (Note: Model 9 is skipped).
* Model 10: ~42.5% (Global Minimum)
* **Recovery Phase:** The line slopes steadily upward in a near-linear fashion through consecutive models from 10 to 14.
* Model 11: ~55%
* Model 12: ~66%
* Model 13: ~72%
* Model 14: ~98.5%
* **Plateau/Stabilization:** The line remains relatively flat, with a very slight dip and subsequent rise, from Model 14 to Model 21. (Note: Models 15, 17, 18, 19, and 20 are skipped).
* Model 16: ~98%
* Model 21: ~99% (Global Maximum)
*Reconstructed Data Table (Approximate Values):*
| Model Number (X) | Score (%) (Y) |
| :--- | :--- |
| 4 | ~53.0 |
| 5 | ~61.0 |
| 8 | ~95.0 |
| 10 | ~42.5 |
| 11 | ~55.0 |
| 12 | ~66.0 |
| 13 | ~72.0 |
| 14 | ~98.5 |
| 16 | ~98.0 |
| 21 | ~99.0 |
*(Note: Models 1, 2, 3, 6, 7, 9, 15, 17, 18, 19, 20, and 22 have no plotted data points).*
### Key Observations
* **Missing Data Points:** The x-axis is continuous from 1 to 22, but data is only plotted for 10 specific models. There is no data for models 1-3.
* **Extreme Volatility:** The performance between Model 8 (~95%) and Model 10 (~42.5%) represents a massive degradation in performance (a drop of over 50 percentage points).
* **Convergence:** After Model 14, the model appears to hit a performance ceiling, stabilizing just below 100%.
### Interpretation
This chart likely tracks the iterative development, training checkpoints, or sequential versions of a machine learning model or software system named "COLLIE".
* **The Gaps:** The missing data points on the x-axis suggest that either not every model iteration was evaluated for this specific "Score," or only notable milestone models were plotted to summarize the development journey.
* **The Collapse at Model 10:** The severe drop at Model 10 is the most notable anomaly. In machine learning contexts, this often represents a failed experiment, a catastrophic forgetting event during training, a bug introduced in that specific version, or a radical change in architecture/hyperparameters that did not work out.
* **The Recovery and Plateau:** The steady climb from Model 10 to 14 shows a systematic debugging or retraining phase, successfully recovering the lost performance. By Model 14, the system reaches a state of convergence. The subsequent models (16 and 21) demonstrate that the system has stabilized, maintaining a near-perfect score (~99%) without further regressions, indicating a mature, finalized state for the "COLLIE" system.
</details>
(a) Constrained Text Generation
<details>
<summary>figures/gpt_2_plots/gpt_performance_Factuality_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores (SimpleQA vs. BrowseComp)
### Overview
This image is a 2D line chart tracking the performance scores of two distinct evaluation metrics—"SimpleQA" and "BrowseComp"—across a sequential series of "Model Numbers." The chart uses two distinct lines with different colors and marker shapes to differentiate the data series. There is no traditional legend box; instead, the series labels are placed directly on the chart area adjacent to specific data points.
*Language Declaration:* All text in this image is in English.
### Components/Axes
**1. Y-Axis (Vertical, Left)**
* **Label:** "Score (%)" (Rotated 90 degrees counter-clockwise, positioned centrally along the axis).
* **Scale:** Ranges from 0 to 70.
* **Markers:** Major tick marks are placed at intervals of 10 (0, 10, 20, 30, 40, 50, 60, 70).
* **Grid:** Faint, light gray, dashed horizontal lines extend from each major tick mark across the chart area.
**2. X-Axis (Horizontal, Bottom)**
* **Label:** "Model Number" (Positioned centrally below the axis numbers).
* **Scale:** Ranges from 1 to 22.
* **Markers:** Major tick marks are placed at intervals of 1 (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22).
* **Grid:** Faint, light gray, dashed vertical lines extend upward from each major tick mark.
**3. Inline Legends (Spatial Grounding)**
* **SimpleQA:** The text "SimpleQA" is written in dark blue. It is positioned in the lower-middle-right section of the chart, immediately to the right of the dark blue data point at Model Number 14.
* **BrowseComp:** The text "BrowseComp" is written in light teal. It is positioned in the upper-right section of the chart, placed horizontally between Model Numbers 19 and 21, just below the peak data point at Model Number 20.
### Detailed Analysis
**Data Series 1: SimpleQA**
* **Visual Identification:** Dark blue line with solid circle markers.
* **Trend Verification:** The line begins at Model 5, slopes upward at a moderate, steady pace through Model 8 to reach its peak at Model 13. Immediately following this peak, the line drops precipitously to its lowest recorded point at Model 14, where the data series ends.
* **Data Points (Approximate values ±0.5%):**
* Model 5: ~38.0%
* Model 8: ~47.0%
* Model 13: ~62.5% (Peak)
* Model 14: ~15.0% (Lowest point; label "SimpleQA" is placed here)
**Data Series 2: BrowseComp**
* **Visual Identification:** Light teal line with solid square markers.
* **Trend Verification:** The line begins at Model 5 at a near-zero baseline and remains completely flat until Model 8. From Model 8, it slopes upward steadily to Model 15, then jumps sharply to Model 16. It plateaus slightly, rising only marginally to Model 19, before spiking sharply to its absolute peak at Model 20. Finally, it experiences a moderate decline at Model 21, where the data series ends.
* **Data Points (Approximate values ±0.5%):**
* Model 5: ~2.0%
* Model 8: ~2.0%
* Model 15: ~28.5%
* Model 16: ~49.5%
* Model 19: ~51.5%
* Model 20: ~69.0% (Peak; label "BrowseComp" is placed just below this)
* Model 21: ~55.0%
### Key Observations
* **Intersection:** The visual paths of the two lines cross between Model 13 and Model 15. During this window, SimpleQA experiences a catastrophic drop, while BrowseComp is in the middle of a steady climb.
* **Data Sparsity:** Neither series has data points for every model number on the x-axis. There are large gaps (e.g., between Model 8 and 13 for SimpleQA, and Model 8 and 15 for BrowseComp).
* **Asynchronous Lifespans:** The SimpleQA evaluation stops at Model 14, whereas the BrowseComp evaluation continues up to Model 21.
* **Post-Peak Drops:** Both metrics exhibit a significant drop in performance immediately after reaching their respective maximum scores (SimpleQA drops 47.5% after Model 13; BrowseComp drops 14% after Model 20).
### Interpretation
* **Model Evolution and Trade-offs:** The chart strongly suggests a sequential training or development process of a machine learning model (or family of models). The data illustrates a classic case of competing objectives or "catastrophic forgetting."
* **Phase 1 (Models 5-13):** Early in the development cycle, the model was optimized for "SimpleQA," achieving a respectable ~62.5% score. During this phase, it possessed virtually no capability for the "BrowseComp" task (~2%).
* **The Pivot (Models 13-15):** A major shift in training methodology, architecture, or dataset likely occurred here. As the developers pushed the model to learn "BrowseComp" (which begins to rise), the model completely lost its ability to perform "SimpleQA" (crashing to 15%). Because SimpleQA data is no longer plotted after Model 14, it is highly probable the developers abandoned that metric for this specific model branch, focusing entirely on the new capability.
* **Phase 2 (Models 15-21):** The later models show rapid, though somewhat unstable, improvement in "BrowseComp." The sharp spike at Model 20 followed by a drop at Model 21 indicates that while high performance is achievable, the training state at that peak might be brittle or overfitted, leading to a regression in the subsequent iteration.
</details>
(b) Factuality
<details>
<summary>figures/gpt_2_plots/gpt_performance_Instruction_Following_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Evaluation Scores (IFEval vs. Multi-IF)
### Overview
This image is a 2D line chart comparing the performance scores of various models across two distinct evaluation metrics: "IFEval" and "Multi-IF". The chart plots the score percentage on the vertical axis against a sequential model number on the horizontal axis. The data is presented in English; no other languages are present.
### Components/Axes
**Component Isolation & Spatial Grounding:**
* **Y-axis (Left):** Labeled vertically as "Score (%)". The axis features solid black text. Major tick marks are labeled at intervals of 5, starting from 60 and ending at 95 (60, 65, 70, 75, 80, 85, 90, 95). Horizontal grid lines extend from these ticks across the chart area. The grid lines are light gray and dashed.
* **X-axis (Bottom):** Labeled horizontally as "Model Number". The axis features solid black text. Major tick marks are labeled with integers from 1 to 22, inclusive. Vertical grid lines extend upward from each integer. These lines are very faint, solid light gray.
* **Data Series 1 (Top Line):** Represented by a dark blue line with solid circular markers. The label "IFEval" is written in matching dark blue text, positioned in the top-right quadrant of the chart area, directly above the final data point at X=14.
* **Data Series 2 (Bottom Line):** Represented by a cyan (light blue) line with solid square markers. The label "Multi-IF" is written in matching cyan text, positioned in the middle-right area of the chart, directly above the final data point at X=14.
### Detailed Analysis
**Trend Verification:**
Before extracting specific values, the visual trends of both lines must be established.
* **IFEval (Dark Blue, Circles):** The line begins at Model 4, slopes upward to Model 5, and continues a steep upward slope to a peak at Model 8. It then experiences a sharp, steep decline to a local minimum at Model 10. From Model 10, it recovers with a steep upward slope to Model 11, continues a moderate upward slope to Model 12, flattens slightly to Model 13, and finishes with a sharp upward spike to its highest point at Model 14.
* **Multi-IF (Cyan, Squares):** The line follows an almost identical geometric path to the IFEval line, but at a lower absolute Y-value. It starts at Model 4, slopes up to Model 5, spikes to Model 8, drops sharply to a minimum at Model 10, recovers steeply to Model 11, slopes up to Model 12, remains perfectly flat (horizontal) between Model 12 and 13, and finishes with a sharp upward spike to Model 14.
**Data Point Extraction (Reconstructed Data Table):**
*Note: Y-axis values are visual approximations based on the placement of markers relative to the 5-point increment gridlines. Uncertainty is approximately ±0.5%.*
| Model Number (X) | IFEval Score (%) [Dark Blue / Circle] | Multi-IF Score (%) [Cyan / Square] |
| :--- | :--- | :--- |
| 4 | ~ 78.5 | ~ 58.0 |
| 5 | ~ 81.0 | ~ 61.0 |
| 8 | ~ 92.0 | ~ 78.0 |
| 10 | ~ 74.5 | ~ 57.5 |
| 11 | ~ 84.0 | ~ 67.0 |
| 12 | ~ 87.5 | ~ 71.0 |
| 13 | ~ 88.0 | ~ 71.0 |
| 14 | ~ 94.0 | ~ 79.5 |
*Note: Models 1, 2, 3, 6, 7, 9, and 15 through 22 are present on the X-axis but contain no data points.*
### Key Observations
1. **High Correlation:** The most striking visual feature is the parallel movement of the two lines. Every increase or decrease in the IFEval score is mirrored by a corresponding increase or decrease in the Multi-IF score.
2. **Consistent Performance Gap:** The IFEval score is consistently higher than the Multi-IF score for every single model evaluated. The gap between the two metrics ranges roughly between 14% and 20% depending on the specific model.
3. **The "Model 10" Anomaly:** Both metrics show a severe degradation in performance at Model 10. For Multi-IF, Model 10 represents the absolute lowest score on the chart (~57.5%), dropping even below the starting point of Model 4.
4. **Missing Data:** Data is only plotted for 8 specific models out of the 22 listed on the X-axis.
### Interpretation
* **Metric Difficulty:** The data strongly suggests that "Multi-IF" is a significantly more rigorous or difficult evaluation metric than "IFEval". Because the lines move in tandem, we can infer that both metrics are testing related capabilities (likely Instruction Following, given the "IF" acronym), but Multi-IF requires a higher threshold for success or tests more complex scenarios (e.g., multiple constraints simultaneously).
* **Model Progression:** Assuming "Model Number" represents a chronological progression of training checkpoints or sequential versions of a model family, the overall trend is positive. Model 14 is vastly superior to Model 4 across both metrics.
* **Developmental Instability:** The sharp drop at Model 10 indicates a regression in the model's development. In machine learning, this often points to a failed training run, catastrophic forgetting after introducing new data, or a bug in that specific iteration. The subsequent models (11-14) show that the developers successfully corrected this regression.
* **Selective Reporting:** The absence of data for models 1-3, 6-7, and 9 suggests this chart is highlighting specific, notable checkpoints rather than every single iteration. The X-axis extending to 22 while data stops at 14 implies that models 15-22 either exist but haven't been evaluated yet, or the chart was generated with a fixed axis range anticipating future data.
</details>
(c) Instruction Following
<details>
<summary>figures/gpt_2_plots/gpt_performance_Long-Context_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores by Graphwalks Type and Size
### Overview
This image is a 2D line chart displaying the performance "Score (%)" of various "Model Numbers" across four different configurations related to "Graphwalks". The chart compares two methods ("parents" and "bfs") across two size/complexity thresholds ("<128000" and ">128000"). The data reveals high volatility in performance across different models for the smaller threshold, and consistently low performance for the larger threshold on a limited subset of models.
### Components/Axes
**1. Y-Axis (Vertical, Left)**
* **Title:** "Score (%)" (Rotated 90 degrees counter-clockwise, centered vertically).
* **Scale:** Linear, ranging from 0 to 70 (with gridlines extending slightly above 70).
* **Markers:** 0, 10, 20, 30, 40, 50, 60, 70.
**2. X-Axis (Horizontal, Bottom)**
* **Title:** "Model Number" (Centered horizontally below the axis).
* **Scale:** Linear, discrete integer values.
* **Markers:** 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22.
**3. Gridlines**
* Light gray, dashed gridlines intersect at every major tick mark on both the X and Y axes.
**4. Legend (Inline)**
There is no separate legend box. Instead, text labels are placed directly on the chart area, color-coded to match their respective data series.
* **Blue Text** (Top center, near X=10-14): `Graphwalks parents <128000` (Corresponds to the Blue line with circle markers).
* **Red Text** (Below blue text, near X=10-14): `Graphwalks bfs <128000` (Corresponds to the Red line with square markers).
* **Pink Text** (Middle right, near X=10-14): `Graphwalks parents >128000` (Corresponds to the Pink line with triangle markers).
* **Teal Text** (Below pink text, near X=10-14): `Graphwalks bfs >128000` (Corresponds to the Teal line with diamond markers).
---
### Detailed Analysis
*Note: All Y-axis values are approximate (denoted by ~) based on visual interpolation between the gridlines.*
**Series 1: Graphwalks parents <128000 (Blue Line, Circle Markers)**
* **Visual Trend:** The line starts low, rises steadily to a peak at Model 8, crashes significantly at Model 10, rebounds sharply to a high plateau at Models 11-12, peaks at Model 13, and drops at Model 14.
* **Data Points:**
* Model 4: ~13%
* Model 5: ~35%
* Model 8: ~51%
* Model 10: ~9%
* Model 11: ~60%
* Model 12: ~58%
* Model 13: ~72%
* Model 14: ~58%
**Series 2: Graphwalks bfs <128000 (Red Line, Square Markers)**
* **Visual Trend:** This line closely mirrors the shape of Series 1 but generally scores higher. It rises to a peak at Model 8, crashes at Model 10, rebounds sharply to a plateau at Models 11-12, peaks at Model 13, and drops sharply at Model 14.
* **Data Points:**
* Model 4: ~29%
* Model 5: ~42%
* Model 8: ~62%
* Model 10: ~25%
* Model 11: ~62%
* Model 12: ~62%
* Model 13: ~72%
* Model 14: ~51%
**Series 3: Graphwalks parents >128000 (Pink Line, Triangle Markers)**
* **Visual Trend:** This series only exists for a short span (Models 10-12). It starts very low and slopes upward steadily.
* **Data Points:**
* Model 10: ~5%
* Model 11: ~11%
* Model 12: ~25%
**Series 4: Graphwalks bfs >128000 (Teal Line, Diamond Markers)**
* **Visual Trend:** Similar to Series 3, this only exists for Models 10-12. It starts very low and slopes upward, crossing over Series 3.
* **Data Points:**
* Model 10: ~3%
* Model 11: ~15%
* Model 12: ~19%
---
### Key Observations
1. **Missing Data:** There are significant gaps in the data. Models 1, 2, 3, 6, 7, 9, and 15 through 22 have no data points plotted for any series.
2. **Performance Disparity by Size:** The `<128000` configurations score vastly higher (ranging from ~9% to ~72%) than the `>128000` configurations (ranging from ~3% to ~25%).
3. **The "Model 10" Anomaly:** Model 10 represents a massive drop in performance for the `<128000` tasks compared to Models 8 and 11. Interestingly, it is also the starting point for the `>128000` evaluations.
4. **Method Comparison (bfs vs. parents):** For the `<128000` category, the "bfs" method (red) consistently outperforms the "parents" method (blue) from Models 4 through 12. They tie at Model 13, and then "parents" overtakes "bfs" at Model 14.
5. **Peak Performance:** Model 13 achieves the highest score (~72%) for both the "bfs" and "parents" methods in the `<128000` category.
---
### Interpretation
* **Contextual Meaning:** The chart likely evaluates the performance of different iterations or architectures of a machine learning model (Models 1-22) on a specific task involving graph traversal or generation ("Graphwalks").
* **Task Complexity:** The threshold `128000` likely refers to a measure of complexity, such as the number of nodes/edges in a graph, context window size, or dataset size. The data clearly demonstrates that the models struggle significantly when this complexity threshold is exceeded (`>128000`), yielding scores below 25%.
* **Algorithmic Differences:** "bfs" (Breadth-First Search) and "parents" likely refer to two different algorithmic approaches or evaluation metrics used during the graphwalks. The data suggests that "bfs" is generally a more successful or stable approach for the less complex tasks, though the advantage disappears at the highest performing model (Model 13).
* **Model Evolution:** The erratic nature of the `<128000` lines suggests that "Model Number" represents distinct architectural changes rather than a smooth training progression (epochs). For example, whatever architectural change was introduced in Model 10 severely degraded performance on the easier tasks, but subsequent models (11-13) fixed this issue and achieved state-of-the-art results for this chart.
* **Resource Constraints/Testing Scope:** The fact that the `>128000` tasks were only tested on Models 10, 11, and 12 suggests that earlier models (4-8) might have been incapable of running these larger tasks (e.g., due to out-of-memory errors), or that the researchers only decided to introduce the harder benchmark once the models reached a certain maturity. The lack of data beyond Model 14 suggests testing was halted or those models are not yet evaluated.
</details>
(d) Long Context
<details>
<summary>figures/gpt_2_plots/gpt_performance_Multi-turn_Conversation_-_LLM.png Details</summary>

### Visual Description
## Line Chart: MultiChallenge Score vs. Model Number
### Overview
This image is a 2D line chart that visualizes the performance of various models on a specific evaluation metric. The chart plots "Score (%)" against "Model Number" for a single data series identified as "MultiChallenge". The data is not continuous across all model numbers; rather, it shows results for a specific subset of models, connected by a line to illustrate the progression or variance in performance between the tested iterations.
### Components/Axes
**1. Y-Axis (Left)**
* **Label:** "Score (%)" (Oriented vertically, reading bottom to top).
* **Scale:** Linear numerical scale.
* **Markers/Ticks:** Major grid lines and labels are provided at intervals of 10, specifically: 20, 30, 40, 50, 60, and 70. (Note: The data dips below 20, implying the axis extends down to at least 10, though it is not explicitly labeled).
* **Gridlines:** Faint, dashed, light-grey horizontal lines extend from each major tick mark across the chart area.
**2. X-Axis (Bottom)**
* **Label:** "Model Number" (Centered horizontally below the axis).
* **Scale:** Discrete numerical scale.
* **Markers/Ticks:** Numbered sequentially from 1 to 22 in increments of 1.
* **Gridlines:** Faint, dashed, light-grey vertical lines extend upward from each number.
**3. Chart Area & Legend (Center to Top-Right)**
* **Data Series:** A single solid blue line connecting solid blue circular markers.
* **Label/Legend:** The text "MultiChallenge" is located in the top-right quadrant of the chart area, positioned directly above the final data point. The text is colored blue, perfectly matching the color of the data line and markers, confirming this line represents the "MultiChallenge" dataset.
### Detailed Analysis
**Trend Verification:**
The visual trend of the blue "MultiChallenge" line is highly volatile in the earlier models but shows a general upward trajectory in the later models.
* The line begins at Model 4 with a low score, sharply inclines to Model 5, and continues a slight upward slope to Model 8.
* A severe, steep decline occurs between Model 8 and Model 10, marking the lowest point on the graph.
* From Model 10, the line sharply recovers to Model 11, followed by a jagged, fluctuating upward trend through Models 12, 13, 14, and 15.
* A significant, steep upward jump occurs between Model 15 and 16.
* Finally, a steady, moderate incline connects Model 16 to the final and highest point at Model 21.
**Data Extraction Table:**
*Note: Values are visual approximations based on the placement of the blue markers relative to the dashed gridlines.*
| Model Number (X-Axis) | Score (%) (Y-Axis) | Visual Placement Notes |
| :--- | :--- | :--- |
| 4 | ~20.5 | Just barely above the 20 gridline. |
| 5 | ~40.5 | Resting almost exactly on, or slightly above, the 40 gridline. |
| 8 | ~45.0 | Positioned exactly halfway between the 40 and 50 gridlines. |
| 10 | ~15.0 | Positioned halfway between the 20 gridline and the implied 10 baseline. |
| 11 | ~36.0 | Positioned slightly above the midpoint between 30 and 40. |
| 12 | ~38.5 | Positioned just below the 40 gridline. |
| 13 | ~44.0 | Positioned just below the midpoint between 40 and 50. |
| 14 | ~40.0 | Resting exactly on the 40 gridline. |
| 15 | ~43.0 | Positioned slightly below the midpoint between 40 and 50. |
| 16 | ~60.5 | Resting almost exactly on, or slightly above, the 60 gridline. |
| 21 | ~69.5 | Positioned just barely below the 70 gridline. |
### Key Observations
* **Missing Data:** There are significant gaps in the X-axis where no data points exist. Models 1, 2, 3, 6, 7, 9, 17, 18, 19, 20, and 22 have no recorded scores on this chart.
* **Absolute Maximum:** The highest recorded score is achieved by Model 21, nearing 70%.
* **Absolute Minimum:** The lowest recorded score is Model 10, dropping to approximately 15%.
* **Highest Volatility:** The most drastic changes in performance occur around Model 10 (a drop of ~30% from Model 8, followed by a recovery of ~21% to Model 11) and between Models 15 and 16 (a sudden increase of ~17.5%).
### Interpretation
**What the data suggests:**
The chart tracks the evolution of a system—likely a machine learning model, software build, or algorithmic iteration—across sequential versions (Model Numbers). The "MultiChallenge" label suggests this is a specific benchmark or test suite designed to evaluate the models' capabilities.
**Reading between the lines (Peircean investigative analysis):**
1. **Iterative Development:** The general upward trend from Model 10 to Model 21 strongly implies an iterative development process where the system is learning, being optimized, or receiving architectural improvements over time.
2. **The "Model 10" Anomaly:** The catastrophic drop at Model 10 is the most notable feature. In software/ML development, this usually indicates a failed experiment, a major bug introduced in that specific build, or a fundamental change in architecture that performed poorly on this specific "MultiChallenge" benchmark before being corrected in Model 11.
3. **The "Model 16" Breakthrough:** The sudden leap in performance at Model 16 (jumping from the low 40s to 60) suggests a breakthrough. This wasn't a minor tweak; it likely represents a successful implementation of a new feature, a significant increase in training data, or a major bug fix that unlocked higher performance.
4. **Selective Testing/Reporting:** The missing data points (e.g., Models 1-3, 17-20) are highly informative. It suggests that either:
* Not every model iteration was subjected to the "MultiChallenge" benchmark (perhaps it is computationally expensive to run).
* Some models failed to compile or run the test entirely.
* The chart author selectively chose to display only major milestone models or models that yielded interesting results, omitting intermediate minor builds.
5. **Diminishing Returns vs. Steady Growth:** The slope from Model 16 to 21 is shallower than the jump from 15 to 16, but it is steady. This indicates that after the breakthrough at Model 16, the developers found a stable path for incremental improvements leading up to Model 21.
</details>
(e) Multi-turn Conversation
<details>
<summary>figures/gpt_2_plots/gpt_performance_Safety_-_LLM.png Details</summary>

### Visual Description
## Line Plot: Model Performance on HealthBench Metrics
### Overview
This image is a 2D line and scatter plot illustrating the performance scores (in percentages) of various numbered models across three different evaluation metrics related to "HealthBench." The chart plots discrete model numbers on the horizontal axis against their corresponding percentage scores on the vertical axis.
### Components/Axes
* **Y-axis (Left):** Labeled "Score (%)". The scale ranges from 30 to 90, with major tick marks and corresponding horizontal dashed light-gray grid lines at increments of 10 (30, 40, 50, 60, 70, 80, 90).
* **X-axis (Bottom):** Labeled "Model Number". The scale ranges from 1 to 22, with major tick marks and corresponding vertical dashed light-gray grid lines at increments of 1.
* **Legend:** There is no separate legend box. Instead, data series are labeled directly on the chart area near their respective data points or at the end of their lines.
* Cyan/Light Blue Triangle: `HealthBench Consensus`
* Medium Blue Circle: `HealthBench`
* Brown/Maroon Square: `HealthBench Hard`
### Detailed Analysis
**1. Series: HealthBench Consensus**
* *Spatial Grounding:* Located in the top-right quadrant of the chart area. The label is placed directly above the single data point.
* *Trend Verification:* This is a single, isolated data point; therefore, there is no line or trend to describe.
* *Data Points:*
* Model Number: 18
* Score: ~90.0% (The cyan triangle rests exactly on the 90 grid line).
**2. Series: HealthBench**
* *Spatial Grounding:* Spans from the lower-left to the middle-right of the chart. The label is placed just above the final data point at Model 21.
* *Trend Verification:* The line slopes upward significantly from Model 5 to Model 16, dips slightly downward to Model 18, and then slopes upward again to Model 21.
* *Data Points:*
* Model Number: 5 | Score: ~32% (Medium blue circle, slightly above the 30 line).
* Model Number: 16 | Score: ~60.0% (Medium blue circle, exactly on the 60 line).
* Model Number: 18 | Score: ~58% (Medium blue circle, slightly below the 60 line).
* Model Number: 21 | Score: ~67% (Medium blue circle, situated between 60 and 70, closer to 70).
**3. Series: HealthBench Hard**
* *Spatial Grounding:* Located in the lower-right quadrant of the chart. The label is placed just above the final data point at Model 21.
* *Trend Verification:* The line starts at Model 16, slopes slightly downward to Model 18, and then slopes sharply upward to Model 21.
* *Data Points:*
* Model Number: 16 | Score: ~32% (Brown square, slightly above the 30 line).
* Model Number: 18 | Score: ~30.0% (Brown square, exactly on the 30 line).
* Model Number: 21 | Score: ~46% (Brown square, situated between 40 and 50, slightly above the midpoint).
### Key Observations
* **Data Sparsity:** Data is not provided for every model number. Only models 5, 16, 18, and 21 have recorded scores.
* **Model 18 Convergence:** Model Number 18 is the only point on the x-axis where data exists for all three metrics simultaneously.
* **Difficulty Gap:** For the models where both are measured (16, 18, 21), the standard "HealthBench" score is consistently and significantly higher (by roughly 20-28 percentage points) than the "HealthBench Hard" score.
* **Performance Dip:** Both "HealthBench" and "HealthBench Hard" show a slight decrease in performance from Model 16 to Model 18, before rebounding strongly at Model 21.
* **Outlier/Peak:** The "HealthBench Consensus" score for Model 18 (90%) is a massive outlier compared to the standard score for the same model (~58%).
### Interpretation
* **Evolution of Capability:** The chart likely tracks the historical progression of a specific family of AI models (e.g., iterations of a Large Language Model) on a medical or healthcare-specific evaluation benchmark ("HealthBench"). The general upward trajectory from Model 5 to Model 21 indicates overall improvement in the models' capabilities over time or iterations.
* **Benchmark Stratification:** The introduction of "HealthBench Hard" at Model 16 suggests that as models improved (reaching 60% on the standard benchmark), the evaluators needed a more rigorous subset of questions to prevent the benchmark from topping out and to accurately measure advanced reasoning. The consistent gap proves the "Hard" dataset is functioning as intended.
* **The "Consensus" Anomaly (Peircean Inference):** The most striking feature of the chart is Model 18. While its base performance on standard and hard metrics actually *regressed* slightly compared to Model 16, its "Consensus" score is 90%.
* *Reading between the lines:* "Consensus" usually implies an ensemble method (multiple models voting), a multi-agent framework (e.g., Medprompt), or a human-in-the-loop verification process. The chart demonstrates that while the raw, zero-shot capability of Model 18 might have dipped slightly, applying a "Consensus" methodology to that specific model yields state-of-the-art results, vastly outperforming even the newer Model 21's base score.
* **Milestone Reporting:** The non-sequential x-axis data points (5, 16, 18, 21) suggest these are major milestone releases or specific checkpoints chosen for publication, rather than a continuous daily or weekly training log.
</details>
(f) Safety
<details>
<summary>figures/gpt_2_plots/gpt_performance_Tool_Use_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores Across Various Benchmarks
### Overview
This image is a line chart displaying the performance scores (in percentages) of various numbered models across six different benchmark tests. The chart illustrates how performance evolves or fluctuates across different model iterations or variants, highlighting significant volatility in certain benchmarks and smoother progression in others.
### Components/Axes
**1. Y-Axis (Left):**
* **Label:** "Score (%)" (Rotated 90 degrees counter-clockwise).
* **Scale:** Ranges from 0 to 100 (though 0 is not explicitly marked, the axis starts below 20).
* **Major Ticks:** 20, 40, 60, 80, 100.
* **Gridlines:** Solid light gray horizontal lines at major ticks. Dashed light gray horizontal lines at midpoints (10, 30, 50, 70, 90).
**2. X-Axis (Bottom):**
* **Label:** "Model Number" (Centered below the axis).
* **Scale:** Discrete integer values from 1 to 22.
* **Major Ticks:** 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22.
* **Gridlines:** Dashed light gray vertical lines extending upward from each integer tick.
**3. Legend/Labels:**
There is no separate legend box. Instead, the labels for each data series are placed directly on the chart area, generally positioned to the right side near the final data points of their respective lines. The text color matches the line color.
### Detailed Analysis
The chart contains six distinct data series. Notably, the data density varies; three series have data points for many models, while three series only have data points for Models 5, 16, and 21.
**Series 1: Tau2-bench Telecom**
* **Visual Identification:** Cyan line, pentagon/diamond markers. Label located at the top right.
* **Trend Verification:** The line slopes upward steadily from its first point to its second, and then slopes upward sharply to its final point, representing the highest score on the chart.
* **Data Points (Approximate ±2%):**
* Model 5: 23%
* Model 16: 58%
* Model 21: 97%
**Series 2: Tau2-bench Retail**
* **Visual Identification:** Olive/gold line, small circle markers. Label located in the upper right.
* **Trend Verification:** The line starts relatively high, slopes upward gradually to the middle point, and then flattens out, showing almost no growth between the last two points.
* **Data Points (Approximate ±2%):**
* Model 5: 63%
* Model 16: 80%
* Model 21: 81%
**Series 3: Tau-bench Retail**
* **Visual Identification:** Green line, square markers. Label located in the mid-upper right.
* **Trend Verification:** Highly volatile. Starts mid-range, rises, experiences a massive drop at Model 10, recovers sharply, fluctuates slightly, and ends relatively high.
* **Data Points (Approximate ±2%):**
* Model 4: 44%
* Model 5: 60%
* Model 8: 71%
* Model 10: 23%
* Model 11: 65%
* Model 12: 74%
* Model 13: 68%
* Model 14: 58%
* Model 15: 72%
* Model 16: 74%
* Model 18: 68%
**Series 4: Tau2-bench Airline**
* **Visual Identification:** Pink line, diamond markers. Label located in the mid-right.
* **Trend Verification:** Starts mid-range, slopes upward gradually to the middle point, and then exhibits a very slight downward slope to the final point.
* **Data Points (Approximate ±2%):**
* Model 5: 45%
* Model 16: 65%
* Model 21: 63%
**Series 5: Tau-bench Airline**
* **Visual Identification:** Dark blue line, circle markers. Label located in the mid-lower right.
* **Trend Verification:** Volatile, mirroring the shape of Tau-bench Retail but at a lower score tier. Rises initially, drops sharply at Model 10, recovers, dips again at Model 14, and stabilizes in the middle range.
* **Data Points (Approximate ±2%):**
* Model 4: 22%
* Model 5: 43%
* Model 8: 50%
* Model 10: 14%
* Model 11: 36%
* Model 12: 49%
* Model 13: 50%
* Model 14: 32%
* Model 15: 49%
* Model 16: 52%
* Model 18: 49%
**Series 6: ComplexFuncBench**
* **Visual Identification:** Purple line, triangle markers. Label located in the lower right.
* **Trend Verification:** Extremely volatile. Starts mid-low, spikes high, crashes to near-zero at Model 10, recovers sharply, and crashes again at Model 14.
* **Data Points (Approximate ±2%):**
* Model 4: 38%
* Model 5: 66%
* Model 8: 47%
* Model 10: 5%
* Model 11: 49%
* Model 12: 65%
* Model 13: 63%
* Model 14: 17%
### Key Observations
1. **The "Model 10" Anomaly:** There is a severe, synchronized drop in performance at Model 10 across all three benchmarks that evaluated it (Tau-bench Retail, Tau-bench Airline, ComplexFuncBench). ComplexFuncBench drops to nearly 0%.
2. **The "Model 14" Dip:** A secondary, less severe synchronized drop occurs at Model 14 for the same three benchmarks.
3. **Evaluation Discrepancy:** The "Tau2" benchmarks (Telecom, Retail, Airline) were only evaluated on Models 5, 16, and 21. The "Tau" benchmarks and ComplexFuncBench were evaluated on a much denser cluster of models (4, 5, 8, 10-16, 18).
4. **Highest/Lowest Performers:** Model 21 on Tau2-bench Telecom achieved the highest score (~97%). Model 10 on ComplexFuncBench achieved the lowest score (~5%).
### Interpretation
The data suggests a comparison of a family of models (likely sequential versions or varying parameter sizes, given the numerical x-axis) against a suite of tasks.
**Reading Between the Lines (Peircean Investigative Analysis):**
* **Model 10 is Broken:** The catastrophic failure of Model 10 across all tested benchmarks strongly implies a fundamental flaw in that specific model's training run, architecture, or alignment, rather than a difficulty with a specific benchmark. It lost almost all capability to perform complex functions.
* **Tau vs. Tau2:** The naming convention suggests "Tau2" is a newer or revised version of the "Tau" benchmarks. The fact that Tau2 was only run on Models 5, 16, and 21 suggests these might be "milestone" models or final release candidates, whereas the denser testing on the older "Tau" benchmarks represents internal testing during the iterative development process (Models 8 through 15).
* **Illusion of Smoothness:** The Tau2 lines appear much smoother and show a clearer upward trajectory. However, this is an artifact of low data density. Because they skip the volatile middle models (like the broken Model 10 and the dip at 14), they draw a straight line over what might actually be a bumpy progression.
* **Overall Progress:** Despite the severe regressions at Models 10 and 14, the general trend from left to right is positive. The later models (16, 18, 21) generally outperform the earlier models (4, 5), indicating that the development process is ultimately yielding more capable systems, particularly in Telecom and Retail domains.
</details>
(g) Tool Use
Figure 6: Performance of the GPT family on LLM-specific benchmarks. Model numbers and corresponding names are as follows: 1 – GPT-3.5; 2 – GPT-4; 3 – GPT-4 Turbo; 4 – GPT-4o mini; 5 – GPT-4o; 6 – o1-preview; 7 – o1-mini; 8 – o1; 9 – o1-pro; 10 – GPT-4.1 nano; 11 – GPT-4.1 mini; 12 – GPT-4.1; 13 – GPT-4.5; 14 – o3-mini; 15 – o4-mini; 16 – o3; 17 – o3-pro; 18 – gpt-oss-120b; 19 – GPT-5 with Deep Research; 20 – ChatGPT Agent; 21 – GPT-5; 22 – GPT-5 Pro.