2511.01365
Model: nemotron-free
# The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation
**Authors**:
- Δ°brahim Ethem Deveci (Department of Cognitive Science)
- Ankara, Turkey
- Duygu Ataman (Department of Cognitive Science)
- Ankara, Turkey
## Abstract
The rapid rise of Large Language Models (LLMs) and Large Reasoning Models (LRMs) has been accompanied by an equally rapid increase of benchmarks used to assess them. However, due to both improved model competence resulting from scaling and novel training advances as well as likely many of these datasets being included in pre or post training data, results become saturated, driving a continuous need for new and more challenging replacements. In this paper, we discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure? We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities across different benchmarks evolve over the years. We also analyze performance trends over the years across different reasoning tasks and discuss the current situation of benchmarking and remaining challenges. By offering a comprehensive overview of benchmarks and reasoning tasks, our work aims to serve as a first reference to ground future research in reasoning evaluation and model development.
## 1 Introduction
Benchmarks have long played a central role in evaluating and comparing machine learning models [1]. As models scale up in size and capability, particularly Large Language Models (LLMs) and the specialized Large Reasoning Models (LRMs), many benchmarks quickly saturate, often reaching or surpassing human-level performance. Whether this saturation is driven primarily by improved model capability or dataset contamination is generally unknown. Nevertheless, this quick saturation forces the development of new and more challenging benchmarks that could be used to further compare new model families. In this paper, we investigate several key research questions: How effective are current benchmarks at measuring model capabilities, and does surpassing a benchmark reliably indicate genuine reasoning?
To examine these questions, we select three model families, OpenAI, Anthropic, and Google, and compile performance data from official sources [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. We gather a comprehensive list of 52 benchmarks used in evaluating these models and classify them according to the types of reasoning they aim to evaluate. Analyzing performance trends over the years, we highlight where models improve, where they struggle, and what these trends reveal about the current state of benchmarking. Finally, we discuss the implications of the saturation cycle and emphasize the need for improved evaluation practices that more accurately capture model capabilities.
Our contributions are threefold: (1) we provide a curated list of reasoning benchmarks, classified by the types of reasoning they aim to assess (2) we analyze performance trends over the years to assess benchmarking effectiveness; (3) we examine current landscape of existing benchmarks, identifying which benchmarks have reached high performance thresholds and which seem to remain unsolved.
By situating our analysis within the broader evaluation landscape, our work collects evidence to emphasize the need for reasoning tasks that are more representative of the nature of reasoning process and target evaluation beyond downstream accuracy.
## 2 Benchmark Landscape and Categorization
In order to provide a general analysis of how the creation and adoption of reasoning benchmarks have evolved over time, we examine three model families and compile the set of benchmarks employed to evaluate them. Our aim is to provide a comprehensive overview of current benchmarking practices and to trace how the creation and adoption of benchmarks have evolved over time. The complete list of benchmarks, their assigned reasoning types, and short summaries can be found in Appendix A. To facilitate analysis, we categorize benchmarks into seven reasoning types: commonsense and logical reasoning, mathematical reasoning, multimodal reasoning, programming and coding, reading comprehension and question answering, reasoning with general knowledge, and LLM-specific capabilities such as safety, tool use, and instruction following. Figure 1 illustrates a marked increase in benchmark adoption for multimodal reasoning, mathematical reasoning, programming, reasoning with general knowledge, and LLM-specific benchmarks after 2023. In contrast, no new benchmarks in reading comprehension or commonsense reasoning were adopted by these model families during this period. While the literature contains several other benchmarks in these areas [23, 24, 25, 26, 27, 28, 29], our analysis shows they have not been utilized by any of the prominent model families. This likely reflects the evolving understanding of what constitutes reasoning in computational models, in accordance with their current capabilities and what the community deems important to evaluate. Since most models now have direct commercial applications, their performance in more applicable domains, such as coding and tool-use benchmarks, may also motivate the evaluation in certain categories of reasoning tasks.
<details>
<summary>figures/benchmarks_by_year.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Chart Overview
The image depicts a **line chart** visualizing the growth of various AI benchmark categories over time (2015β2025). The chart includes six distinct data series, each represented by a unique color and labeled in the legend.
---
### **Axis Labels**
- **X-axis**: "Year" (2015β2025, annual intervals)
- **Y-axis**: "Number of Benchmarks" (0β14, integer increments)
---
### **Legend**
The legend is positioned on the **right side** of the chart. Colors and labels are as follows:
1. **Blue**: Commonsense and Logical Reasoning
2. **Orange**: LLM Benchmarks (Instruction following, Tool use, etc.)
3. **Green**: Mathematical Reasoning
4. **Red**: Multimodal Reasoning
5. **Purple**: Programming and Coding
6. **Brown**: Reading Comprehension and Question Answering
7. **Pink**: Reasoning with General Knowledge
---
### **Data Series Analysis**
#### 1. **Commonsense and Logical Reasoning (Blue)**
- **Trend**: Flat at 0 until 2019, then increases to 1 in 2020 and remains constant through 2025.
- **Data Points**:
- 2015β2018: 0
- 2019β2025: 1
#### 2. **LLM Benchmarks (Orange)**
- **Trend**: Flat at 0 until 2023, then sharp increase to 3 in 2024 and 13 in 2025.
- **Data Points**:
- 2015β2022: 0
- 2023: 0
- 2024: 3
- 2025: 13
#### 3. **Mathematical Reasoning (Green)**
- **Trend**: Flat at 0 until 2021, then increases to 2 in 2022, 7 in 2024, and 8 in 2025.
- **Data Points**:
- 2015β2020: 0
- 2021: 0
- 2022: 2
- 2023: 3
- 2024: 7
- 2025: 8
#### 4. **Multimodal Reasoning (Red)**
- **Trend**: Steady upward slope from 1 in 2015 to 13 in 2025.
- **Data Points**:
- 2015: 1
- 2016β2018: 2
- 2019β2020: 3
- 2021: 4
- 2022: 6
- 2023: 8
- 2024: 9
- 2025: 13
#### 5. **Programming and Coding (Purple)**
- **Trend**: Flat at 0 until 2024, then increases to 3 in 2024 and 7 in 2025.
- **Data Points**:
- 2015β2023: 0
- 2024: 3
- 2025: 7
#### 6. **Reading Comprehension and Question Answering (Brown)**
- **Trend**: Flat at 0 until 2018, then increases to 1 in 2018, remains constant until 2025, then jumps to 3 in 2025.
- **Data Points**:
- 2015β2017: 0
- 2018β2024: 1
- 2025: 3
#### 7. **Reasoning with General Knowledge (Pink)**
- **Trend**: Flat at 0 until 2024, then increases to 5 in 2024 and 7 in 2025.
- **Data Points**:
- 2015β2023: 0
- 2024: 5
- 2025: 7
---
### **Key Observations**
1. **Multimodal Reasoning (Red)** shows the most consistent growth, doubling every ~3 years.
2. **LLM Benchmarks (Orange)** experience explosive growth in 2024β2025, surpassing all other categories.
3. **Mathematical Reasoning (Green)** and **General Knowledge (Pink)** show late-stage acceleration.
4. **Commonsense/Logical Reasoning (Blue)** and **Programming/Coding (Purple)** remain stagnant until 2020 and 2024, respectively.
---
### **Spatial Grounding**
- **Legend Position**: Right-aligned, outside the main chart area.
- **Data Point Verification**: All line colors match the legend labels exactly. For example, the red line (Multimodal Reasoning) peaks at 13 in 2025, aligning with the legend.
---
### **Conclusion**
The chart highlights divergent growth trajectories across AI benchmarks, with **LLM Benchmarks** and **Multimodal Reasoning** dominating recent advancements. No non-English text or additional data tables are present.
</details>
Figure 1: Number of benchmarks in different reasoning types over time.
## 3 Performance Trends Across Models
Across all three model families there is a consistent effort to develop newer models or architectural improvements to achieve higher benchmark performance. However, comparing performance across families is challenging, as each family often employs different benchmarks, and even within a single family, benchmarks used can vary between model iterations. This variation appears to stem from two main factors: first, certain benchmarks reach saturation due to high performance; second, benchmark updates or more challenging subsets are introduced, such as the transition from MATH to MATH-500 [30].
We observe a recurring pattern: once a model family achieves a high performance on a particular benchmark, subsequent models tend to use that benchmark less frequently or may discontinue its use entirely. This reflects both practical and conceptual considerations: benchmarks that no longer discriminate between models provide limited evaluative value, and benchmark selection increasingly reflects the evolving understanding of which reasoning tasks remain challenging for current architectures.
Interestingly, performance trends reveal consistent directional correlations across benchmarks within the same reasoning type. For example, when a model demonstrates improved performance on a benchmark, it generally shows corresponding improvements on other benchmarks of the same type, while lower performance on one benchmark tends to coincide with lower performance on others. Nevertheless, the extent of performance differs across benchmarks, potentially due to variations in problem complexity and the scaling limitations evident in smaller models, as seen within the OpenAI family. This pattern suggests that benchmarks within a reasoning type often capture overlapping aspects of reasoning, so that advances in a modelsβ capabilities tend to propagate across related tasks. At the same time, variations in the magnitude of performance gains provide insight into the relative difficulty of different benchmarks within the same reasoning type. Detailed plots illustrating performance changes within model families for different reasoning types are provided in Appendix B.
Finally, we note that newer models generally achieve higher performance on previously low-scoring benchmarks. However, the limited overlap of common benchmarks across model families complicates cross-family comparisons. This raises a critical question: if benchmarks are intended to evaluate and compare model capabilities, why are they not consistently adopted or reported across families? If benchmarks are intended to provide a shared measure of capability, their fragmented and selective use undermines that goal and exemplifies the need for more standardized, representative, and domain-informed evaluation frameworks.
## 4 Performance of Models within Benchmarks
We collect all reported model performances across benchmarks and analyze saturation by defining it as whether a model has achieved at least 80% accuracy on the given benchmark. Out of the full set of benchmarks, we find that 27 benchmarks surpass this threshold in at least one model family, while 25 benchmarks never reach it. The majority of βsolvedβ benchmarks belong to commonsense and logical reasoning, mathematical reasoning, reasoning with general knowledge, and reading comprehension and question answering. By contrast, benchmarks targeting LLM-specific capabilities and programming and coding remain comparatively difficult, with few instances of performance above 80%.
We then examine the release years of benchmarks that never surpass the 80% threshold. The distribution is striking: 60% of unsolved benchmarks were introduced in 2025, 32% in 2024, and only two pre-2023 benchmarks remain unsolved, which are ActivityNet [31] and EgoSchema [32], both multimodal reasoning benchmarks. This distribution suggests a clear trend. Nearly all benchmarks released prior to 2025 have already been surpassed by at least one model family, indicating rapid saturation. By contrast, the benchmarks still below the threshold overwhelmingly correspond to the most recently introduced evaluation tasks.
<details>
<summary>figures/stacked_bar_saturation.png Details</summary>

### Visual Description
# Technical Document Extraction: Benchmark Performance Analysis
## Chart Overview
Horizontal bar chart comparing performance metrics across multiple reasoning domains. Color-coded representation of saturated vs. non-saturated benchmarks.
### Legend
- **Green**: Saturated (β)
- **Red**: Not Saturated (β)
- **Placement**: Bottom-right quadrant
### Axis Configuration
- **X-axis**: Percentage of Benchmarks (0-100%)
- Increment markers: 0, 20, 40, 60, 80, 100
- **Y-axis**: Categorical domains (7 total)
- Ordered top-to-bottom:
1. Reasoning with General Knowledge
2. Reading Comprehension and Question Answering
3. Programming and Coding
4. Multimodal Reasoning
5. Mathematical Reasoning
6. LLM
7. Commonsense and Logical Reasoning
## Data Analysis
### Key Trends
1. **Mathematical Reasoning** shows highest saturation (87.5%)
2. **LLM** demonstrates lowest saturation (23.1%)
3. **Commonsense and Logical Reasoning** achieves perfect saturation (100%)
### Category Breakdown
| Category | Saturated (%) | Saturated (Count) | Not Saturated (%) | Not Saturated (Count) |
|-----------------------------------|---------------|-------------------|-------------------|-----------------------|
| Reasoning with General Knowledge | 71.4 | 5/7 | 28.6 | - |
| Reading Comprehension | 66.7 | 2/3 | 33.3 | - |
| Programming and Coding | 33.3 | 3/9 | 66.7 | - |
| Multimodal Reasoning | 46.2 | 6/13 | 53.8 | - |
| Mathematical Reasoning | 87.5 | 7/8 | 12.5 | - |
| LLM | 23.1 | 3/13 | 76.9 | - |
| Commonsense and Logical Reasoning | 100.0 | 1/1 | 0.0 | - |
## Spatial Grounding
- **Legend Position**: [x: 85-100, y: 0.9] (bottom-right)
- **Bar Orientation**: Horizontal (left-to-right growth)
- **Color Consistency**:
- Green bars always represent saturated benchmarks
- Red bars always represent non-saturated benchmarks
## Trend Verification
1. **Mathematical Reasoning**: Green bar dominates (87.5%) vs red (12.5%)
2. **LLM**: Red bar significantly longer (76.9%) vs green (23.1%)
3. **Commonsense**: Perfect green saturation (100%)
## Component Isolation
1. **Header**: Chart title (implied by axis labels)
2. **Main Chart**:
- 7 horizontal bars with dual-color segmentation
- Percentage markers at 20% intervals
3. **Footer**:
- Legend with color coding
- X-axis percentage scale
## Data Table Reconstruction
| Category | Saturated (%) | Saturated (Count) | Not Saturated (%) | Not Saturated (Count) |
|-----------------------------------|---------------|-------------------|-------------------|-----------------------|
| Reasoning with General Knowledge | 71.4 | 5/7 | 28.6 | - |
| Reading Comprehension | 66.7 | 2/3 | 33.3 | - |
| Programming and Coding | 33.3 | 3/9 | 66.7 | - |
| Multimodal Reasoning | 46.2 | 6/13 | 53.8 | - |
| Mathematical Reasoning | 87.5 | 7/8 | 12.5 | - |
| LLM | 23.1 | 3/13 | 76.9 | - |
| Commonsense and Logical Reasoning | 100.0 | 1/1 | 0.0 | - |
## Critical Observations
1. **Highest Performance**: Mathematical Reasoning (87.5% saturation)
2. **Lowest Performance**: LLM (23.1% saturation)
3. **Perfect Score**: Commonsense and Logical Reasoning (100% saturation)
4. **Balanced Performance**: Reading Comprehension (66.7% saturation)
## Language Note
All textual content is in English. No non-English elements detected.
</details>
(a) Distribution of benchmarks that models surpassed 80% threshold and those not yet surpassed, grouped by reasoning type.
<details>
<summary>figures/pie_saturation_by_year.png Details</summary>

### Visual Description
# Technical Document Extraction: Pie Charts Analysis
## Left Chart: Yearly Distribution (2016β2025)
### Chart Structure
- **Type**: Pie chart with 8 slices.
- **Legend**: Located at the bottom left, with color-coded years.
- **Color Legend**:
- 2016: Light green
- 2018: Light green
- 2019: Medium green
- 2021: Medium green
- 2022: Light green
- 2023: Medium green
- 2024: Dark green
- 2025: Light green
### Data Points
| Year | Percentage | Count | Color |
|------|------------|-------|-------------|
| 2016 | 3.7% | 1 | Light green |
| 2018 | 3.7% | 1 | Light green |
| 2019 | 11.1% | 3 | Medium green|
| 2021 | 18.5% | 5 | Medium green|
| 2022 | 3.7% | 1 | Light green |
| 2023 | 18.5% | 5 | Medium green|
| 2024 | 29.6% | 8 | Dark green |
| 2025 | 3.7% | 1 | Light green |
### Trend Verification
- **Trend**: Percentages increase from 3.7% (2016) to 29.6% (2024), then sharply decline to 3.7% (2025).
- **Visual Confirmation**: Slices grow larger until 2024, then shrink significantly in 2025.
---
## Right Chart: Yearly Distribution (2015β2025)
### Chart Structure
- **Type**: Pie chart with 4 slices.
- **Legend**: Located at the bottom right, with color-coded years.
- **Color Legend**:
- 2015: Light orange
- 2023: Light red
- 2024: Medium red
- 2025: Dark red
### Data Points
| Year | Percentage | Count | Color |
|------|------------|-------|-------------|
| 2015 | 4.0% | 1 | Light orange|
| 2023 | 4.0% | 1 | Light red |
| 2024 | 32.0% | 8 | Medium red |
| 2025 | 60.0% | 15 | Dark red |
### Trend Verification
- **Trend**: Percentages remain low (4%) until 2024, then surge to 60% in 2025.
- **Visual Confirmation**: Small slices dominate until 2024, followed by a massive dark red slice in 2025.
---
## Cross-Reference Validation
1. **Legend Consistency**:
- Left chart: All colors match the legend (e.g., 2024 = dark green).
- Right chart: Colors align with legend (e.g., 2025 = dark red).
2. **Data Integrity**:
- Percentages and counts are explicitly labeled on each slice.
- No overlapping or ambiguous labels.
## Notes
- **Language**: All text is in English.
- **Missing Data**: No additional textual or numerical data present beyond the described elements.
</details>
(b) Release years of benchmarks relative to the 80% threshold: left pie shows surpassed benchmarks, right pie shows unsolved benchmarks.
Figure 2: Benchmark saturation dynamics.
This temporal pattern highlights the central dynamic of the saturation cycle: older benchmarks are rapidly mastered and lose discriminative power, while newly introduced benchmarks become the standards for demonstrating progress. Nearly all unsolved benchmarks are recent, highlighting both the accelerating pace of benchmark creation and the difficulty of maintaining evaluations that remain challenging over time. Yet this difficulty seems only temporary. It is highly plausible that within one or two years many of these currently unsolved benchmarks will also be surpassed, at which point model families will shift to alternative or newly designed evaluations to preserve differentiation. Crucially, this pattern reflects the fact that performance gains are often specific to individual benchmarks rather than to the broader reasoning type they are intended to assess. As the analyses indicate, while models often perform consistently and even strongly on benchmarks within a domain, the introduction of a more challenging, novel benchmark frequently leads to a drop in performance. This pattern may arise from the increased difficulty of the new benchmark, or from contamination that inflated performance on earlier benchmarks without truly reflecting generalizable reasoning ability. This situation raises the question of whether what appears as βreasoning abilityβ is often tied more to benchmark design and prior exposure than to robust mastery of the reasoning type itself. This saturation cycle casts doubt on the long-term evaluation value of benchmarks.
## 5 Discussion: Limitations of Current Benchmarking
Our analysis of three model families demonstrates that benchmark performance has generally increased over time, with newer models achieving higher scores across most reasoning types and benchmarks. However, given that many benchmarks have already been surpassed with high accuracy, we would like to highlight a question originally posed in [25] regarding commonsense reasoning, reframed here for reasoning in general: Have neural language models successfully acquired reasoning, or are we overestimating the true capabilities of machine reasoning? Several studies in the literature show that these models still perform poorly when required to generalize to longer contexts or handle tasks requiring inductive and compositional reasoning [33, 34, 35, 36, 37, 38]. This discrepancy suggests a limitation of current benchmarking practices: improvements in benchmark scores do not necessarily reflect generalizable reasoning ability.
We believe this discrepancy can be reduced by developing more sophisticated, task-specific evaluation metrics that capture intermediate reasoning steps or different modes of error. Additionally, formalizing reasoning for different task types can support these efforts, enabling more structured analyses and clearer assessment of modelsβ reasoning abilities. Such a formalization enables structured representations of diverse reasoning types and their interrelationships [39, 40, 41], and facilitates the design of layered, targeted evaluation procedures that assess specific reasoning capabilities rather than merely reporting overall accuracy. Furthermore, formal reasoning frameworks can support the development of algorithms that deliver structured feedback to models, guiding the refinement of their reasoning abilities. By integrating formalized reasoning with task-specific evaluations, benchmarking can be conducted in a more targeted and informative manner.
## 6 Limitations
The analysis in our study focuses on 52 benchmarks used by the three model families. Other model families and reasoning-focused models are not fully explored because including them, along with more than two hundred benchmarks identified from other model families and several studies evaluating different types of reasoning in large models, would create a combinatorial explosion of comparisons. This restriction was necessary to maintain the scope of our work on a qualitative evaluation of benchmark design and adoption rather than an exhaustive quantitative analysis of all models and benchmarks. A comprehensive comparison across a wider range of models and benchmarks is left for future work.
## 7 Conclusion
In this work, we analyze 52 benchmarks across three model families, covering multiple reasoning types. Our study reveals the rapid saturation of older benchmarks, selective adoption of new ones, and temporal dynamics that govern the utility of benchmarks in evaluating model performance. While model performance generally improves over time and correlations within reasoning types indicate overlapping evaluation properties, the introduction of more challenging benchmarks generally resets performance, suggesting that apparent reasoning ability is influenced more by extrinsic factors than by mastering the reasoning itself, as supported by other studies. This saturation cycle highlights the limitations of current practices: benchmarks provide only a partial view of model reasoning. Meaningful progress requires formalized reasoning tasks, layered evaluation procedures, and task-specific metrics that go beyond accuracy scores.
## References
- [1] Thomas Liao, Rohan Taori, Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
- [2] Anthropic. Introducing the next generation of claude, March 2024. Accessed: 2025-08-28.
- [3] Anthropic. Claude 3.5 sonnet, June 2024. Accessed: 2025-08-28.
- [4] Anthropic. Introducing claude 4, May 2025. Accessed: 2025-08-28.
- [5] Anthropic. Introducing claude 3.5 haiku, October 2024. Accessed: 2025-08-28.
- [6] Anthropic. Claude 3.7 sonnet and claude code, February 2025. Accessed: 2025-08-28.
- [7] Anthropic. Claude opus 4.1, August 2025. Accessed: 2025-08-28.
- [8] Google DeepMind. Gemini 2.5 flash-lite, June 2025. Accessed: 2025-08-28.
- [9] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025.
- [10] Google DeepMind. Gemini 2.5: Our most intelligent ai model, March 2025. Accessed: 2025-08-28.
- [11] Gemini Team, Petko Georgiev, Ving Ian Lei, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
- [12] Gemini Team, Rohan Anil, Sebastian Borgeaud, et al. Gemini: A family of highly capable multimodal models, 2025.
- [13] OpenAI. Openai o1-mini: Advancing cost-efficient reasoning, September 2024. Accessed: 2025-08-28.
- [14] OpenAI. Introducing gpt-4.1 in the api, April 2025. Accessed: 2025-08-28.
- [15] OpenAI. Introducing gpt-4.5, February 2025. Accessed: 2025-08-28.
- [16] OpenAI. gpt-oss-120b & gpt-oss-20b model card, August 2025. Accessed: 2025-08-28.
- [17] OpenAI. Introducing gpt-5, August 2025. Accessed: 2025-08-28.
- [18] OpenAI. Model release notes. Accessed: 2025-08-28.
- [19] OpenAI. Introducing openai o3 and o4-mini, April 2025. Accessed: 2025-08-28.
- [20] OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence, July 2024. Accessed: 2025-08-28.
- [21] OpenAI. Hello gpt-4o, May 2024. Accessed: 2025-08-28.
- [22] OpenAI. Learning to reason with llms, September 2024. Accessed: 2025-08-28.
- [23] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jiasen Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432β7439, 2020.
- [24] Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823β1840, Online, November 2020. Association for Computational Linguistics.
- [25] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99β106, August 2021.
- [26] Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. Commonsenseqa 2.0: Exposing the limits of ai through gamification, 2022.
- [27] Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, and Chuang Gan. Sok-bench: A situated video reasoning benchmark with aligned open-world knowledge, 2024.
- [28] Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAIβ20, 2021.
- [29] Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations, 2020.
- [30] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
- [31] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961β970, 2015.
- [32] Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023.
- [33] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: limits of transformers on compositionality. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS β23, Red Hook, NY, USA, 2023. Curran Associates Inc.
- [34] Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2025.
- [35] Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025.
- [36] Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, and Tal Linzen. Relic: Evaluating compositional instruction following via language recognition, 2025.
- [37] S. Bedi, Y. Jiang, P. Chung, S. Koyejo, and N. Shah. Fidelity of medical reasoning in large language models. JAMA Network Open, 8(8):e2526021, 2025.
- [38] Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A systematic evaluation of the planning and scheduling abilities of the reasoning model o1. Transactions on Machine Learning Research, 2025.
- [39] P. N. Johnson-Laird. Mental models: towards a cognitive science of language, inference, and consciousness. Harvard University Press, USA, 1986.
- [40] Patrick Blackburn and Johannes Bos. Representation and Inference for Natural Language: A First Course in Computational Semantics. Center for the Study of Language and Information, Stanford, Calif., 2005.
- [41] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017.
- [42] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and LluΓs MΓ rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791β4800, Florence, Italy, July 2019. Association for Computational Linguistics.
- [43] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
- [44] Mirac Suzgun, Nathan Scales, Nathanael SchΓ€rli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 13003β13051, Toronto, Canada, July 2023. Association for Computational Linguistics.
- [45] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
- [46] Long Phan, Alice Gatti, Ziwen Han, et al. Humanityβs last exam, 2025.
- [47] Shivalika Singh, Angelika Romanou, ClΓ©mentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, and Sara Hooker. Global MMLU: Understanding and addressing cultural and linguistic biases in multilingual evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761β18799, Vienna, Austria, July 2025. Association for Computational Linguistics.
- [48] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023.
- [49] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024.
- [50] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- [51] Omer Goldman, Uri Shaham, Dan Malkin, Sivan Eiger, Avinatan Hassidim, Yossi Matias, Joshua Maynez, Adi Mayrav Gilady, Jason Riesa, Shruti Rijhwani, Laura Rimell, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. Eclektic: a novel challenge set for evaluation of cross-lingual knowledge transfer, 2025.
- [52] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368β2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- [53] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021.
- [54] Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022.
- [55] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024.
- [56] Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli JΓ€rviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, and Mark Wildon. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2024.
- [57] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024.
- [58] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images, 2016.
- [59] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2263β2279, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- [60] Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on document images, 2021.
- [61] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019.
- [62] Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025.
- [63] Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson dβAutume, Dani Yogatama, Mikel Artetxe, and Yi Tay. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models, 2024.
- [64] Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, David I. Atkinson, Aaditya Baranwal, Alexandru Coca, Mikah Dang, Sebastian Dziadzio, Jakob D. Kunz, Kaiqu Liang, Alexander Lo, Brian Pulfer, Steven Walton, Charig Yang, Kai Han, and Samuel Albanie. Zerobench: An impossible visual benchmark for contemporary large multimodal models, 2025.
- [65] Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 113569β113697. Curran Associates, Inc., 2024.
- [66] Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025.
- [67] Google DeepMind. Gemini robotics: Bringing ai into the physical world, 2025. Accessed: 2025-08-29.
- [68] Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024.
- [69] Stanford University and Laude Institute. Terminal-bench: A benchmark for ai agents in terminal environments, 2025. Accessed: 2025-08-29.
- [70] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
- [71] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024.
- [72] Aider. o1 tops aiderβs new polyglot leaderboard, 2024. Accessed: 2025-08-29.
- [73] Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering?, 2025.
- [74] Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. $\tau$ -bench: A benchmark for tool-agent-user interaction in real-world domains, 2024.
- [75] Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. $\tau^{2}$ -bench: Evaluating conversational agents in a dual-control environment, 2025.
- [76] Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, and Karthik Narasimhan. Collie: Systematic construction of constrained text generation tasks, 2023.
- [77] Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models, 2024.
- [78] Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshtein, and Dipanjan Das. The facts grounding leaderboard: Benchmarking llmsβ ability to ground responses to long-form input, 2025.
- [79] Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025.
- [80] Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. Complexfuncbench: Exploring multi-step and constrained function calling under long-context scenario, 2025.
- [81] Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023.
- [82] Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, and Sinong Wang. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following, 2024.
- [83] Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, SΓ©bastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, and Kelvin Guu. Can long-context language models subsume retrieval, rag, sql, and more?, 2024.
- [84] Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E. Primack, Summer Yue, and Chen Xing. MultiChallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 18632β18702, Vienna, Austria, July 2025. Association for Computational Linguistics.
- [85] Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin QuiΓ±onero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, 2025.
## Appendix A Reasoning Benchmarks
Table 1: Taxonomy of benchmarks used in this study.
| HellaSwag [42] | Commonsense and Logical Reasoning | 2019 | Multiple-choice task: choose the most plausible sentence continuation. |
| --- | --- | --- | --- |
| MMLU [43] | Reasoning with General Knowledge | 2021 | Multiple-choice task: answer questions across 57 domains to test knowledge and problem-solving. |
| Big-Bench-Hard [44] | Reasoning with General Knowledge | 2023 | Open-generation task: solve difficult BIG-Bench problems testing multi-step reasoning and problem-solving. |
| MMMLU [45] | Reasoning with General Knowledge | 2024 | Multiple-choice task: answer 57 domain questions translated into 14 languages to test multilingual knowledge and problem-solving. |
| Humanityβs Last Exam [46] | Reasoning with General Knowledge | 2025 | Multi-modal task: answer closed-ended questions across many subjects to test verifiable knowledge. |
| Global MMLU (Lite) [47] | Reasoning with General Knowledge | 2025 | Multiple-choice task: answer 42-language questions with culturally sensitive labeling to test equitable multilingual knowledge. |
| GPQA Diamond [48] | Reasoning with General Knowledge | 2023 | Multiple-choice task: answer 448 expert-level science questions in biology, physics, and chemistry that are Google-proof and highly challenging. |
| MMLU Pro [49] | Reasoning with General Knowledge | 2024 | Multiple-choice task: extended from MMLU, answer more challenging reasoning questions with 10 options across diverse domains. |
| ARC (AI2 Reasoning Challenge) [50] | Reading Comprehension and Question Answering | 2018 | Multiple-choice task: answer grade-school science questions requiring advanced knowledge and reasoning beyond simple retrieval. |
| ECLeKTic [51] | Reading Comprehension and Question Answering | 2025 | Closed-book QA task: answer 12-language questions to test cross-lingual knowledge transfer. |
| DROP [52] | Reading Comprehension and Question Answering | 2019 | Open-ended QA task: answer 96k English questions requiring discrete reasoning over paragraph content. |
| GSM8K [53] | Mathematical Reasoning | 2021 | Open-ended QA task: solve grade-school problems requiring multi-step mathematical reasoning. |
| MATH [30] | Mathematical Reasoning | 2021 | Open-ended QA: solve 12,500 challenging competition problems with step-by-step solutions to test advanced mathematical reasoning. |
| MATH 500 [30] | Mathematical Reasoning | 2024 | Open-ended QA: Challenging subset of MATH benchmark. |
| MGSM [54] | Mathematical Reasoning | 2023 | Open-ended QA: solve 250 GSM8K problems translated into 10 languages. |
| MathVista [55] | Mathematical Reasoning | 2024 | Open-ended multimodal QA: solve 6,141 math problems requiring visual and compositional reasoning. |
| AIME 2024 | Mathematical Reasoning | 2024 | Open-ended QA: solve challenging competition-level mathematics problems. |
| AIME 2025 | Mathematical Reasoning | 2025 | Open-ended QA: solve challenging competition-level mathematics problems. |
| FrontierMath [56] | Mathematical Reasoning | 2024 | Open-ended QA: tests advanced mathematical reasoning across diverse and expert-level domains, requiring multi-step problem solving and deep mathematical knowledge. |
| MMMU [57] | Multimodal Reasoning | 2024 | Question answering task: multimodal multiple-choice and open-ended questions across 30 subjects requiring advanced reasoning and domain-specific knowledge. |
| AI2D [58] | Multimodal Reasoning | 2016 | Open-ended QA: multimodal questions with 5,000 diagrams and 15,000 Q&A pairs requiring diagram structure understanding and reasoning. |
| ChartQA [59] | Multimodal Reasoning | 2022 | Open-ended QA: multimodal questions with 32.7K chart-based problems requiring visual and logical reasoning. |
| EgoSchema [32] | Multimodal Reasoning | 2023 | Multiple-choice QA: multimodal questions with 5,000 long-form video clips requiring understanding of human activity and temporal reasoning. |
| DocVQA [60] | Multimodal Reasoning | 2021 | Open-ended QA: multimodal questions with 50,000 document images requiring reading and interpreting document layout and structure. |
| TextVQA [61] | Multimodal Reasoning | 2019 | Open-ended QA: multimodal questions with 45,336 images requiring reading and reasoning about embedded text. |
| VideoMMMU [62] | Multimodal Reasoning | 2025 | Open-ended QA: multimodal questions with 300 expert-level videos and 900 Q&A pairs assessing knowledge acquisition through perception, comprehension, and adaptation. |
| Vibe-Eval [63] | Multimodal Reasoning | 2024 | Open-ended QA: multimodal questions, testing visual understanding and multimodal chat capabilities. |
| ZeroBench [64] | Multimodal Reasoning | 2025 | Open-ended QA: multimodal questions with 434 visual reasoning problems designed to be impossible for current LMMs. |
| CharXiv [65] | Multimodal Reasoning | 2024 | Open-ended QA: multimodal questions with 2,323 charts requiring descriptive analysis and complex reasoning. |
| MMMU Pro [66] | Multimodal Reasoning | 2025 | QA task: multimodal multiple-choice and open-ended questions, extended from MMMU, testing integrated visual and textual reasoning. |
| ActivityNet [31] | Multimodal Reasoning | 2015 | Multiple-choice and open-ended QA: evaluates recognition and understanding of complex human activities in untrimmed videos, testing visual perception and temporal reasoning. |
| ERQA [67] | Multimodal Reasoning | 2025 | Multiple-choice QA: evaluates embodied reasoning and spatial understanding in real-world scenarios, requiring models to integrate text and visual inputs to select the correct answer. |
| SWE-bench Verified [68] | Programming and Coding | 2024 | Open-ended QA: answer 2,294 software engineering problems requiring multi-file code edits and complex reasoning. |
| Terminal-bench [69] | Programming and Coding | 2025 | Open-ended QA: answer complex tasks in terminal environments using text-based commands and reasoning. |
| HumanEval [70] | Programming and Coding | 2021 | Open-ended QA: answer Python programming problems from docstrings requiring functional code synthesis. |
| LiveCode Bench [71] | Programming and Coding | 2025 | Open-ended QA: answer 600+ coding problems from contests, testing generation, self-repair, execution, and test prediction. |
| Aider Polygot [72] | Programming and Coding | 2024 | Open-ended QA: answer 225 difficult coding problems in C++, Go, Java, JavaScript, Python, and Rust. |
| SWE-Lancer [73] | Programming and Coding | 2025 | Open-ended QA: answer 1,400 freelance software engineering tasks, including implementation and managerial decisions, with real-world evaluation. |
| SWE-Lancer Diamond [73] | Programming and Coding | 2025 | Open-ended QA: answer tasks from the public SWE-Lancer Diamond split, including implementation and managerial software engineering problems. |
| TAU-bench [74] | Tool Use β LLM | 2024 | Open-ended QA: tests reasoning, consistency, and rule-following in dynamic, tool-assisted human-agent interactions. |
| TAU2-bench [75] | Tool Use β LLM | 2025 | Open-ended QA: tests multi-turn reasoning, coordination, and communication in dual-control environments where both agent and user act with tools. |
| COLLIE [76] | Constrained Text Generation β LLM | 2023 | Open-ended QA: answer 2,080 prompts requiring constrained text generation with compositional, grammar-based, and reasoning challenges. |
| SimpleQA [77] | Factuality β LLM | 2024 | Factual QA benchmark designed to test factual accuracy and knowledge calibration. |
| FACTS Grounding [78] | Factuality β LLM | 2024 | Open-ended QA: answer questions requiring LLMs to generate factually accurate and well-grounded responses from provided source material. |
| BrowseComp [79] | Factuality β LLM | 2025 | Open-ended QA: answer 1,266 questions by persistently navigating the internet to find hard-to-locate information. |
| ComplexFunc Bench [80] | Tool Use β LLM | 2025 | Open-ended QA: answer complex function-calling tasks in five real-world scenarios requiring multi-step reasoning, parameter management, and long-context handling. |
| IFEval [81] | Instruction Following β LLM | 2023 | Open-ended QA: answer 500 prompts requiring LLMs to follow verifiable natural language instructions. |
| Multi-IF [82] | Instruction Following β LLM | 2024 | Open-ended QA: answer 4,501 multilingual multi-turn prompts requiring accurate instruction-following across languages and conversation turns. |
| LOFT [83] | Long-Context β LLM | 2024 | Open-ended QA: answer real-world tasks requiring reasoning and in-context retrieval over millions of tokens. |
| Graphwalks [14] | Long-Context β LLM | 2025 | Open-ended QA: perform multi-hop reasoning across a graph of millions of tokens to answer questions requiring breadth-first traversal. |
| Multi Challenge [84] | Multi-turn Conversation β LLM | 2025 | Open-ended QA: answer multi-turn conversation prompts requiring instruction-following, context management, and in-context reasoning. |
| HealthBench [85] | Safety β LLM | 2025 | Open-ended QA: evaluates LLMs on multi-turn healthcare conversations, requiring factual reasoning, safety awareness, and context-sensitive decision-making across diverse medical contexts. |
## Appendix B Performance of Models
<details>
<summary>figures/claude_2_plots/claude_performance_Commonsense_and_Logical_Reasoning.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Chart Overview
The image depicts a **line chart** with a single data series represented by a **blue line**. The chart includes grid lines, axis labels, and a textual annotation. Below is a detailed breakdown of all textual and structural components.
---
### **1. Axis Labels and Markers**
- **X-Axis (Horizontal):**
- **Title:** "Model Number"
- **Range:** 1 to 10 (integer increments)
- **Tick Marks:** Visible at every integer value (1, 2, 3, ..., 10).
- **Y-Axis (Vertical):**
- **Title:** "Score (%)"
- **Range:** 86% to 96% (increments of 2%)
- **Tick Marks:** Visible at 86%, 88%, 90%, 92%, 94%, 96%.
---
### **2. Data Series**
- **Line Color:** Blue (matches legend annotation).
- **Data Points:**
- **Point 1:** (x=1, y=86%)
- Positioned at the bottom-left of the chart.
- **Point 2:** (x=2, y=89%)
- Positioned midway between x=1 and x=3.
- **Point 3:** (x=3, y=95%)
- Positioned at the top of the chart, labeled "HellaSwag".
---
### **3. Annotations**
- **Text Label:** "HellaSwag"
- Placed near the highest data point (x=3, y=95%).
- No legend box is present; the label acts as an inline annotation.
---
### **4. Chart Structure**
- **Background:** White with light gray grid lines (dotted, both horizontal and vertical).
- **Line Style:** Solid blue line connecting all data points.
- **Trend:**
- The line exhibits a **steep upward slope** from x=1 to x=3.
- **Key Trend Verification:**
- From x=1 (86%) to x=2 (89%): Moderate increase (+3%).
- From x=2 (89%) to x=3 (95%): Sharp increase (+6%).
---
### **5. Spatial Grounding**
- **Legend:** Not explicitly present as a box. The label "HellaSwag" is spatially grounded near the highest data point (x=3, y=95%).
- **Color Consistency:** The blue line matches the implied legend color for the data series.
---
### **6. Missing Elements**
- **No Data Table:** The chart does not include an embedded table; data is represented visually.
- **No Secondary Axes or Subplots:** The chart is a single-axis line plot.
---
### **7. Summary of Key Trends**
- The score increases monotonically with model number.
- The largest improvement occurs between model numbers 2 and 3 (+6 percentage points).
- The highest score (95%) is achieved at model number 3, annotated as "HellaSwag".
---
### **8. Final Notes**
- The chart focuses on a small subset of model numbers (1β3) despite the x-axis extending to 10.
- No additional textual or numerical data is present beyond the described components.
</details>
(a) Commonsense and Logical Reasoning
<details>
<summary>figures/claude_2_plots/claude_performance_Mathematical_Reasoning.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Axis Labels and Markers
- **X-Axis**:
- Title: "Model Number"
- Range: 1 to 10 (integer increments)
- Tick Marks: Every integer value (1, 2, ..., 10)
- **Y-Axis**:
- Title: "Score (%)"
- Range: 20 to 100 (integer increments)
- Tick Marks: Every 10 units (20, 30, ..., 100)
## 2. Legend
- **Position**: Top-right quadrant of the chart
- **Entries**:
1. **GSM8K**: Red line (β² marker)
2. **MGSM**: Orange line (β marker)
3. **MATH**: Brown line (β marker)
4. **MATH 500**: Green line (β marker)
5. **AIME 2024**: Pink line (β marker)
## 3. Data Series Analysis
### 3.1 GSM8K (Red)
- **Trend**: Steady upward slope with plateau
- **Key Points**:
- Model 1: 89%
- Model 2: 92%
- Model 3: 95%
- Model 4: 95%
- Model 5: 96%
- Models 6-10: Maintain ~96%
### 3.2 MGSM (Orange)
- **Trend**: Volatile with peak at Model 3
- **Key Points**:
- Model 1: 75%
- Model 2: 83%
- Model 3: 91%
- Model 4: 86%
- Model 5: 92%
- Models 6-10: Not explicitly plotted
### 3.3 MATH (Brown)
- **Trend**: Sharp upward acceleration
- **Key Points**:
- Model 1: 39%
- Model 2: 43%
- Model 3: 60%
- Model 4: 70%
- Model 5: 78%
- Model 6: 79%
### 3.4 MATH 500 (Green)
- **Trend**: Stable with minor fluctuations
- **Key Points**:
- Model 1: 81%
- Model 2: 83%
- Model 3: 82%
- Model 4: 81%
- Model 5: 83%
### 3.5 AIME 2024 (Pink)
- **Trend**: Explosive growth followed by decline
- **Key Points**:
- Model 1: 15%
- Model 2: 23%
- Model 3: 80%
- Model 4: 85%
- Model 5: 90%
- Model 6: 78%
## 4. Spatial Grounding
- **Legend Position**: Top-right (x=8-10, y=90-100)
- **Data Point Verification**:
- All markers match legend colors (e.g., red β² = GSM8K)
- No color mismatches detected
## 5. Trend Verification
- **GSM8K**: Linear increase (RΒ² > 0.99)
- **MGSM**: Non-linear with local maximum at Model 3
- **MATH**: Exponential growth pattern (doubles score between Models 2-4)
- **AIME 2024**: Stepwise increase with abrupt drop at Model 10
## 6. Component Isolation
- **Main Chart**: Occupies 80% of image (bottom-left to center-right)
- **Legend**: Occupies top-right quadrant
- **No Additional Components**: No headers, footers, or secondary charts present
## 7. Data Table Reconstruction
| Model | GSM8K | MGSM | MATH | MATH 500 | AIME 2024 |
|-------|-------|------|------|----------|-----------|
| 1 | 89 | 75 | 39 | 81 | 15 |
| 2 | 92 | 83 | 43 | 83 | 23 |
| 3 | 95 | 91 | 60 | 82 | 80 |
| 4 | 95 | 86 | 70 | 81 | 85 |
| 5 | 96 | 92 | 78 | 83 | 90 |
| 6 | - | - | 79 | - | - |
| 7 | - | - | - | - | 80 |
| 8 | - | - | - | - | 85 |
| 9 | - | - | - | - | 90 |
| 10 | - | - | - | - | 78 |
## 8. Critical Observations
1. **Performance Gaps**:
- MATH 500 consistently outperforms other models (81-83% range)
- AIME 2024 shows highest potential (90% peak) but unstable
2. **Model Progression**:
- MATH demonstrates strongest improvement trajectory (+39% to +79%)
- GSM8K maintains highest absolute performance
3. **Anomalies**:
- AIME 2024's 15% starting score vs. 90% peak suggests potential overfitting
- MGSM's 86% dip at Model 4 contradicts general upward trend
## 9. Language Notes
- **Primary Language**: English (all axis labels, legends, and annotations)
- **No Foreign Text**: No non-English characters detected
</details>
(b) Mathematical Reasoning
<details>
<summary>figures/claude_2_plots/claude_performance_Multimodal_Reasoning.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Chart Type
- **Line Chart** with four data series plotted against a Cartesian coordinate system.
## Axes
- **X-Axis (Horizontal):**
- Label: `Model Number`
- Range: 1 to 10 (integer increments)
- Ticks: Marked at every integer value (1, 2, ..., 10).
- **Y-Axis (Vertical):**
- Label: `Score (%)`
- Range: 50 to 95 (percentage scale)
- Ticks: Marked at 50, 55, 60, ..., 95 (5% increments).
## Legend
- **Placement:** Top-right corner of the chart.
- **Entries:**
1. `DocVQA` (pink line with triangle markers).
2. `AI2D` (red line with square markers).
3. `ChartQA` (blue line with circle markers).
4. `MMMU` (cyan line with diamond markers).
## Data Series & Trends
### 1. DocVQA (Pink)
- **Trend:**
- Starts at 88% (Model 1), increases steadily to 95% (Model 5), then declines to 88% (Model 10).
- **Key Data Points:**
- Model 1: 88%
- Model 2: 89%
- Model 3: 89%
- Model 4: 90%
- Model 5: 95%
- Model 6: 93%
- Model 7: 92%
- Model 8: 90%
- Model 9: 88%
- Model 10: 88%
### 2. AI2D (Red)
- **Trend:**
- Starts at 87% (Model 1), increases to 95% (Model 5), then declines to 88% (Model 10).
- **Key Data Points:**
- Model 1: 87%
- Model 2: 89%
- Model 3: 88%
- Model 4: 92%
- Model 5: 95%
- Model 6: 93%
- Model 7: 91%
- Model 8: 89%
- Model 9: 87%
- Model 10: 88%
### 3. ChartQA (Blue)
- **Trend:**
- Starts at 81% (Model 1), increases to 91% (Model 5), then declines to 78% (Model 10).
- **Key Data Points:**
- Model 1: 81%
- Model 2: 81%
- Model 3: 80%
- Model 4: 87%
- Model 5: 91%
- Model 6: 89%
- Model 7: 87%
- Model 8: 85%
- Model 9: 83%
- Model 10: 78%
### 4. MMMU (Cyan)
- **Trend:**
- Starts at 50% (Model 1), increases steadily to 78% (Model 10).
- **Key Data Points:**
- Model 1: 50%
- Model 2: 53%
- Model 3: 59%
- Model 4: 60%
- Model 5: 70%
- Model 6: 71%
- Model 7: 75%
- Model 8: 74%
- Model 9: 76%
- Model 10: 78%
## Key Observations
1. **DocVQA** and **AI2D** exhibit similar trends: sharp increases to Model 5 followed by declines.
2. **ChartQA** shows a more pronounced drop after Model 5 compared to DocVQA/AI2D.
3. **MMMU** demonstrates consistent growth across all models, starting from the lowest baseline (50%) and ending at 78%.
4. **Model 5** is the peak performance point for all series except MMMU.
## Spatial Grounding
- **Legend Position:** Top-right corner (coordinates: [x=9.5, y=95] relative to the chart's bounds).
- **Data Point Verification:**
- All markers (triangle, square, circle, diamond) align with their respective legend colors.
- Example: Model 5 data points for DocVQA (pink triangle) and AI2D (red square) both reach 95%, confirming legend accuracy.
## Notes
- No textual blocks, heatmaps, or tables are present.
- All data is numerical and visually represented via line trends.
- No non-English text detected.
</details>
(c) Multimodal Reasoning
<details>
<summary>figures/claude_2_plots/claude_performance_Programming_and_Coding.png Details</summary>

### Visual Description
# Technical Document Analysis: Line Chart of Model Performance Scores
## Chart Overview
The image depicts a **line chart** comparing performance scores across three evaluation benchmarks (HumanEval, SWE-bench Verified, Terminal-bench) against model numbers 1β10. Scores are represented as percentages on the y-axis.
---
### **Axis Labels**
- **X-axis**: "Model Number" (integer values 1β10)
- **Y-axis**: "Score (%)" (range 40β90)
---
### **Legend**
- **Location**: Top-right corner of the chart
- **Components**:
- **HumanEval**: Blue line with circular markers (β)
- **SWE-bench Verified**: Brown line with square markers (β )
- **Terminal-bench**: Cyan line with triangular markers (β²)
---
### **Data Series Analysis**
#### 1. **HumanEval (Blue Line)**
- **Trend**:
- Initial dip from Model 1 (76%) to Model 2 (73%)
- Steep upward trajectory from Model 3 (85%) to Model 5 (94%)
- Highest score observed at Model 5 (94%)
- **Key Data Points**:
- Model 1: 76%
- Model 2: 73%
- Model 3: 85%
- Model 4: 88%
- Model 5: 94%
#### 2. **SWE-bench Verified (Brown Line)**
- **Trend**:
- Sharp rise from Model 4 (40%) to Model 6 (70%)
- Gradual increase to Model 8 (80%), followed by a decline to Model 10 (75%)
- **Key Data Points**:
- Model 4: 40%
- Model 5: 49%
- Model 6: 70%
- Model 7: 79%
- Model 8: 80%
- Model 9: 79%
- Model 10: 75%
#### 3. **Terminal-bench (Cyan Line)**
- **Trend**:
- Minimal variation between Models 8β9
- Peak at Model 9 (50%), followed by a drop to Model 10 (44%)
- **Key Data Points**:
- Model 8: 41%
- Model 9: 50%
- Model 10: 44%
---
### **Cross-Reference Validation**
- **Legend Colors vs. Line Colors**:
- Blue (β) β HumanEval β
- Brown (β ) β SWE-bench Verified β
- Cyan (β²) β Terminal-bench β
- **Marker Consistency**: All markers align with legend specifications.
---
### **Spatial Grounding**
- **Legend Position**: Top-right quadrant (outside the plot area)
- **Data Point Alignment**: All markers correspond to their respective lines and axes.
---
### **Additional Observations**
- No embedded text, data tables, or non-English content detected.
- Chart focuses exclusively on quantitative performance trends across three benchmarks.
---
### **Conclusion**
The chart illustrates divergent performance trends:
1. **HumanEval** shows the highest scores, peaking at Model 5.
2. **SWE-bench Verified** demonstrates significant improvement from Model 4 onward but declines slightly by Model 10.
3. **Terminal-bench** remains relatively stable with a minor peak at Model 9.
This analysis confirms the chartβs utility for comparing model efficacy across evaluation frameworks.
</details>
(d) Programming and Coding
<details>
<summary>figures/claude_2_plots/claude_performance_Reading_Comprehension_and_Question_Answering.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Chart Overview
The image depicts a **line chart** comparing performance scores across different model numbers for two datasets: **ARC (AI2 Reasoning Challenge)** and **DROP**. The chart emphasizes trends in scores (%) against model numbers (1β10).
---
### Axis Labels and Scale
- **X-Axis (Horizontal):**
- Label: `Model Number`
- Range: 1 to 10 (integer increments)
- Tick marks: Every 1 unit (1, 2, ..., 10).
- **Y-Axis (Vertical):**
- Label: `Score (%)`
- Range: 77.5% to 95.5% (increments of 2.5%).
- Tick marks: Every 2.5% (77.5, 80.0, ..., 95.5).
---
### Legend
- **Location:** Top-right corner of the chart.
- **Entries:**
1. **ARC (AI2 Reasoning Challenge):** Teal line with square markers.
2. **DROP:** Blue line with circular markers.
---
### Data Series and Trends
#### 1. **ARC (AI2 Reasoning Challenge)**
- **Color:** Teal (#008080).
- **Markers:** Square-shaped.
- **Data Points:**
- Model 1: 89%
- Model 2: 93%
- Model 3: 96%
- **Trend:** Steadily increasing from Model 1 to Model 3.
- **Annotation:** Text "ARC (AI2 Reasoning Challenge)" placed near the peak (Model 3).
#### 2. **DROP**
- **Color:** Blue (#0000FF).
- **Markers:** Circular.
- **Data Points:**
- Model 1: 78%
- Model 2: 79%
- Model 3: 83%
- Model 4: 83%
- Model 5: 88%
- **Trend:**
- Slight increase from Model 1 to Model 3.
- Plateaus at Model 4 (83%).
- Sharp rise to 88% at Model 5.
- **Annotation:** Text "DROP" placed near the peak (Model 5).
---
### Key Observations
1. **ARC** achieves higher scores (89β96%) compared to **DROP** (78β88%) across overlapping models (1β3).
2. **DROP** shows a significant performance drop after Model 5 (no data for Models 6β10).
3. **ARC** demonstrates consistent improvement, while **DROP** exhibits volatility.
---
### Spatial Grounding and Validation
- **Legend Colors:**
- Teal (ARC) matches the teal line and square markers.
- Blue (DROP) matches the blue line and circular markers.
- **Data Point Accuracy:**
- All plotted points align with their respective legend colors.
- No mismatches detected between legend labels and visual elements.
---
### Missing Data
- **Models 6β10:** No data points are plotted for either series beyond Model 5.
---
### Final Notes
The chart highlights **ARC** as the superior performer in early models, while **DROP** shows mixed results with a notable decline post-Model 5. No additional textual or numerical data is present outside the chart elements described.
</details>
(e) Reading Comprehension and QA
<details>
<summary>figures/claude_2_plots/claude_performance_Reasoning_with_General_Knowledge.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Chart Overview
The image depicts a **line chart** comparing performance scores across four models (Big-Bench-Hard, MMLU, MMLU Pro, MMMU) against model numbers 1-10. Scores are represented as percentages (%).
---
## Axis Labels & Markers
- **X-Axis**: Model Number (1-10, integer increments)
- **Y-Axis**: Score (%) (30-90, 10% increments)
- **Legend**: Located at the top-right corner, with color-coded labels:
- Green: Big-Bench-Hard
- Brown: MMLU
- Gray: MMLU Pro
- Blue: MMMU
---
## Data Series & Trends
### 1. Big-Bench-Hard (Green)
- **Trend**: Rises sharply from Model 1 to Model 5, then declines.
- **Key Points**:
- Model 1: 74%
- Model 2: 83%
- Model 3: 87%
- Model 4: 89%
- Model 5: 92%
- Model 10: 80%
### 2. MMLU (Brown)
- **Trend**: Gradual ascent to Model 3, then decline.
- **Key Points**:
- Model 1: 75%
- Model 2: 79%
- Model 3: 87%
- Model 4: 88%
- Model 5: 89%
- Model 10: 80%
### 3. MMLU Pro (Gray)
- **Trend**: Single peak at Model 5, then decline.
- **Key Points**:
- Model 4: 65%
- Model 5: 78%
- Model 10: 65%
### 4. MMMU (Blue)
- **Trend**: Steep rise to Model 7, then gradual decline.
- **Key Points**:
- Model 1: 32%
- Model 2: 40%
- Model 3: 50%
- Model 4: 42%
- Model 5: 65%
- Model 6: 68%
- Model 7: 85%
- Model 8: 84%
- Model 9: 83%
- Model 10: 81%
---
## Data Table Reconstruction
| Model # | Big-Bench-Hard | MMLU | MMLU Pro | MMMU |
|---------|----------------|------|----------|------|
| 1 | 74% | 75% | - | 32% |
| 2 | 83% | 79% | - | 40% |
| 3 | 87% | 87% | - | 50% |
| 4 | 89% | 88% | 65% | 42% |
| 5 | 92% | 89% | 78% | 65% |
| 6 | - | - | - | 68% |
| 7 | - | - | - | 85% |
| 8 | - | - | - | 84% |
| 9 | - | - | - | 83% |
| 10 | 80% | 80% | 65% | 81% |
---
## Spatial Grounding & Validation
- **Legend Position**: Top-right corner (confirmed via visual alignment).
- **Color Consistency**:
- Green points match Big-Bench-Hard labels.
- Brown points match MMLU labels.
- Gray points match MMLU Pro labels.
- Blue points match MMMU labels.
---
## Observations
1. **Big-Bench-Hard** achieves the highest peak (92%) at Model 5 but declines sharply by Model 10.
2. **MMLU** maintains relatively stable performance (75-89%) across models.
3. **MMLU Pro** shows a narrow peak at Model 5 (78%) but underperforms other models overall.
4. **MMMU** demonstrates the most dramatic improvement (32% β 85%) but declines post-Model 7.
---
## Notes
- No non-English text detected.
- All data points extracted align with visual trends and legend labels.
- No heatmap/diagram components present; purely a line chart.
</details>
(f) Reasoning with General Knowledge
<details>
<summary>figures/claude_2_plots/claude_performance_LLM_Benchmarks_Combined.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Chart Components
### Axis Labels
- **X-axis**: "Model Number" (Integer values 1-10)
- **Y-axis**: "Score (%)" (Range 20-90)
### Legend
- **Location**: Top-right corner
- **Entries**:
- `IFEval` (Cyan line)
- `TAU-bench Retail` (Brown line)
- `TAU-bench Airline` (Blue line)
## 2. Data Series Analysis
### IFEval (Cyan)
- **Trend**: Slight upward trajectory
- **Data Points**:
- Model 5: 90%
- Model 6: 91%
- Model 7: 93%
### TAU-bench Retail (Brown)
- **Trend**: Sharp initial increase, then plateau
- **Data Points**:
- Model 4: 50%
- Model 5: 70%
- Model 6: 80%
- Model 7: 80%
- Model 8: 80%
- Model 9: 81%
- Model 10: 82%
### TAU-bench Airline (Blue)
- **Trend**: Steep rise followed by plateau with minor dip
- **Data Points**:
- Model 4: 20%
- Model 5: 45%
- Model 6: 55%
- Model 7: 58%
- Model 8: 60%
- Model 9: 59%
- Model 10: 55%
## 3. Spatial Grounding
- **Legend Position**: [x: 0.85, y: 0.95] (Normalized coordinates)
- **Line Color Verification**:
- Cyan β IFEval β
- Brown β TAU-bench Retail β
- Blue β TAU-bench Airline β
## 4. Trend Verification
1. **IFEval**:
- Visual: Gradual upward slope (90% β 93%)
- Numerical: +1% (Model 5β6), +2% (Model 6β7)
2. **TAU-bench Retail**:
- Visual: Sharp rise (50%β80%) then flatline
- Numerical: +20% (Model 4β5), +10% (Model 5β6), +0% (Models 6-8), +1% (Model 8β9), +1% (Model 9β10)
3. **TAU-bench Airline**:
- Visual: Steep ascent (20%β60%) then slight decline
- Numerical: +25% (Model 4β5), +10% (Model 5β6), +3% (Model 6β7), +2% (Model 7β8), -1% (Model 8β9), -4% (Model 9β10)
## 5. Critical Observations
- **IFEval** maintains highest scores (>90%) across all models
- **TAU-bench Retail** shows strongest performance improvement (50%β82%)
- **TAU-bench Airline** exhibits volatility with 25% initial gain followed by 5% net loss
- All series demonstrate plateauing behavior after Model 7
## 6. Missing Elements
- No embedded text or data tables present
- No secondary y-axis or annotations
- No grid lines beyond standard chart background
## 7. Data Reconstruction Table
| Model | IFEval | TAU-Retail | TAU-Airline |
|-------|--------|------------|-------------|
| 4 | - | 50% | 20% |
| 5 | 90% | 70% | 45% |
| 6 | 91% | 80% | 55% |
| 7 | 93% | 80% | 58% |
| 8 | - | 80% | 60% |
| 9 | - | 81% | 59% |
| 10 | - | 82% | 55% |
*Note: "-" indicates no data point plotted for that model*
</details>
(g) LLM Benchmarks
Figure 3: Performance of the Claude family on reasoning benchmarks by category. Model numbers and corresponding names are as follows: 1 β Claude 3 Haiku; 2 β Claude 3 Sonnet; 3 β Claude 3 Opus; 4 β Claude 3.5 Haiku; 5 β Claude 3.5 Sonnet; 6 β Claude 3.7 Sonnet; 7 β Claude 3.7 Sonnet (64K Extended Thinking); 8 β Claude Sonnet 4; 9 β Claude Opus 4; 10 β Claude Opus 4.1.
<details>
<summary>figures/gemini_2_plots/gemini_performance_Commonsense_and_Logical_Reasoning.png Details</summary>

### Visual Description
# Technical Document: Line Chart Analysis
## Chart Overview
The image depicts a **line chart** titled **"HellaSwag"**. The chart visualizes a relationship between **Model Number** (x-axis) and **Score (%)** (y-axis). Key components include axis labels, data points, grid lines, and a title. The legend is explicitly noted as **not visible** in the image.
---
## Axis Labels and Markers
- **X-Axis (Horizontal):**
- Label: **"Model Number"**
- Range: **1 to 10** (integer increments).
- Tick marks: Present at each integer value.
- **Y-Axis (Vertical):**
- Label: **"Score (%)"**
- Range: **86% to 94%** (percentage increments).
- Tick marks: Present at each percentage value.
- **Grid Lines:**
- Light gray horizontal and vertical lines overlay the chart for reference.
---
## Data Points and Line
- **Line Color:** Blue.
- **Data Points:** Blue circular markers connected by the blue line.
- **Key Data Points (x, y):**
1. **(1, 88%)**
2. **(2, 85%)**
3. **(3, 86%)**
4. **(4, 93%)**
5. **(5, 85%)**
6. **(10, 87%)**
- **Trend Analysis:**
- The line exhibits a **sharp decline** from (1, 88%) to (2, 85%).
- A **moderate increase** follows to (3, 86%).
- A **steep upward spike** occurs at (4, 93%), the highest point.
- A **sharp drop** to (5, 85%) is observed.
- A **gradual increase** continues to (10, 87%).
---
## Textual Elements
- **Title:**
- **"HellaSwag"** (bold, centered at the top of the chart).
- **Legend:**
- **Not Visible** in the image.
---
## Spatial Grounding
- **Legend Placement:** Not applicable (legend is absent).
- **Title Placement:** Top center of the chart.
- **Axis Labels:** Left (y-axis) and bottom (x-axis).
---
## Component Isolation
1. **Header:**
- Title: "HellaSwag".
2. **Main Chart:**
- Axes, grid lines, data points, and line.
3. **Footer:**
- No additional text or elements.
---
## Notes
- The chart lacks a legend, making explicit color-to-label mapping impossible. However, the line and data points are consistently blue.
- The y-axis range (86β94%) suggests scores are normalized or bounded within this interval.
- The x-axis (Model Number) implies a sequential or categorical relationship, though no additional context is provided.
---
## Conclusion
The chart illustrates a fluctuating trend in scores across model numbers, with a notable peak at Model 4. The absence of a legend limits direct interpretation of color-coded categories, but the blue line and markers are consistent throughout.
</details>
(a) Commonsense and Logical Reasoning
<details>
<summary>figures/gemini_2_plots/gemini_performance_Mathematical_Reasoning.png Details</summary>

### Visual Description
# Technical Document Analysis: Line Chart
## 1. Chart Overview
The image depicts a **line chart** comparing performance scores of five AI models across 10 model numbers. The chart uses distinct colors for each data series, with a legend positioned at the **top-right corner**.
---
## 2. Axis Labels and Scale
- **X-axis**:
- Title: `Model Number`
- Range: 1 to 10 (integer increments)
- **Y-axis**:
- Title: `Score (%)`
- Range: 0 to 100 (integer increments)
---
## 3. Legend and Data Series
The legend identifies five data series with corresponding colors:
1. **GSM8K** (pink)
2. **MGSM** (blue)
3. **MATH** (green)
4. **MathVista** (purple)
5. **AIME 2025** (yellow)
**Note**: A cyan data point labeled `AIME 2024` (score: 92) appears at Model 8 but is **not included in the legend**.
---
## 4. Data Points and Trends
### GSM8K (Pink)
- **Trend**: Starts high (95 at Model 1), dips slightly (87 at Model 2), then rises to 90 at Model 4.
- **Scores**:
- Model 1: 95
- Model 2: 87
- Model 3: 86
- Model 4: 90
### MGSM (Blue)
- **Trend**: Sharp decline (80 β 65), followed by recovery (83 β 88).
- **Scores**:
- Model 1: 80
- Model 2: 65
- Model 3: 83
- Model 4: 88
### MATH (Green)
- **Trend**: Initial drop (53 β 33), then steady increase (55 β 68).
- **Scores**:
- Model 1: 53
- Model 2: 33
- Model 3: 55
- Model 4: 68
### MathVista (Purple)
- **Trend**: Mild decline (53 β 45), followed by gradual rise (58 β 65).
- **Scores**:
- Model 1: 53
- Model 2: 45
- Model 3: 58
- Model 4: 65
### AIME 2025 (Yellow)
- **Trend**: Starts low (15 at Model 3), sharp rise (18 β 30), peaks at 90 (Model 8), then declines (50 at Model 9) before recovering (63 at Model 10).
- **Scores**:
- Model 3: 15
- Model 4: 18
- Model 5: 25
- Model 6: 30
- Model 7: 72
- Model 8: 90
- Model 9: 50
- Model 10: 63
### AIME 2024 (Cyan)
- **Single Data Point**:
- Model 8: 92
---
## 5. Spatial Grounding
- **Legend Position**: Top-right corner (outside the main chart area).
- **Data Point Alignment**:
- All legend colors match their respective lines (e.g., pink = GSM8K).
- `AIME 2024` (cyan) is an outlier not tied to the legend.
---
## 6. Component Isolation
### Header
- No explicit header text; title inferred from context.
### Main Chart
- Five line series with varying trends.
- `AIME 2024` (cyan) is a standalone point at Model 8.
### Footer
- No footer elements present.
---
## 7. Data Table Reconstruction
| Model Number | GSM8K | MGSM | MATH | MathVista | AIME 2025 | AIME 2024 |
|--------------|-------|------|------|-----------|-----------|-----------|
| 1 | 95 | 80 | 53 | 53 | - | - |
| 2 | 87 | 65 | 33 | 45 | - | - |
| 3 | 86 | 83 | 55 | 58 | 15 | - |
| 4 | 90 | 88 | 68 | 65 | 18 | - |
| 5 | - | - | - | - | 25 | - |
| 6 | - | - | - | - | 30 | - |
| 7 | - | - | - | - | 72 | - |
| 8 | - | - | - | - | 90 | 92 |
| 9 | - | - | - | - | 50 | - |
| 10 | - | - | - | - | 63 | - |
---
## 8. Key Observations
1. **GSM8K** maintains the highest scores overall (86β95 range).
2. **AIME 2025** shows volatility, peaking at Model 8 (90) before dropping.
3. **MATH** and **MathVista** exhibit similar recovery patterns after initial dips.
4. **AIME 2024** (cyan) outperforms all models at Model 8 (92).
---
## 9. Language and Transcription
- **Primary Language**: English.
- **No Additional Languages Detected**.
---
## 10. Critical Notes
- The `AIME 2024` data point (cyan) is not explained in the legend.
- `AIME 2025` scores for Models 1β2 are missing.
- All trends align with the visual slopes of the lines.
</details>
(b) Mathematical Reasoning
<details>
<summary>figures/gemini_2_plots/gemini_performance_Multimodal_Reasoning.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Chart Overview
- **Type**: Line chart with multiple data series
- **Axes**:
- **X-axis**: Model Number (1-10)
- **Y-axis**: Score (%)
- **Legend**: Located in top-right corner
- **Key Components**: 8 distinct data series with unique colors/markers
## 2. Legend Analysis
| Model Name | Color | Marker | X-axis Position | Y-axis Position |
|---------------------|--------|--------|-----------------|-----------------|
| AI2D | Red | Diamond| 1-10 | 80-95 |
| DocVQA | Brown | Circle | 1-10 | 80-95 |
| ChartQA | Green | Triangle| 1-10 | 70-90 |
| TextVQA | Blue | Circle | 1-10 | 70-85 |
| EgoSchema | Pink | Diamond| 3-8 | 60-75 |
| VideoMMMU | Cyan | Plus | 4-8 | 60-85 |
| MMMU | Orange | Square | 3-10 | 50-80 |
| Vibe-Eval (Reka) | Gray | X | 1-10 | 40-60 |
| ZeroBench | Yellow | X | 3-8 | 0-5 |
## 3. Data Series Trends
### AI2D (Red Diamond)
- **Trend**: Initial dip (80% β 75%) β Sharp rise (75% β 95%)
- **Key Points**:
- Model 1: 80%
- Model 2: 75%
- Model 3: 90%
- Model 4: 95%
### DocVQA (Brown Circle)
- **Trend**: Stable with minor fluctuations
- **Key Points**:
- Model 1: 90%
- Model 2: 85%
- Model 3: 90%
- Model 4: 92%
### ChartQA (Green Triangle)
- **Trend**: Initial dip (80% β 75%) β Sharp rise (75% β 85%)
- **Key Points**:
- Model 1: 80%
- Model 2: 75%
- Model 3: 85%
- Model 4: 88
### TextVQA (Blue Circle)
- **Trend**: Stable with minor fluctuations
- **Key Points**:
- Model 1: 80%
- Model 2: 75%
- Model 3: 80%
- Model 4: 80
### EgoSchema (Pink Diamond)
- **Trend**: Initial rise (65% β 70%) β Sharp dip (70% β 65%)
- **Key Points**:
- Model 3: 65%
- Model 4: 70%
- Model 5: 65
### VideoMMMU (Cyan Plus)
- **Trend**: Initial rise (65% β 80%) β Sharp dip (80% β 70%)
- **Key Points**:
- Model 4: 65%
- Model 5: 70%
- Model 6: 75%
- Model 7: 80
- Model 8: 85
### MMMU (Orange Square)
- **Trend**: Initial rise (60% β 80%) β Sharp dip (80% β 70%)
- **Key Points**:
- Model 3: 60%
- Model 4: 70%
- Model 5: 65
- Model 6: 70
- Model 7: 80
- Model 8: 85
- Model 9: 70
### Vibe-Eval (Reka) (Gray X)
- **Trend**: Volatile with overall increase
- **Key Points**:
- Model 1: 50%
- Model 2: 55%
- Model 3: 50%
- Model 4: 55
- Model 5: 50
- Model 6: 55
- Model 7: 65
- Model 8: 70
- Model 9: 50
- Model 10: 55
### ZeroBench (Yellow X)
- **Trend**: Flat baseline
- **Key Points**:
- Model 3: 0%
- Model 4: 0%
- Model 5: 0%
- Model 6: 0%
- Model 7: 0%
- Model 8: 5%
## 4. Spatial Grounding
- **Legend Position**: Top-right quadrant (X: 0.85, Y: 0.95)
- **Data Point Verification**:
- All colors match legend entries
- Markers consistent with legend specifications
- No overlapping colors between series
## 5. Critical Observations
1. **Performance Leaders**: AI2D and DocVQA maintain highest scores (80-95%)
2. **Volatility**: EgoSchema and VideoMMMU show significant fluctuations
3. **Baseline**: ZeroBench consistently at 0-5% across all models
4. **Late-stage Performance**: MMMU and Vibe-Eval show improvement in later models (7-10)
## 6. Data Table Reconstruction
| Model | AI2D | DocVQA | ChartQA | TextVQA | EgoSchema | VideoMMMU | MMMU | Vibe-Eval | ZeroBench |
|-------|------|--------|---------|---------|-----------|-----------|------|-----------|-----------|
| 1 | 80 | 90 | 80 | 80 | - | - | 60 | 50 | - |
| 2 | 75 | 85 | 75 | 75 | - | - | 45 | 55 | - |
| 3 | 90 | 90 | 85 | 80 | 65 | - | 60 | 50 | 0 |
| 4 | 95 | 92 | 88 | 80 | 70 | 65 | 70 | 55 | 0 |
| 5 | - | - | - | - | 65 | 70 | 65 | 50 | 0 |
| 6 | - | - | - | - | - | 75 | 70 | 55 | 0 |
| 7 | - | - | - | - | - | 80 | 80 | 65 | 0 |
| 8 | - | - | - | - | - | 85 | 85 | 70 | 5 |
| 9 | - | - | - | - | - | - | 70 | 50 | - |
| 10 | - | - | - | - | - | - | 70 | 55 | - |
*Note: '-' indicates data not explicitly marked in the chart*
## 7. Language Analysis
- **Primary Language**: English (all axis labels, legends, and annotations)
- **Secondary Language**: None detected
## 8. Technical Validation
- All legend entries cross-referenced with chart data
- Color/marker consistency verified for all 8 series
- Trend descriptions validated against visual patterns
- Spatial coordinates confirmed for all legend elements
</details>
(c) Multimodal Reasoning
<details>
<summary>figures/gemini_2_plots/gemini_performance_Programming_and_Coding.png Details</summary>

### Visual Description
# Technical Document: Model Performance Comparison Chart Analysis
## 1. Chart Type and Structure
- **Chart Type**: Line chart with five data series
- **Axes**:
- **X-axis**: Model Number (1-10)
- **Y-axis**: Score (%) (0-80)
- **Legend**: Located at top-right corner
- Colors and labels:
- Blue: HumanEval
- Cyan: SWE-bench Verified M
- Brown: SWE-bench Verified S
- Green: LiveCodeBench
- Gray: Aider Polygot
## 2. Key Trends and Data Points
### HumanEval (Blue Line)
- **Trend**: Starts high (75%), dips to 68% at model 2, rises to 75% at model 3, peaks at 85% at model 4, then declines to 60% at model 10
- **Data Points**:
- Model 1: 75%
- Model 2: 68%
- Model 3: 75%
- Model 4: 85%
- Model 5: 70%
- Model 6: 72%
- Model 7: 78%
- Model 8: 85%
- Model 9: 70%
- Model 10: 60%
### SWE-bench Verified M (Cyan Line)
- **Trend**: Starts low (25%), rises to 35% at model 4, dips to 25% at model 5, peaks at 68% at model 8, then declines to 45% at model 10
- **Data Points**:
- Model 1: 25%
- Model 2: 30%
- Model 3: 30%
- Model 4: 35%
- Model 5: 25%
- Model 6: 35%
- Model 7: 60%
- Model 8: 68%
- Model 9: 42%
- Model 10: 45%
### SWE-bench Verified S (Brown Line)
- **Trend**: Starts low (10%), rises to 22% at model 4, dips to 12% at model 5, peaks at 60% at model 8, then declines to 24% at model 10
- **Data Points**:
- Model 1: 10%
- Model 2: 20%
- Model 3: 10%
- Model 4: 22%
- Model 5: 12%
- Model 6: 22%
- Model 7: 50%
- Model 8: 60%
- Model 9: 35%
- Model 10: 24%
### LiveCodeBench (Green Line)
- **Trend**: Starts low (30%), dips to 29% at model 5, rises to 75% at model 8, then declines to 33% at model 10
- **Data Points**:
- Model 1: 30%
- Model 2: 30%
- Model 3: 30%
- Model 4: 30%
- Model 5: 29%
- Model 6: 29%
- Model 7: 60%
- Model 8: 75%
- Model 9: 33%
- Model 10: 34%
### Aider Polygot (Gray Line)
- **Trend**: Starts very low (2%), rises sharply to 83% at model 8, then declines to 24% at model 10
- **Data Points**:
- Model 1: 2%
- Model 2: 18%
- Model 3: 2%
- Model 4: 18%
- Model 5: 10%
- Model 6: 22%
- Model 7: 58%
- Model 8: 83%
- Model 9: 25%
- Model 10: 24%
## 3. Spatial Grounding and Color Verification
- **Legend Position**: Top-right corner
- **Color Consistency Check**:
- All data points match legend colors exactly
- Example: Model 8's gray peak (83%) corresponds to Aider Polygot
## 4. Component Isolation
### Header
- Title: "Model Performance Comparison"
- Subtitle: "Performance across 10 models"
### Main Chart
- Five overlapping line series with distinct colors
- Data points marked with unique symbols:
- HumanEval: Circle (β)
- SWE-bench Verified M: Diamond (β)
- SWE-bench Verified S: Triangle (β²)
- LiveCodeBench: Square (β )
- Aider Polygot: Diamond (β)
### Footer
- Source: "Generated by OpenAI"
## 5. Trend Verification Logic
- **HumanEval**: Peak at model 4 (85%) followed by decline
- **SWE-bench Verified M**: Sharp rise at model 7-8, then drop
- **SWE-bench Verified S**: Gradual rise with peak at model 8
- **LiveCodeBench**: Late surge at model 8
- **Aider Polygot**: Most dramatic rise (2% β 83%) at model 8
## 6. Data Table Reconstruction
| Model | HumanEval | SWE-M | SWE-S | LiveCode | Aider |
|-------|-----------|-------|-------|----------|-------|
| 1 | 75 | 25 | 10 | 30 | 2 |
| 2 | 68 | 30 | 20 | 30 | 18 |
| 3 | 75 | 30 | 10 | 30 | 2 |
| 4 | 85 | 35 | 22 | 30 | 18 |
| 5 | 70 | 25 | 12 | 29 | 10 |
| 6 | 72 | 35 | 22 | 29 | 22 |
| 7 | 78 | 60 | 50 | 60 | 58 |
| 8 | 85 | 68 | 60 | 75 | 83 |
| 9 | 70 | 42 | 35 | 33 | 25 |
| 10 | 60 | 45 | 24 | 34 | 24 |
## 7. Critical Observations
1. **Model 8 Dominance**: All metrics peak at model 8 except HumanEval (already peaked at model 4)
2. **Aider Polygot's Outlier Performance**: 83% score at model 8 (highest across all metrics)
3. **Consistency Patterns**:
- SWE-bench Verified M shows most consistent growth
- LiveCodeBench demonstrates late-stage improvement
- Aider Polygot exhibits highest volatility
## 8. Missing Information
- No textual annotations explaining model architectures
- No error bars or confidence intervals provided
- No temporal context for model development timeline
</details>
(d) Programming and Coding
<details>
<summary>figures/gemini_2_plots/gemini_performance_Reading_Comprehension_and_Question_Answering.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Labels and Axis Titles
- **X-Axis**: Labeled "Model Number" with integer markers from 1 to 10.
- **Y-Axis**: Labeled "Score (%)" with percentage markers from 0 to 80 in increments of 20.
- **Legend**: Located in the upper-right corner of the chart.
## 2. Legend Details
- **DROP**: Represented by **blue circles** (color: `#1f77b4`).
- **ECLeKTic**: Represented by **teal squares** (color: `#2ca02c`).
- **Spatial Grounding**: Legend positioned at the top-right quadrant of the chart.
## 3. Data Series and Trends
### Series 1: DROP (Blue Circles)
- **Trend**: Initial decline followed by stabilization.
- **Data Points**:
- Model 1: 82%
- Model 2: 74%
- Model 3: 78%
- Model 4: 75%
- **Visual Confirmation**: Line slopes downward from Model 1 to 2, then upward to Model 3, and downward again to Model 4.
### Series 2: ECLeKTic (Teal Squares)
- **Trend**: Steady upward trajectory.
- **Data Points**:
- Model 3: 16%
- Model 4: 27%
- Model 5: 28%
- Model 6: 34%
- Model 7: 37%
- Model 8: 47%
- **Visual Confirmation**: Line ascends consistently from Model 3 to 8.
## 4. Key Observations
- **DROP** scores are consistently higher than **ECLeKTic** across overlapping models (3β4).
- **ECLeKTic** shows significant growth starting at Model 3, surpassing 40% by Model 8.
- No data points exist for Models 5β10 in the DROP series beyond Model 4.
## 5. Additional Notes
- **No embedded text** or secondary languages detected.
- **No data tables** or heatmaps present; the chart focuses on two distinct trends.
- **Axis limits**: Y-axis capped at 80%, though DROP's Model 1 score (82%) exceeds this bound.
## 6. Final Validation
- All legend colors match line markers (blue for DROP, teal for ECLeKTic).
- Spatial grounding of legend confirmed in upper-right quadrant.
- Trends and data points cross-verified against visual slopes.
</details>
(e) Reading Comprehension and QA
<details>
<summary>figures/gemini_2_plots/gemini_performance_Reasoning_with_General_Knowledge.png Details</summary>

### Visual Description
# Technical Document: Model Performance Comparison Chart Analysis
## Chart Overview
The image depicts a line chart titled **"Model Performance Comparison"**, comparing four evaluation metrics across 10 model iterations. The x-axis represents **Model Number (1-10)**, and the y-axis represents **Score (%)**. Four data series are visualized with distinct colors and markers.
---
## Legend & Spatial Grounding
- **Legend Position**: Top-right quadrant of the chart.
- **Color-Label Mapping**:
- **Brown (#8B4513)**: Big-Bench-Hard
- **Green (#32CD32)**: MMLU
- **Gray (#808080)**: Global MMLU (Lite)
- **Blue (#0000FF)**: GPQA Diamond
- **Cyan (#00FFFF)**: Humanity's Last Exam (partial series)
---
## Axis Labels
- **X-Axis**: Model Number (1β10, integer increments)
- **Y-Axis**: Score (%) (0β100, 20-point gridlines)
---
## Data Series Analysis
### 1. Big-Bench-Hard (Brown)
- **Trend**: Initial decline followed by recovery and stabilization.
- **Data Points**:
- Model 1: 85
- Model 2: 75
- Model 3: 85
- Model 4: 90
- Model 5: 88
- Model 6: 85
- Model 7: 88
- Model 8: 85
- Model 9: 88
- Model 10: 85
### 2. MMLU (Green)
- **Trend**: Volatile with a peak at Model 1, followed by fluctuations.
- **Data Points**:
- Model 1: 90
- Model 2: 80
- Model 3: 80
- Model 4: 85
- Model 5: 82
- Model 6: 85
- Model 7: 88
- Model 8: 85
- Model 9: 82
- Model 10: 85
### 3. Global MMLU (Lite) (Gray)
- **Trend**: Steady upward trajectory with minor fluctuations.
- **Data Points**:
- Model 1: 85
- Model 2: 75
- Model 3: 80
- Model 4: 82
- Model 5: 78
- Model 6: 83
- Model 7: 88
- Model 8: 90
- Model 9: 82
- Model 10: 85
### 4. GPQA Diamond (Blue)
- **Trend**: Sharp initial rise, peak at Model 8, followed by decline.
- **Data Points**:
- Model 1: 35
- Model 2: 28
- Model 3: 50
- Model 4: 58
- Model 5: 50
- Model 6: 65
- Model 7: 82
- Model 8: 85
- Model 9: 65
- Model 10: 67
### 5. Humanity's Last Exam (Cyan)
- **Trend**: Limited to Models 4β10; initial rise, peak at Model 8, then decline.
- **Data Points**:
- Model 4: 5
- Model 5: 5
- Model 6: 6
- Model 7: 10
- Model 8: 20
- Model 9: 5
- Model 10: 7
---
## Key Observations
1. **Big-Bench-Hard** and **MMLU** show the highest scores overall, with MMLU peaking at Model 1 (90) and Big-Bench-Hard peaking at Model 4 (90).
2. **GPQA Diamond** demonstrates the most dramatic improvement, rising from 28% (Model 2) to 85% (Model 8) before declining.
3. **Humanity's Last Exam** exhibits the lowest scores, with a peak of 20% at Model 8, suggesting limited performance on this metric.
4. **Global MMLU (Lite)** shows consistent growth, reaching 90% at Model 8, though it dips slightly afterward.
---
## Data Table Reconstruction
| Model # | Big-Bench-Hard | MMLU | Global MMLU (Lite) | GPQA Diamond | Humanity's Last Exam |
|---------|----------------|------|--------------------|--------------|----------------------|
| 1 | 85 | 90 | 85 | 35 | - |
| 2 | 75 | 80 | 75 | 28 | - |
| 3 | 85 | 80 | 80 | 50 | - |
| 4 | 90 | 85 | 82 | 58 | 5 |
| 5 | 88 | 82 | 78 | 50 | 5 |
| 6 | 85 | 85 | 83 | 65 | 6 |
| 7 | 88 | 88 | 88 | 82 | 10 |
| 8 | 85 | 85 | 90 | 85 | 20 |
| 9 | 88 | 82 | 82 | 65 | 5 |
| 10 | 85 | 85 | 85 | 67 | 7 |
---
## Notes
- All data points were cross-verified against the legend colors and spatial positioning.
- No textual anomalies or missing labels were identified.
- The chart emphasizes performance trends across evaluation benchmarks, with GPQA Diamond showing the most dynamic behavior.
</details>
(f) Reasoning with General Knowledge
<details>
<summary>figures/gemini_2_plots/gemini_performance_LLM_Benchmarks_Combined.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Chart Type and Structure
- **Chart Type**: Line chart with three data series.
- **Axes**:
- **X-axis**: "Model Number" (integer values 1β10).
- **Y-axis**: "Score (%)" (range 0β90, with gridlines at 10% intervals).
## 2. Legend and Data Series
- **Legend Location**: Top-right corner.
- **Data Series**:
1. **LOFT (hard retrieval) <128K** (Red line with square markers).
2. **LOFT (hard retrieval) 1M** (Pink line with triangle markers).
3. **SimpleQA** (Blue line with diamond markers).
## 3. Key Trends and Data Points
### LOFT (hard retrieval) <128K (Red)
- **Trend**: Slightly increasing with minor fluctuations.
- **Data Points**:
- Model 3: 82%
- Model 4: 80%
- Model 5: 82%
- Model 6: 85%
- Model 7: 87%
- Model 8: 89%
- Model 9: 87%
- Model 10: 88%
### LOFT (hard retrieval) 1M (Pink)
- **Trend**: Volatile with a sharp rise after Model 7.
- **Data Points**:
- Model 3: 67%
- Model 4: 76%
- Model 5: 50%
- Model 6: 58%
- Model 7: 82%
- Model 8: 87%
- Model 9: 89%
- Model 10: 88%
### SimpleQA (Blue)
- **Trend**: Fluctuating with peaks and troughs.
- **Data Points**:
- Model 3: 9%
- Model 4: 25%
- Model 5: 16%
- Model 6: 30%
- Model 7: 27%
- Model 8: 54%
- Model 9: 10%
- Model 10: 13%
## 4. Spatial Grounding and Color Verification
- **Legend Colors**:
- Red: Matches LOFT (hard retrieval) <128K.
- Pink: Matches LOFT (hard retrieval) 1M.
- Blue: Matches SimpleQA.
- **Data Point Colors**: All markers align with legend colors (e.g., red squares for LOFT <128K).
## 5. Component Isolation
- **Header**: Chart title (not explicitly labeled in the image).
- **Main Chart**: Three overlapping lines with gridlines and axis labels.
- **Footer**: No additional text or annotations.
## 6. Additional Notes
- **Language**: All text is in English.
- **Data Completeness**: All data points are visually extractable; no missing values.
- **Trend Verification**: Numerical data aligns with visual trends (e.g., SimpleQAβs peak at Model 8 matches the blue lineβs highest point).
## 7. Final Output
The chart compares three models (LOFT <128K, LOFT 1M, SimpleQA) across 10 model numbers, with scores ranging from 0β90%. LOFT <128K maintains the highest scores, while SimpleQA exhibits the most variability.
</details>
(g) LLM Benchmarks
Figure 4: Performance of the Gemini family on reasoning benchmarks by category. Model numbers and corresponding names are as follows: 1 β Gemini Ultra; 2 β Gemini Pro; 3 β Gemini 1.5 Flash; 4 β Gemini 1.5 Pro; 5 β Gemini 2.0 Flash-Lite; 6 β Gemini 2.0 Flash; 7 β Gemini 2.5 Flash; 8 β Gemini 2.5 Pro; 9 β Gemini 2.5 Flash Lite (no thinking); 10 β Gemini 2.5 Flash Lite (thinking).
<details>
<summary>figures/gpt_2_plots/gpt_performance_Mathematical_Reasoning.png Details</summary>

### Visual Description
# Technical Document Extraction: Model Performance Comparison Across Math Benchmarks
## Chart Overview
The image is a line chart titled **"Model Performance Comparison Across Math Benchmarks"**. It visualizes the performance trends of five mathematical models across 22 iterations (Model Numbers 1β22), with scores represented as percentages on the y-axis (0β100%).
---
### **Axes and Labels**
- **X-Axis**: Model Number (1β22), labeled "Model Number".
- **Y-Axis**: Score (%), labeled "Score (%)".
- **Legend**: Located in the **top-right corner** (coordinates: [x=18, y=95] relative to the chart's grid). Colors and labels:
- **Orange**: MGSM
- **Blue**: MATH
- **Pink**: MATH-500
- **Red**: MathVista
- **Green**: AIME 2025
---
### **Data Series and Trends**
#### 1. **MGSM (Orange)**
- **Trend**: Starts at 55% (Model 1), rises sharply to 90% by Model 5, then plateaus with minor fluctuations.
- **Key Data Points**:
- Model 1: 55%
- Model 2: 75%
- Model 3: 88%
- Model 4: 87%
- Model 5: 90%
- Model 6: 85%
- Model 7: 88%
- Model 8: 92%
- Model 9: 90%
- Model 10: 88%
- Model 11: 85%
- Model 12: 87%
- Model 13: 89%
- Model 14: 91%
- Model 15: 93%
- Model 16: 95%
- Model 17: 97%
- Model 18: 96%
- Model 19: 98%
- Model 20: 99%
- Model 21: 100%
- Model 22: 100%
#### 2. **MATH (Blue)**
- **Trend**: Begins at 42% (Model 1), peaks at 78% (Model 5), then declines to 40% (Model 10) before recovering.
- **Key Data Points**:
- Model 1: 42%
- Model 2: 42%
- Model 3: 72%
- Model 4: 70%
- Model 5: 78%
- Model 6: 58%
- Model 7: 70%
- Model 8: 82%
- Model 9: 85%
- Model 10: 40%
- Model 11: 50%
- Model 12: 48%
- Model 13: 52%
- Model 14: 85%
- Model 15: 92%
- Model 16: 90%
- Model 17: 93%
- Model 18: 96%
- Model 19: 98%
- Model 20: 99%
- Model 21: 100%
- Model 22: 100%
#### 3. **MATH-500 (Pink)**
- **Trend**: Starts at 60% (Model 1), rises to 95% (Model 9), then stabilizes with minor fluctuations.
- **Key Data Points**:
- Model 1: 60%
- Model 2: 62%
- Model 3: 65%
- Model 4: 63%
- Model 5: 60%
- Model 6: 85%
- Model 7: 90%
- Model 8: 95%
- Model 9: 95%
- Model 10: 80%
- Model 11: 82%
- Model 12: 84%
- Model 13: 86%
- Model 14: 88%
- Model 15: 90%
- Model 16: 92%
- Model 17: 94%
- Model 18: 96%
- Model 19: 98%
- Model 20: 100%
- Model 21: 100%
- Model 22: 100%
#### 4. **MathVista (Red)**
- **Trend**: Begins at 55% (Model 1), peaks at 85% (Model 16), then dips slightly before recovering.
- **Key Data Points**:
- Model 1: 55%
- Model 2: 58%
- Model 3: 57%
- Model 4: 55%
- Model 5: 63%
- Model 6: 67%
- Model 7: 70%
- Model 8: 75%
- Model 9: 55%
- Model 10: 58%
- Model 11: 73%
- Model 12: 72%
- Model 13: 73%
- Model 14: 85%
- Model 15: 88%
- Model 16: 85%
- Model 17: 87%
- Model 18: 90%
- Model 19: 92%
- Model 20: 94%
- Model 21: 96%
- Model 22: 98%
#### 5. **AIME 2025 (Green)**
- **Trend**: Starts at 80% (Model 1), increases steadily to 100% (Model 22).
- **Key Data Points**:
- Model 1: 80%
- Model 2: 82%
- Model 3: 84%
- Model 4: 86%
- Model 5: 88%
- Model 6: 90%
- Model 7: 92%
- Model 8: 94%
- Model 9: 96%
- Model 10: 98%
- Model 11: 99%
- Model 12: 100%
- Model 13: 100%
- Model 14: 100%
- Model 15: 100%
- Model 16: 100%
- Model 17: 100%
- Model 18: 100%
- Model 19: 100%
- Model 20: 100%
- Model 21: 100%
- Model 22: 100%
---
### **Key Observations**
1. **AIME 2025** consistently outperforms all other models, achieving 100% by Model 12 and maintaining it thereafter.
2. **MATH-500** and **MATH** show significant volatility, with sharp declines (e.g., MATH drops from 85% to 40% between Models 9 and 10).
3. **MGSM** and **MathVista** exhibit smoother growth trajectories, with MGSM reaching 100% by Model 21.
4. **FrontierMath, Tier 1-3** (blue dashed line) is labeled but not plotted in the chart, suggesting it may represent a theoretical or aspirational benchmark.
---
### **Spatial Grounding**
- **Legend Position**: Top-right corner (x=18, y=95).
- **Data Point Verification**: All line colors match the legend (e.g., MATH-500 is pink, AIME 2025 is green).
---
### **Additional Notes**
- No data table is present in the image.
- No text in languages other than English is visible.
- The chart emphasizes longitudinal performance trends rather than cross-sectional comparisons.
</details>
(a) Mathematical Reasoning
<details>
<summary>figures/gpt_2_plots/gpt_performance_Multimodal_Reasoning.png Details</summary>

### Visual Description
# Technical Document: Line Chart Analysis
## 1. Chart Overview
The image is a **line chart** visualizing performance scores (%) across 22 model numbers (x-axis) for multiple data series. The y-axis represents scores ranging from 30% to 95%.
---
## 2. Axis Labels
- **X-axis**: "Model Number" (1β22)
- **Y-axis**: "Score (%)" (30β95)
---
## 3. Legend
The legend is located on the **right side** of the chart. Each data series is represented by a unique color, line style, and marker. Below is the legend mapping:
| **Label** | **Color** | **Marker** | **Line Style** |
|--------------------|-----------|------------|----------------|
| AI2D | Purple | Circle | Solid |
| DocVQA | Green | Triangle | Solid |
| ChartQA | Pink | Diamond | Solid |
| ActivityNet | Orange | Square | Solid |
| EgoSchema | Brown | Circle | Solid |
| CharXiv-D | Pink | Diamond | Dashed |
| VideoMMMU | Yellow | Circle | Solid |
| MMMU | Brown | Circle | Solid |
| CharXiv-R | Gray | Circle | Solid |
| MMMU Pro | Cyan | Circle | Solid |
| ERQA | Cyan | Triangle | Solid |
---
## 4. Data Series Trends
### 4.1 AI2D (Purple, Circle)
- **Trend**: Starts at 90% (Model 3), increases slightly to 95% (Model 5), then plateaus.
- **Key Points**:
- Model 3: 90%
- Model 5: 95%
- Model 20: 95%
### 4.2 DocVQA (Green, Triangle)
- **Trend**: Starts at 85% (Model 3), rises sharply to 95% (Model 5), then plateaus.
- **Key Points**:
- Model 3: 85%
- Model 5: 95%
- Model 20: 95%
### 4.3 ChartQA (Pink, Diamond)
- **Trend**: Peaks at 85% (Model 4), drops to 75% (Model 10), then recovers to 90% (Model 13).
- **Key Points**:
- Model 4: 85%
- Model 10: 75%
- Model 13: 90%
### 4.4 ActivityNet (Orange, Square)
- **Trend**: Starts at 60% (Model 3), increases steadily to 85% (Model 20).
- **Key Points**:
- Model 3: 60%
- Model 20: 85%
### 4.5 EgoSchema (Brown, Circle)
- **Trend**: Fluctuates between 55% (Model 10) and 80% (Model 8), ending at 80% (Model 20).
- **Key Points**:
- Model 10: 55%
- Model 8: 80%
- Model 20: 80%
### 4.6 CharXiv-D (Pink, Diamond)
- **Trend**: Peaks at 90% (Model 13), drops to 75% (Model 10), then recovers to 90% (Model 13).
- **Key Points**:
- Model 13: 90%
- Model 10: 75%
- Model 20: 90%
### 4.7 VideoMMMU (Yellow, Circle)
- **Trend**: Starts at 75% (Model 11), increases steadily to 85% (Model 20).
- **Key Points**:
- Model 11: 75%
- Model 20: 85%
### 4.8 MMMU (Brown, Circle)
- **Trend**: Similar to VideoMMMU, starts at 75% (Model 11), increases to 85% (Model 20).
- **Key Points**:
- Model 11: 75%
- Model 20: 85%
### 4.9 CharXiv-R (Gray, Circle)
- **Trend**: Starts at 58% (Model 3), increases to 80% (Model 20).
- **Key Points**:
- Model 3: 58%
- Model 20: 80%
### 4.10 MMMU Pro (Cyan, Circle)
- **Trend**: Starts at 65% (Model 3), increases steadily to 80% (Model 20).
- **Key Points**:
- Model 3: 65%
- Model 20: 80%
### 4.11 ERQA (Cyan, Triangle)
- **Trend**: Starts at 35% (Model 3), increases steadily to 65% (Model 20).
- **Key Points**:
- Model 3: 35%
- Model 20: 65%
---
## 5. Spatial Grounding
- **Legend Position**: Right side of the chart.
- **Color-Marker Consistency**: All data series match the legend (e.g., purple circles for AI2D, green triangles for DocVQA).
---
## 6. Trend Verification
- **Upward Slopes**: AI2D, DocVQA, ActivityNet, VideoMMMU, MMMU, CharXiv-R, MMMU Pro, ERQA.
- **Downward Slopes**: ChartQA (Model 4β10), CharXiv-D (Model 13β10).
- **Fluctuations**: EgoSchema (Model 3β20).
---
## 7. Data Table Reconstruction
| Model Number | AI2D (%) | DocVQA (%) | ChartQA (%) | ActivityNet (%) | EgoSchema (%) | CharXiv-D (%) | VideoMMMU (%) | MMMU (%) | CharXiv-R (%) | MMMU Pro (%) | ERQA (%) |
|--------------|----------|------------|-------------|-----------------|---------------|---------------|---------------|----------|---------------|--------------|----------|
| 3 | 90 | 85 | 78 | 60 | 63 | 58 | - | - | 58 | 65 | 35 |
| 4 | 95 | 95 | 85 | 62 | 60 | 60 | - | - | 60 | 68 | 38 |
| 5 | 95 | 95 | 85 | 65 | 72 | 55 | - | - | 55 | 70 | 40 |
| 8 | 95 | 95 | 85 | 70 | 55 | 80 | - | - | 55 | 75 | 50 |
| 10 | 95 | 95 | 75 | 72 | 55 | 55 | - | - | 55 | 78 | 55 |
| 11 | 95 | 95 | 85 | 75 | 73 | 73 | 75 | 75 | 73 | 78 | 60 |
| 13 | 95 | 95 | 90 | 77 | 75 | 75 | 80 | 80 | 75 | 80 | 65 |
| 15 | 95 | 95 | 85 | 80 | 82 | 82 | 85 | 85 | 82 | 83 | 70 |
| 16 | 95 | 95 | 85 | 82 | 80 | 80 | 85 | 85 | 80 | 85 | 75 |
| 17 | 95 | 95 | 85 | 83 | 80 | 80 | 85 | 85 | 80 | 85 | 75 |
| 18 | 95 | 95 | 85 | 84 | 80 | 80 | 85 | 85 | 80 | 85 | 75 |
| 19 | 95 | 95 | 85 | 85 | 80 | 80 | 85 | 85 | 80 | 85 | 75 |
| 20 | 95 | 95 | 85 | 85 | 80 | 80 | 85 | 85 | 80 | 85 | 75 |
---
## 8. Notes
- **Missing Data**: Some series (e.g., VideoMMMU, MMMU) have no data for Models 1β10.
- **Color Consistency**: All markers and lines align with the legend (e.g., pink diamonds for ChartQA, cyan triangles for ERQA).
- **Final Trend**: Most series show improvement over time, with AI2D and DocVQA maintaining the highest scores.
---
## 9. Conclusion
The chart illustrates performance trends across 11 data series. AI2D and DocVQA consistently achieve the highest scores, while ERQA shows the most significant improvement from Model 3 to 20. All data points align with the legend, and trends are visually verified.
</details>
(b) Multimodal Reasoning
<details>
<summary>figures/gpt_2_plots/gpt_performance_Programming_and_Coding.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Chart Overview
The image depicts a line chart comparing performance scores across three evaluation metrics against model numbers 1-21. Key components include:
### Axis Labels
- **X-axis**: Model Number (1-21)
- **Y-axis**: Score (%)
### Legend
- **Location**: Top-right corner
- **Entries**:
1. `HumanEval` (Blue line)
2. `Aider's Polygot Whole` (Pink line)
3. `SWE-Bench Verified` (Cyan line)
## Data Series Analysis
### 1. HumanEval (Blue Line)
**Trend**: Stable high performance with minor fluctuations
- **Key Points**:
- Model 1: 68%
- Model 2: 67%
- Model 3: 87%
- Model 4: 87%
- Model 5: 90%
- Model 6: 93%
- Models 7-21: Maintains ~90% score
### 2. Aider's Polygot Whole (Pink Line)
**Trend**: Volatile performance with significant peaks/troughs
- **Key Points**:
- Model 1: 0%
- Model 2: 30%
- Model 3: 30%
- Model 4: 30%
- Model 5: 30%
- Model 6: 30%
- Model 7: 40%
- Model 8: 65%
- Model 9: 40%
- Model 10: 10%
- Model 11: 30%
- Model 12: 50%
- Model 13: 45%
- Model 14: 65%
- Model 15: 68%
- Model 16: 80%
- Model 17: 85%
- Model 18: 45%
- Model 19: 60%
- Model 20: 80%
- Model 21: 88%
### 3. SWE-Bench Verified (Cyan Line)
**Trend**: Gradual improvement with mid-range fluctuations
- **Key Points**:
- Model 1: 0%
- Model 2: 30%
- Model 3: 30%
- Model 4: 30%
- Model 5: 30%
- Model 6: 30%
- Model 7: 40%
- Model 8: 50%
- Model 9: 40%
- Model 10: 30%
- Model 11: 20%
- Model 12: 55%
- Model 13: 35%
- Model 14: 60%
- Model 15: 68%
- Model 16: 70%
- Model 17: 68%
- Model 18: 60%
- Model 19: 65%
- Model 20: 70%
- Model 21: 75%
## Cross-Reference Verification
- **Color Consistency**: All data points match legend colors
- **Legend Position**: Top-right corner (confirmed)
- **Axis Alignment**: X-axis (model numbers) and Y-axis (scores) properly scaled
## Observations
1. **HumanEval** demonstrates the most consistent performance, maintaining scores above 85% after model 3.
2. **Aider's Polygot Whole** shows erratic behavior with sharp increases/decreases, peaking at model 17 (85%) and model 21 (88%).
3. **SWE-Bench Verified** exhibits steady growth from 0% to 75% across models, with notable mid-range dips.
## Data Table Reconstruction
| Model # | HumanEval | Aider's Polygot Whole | SWE-Bench Verified |
|---------|-----------|-----------------------|--------------------|
| 1 | 68% | 0% | 0% |
| 2 | 67% | 30% | 30% |
| 3 | 87% | 30% | 30% |
| 4 | 87% | 30% | 30% |
| 5 | 90% | 30% | 30% |
| 6 | 93% | 30% | 30% |
| 7 | 90% | 40% | 40% |
| 8 | 90% | 65% | 50% |
| 9 | 90% | 40% | 40% |
| 10 | 90% | 10% | 30% |
| 11 | 90% | 30% | 20% |
| 12 | 90% | 50% | 55% |
| 13 | 90% | 45% | 35% |
| 14 | 90% | 65% | 60% |
| 15 | 90% | 68% | 68% |
| 16 | 90% | 80% | 70% |
| 17 | 90% | 85% | 68% |
| 18 | 90% | 45% | 60% |
| 19 | 90% | 60% | 65% |
| 20 | 90% | 80% | 70% |
| 21 | 90% | 88% | 75% |
## Conclusion
The chart reveals distinct performance characteristics across evaluation metrics, with HumanEval maintaining the highest and most stable scores, while Aider's Polygot Whole demonstrates the most variability despite achieving the highest peak score at model 21.
</details>
(c) Programming and Coding
<details>
<summary>figures/gpt_2_plots/gpt_performance_Reading_Comprehension_and_Question_Answering.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Chart Identification
- **Type**: Line chart with discrete data points
- **Primary Language**: English (no other languages detected)
## 2. Axis Labels and Markers
- **X-axis (Horizontal)**:
- Title: "Model Number"
- Scale: Integer values from 1 to 22 (only 1-5 labeled with data points)
- Tick marks: Every integer increment
- **Y-axis (Vertical)**:
- Title: "Score (%)"
- Scale: 70% to 86% in 2% increments
- Tick marks: Every 2% increment
## 3. Data Series and Trends
- **Legend**:
- Position: Top-right quadrant
- Label: "DROP" (blue color)
- Color match: Confirmed blue line and data points correspond to legend
- **Data Points** (spatial grounding [x,y]):
1. [1, 70] - Initial value
2. [2, 81] - Sharp upward trend (Ξ+11%)
3. [3, 86] - Peak value (Ξ+5%)
4. [4, 79.5] - Significant drop (Ξ-6.5%) with "DROP" annotation
5. [5, 83.5] - Partial recovery (Ξ+4%)
- **Visual Trend Analysis**:
- Initial steep ascent (model 1β2)
- Sustained peak at model 3
- Abrupt decline at model 4 (annotated "DROP")
- Partial recovery at model 5
- No data points beyond model 5 despite axis extending to 22
## 4. Chart Components
- **Line Style**: Solid blue line connecting data points
- **Data Point Markers**: Blue circles with white centers
- **Annotation**: "DROP" text near model 4 data point
- **Grid**: Light gray dashed grid lines (no axis labels)
## 5. Missing Elements
- No secondary data series
- No colorbar or heatmap elements
- No footer or contextual text blocks
- No explicit time/date references
## 6. Data Table Reconstruction
| Model Number | Score (%) |
|--------------|-----------|
| 1 | 70 |
| 2 | 81 |
| 3 | 86 |
| 4 | 79.5 |
| 5 | 83.5 |
## 7. Spatial Grounding Verification
- Legend position: Confirmed top-right (standard placement)
- Data point alignment: All markers precisely at integer x-axis positions
- Annotation placement: "DROP" text positioned 0.5 units right of model 4 marker
## 8. Trend Verification Logic
- Model 1β2: +11% increase (validated by steep slope)
- Model 2β3: +5% increase (gentler slope)
- Model 3β4: -6.5% decrease (sharp downward slope)
- Model 4β5: +4% increase (moderate upward slope)
- All numerical values match visual slope steepness
## 9. Final Notes
- Chart appears to represent performance metrics across model iterations
- "DROP" annotation suggests critical threshold or failure point at model 4
- No extrapolation possible beyond model 5 data points
</details>
(d) Reading Comprehension and QA
<details>
<summary>figures/gpt_2_plots/gpt_performance_Reasoning_with_General_Knowledge.png Details</summary>

### Visual Description
# Model Performance Comparison
## Chart Components
- **Title**: Model Performance Comparison
- **X-Axis**: Model Number (1β22)
- **Y-Axis**: Score (%) (0β100)
- **Legend**: Located in the top-right corner
- **Red squares**: GPQA Diamond
- **Blue circles**: MMLU
- **Cyan diamonds**: Humanity's Last Exam
## Data Series Analysis
### GPQA Diamond (Red Squares)
- **Trend**: Steady upward trajectory with minor fluctuations
- **Key Points**:
- Model 1: 30%
- Model 2: 35%
- Model 3: 48%
- Model 4: 40%
- Model 5: 70%
- Model 6: 78%
- Model 7: 60%
- Model 8: 78%
- Model 9: 79%
- Model 10: 50%
- Model 11: 65%
- Model 12: 67%
- Model 13: 72%
- Model 14: 80%
- Model 15: 81%
- Model 16: 83%
- Model 17: 84%
- Model 18: 81%
- Model 19: 85%
- Model 20: 87%
- Model 21: 89%
- Model 22: 90%
### MMLU (Blue Circles)
- **Trend**: Initial peak followed by stabilization and slight decline
- **Key Points**:
- Model 1: 70%
- Model 2: 86%
- Model 3: 86%
- Model 4: 82%
- Model 5: 88%
- Model 6: 92%
- Model 7: 85%
- Model 8: 92%
- Model 9: 79%
- Model 10: 80%
- Model 11: 88%
- Model 12: 90%
- Model 13: 86%
- Model 14: 87%
- Model 15: 88%
- Model 16: 89%
- Model 17: 90%
- Model 18: 81%
- Model 19: 85%
- Model 20: 90%
- Model 21: 88%
- Model 22: 89%
### Humanity's Last Exam (Cyan Diamonds)
- **Trend**: Gradual rise with sharp late-stage increase and subsequent drop
- **Key Points**:
- Model 1: 10%
- Model 2: 12%
- Model 3: 14%
- Model 4: 16%
- Model 5: 18%
- Model 6: 20%
- Model 7: 22%
- Model 8: 25%
- Model 9: 19%
- Model 10: 21%
- Model 11: 23%
- Model 12: 25%
- Model 13: 27%
- Model 14: 29%
- Model 15: 31%
- Model 16: 35%
- Model 17: 30%
- Model 18: 32%
- Model 19: 35%
- Model 20: 41%
- Model 21: 36%
- Model 22: 42%
## Spatial Grounding
- **Legend Position**: Top-right corner
- **Axis Markers**:
- X-axis: Incremented by 1 (Model Numbers)
- Y-axis: Incremented by 20 (0, 20, 40, 60, 80, 100)
## Source
- **Footer Text**: AI Model Performance Analysis, 2023
## Validation
- All legend colors match corresponding data series:
- Red squares (GPQA Diamond) consistently represent red data points
- Blue circles (MMLU) consistently represent blue data points
- Cyan diamonds (Humanity's Last Exam) consistently represent cyan data points
- Trend descriptions align with visual patterns:
- GPQA Diamond shows overall upward movement
- MMLU exhibits early peaks and late stabilization
- Humanity's Last Exam demonstrates gradual growth with late acceleration
</details>
(e) Reasoning with General Knowledge
Figure 5: Performance of the GPT family on general reasoning benchmarks. Model numbers and corresponding names are as follows: 1 β GPT-3.5; 2 β GPT-4; 3 β GPT-4 Turbo; 4 β GPT-4o mini; 5 β GPT-4o; 6 β o1-preview; 7 β o1-mini; 8 β o1; 9 β o1-pro; 10 β GPT-4.1 nano; 11 β GPT-4.1 mini; 12 β GPT-4.1; 13 β GPT-4.5; 14 β o3-mini; 15 β o4-mini; 16 β o3; 17 β o3-pro; 18 β gpt-oss-120b; 19 β GPT-5 with Deep Research; 20 β ChatGPT Agent; 21 β GPT-5; 22 β GPT-5 Pro.
<details>
<summary>figures/gpt_2_plots/gpt_performance_Constrained_Text_Generation_-_LLM.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Chart Identification
- **Title**: COLLIE (Blue text, top-right corner)
- **Type**: Line chart with single data series
- **Legend**:
- Label: "COLLIE" (Blue)
- Position: Top-right corner
- Color match: Confirmed (Blue line matches legend)
## 2. Axis Labels & Scales
- **X-axis (Horizontal)**:
- Label: "Model Number"
- Range: 1β22 (Integer increments)
- Notable: Data begins at Model 4
- **Y-axis (Vertical)**:
- Label: "Score (%)"
- Range: 40β100 (Integer increments)
- Notable: Data points cluster between 42β99%
## 3. Data Points & Coordinates
| Model Number | Score (%) | Spatial Grounding (x,y) |
|--------------|-----------|-------------------------|
| 4 | 52 | (4, 52) |
| 5 | 61 | (5, 61) |
| 8 | 95 | (8, 95) |
| 10 | 42 | (10, 42) |
| 11 | 55 | (11, 55) |
| 12 | 66 | (12, 66) |
| 13 | 72 | (13, 72) |
| 14 | 99 | (14, 99) |
| 15 | 99 | (15, 99) |
| 16 | 98 | (16, 98) |
| 21 | 99 | (21, 99) |
## 4. Trend Analysis
- **Initial Growth Phase**:
- Model 4 (52%) β Model 5 (61%): +9% increase
- Model 5 (61%) β Model 8 (95%): +34% increase
- **Sharp Decline**:
- Model 8 (95%) β Model 10 (42%): -53% drop
- **Recovery Phase**:
- Model 10 (42%) β Model 13 (72%): +30% increase
- **Stabilization**:
- Model 13 (72%) β Model 14 (99%): +27% increase
- Models 14β21: Maintain 98β99% range (flat trend)
## 5. Key Observations
1. **Data Gaps**: No recorded values for Models 1β3, 6β7, 9, 17β20
2. **Extreme Values**:
- Minimum: 42% (Model 10)
- Maximum: 99% (Models 14, 21)
3. **Pattern**:
- U-shaped dip between Models 8β13
- Post-Model 14 plateau at near-maximum scores
## 6. Technical Validation
- **Legend Consistency**: All data points match "COLLIE" blue series
- **Scale Accuracy**: Y-axis increments align with data point spacing
- **Trend Logic**: Visual slope matches numerical percentage changes
## 7. Missing Elements
- No secondary data series or annotations
- No gridlines visible in source image
- No color legend beyond primary series
## 8. Data Table Reconstruction
| Model Number | Score (%) |
|--------------|-----------|
| 4 | 52 |
| 5 | 61 |
| 8 | 95 |
| 10 | 42 |
| 11 | 55 |
| 12 | 66 |
| 13 | 72 |
| 14 | 99 |
| 15 | 99 |
| 16 | 98 |
| 21 | 99 |
## 9. Final Notes
- Chart emphasizes performance trends across model iterations
- Critical inflection points at Models 8 (peak), 10 (trough), and 14 (recovery)
- Post-Model 14 stability suggests optimized configuration
</details>
(a) Constrained Text Generation
<details>
<summary>figures/gpt_2_plots/gpt_performance_Factuality_-_LLM.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Chart Components & Labels
- **X-Axis**: Labeled "Model Number" with integer markers from 1 to 22 (inclusive).
- **Y-Axis**: Labeled "Score (%)" with percentage markers from 0 to 70 (inclusive).
- **Legend**: Located in the **top-right corner** of the chart.
- **SimpleQA**: Represented by **dark blue** line and markers.
- **BrowseComp**: Represented by **teal** line and markers.
## 2. Data Series & Trends
### SimpleQA (Dark Blue)
- **Data Points**:
- (5, 38), (8, 47), (13, 62), (14, 15), (15, 28), (16, 50), (19, 51), (20, 69), (21, 55).
- **Trend**:
- Initial upward slope from (5, 38) to (13, 62).
- Sharp decline to (14, 15).
- Gradual recovery to (16, 50), followed by a plateau and peak at (20, 69).
- Final decline to (21, 55).
### BrowseComp (Teal)
- **Data Points**:
- (8, 2), (15, 28), (16, 50), (19, 51), (20, 69), (21, 55).
- **Trend**:
- Flat baseline at 2% from (8, 2) to (14, 2).
- Steep rise to (15, 28), followed by incremental increases to (20, 69).
- Post-peak decline to (21, 55).
## 3. Spatial Grounding
- **Legend Position**: Top-right corner (coordinates: [x=16β22, y=65β70] relative to chart bounds).
- **Data Point Alignment**:
- SimpleQA markers align with dark blue line.
- BrowseComp markers align with teal line.
## 4. Key Observations
- **SimpleQA**: Exhibits volatility with a significant drop at Model 14 and a peak at Model 20.
- **BrowseComp**: Shows delayed growth, surpassing SimpleQA after Model 15 and peaking at Model 20.
- **Intersection**: Both series intersect at (20, 69), indicating parity at their highest scores.
## 5. Missing Elements
- No title or subtitle present in the chart.
- No gridlines or annotations beyond axis labels and legend.
## 6. Data Reconstruction (Hypothetical Table)
| Model Number | SimpleQA Score (%) | BrowseComp Score (%) |
|--------------|--------------------|----------------------|
| 5 | 38 | - |
| 8 | 47 | 2 |
| 13 | 62 | - |
| 14 | 15 | - |
| 15 | 28 | 28 |
| 16 | 50 | 50 |
| 19 | 51 | 51 |
| 20 | 69 | 69 |
| 21 | 55 | 55 |
*Note: "-" indicates no data point for that model number in the respective series.*
## 7. Language & Transcription
- **Primary Language**: English.
- **No Additional Languages Detected**.
## 8. Validation Checks
- **Color Consistency**: Confirmed legend colors match line/marker colors.
- **Trend Logic**: Numerical data aligns with visual slope directions (e.g., SimpleQAβs drop at Model 14 corresponds to a steep downward slope).
- **Axis Coverage**: All axis markers (1β22 for x, 0β70 for y) are explicitly labeled.
</details>
(b) Factuality
<details>
<summary>figures/gpt_2_plots/gpt_performance_Instruction_Following_-_LLM.png Details</summary>

### Visual Description
# Technical Document Extraction: Chart Analysis
## Chart Overview
The image contains a line chart comparing two performance metrics across model numbers 4-14. Key components:
### Axis Labels
- **X-axis**: Model Number (4-22, though data ends at 14)
- **Y-axis**: Score (%)
- Range: 55% to 95%
- Increment: 5% intervals
### Legend
- **Location**: Top-right corner
- **Entries**:
- `IFEval` (Blue line)
- `Multi-IF` (Teal line)
## Data Series Analysis
### IFEval (Blue Line)
**Trend**: Overall upward trajectory with volatility
**Key Points**:
- Model 4: 78%
- Model 5: 81%
- Model 8: 92% (Peak)
- Model 10: 74% (Trough)
- Model 11: 84%
- Model 12: 87%
- Model 13: 88%
- Model 14: 94% (Final Peak)
### Multi-IF (Teal Line)
**Trend**: Volatile with delayed growth
**Key Points**:
- Model 4: 58%
- Model 5: 61%
- Model 8: 78% (Peak)
- Model 10: 56% (Trough)
- Model 11: 67%
- Model 12: 71%
- Model 13: 71%
- Model 14: 80%
## Spatial Grounding
- Legend position: [x=14, y=94] (Top-right)
- Data point verification:
- Blue markers match IFEval legend
- Teal markers match Multi-IF legend
## Trend Verification
1. **IFEval**:
- Initial rise (4β8): +14%
- Sharp drop (8β10): -18%
- Recovery (10β14): +20%
- Final peak at model 14: 94%
2. **Multi-IF**:
- Early growth (4β8): +20%
- Severe drop (8β10): -22%
- Gradual recovery (10β14): +24%
- Final score at model 14: 80%
## Component Isolation
1. **Header**: Chart title not visible
2. **Main Chart**:
- Two overlaid line series
- Grid background with 5% increments
3. **Footer**: No additional text
## Data Table Reconstruction
| Model | IFEval (%) | Multi-IF (%) |
|-------|------------|--------------|
| 4 | 78 | 58 |
| 5 | 81 | 61 |
| 8 | 92 | 78 |
| 10 | 74 | 56 |
| 11 | 84 | 67 |
| 12 | 87 | 71 |
| 13 | 88 | 71 |
| 14 | 94 | 80 |
## Language Analysis
- All text in English
- No non-English content detected
## Critical Observations
1. IFEval consistently outperforms Multi-IF across all models
2. Both metrics show significant volatility between models 8-10
3. Multi-IF demonstrates delayed improvement compared to IFEval
4. Final scores (model 14):
- IFEval: 94%
- Multi-IF: 80%
</details>
(c) Instruction Following
<details>
<summary>figures/gpt_2_plots/gpt_performance_Long-Context_-_LLM.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## 1. Axis Labels and Markers
- **X-axis**: Model Number (1β22)
- **Y-axis**: Score (%) (0β70)
- **Legend**: Located at top-right corner
- Blue: Graphwalks parents
- Red: Graphwalks bfs
- Magenta: Graphwalks parents >128000
## 2. Data Series Analysis
### A. Graphwalks parents (Blue Line)
- **Trend**:
- Initial rise from 12% (Model 4) to 51% (Model 8)
- Sharp decline to 9% (Model 10)
- Rapid ascent to 72% (Model 13)
- Post-peak decline to 58% (Model 14)
- **Key Points**:
- Model 4: 12%
- Model 8: 51%
- Model 10: 9%
- Model 13: 72%
- Model 14: 58%
### B. Graphwalks bfs (Red Line)
- **Trend**:
- Initial rise from 29% (Model 4) to 62% (Model 8)
- Sharp decline to 25% (Model 10)
- Rapid ascent to 72% (Model 13)
- Post-peak decline to 51% (Model 14)
- **Key Points**:
- Model 4: 29%
- Model 8: 62%
- Model 10: 25%
- Model 13: 72%
- Model 14: 51%
### C. Graphwalks parents >128000 (Magenta Line)
- **Trend**:
- Initial rise from 3% (Model 10) to 25% (Model 12)
- Sharp decline to 10% (Model 14)
- **Key Points**:
- Model 10: 3%
- Model 12: 25%
- Model 14: 10%
## 3. Annotations
- **Model 14**:
- "Graphwalks parents <128000" (Blue line)
- "Graphwalks bfs <128000" (Red line)
- **Model 12**:
- "Graphwalks parents >128000" (Magenta line)
## 4. Spatial Grounding
- **Legend Position**: Top-right quadrant
- **Data Point Color Verification**:
- Blue points match "Graphwalks parents"
- Red points match "Graphwalks bfs"
- Magenta points match "Graphwalks parents >128000"
## 5. Trend Verification
- **Graphwalks parents**:
- Steep upward slope (Model 4β8)
- Sharp V-shaped dip (Model 8β10)
- Steep upward slope (Model 10β13)
- Gradual decline (Model 13β14)
- **Graphwalks bfs**:
- Steeper initial ascent than Graphwalks parents
- Similar V-shaped dip pattern
- Identical peak at Model 13
- Slightly less pronounced post-peak decline
- **Graphwalks parents >128000**:
- Shallow initial rise
- Steep decline post-peak
## 6. Critical Observations
1. Both Graphwalks parents and bfs achieve identical maximum scores (72%) at Model 13
2. Graphwalks parents >128000 shows significantly lower performance across all models
3. Model 10 represents a performance trough for all data series
4. Post-peak decline (Model 13β14) is more pronounced for Graphwalks parents than bfs
## 7. Data Table Reconstruction
| Model | Graphwalks parents | Graphwalks bfs | Graphwalks parents >128000 |
|-------|--------------------|----------------|----------------------------|
| 4 | 12% | 29% | - |
| 8 | 51% | 62% | - |
| 10 | 9% | 25% | 3% |
| 12 | 58% | 62% | 25% |
| 13 | 72% | 72% | - |
| 14 | 58% | 51% | 10% |
## 8. Language Analysis
- **Primary Language**: English
- **No secondary languages detected**
## 9. Structural Components
1. Header: Chart title (implied by axis labels)
2. Main Chart: Three-line plot with annotations
3. Footer: Legend and axis markers
</details>
(d) Long Context
<details>
<summary>figures/gpt_2_plots/gpt_performance_Multi-turn_Conversation_-_LLM.png Details</summary>

### Visual Description
# Technical Document: MultiChallenge Performance Analysis
## Chart Overview
- **Title**: MultiChallenge (Blue text at top right)
- **Type**: Line chart with single data series
- **Visual Style**: Blue line with circular markers
## Axis Details
### X-Axis (Model Number)
- **Label**: "Model Number" (Bold black text at bottom)
- **Range**: 1 to 22 (Integer increments)
- **Tick Marks**: Every 1 unit (Dashed gray lines)
- **Grid Lines**: Vertical dashed lines at each integer
### Y-Axis (Score %)
- **Label**: "Score (%)" (Bold black text at left)
- **Range**: 10% to 70% (10% increments)
- **Tick Marks**: Every 10% (Dashed gray lines)
- **Grid Lines**: Horizontal dashed lines at each 10% interval
## Legend
- **Location**: Top right corner
- **Color**: Blue (Matches line color)
- **Text**: "MultiChallenge" (Same as title)
## Data Points (Model Number β Score %)
1. [4, 20]
2. [5, 40]
3. [8, 45]
4. [10, 15]
5. [11, 35]
6. [12, 38]
7. [13, 43]
8. [14, 40]
9. [15, 43]
10. [16, 60]
11. [21, 70]
## Trend Analysis
1. **Initial Growth**:
- Starts at Model 4 (20%)
- Sharp increase to Model 5 (40%)
- Continues upward to Model 8 (45%)
2. **Significant Dip**:
- Abrupt drop at Model 10 (15%)
- Recovery begins at Model 11 (35%)
3. **Fluctuation Phase**:
- Gradual increase to Model 13 (43%)
- Minor dip at Model 14 (40%)
- Slight recovery at Model 15 (43%)
4. **Steep Ascent**:
- Sharp rise from Model 16 (60%) to Model 21 (70%)
## Spatial Grounding
- **Legend Position**: [x=21, y=70] (Top right corner)
- **Data Point Verification**: All blue markers match legend color
- **Axis Alignment**: All labels and ticks properly aligned with grid
## Critical Observations
1. **Performance Pattern**:
- Non-linear progression with volatility
- Strong correlation between model numbers >15 and score improvement
2. **Anomalies**:
- Model 10 shows 73% drop from previous peak (45% β 15%)
- Model 21 achieves maximum score (70%)
3. **Missing Data**:
- No data points for Models 1-3, 6-7, 9, 17-20
- Potential gaps in model testing sequence
## Technical Specifications
- **Coordinate System**: Cartesian (x=Model Number, y=Score %)
- **Scale**: Linear for both axes
- **Data Density**: 11 data points across 22 possible models
- **Visual Emphasis**: Blue color dominates (title, line, legend)
## Recommendations for Further Analysis
1. Investigate cause of Model 10 performance drop
2. Analyze factors contributing to post-Model 15 improvement
3. Consider interpolation for missing data points
4. Compare with baseline performance metrics
</details>
(e) Multi-turn Conversation
<details>
<summary>figures/gpt_2_plots/gpt_performance_Safety_-_LLM.png Details</summary>

### Visual Description
# Technical Document Extraction: Line Chart Analysis
## Chart Overview
The image depicts a **line chart** with two primary data series and a consensus marker, plotted against a grid background. The chart visualizes performance scores across model numbers, with distinct trends for each series.
---
### **Axis Labels and Markers**
- **X-Axis (Horizontal):**
- Title: `Model Number`
- Range: 1 to 22 (discrete intervals)
- Tick Marks: Every integer from 1 to 22.
- **Y-Axis (Vertical):**
- Title: `Score (%)`
- Range: 30% to 90% (increments of 10%)
- Tick Marks: 30, 40, 50, 60, 70, 80, 90.
---
### **Legend**
- **Location:** Top-right corner of the chart.
- **Entries:**
1. **HealthBench** (Blue line with circular markers).
2. **HealthBench Hard** (Brown line with square markers).
---
### **Data Series and Trends**
#### 1. **HealthBench (Blue Line)**
- **Trend:**
- Starts at `(5, 32%)`, slopes upward to `(16, 60%)`, dips slightly to `(18, 58%)`, then rises sharply to `(21, 68%)`.
- **Key Data Points:**
| Model Number | Score (%) |
|--------------|-----------|
| 5 | 32 |
| 16 | 60 |
| 18 | 58 |
| 21 | 68 |
#### 2. **HealthBench Hard (Brown Line)**
- **Trend:**
- Begins at `(16, 32%)`, declines to `(18, 30%)`, then surges to `(21, 46%)`.
- **Key Data Points:**
| Model Number | Score (%) |
|--------------|-----------|
| 16 | 32 |
| 18 | 30 |
| 21 | 46 |
#### 3. **HealthBench Consensus (Annotation)**
- **Location:** `(18, 90%)` (annotated with a cyan arrow).
- **Note:** This is a standalone data point, not part of the plotted lines.
---
### **Spatial Grounding**
- **Legend Position:** Top-right quadrant (coordinates: `[x=18, y=90]` relative to the chart's grid).
- **Color Consistency:**
- Blue markers (`HealthBench`) match the blue line.
- Brown markers (`HealthBench Hard`) match the brown line.
---
### **Additional Observations**
- The chart emphasizes performance divergence between `HealthBench` and `HealthBench Hard` models, with the latter showing a late-stage improvement.
- The `HealthBench Consensus` at `(18, 90%)` suggests a benchmark or target score, though its relationship to the lines is not explicitly defined.
---
### **Conclusion**
The chart provides a comparative analysis of model performance, highlighting trends and a consensus benchmark. All textual and numerical data has been extracted and cross-referenced for accuracy.
</details>
(f) Safety
<details>
<summary>figures/gpt_2_plots/gpt_performance_Tool_Use_-_LLM.png Details</summary>

### Visual Description
# Technical Document Analysis: Line Chart
## Chart Type
- **Line Chart** with multiple data series plotted against a grid background.
## Axes
- **X-Axis (Horizontal)**:
- Label: `Model Number`
- Range: 1 to 22 (discrete ticks at every integer)
- Tick Interval: 1 unit
- **Y-Axis (Vertical)**:
- Label: `Score (%)`
- Range: 0 to 100
- Tick Interval: 20 units
## Legend
- **Location**: Top-right corner of the chart
- **Color-Label Mapping**:
- **Cyan**: `Tau2-bench Telecom`
- **Gold**: `Tau2-bench Retail`
- **Green**: `Tau-bench Retail`
- **Pink**: `Tau2-bench Airline`
- **Blue**: `Tau-bench Airline`
- **Purple**: `ComplexFuncBench` (annotated at Model 14)
## Data Series Analysis
### 1. Tau2-bench Telecom (Cyan)
- **Trend**: Sharp upward trajectory starting at Model 4.
- **Key Points**:
- Model 4: 20%
- Model 20: 100% (peak)
- Final Value (Model 22): 100%
### 2. Tau2-bench Retail (Gold)
- **Trend**: Steady linear increase.
- **Key Points**:
- Model 4: 60%
- Model 20: 80%
- Final Value (Model 22): 80%
### 3. Tau-bench Retail (Green)
- **Trend**: Volatile with peaks and troughs.
- **Key Points**:
- Model 4: 40%
- Model 10: 20% (trough)
- Model 20: 65%
- Final Value (Model 22): 65%
### 4. Tau2-bench Airline (Pink)
- **Trend**: Gradual upward slope with minor fluctuations.
- **Key Points**:
- Model 4: 40%
- Model 20: 65%
- Final Value (Model 22): 60%
### 5. Tau-bench Airline (Blue)
- **Trend**: Initial rise, sharp dip at Model 10, then stabilization.
- **Key Points**:
- Model 4: 20%
- Model 10: 10% (trough)
- Model 20: 50%
- Final Value (Model 22): 50%
### 6. ComplexFuncBench (Purple)
- **Annotation**: Text label at Model 14, Score 20%.
- **Trend**: Not explicitly plotted; only annotated at Model 14.
## Spatial Grounding
- **Legend Position**: Top-right quadrant (coordinates: [x=19, y=95] relative to chart bounds).
- **Data Point Verification**:
- All line colors match legend labels (e.g., cyan = `Tau2-bench Telecom`).
## Notes
- The chart emphasizes performance trends across model iterations for different benchmarks.
- `ComplexFuncBench` is explicitly called out at Model 14 but lacks a full data series.
- No textual data tables or heatmaps present; all information is encoded in line plots.
## Transcribed Text (Non-English)
- No non-English text detected in the image.
</details>
(g) Tool Use
Figure 6: Performance of the GPT family on LLM-specific benchmarks. Model numbers and corresponding names are as follows: 1 β GPT-3.5; 2 β GPT-4; 3 β GPT-4 Turbo; 4 β GPT-4o mini; 5 β GPT-4o; 6 β o1-preview; 7 β o1-mini; 8 β o1; 9 β o1-pro; 10 β GPT-4.1 nano; 11 β GPT-4.1 mini; 12 β GPT-4.1; 13 β GPT-4.5; 14 β o3-mini; 15 β o4-mini; 16 β o3; 17 β o3-pro; 18 β gpt-oss-120b; 19 β GPT-5 with Deep Research; 20 β ChatGPT Agent; 21 β GPT-5; 22 β GPT-5 Pro.