2511.01365
Model: gemini-2.0-flash
# The Ouroboros of Benchmarking: Reasoning Evaluation in an Era of Saturation
**Authors**:
- İbrahim Ethem Deveci (Department of Cognitive Science)
- Ankara, Turkey
- Duygu Ataman (Department of Cognitive Science)
- Ankara, Turkey
Abstract
The rapid rise of Large Language Models (LLMs) and Large Reasoning Models (LRMs) has been accompanied by an equally rapid increase of benchmarks used to assess them. However, due to both improved model competence resulting from scaling and novel training advances as well as likely many of these datasets being included in pre or post training data, results become saturated, driving a continuous need for new and more challenging replacements. In this paper, we discuss whether surpassing a benchmark truly demonstrates reasoning ability or are we simply tracking numbers divorced from the capabilities we claim to measure? We present an investigation focused on three model families, OpenAI, Anthropic, and Google, and how their reasoning capabilities across different benchmarks evolve over the years. We also analyze performance trends over the years across different reasoning tasks and discuss the current situation of benchmarking and remaining challenges. By offering a comprehensive overview of benchmarks and reasoning tasks, our work aims to serve as a first reference to ground future research in reasoning evaluation and model development.
1 Introduction
Benchmarks have long played a central role in evaluating and comparing machine learning models [1]. As models scale up in size and capability, particularly Large Language Models (LLMs) and the specialized Large Reasoning Models (LRMs), many benchmarks quickly saturate, often reaching or surpassing human-level performance. Whether this saturation is driven primarily by improved model capability or dataset contamination is generally unknown. Nevertheless, this quick saturation forces the development of new and more challenging benchmarks that could be used to further compare new model families. In this paper, we investigate several key research questions: How effective are current benchmarks at measuring model capabilities, and does surpassing a benchmark reliably indicate genuine reasoning?
To examine these questions, we select three model families, OpenAI, Anthropic, and Google, and compile performance data from official sources [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]. We gather a comprehensive list of 52 benchmarks used in evaluating these models and classify them according to the types of reasoning they aim to evaluate. Analyzing performance trends over the years, we highlight where models improve, where they struggle, and what these trends reveal about the current state of benchmarking. Finally, we discuss the implications of the saturation cycle and emphasize the need for improved evaluation practices that more accurately capture model capabilities.
Our contributions are threefold: (1) we provide a curated list of reasoning benchmarks, classified by the types of reasoning they aim to assess (2) we analyze performance trends over the years to assess benchmarking effectiveness; (3) we examine current landscape of existing benchmarks, identifying which benchmarks have reached high performance thresholds and which seem to remain unsolved.
By situating our analysis within the broader evaluation landscape, our work collects evidence to emphasize the need for reasoning tasks that are more representative of the nature of reasoning process and target evaluation beyond downstream accuracy.
2 Benchmark Landscape and Categorization
In order to provide a general analysis of how the creation and adoption of reasoning benchmarks have evolved over time, we examine three model families and compile the set of benchmarks employed to evaluate them. Our aim is to provide a comprehensive overview of current benchmarking practices and to trace how the creation and adoption of benchmarks have evolved over time. The complete list of benchmarks, their assigned reasoning types, and short summaries can be found in Appendix A. To facilitate analysis, we categorize benchmarks into seven reasoning types: commonsense and logical reasoning, mathematical reasoning, multimodal reasoning, programming and coding, reading comprehension and question answering, reasoning with general knowledge, and LLM-specific capabilities such as safety, tool use, and instruction following. Figure 1 illustrates a marked increase in benchmark adoption for multimodal reasoning, mathematical reasoning, programming, reasoning with general knowledge, and LLM-specific benchmarks after 2023. In contrast, no new benchmarks in reading comprehension or commonsense reasoning were adopted by these model families during this period. While the literature contains several other benchmarks in these areas [23, 24, 25, 26, 27, 28, 29], our analysis shows they have not been utilized by any of the prominent model families. This likely reflects the evolving understanding of what constitutes reasoning in computational models, in accordance with their current capabilities and what the community deems important to evaluate. Since most models now have direct commercial applications, their performance in more applicable domains, such as coding and tool-use benchmarks, may also motivate the evaluation in certain categories of reasoning tasks.
<details>
<summary>figures/benchmarks_by_year.png Details</summary>

### Visual Description
## Line Chart: LLM Benchmarks Over Time
### Overview
The image is a line chart showing the number of benchmarks for various Large Language Model (LLM) capabilities over the years from 2015 to 2025. The chart tracks trends in Commonsense and Logical Reasoning, LLM Benchmarks (Instruction following, Tool use, etc.), Mathematical Reasoning, Multimodal Reasoning, Programming and Coding, Reading Comprehension and Question Answering, and Reasoning with General Knowledge.
### Components/Axes
* **X-axis:** Year (2015 to 2025)
* **Y-axis:** Number of Benchmarks (0 to 12)
* **Legend (Top-Right):**
* Blue: Commonsense and Logical Reasoning
* Orange: LLM Benchmarks (Instruction following, Tool use, etc.)
* Green: Mathematical Reasoning
* Red: Multimodal Reasoning
* Purple: Programming and Coding
* Brown: Reading Comprehension and Question Answering
* Pink: Reasoning with General Knowledge
### Detailed Analysis
* **Commonsense and Logical Reasoning (Blue):**
* Trend: Stays relatively constant.
* 2015: 0, 2018: 1, 2025: 1
* **LLM Benchmarks (Instruction following, Tool use, etc.) (Orange):**
* Trend: Remains at 0 until 2023, then increases sharply.
* 2015-2022: 0, 2023: 2, 2025: 13
* **Mathematical Reasoning (Green):**
* Trend: Starts at 0, increases significantly after 2023.
* 2015-2023: 0, 2024: 7, 2025: 8
* **Multimodal Reasoning (Red):**
* Trend: Increases steadily over time, with a sharp increase in 2024 and 2025.
* 2015: 1, 2016: 2, 2017: 2, 2018: 2, 2019: 3, 2020: 3, 2021: 4, 2022: 5, 2023: 6, 2024: 9, 2025: 13
* **Programming and Coding (Purple):**
* Trend: Remains at 0 until 2023, then increases.
* 2015-2023: 0, 2024: 3, 2025: 7
* **Reading Comprehension and Question Answering (Brown):**
* Trend: Starts at 0, increases to 2 by 2018, then remains constant.
* 2015-2017: 0, 2018-2025: 2
* **Reasoning with General Knowledge (Pink):**
* Trend: Remains at 0 until 2023, then increases.
* 2015-2023: 0, 2024: 3, 2025: 7
### Key Observations
* Multimodal Reasoning (Red) shows the most significant increase in benchmarks over time.
* LLM Benchmarks (Instruction following, Tool use, etc.) (Orange) and Mathematical Reasoning (Green) show a sharp increase in 2024 and 2025.
* Commonsense and Logical Reasoning (Blue) and Reading Comprehension and Question Answering (Brown) remain relatively constant.
* Programming and Coding (Purple) and Reasoning with General Knowledge (Pink) start increasing later in the period.
### Interpretation
The chart indicates a growing emphasis on benchmarks for LLMs, particularly in areas like Multimodal Reasoning, LLM Benchmarks (Instruction following, Tool use, etc.), and Mathematical Reasoning. The sharp increases in these areas in recent years suggest a focus on developing and evaluating more complex capabilities in LLMs. The relatively constant benchmarks in Commonsense and Logical Reasoning and Reading Comprehension and Question Answering might indicate that these areas are considered more mature or have reached a certain level of performance. The later increase in Programming and Coding and Reasoning with General Knowledge suggests these areas are emerging as important evaluation metrics.
</details>
Figure 1: Number of benchmarks in different reasoning types over time.
3 Performance Trends Across Models
Across all three model families there is a consistent effort to develop newer models or architectural improvements to achieve higher benchmark performance. However, comparing performance across families is challenging, as each family often employs different benchmarks, and even within a single family, benchmarks used can vary between model iterations. This variation appears to stem from two main factors: first, certain benchmarks reach saturation due to high performance; second, benchmark updates or more challenging subsets are introduced, such as the transition from MATH to MATH-500 [30].
We observe a recurring pattern: once a model family achieves a high performance on a particular benchmark, subsequent models tend to use that benchmark less frequently or may discontinue its use entirely. This reflects both practical and conceptual considerations: benchmarks that no longer discriminate between models provide limited evaluative value, and benchmark selection increasingly reflects the evolving understanding of which reasoning tasks remain challenging for current architectures.
Interestingly, performance trends reveal consistent directional correlations across benchmarks within the same reasoning type. For example, when a model demonstrates improved performance on a benchmark, it generally shows corresponding improvements on other benchmarks of the same type, while lower performance on one benchmark tends to coincide with lower performance on others. Nevertheless, the extent of performance differs across benchmarks, potentially due to variations in problem complexity and the scaling limitations evident in smaller models, as seen within the OpenAI family. This pattern suggests that benchmarks within a reasoning type often capture overlapping aspects of reasoning, so that advances in a models’ capabilities tend to propagate across related tasks. At the same time, variations in the magnitude of performance gains provide insight into the relative difficulty of different benchmarks within the same reasoning type. Detailed plots illustrating performance changes within model families for different reasoning types are provided in Appendix B.
Finally, we note that newer models generally achieve higher performance on previously low-scoring benchmarks. However, the limited overlap of common benchmarks across model families complicates cross-family comparisons. This raises a critical question: if benchmarks are intended to evaluate and compare model capabilities, why are they not consistently adopted or reported across families? If benchmarks are intended to provide a shared measure of capability, their fragmented and selective use undermines that goal and exemplifies the need for more standardized, representative, and domain-informed evaluation frameworks.
4 Performance of Models within Benchmarks
We collect all reported model performances across benchmarks and analyze saturation by defining it as whether a model has achieved at least 80% accuracy on the given benchmark. Out of the full set of benchmarks, we find that 27 benchmarks surpass this threshold in at least one model family, while 25 benchmarks never reach it. The majority of “solved” benchmarks belong to commonsense and logical reasoning, mathematical reasoning, reasoning with general knowledge, and reading comprehension and question answering. By contrast, benchmarks targeting LLM-specific capabilities and programming and coding remain comparatively difficult, with few instances of performance above 80%.
We then examine the release years of benchmarks that never surpass the 80% threshold. The distribution is striking: 60% of unsolved benchmarks were introduced in 2025, 32% in 2024, and only two pre-2023 benchmarks remain unsolved, which are ActivityNet [31] and EgoSchema [32], both multimodal reasoning benchmarks. This distribution suggests a clear trend. Nearly all benchmarks released prior to 2025 have already been surpassed by at least one model family, indicating rapid saturation. By contrast, the benchmarks still below the threshold overwhelmingly correspond to the most recently introduced evaluation tasks.
<details>
<summary>figures/stacked_bar_saturation.png Details</summary>

### Visual Description
## Horizontal Bar Chart: Benchmark Saturation
### Overview
The image is a horizontal bar chart comparing the performance of a system across different reasoning and knowledge domains. Each bar represents a specific category, and the bar is divided into two colored segments: green ("Saturated") and red ("Not Saturated"). The chart displays the percentage and the fraction of benchmarks achieved for each category.
### Components/Axes
* **Y-axis:** Lists the categories being evaluated. From top to bottom:
* Reasoning with General Knowledge
* Reading Comprehension and Question Answering
* Programming and Coding
* Multimodal Reasoning
* Mathematical Reasoning
* LLM
* Commonsense and Logical Reasoning
* **X-axis:** Represents the "Percentage of Benchmarks," ranging from 0% to 100% in increments of 20%.
* **Legend:** Located in the bottom-right corner, indicating:
* Green: "Saturated"
* Red: "Not Saturated"
### Detailed Analysis
Here's a breakdown of each category's performance, including the trend and specific values:
* **Reasoning with General Knowledge:**
* Trend: Predominantly Saturated.
* Saturated (Green): 71.4% (5/7)
* Not Saturated (Red): 28.6%
* **Reading Comprehension and Question Answering:**
* Trend: Predominantly Saturated.
* Saturated (Green): 66.7% (2/3)
* Not Saturated (Red): 33.3%
* **Programming and Coding:**
* Trend: Predominantly Not Saturated.
* Saturated (Green): 33.3% (3/9)
* Not Saturated (Red): 66.7%
* **Multimodal Reasoning:**
* Trend: Slightly more Not Saturated than Saturated.
* Saturated (Green): 46.2% (6/13)
* Not Saturated (Red): 53.8%
* **Mathematical Reasoning:**
* Trend: Highly Saturated.
* Saturated (Green): 87.5% (7/8)
* Not Saturated (Red): 12.5%
* **LLM:**
* Trend: Overwhelmingly Not Saturated.
* Saturated (Green): 23.1% (3/13)
* Not Saturated (Red): 76.9%
* **Commonsense and Logical Reasoning:**
* Trend: Completely Saturated.
* Saturated (Green): 100.0% (1/1)
* Not Saturated (Red): 0.0%
### Key Observations
* Commonsense and Logical Reasoning is the only category with 100% saturation.
* LLM has the lowest saturation rate, with only 23.1% of benchmarks saturated.
* Mathematical Reasoning shows a high saturation rate of 87.5%.
* Programming and Coding and Multimodal Reasoning have more benchmarks not saturated than saturated.
### Interpretation
The chart provides a performance overview across different reasoning and knowledge areas. The "Saturated" vs. "Not Saturated" distinction likely indicates whether the system met a certain performance threshold or achieved a desired outcome for each benchmark.
The high saturation in Commonsense and Logical Reasoning suggests strong performance in this area. Conversely, the low saturation in LLM indicates a potential weakness or area for improvement. The varying degrees of saturation across the other categories highlight the system's strengths and weaknesses in different domains. The data suggests that the system performs well in areas requiring established logical rules and mathematical principles, but struggles with more complex tasks like programming and coding, and especially LLM tasks.
</details>
(a) Distribution of benchmarks that models surpassed 80% threshold and those not yet surpassed, grouped by reasoning type.
<details>
<summary>figures/pie_saturation_by_year.png Details</summary>

### Visual Description
## Pie Charts: Distribution by Year
### Overview
The image contains two pie charts, each representing a distribution across different years. The left pie chart uses shades of green, while the right pie chart uses shades of red. Each slice of the pie chart is labeled with a year, a percentage, and a number in parentheses.
### Components/Axes
* **Pie Charts:** Two pie charts, one on the left and one on the right.
* **Labels:** Each slice is labeled with a year (2015, 2016, 2018, 2019, 2021, 2022, 2023, 2024, 2025), a percentage value, and a number in parentheses.
* **Colors:** The left pie chart uses shades of green, and the right pie chart uses shades of red.
### Detailed Analysis
**Left Pie Chart (Green Shades):**
* **2016:** 3.7% (1), light green
* **2018:** 3.7% (1), light green
* **2019:** 11.1% (3), medium green
* **2021:** 18.5% (5), medium green
* **2022:** 3.7% (1), light green
* **2023:** 18.5% (5), medium green
* **2024:** 29.6% (8), dark green
* **2025:** 11.1% (3), medium green
**Right Pie Chart (Red Shades):**
* **2015:** 4.0% (1), light red
* **2023:** 4.0% (1), light red
* **2024:** 32.0% (8), medium red
* **2025:** 60.0% (15), dark red
### Key Observations
* In the left pie chart, 2024 has the largest percentage (29.6%), followed by 2021 and 2023 (both at 18.5%).
* In the right pie chart, 2025 has the largest percentage (60.0%), followed by 2024 (32.0%).
* The years 2015, 2016, 2018, 2019, 2021, and 2022 only appear in the left pie chart.
### Interpretation
The
</details>
(b) Release years of benchmarks relative to the 80% threshold: left pie shows surpassed benchmarks, right pie shows unsolved benchmarks.
Figure 2: Benchmark saturation dynamics.
This temporal pattern highlights the central dynamic of the saturation cycle: older benchmarks are rapidly mastered and lose discriminative power, while newly introduced benchmarks become the standards for demonstrating progress. Nearly all unsolved benchmarks are recent, highlighting both the accelerating pace of benchmark creation and the difficulty of maintaining evaluations that remain challenging over time. Yet this difficulty seems only temporary. It is highly plausible that within one or two years many of these currently unsolved benchmarks will also be surpassed, at which point model families will shift to alternative or newly designed evaluations to preserve differentiation. Crucially, this pattern reflects the fact that performance gains are often specific to individual benchmarks rather than to the broader reasoning type they are intended to assess. As the analyses indicate, while models often perform consistently and even strongly on benchmarks within a domain, the introduction of a more challenging, novel benchmark frequently leads to a drop in performance. This pattern may arise from the increased difficulty of the new benchmark, or from contamination that inflated performance on earlier benchmarks without truly reflecting generalizable reasoning ability. This situation raises the question of whether what appears as “reasoning ability” is often tied more to benchmark design and prior exposure than to robust mastery of the reasoning type itself. This saturation cycle casts doubt on the long-term evaluation value of benchmarks.
5 Discussion: Limitations of Current Benchmarking
Our analysis of three model families demonstrates that benchmark performance has generally increased over time, with newer models achieving higher scores across most reasoning types and benchmarks. However, given that many benchmarks have already been surpassed with high accuracy, we would like to highlight a question originally posed in [25] regarding commonsense reasoning, reframed here for reasoning in general: Have neural language models successfully acquired reasoning, or are we overestimating the true capabilities of machine reasoning? Several studies in the literature show that these models still perform poorly when required to generalize to longer contexts or handle tasks requiring inductive and compositional reasoning [33, 34, 35, 36, 37, 38]. This discrepancy suggests a limitation of current benchmarking practices: improvements in benchmark scores do not necessarily reflect generalizable reasoning ability.
We believe this discrepancy can be reduced by developing more sophisticated, task-specific evaluation metrics that capture intermediate reasoning steps or different modes of error. Additionally, formalizing reasoning for different task types can support these efforts, enabling more structured analyses and clearer assessment of models’ reasoning abilities. Such a formalization enables structured representations of diverse reasoning types and their interrelationships [39, 40, 41], and facilitates the design of layered, targeted evaluation procedures that assess specific reasoning capabilities rather than merely reporting overall accuracy. Furthermore, formal reasoning frameworks can support the development of algorithms that deliver structured feedback to models, guiding the refinement of their reasoning abilities. By integrating formalized reasoning with task-specific evaluations, benchmarking can be conducted in a more targeted and informative manner.
6 Limitations
The analysis in our study focuses on 52 benchmarks used by the three model families. Other model families and reasoning-focused models are not fully explored because including them, along with more than two hundred benchmarks identified from other model families and several studies evaluating different types of reasoning in large models, would create a combinatorial explosion of comparisons. This restriction was necessary to maintain the scope of our work on a qualitative evaluation of benchmark design and adoption rather than an exhaustive quantitative analysis of all models and benchmarks. A comprehensive comparison across a wider range of models and benchmarks is left for future work.
7 Conclusion
In this work, we analyze 52 benchmarks across three model families, covering multiple reasoning types. Our study reveals the rapid saturation of older benchmarks, selective adoption of new ones, and temporal dynamics that govern the utility of benchmarks in evaluating model performance. While model performance generally improves over time and correlations within reasoning types indicate overlapping evaluation properties, the introduction of more challenging benchmarks generally resets performance, suggesting that apparent reasoning ability is influenced more by extrinsic factors than by mastering the reasoning itself, as supported by other studies. This saturation cycle highlights the limitations of current practices: benchmarks provide only a partial view of model reasoning. Meaningful progress requires formalized reasoning tasks, layered evaluation procedures, and task-specific metrics that go beyond accuracy scores.
References
- [1] Thomas Liao, Rohan Taori, Deborah Raji, and Ludwig Schmidt. Are we learning yet? a meta review of evaluation failures across machine learning. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
- [2] Anthropic. Introducing the next generation of claude, March 2024. Accessed: 2025-08-28.
- [3] Anthropic. Claude 3.5 sonnet, June 2024. Accessed: 2025-08-28.
- [4] Anthropic. Introducing claude 4, May 2025. Accessed: 2025-08-28.
- [5] Anthropic. Introducing claude 3.5 haiku, October 2024. Accessed: 2025-08-28.
- [6] Anthropic. Claude 3.7 sonnet and claude code, February 2025. Accessed: 2025-08-28.
- [7] Anthropic. Claude opus 4.1, August 2025. Accessed: 2025-08-28.
- [8] Google DeepMind. Gemini 2.5 flash-lite, June 2025. Accessed: 2025-08-28.
- [9] Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities, 2025.
- [10] Google DeepMind. Gemini 2.5: Our most intelligent ai model, March 2025. Accessed: 2025-08-28.
- [11] Gemini Team, Petko Georgiev, Ving Ian Lei, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024.
- [12] Gemini Team, Rohan Anil, Sebastian Borgeaud, et al. Gemini: A family of highly capable multimodal models, 2025.
- [13] OpenAI. Openai o1-mini: Advancing cost-efficient reasoning, September 2024. Accessed: 2025-08-28.
- [14] OpenAI. Introducing gpt-4.1 in the api, April 2025. Accessed: 2025-08-28.
- [15] OpenAI. Introducing gpt-4.5, February 2025. Accessed: 2025-08-28.
- [16] OpenAI. gpt-oss-120b & gpt-oss-20b model card, August 2025. Accessed: 2025-08-28.
- [17] OpenAI. Introducing gpt-5, August 2025. Accessed: 2025-08-28.
- [18] OpenAI. Model release notes. Accessed: 2025-08-28.
- [19] OpenAI. Introducing openai o3 and o4-mini, April 2025. Accessed: 2025-08-28.
- [20] OpenAI. Gpt-4o mini: Advancing cost-efficient intelligence, July 2024. Accessed: 2025-08-28.
- [21] OpenAI. Hello gpt-4o, May 2024. Accessed: 2025-08-28.
- [22] OpenAI. Learning to reason with llms, September 2024. Accessed: 2025-08-28.
- [23] Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jiasen Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020.
- [24] Bill Yuchen Lin, Wangchunshu Zhou, Ming Shen, Pei Zhou, Chandra Bhagavatula, Yejin Choi, and Xiang Ren. CommonGen: A constrained text generation challenge for generative commonsense reasoning. In Trevor Cohn, Yulan He, and Yang Liu, editors, Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1823–1840, Online, November 2020. Association for Computational Linguistics.
- [25] Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: an adversarial winograd schema challenge at scale. Commun. ACM, 64(9):99–106, August 2021.
- [26] Alon Talmor, Ori Yoran, Ronan Le Bras, Chandra Bhagavatula, Yoav Goldberg, Yejin Choi, and Jonathan Berant. Commonsenseqa 2.0: Exposing the limits of ai through gamification, 2022.
- [27] Andong Wang, Bo Wu, Sunli Chen, Zhenfang Chen, Haotian Guan, Wei-Ning Lee, Li Erran Li, and Chuang Gan. Sok-bench: A situated video reasoning benchmark with aligned open-world knowledge, 2024.
- [28] Jian Liu, Leyang Cui, Hanmeng Liu, Dandan Huang, Yile Wang, and Yue Zhang. Logiqa: a challenge dataset for machine reading comprehension with logical reasoning. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI’20, 2021.
- [29] Weihao Yu, Zihang Jiang, Yanfei Dong, and Jiashi Feng. Reclor: A reading comprehension dataset requiring logical reasoning. In International Conference on Learning Representations, 2020.
- [30] Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, volume 1, 2021.
- [31] Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 961–970, 2015.
- [32] Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding, 2023.
- [33] Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi. Faith and fate: limits of transformers on compositionality. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA, 2023. Curran Associates Inc.
- [34] Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models, 2025.
- [35] Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity, 2025.
- [36] Jackson Petty, Michael Y. Hu, Wentao Wang, Shauli Ravfogel, William Merrill, and Tal Linzen. Relic: Evaluating compositional instruction following via language recognition, 2025.
- [37] S. Bedi, Y. Jiang, P. Chung, S. Koyejo, and N. Shah. Fidelity of medical reasoning in large language models. JAMA Network Open, 8(8):e2526021, 2025.
- [38] Karthik Valmeekam, Kaya Stechly, Atharva Gundawar, and Subbarao Kambhampati. A systematic evaluation of the planning and scheduling abilities of the reasoning model o1. Transactions on Machine Learning Research, 2025.
- [39] P. N. Johnson-Laird. Mental models: towards a cognitive science of language, inference, and consciousness. Harvard University Press, USA, 1986.
- [40] Patrick Blackburn and Johannes Bos. Representation and Inference for Natural Language: A First Course in Computational Semantics. Center for the Study of Language and Information, Stanford, Calif., 2005.
- [41] Brenden M. Lake, Tomer D. Ullman, Joshua B. Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and Brain Sciences, 40:e253, 2017.
- [42] Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? In Anna Korhonen, David Traum, and LluĂs MĂ rquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy, July 2019. Association for Computational Linguistics.
- [43] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
- [44] Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc Le, Ed Chi, Denny Zhou, and Jason Wei. Challenging BIG-bench tasks and whether chain-of-thought can solve them. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 13003–13051, Toronto, Canada, July 2023. Association for Computational Linguistics.
- [45] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding, 2021.
- [46] Long Phan, Alice Gatti, Ziwen Han, et al. Humanity’s last exam, 2025.
- [47] Shivalika Singh, Angelika Romanou, Clémentine Fourrier, David Ifeoluwa Adelani, Jian Gang Ngui, Daniel Vila-Suero, Peerat Limkonchotiwat, Kelly Marchisio, Wei Qi Leong, Yosephine Susanto, Raymond Ng, Shayne Longpre, Sebastian Ruder, Wei-Yin Ko, Antoine Bosselut, Alice Oh, Andre Martins, Leshem Choshen, Daphne Ippolito, Enzo Ferrante, Marzieh Fadaee, Beyza Ermis, and Sara Hooker. Global MMLU: Understanding and addressing cultural and linguistic biases in multilingual evaluation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vienna, Austria, July 2025. Association for Computational Linguistics.
- [48] David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023.
- [49] Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark, 2024.
- [50] Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018.
- [51] Omer Goldman, Uri Shaham, Dan Malkin, Sivan Eiger, Avinatan Hassidim, Yossi Matias, Joshua Maynez, Adi Mayrav Gilady, Jason Riesa, Shruti Rijhwani, Laura Rimell, Idan Szpektor, Reut Tsarfaty, and Matan Eyal. Eclektic: a novel challenge set for evaluation of cross-lingual knowledge transfer, 2025.
- [52] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics.
- [53] Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021.
- [54] Freda Shi, Mirac Suzgun, Markus Freitag, Xuezhi Wang, Suraj Srivats, Soroush Vosoughi, Hyung Won Chung, Yi Tay, Sebastian Ruder, Denny Zhou, Dipanjan Das, and Jason Wei. Language models are multilingual chain-of-thought reasoners, 2022.
- [55] Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024.
- [56] Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, Shreepranav Varma Enugandla, and Mark Wildon. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai, 2024.
- [57] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2024.
- [58] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images, 2016.
- [59] Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. ChartQA: A benchmark for question answering about charts with visual and logical reasoning. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 2263–2279, Dublin, Ireland, May 2022. Association for Computational Linguistics.
- [60] Minesh Mathew, Dimosthenis Karatzas, and C. V. Jawahar. Docvqa: A dataset for vqa on document images, 2021.
- [61] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read, 2019.
- [62] Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos, 2025.
- [63] Piotr Padlewski, Max Bain, Matthew Henderson, Zhongkai Zhu, Nishant Relan, Hai Pham, Donovan Ong, Kaloyan Aleksiev, Aitor Ormazabal, Samuel Phua, Ethan Yeo, Eugenie Lamprecht, Qi Liu, Yuqi Wang, Eric Chen, Deyu Fu, Lei Li, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Mikel Artetxe, and Yi Tay. Vibe-eval: A hard evaluation suite for measuring progress of multimodal language models, 2024.
- [64] Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion-Vlad Bogolin, Jialu Tang, Florian Langer, Vyas Raina, Vatsal Raina, Hanyi Xiong, Vishaal Udandarao, Jingyi Lu, Shiyang Chen, Sam Purkis, Tianshuo Yan, Wenye Lin, Gyungin Shin, Qiaochu Yang, Anh Totti Nguyen, David I. Atkinson, Aaditya Baranwal, Alexandru Coca, Mikah Dang, Sebastian Dziadzio, Jakob D. Kunz, Kaiqu Liang, Alexander Lo, Brian Pulfer, Steven Walton, Charig Yang, Kai Han, and Samuel Albanie. Zerobench: An impossible visual benchmark for contemporary large multimodal models, 2025.
- [65] Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, and Danqi Chen. Charxiv: Charting gaps in realistic chart understanding in multimodal llms. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 113569–113697. Curran Associates, Inc., 2024.
- [66] Xiang Yue, Tianyu Zheng, Yuansheng Ni, Yubo Wang, Kai Zhang, Shengbang Tong, Yuxuan Sun, Botao Yu, Ge Zhang, Huan Sun, Yu Su, Wenhu Chen, and Graham Neubig. Mmmu-pro: A more robust multi-discipline multimodal understanding benchmark, 2025.
- [67] Google DeepMind. Gemini robotics: Bringing ai into the physical world, 2025. Accessed: 2025-08-29.
- [68] Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024.
- [69] Stanford University and Laude Institute. Terminal-bench: A benchmark for ai agents in terminal environments, 2025. Accessed: 2025-08-29.
- [70] Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. Evaluating large language models trained on code, 2021.
- [71] Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024.
- [72] Aider. o1 tops aider’s new polyglot leaderboard, 2024. Accessed: 2025-08-29.
- [73] Samuel Miserendino, Michele Wang, Tejal Patwardhan, and Johannes Heidecke. Swe-lancer: Can frontier llms earn $1 million from real-world freelance software engineering?, 2025.
- [74] Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. $\tau$ -bench: A benchmark for tool-agent-user interaction in real-world domains, 2024.
- [75] Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan. $\tau^{2}$ -bench: Evaluating conversational agents in a dual-control environment, 2025.
- [76] Shunyu Yao, Howard Chen, Austin W. Hanjie, Runzhe Yang, and Karthik Narasimhan. Collie: Systematic construction of constrained text generation tasks, 2023.
- [77] Jason Wei, Nguyen Karina, Hyung Won Chung, Yunxin Joy Jiao, Spencer Papay, Amelia Glaese, John Schulman, and William Fedus. Measuring short-form factuality in large language models, 2024.
- [78] Alon Jacovi, Andrew Wang, Chris Alberti, Connie Tao, Jon Lipovetz, Kate Olszewska, Lukas Haas, Michelle Liu, Nate Keating, Adam Bloniarz, Carl Saroufim, Corey Fry, Dror Marcus, Doron Kukliansky, Gaurav Singh Tomar, James Swirhun, Jinwei Xing, Lily Wang, Madhu Gurumurthy, Michael Aaron, Moran Ambar, Rachana Fellinger, Rui Wang, Zizhao Zhang, Sasha Goldshtein, and Dipanjan Das. The facts grounding leaderboard: Benchmarking llms’ ability to ground responses to long-form input, 2025.
- [79] Jason Wei, Zhiqing Sun, Spencer Papay, Scott McKinney, Jeffrey Han, Isa Fulford, Hyung Won Chung, Alex Tachard Passos, William Fedus, and Amelia Glaese. Browsecomp: A simple yet challenging benchmark for browsing agents, 2025.
- [80] Lucen Zhong, Zhengxiao Du, Xiaohan Zhang, Haiyi Hu, and Jie Tang. Complexfuncbench: Exploring multi-step and constrained function calling under long-context scenario, 2025.
- [81] Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023.
- [82] Yun He, Di Jin, Chaoqi Wang, Chloe Bi, Karishma Mandyam, Hejia Zhang, Chen Zhu, Ning Li, Tengyu Xu, Hongjiang Lv, Shruti Bhosale, Chenguang Zhu, Karthik Abinav Sankararaman, Eryk Helenowski, Melanie Kambadur, Aditya Tayade, Hao Ma, Han Fang, and Sinong Wang. Multi-if: Benchmarking llms on multi-turn and multilingual instructions following, 2024.
- [83] Jinhyuk Lee, Anthony Chen, Zhuyun Dai, Dheeru Dua, Devendra Singh Sachan, Michael Boratko, Yi Luan, Sébastien M. R. Arnold, Vincent Perot, Siddharth Dalmia, Hexiang Hu, Xudong Lin, Panupong Pasupat, Aida Amini, Jeremy R. Cole, Sebastian Riedel, Iftekhar Naim, Ming-Wei Chang, and Kelvin Guu. Can long-context language models subsume retrieval, rag, sql, and more?, 2024.
- [84] Kaustubh Deshpande, Ved Sirdeshmukh, Johannes Baptist Mols, Lifeng Jin, Ed-Yeremai Hernandez-Cardona, Dean Lee, Jeremy Kritz, Willow E. Primack, Summer Yue, and Chen Xing. MultiChallenge: A realistic multi-turn conversation evaluation benchmark challenging to frontier LLMs. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Findings of the Association for Computational Linguistics: ACL 2025, pages 18632–18702, Vienna, Austria, July 2025. Association for Computational Linguistics.
- [85] Rahul K. Arora, Jason Wei, Rebecca Soskin Hicks, Preston Bowman, Joaquin Quiñonero-Candela, Foivos Tsimpourlas, Michael Sharman, Meghan Shah, Andrea Vallone, Alex Beutel, Johannes Heidecke, and Karan Singhal. Healthbench: Evaluating large language models towards improved human health, 2025.
Appendix A Reasoning Benchmarks
Table 1: Taxonomy of benchmarks used in this study.
| HellaSwag [42] | Commonsense and Logical Reasoning | 2019 | Multiple-choice task: choose the most plausible sentence continuation. |
| --- | --- | --- | --- |
| MMLU [43] | Reasoning with General Knowledge | 2021 | Multiple-choice task: answer questions across 57 domains to test knowledge and problem-solving. |
| Big-Bench-Hard [44] | Reasoning with General Knowledge | 2023 | Open-generation task: solve difficult BIG-Bench problems testing multi-step reasoning and problem-solving. |
| MMMLU [45] | Reasoning with General Knowledge | 2024 | Multiple-choice task: answer 57 domain questions translated into 14 languages to test multilingual knowledge and problem-solving. |
| Humanity’s Last Exam [46] | Reasoning with General Knowledge | 2025 | Multi-modal task: answer closed-ended questions across many subjects to test verifiable knowledge. |
| Global MMLU (Lite) [47] | Reasoning with General Knowledge | 2025 | Multiple-choice task: answer 42-language questions with culturally sensitive labeling to test equitable multilingual knowledge. |
| GPQA Diamond [48] | Reasoning with General Knowledge | 2023 | Multiple-choice task: answer 448 expert-level science questions in biology, physics, and chemistry that are Google-proof and highly challenging. |
| MMLU Pro [49] | Reasoning with General Knowledge | 2024 | Multiple-choice task: extended from MMLU, answer more challenging reasoning questions with 10 options across diverse domains. |
| ARC (AI2 Reasoning Challenge) [50] | Reading Comprehension and Question Answering | 2018 | Multiple-choice task: answer grade-school science questions requiring advanced knowledge and reasoning beyond simple retrieval. |
| ECLeKTic [51] | Reading Comprehension and Question Answering | 2025 | Closed-book QA task: answer 12-language questions to test cross-lingual knowledge transfer. |
| DROP [52] | Reading Comprehension and Question Answering | 2019 | Open-ended QA task: answer 96k English questions requiring discrete reasoning over paragraph content. |
| GSM8K [53] | Mathematical Reasoning | 2021 | Open-ended QA task: solve grade-school problems requiring multi-step mathematical reasoning. |
| MATH [30] | Mathematical Reasoning | 2021 | Open-ended QA: solve 12,500 challenging competition problems with step-by-step solutions to test advanced mathematical reasoning. |
| MATH 500 [30] | Mathematical Reasoning | 2024 | Open-ended QA: Challenging subset of MATH benchmark. |
| MGSM [54] | Mathematical Reasoning | 2023 | Open-ended QA: solve 250 GSM8K problems translated into 10 languages. |
| MathVista [55] | Mathematical Reasoning | 2024 | Open-ended multimodal QA: solve 6,141 math problems requiring visual and compositional reasoning. |
| AIME 2024 | Mathematical Reasoning | 2024 | Open-ended QA: solve challenging competition-level mathematics problems. |
| AIME 2025 | Mathematical Reasoning | 2025 | Open-ended QA: solve challenging competition-level mathematics problems. |
| FrontierMath [56] | Mathematical Reasoning | 2024 | Open-ended QA: tests advanced mathematical reasoning across diverse and expert-level domains, requiring multi-step problem solving and deep mathematical knowledge. |
| MMMU [57] | Multimodal Reasoning | 2024 | Question answering task: multimodal multiple-choice and open-ended questions across 30 subjects requiring advanced reasoning and domain-specific knowledge. |
| AI2D [58] | Multimodal Reasoning | 2016 | Open-ended QA: multimodal questions with 5,000 diagrams and 15,000 Q&A pairs requiring diagram structure understanding and reasoning. |
| ChartQA [59] | Multimodal Reasoning | 2022 | Open-ended QA: multimodal questions with 32.7K chart-based problems requiring visual and logical reasoning. |
| EgoSchema [32] | Multimodal Reasoning | 2023 | Multiple-choice QA: multimodal questions with 5,000 long-form video clips requiring understanding of human activity and temporal reasoning. |
| DocVQA [60] | Multimodal Reasoning | 2021 | Open-ended QA: multimodal questions with 50,000 document images requiring reading and interpreting document layout and structure. |
| TextVQA [61] | Multimodal Reasoning | 2019 | Open-ended QA: multimodal questions with 45,336 images requiring reading and reasoning about embedded text. |
| VideoMMMU [62] | Multimodal Reasoning | 2025 | Open-ended QA: multimodal questions with 300 expert-level videos and 900 Q&A pairs assessing knowledge acquisition through perception, comprehension, and adaptation. |
| Vibe-Eval [63] | Multimodal Reasoning | 2024 | Open-ended QA: multimodal questions, testing visual understanding and multimodal chat capabilities. |
| ZeroBench [64] | Multimodal Reasoning | 2025 | Open-ended QA: multimodal questions with 434 visual reasoning problems designed to be impossible for current LMMs. |
| CharXiv [65] | Multimodal Reasoning | 2024 | Open-ended QA: multimodal questions with 2,323 charts requiring descriptive analysis and complex reasoning. |
| MMMU Pro [66] | Multimodal Reasoning | 2025 | QA task: multimodal multiple-choice and open-ended questions, extended from MMMU, testing integrated visual and textual reasoning. |
| ActivityNet [31] | Multimodal Reasoning | 2015 | Multiple-choice and open-ended QA: evaluates recognition and understanding of complex human activities in untrimmed videos, testing visual perception and temporal reasoning. |
| ERQA [67] | Multimodal Reasoning | 2025 | Multiple-choice QA: evaluates embodied reasoning and spatial understanding in real-world scenarios, requiring models to integrate text and visual inputs to select the correct answer. |
| SWE-bench Verified [68] | Programming and Coding | 2024 | Open-ended QA: answer 2,294 software engineering problems requiring multi-file code edits and complex reasoning. |
| Terminal-bench [69] | Programming and Coding | 2025 | Open-ended QA: answer complex tasks in terminal environments using text-based commands and reasoning. |
| HumanEval [70] | Programming and Coding | 2021 | Open-ended QA: answer Python programming problems from docstrings requiring functional code synthesis. |
| LiveCode Bench [71] | Programming and Coding | 2025 | Open-ended QA: answer 600+ coding problems from contests, testing generation, self-repair, execution, and test prediction. |
| Aider Polygot [72] | Programming and Coding | 2024 | Open-ended QA: answer 225 difficult coding problems in C++, Go, Java, JavaScript, Python, and Rust. |
| SWE-Lancer [73] | Programming and Coding | 2025 | Open-ended QA: answer 1,400 freelance software engineering tasks, including implementation and managerial decisions, with real-world evaluation. |
| SWE-Lancer Diamond [73] | Programming and Coding | 2025 | Open-ended QA: answer tasks from the public SWE-Lancer Diamond split, including implementation and managerial software engineering problems. |
| TAU-bench [74] | Tool Use – LLM | 2024 | Open-ended QA: tests reasoning, consistency, and rule-following in dynamic, tool-assisted human-agent interactions. |
| TAU2-bench [75] | Tool Use – LLM | 2025 | Open-ended QA: tests multi-turn reasoning, coordination, and communication in dual-control environments where both agent and user act with tools. |
| COLLIE [76] | Constrained Text Generation – LLM | 2023 | Open-ended QA: answer 2,080 prompts requiring constrained text generation with compositional, grammar-based, and reasoning challenges. |
| SimpleQA [77] | Factuality – LLM | 2024 | Factual QA benchmark designed to test factual accuracy and knowledge calibration. |
| FACTS Grounding [78] | Factuality – LLM | 2024 | Open-ended QA: answer questions requiring LLMs to generate factually accurate and well-grounded responses from provided source material. |
| BrowseComp [79] | Factuality – LLM | 2025 | Open-ended QA: answer 1,266 questions by persistently navigating the internet to find hard-to-locate information. |
| ComplexFunc Bench [80] | Tool Use – LLM | 2025 | Open-ended QA: answer complex function-calling tasks in five real-world scenarios requiring multi-step reasoning, parameter management, and long-context handling. |
| IFEval [81] | Instruction Following – LLM | 2023 | Open-ended QA: answer 500 prompts requiring LLMs to follow verifiable natural language instructions. |
| Multi-IF [82] | Instruction Following – LLM | 2024 | Open-ended QA: answer 4,501 multilingual multi-turn prompts requiring accurate instruction-following across languages and conversation turns. |
| LOFT [83] | Long-Context – LLM | 2024 | Open-ended QA: answer real-world tasks requiring reasoning and in-context retrieval over millions of tokens. |
| Graphwalks [14] | Long-Context – LLM | 2025 | Open-ended QA: perform multi-hop reasoning across a graph of millions of tokens to answer questions requiring breadth-first traversal. |
| Multi Challenge [84] | Multi-turn Conversation – LLM | 2025 | Open-ended QA: answer multi-turn conversation prompts requiring instruction-following, context management, and in-context reasoning. |
| HealthBench [85] | Safety – LLM | 2025 | Open-ended QA: evaluates LLMs on multi-turn healthcare conversations, requiring factual reasoning, safety awareness, and context-sensitive decision-making across diverse medical contexts. |
Appendix B Performance of Models
<details>
<summary>figures/claude_2_plots/claude_performance_Commonsense_and_Logical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Score vs. Model Number
### Overview
The image is a line chart showing the relationship between "Model Number" on the x-axis and "Score (%)" on the y-axis. The chart displays the performance of different models, with a single data series labeled "HellaSwag". The score increases as the model number increases from 1 to 3.
### Components/Axes
* **X-axis:** "Model Number", with tick marks at integers from 1 to 10.
* **Y-axis:** "Score (%)", with tick marks at 86, 88, 90, 92, and 94.
* **Data Series:** "HellaSwag" (blue line with circular markers).
### Detailed Analysis
The "HellaSwag" data series consists of three data points:
* Model Number 1: Score is approximately 86%.
* Model Number 2: Score is approximately 89%.
* Model Number 3: Score is approximately 95%.
The line slopes sharply upward from Model Number 1 to Model Number 3.
### Key Observations
* The "HellaSwag" model shows a significant increase in score from Model Number 1 to Model Number 3.
* There is no data available for Model Numbers 4 through 10.
### Interpretation
The chart suggests that the "HellaSwag" model's performance improves as the model number increases, at least up to Model Number 3. The absence of data for higher model numbers prevents a complete assessment of the model's performance trend. The rapid increase in score between Model Numbers 1 and 3 indicates a potentially significant improvement in the model's capabilities with each iteration. It is important to note that the data is limited to only three model numbers, so any conclusions about the overall trend should be made with caution.
</details>
(a) Commonsense and Logical Reasoning
<details>
<summary>figures/claude_2_plots/claude_performance_Mathematical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of several models (GSM8K, MGSM, MATH, MathVista, MATH 500, AIME 2024, AIME 2025) across different model numbers (1 to 10). The y-axis represents the score in percentage (%), and the x-axis represents the model number. Each model's performance is plotted as a line, with different colors and markers distinguishing them.
### Components/Axes
* **X-axis:** Model Number, labeled from 1 to 10.
* **Y-axis:** Score (%), labeled from 20 to 100 in increments of 10.
* **Legend:** Located at the top of the chart, identifying each model by its name and corresponding line color/marker.
* GSM8K (Red line with triangle markers)
* MGSM (Orange line with square markers)
* MATH (Brown line with diamond markers)
* MathVista (Blue line with circle markers)
* MATH 500 (Yellow-Green line with no markers)
* AIME 2024 (Pink line with star markers)
* AIME 2025 (Teal line with star markers)
### Detailed Analysis
* **GSM8K (Red triangles):** Starts at approximately 89% at Model Number 1, increases to approximately 92% at Model Number 2, increases to approximately 95% at Model Number 3, and remains relatively stable around 96% for Model Numbers 4-6.
* **MGSM (Orange squares):** Starts at approximately 75% at Model Number 1, increases to approximately 84% at Model Number 2, increases to approximately 91% at Model Number 3, decreases to approximately 86% at Model Number 4, increases to approximately 93% at Model Number 5, and decreases to approximately 86% at Model Number 6.
* **MATH (Brown diamonds):** Starts at approximately 39% at Model Number 1, increases to approximately 43% at Model Number 2, increases to approximately 60% at Model Number 3, increases to approximately 69% at Model Number 4, increases to approximately 78% at Model Number 5.
* **MathVista (Blue circles):** Starts at approximately 47% at Model Number 1, increases to approximately 48% at Model Number 2, increases to approximately 51% at Model Number 3, increases to approximately 62% at Model Number 4, increases to approximately 68% at Model Number 5.
* **MATH 500 (Yellow-Green):** Starts at approximately 82% at Model Number 6, increases to approximately 97% at Model Number 7.
* **AIME 2024 (Pink):** Starts at approximately 16% at Model Number 5, increases to approximately 24% at Model Number 6, increases to approximately 80% at Model Number 7.
* **AIME 2025 (Teal):** Starts at approximately 87% at Model Number 8, increases to approximately 90% at Model Number 9, decreases to approximately 79% at Model Number 10.
### Key Observations
* GSM8K consistently performs well across all model numbers.
* MGSM shows some fluctuation in performance.
* MATH and MathVista show a general upward trend in performance as the model number increases.
* AIME 2024 shows a significant jump in performance between Model Numbers 6 and 7.
* MATH 500 only has two data points, showing high performance.
* AIME 2025 has three data points, showing a peak at Model Number 9.
### Interpretation
The chart provides a comparative analysis of different models' performance. GSM8K appears to be the most stable and high-performing model across the board. MATH and MathVista show improvement with increasing model number, suggesting potential learning or optimization. The dramatic increase in AIME 2024's performance between Model Numbers 6 and 7 is noteworthy and could indicate a significant change in the model's architecture or training data. The limited data points for MATH 500 and AIME 2025 make it difficult to assess their overall performance comprehensively. The data suggests that different models are suited for different tasks or have undergone different stages of development. Further investigation into the specific characteristics of each model and the nature of the tasks they are evaluated on would provide a more complete understanding of their relative strengths and weaknesses.
</details>
(b) Mathematical Reasoning
<details>
<summary>figures/claude_2_plots/claude_performance_Multimodal_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of four different models (DocVQA, AI2D, ChartQA, and MMMU) across ten iterations or variations, labeled as "Model Number" from 1 to 10. The y-axis represents the "Score (%)", ranging from 50 to 90. The chart visually displays how the scores of each model change as the model number increases.
### Components/Axes
* **X-axis:** "Model Number", with tick marks at integers from 1 to 10.
* **Y-axis:** "Score (%)", ranging from 50 to 90. Tick marks are not explicitly labeled, but implied at intervals of 10.
* **Legend:** Located at the top of the chart, associating each model with a specific color and marker:
* **DocVQA:** Pink line with triangle markers.
* **AI2D:** Red line with square markers.
* **ChartQA:** Blue line with circle markers.
* **MMMU:** Light blue line with diamond markers.
### Detailed Analysis
* **DocVQA (Pink, Triangles):**
* Trend: Relatively stable performance, with a slight increase towards the end.
* Model 1: ~89%
* Model 2: ~89%
* Model 3: ~89%
* Model 4: ~90%
* Model 5: ~93%
* Model 10: Not explicitly shown, but the line appears to be slightly above 93%.
* **AI2D (Red, Squares):**
* Trend: Relatively stable performance, with a more pronounced increase towards the end.
* Model 1: ~87%
* Model 2: ~89%
* Model 3: ~89%
* Model 4: ~88%
* Model 5: ~92%
* Model 10: Not explicitly shown, but the line appears to be slightly above 93%.
* **ChartQA (Blue, Circles):**
* Trend: Starts relatively high, dips slightly, then increases significantly.
* Model 1: ~82%
* Model 2: ~81%
* Model 3: ~81%
* Model 4: ~87%
* Model 5: ~91%
* Model 10: Not explicitly shown, but the line appears to be slightly above 91%.
* **MMMU (Light Blue, Diamonds):**
* Trend: Consistently increasing performance across all model numbers.
* Model 1: ~50%
* Model 2: ~53%
* Model 3: ~59%
* Model 4: ~61%
* Model 5: ~70%
* Model 6: ~72%
* Model 7: ~75%
* Model 8: ~74%
* Model 9: ~77%
* Model 10: ~77%
### Key Observations
* DocVQA and AI2D models show similar performance, with AI2D having a slightly lower score at the beginning but catching up by Model 5.
* ChartQA starts with a relatively high score but is overtaken by DocVQA and AI2D by Model 5.
* MMMU starts with the lowest score but shows the most consistent improvement across all model numbers, although it remains significantly lower than the other models.
* All models show improvement as the model number increases, suggesting iterative improvements in the model design or training.
### Interpretation
The chart illustrates the performance comparison of four different models across ten iterations. DocVQA and AI2D appear to be the most performant, with ChartQA closely following. MMMU, while showing consistent improvement, lags behind the other models in terms of overall score. The increasing trend in all models suggests that the iterative development process is effective in improving model performance. The data suggests that DocVQA and AI2D are the most mature models, while MMMU has the most potential for improvement. The relatively stable performance of DocVQA and AI2D could indicate that they are approaching their performance ceiling, while the consistent improvement of MMMU suggests that it is still in a phase of rapid development.
</details>
(c) Multimodal Reasoning
<details>
<summary>figures/claude_2_plots/claude_performance_Programming_and_Coding.png Details</summary>

### Visual Description
## Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models across three benchmarks: HumanEval, SWE-bench Verified, and Terminal-bench. The x-axis represents the Model Number (from 1 to 10), and the y-axis represents the Score (in percentage). Each benchmark is represented by a different colored line with distinct markers.
### Components/Axes
* **X-axis:** Model Number, ranging from 1 to 10 in increments of 1.
* **Y-axis:** Score (%), ranging from 40 to 90 in increments of 10.
* **Legend:**
* **HumanEval:** Blue line with circle markers. Located at the top of the chart.
* **SWE-bench Verified:** Brown line with square markers. Located in the middle-right of the chart.
* **Terminal-bench:** Cyan line with triangle markers. Located at the bottom-right of the chart.
### Detailed Analysis
* **HumanEval (Blue, Circle Markers):** The line generally slopes upward, indicating increasing performance with higher model numbers.
* Model 1: Approximately 76%
* Model 2: Approximately 73%
* Model 3: Approximately 85%
* Model 4: Approximately 88%
* Model 5: Approximately 94%
* **SWE-bench Verified (Brown, Square Markers):** The line increases sharply until Model 8, then decreases slightly.
* Model 4: Approximately 41%
* Model 5: Approximately 49%
* Model 6: Approximately 70%
* Model 8: Approximately 80%
* Model 10: Approximately 75%
* **Terminal-bench (Cyan, Triangle Markers):** The line shows a peak at Model 9.
* Model 8: Approximately 41%
* Model 9: Approximately 50%
* Model 10: Approximately 43%
### Key Observations
* HumanEval scores consistently increase as the model number increases.
* SWE-bench Verified scores increase significantly from Model 4 to Model 8, then slightly decrease.
* Terminal-bench scores are significantly lower than the other two benchmarks, peaking at Model 9.
### Interpretation
The chart suggests that models generally improve in performance on the HumanEval benchmark as the model number increases. The SWE-bench Verified benchmark shows a significant improvement up to a certain model number, after which the performance plateaus or slightly decreases. The Terminal-bench benchmark shows a different performance pattern, with a peak at Model 9, and overall lower scores compared to the other benchmarks. This could indicate that different models are better suited for different types of tasks or benchmarks. The models are likely being iterated upon, with each new model number representing an improvement or change in architecture. The data suggests that the models are improving in "human-like" tasks (HumanEval) and software engineering tasks (SWE-bench), but struggle with "terminal" tasks (Terminal-bench).
</details>
(d) Programming and Coding
<details>
<summary>figures/claude_2_plots/claude_performance_Reading_Comprehension_and_Question_Answering.png Details</summary>

### Visual Description
## Line Chart: Model Performance on ARC and DROP Datasets
### Overview
The image is a line chart comparing the performance of different models on two datasets: ARC (AI2 Reasoning Challenge) and DROP. The chart plots the "Score (%)" on the y-axis against the "Model Number" on the x-axis. Two lines represent the performance of models on each dataset.
### Components/Axes
* **X-axis:** "Model Number" ranging from 1 to 10.
* **Y-axis:** "Score (%)" ranging from 77.5 to 95.0, with increments of 2.5.
* **Data Series:**
* **ARC (AI2 Reasoning Challenge):** Light blue line with square markers.
* **DROP:** Blue line with circular markers.
### Detailed Analysis
**ARC (AI2 Reasoning Challenge) - Light Blue Line with Square Markers:**
* **Trend:** The line slopes upward, indicating increasing performance with higher model numbers.
* **Data Points:**
* Model 1: Approximately 89.3%
* Model 2: Approximately 93.1%
* The line continues upward beyond Model 2, but no further data points are explicitly shown.
**DROP - Blue Line with Circular Markers:**
* **Trend:** The line initially increases, plateaus, and then increases again.
* **Data Points:**
* Model 1: Approximately 78.4%
* Model 2: Approximately 78.9%
* Model 3: Approximately 83.1%
* Model 4: Approximately 83.1%
* Model 5: Approximately 88.2%
### Key Observations
* The ARC dataset shows a consistently increasing performance as the model number increases.
* The DROP dataset shows an initial increase in performance, followed by a plateau, and then another increase.
* The ARC dataset has a higher score than the DROP dataset for the models shown.
### Interpretation
The chart suggests that models perform differently on the ARC and DROP datasets. The ARC dataset seems to benefit more from increasing model complexity (represented by the model number), while the DROP dataset shows a more complex performance pattern with periods of improvement and stagnation. The higher scores on the ARC dataset might indicate that the models are better suited for the type of reasoning required by the AI2 Reasoning Challenge compared to the DROP dataset. The plateau in the DROP dataset's performance could indicate a limitation in the models' ability to handle the specific challenges posed by that dataset at certain model complexities.
</details>
(e) Reading Comprehension and QA
<details>
<summary>figures/claude_2_plots/claude_performance_Reasoning_with_General_Knowledge.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models on various benchmarks. The chart displays the "Score (%)" on the y-axis against the "Model Number" on the x-axis. There are four data series, each representing a different benchmark: "Big-Bench-Hard", "MMLU", "MMMLU", "GPQA Diamond", and "MMLU Pro".
### Components/Axes
* **X-axis:** "Model Number" ranging from 1 to 10.
* **Y-axis:** "Score (%)" ranging from 40 to 90, with gridlines at intervals of 10.
* **Legend:** Located at the top of the chart, identifying each data series by color and name:
* Green squares: "Big-Bench-Hard"
* Brown triangles: "MMLU"
* Light blue circles: "MMMLU"
* Dark blue circles: "GPQA Diamond"
* Gray diamonds: "MMLU Pro"
### Detailed Analysis
**1. Big-Bench-Hard (Green Squares):**
* Trend: Generally increasing, with a slight plateau towards the end.
* Data Points:
* Model 1: ~74%
* Model 2: ~83%
* Model 3: ~87%
* Model 4: ~90%
**2. MMLU (Brown Triangles):**
* Trend: Increasing, then plateauing.
* Data Points:
* Model 1: ~75%
* Model 2: ~79%
* Model 3: ~87%
* Model 4: ~89%
**3. MMMLU (Light Blue Circles):**
* Trend: Starts low, increases sharply, then plateaus at a high level.
* Data Points:
* Model 5: ~82%
* Model 6: ~85%
* Model 7: ~86%
* Model 8: ~87%
* Model 9: ~89%
* Model 10: ~89%
**4. GPQA Diamond (Dark Blue Circles):**
* Trend: Highly variable initially, then increases and plateaus.
* Data Points:
* Model 1: ~33%
* Model 2: ~40%
* Model 3: ~50%
* Model 4: ~41%
* Model 5: ~65%
* Model 6: ~68%
* Model 7: ~85%
* Model 8: ~84%
* Model 9: ~84%
* Model 10: ~82%
**5. MMLU Pro (Gray Diamonds):**
* Trend: Only two data points are available, showing an increase.
* Data Points:
* Model 4: ~65%
* Model 5: ~78%
### Key Observations
* "Big-Bench-Hard" and "MMLU" show relatively consistent high performance across the first four models.
* "GPQA Diamond" has the most variable performance, with a significant jump between models 5 and 7.
* "MMMLU" achieves high scores, plateauing after model 7.
* "MMLU Pro" only has two data points, making it difficult to assess its overall trend.
### Interpretation
The chart provides a comparative analysis of different models' performance on various benchmarks. The "GPQA Diamond" benchmark appears to be more challenging for the earlier models, as indicated by the lower scores. The "Big-Bench-Hard" and "MMLU" benchmarks seem to be more consistently handled by the models tested. The "MMMLU" benchmark shows a strong performance for the later models. The limited data for "MMLU Pro" makes it difficult to draw definitive conclusions about its performance relative to the other benchmarks. The data suggests that model architecture and training significantly impact performance on specific benchmarks.
</details>
(f) Reasoning with General Knowledge
<details>
<summary>figures/claude_2_plots/claude_performance_LLM_Benchmarks_Combined.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of three different models: IFEval, TAU-bench Retail, and TAU-bench Airline. The x-axis represents the "Model Number" ranging from 1 to 10, and the y-axis represents the "Score (%)" ranging from 20 to 90. Each model's performance is plotted as a line, showing how the score changes with different model numbers.
### Components/Axes
* **X-axis:** "Model Number" with tick marks at 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10.
* **Y-axis:** "Score (%)" with tick marks at 20, 30, 40, 50, 60, 70, 80, and 90.
* **Legend:** Located on the top-right of the chart, identifying the models:
* IFEval (light blue, triangle marker)
* TAU-bench Retail (brown, square marker)
* TAU-bench Airline (dark blue, circle marker)
### Detailed Analysis
* **IFEval (light blue, triangle marker):** The line starts at Model Number 5 with a score of approximately 90%, increases slightly to approximately 92% at Model Number 7, and remains relatively stable thereafter.
* Model 5: ~90%
* Model 7: ~92%
* **TAU-bench Retail (brown, square marker):** The line starts at Model Number 4 with a score of approximately 51%, increases sharply to approximately 72% at Model Number 5, and then to approximately 81% at Model Number 6. It remains relatively stable around 81% for Model Numbers 7 and 8, and increases slightly to approximately 82% at Model Number 10.
* Model 4: ~51%
* Model 5: ~72%
* Model 6: ~81%
* Model 8: ~81%
* Model 10: ~82%
* **TAU-bench Airline (dark blue, circle marker):** The line starts at Model Number 4 with a score of approximately 23%, increases sharply to approximately 49% at Model Number 5, and then to approximately 59% at Model Number 6. It remains relatively stable around 60% for Model Numbers 7 and 8, and decreases slightly to approximately 58% at Model Number 9, and then to approximately 56% at Model Number 10.
* Model 4: ~23%
* Model 5: ~49%
* Model 6: ~59%
* Model 8: ~60%
* Model 9: ~58%
* Model 10: ~56%
### Key Observations
* IFEval consistently outperforms the other two models, with scores above 90%.
* TAU-bench Retail shows a significant improvement from Model Number 4 to Model Number 6, then plateaus.
* TAU-bench Airline shows a significant improvement from Model Number 4 to Model Number 6, then plateaus, and decreases slightly at Model Numbers 9 and 10.
* The performance of TAU-bench Airline is significantly lower than the other two models, especially at lower model numbers.
### Interpretation
The chart suggests that IFEval is the most effective model among the three, consistently achieving high scores. TAU-bench Retail shows a strong initial improvement but plateaus, while TAU-bench Airline, although improving initially, has the lowest overall performance and even declines slightly in later model numbers. The data indicates that the model number has a varying impact on the performance of each model, with some models benefiting more from the changes than others. The performance of TAU-bench Airline is significantly lower than the other two models, especially at lower model numbers.
</details>
(g) LLM Benchmarks
Figure 3: Performance of the Claude family on reasoning benchmarks by category. Model numbers and corresponding names are as follows: 1 – Claude 3 Haiku; 2 – Claude 3 Sonnet; 3 – Claude 3 Opus; 4 – Claude 3.5 Haiku; 5 – Claude 3.5 Sonnet; 6 – Claude 3.7 Sonnet; 7 – Claude 3.7 Sonnet (64K Extended Thinking); 8 – Claude Sonnet 4; 9 – Claude Opus 4; 10 – Claude Opus 4.1.
<details>
<summary>figures/gemini_2_plots/gemini_performance_Commonsense_and_Logical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance
### Overview
The image is a line chart showing the performance score (in percentage) of different models, numbered 1 through 10. The chart displays a single data series, represented by a blue line, that illustrates the score of each model. The model "HellaSwag" is specifically labeled at its data point.
### Components/Axes
* **X-axis:** "Model Number", with tick marks at each integer from 1 to 10.
* **Y-axis:** "Score (%)", ranging from 86 to 92, with tick marks at each integer value.
* **Data Series:** A single blue line representing the performance score of each model.
* **Label:** "HellaSwag" is written in blue text above the data point for Model Number 4.
### Detailed Analysis
The blue line represents the performance of the models.
* **Model 1:** Score is approximately 87.8%.
* **Model 2:** Score is approximately 84.8%.
* **Model 3:** Score is approximately 86.5%.
* **Model 4 (HellaSwag):** Score is approximately 93.2%.
* **Models 5-10:** No data is shown for these models.
The line initially slopes downward from Model 1 to Model 2, then slopes upward from Model 2 to Model 4.
### Key Observations
* Model 4, labeled "HellaSwag", has the highest score among the models shown.
* Model 2 has the lowest score.
* There is a significant increase in score from Model 3 to Model 4.
* The chart only shows data for models 1 through 4.
### Interpretation
The chart compares the performance scores of different models. "HellaSwag" (Model 4) significantly outperforms the other models shown in the chart. The initial dip in performance from Model 1 to Model 2 suggests that some models may perform worse than others, but subsequent improvements are possible, as seen with Model 3 and "HellaSwag". The absence of data for models 5-10 makes it impossible to compare their performance to the first four models.
</details>
(a) Commonsense and Logical Reasoning
<details>
<summary>figures/gemini_2_plots/gemini_performance_Mathematical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models on various tasks. The chart plots the "Score (%)" on the y-axis against the "Model Number" on the x-axis. There are five data series, each representing a different task: GSM8K, MGSM, MATH, MathVista, AIME 2024, and AIME 2025.
### Components/Axes
* **X-axis:** "Model Number" ranging from 1 to 10.
* **Y-axis:** "Score (%)" ranging from 20 to 90, with gridlines at intervals of 10.
* **Legend:** Located in the top-right area of the chart, associating colors and markers with task names.
* GSM8K: Pink line with diamond markers.
* MGSM: Blue line with circle markers.
* MATH: Green line with square markers.
* MathVista: Purple line with triangle markers.
* AIME 2024: Teal line with circle markers.
* AIME 2025: Yellow-green line with circle markers.
### Detailed Analysis
* **GSM8K (Pink, Diamond):** Starts at approximately 94% for Model 1, decreases to about 87% for Model 2, remains relatively stable at approximately 87% for Model 3, and increases slightly to approximately 91% for Model 4.
* **MGSM (Blue, Circle):** Starts at approximately 79% for Model 1, decreases to about 63% for Model 2, increases to approximately 83% for Model 3, and increases slightly to approximately 87% for Model 4.
* **MATH (Green, Square):** Starts at approximately 53% for Model 1, decreases to about 33% for Model 2, increases to approximately 55% for Model 3, and increases to approximately 68% for Model 4.
* **MathVista (Purple, Triangle):** Starts at approximately 53% for Model 1, decreases to about 45% for Model 2, increases to approximately 58% for Model 3, and increases to approximately 65% for Model 4.
* **AIME 2024 (Teal, Circle):** Only data point is at Model 8, with a score of approximately 93%.
* **AIME 2025 (Yellow-Green, Circle):** Starts at approximately 15% for Model 3, increases to approximately 18% for Model 4, increases to approximately 24% for Model 5, increases to approximately 30% for Model 6, increases to approximately 72% for Model 7, increases to approximately 88% for Model 8, decreases to approximately 50% for Model 9, and increases to approximately 63% for Model 10.
### Key Observations
* GSM8K and MGSM generally outperform MATH and MathVista across the first four models.
* AIME 2024 has a single data point at Model 8, indicating it might be specifically designed or evaluated for that model.
* AIME 2025 shows a significant performance increase from Model 3 to Model 8, followed by a decrease and then a slight increase.
### Interpretation
The chart provides a comparative analysis of different models' performance on various tasks. The tasks GSM8K and MGSM appear to be easier or better suited for the initial models (1-4) compared to MATH and MathVista. The AIME 2024 task seems to be specifically targeted towards Model 8. The AIME 2025 task shows a more complex performance pattern, suggesting that the models' suitability for this task varies significantly. The data suggests that different models excel at different tasks, and the choice of model should be tailored to the specific task at hand. The AIME 2025 data suggests that model number 8 is particularly good at this task, but model 9 is particularly bad.
</details>
(b) Mathematical Reasoning
<details>
<summary>figures/gemini_2_plots/gemini_performance_Multimodal_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of several models (AI2D, DocVQA, ChartQA, TextVQA, EgoSchema, VideoMMMU, MMMU, Vibe-Eval (Reka), and ZeroBench) across a range of model numbers (1 to 10). The y-axis represents the score in percentage, ranging from 0% to 100%.
### Components/Axes
* **X-axis:** "Model Number" with tick marks at integers from 1 to 10.
* **Y-axis:** "Score (%)" with tick marks at 0, 20, 40, 60, 80.
* **Legend:** Located on the right side of the chart, listing the models and their corresponding line colors/markers.
* AI2D (brown line, diamond marker)
* DocVQA (red line, diamond marker)
* ChartQA (green line, triangle marker)
* TextVQA (blue line, circle marker)
* EgoSchema (pink line, plus marker)
* VideoMMMU (teal line, plus marker)
* MMMU (orange line, square marker)
* Vibe-Eval (Reka) (gray line, no marker)
* ZeroBench (yellow-green line, x marker)
### Detailed Analysis
* **AI2D (brown line, diamond marker):** The line starts at approximately 89% at Model Number 1, dips slightly to around 87% at Model Number 2, then increases to approximately 92% at Model Number 3, and continues to increase slightly to approximately 93% at Model Number 4. The line remains relatively stable at approximately 93% for Model Numbers 5-10.
* **DocVQA (red line, diamond marker):** The line starts at approximately 80% at Model Number 1, dips to approximately 75% at Model Number 2, then increases to approximately 85% at Model Number 3, and continues to increase slightly to approximately 87% at Model Number 4. The line remains relatively stable at approximately 87% for Model Numbers 5-10.
* **ChartQA (green line, triangle marker):** The line starts at approximately 80% at Model Number 1, dips to approximately 75% at Model Number 2, then increases to approximately 85% at Model Number 3, and continues to increase slightly to approximately 86% at Model Number 4. The line remains relatively stable at approximately 86% for Model Numbers 5-10.
* **TextVQA (blue line, circle marker):** The line starts at approximately 82% at Model Number 1, dips to approximately 74% at Model Number 2, then increases to approximately 79% at Model Number 3, and remains relatively stable at approximately 79% for Model Numbers 4-10.
* **EgoSchema (pink line, plus marker):** The line starts at approximately 79% at Model Number 1, dips to approximately 74% at Model Number 2, then decreases to approximately 65% at Model Number 3, and increases to approximately 70% at Model Number 4. The line remains relatively stable at approximately 70% for Model Numbers 5-10.
* **VideoMMMU (teal line, plus marker):** The line starts at approximately 80% at Model Number 1, dips to approximately 74% at Model Number 2, then decreases to approximately 64% at Model Number 3, and increases to approximately 69% at Model Number 4. The line increases to approximately 82% at Model Number 8, and remains relatively stable at approximately 82% for Model Numbers 9-10.
* **MMMU (orange line, square marker):** The line starts at approximately 60% at Model Number 1, dips to approximately 48% at Model Number 2, then increases to approximately 58% at Model Number 3, and increases to approximately 68% at Model Number 4. The line increases to approximately 74% at Model Number 9, and remains relatively stable at approximately 74% for Model Number 10.
* **Vibe-Eval (Reka) (gray line, no marker):** The line starts at approximately 55% at Model Number 1, dips to approximately 52% at Model Number 2, then increases to approximately 53% at Model Number 3, and remains relatively stable at approximately 53% for Model Numbers 4-10.
* **ZeroBench (yellow-green line, x marker):** The line starts at approximately 0% at Model Number 1, and remains relatively stable at approximately 1% for Model Numbers 2-7. The line increases to approximately 4% at Model Number 8, and remains relatively stable at approximately 4% for Model Numbers 9-10.
### Key Observations
* AI2D consistently performs well across all model numbers, maintaining a high score.
* ZeroBench consistently performs poorly across all model numbers, maintaining a low score.
* Some models (e.g., MMMU, VideoMMMU) show significant improvement in performance as the model number increases.
* The performance of Vibe-Eval (Reka) is relatively stable across all model numbers.
### Interpretation
The chart provides a comparative analysis of different models' performance across a range of model numbers. The data suggests that some models are consistently better than others, while some models show improvement with increasing model number. The performance of some models is relatively stable, while others fluctuate. The chart highlights the strengths and weaknesses of each model, providing insights into their suitability for different tasks. The large difference in performance between ZeroBench and the other models suggests that ZeroBench may not be suitable for the tasks evaluated in this chart.
</details>
(c) Multimodal Reasoning
<details>
<summary>figures/gemini_2_plots/gemini_performance_Programming_and_Coding.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models (numbered 1 to 10) across several benchmarks: HumanEval, SWE-bench Verified M, LiveCodeBench, SWE-bench Verified S, and Aider Polygot. The y-axis represents the score in percentage (%), and the x-axis represents the model number.
### Components/Axes
* **X-axis:** Model Number (ranging from 1 to 10)
* **Y-axis:** Score (%) (ranging from 0 to 80)
* **Legend (Right Side):**
* HumanEval (Blue)
* SWE-bench Verified M (Cyan)
* LiveCodeBench (Green)
* SWE-bench Verified S (Brown)
* Aider Polygot (Gray)
### Detailed Analysis
* **HumanEval (Blue):** Starts at approximately 75% for Model 1, dips to around 68% for Model 2, then rises to about 74% for Model 3, and continues to increase to approximately 84% for Model 4.
* Model 1: ~75%
* Model 2: ~68%
* Model 3: ~74%
* Model 4: ~84%
* **SWE-bench Verified M (Cyan):** Starts around 34% for Model 4, decreases to approximately 23% for Model 5, increases to about 29% for Model 6, then rises sharply to approximately 57% for Model 7, peaks at approximately 67% for Model 8, and then decreases to approximately 43% for Model 9.
* Model 4: ~34%
* Model 5: ~23%
* Model 6: ~29%
* Model 7: ~57%
* Model 8: ~67%
* Model 9: ~43%
* **LiveCodeBench (Green):** Remains relatively stable around 30% for Models 3 to 6, then increases sharply to approximately 60% for Model 7, peaks at approximately 74% for Model 8, and then decreases to approximately 35% for Model 9.
* Model 3: ~30%
* Model 4: ~30%
* Model 5: ~29%
* Model 6: ~29%
* Model 7: ~60%
* Model 8: ~74%
* Model 9: ~35%
* **SWE-bench Verified S (Brown):** Starts at approximately 9% for Model 3, increases to approximately 22% for Model 4, decreases to approximately 11% for Model 5, increases to approximately 21% for Model 6, rises sharply to approximately 48% for Model 7, peaks at approximately 59% for Model 8.
* Model 3: ~9%
* Model 4: ~22%
* Model 5: ~11%
* Model 6: ~21%
* Model 7: ~48%
* Model 8: ~59%
* **Aider Polygot (Gray):** Starts at approximately 3% for Model 3, increases to approximately 17% for Model 4, decreases to approximately 10% for Model 5, increases to approximately 21% for Model 6, rises sharply to approximately 57% for Model 7, peaks at approximately 82% for Model 8.
* Model 3: ~3%
* Model 4: ~17%
* Model 5: ~10%
* Model 6: ~21%
* Model 7: ~57%
* Model 8: ~82%
### Key Observations
* HumanEval consistently shows high scores across all models, with a generally increasing trend.
* Aider Polygot shows the most significant improvement from Model 3 to Model 8.
* LiveCodeBench and SWE-bench Verified M peak at Model 8 and then decline.
* SWE-bench Verified S shows a similar trend to Aider Polygot but with lower overall scores.
* Models 7 and 8 appear to be the most successful across all benchmarks except HumanEval.
### Interpretation
The chart illustrates the performance of different models on various coding benchmarks. HumanEval appears to be a less sensitive benchmark, as all models perform relatively well. Aider Polygot shows substantial improvement, suggesting that later models have significantly enhanced capabilities in this area. The peak performance of LiveCodeBench and SWE-bench Verified M at Model 8, followed by a decline, could indicate overfitting or specific optimizations that benefited Model 8 but not subsequent models. The data suggests that Model 8 is a strong performer across multiple benchmarks, while HumanEval provides a consistently high baseline.
</details>
(d) Programming and Coding
<details>
<summary>figures/gemini_2_plots/gemini_performance_Reading_Comprehension_and_Question_Answering.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of two models, "DROP" and "ECLEKTic", across different model numbers (presumably iterations or variations). The y-axis represents the score in percentage, and the x-axis represents the model number.
### Components/Axes
* **X-axis:** "Model Number", ranging from 1 to 10.
* **Y-axis:** "Score (%)", ranging from 20 to 80, with gridlines at intervals of 10.
* **Data Series 1:** "DROP" (represented by a blue line with circular markers).
* **Data Series 2:** "ECLEKTic" (represented by a teal line with square markers).
### Detailed Analysis
* **DROP:**
* Model 1: Score approximately 82%.
* Model 2: Score approximately 74%.
* Model 4: Score approximately 78%.
* Model 5: Score approximately 75%.
* Trend: The DROP model's performance starts high, decreases slightly, increases again, and then decreases slightly again.
* **ECLEKTic:**
* Model 3: Score approximately 16%.
* Model 4: Score approximately 27%.
* Model 5: Score approximately 28%.
* Model 6: Score approximately 34%.
* Model 7: Score approximately 37%.
* Model 8: Score approximately 46%.
* Trend: The ECLEKTic model's performance consistently increases as the model number increases.
### Key Observations
* The DROP model generally outperforms the ECLEKTic model across the tested model numbers.
* The ECLEKTic model shows a clear upward trend, indicating improvement with increasing model number.
* The DROP model's performance is relatively stable, with minor fluctuations.
### Interpretation
The chart suggests that the DROP model initially performs better than the ECLEKTic model. However, the ECLEKTic model demonstrates a consistent improvement in performance as the model number increases, potentially indicating a learning or optimization process. The DROP model's relatively stable performance might suggest that it has reached a plateau or that further iterations are not significantly improving its score. The data implies that further development of the ECLEKTic model might lead to it eventually surpassing the DROP model's performance.
</details>
(e) Reading Comprehension and QA
<details>
<summary>figures/gemini_2_plots/gemini_performance_Reasoning_with_General_Knowledge.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models across several benchmarks. The x-axis represents the model number (1 to 10), and the y-axis represents the score in percentage (%). There are four data series, each representing a different benchmark: Big-Bench-Hard, MMLU, Global MMLU (Lite), GPQA Diamond, and Humanity's Last Exam.
### Components/Axes
* **X-axis:** Model Number (1 to 10, incrementing by 1)
* **Y-axis:** Score (%) (0 to 80, incrementing by 20)
* **Legend:** Located at the top-right of the chart.
* Big-Bench-Hard (Brown line with triangle markers)
* MMLU (Green line with square markers)
* Global MMLU (Lite) (Gray line with diamond markers)
* GPQA Diamond (Blue line with circle markers)
* Humanity's Last Exam (Light Blue line with diamond markers)
### Detailed Analysis
* **Big-Bench-Hard (Brown):**
* Trend: Relatively stable, with a slight increase overall.
* Model 1: ~84%
* Model 2: ~75%
* Model 3: ~82%
* Model 4: ~85%
* **MMLU (Green):**
* Trend: Relatively stable, with a slight decrease overall.
* Model 1: ~90%
* Model 2: ~79%
* Model 3: ~79%
* Model 4: ~81%
* **Global MMLU (Lite) (Gray):**
* Trend: Increasing initially, then plateauing, and finally decreasing slightly.
* Model 3: ~72%
* Model 4: ~81%
* Model 5: ~78%
* Model 6: ~83%
* Model 7: ~88%
* Model 8: ~89%
* Model 9: ~81%
* Model 10: ~88%
* **GPQA Diamond (Blue):**
* Trend: Increasing significantly, peaking at Model 8, then decreasing.
* Model 1: ~36%
* Model 2: ~28%
* Model 3: ~50%
* Model 4: ~58%
* Model 5: ~51%
* Model 6: ~66%
* Model 7: ~82%
* Model 8: ~86%
* Model 9: ~66%
* Model 10: ~68%
* **Humanity's Last Exam (Light Blue):**
* Trend: Very low scores, with a peak at Model 8.
* Model 4: ~6%
* Model 5: ~6%
* Model 6: ~7%
* Model 7: ~11%
* Model 8: ~22%
* Model 9: ~8%
* Model 10: ~9%
### Key Observations
* The Global MMLU (Lite) benchmark generally yields the highest scores across the models.
* GPQA Diamond shows the most significant performance variation across the models, with a notable peak at Model 8.
* Humanity's Last Exam consistently results in the lowest scores, indicating it is a challenging benchmark for these models.
* Big-Bench-Hard and MMLU benchmarks show relatively stable performance across the models.
### Interpretation
The chart provides a comparative analysis of model performance across different benchmarks. The varying trends suggest that different models excel at different tasks. The GPQA Diamond benchmark appears to be more sensitive to model architecture or training, given the wide range of scores. The consistently low scores on Humanity's Last Exam suggest that this benchmark may require specific capabilities or knowledge that the models generally lack. The stable performance of Big-Bench-Hard and MMLU might indicate that these benchmarks are less discriminating among the models tested. Overall, the data highlights the importance of evaluating models on a diverse set of benchmarks to understand their strengths and weaknesses.
</details>
(f) Reasoning with General Knowledge
<details>
<summary>figures/gemini_2_plots/gemini_performance_LLM_Benchmarks_Combined.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of four different models (LOFT with hard retrieval <=128K, FACTS Grounding, LOFT with hard retrieval 1M, and SimpleQA) across a range of model numbers (1 to 10). The y-axis represents the score in percentage, ranging from 0 to 90.
### Components/Axes
* **X-axis:** Model Number (ranging from 1 to 10)
* **Y-axis:** Score (%) (ranging from 0 to 90)
* **Legend (top-right):**
* Red squares: LOFT (hard retrieval) <=128K
* Pink triangles: FACTS Grounding
* Teal diamonds: LOFT (hard retrieval) 1M
* Blue circles: SimpleQA
### Detailed Analysis
* **LOFT (hard retrieval) <=128K (Red Squares):**
* Trend: Relatively stable and high-performing.
* Model 3: ~83%
* Model 4: ~83%
* Model 5: ~82%
* Model 6: ~85%
* Model 7: ~86%
* Model 8: ~87%
* Model 9: Not present
* Model 10: Not present
* **FACTS Grounding (Pink Triangles):**
* Trend: Starts lower, peaks at Model 4, then decreases, and increases again.
* Model 3: ~67%
* Model 4: ~76%
* Model 5: ~50%
* Model 6: ~58%
* Model 7: ~59%
* Model 8: Not present
* Model 9: Not present
* Model 10: Not present
* **LOFT (hard retrieval) 1M (Teal Diamonds):**
* Trend: Starts low, peaks at Model 4, then decreases sharply, and increases again.
* Model 3: ~37%
* Model 4: ~47%
* Model 5: ~17%
* Model 6: ~8%
* Model 7: ~59%
* Model 8: ~70%
* Model 9: Not present
* Model 10: Not present
* **SimpleQA (Blue Circles):**
* Trend: Low performance with a peak at Model 4, then decreases sharply, and increases slightly.
* Model 3: ~9%
* Model 4: ~25%
* Model 5: ~17%
* Model 6: ~8%
* Model 7: ~27%
* Model 8: ~30%
* Model 9: ~11%
* Model 10: ~12%
### Key Observations
* LOFT (hard retrieval) <=128K consistently outperforms the other models.
* SimpleQA consistently underperforms the other models.
* LOFT (hard retrieval) 1M and FACTS Grounding show more variability in performance across different model numbers.
* All models are only plotted up to Model Number 10.
### Interpretation
The chart provides a comparative analysis of the performance of four different models. LOFT (hard retrieval) <=128K appears to be the most robust and reliable model based on the data presented. SimpleQA consistently shows the lowest scores, indicating it may need further refinement or is not well-suited for the task being evaluated. The variability in performance of FACTS Grounding and LOFT (hard retrieval) 1M suggests that their effectiveness may be more dependent on the specific characteristics of each model number. The absence of data points for models 8, 9, and 10 for LOFT (hard retrieval) <=128K, FACTS Grounding, and LOFT (hard retrieval) 1M suggests that these models may not have been evaluated for those specific model numbers.
</details>
(g) LLM Benchmarks
Figure 4: Performance of the Gemini family on reasoning benchmarks by category. Model numbers and corresponding names are as follows: 1 – Gemini Ultra; 2 – Gemini Pro; 3 – Gemini 1.5 Flash; 4 – Gemini 1.5 Pro; 5 – Gemini 2.0 Flash-Lite; 6 – Gemini 2.0 Flash; 7 – Gemini 2.5 Flash; 8 – Gemini 2.5 Pro; 9 – Gemini 2.5 Flash Lite (no thinking); 10 – Gemini 2.5 Flash Lite (thinking).
<details>
<summary>figures/gpt_2_plots/gpt_performance_Mathematical_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of several models (MATH-500, MGSM, MATH, MathVista, AIME 2024, AIME 2025, and FrontierMath Tier 1-3) across a range of model numbers (1 to 22). The y-axis represents the score in percentage, ranging from 0 to 100. Each model's performance is plotted as a line, with different colors and markers distinguishing them.
### Components/Axes
* **X-axis:** Model Number, ranging from 1 to 22 in integer increments.
* **Y-axis:** Score (%), ranging from 0 to 100 in increments of 20.
* **Legend (Top):**
* MATH-500 (Pink Line, Circle Marker)
* MGSM (Orange Line, Square Marker)
* MATH (Blue Line, Circle Marker)
* MathVista (Red Line, Triangle Marker)
* AIME 2024 (Yellow-Green Line, Circle Marker)
* AIME 2025 (Green Line, Circle Marker)
* FrontierMath, Tier 1-3 (Teal Line, Circle Marker)
* Unlabeled (Brown Line, Diamond Marker)
### Detailed Analysis
* **MATH-500 (Pink Line, Circle Marker):**
* Model 4: ~60%
* Model 5: ~70%
* Model 6: ~75%
* Model 7: ~80%
* Model 8: ~83%
* Model 9: ~86%
* Model 10: ~90%
Trend: Generally increasing from Model 4 to Model 10.
* **MGSM (Orange Line, Square Marker):**
* Model 1: ~56%
* Model 2: ~75%
* Model 3: ~88%
* Model 4: ~90%
* Model 5: ~92%
* Model 6: ~88%
Trend: Rapidly increases from Model 1 to Model 4, then plateaus and decreases slightly.
* **MATH (Blue Line, Circle Marker):**
* Model 1: ~43%
* Model 2: ~43%
* Model 3: ~73%
* Model 4: ~68%
* Model 5: ~77%
Trend: Relatively flat from Model 1 to Model 2, then increases to Model 3, then decreases slightly.
* **MathVista (Red Line, Triangle Marker):**
* Model 3: ~58%
* Model 4: ~57%
* Model 5: ~62%
* Model 6: ~68%
* Model 7: ~70%
* Model 8: ~74%
* Model 9: ~80%
* Model 10: ~55%
* Model 11: ~73%
* Model 12: ~73%
* Model 13: ~72%
* Model 14: ~85%
* Model 15: ~85%
Trend: Generally increasing, with some fluctuations, up to Model 15.
* **AIME 2024 (Yellow-Green Line, Circle Marker):**
* Model 7: ~83%
* Model 8: ~84%
* Model 9: ~85%
* Model 15: ~90%
* Model 16: ~92%
* Model 17: ~98%
* Model 18: ~98%
* Model 19: ~98%
* Model 20: ~98%
* Model 21: ~99%
* Model 22: ~100%
Trend: Steadily increasing, reaching near-perfect scores from Model 17 onwards.
* **AIME 2025 (Green Line, Circle Marker):**
* Model 17: ~95%
* Model 18: ~97%
* Model 19: ~98%
* Model 20: ~98%
* Model 21: ~99%
* Model 22: ~100%
Trend: High and relatively stable, approaching perfect scores.
* **FrontierMath, Tier 1-3 (Teal Line, Circle Marker):**
* Model 15: ~19%
* Model 16: ~16%
* Model 19: ~24%
* Model 20: ~27%
* Model 21: ~27%
* Model 22: ~28%
Trend: Low and relatively flat, with a slight upward trend.
* **Unlabeled (Brown Line, Diamond Marker):**
* Model 5: ~9%
* Model 6: ~14%
* Model 7: ~57%
* Model 8: ~78%
* Model 9: ~84%
* Model 10: ~86%
* Model 11: ~30%
* Model 12: ~50%
* Model 13: ~48%
* Model 14: ~37%
* Model 15: ~19%
Trend: Highly volatile, with a sharp increase followed by a sharp decrease.
### Key Observations
* AIME 2025 and AIME 2024 consistently achieve the highest scores, especially for higher model numbers.
* FrontierMath, Tier 1-3, consistently scores the lowest across all model numbers.
* The unlabeled model (brown line) exhibits the most significant fluctuations in performance.
* MGSM performs well initially but plateaus and decreases slightly.
### Interpretation
The chart provides a comparative analysis of different models' performance on a specific task, as indicated by the "Score (%)". The AIME models (2024 and 2025) demonstrate superior performance, suggesting they are the most effective for this task. FrontierMath, Tier 1-3, consistently underperforms, indicating it may not be suitable for the same task or requires further optimization. The volatile performance of the unlabeled model suggests instability or sensitivity to specific model numbers. The other models (MATH-500, MGSM, MATH, MathVista) show varying degrees of effectiveness, with MGSM performing well initially but not sustaining its high performance. The data suggests that the choice of model significantly impacts the outcome, and careful consideration should be given to the specific requirements of the task when selecting a model.
</details>
(a) Mathematical Reasoning
<details>
<summary>figures/gpt_2_plots/gpt_performance_Multimodal_Reasoning.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models across a range of model numbers. The y-axis represents the score in percentage, and the x-axis represents the model number. Each line represents a different model, and the chart displays how the score changes for each model as the model number increases.
### Components/Axes
* **X-axis:** Model Number, ranging from 1 to 22.
* **Y-axis:** Score (%), ranging from 40 to 90, with gridlines at intervals of 10.
* **Legend:** Located at the top of the chart, identifying each model by color and name. The models are:
* AI2D (Purple)
* DocVQA (Green)
* ChartQA (Red)
* EgoSchema (Blue)
* ActivityNet (Orange)
* CharXiv-D (Pink)
* VideoMMMU (Light Brown)
* MMMU (Dark Yellow)
* CharXiv-R (Gray)
* MMMU Pro (Dark Gray)
* ERQA (Teal)
### Detailed Analysis
* **AI2D (Purple):** Starts at approximately 89% at model number 3 and increases to approximately 94% at model number 5.
* **DocVQA (Green):** Starts at approximately 87% at model number 3 and increases to approximately 93% at model number 5.
* **ChartQA (Red):** Starts at approximately 78% at model number 3, increases to approximately 85% at model number 4.
* **EgoSchema (Blue):** Starts at approximately 64% at model number 3, increases to approximately 73% at model number 4, decreases to approximately 60% at model number 5, and then increases to approximately 78% at model number 6.
* **ActivityNet (Orange):** Starts at approximately 60% at model number 3, decreases to approximately 59% at model number 4, increases to approximately 62% at model number 5.
* **CharXiv-D (Pink):** Starts at approximately 77% at model number 4, increases to approximately 90% at model number 8, decreases to approximately 74% at model number 10, increases to approximately 88% at model number 11, decreases to approximately 87% at model number 13.
* **VideoMMMU (Light Brown):** Starts at approximately 74% at model number 11, increases to approximately 83% at model number 16, and plateaus around 84% at model number 22.
* **MMMU (Dark Yellow):** Starts at approximately 71% at model number 10, increases to approximately 83% at model number 16, and plateaus around 84% at model number 22.
* **CharXiv-R (Gray):** Starts at approximately 37% at model number 4, increases to approximately 60% at model number 5, decreases to approximately 55% at model number 8, decreases to approximately 40% at model number 10, increases to approximately 57% at model number 11, plateaus around 56% at model number 13, increases to approximately 77% at model number 16, and plateaus around 79% at model number 22.
* **MMMU Pro (Dark Gray):** Starts at approximately 64% at model number 3, decreases to approximately 36% at model number 5, increases to approximately 55% at model number 8, decreases to approximately 41% at model number 10, increases to approximately 57% at model number 12, plateaus around 56% at model number 13, increases to approximately 77% at model number 16, and plateaus around 79% at model number 22.
* **ERQA (Teal):** Starts at approximately 35% at model number 5, increases to approximately 64% at model number 16, and plateaus around 66% at model number 22.
### Key Observations
* AI2D and DocVQA have the highest initial scores but are only evaluated for the first few model numbers.
* CharXiv-D shows high variability in performance across different model numbers.
* VideoMMMU and MMMU show a steady increase and plateau in performance.
* CharXiv-R and MMMU Pro show significant improvement over the model numbers.
* ERQA has the lowest initial score but shows a steady increase in performance.
### Interpretation
The chart illustrates the performance of various models as their model number increases. Some models, like AI2D and DocVQA, have high initial scores but are not evaluated across the entire range of model numbers. Other models, such as VideoMMMU and MMMU, demonstrate a consistent improvement and then plateau. CharXiv-R and MMMU Pro show a more volatile performance, with significant fluctuations. ERQA starts with a lower score but exhibits a steady increase, suggesting potential for further improvement. The data suggests that different models have varying strengths and weaknesses, and their performance can change significantly as the model number increases.
</details>
(b) Multimodal Reasoning
<details>
<summary>figures/gpt_2_plots/gpt_performance_Programming_and_Coding.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models across a range of model numbers. The chart displays the "Score (%)" on the y-axis and "Model Number" on the x-axis. Four different data series are plotted: "HumanEval", "Aider's Polygot Whole", "Aider's Polygot Diff", and "SWE-Bench Verified".
### Components/Axes
* **X-axis:** "Model Number" ranging from 1 to 22.
* **Y-axis:** "Score (%)" ranging from 0 to 80.
* **Legend:** Located at the top-right of the chart, identifying the data series:
* "HumanEval" (Blue line with circle markers)
* "Aider's Polygot Whole" (Pink line with triangle markers)
* "Aider's Polygot Diff" (Red line with square markers)
* "SWE-Bench Verified" (Cyan line with diamond markers)
### Detailed Analysis
* **HumanEval (Blue):**
* Trend: Starts relatively low, rapidly increases, and then plateaus.
* Data Points:
* Model 1: ~68%
* Model 2: ~67%
* Model 4: ~86%
* Model 5: ~86%
* Model 6: ~89%
* Model 7: ~91%
* Model 8-22: ~91% (approximately constant)
* **Aider's Polygot Whole (Pink):**
* Trend: Highly variable, with peaks and troughs across different model numbers.
* Data Points:
* Model 4: ~33%
* Model 8: ~63%
* Model 10: ~8%
* Model 12: ~55%
* Model 16: ~80%
* Model 19: ~45%
* Model 21: ~75%
* **Aider's Polygot Diff (Red):**
* Trend: Similar to "Aider's Polygot Whole" but with some differences in magnitude.
* Data Points:
* Model 4: ~3%
* Model 5: ~19%
* Model 8: ~62%
* Model 10: ~6%
* Model 12: ~32%
* Model 13: ~45%
* Model 15: ~59%
* Model 16: ~79%
* **SWE-Bench Verified (Cyan):**
* Trend: Variable, with a general upward trend towards the end.
* Data Points:
* Model 4: ~10%
* Model 8: ~50%
* Model 11: ~24%
* Model 13: ~38%
* Model 14: ~62%
* Model 16: ~70%
* Model 18: ~62%
* Model 21: ~75%
### Key Observations
* "HumanEval" consistently outperforms the other models after Model 4.
* "Aider's Polygot Whole", "Aider's Polygot Diff", and "SWE-Bench Verified" show significant performance fluctuations across different model numbers.
* Models 8 and 16 appear to be high-performing models for "Aider's Polygot Whole" and "Aider's Polygot Diff".
* "SWE-Bench Verified" shows a general upward trend, especially after Model 11.
### Interpretation
The chart illustrates a comparison of model performance based on different evaluation metrics or datasets. "HumanEval" represents a benchmark that the other models are compared against. The variability in performance of "Aider's Polygot Whole", "Aider's Polygot Diff", and "SWE-Bench Verified" suggests that these models are more sensitive to the specific characteristics of the evaluated tasks or datasets. The "HumanEval" data suggests that the models quickly reach a performance plateau. The other models show more nuanced performance, suggesting that they may be better suited for certain tasks or model numbers.
</details>
(c) Programming and Coding
<details>
<summary>figures/gpt_2_plots/gpt_performance_Reading_Comprehension_and_Question_Answering.png Details</summary>

### Visual Description
## Line Chart: Model Performance
### Overview
The image is a line chart showing the performance score (%) of different models, numbered 1 through 22. The chart displays a single data series, represented by a blue line with circular markers, indicating the score for each model. The performance initially increases sharply, peaks, then decreases before slightly increasing again.
### Components/Axes
* **X-axis:** "Model Number", labeled from 1 to 22 in increments of 1.
* **Y-axis:** "Score (%)", labeled from 70 to 86 in increments of 2.
* **Data Series:** A single blue line with circular markers representing the performance score of each model.
* **Annotation:** The word "DROP" is written near the data point for Model Number 5.
### Detailed Analysis
The blue line represents the performance score of the models.
* **Model 1:** Score is approximately 70%.
* **Model 2:** Score is approximately 81%.
* **Model 3:** Score is approximately 86%.
* **Model 4:** Score is approximately 80%.
* **Model 5:** Score is approximately 83%. The annotation "DROP" is near this point.
* **Models 6-22:** The line is not visible, implying the score is not available or not plotted.
### Key Observations
* The model performance peaks at Model 3 with a score of approximately 86%.
* There is a sharp increase in performance from Model 1 to Model 3.
* There is a decrease in performance from Model 3 to Model 4.
* The performance increases slightly from Model 4 to Model 5.
* The data for models 6 through 22 is not displayed on the chart.
### Interpretation
The chart illustrates the performance of different models, with Model 3 achieving the highest score. The annotation "DROP" near Model 5 might indicate a significant change or event related to that model. The absence of data for models 6-22 suggests that either the data is missing, or those models were not part of the analysis. The initial increase in performance followed by a decrease suggests a potential optimization process where Model 3 represents a peak, and subsequent models may have undergone changes that affected their performance. Further context is needed to understand the significance of the "DROP" annotation and the missing data.
</details>
(d) Reading Comprehension and QA
<details>
<summary>figures/gpt_2_plots/gpt_performance_Reasoning_with_General_Knowledge.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models on various benchmarks. The chart displays the "Score (%)" on the y-axis against the "Model Number" on the x-axis. Four different models are represented by different colored lines: MMLU (blue), GPQA Diamond (red), AMMLU (pink), and Humanity's Last Exam (cyan).
### Components/Axes
* **X-axis:** "Model Number" ranging from 1 to 22.
* **Y-axis:** "Score (%)" ranging from 20 to 80, with implied values extending to 0 and 100.
* **Legend:** Located in the top-right corner, associating colors with model names:
* Blue: MMLU
* Red: GPQA Diamond
* Pink: AMMLU
* Cyan: Humanity's Last Exam
### Detailed Analysis
* **MMLU (Blue):**
* Trend: Generally high and relatively stable, with some fluctuations.
* Data Points:
* Model 1: ~70%
* Model 2: ~86%
* Model 3: ~86%
* Model 4: ~82%
* Model 5: ~87%
* Model 6: ~88%
* Model 7: ~91%
* Model 8: ~89%
* Model 9: ~87%
* Model 10: ~80%
* Model 11: ~87%
* Model 12: ~88%
* Model 13: ~89%
* Model 14: ~88%
* Model 15: ~86%
* Model 16: ~85%
* Model 17: ~84%
* Model 18: ~84%
* Model 19: ~84%
* Model 20: ~85%
* Model 21: ~85%
* Model 22: ~86%
* **GPQA Diamond (Red):**
* Trend: Starts low, increases sharply, fluctuates, and then stabilizes at a high level.
* Data Points:
* Model 1: ~31%
* Model 2: ~36%
* Model 3: ~49%
* Model 4: ~40%
* Model 5: ~72%
* Model 6: ~79%
* Model 7: ~60%
* Model 8: ~78%
* Model 9: ~79%
* Model 10: ~51%
* Model 11: ~65%
* Model 12: ~66%
* Model 13: ~71%
* Model 14: ~72%
* Model 15: ~80%
* Model 16: ~81%
* Model 17: ~82%
* Model 18: ~83%
* Model 19: ~83%
* Model 20: ~84%
* Model 21: ~85%
* Model 22: ~86%
* **AMMLU (Pink):**
* Trend: Starts around 70%, peaks around model 9, then decreases and stabilizes around 80%.
* Data Points:
* Model 5: ~70%
* Model 9: ~88%
* Model 14: ~78%
* Model 18: ~81%
* Model 22: ~82%
* **Humanity's Last Exam (Cyan):**
* Trend: Starts very low and increases significantly towards the end.
* Data Points:
* Model 9: ~8%
* Model 14: ~13%
* Model 16: ~24%
* Model 18: ~19%
* Model 19: ~26%
* Model 20: ~41%
* Model 21: ~35%
* Model 22: ~40%
### Key Observations
* MMLU consistently scores high across all models.
* GPQA Diamond shows significant improvement over the model numbers.
* Humanity's Last Exam starts with very low scores but shows a substantial increase in performance towards the later models.
* AMMLU has fewer data points compared to the other models.
### Interpretation
The chart illustrates the performance of different models on various benchmarks. MMLU appears to be a strong performer across all models tested. GPQA Diamond demonstrates a learning curve, with performance increasing significantly as the model number increases. Humanity's Last Exam, while starting with low scores, shows a notable improvement in later models, suggesting potential for further development. The AMMLU data is sparse, making it difficult to draw definitive conclusions about its overall performance relative to the other models. The data suggests that different models excel at different tasks or benchmarks, highlighting the importance of selecting the appropriate model for a specific application.
</details>
(e) Reasoning with General Knowledge
Figure 5: Performance of the GPT family on general reasoning benchmarks. Model numbers and corresponding names are as follows: 1 – GPT-3.5; 2 – GPT-4; 3 – GPT-4 Turbo; 4 – GPT-4o mini; 5 – GPT-4o; 6 – o1-preview; 7 – o1-mini; 8 – o1; 9 – o1-pro; 10 – GPT-4.1 nano; 11 – GPT-4.1 mini; 12 – GPT-4.1; 13 – GPT-4.5; 14 – o3-mini; 15 – o4-mini; 16 – o3; 17 – o3-pro; 18 – gpt-oss-120b; 19 – GPT-5 with Deep Research; 20 – ChatGPT Agent; 21 – GPT-5; 22 – GPT-5 Pro.
<details>
<summary>figures/gpt_2_plots/gpt_performance_Constrained_Text_Generation_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Performance Score
### Overview
The image is a line chart displaying the performance score (in percentage) of different models, numbered from 1 to 22. The chart shows the trend of the "COLLIE" model's score across these model numbers.
### Components/Axes
* **X-axis:** "Model Number" ranging from 1 to 22, with tick marks at each integer value.
* **Y-axis:** "Score (%)" ranging from 40 to 100, with tick marks at intervals of 10.
* **Legend:** Located at the top-right corner, the label "COLLIE" is associated with the blue line.
### Detailed Analysis
The chart contains one data series, "COLLIE," represented by a blue line.
* **Trend:** The "COLLIE" line exhibits significant fluctuations in the early model numbers, followed by a sharp increase and then plateaus at a high score.
* **Data Points:**
* Model 4: Score approximately 53%
* Model 5: Score approximately 61%
* Model 8: Score approximately 95%
* Model 10: Score approximately 42%
* Model 11: Score approximately 55%
* Model 12: Score approximately 66%
* Model 13: Score approximately 72%
* Model 14: Score approximately 98%
* Model 16: Score approximately 98%
* Model 21: Score approximately 99%
### Key Observations
* The "COLLIE" model's performance varies significantly between model numbers 4 and 13.
* The model achieves a high and stable performance score from model number 14 onwards.
* There is a notable dip in performance at model number 10.
### Interpretation
The chart suggests that the "COLLIE" model underwent several iterations or modifications, resulting in fluctuating performance scores. The significant improvement observed after model number 13 indicates a successful optimization or refinement of the model. The stable high score from model number 14 onwards implies that the model has reached a satisfactory level of performance. The dip at model 10 could indicate a problematic change that was later corrected.
</details>
(a) Constrained Text Generation
<details>
<summary>figures/gpt_2_plots/gpt_performance_Factuality_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of two models, "BrowseComp" and "SimpleQA," across different model numbers. The chart plots the score (in percentage) on the y-axis against the model number on the x-axis.
### Components/Axes
* **X-axis:** "Model Number" ranging from 1 to 22, with integer increments.
* **Y-axis:** "Score (%)" ranging from 0 to 70, with increments of 10.
* **Legend:**
* "BrowseComp" is represented by a light blue line with square markers.
* "SimpleQA" is represented by a dark blue line with circle markers.
### Detailed Analysis
* **BrowseComp (Light Blue, Square Markers):**
* The line starts at Model Number 5 with a score of approximately 2%.
* It remains relatively flat until Model Number 8, staying around 2%.
* The line then increases to approximately 28% at Model Number 15.
* The line increases to approximately 50% at Model Number 16.
* The line remains relatively flat until Model Number 19, staying around 51%.
* The line increases sharply to approximately 69% at Model Number 20.
* The line decreases to approximately 54% at Model Number 21.
* **SimpleQA (Dark Blue, Circle Markers):**
* The line starts at Model Number 5 with a score of approximately 38%.
* It increases to approximately 47% at Model Number 8.
* The line increases sharply to approximately 62% at Model Number 13.
* The line drops sharply to approximately 16% at Model Number 15.
### Key Observations
* SimpleQA initially outperforms BrowseComp.
* BrowseComp shows a significant performance increase in later model numbers.
* SimpleQA experiences a sharp performance drop after Model Number 13.
* BrowseComp has a peak at Model Number 20.
### Interpretation
The chart suggests that while SimpleQA starts with a higher score, its performance degrades significantly after a certain model number. BrowseComp, on the other hand, shows a steady improvement and eventually surpasses SimpleQA's performance. This could indicate that BrowseComp is better suited for later iterations or more complex models, while SimpleQA might be more effective for earlier, simpler models. The sharp drop in SimpleQA's performance warrants further investigation to understand the underlying cause. The peak of BrowseComp at Model 20, followed by a slight decrease, could indicate an optimal point in the model's development.
</details>
(b) Factuality
<details>
<summary>figures/gpt_2_plots/gpt_performance_Instruction_Following_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of two models, "IFEval" and "Multi-IF," across a range of model numbers. The chart displays the score (in percentage) on the y-axis against the model number on the x-axis.
### Components/Axes
* **X-axis:** "Model Number" ranging from 1 to 22, with tick marks at each integer value.
* **Y-axis:** "Score (%)" ranging from 60 to 95, with tick marks at intervals of 5.
* **Legend:**
* "IFEval" is represented by a dark blue line with circular markers.
* "Multi-IF" is represented by a light blue line with square markers.
### Detailed Analysis
* **IFEval (Dark Blue Line):**
* **Trend:** Generally increasing with fluctuations.
* **Data Points:**
* Model 4: Approximately 78.7%
* Model 5: Approximately 81.1%
* Model 8: Approximately 92.3%
* Model 10: Approximately 74.6%
* Model 11: Approximately 84.1%
* Model 12: Approximately 87.4%
* Model 13: Approximately 88.2%
* Model 14: Approximately 93.5%
* **Multi-IF (Light Blue Line):**
* **Trend:** More volatile, with significant ups and downs.
* **Data Points:**
* Model 4: Approximately 58.1%
* Model 5: Approximately 60.7%
* Model 8: Approximately 78.2%
* Model 10: Approximately 57.5%
* Model 11: Approximately 66.8%
* Model 12: Approximately 70.8%
* Model 13: Approximately 70.8%
* Model 14: Approximately 80.0%
### Key Observations
* IFEval consistently outperforms Multi-IF, except for a brief period around Model 8.
* IFEval shows a generally increasing trend, suggesting improvements in performance as the model number increases.
* Multi-IF exhibits more significant fluctuations, indicating greater variability in performance across different model numbers.
### Interpretation
The chart provides a comparative analysis of the performance of two models, IFEval and Multi-IF, across a range of model numbers. The data suggests that IFEval generally performs better than Multi-IF, with a more consistent and upward trend. Multi-IF, on the other hand, shows more variability in performance. This could indicate that IFEval is a more robust and reliable model compared to Multi-IF. The fluctuations in Multi-IF's performance might be due to its sensitivity to specific model configurations or data characteristics. Further investigation into the specific differences between the models and their training data would be necessary to understand the reasons for these performance variations.
</details>
(c) Instruction Following
<details>
<summary>figures/gpt_2_plots/gpt_performance_Long-Context_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Score Comparison
### Overview
The image is a line chart comparing the scores of different models, specifically Graphwalks, using two different algorithms (parents and bfs) and two different size thresholds (<128000 and >128000). The x-axis represents the model number, and the y-axis represents the score in percentage.
### Components/Axes
* **X-axis:** "Model Number", ranging from 1 to 22 with integer increments.
* **Y-axis:** "Score (%)", ranging from 0 to 70 with increments of 10.
* **Legend (Top-Right):**
* Blue: "Graphwalks parents <128000"
* Red: "Graphwalks bfs <128000"
* Pink: "Graphwalks parents >128000"
* Cyan: "Graphwalks bfs >128000"
### Detailed Analysis
* **Graphwalks parents <128000 (Blue):**
* Model 4: Approximately 12
* Model 5: Approximately 35
* Model 8: Approximately 51
* Model 10: Approximately 10
* Model 11: Approximately 60
* Model 13: Approximately 71
* Model 14: Approximately 51
*Trend:* The blue line starts low, increases sharply to model 8, drops sharply to model 10, then increases sharply again to model 13, before decreasing to model 14.
* **Graphwalks bfs <128000 (Red):**
* Model 4: Approximately 29
* Model 5: Approximately 42
* Model 8: Approximately 62
* Model 10: Approximately 26
* Model 11: Approximately 62
* Model 13: Approximately 72
* Model 14: Approximately 52
*Trend:* The red line starts at a moderate value, increases to model 8, decreases to model 10, increases again to model 13, and then decreases to model 14.
* **Graphwalks parents >128000 (Pink):**
* Model 10: Approximately 10
* Model 11: Approximately 11
* Model 12: Approximately 16
*Trend:* The pink line shows a slight upward trend from model 10 to model 12.
* **Graphwalks bfs >128000 (Cyan):**
* Model 10: Approximately 2
* Model 11: Approximately 15
* Model 12: Approximately 19
*Trend:* The cyan line shows an upward trend from model 10 to model 12.
### Key Observations
* The "Graphwalks parents <128000" (blue) and "Graphwalks bfs <128000" (red) lines have similar trends, with peaks at model 8 and model 13.
* The "Graphwalks parents >128000" (pink) and "Graphwalks bfs >128000" (cyan) lines have much lower scores compared to the other two lines.
* The data for "Graphwalks parents >128000" and "Graphwalks bfs >128000" is only available for models 10, 11, and 12.
### Interpretation
The chart compares the performance of Graphwalks models using different configurations. The models with a size threshold of less than 128000 (blue and red lines) generally perform better than those with a size threshold greater than 128000 (pink and cyan lines). The peaks at model numbers 8 and 13 for the <128000 models suggest that these models are particularly effective. The limited data for the >128000 models makes it difficult to draw definitive conclusions about their performance, but the available data suggests they are less effective than the <128000 models. The choice of algorithm (parents vs. bfs) seems to have a less significant impact on performance than the size threshold.
</details>
(d) Long Context
<details>
<summary>figures/gpt_2_plots/gpt_performance_Multi-turn_Conversation_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Performance Scores
### Overview
The image is a line chart displaying the performance scores of different models. The x-axis represents the model number, ranging from 1 to 22, and the y-axis represents the score in percentage. The chart shows the performance of models, with a notable increase towards the higher model numbers, specifically model 16 and beyond. The highest performing model is labeled "MultiChallenge".
### Components/Axes
* **X-axis:** Model Number, ranging from 1 to 22.
* **Y-axis:** Score (%), ranging from 20 to 70.
* **Data Series:** A single blue line representing the performance score of each model.
* **Label:** "MultiChallenge" is positioned near the data point for Model 21.
### Detailed Analysis
The blue line represents the performance score of each model.
* **Model 4:** Score is approximately 20%.
* **Model 5:** Score is approximately 40%.
* **Model 8:** Score is approximately 45%.
* **Model 10:** Score is approximately 15%.
* **Model 11:** Score is approximately 36%.
* **Model 13:** Score is approximately 44%.
* **Model 14:** Score is approximately 40%.
* **Model 16:** Score is approximately 60%.
* **Model 21 (MultiChallenge):** Score is approximately 69%.
**Trend Analysis:**
* From Model 4 to Model 8, the score increases.
* From Model 8 to Model 10, the score decreases significantly.
* From Model 10 to Model 13, the score increases.
* From Model 13 to Model 14, the score decreases slightly.
* From Model 14 to Model 16, the score increases sharply.
* From Model 16 to Model 21, the score increases gradually.
### Key Observations
* Model 10 has the lowest score among all models.
* Model 21, labeled "MultiChallenge," has the highest score.
* There is a significant performance jump between Model 14 and Model 16.
### Interpretation
The chart illustrates the performance of different models, with "MultiChallenge" (Model 21) outperforming the others. The performance varies significantly across the models, suggesting that certain model configurations or parameters are more effective than others. The sharp increase in performance from Model 14 to Model 16 indicates a potentially significant change or improvement in the model design or training process. The low score of Model 10 could indicate a flawed configuration or a need for further optimization. Overall, the data suggests that iterative model development and optimization are crucial for achieving high performance.
</details>
(e) Multi-turn Conversation
<details>
<summary>figures/gpt_2_plots/gpt_performance_Safety_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models based on their "Score (%)" on the y-axis and "Model Number" on the x-axis. Three data series are plotted: "HealthBench Consensus", "HealthBench", and "HealthBench Hard".
### Components/Axes
* **X-axis:** "Model Number", ranging from 1 to 22, with integer increments.
* **Y-axis:** "Score (%)", ranging from 30 to 90, with increments of 10.
* **Legend:**
* "HealthBench Consensus" - Light Blue Triangle
* "HealthBench" - Blue Circle
* "HealthBench Hard" - Brown Square
### Detailed Analysis
* **HealthBench Consensus (Light Blue Triangle):** This series has only one data point, located at Model Number ~17, Score ~90.
* **HealthBench (Blue Circle):**
* The line starts at Model Number 5 with a score of approximately 32%.
* It increases to approximately 60% at Model Number 16.
* It dips slightly to approximately 58% at Model Number 18.
* It then rises to approximately 67% at Model Number 21.
* **HealthBench Hard (Brown Square):**
* The line starts at Model Number 16 with a score of approximately 32%.
* It decreases slightly to approximately 30% at Model Number 18.
* It then rises to approximately 47% at Model Number 21.
### Key Observations
* "HealthBench Consensus" has the highest score, but only for one model.
* "HealthBench" generally outperforms "HealthBench Hard".
* "HealthBench Hard" only has data points for the last few models.
### Interpretation
The chart compares the performance of different models using three different benchmarks: "HealthBench Consensus", "HealthBench", and "HealthBench Hard". The "HealthBench" series shows a general upward trend, indicating that later models perform better than earlier ones. The "HealthBench Hard" series shows a similar upward trend, but the scores are lower than the "HealthBench" series. The "HealthBench Consensus" series represents the highest score achieved, but it is only available for one model. This suggests that the "HealthBench Consensus" benchmark is more difficult to achieve than the other two benchmarks. The data suggests that the models are improving over time, but there is still room for improvement, especially on the "HealthBench Hard" benchmark.
</details>
(f) Safety
<details>
<summary>figures/gpt_2_plots/gpt_performance_Tool_Use_-_LLM.png Details</summary>

### Visual Description
## Line Chart: Model Performance Comparison
### Overview
The image is a line chart comparing the performance of different models across a range of model numbers. The y-axis represents the score in percentage, and the x-axis represents the model number. There are five distinct data series, each representing a different model or benchmark, distinguished by color and label.
### Components/Axes
* **X-axis:** "Model Number", ranging from 1 to 22. Axis markers are present at each integer value.
* **Y-axis:** "Score (%)", ranging from 0 to 100. Axis markers are present at intervals of 20 (0, 20, 40, 60, 80, 100).
* **Legend:** Located on the top-right of the chart, identifying each data series by color and label:
* **Teal:** "Tau2-bench Telecom"
* **Yellow:** "Tau2-bench Retail"
* **Green:** "Tau-bench Retail"
* **Pink:** "Tau2-bench Airline"
* **Blue:** "Tau-bench Airline"
* **Purple:** "ComplexFuncBench"
### Detailed Analysis
* **Tau2-bench Telecom (Teal):** This line generally slopes upward.
* Model 4: ~22%
* Model 5: ~23%
* Model 11: ~36%
* Model 16: ~58%
* Model 21: ~98%
* **Tau2-bench Retail (Yellow):** This line is relatively flat, with a slight upward trend.
* Model 5: ~65%
* Model 11: ~70%
* Model 16: ~80%
* Model 21: ~82%
* **Tau-bench Retail (Green):** This line fluctuates.
* Model 4: ~44%
* Model 8: ~70%
* Model 10: ~23%
* Model 12: ~74%
* Model 14: ~58%
* Model 16: ~73%
* Model 21: ~68%
* **Tau2-bench Airline (Pink):** This line is relatively flat, with a slight upward trend.
* Model 4: ~47%
* Model 8: ~50%
* Model 13: ~64%
* Model 16: ~65%
* **Tau-bench Airline (Blue):** This line fluctuates.
* Model 4: ~42%
* Model 11: ~49%
* Model 13: ~50%
* Model 14: ~32%
* Model 16: ~52%
* Model 21: ~50%
* **ComplexFuncBench (Purple):** This line fluctuates significantly.
* Model 4: ~38%
* Model 8: ~50%
* Model 10: ~7%
* Model 13: ~62%
* Model 16: ~20%
### Key Observations
* "Tau2-bench Telecom" shows the most significant performance increase as the model number increases.
* "ComplexFuncBench" has the most volatile performance across different model numbers.
* "Tau2-bench Retail" and "Tau2-bench Airline" show relatively stable and high performance.
* "Tau-bench Retail" and "Tau-bench Airline" show more fluctuation than "Tau2-bench Retail" and "Tau2-bench Airline".
### Interpretation
The chart compares the performance of different models or benchmarks across a range of model numbers. The "Tau2-bench Telecom" model demonstrates a clear upward trend, suggesting it benefits the most from increasing model complexity or iteration. "ComplexFuncBench" shows highly variable performance, indicating it may be sensitive to specific model architectures or configurations. The other models show relatively stable performance, with "Tau2-bench Retail" and "Tau2-bench Airline" consistently achieving higher scores than "Tau-bench Retail" and "Tau-bench Airline". The "Tau" vs "Tau2" prefixes may indicate different versions or configurations of the same underlying model type. The data suggests that for the "Telecom" benchmark, the "Tau2" version is significantly better than the "Tau" version, while for "Retail" and "Airline" the "Tau2" versions are only marginally better.
</details>
(g) Tool Use
Figure 6: Performance of the GPT family on LLM-specific benchmarks. Model numbers and corresponding names are as follows: 1 – GPT-3.5; 2 – GPT-4; 3 – GPT-4 Turbo; 4 – GPT-4o mini; 5 – GPT-4o; 6 – o1-preview; 7 – o1-mini; 8 – o1; 9 – o1-pro; 10 – GPT-4.1 nano; 11 – GPT-4.1 mini; 12 – GPT-4.1; 13 – GPT-4.5; 14 – o3-mini; 15 – o4-mini; 16 – o3; 17 – o3-pro; 18 – gpt-oss-120b; 19 – GPT-5 with Deep Research; 20 – ChatGPT Agent; 21 – GPT-5; 22 – GPT-5 Pro.